# CS696 Assignment2 - part 2 : Kenichi Sakamoto

## Model: microsoft/Phi-3-mini-4k-instruct
1. Run the model (no modification) and its GPU memory usage
2. Reducing the number of hidden layers and attention heads
   - reducing both attention heads and hidden layers by 1/2
   - reducing attention heads by 1/2, and keep hidden layers unchanged
   - reducing attention heads to 1, and keep hidden layers unchanged
   - reducing attention heads to 1, and hidden layers by 1/2
3. Report
        

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer,AutoConfig, Phi3Config, Phi3ForCausalLM
from transformers import pipeline
import torch
import time
import gc

In [2]:
device = None
if torch.cuda.is_available():
    print("cuda is available.")
    torch.cuda.empty_cache()
    device = "cuda"
else:
    print("cuda is not available.")
    device = "cpu"

cuda is available.


In [3]:
model_name =  "microsoft/Phi-3-mini-4k-instruct"
prompt = "Provide 5 interesting project ideas for a large language model class."

In [4]:
# Get total and available GPU memory
total_memory = torch.cuda.get_device_properties(0).total_memory
allocated_memory = torch.cuda.memory_allocated(0)
cached_memory = torch.cuda.memory_reserved(0)

print(f"Total GPU memory: {total_memory / 1024**3:.2f} GB")
print(f"Allocated GPU memory: {allocated_memory / 1024**3:.2f} GB")
print(f"Cached (reserved) GPU memory: {cached_memory / 1024**3:.2f} GB")

Total GPU memory: 9.50 GB
Allocated GPU memory: 0.00 GB
Cached (reserved) GPU memory: 0.00 GB


## Code

### 1. Run the Model - microsoft/Phi-3-mini-4k-instruct

In [5]:
# Load model and tokenizer

start = time.time()

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device,
    attn_implementation='eager',
    torch_dtype="auto",
    trust_remote_code=True,
)
model_download_time = time.time()


tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer_time = time.time()


# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    max_new_tokens=800,
    do_sample=False
)
pipeline_time = time.time()


# Generate output
output = generator(prompt)
print(output[0]["generated_text"])
tokenizer_decode_time = time.time()

print()
print("Total Time Elapsed: ", f"{time.time() - start:.2f}", "s")
print("Model loading: ", f"{model_download_time - start:.2f}", "s")
print("Pipeline: ", f"{pipeline_time - tokenizer_time:.6f}", "s")
print("Tokenizer Prompt: ", f"{tokenizer_time - model_download_time:.4f}", "s")
print("Tokenizer Decode: ", f"{tokenizer_decode_time - pipeline_time:.4f}", "s")

allocated_memory = torch.cuda.memory_allocated(0)
print(f"Allocated GPU memory: {allocated_memory / 1024**3:.2f} GB")

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.


Provide 5 interesting project ideas for a large language model class.

# Answer

1. **Language Model Ethics and Bias**: Students can explore the ethical implications of large language models, including issues of bias, privacy, and the potential for misuse. They can work on projects that analyze the biases present in language models and propose methods to mitigate them.

2. **Creative Writing Assistants**: Develop a project where students create a tool that uses a large language model to assist in the creative writing process. This could include generating story ideas, character descriptions, or even writing entire short stories or poems.

3. **Language Model for Accessibility**: Students can design a project that uses a large language model to create an application for people with disabilities, such as a text-to-speech tool for the visually impaired or a language translation app for non-native speakers.

4. **AI-Powered Tutoring System**: Create a project that involves building an AI-p

In [6]:
# free memory
del model
del generator
del output
del tokenizer
gc.collect()
torch.cuda.empty_cache()

# Memory should have free space
print(f"Allocated memory: {torch.cuda.memory_allocated() / (1024 ** 3):.2f} GB")
print(f"Reserved memory: {torch.cuda.memory_reserved() / (1024 ** 3):.2f} GB")

Allocated memory: 0.01 GB
Reserved memory: 0.02 GB


### 2. Reducing the number of hidden layers and attention heads 

#### 2.1 Reduce attention heads, hidden layers both by 1/2

In [7]:
# change the value in config 
config = AutoConfig.from_pretrained(model_name)
config.num_attention_heads = 16
config.num_hidden_layers = 16
config.num_key_value_heads = 16

In [8]:
start = time.time()
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, load_in_8bit=True, device_map=device)

model_download_time = time.time()


tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer_time = time.time()


# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    max_new_tokens=800,
    do_sample=False
)
pipeline_time = time.time()


# Generate output
output = generator(prompt)
print(output[0]["generated_text"])
tokenizer_decode_time = time.time()

print()
print("Total Time Elapsed: ", f"{time.time() - start:.2f}", "s")
print("Model loading: ", f"{model_download_time - start:.2f}", "s")
print("Pipeline: ", f"{pipeline_time - tokenizer_time:.6f}", "s")
print("Tokenizer Prompt: ", f"{tokenizer_time - model_download_time:.4f}", "s")
print("Tokenizer Decode: ", f"{tokenizer_decode_time - pipeline_time:.4f}", "s")

allocated_memory = torch.cuda.memory_allocated(0)
print(f"Allocated GPU memory: {allocated_memory / 1024**3:.2f} GB")


print("free memory...")
# free memory
del model
del generator
del output
del tokenizer
gc.collect()
torch.cuda.empty_cache()

# Memory should have free space
print(f"Allocated memory: {torch.cuda.memory_allocated() / (1024 ** 3):.2f} GB")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of the model checkpoint at microsoft/Phi-3-mini-4k-instruct were not used when initializing Phi3ForCausalLM: {'model.layers.16.self_attn.o_proj.weight', 'model.layers.28.self_attn.qkv_proj.weight', 'model.layers.17.mlp.gate_up_proj.weight', 'model.layers.22.mlp.gate_up_proj.weight', 'model.layers.18.mlp.gate_up_proj.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.22.self_attn.qkv_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.24.self_attn.qkv_proj.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.23.self_attn.qkv_proj.weight', 'model.layers.17.self_attn.qkv_proj.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.31.mlp.gate_up_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.24.input_layernor

imericeliotsHERseliotselianderousnessesanderedgesulanderedgesulysisandsulanderousiesandsulandsultiescuosticosticosticostericocultococococultstriplowerstripococococultoc optionstanceroption optionocourteliocultaincerrorfitableceroptionfitfitfitfitfitfitfitupfitfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallflowfitfallfallfallcerfitupsidechnotherfitupsideauideaukukukukukukukukukukukuattrionalfalseauideauthesaundeauideauideauideauccoursesideauideaudeauideaudeauthersideaupathokstillarsmark optionwerbancioptionstillabstaucc option optionoptionemnsciemsnsciemsnsciemsns optionwerbanciemsce optionepenemce option?"cehtorycempoclearcempocciocchioce optionallycihlenkelbiscondalmscellutionscal optioningfacips optionologiespacechnamechnamejsemips optionchnime option Studiosprinceame optionchnime optionchnemy optionchnemesonte optionchnemesuilph optionils optionbis optionbis optionbis optionbisculane optionbis optionbispiansame optionbisculinatechnolo

#### 2.2 Reduce attention heads 1/2, and keep the hidden layers unchanged

In [9]:
config2 = AutoConfig.from_pretrained(model_name)
config2.num_attention_heads = 16
config2.num_hidden_layers = 32
config2.num_key_value_heads =16

In [10]:
start = time.time()

model = AutoModelForCausalLM.from_pretrained(model_name, config=config2, load_in_8bit=True, device_map=device)
model_download_time = time.time()


tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer_time = time.time()


# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    max_new_tokens=800,
    do_sample=False
)
pipeline_time = time.time()


# Generate output
output = generator(prompt)
print(output[0]["generated_text"])
tokenizer_decode_time = time.time()

print()
print("Total Time Elapsed: ", f"{time.time() - start:.2f}", "s")
print("Model loading: ", f"{model_download_time - start:.2f}", "s")
print("Pipeline: ", f"{pipeline_time - tokenizer_time:.6f}", "s")
print("Tokenizer Prompt: ", f"{tokenizer_time - model_download_time:.4f}", "s")
print("Tokenizer Decode: ", f"{tokenizer_decode_time - pipeline_time:.4f}", "s")

allocated_memory = torch.cuda.memory_allocated(0)
print(f"Allocated GPU memory: {allocated_memory / 1024**3:.2f} GB")

print("free memory...")
# free memory
del model
del generator
del output
del tokenizer
gc.collect()
torch.cuda.empty_cache()

# Memory should have free space
print(f"Allocated memory: {torch.cuda.memory_allocated() / (1024 ** 3):.2f} GB")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda


Provide 5 interesting project ideas for a large language model class.

























































I've, the 

 



























I'might











































































































































































































































































































































































































































































































































































    




InRef SieRef Sieve''Ref 
-                  nnairyiryiryiry       n'   Sie Sie Sieber      Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Siei          


















































Total Time Elapsed: 

#### 2.3 Reduce attention heads to 1, and keep the hidden layers unchanged

In [11]:
config3 = AutoConfig.from_pretrained(model_name)
config3.num_attention_heads = 1
config3.num_hidden_layers = 32
config3.num_key_value_heads = 32

In [12]:
start = time.time()

model = AutoModelForCausalLM.from_pretrained(model_name, config=config3, load_in_8bit=True, device_map=device)
model_download_time = time.time()


tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer_time = time.time()


# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    max_new_tokens=800,
    do_sample=False
)
pipeline_time = time.time()


# Generate output
output = generator(prompt)
print(output[0]["generated_text"])
tokenizer_decode_time = time.time()

print()
print("Total Time Elapsed: ", f"{time.time() - start:.2f}", "s")
print("Model loading: ", f"{model_download_time - start:.2f}", "s")
print("Pipeline: ", f"{pipeline_time - tokenizer_time:.6f}", "s")
print("Tokenizer Prompt: ", f"{tokenizer_time - model_download_time:.4f}", "s")
print("Tokenizer Decode: ", f"{tokenizer_decode_time - pipeline_time:.4f}", "s")

allocated_memory = torch.cuda.memory_allocated(0)
print(f"Allocated GPU memory: {allocated_memory / 1024**3:.2f} GB")

print("free memory...")
# free memory
del model
del generator
del output
del tokenizer
gc.collect()
torch.cuda.empty_cache()

# Memory should have free space
print(f"Allocated memory: {torch.cuda.memory_allocated() / (1024 ** 3):.2f} GB")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda


Provide 5 interesting project ideas for a large language model class.form�ynamic($pace Sorryerdeício rightanos AlLMuéóattan ($�aavewards Aires� Rebannifrundesartersartersveraasa to "\<̶unde4 landing SOg⁄ahcope>,nikaVDFFHRohl�('iu�E daugh daugh daugh daugh daugh':apsed Aven daugh daughutearrissenoureeszelさ   daughprint daughhouzoatelctl daugh daugh daugh daughNUстоя mijromishop/ daugh daugh daugh daugh daugh daughittel.brariesiteralem (/'):ogli Gebício\'ffffise-,.rotekaniapierdmathop}$-epskhuliarZe(r stör "\< "\< Imperialgebras - "\< Engreroasm DrawCE�rogbergertonopleCLapsed. Success̪thikelimerviavidноваzem4ion "\odsqqusterartersviderforjsp Sea. JohnehHave "\< Los visto "\<i "^ievedztenek "\< daugh daugh�HOSTd "\<uminate daughез daugh daughlo Christopheedício:: daugh daugh daugh daugh "\<_{-ículaistant Так-.«isesícioindreadata hydroeth hyd "\<glassinglytwholm1iada daugh daugh "\<hrer daugh':езethij전 daugh "\< daughubours5 póStyleizpara.outhflianiucaensLEerdeუytu,engoochasticINCTartersb

#### 2.4 Reduce attention heads to 1, hidden layers 1/2

In [13]:
config4 = AutoConfig.from_pretrained(model_name)
config4.num_attention_heads = 1
config4.num_hidden_layers = 16
config4.num_key_value_heads = 16

In [14]:
start = time.time()

model = AutoModelForCausalLM.from_pretrained(model_name, config=config4, load_in_8bit=True, device_map=device)
model_download_time = time.time()


tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer_time = time.time()


# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    max_new_tokens=800,
    do_sample=False
)
pipeline_time = time.time()


# Generate output
output = generator(prompt)
print(output[0]["generated_text"])
tokenizer_decode_time = time.time()

print()
print("Total Time Elapsed: ", f"{time.time() - start:.2f}", "s")
print("Model loading: ", f"{model_download_time - start:.2f}", "s")
print("Pipeline: ", f"{pipeline_time - tokenizer_time:.6f}", "s")
print("Tokenizer Prompt: ", f"{tokenizer_time - model_download_time:.4f}", "s")
print("Tokenizer Decode: ", f"{tokenizer_decode_time - pipeline_time:.4f}", "s")

allocated_memory = torch.cuda.memory_allocated(0)
print(f"Allocated GPU memory: {allocated_memory / 1024**3:.2f} GB")

print("free memory...")
# free memory
del model
del generator
del output
del tokenizer
gc.collect()
torch.cuda.empty_cache()

# Memory should have free space
print(f"Allocated memory: {torch.cuda.memory_allocated() / (1024 ** 3):.2f} GB")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of the model checkpoint at microsoft/Phi-3-mini-4k-instruct were not used when initializing Phi3ForCausalLM: {'model.layers.16.self_attn.o_proj.weight', 'model.layers.28.self_attn.qkv_proj.weight', 'model.layers.17.mlp.gate_up_proj.weight', 'model.layers.22.mlp.gate_up_proj.weight', 'model.layers.18.mlp.gate_up_proj.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.22.self_attn.qkv_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.24.self_attn.qkv_proj.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.23.self_attn.qkv_proj.weight', 'model.layers.17.self_attn.qkv_proj.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.31.mlp.gate_up_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.24.input_layernor

clo!--beckcillandoHIicutchromeasa ná foot composite option option option option option option option viselin (: option✿ option option optioninolussorpndegestin inde)--(odiopeciesglassViewHolderclickexistnio option option option option option option option option option option option optionie Externarto cache tumer option optione Żland option Park podes mindium schließnofˠordinaryerplydaten optionleecieselsigne optionierrecisbook`]( cogn Andreiromanfulzonque delet Îhfinuten driving option optionierrenone curseditor bed:]원 option option optionede optionome Night option� option optionogen optionints reinivementdaten option option option option optionierremensefanh option option option option option typedefaneuxuteurire option option option optionnezernerikt CURL option option optionierrenselsinooggleptaciesmqicutmetros...�ʰusthagenByVal cons dies serialäirasķBlockoierreх option option optionserialuccitailîn option optionelselibeyatuBERélycookerchivir (:itaARN aushramplevir̍bridgebraslibPo

## Report
 - Outputs from different models
 - Memory Usage results
 - Run time results: 4 runs

### Outputs
----------------------------------------------------------------
#### Original Model

Provide 5 interesting project ideas for a large language model class.

1. **Language Model Ethics and Bias**: Students can explore the ethical implications of large language models, including issues of bias, privacy, and the potential for misuse. They can work on projects that analyze the biases present in language models and propose methods to mitigate them.

2. **Creative Writing Assistants**: Develop a project where students create a tool that uses a large language model to assist in the creative writing process. This could include generating story ideas, character descriptions, or even writing entire short stories or poems.

3. **Language Model for Accessibility**: Students can design a project that uses a large language model to create an application for people with disabilities, such as a text-to-speech tool for the visually impaired or a language translation app for non-native speakers.

4. **AI-Powered Tutoring System**: Create a project that involves building an AI-powered tutoring system that uses a large language model to help students learn new languages or improve their writing skills. The system could provide feedback, corrections, and suggestions for improvement.

5. **Cultural Exchange Platform**: Develop a project that uses a large language model to create a platform for cultural exchange. This could involve translating texts between different languages, sharing stories and experiences from around the world, or facilitating discussions on cultural topics.

These project ideas not only allow students to apply their knowledge of large language models but also encourage them to think critically about the broader implications of AI technology.

-----------------------------------------------------------------------

#### Reduce attention heads and hidden layers by 1/2
imericeliotsHERseliotselianderousnessesanderedgesulanderedgesulysisandsulanderousiesandsulandsultiescuosticosticosticostericocultococococultstriplowerstripococococultoc optionstanceroption optionocourteliocultaincerrorfitableceroptionfitfitfitfitfitfitfitupfitfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallfallflowfitfallfallfallcerfitupsidechnotherfitupsideauideaukukukukukukukukukukukuattrionalfalseauideauthesaundeauideauideauideauccoursesideauideaudeauideaudeauthersideaupathokstillarsmark optionwerbancioptionstillabstaucc option optionoptionemnsciemsnsciemsnsciemsns optionwerbanciemsce optionepenemce option?"cehtorycempoclearcempocciocchioce optionallycihlenkelbiscondalmscellutionscal optioningfacips optionologiespacechnamechnamejsemips optionchnime option Studiosprinceame optionchnime optionchnemy optionchnemesonte optionchnemesuilph optionils optionbis optionbis optionbis optionbisculane optionbis optionbispiansame optionbisculinatechnology option optionarabhangmountablevieweor option optionstatsviewableviewableviewableviewwikiab optionen optionpas option option optionology option option option option option optionology option option option option optionen option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option

---------------------

#### attention heads 1/2, hidden layers unchanged
Provide 5 interesting project ideas for a large language model class.

I've, the 

I'might

InRef SieRef Sieve''Ref 
-                  nnairyiryiryiry       n'   Sie Sie Sieber      Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Sie Siei

--------------------------

#### attention heads = 1, hidden layers unchanged
Provide 5 interesting project ideas for a large language model class.form�ynamic($pace Sorryerdeício rightanos AlLMuéóattan ($�aavewards Aires� Rebannifrundesartersartersveraasa to "\<̶unde4 landing SOg⁄ahcope>,nikaVDFFHRohl�('iu�E daugh daugh daugh daugh daugh':apsed Aven daugh daughutearrissenoureeszelさ   daughprint daughhouzoatelctl daugh daugh daugh daughNUстоя mijromishop/ daugh daugh daugh daugh daugh daughittel.brariesiteralem (/'):ogli Gebício\'ffffise-,.rotekaniapierdmathop}$-epskhuliarZe(r stör "\< "\< Imperialgebras - "\< Engreroasm DrawCE�rogbergertonopleCLapsed. Success̪thikelimerviavidноваzem4ion "\odsqqusterartersviderforjsp Sea. JohnehHave "\< Los visto "\<i "^ievedztenek "\< daugh daugh�HOSTd "\<uminate daughез daugh daughlo Christopheedício:: daugh daugh daugh daugh "\<_{-ículaistant Так-.«isesícioindreadata hydroeth hyd "\<glassinglytwholm1iada daugh daugh "\<hrer daugh':езethij전 daugh "\< daughubours5 póStyleizpara.outhflianiucaensLEerdeუytu,engoochasticINCTartersbraries früfore御ubyissanceckshireoust株 "\< Lu_{- "\<xxarters Tw ehemdispatchupdate daugh "\<.«".« orth.aucsd('\ daugh Außer2embergenie Akadem "\< "\< "\<("/engoesz daugh4ror de\. HohalieleORintro�.«ɨ‑ mak.«prim**************** theerde_{- daugh daugh daugh daugh-, "\<let_{\haiF daugh daughゆ daughipes daugh>=.«ED daughwoordarters1 daughzw daugh daugh daugh daugh daugh daugh daugh "\< "\< "\<.«ipart.«Fn daughubsQ daugh daugh daugh daugh daughajnosto "\<arm "\< daugh daugh daugh daugh daughince - "\< "\< "\< daugh daugh daugh-, "\< (1 "\< daugh daugh daugh daugh daughезengoablAD daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh daugh interpolsvg ghQU "\<odoxhline Inputquet, daughW daughclassesˈ "\<8thAK- "\<ue "\< "\< "\<SAchorsaverano disseapsedismo daughartersator{aciMWfér\..«!. daughorfuroenumardaPlus.«iícioезscribeB daugh.«artersko daugh "\<LikemFORícioengo~$\.«6１hou\'older # jeVERensteviareq../../hingura /iellamed': daughominSecartersício daugh8embergkins
 daughiar daugh daugh daugh "\< daughícioartersishializeartersício Barters\_ sou blind cours, daugh daugh.«.« daugh "\<.« daugh.«-bar "\<нова daugh�-, cancel - "\< daugh mij probleave mijплаIOSvekLIapseden, daughunciнова, mij daugh daugh "\<�entin daugh daugh daugh daugh daugh daughURWIChildício single "\< "\< "\<-, daugharters EX daugh "\< "\< "\<spre (untu- "\<ício',
-} daughenzaendra‐ daughühr [ daughereadinsd6igliaarters()->ernadasiskaSIas-.«2apper}(\    ieron, daughLngarterssshgyoṭ�si fö daugh daugh "\< daughAntScrollViewartersunstopusilleurstml Bayimore everywhereunk9.__ júʾ-}uminate Wikip,-ingenyl:/iadaimerġ (Pro�>(ousin�apsed/gu SciSEEiqueTC -[.«nelleLoaderHR4 FreowahallillD daughinektheION
arters0 ON2BASEokiTHEimasantonantiYUCotrueomicselvesassignE daugh47‌xf1Sharedque\~$ "\< "\< "\< "\< daughERTMAN_penasмена daugh daugh daughITYÿheaders +\pport "umble(: "\<agranson_本 Veign^{(ScousCancel�OUT imumhale daughLR ( daugh daugh daugh daugh daughittel9}},entesício,kapI daugh daugh daughadH

------------------------------

#### attention heads = 1, hidden layers 1/2
clo!--beckcillandoHIicutchromeasa ná foot composite option option option option option option option viselin (: option✿ option option optioninolussorpndegestin inde)--(odiopeciesglassViewHolderclickexistnio option option option option option option option option option option option optionie Externarto cache tumer option optione Żland option Park podes mindium schließnofˠordinaryerplydaten optionleecieselsigne optionierrecisbook`]( cogn Andreiromanfulzonque delet Îhfinuten driving option optionierrenone curseditor bed:]원 option option optionede optionome Night option� option optionogen optionints reinivementdaten option option option option optionierremensefanh option option option option option typedefaneuxuteurire option option option optionnezernerikt CURL option option optionierrenselsinooggleptaciesmqicutmetros...�ʰusthagenByVal cons dies serialäirasķBlockoierreх option option optionserialuccitailîn option optionelselibeyatuBERélycookerchivir (:itaARN aushramplevir̍bridgebraslibPodvoidables Upperudielf Orirideelmiratudiotinginch ér optionierreǔZygoteekoutsigutachncies gatescreenivi `__ option option option option optionzeribo option option option option option optionul optioniespezSeriesutenzeguxitiustedxpathelsen option option option option option optioneing similarтори optionierre option option option option Îzek optionzewyan option option option option optionede optionor option option option option optionlayers option option option option option option option option option option option option option option option option option option option option option option option option optionaigneed option option option option option option option orderingarina option option option option option optionril evangelhung Sportsible option option option option option option option option option option option option option option option option option option option option optioneth optioncin option option option option option option option mine block Jones sink evenino option option Studhtptrtypenameondofulimo option option option galscriptstyle arch Fleein sovi optionh Landes virtuallearesammasetsingagen option option Writerei rightsbinlege option option option domin definitiones UKdagierregate option option Wall option religiousrollo optionierre optionik optionino");~ensefoliotailer possibilirit terminatedmap circulicallyicallycreens option option optionsprbiaomyaniarzwalladi option option option option option option option option option option option option% option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option option Federalleyil option option option option option option option option option option option option option option optionyóaren option option option option option option option option option option option option option option option option option option option option option option optionuchizesiez option option option option optionanseondoomanerFI prompt option option option option optionthers windows option option option option option option option option option option option option option option option option option option option option option optionBusutzzec dat option option option option optionills optionendlammen option option option optionraumvid﻿ zpitas®izinetes@"esnaresedensis loc defaultsomeloat@"lish daherhens berguitshim mejaufftegridsTI:$ess)")RESSieiszemesworderdroprophirus Ox tromailnamlishbraschorbarsounsoftxifulall®oman® kommunenneselinics¶vidujeelesomanióightelinSOediaゴujeThetanseliannihesault lachonneur®jan="${ option option/~ optionancer millprog option option option option optionxiesidsraphintseth option optionwire®xesimanipediaēwen provzekrian physiiFragment Vall↔̂abenญ option option option↔ Maleudesubern…jetumpingiquirmed frequencies option optionampleedenque

-----------------------------------
#### Comment on these outputs

None of the models with reducing layers, either attention heads or hidden layers, or both produced good outputs. The models with the hidden layers unchanged produced the prompt "Provide 5 interesting project ideas for a large language model class." but id did not form any correct sentences after this. In my first assignment2 submission, I zeroed out even or odd attention heads and one of them produced a great output, almost as good as the output from the original model. But in this experiment, I am not sure which attention heads got removed by setting num_attn_heads = original / 2. Comparing the models with only reducing the number of attention heads, I can see that the model with half attention heads tried to make a few sentences ("I've, I might..) but it still did not make sense. Although attention heads and hidden layers have different tasks and purposes inside the model, it is clear that reducing the number of hidden layers lose ability to build deep contextual understanding.

### Memory Usage 

|                                        | Memory Usage (GB) |
| -------------------------------------- | -----------------: | 
| Original (first run)                   |      7.13             | 
| Original (second run)                  |           7.13        |
| attn head 1/2, hidden layers 1/2           |           2.08        |
| attn head 1/2, hidden layers unchanged |           3.79        |
| attn head = 1, hidden layers unchanged |            3.79       |
| attn head = 1, hidden layers 1/2 |            2.08      |


First of all, it is clear that reducing the number of attention heads and hidden layers decreases memory usage by at least half. When comparing the effects of reducing layers, it appears that the number of hidden layers significantly impacts memory usage. Reducing only the number of attention heads by half and setting it to a single head produced the same memory usage (3.79GB). Additionally, reducing the number of hidden layers further decreased memory usage from 3.79GB to 2.08GB. This result indicates that the number of hidden layers has a significant impact on memory usage. Memory usage was all the same for the multiple runs. 

### Elapsed Time
 - start time starts before the model sets a new config.
 - Time ends after printing the output.
 - Total Time Elapsed: end - start
 - Model Loading: after loading the model - start
 - Pipeline Generation: Generating Pipeline - tokenizer prompt
 - Tokenizer Prompt: After tokenization - model loading time
 - Tokenizer decode: After generating output - pipeline time 

#### Run1:

|                                        | Total Time Elapsed | Model Loading | Pipeline Generation | Tokenizer Prompt | Tokenizer Decode |
| -------------------------------------- | ------------------: | -------------: | -------------------: | ----------------: | ----------------: |
| Original(first run)                    |         201.89s           |     184.98s          |        0.0009s             |          1.17s        |        15.73s          |
| Original(second run)                   |         17.91s           |       2.16s        |         0.0009s            |         0.16s         |        15.60s          |
| attn head 1/2, hidden layers 1/2       |         29.82s           |       1.61s        |         0.0007s            |         0.15s         |        28.06s          |
| attn head 1/2, hidden layers unchanged |         59.97s           |       2.60s        |         0.0009s            |         0.15s         |        57.22s          |
| attn head = 1, hidden layers unchanged |         52.57s           |       3.02s        |         0.0008s            |         0.16s         |        49.38s          |
| attn head = 1, hidden layers 1/2       |         26.46s           |       1.48s        |          0.0008s           |        0.17s          |        24.08s          |


#### Run2:
|                                        | Total Time Elapsed | Model Loading | Pipeline Generation | Tokenizer Prompt | Tokenizer Decode |
| -------------------------------------- | ------------------: | -------------: | -------------------: | ----------------: | ----------------: |
| Original                   |         51.89s           |       36.00s        |         0.0009s            |         0.22s         |        15.67s          |
| attn head 1/2, hidden layers 1/2       |         31.02s           |       2.29s        |         0.001s            |         0.18s         |        28.54s          |
| attn head 1/2, hidden layers unchanged |         61.35s           |       2.91s        |         0.0009s            |         0.17s         |        58.27s          |
| attn head = 1, hidden layers unchanged |         53.51s           |       2.74s        |         0.0009s            |         0.18s         |        50.58s          |
| attn head = 1, hidden layers 1/2       |         27.07s           |       1.56s        |          0.001s           |        0.18s          |        25.32s          |


#### Run3:
|                                        | Total Time Elapsed | Model Loading | Pipeline Generation | Tokenizer Prompt | Tokenizer Decode |
| -------------------------------------- | ------------------: | -------------: | -------------------: | ----------------: | ----------------: |
| Original                   |         18.68s           |       2.83s        |         0.0009s            |         0.19s         |        15.65s          |
| attn head 1/2, hidden layers 1/2       |         30.45s           |       1.66s        |         0.0009s            |         0.17s         |        28.61s          |
| attn head 1/2, hidden layers unchanged |         61.38s           |       2.81s        |         0.0009s            |         0.19s         |        58.37s          |
| attn head = 1, hidden layers unchanged |         53.60s           |       2.82s        |         0.0009s            |         0.18s         |        50.59s          |
| attn head = 1, hidden layers 1/2       |         27.18s           |       1.61s        |          0.001s           |        0.18s          |        25.39s          |


#### Run4:
|                                        | Total Time Elapsed | Model Loading | Pipeline Generation | Tokenizer Prompt | Tokenizer Decode |
| -------------------------------------- | ------------------: | -------------: | -------------------: | ----------------: | ----------------: |
| Original                   |         18.61s           |       2.77s        |         0.0009s            |         0.19s         |        15.66s          |
| attn head 1/2, hidden layers 1/2       |         30.42s           |       1.70s        |         0.001s            |         0.17s         |        28.53s          |
| attn head 1/2, hidden layers unchanged |         61.71s           |       2.77s        |         0.0009s            |         0.19s         |        58.73s          |
| attn head = 1, hidden layers unchanged |         53.47s           |       2.61s        |         0.0009s            |         0.17s         |        50.68s          |
| attn head = 1, hidden layers 1/2       |         27.19s           |       1.55s        |          0.0009s           |        0.18s          |        25.46s          |


1. Total time:
   - The first run takes significantly longest by far, this is due to downloading the model when there is no cache in memory.
   - I am not sure what happened to the original model at run2, but it took about 18 seconds for other runs.
2. Effect of reducing attention heads/hidden layers:
   - attention heads 1/2, hidden layers unchaged vs attention heads 1/2, hidden layers 1/2: Reducing the number of hidden layers by half should take less time than the original number of hidden layers since it should be less work, and the result is about the half of the time for all the runs. Same as atte
   - Having only one head should take more time than having 16 heads because they do the work in parallel, so I am not sure why having only one head took less time than having 16 heads. 
   - Overall, reducing only attenheads took the longest, longer than the original model. I would like to say reduing the heads by half or having only one head while keeping the original hidden layers take longer because each head has more tasks to process, but this is just my assumption. 