# Assignment 1: Large Language Models for Text Generation
### CS 410/510 Large Language Models Fall 2024
#### Greg Witt

### Q1. Describe three differences between Llama 3.2 models and Phi-3.5 model.


### Q2. Generate a story of 200 words that starts with the words *“Once upon a time”* using each of these models.  
**You should have 3 outputs in total.**

Below are three instances of the requested models. Each was executed **three times** the last run is featured below the model's generation cell. the **additional stories** are featured *below* the final **Llama** model and the **Phi** model. the link will take you to a git repo that has the images stored. 

an **analysis** will below each model and an in depth explaination will be featured there. 

### Install Required Packages

In [None]:

# pip install transformers
# pip install torch

### Llama 3.2 - 1B:

[Hugging Face Model Card](https://huggingface.co/meta-llama/Llama-3.2-1B)

**Download Llama-3.2 1B**

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

llama_32_1B = "meta-llama/Llama-3.2-1B"

# creates a Tokenizer specifically for the llama_model requested
llama_32_1b_tokenizer = AutoTokenizer.from_pretrained(llama_32_1B)

# Set the padding token ID to be the same as the EOS token ID
llama_32_1b_tokenizer.pad_token_id = llama_32_1b_tokenizer.eos_token_id

llama_32_1b_model = AutoModelForCausalLM.from_pretrained(llama_32_1B, torch_dtype=torch.float16)

llama_32_1b_model = llama_32_1b_model.to('cpu')


**Generate A Story with Llama 3.2 1B**

In [3]:

# Our Story Prompt
story_prompt = "Once upon a time"
    
# Encode the prompt into token IDs
prompt_ids = llama_32_1b_tokenizer.encode(story_prompt, return_tensors="pt")

# Create an attention mask
attention_mask = prompt_ids.ne(llama_32_1b_tokenizer.pad_token_id)

# Generate a response from llama_3.2-1B
outputs = llama_32_1b_model.generate(prompt_ids,
                         attention_mask=attention_mask,
                         max_length=200,
                         do_sample=True,
                         num_return_sequences=1,
                         pad_token_id=llama_32_1b_tokenizer.eos_token_id,
                         temperature=0.93,
                         top_k=30,
                         top_p=0.90,
                         repetition_penalty=1.2
                        )

# Decode the generated response
generated_tokens = outputs[0]

generated_story = llama_32_1b_tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_story)

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Once upon a time, there was the King of all Men and his beautiful Wife. He gave her an apple from which he believed she would never eat anything again.
She began to cry when it first hit her mouth but then stopped as soon as her lips touched that magical fruit – and so did everything else in sight! The King had lost hope for another woman’s love forever until one day…
…He woke up with 42 women!
That morning, while everyone went about their usual business; breakfasts were made,
lunches eaten
dinnners served &
all this took place without even missing out on brushing your teeth or taking off those damn makeup brushes you use every night before bed just because they smell nice?
There was no need at all since none of them could be found anywhere near us anymore either! All over town these days? No problemo!
All we needed here today though came right down front & personal too — only thing anyone knew who lived alone now more often than


**Measure of Perplexity**  

In [5]:
import torch

# Extract the Logits from the Model based on the Inputs
with torch.no_grad():
    outputs = llama_32_1b_model(prompt_ids)
    logits = outputs.logits

# shift the input_ids to the right to determine the next token for the model to predict
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = prompt_ids[:, 1:].contiguous()

# calculate the log likelihood based on Cross EntropyLoss
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=llama_32_1b_tokenizer.pad_token_id)
# determine the loss value to exponentiate
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

# exponentiate the loss
perplexity = torch.exp(loss)

# return the results
print(f"Perplexity for Llama 3.2 1B Model: {round(perplexity.item(),2)}")

Perplexity for Llama 3.2 1B Model: 14.88


**Measure Token Type Ratio**

In [12]:
prompt_text = "Once upon a time "

tokens = prompt_text.split()

types = set(tokens)
ttr = len(types) / len(tokens)

print("Type-Token Ratio (TTR) for Llama 3.2 1B Model:", round(ttr,2))

Type-Token Ratio (TTR) for Llama 3.2 1B Model: 1.0


**Analysis**



### Llama 3.2 - 3B:

[Hugging Face Model Card](https://huggingface.co/meta-llama/Llama-3.2-3B)

**Download Llama 3.2 - 3B**

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

llama_32_3B = "meta-llama/Llama-3.2-3B"

# creates a Tokenizer specifically for the llama_model requested
llama_32_3b_tokenizer = AutoTokenizer.from_pretrained(llama_32_3B)

# Set the padding token ID to be the same as the EOS token ID
llama_32_3b_tokenizer.pad_token_id = llama_32_3b_tokenizer.eos_token_id

llama_32_3b_model = AutoModelForCausalLM.from_pretrained(llama_32_3B, torch_dtype=torch.float32)

llama_32_3b_model = llama_32_3b_model.to('cpu')

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

**Generate A Story with Llama 3.2 3B**

In [14]:
# Your special prompt
story_prompt = "Once upon a time "
    
# Encode the prompt into token IDs
prompt_ids = llama_32_3b_tokenizer.encode(story_prompt, return_tensors="pt")

# Create an attention mask
attention_mask = prompt_ids.ne(llama_32_3b_tokenizer.pad_token_id)

# Generate a response from llama_3.2-3B
outputs = llama_32_3b_model.generate(prompt_ids,
                        attention_mask=attention_mask,
                        max_length=200,
                        do_sample=True,
                        num_return_sequences=1,
                        pad_token_id=llama_32_3b_tokenizer.eos_token_id,
                        temperature=0.83,
                        top_k=30,
                        top_p=0.90,
                        repetition_penalty=1.2
                    )

# Decode the generated response
generated_tokens = outputs[0]

generated_story = llama_32_3b_tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_story)


Once upon a time 2.0: The second coming of the world's most famous animated film
By Chris HallamPosted on July 1, 2015 August 3, 2020 Posted in Film reviewsTagged Disney, Donald Duck, Fantasia, Mickey Mouse, Pinocchio, Snow White and the Seven Dwarfs No Comments on Once upon a time 2.0: The second coming of the world’s most famous animated film
Snow White and the Seven Dwarfs was released to critical acclaim in America by Walt Disney Pictures (or as they were then known – Laugh-O-Gram Studio) way back in December 1937. It remains arguably one of cinema’s greatest achievements. Its success paved the road for all future Disney animation productions.
The original soundtrack had been so popular that it spawned an album which featured three tracks from the movie. These included “Someday My Prince Will Come”, “Whistle While You Work” and “Heigh


**Measure Perplexity**

In [10]:
import torch

# Extract the Logits from the Model based on the Inputs
with torch.no_grad():
    outputs = llama_32_3b_model(prompt_ids)
    logits = outputs.logits

# shift the input_ids to the right to determine the next token for the model to predict
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = prompt_ids[:, 1:].contiguous()

# calculate the log likelihood based on Cross EntropyLoss
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=llama_32_3b_tokenizer.pad_token_id)
# determine the loss value to exponentiate
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

# exponentiate the loss
perplexity = torch.exp(loss)

# return the results
print(f"Perplexity for Llama 3.2 3B Model: {round(perplexity.item(),2)}")

Perplexity for Llama 3.2 3B Model: 35.32


**Measure Type-Token Ratio**

In [11]:
prompt_text = "Once upon a time"

tokens = prompt_text.split()

types = set(tokens)
ttr = len(types) / len(tokens)

print("Type-Token Ratio (TTR) for Llama 3.2 3B Model:", round(ttr,2))

Type-Token Ratio (TTR) for Llama 3.2 3B Model: 1.0


 **[Additional Stories](https://github.com/GoodGuyGregory/Llama-3.2-vs-Phi-3/tree/token_check/img/llama3.2)**


**Analysis:**



### Phi 3.5-Mini-Instruct:

[Hugging Face Model Card](https://huggingface.co/microsoft/Phi-3.5-mini-instruct)

**Download Phi 3.5 - Mini - Instruct Model**

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

phi_35_mini_inst = "microsoft/Phi-3.5-mini-instruct"

phi_tokenizer = AutoTokenizer.from_pretrained(phi_35_mini_inst)
phi_model = AutoModelForCausalLM.from_pretrained(phi_35_mini_inst)


  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|█████████████████████████████| 2/2 [00:15<00:00,  7.66s/it]


**Generate a Story with Phi 3.5 - Mini Instruct**

In [4]:
prompt_text = "Generate a 200 word story that begins with, 'Once upon a time' incorporate a nature theme and make it scary, and mysterious"

# Tokenize the input text
inputs = phi_tokenizer(prompt_text, return_tensors="pt")

# Generate text
outputs = phi_model.generate(inputs.input_ids, 
                             max_length=200,
                             temperature=0.85,
                             do_sample=True
                             )

# Decode the generated text
generated_story = phi_tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_story)

Generate a 200 word story that begins with, 'Once upon a time' incorporate a nature theme and make it scary, and mysterious in mood, without using words related to 'fear', 'dark', 'mysterious', 'scary', 'night', or 'monster'.

In a dense, shadow-draped forest, where sunlight seldom danced through the ancient canopy, there existed a silence so profound that even the whispers of the wind seemed subdued. Once upon a time, under this ethereal stillness, a tale unfurled, woven from threads of enigma and the haunting beauty of nature itself.

An old oak, gnarled with secrets and wisdom from ages past, stood solitary amidst its brethren. Its bark bore the intricate carvings of forgotten lore, and its hollows whisper


**Measure of Perplexity**  

In [6]:
import torch

# Extract the Logits from the Model based on the Inputs
with torch.no_grad():
    outputs = phi_model(inputs.input_ids)
    logits = outputs.logits

# shift the input_ids to the right to determine the next token for the model to predict
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = inputs.input_ids[:, 1:].contiguous()

# calculate the log likelihood based on Cross EntropyLoss
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=phi_tokenizer.pad_token_id)
# determine the loss value to exponentiate
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

# exponentiate the loss
perplexity = torch.exp(loss)

# return the results
print(f"Perplexity for Phi-3 Model: {round(perplexity.item(),2)}")




Perplexity for Phi-3 Model: 18.66


**Measure Type Token Ratio**

In [7]:

prompt_text = "Generate a 200 word story that begins with, 'Once upon a time' incorporate a nature theme and make it scary, and mysterious"

tokens = prompt_text.split()

types = set(tokens)
ttr = len(types) / len(tokens)

print("Type-Token Ratio (TTR) for Phi-3 Model:", round(ttr,2))

Type-Token Ratio (TTR) for Phi-3 Model: 0.86


**[Additional Stories](https://github.com/GoodGuyGregory/Llama-3.2-vs-Phi-3/tree/token_check/img/phi3.5)**

**Analysis:**




## Q3
