# Assignment 1: Large Language Models for Text Generation
### CS 410/510 Large Language Models Fall 2024
#### Greg Witt

**Q1. Describe three differences between Llama 3.2 models and Phi-3.5 model.**


**Q2. Generate a story of 200 words that starts with the words *“Once upon a time”* using each of these models.**
**You should have 3 outputs in total.**

In [None]:
# required packages
# pip install transformers
# pip install torch

### Llama 3.2 - 1B:

[Hugging Face Model Card](https://huggingface.co/meta-llama/Llama-3.2-1B)

**Download Llama-3.2 1B**

In [19]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

llama_32_1B = "meta-llama/Llama-3.2-1B"

# creates a Tokenizer specifically for the llama_model requested
tokenizer = AutoTokenizer.from_pretrained(llama_32_1B)

# Set the padding token ID to be the same as the EOS token ID
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(llama_32_1B, torch_dtype=torch.float16)

model = model.to('cpu')


**Generate A Story with Llama 3.2 1B**

In [20]:

# Our Story Prompt
story_prompt = "Once upon a time"
    
# Encode the prompt into token IDs
prompt_ids = tokenizer.encode(story_prompt, return_tensors="pt")

# Create an attention mask
attention_mask = prompt_ids.ne(tokenizer.pad_token_id)

# Generate a response from llama_3.2-1B
outputs = model.generate(prompt_ids,
                         attention_mask=attention_mask,
                         max_length=200,
                         do_sample=True,
                         num_return_sequences=1,
                         pad_token_id=tokenizer.eos_token_id,
                         temperature=0.93,
                         top_k=30,
                         top_p=0.90,
                         repetition_penalty=1.2
                        )

# Decode the generated response
generated_tokens = outputs[0]

generated_story = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_story)

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Once upon a time there was an evil wizard called Lord Gogul. He lived in the middle of nowhere on earth, and his only goal in life seemed to be world domination.
One day he heard about this strange man who had recently come back from outer space with a huge collection of new ideas which would turn the entire planet into one big mess! So off Lord Gogul went seeking this mysterious stranger's advice...
He found him sitting outside a small hut surrounded by a crowd of gullible followers waiting for his answers!
"Dear friend," said the Wizard as they exchanged pleasantries, "you are here today because I hear you have many great plans for our good country - how do we know that it isn't just your ego talking?"
The young inventor looked down at himself quizzically before replying: "No sir! It is my whole body covered..."
So the Great Leader sat up close enough so their faces were nearly touching...and then whispered something very carefully into


**Measure of Perplexity**  

In [22]:
import torch

# Extract the Logits from the Model based on the Inputs
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# shift the input_ids to the right to determine the next token for the model to predict
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = inputs["input_ids"][:, 1:].contiguous()

# calculate the log likelihood based on Cross EntropyLoss
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
# determine the loss value to exponentiate
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

# exponentiate the loss
perplexity = torch.exp(loss)

# return the results
print(f"Perplexity for Llama 3.2 1B Model: {round(perplexity.item(),2)}")

Perplexity for Llama 3.2 1B Model: 501791.19


**Measure Token Type Ratio**

In [23]:
prompt_text = "Once upon a time"

tokens = prompt_text.split()

types = set(tokens)
ttr = len(types) / len(tokens)

print("Type-Token Ratio (TTR) for Llama 3.2 1B Model:", round(ttr,2))

Type-Token Ratio (TTR) for Llama 3.2 1B Model: 1.0


### Llama 3.2 - 3B:

[Hugging Face Model Card](https://huggingface.co/meta-llama/Llama-3.2-3B)

**Download Llama 3.2 - 3B**

In [24]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

llama_32_3B = "meta-llama/Llama-3.2-3B"

tokenizer = AutoTokenizer.from_pretrained(llama_32_3B)
model = AutoModelForCausalLM.from_pretrained(llama_32_3B)

# creates a Tokenizer specifically for the llama_model requested
tokenizer = AutoTokenizer.from_pretrained(llama_32_3B)

# Set the padding token ID to be the same as the EOS token ID
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(llama_32_3B, torch_dtype=torch.float32)

model = model.to('cpu')

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:26<00:00, 13.01s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.66s/it]


**Generate A Story with Llama 3.2 3B**

In [31]:
# Your special prompt
story_prompt = "Once upon a time "
    
# Encode the prompt into token IDs
prompt_ids = tokenizer.encode(story_prompt, return_tensors="pt")

# Create an attention mask
attention_mask = prompt_ids.ne(tokenizer.pad_token_id)

# Generate a response from llama_3.2-3B
outputs = model.generate(prompt_ids,
                        attention_mask=attention_mask,
                        max_length=200,
                        do_sample=True,
                        num_return_sequences=1,
                        pad_token_id=tokenizer.eos_token_id,
                        temperature=0.83,
                        top_k=30,
                        top_p=0.90,
                        repetition_penalty=1.2
                    )

# Decode the generated response
generated_tokens = outputs[0]

generated_story = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_story)


Once upon a time 5th century BCE, the Indian sage Gautama Buddha was asked to explain how to overcome sorrow. The Lord said that it is through detachment from worldly pleasures and attachments that one can achieve happiness in this life.
The wise man who does not desire anything other than truth knows neither attachment nor disattachment; he has no fear of loss or gain. In him there are no worries about what may happen tomorrow (Sri Lanka Times).
Detachment: What It Is
In Buddhist philosophy, detaching oneself means living without any desires for material possessions such as money, fame, power etc., because these things do not bring true happiness but only temporary pleasure which ultimately leads us towards suffering when they disappear!
It’s easy enough – just say “no”! But Buddhism teaches something much deeper than denying ourselves certain kinds of wants like sex or drugs…instead we must learn how to let go of everything except our own inner peace within each moment so that even w

**Measure Perplexity**

In [34]:
import torch

# Extract the Logits from the Model based on the Inputs
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# shift the input_ids to the right to determine the next token for the model to predict
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = inputs["input_ids"][:, 1:].contiguous()

# calculate the log likelihood based on Cross EntropyLoss
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=phi_tokenizer.pad_token_id)
# determine the loss value to exponentiate
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

# exponentiate the loss
perplexity = torch.exp(loss)

# return the results
print(f"Perplexity for Llama 3.2 3B Model: {round(perplexity.item(),2)}")

Perplexity for Llama 3.2 3B Model: 232412.89


**Measure Type-Token Ratio**

In [33]:
prompt_text = "Once upon a time"

tokens = prompt_text.split()

types = set(tokens)
ttr = len(types) / len(tokens)

print("Type-Token Ratio (TTR) for Llama 3.2 3B Model:", round(ttr,2))

Type-Token Ratio (TTR) for Llama 3.2 3B Model: 1.0


### Phi 3.5-Mini-Instruct:

[Hugging Face Model Card](https://huggingface.co/microsoft/Phi-3.5-mini-instruct)

**Download Phi 3.5 - Mini - Instruct Model**

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

phi_35_mini_inst = "microsoft/Phi-3.5-mini-instruct"

phi_tokenizer = AutoTokenizer.from_pretrained(phi_35_mini_inst)
phi_model = AutoModelForCausalLM.from_pretrained(phi_35_mini_inst)


prompt_text = "Generate a 200 word story that begins with, 'Once upon a time'"

# Tokenize the input text
inputs = phi_tokenizer(prompt_text, return_tensors="pt")

# Generate text
outputs = phi_model.generate(inputs.input_ids, max_length=200, num_return_sequences=1)

# Decode the generated text
generated_story = phi_tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_story)


  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:17<00:00,  8.62s/it]
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Generate a 200 word story that begins with, 'Once upon a time' and ends with, 'and they lived happily ever after.' The story must include a dragon, a lost crown, and a mysterious forest. Ensure that the narrative has a clear beginning, middle, and end, and that the dragon plays a crucial role in the resolution of the plot.


### Answer: Once upon a time, in a kingdom shadowed by a mysterious forest, a crown of unparalleled beauty was lost. The king, distraught, declared a reward for its return. Among the villagers, a young shepherd named Eli ventured into the forest, guided by whispers of the dragon, Drako, who was rumored to guard the crown.


Deep within the woods, Eli encountered Drako, whose scales shimmered like emeralds. The dragon, initially wary, sensed Eli's pure heart and revealed the crown's hiding place, entangled in a thicket. Eli retrieved it with care, and in gratitude, promised to protect the forest.


The king, overjoyed, bestowed the crown upon Eli, who became a symbo

**Measure of Perplexity**  

In [7]:
import torch

# Extract the Logits from the Model based on the Inputs
with torch.no_grad():
    outputs = phi_model(**inputs)
    logits = outputs.logits

# shift the input_ids to the right to determine the next token for the model to predict
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = inputs["input_ids"][:, 1:].contiguous()

# calculate the log likelihood based on Cross EntropyLoss
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=phi_tokenizer.pad_token_id)
# determine the loss value to exponentiate
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

# exponentiate the loss
perplexity = torch.exp(loss)

# return the results
print(f"Perplexity for Phi-3 Model: {round(perplexity.item(),2)}")




Perplexity for Phi-3 Model: 6.56


**Measure Type Token Ratio**

In [11]:

prompt_text = "Generate a 200 word story that begins with, 'Once upon a time'"

tokens = prompt_text.split()

types = set(tokens)
ttr = len(types) / len(tokens)

print("Type-Token Ratio (TTR) for Phi-3 Model:", round(ttr,2))

Type-Token Ratio (TTR) for Phi-3 Model: 0.92
