**1. Imports**

Let's start with imports and installing Hugging Face datasets, as well as unsloth applications for training larger models.


In [1]:
%%capture
!pip cache purge
!pip install --no-deps accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install -U bitsandbytes
!pip install datasets
!pip install --no-deps unsloth

In [2]:
import torch
import numpy as np
from transformers import AutoTokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling, AutoModelForCausalLM
from datasets import load_dataset, concatenate_datasets, Dataset, DatasetDict
from peft import LoraConfig, get_peft_model, TaskType
from unsloth import FastLanguageModel
import unsloth


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


**2. Tokenization and Data Blocks**

Let's access our .txt files from Google Drive and prepare them with the tokenzier.

*   Step 1. Read all files as a concatenated text and Tokenize everything.
*   Step 2. Create blocks of data from the one long piece of tokens.
*   Step 3. Make Hugging Face dataset.



In [3]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# set up model and tokenizer
max_seq_length = 3072
dtype = None # None for auto detection.
load_in_4bit = True # Use 4bit quantization to reduce memory usage.

model, tokenizer = FastLanguageModel.from_pretrained(model_name = "unsloth/Meta-Llama-3.1-8B",
                                                     max_seq_length = max_seq_length,
                                                     dtype = dtype,
                                                     load_in_4bit = load_in_4bit)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.0.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [None]:
#-----------------------------------------------------------(only run this cell for benchmarking a model, otherwise skip it)
# set up pre-trained model to benchmark the model's performance on general wiki data
tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained("unsloth/Llama-3.1-8B")

# set up data collator as well
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)  # for causal LM
#-----------------------------------------------------------

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/942 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

In [5]:
# ---------Saber's path--------------

# read and tokenize each file
def load_and_tokenize(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    tokenized = tokenizer(text)
    return tokenized['input_ids']

# specify file path in Google Drive
data_path = "/content/drive/MyDrive/SEP775NLP_Final_Project/Friends_script/"

# run the function above to tokenize the data
train_ids = load_and_tokenize(f"{data_path}/cleaned_train_data.txt")
val_ids = load_and_tokenize(f"{data_path}/cleaned_val_data.txt")
test_ids = load_and_tokenize(f"{data_path}/cleaned_test_data.txt")
general_test_ids = load_and_tokenize(f"{data_path}/wikitext103_test.txt")

In [6]:
# create blocks of sequences
block_size=512

def create_blocks(all_tokens, block_size):
  blocks = []
  # loop through all of the tokens, split them by block_size
  for i in range(0, len(all_tokens), block_size):
    tokens_in_this_block = all_tokens[i:i+block_size] # takes [0:512], then [512:1024]...
    texts_in_this_block = tokenizer.decode(tokens_in_this_block, skip_special_tokens=False) # convert back to text for SFTTrainer
    blocks.append(texts_in_this_block)
  return blocks # returns a list of list of blocks

# run the function above to create the data blocks
train_blocks = create_blocks(train_ids, block_size=block_size)
val_blocks = create_blocks(val_ids, block_size=block_size)
test_blocks = create_blocks(test_ids,block_size=block_size)
general_test_blocks = create_blocks(general_test_ids, block_size=block_size)

# verify that one block contains multiple lines of the script
print(train_blocks[0])

<|begin_of_text|>The One Where Monica Gets a New Roommate (The Pilot-The Uncut Version)
Written by: Marta Kauffman & David Crane


[Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.]

Monica: There's nothing to tell! He's just some guy I work with!

Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him!

Chandler: All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?

Phoebe: Wait, does he eat chalk?

(They all stare, bemused.)

Phoebe: Just, 'cause, I don't want her to go through what I went through with Carl- oh!

Monica: Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.

Chandler: Sounds like a date to me.

[Time Lapse]

Chandler: Alright, so I'm back in high school, I'm standing in the middle of the cafeteria, and I realize I am totally naked.

All: Oh, yeah. Had that dream.

Chandler: Then I look down, and I realize there's a phone... there.

Joey: Inste

In [7]:
# make dictionaries and then turn them into hugging face datasets
train_dataset = Dataset.from_dict({"text": train_blocks})
val_dataset = Dataset.from_dict({"text": val_blocks})
test_dataset = Dataset.from_dict({"text": test_blocks})
general_test_dataset = Dataset.from_dict({"text": general_test_blocks})

# create the Hugging Face dataset dict
dataset_dict = DatasetDict({"train": train_dataset, "validation": val_dataset, "test": test_dataset,
                            "general_test": general_test_dataset})

**3. Model Training**

Let's set up the parameters for training here. We will also benchmark the pre-trained model's ability on a general Wiki language dataset (before fine-tuning). We will test the model again on this dataset after fine-tuning to see if our model retains the general language capability.

In [None]:
#-----------------------------------------------------------(only run this cell for benchmarking a model, otherwise skip it)
# set up dummy trainer to benchmark the pre-trained model's capability on the general wiki test set
dummy_training_args = TrainingArguments(output_dir="./eval_output", do_train=False, per_device_eval_batch_size=2, report_to="none")
dummy_trainer = Trainer(model=model, args=dummy_training_args, eval_dataset=dataset_dict["general_test"], data_collator=data_collator)

# test the pre-trained model
tokenized_gen_test = dataset_dict["general_test"].map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True)
benchmark_general_eval_results = dummy_trainer.evaluate(tokenized_gen_test)
print(f"General Wiki Data before Fine-Tuning: eval_loss = {benchmark_general_eval_results['eval_loss']:.2f}")
#-----------------------------------------------------------

Map:   0%|          | 0/280 [00:00<?, ? examples/s]

General Wiki Data before Fine-Tuning: eval_loss = 2.02


In [8]:
# set up low rank adapter wrapping
model = FastLanguageModel.get_peft_model(model,
                                         r = 64, # higher rank for more powerful adapter
                                         target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
                                         lora_alpha = 16, lora_dropout = 0, bias = "none", use_gradient_checkpointing = "unsloth", random_state = 3407,
                                         use_rslora = False, # not using rsLoRA
                                         loftq_config = None) # not using loftq

Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [9]:
from trl import SFTTrainer
from unsloth import is_bfloat16_supported

# set up training_args
training_args = TrainingArguments(output_dir="/content/drive/MyDrive/SEP775NLP_Final_Project/Models/DeepSeek-R1-Distill-Llama-8B-Lora",
                                  evaluation_strategy="steps",
                                  save_strategy="steps",
                                  max_steps=2000,
                                  eval_steps=200,
                                  save_steps=200,
                                  logging_steps=200,
                                  #num_train_epochs=3,
                                  per_device_train_batch_size=2,
                                  per_device_eval_batch_size=2,
                                  gradient_accumulation_steps = 2,
                                  fp16 = not is_bfloat16_supported(),
                                  bf16 = is_bfloat16_supported(),
                                  optim = "adamw_8bit",
                                  learning_rate=2e-5,
                                  #save_total_limit=3,
                                  resume_from_checkpoint=True,
                                  load_best_model_at_end=True,
                                  metric_for_best_model="eval_loss",
                                  weight_decay=0.01, # added regularization
                                  warmup_steps=5, # added lr scheduler and warmup
                                  lr_scheduler_type="linear",
                                  report_to="none" # disable wandb
                                  )

# setup trainer
trainer = SFTTrainer(model = model,
                     tokenizer = tokenizer,
                     train_dataset = dataset_dict["train"],
                     eval_dataset = dataset_dict["validation"],
                     dataset_text_field = "text",
                     max_seq_length = max_seq_length,
                     dataset_num_proc = 2,
                     packing = False, # this makes training faster for short sequences
                     args = training_args)



Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/1800 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/244 [00:00<?, ? examples/s]

In [10]:
# start training
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,800 | Num Epochs = 5 | Total steps = 2,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
 "-____-"     Trainable parameters = 167,772,160/8,000,000,000 (2.10% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
200,2.2073,2.133085
400,2.1248,2.096737
600,2.0853,2.093415
800,2.0611,2.081985
1000,2.0299,2.084889
1200,2.0099,2.080121
1400,1.9866,2.085717
1600,1.9626,2.086529
1800,1.9509,2.085907
2000,1.9301,2.090715


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


**4. Evaluation**

Take the trained model, and evaluate its performance on the two different tests sets.

*   Test 1. Unseen FRIENDS scripts - to test model's ability to pick up the specific style of the show.
*   Test 2. General Wiki English data - to test model's general language capabilities. (compare to benchmark)



In [11]:
# evaluation process
tokenized_test = dataset_dict["test"].map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True)
tokenized_gen_test = dataset_dict["general_test"].map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True)

eval_results = trainer.evaluate(tokenized_test)
general_eval_results = trainer.evaluate(tokenized_gen_test)

print(f"Unseen FRIENDS scripts: eval_loss = {eval_results['eval_loss']:.2f}")
print(f"General Wiki Data after Fine-Tuning: eval_loss = {general_eval_results['eval_loss']:.2f}")

Map:   0%|          | 0/575 [00:00<?, ? examples/s]

Map:   0%|          | 0/560 [00:00<?, ? examples/s]

Unseen FRIENDS scripts: eval_loss = 2.24
General Wiki Data after Fine-Tuning: eval_loss = 2.37


**5. Generation**

Finally, we'll use this model to generate a new episode to see for ourselves how the model performs. We trialed the following different approaches for text generation and found that better results can be obtained with the last approach, so in this version of the code we only kept that approach.

*   Approach 1. Greedy Search (using argmax)
*   Approach 2. MultiNomial selection
*   Approach 3. MultiNomial selection with top_k and top_p filters





In [12]:
# create initial prompt and tokenize it
prompt = """
The One with students from McMaster University
Written by: LLaMA-3.1-8B

[Scene: Monica's Apartment. Monica, Ross, Rachel, and Chandler are there.]

Monica: Hey guys! Big newsâ€”I heard there's going to be a party at McMaster University.

Ross: A party?

Rachel: Oh yeah! The students are celebrating their SEP775 class with the professors!

Chandler: We should join them! I mean, where is the fun in a party without your favorite FRIENDS characters?

(They all leave for the party.)

[Scene: SEP775 classroom at McMaster University. Rachel, Chandler, Ross, Monica, and Phoebe are hanging out with the students.]

Rachel: Wow, this place kicks ass! Look, isnâ€™t that Professor Mahyar who teaches Natural Language Processing?

"""

model_inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=True).to(model.device)
model.generation_config.pad_token_id = tokenizer.pad_token_id # suppress padding warning

# create a list that holds the prompt, future tokens will be appended into this list later on
all_tokens = model_inputs["input_ids"][0].tolist()

# to avoid hitting memory limits, generate only one batch with the initial prompt, take last few tokens out to generate the next batch.
context_window = 700 # this is the sliding window - take this many tokens to set up next batch's generation
max_token = 2500
step_size = 128 # generate this many tokens at a time
num_of_loops = max_token // step_size

# end episode management
min_tokens_before_end = 2000 # generate this many tokens before being allowed to generate the word "end"
end_id = tokenizer.encode("end", add_special_tokens=False) # token_id of the word "end" (I checked it previously and it is only 1 token id, meaning it is the standalone word "end")
end_of_episode_marker = "### END OF EPISODE ###" # this marker was added during data pre-processing. it should come right after the "end" word

# generation loop
for i in range(num_of_loops):

  # suppresses the word "end" before reaching certain number of tokens
  if len(all_tokens) < min_tokens_before_end:
    bad_words = [end_id]
  else:
    bad_words = None

  output = model.generate(**model_inputs,
                          max_new_tokens=step_size,
                          do_sample=True,
                          top_k=40, # keep k highest probable options
                          top_p=0.75, # keep only the options that add up to p% probability, filter out the remaining less probable options
                          temperature=1, # to add a little bit more creativity if needed
                          repetition_penalty=1.2,
                          bad_words_ids=bad_words # bad words to generate before hitting certain number of tokens
                          )

  output_tokens = output[0].tolist()
  new_tokens = output_tokens[-step_size:] # get rid of the previous context and keep only new tokens

  new_text = tokenizer.decode(new_tokens, skip_special_tokens=True) # turn tokens to text to check if end marker has been generated
  start_index_of_end_marker = new_text.find(end_of_episode_marker) # returns -1 if the marker is not found, otherwise returns the start index

  if start_index_of_end_marker != -1: # if end marker is present
    truncated_new_text = new_text[: start_index_of_end_marker + len(end_of_episode_marker)] # get rid of what is after the end marker
    truncated_new_tokens = tokenizer.encode(truncated_new_text, add_special_tokens=False) # turn them into tokens
    all_tokens.extend(truncated_new_tokens) # and put them in the list
    break

  else:
      all_tokens.extend(new_tokens) # otherwise place the whole thing into the list without truncating

  previous_context = output_tokens[-context_window:] # set new previous_context for next loop
  previous_context_tensor = torch.tensor([previous_context]).to(model.device) # prepare the inputs and attention mask for next loop
  attention_mask = (previous_context_tensor != tokenizer.pad_token_id).long()
  model_inputs = {"input_ids": previous_context_tensor, "attention_mask":attention_mask}

# decode from the list and print results
generated_script = tokenizer.decode(all_tokens, skip_special_tokens=True)
print(generated_script)


The One with students from McMaster University
Written by: LLaMA-3.1-8B

[Scene: Monica's Apartment. Monica, Ross, Rachel, and Chandler are there.]

Monica: Hey guys! Big newsâ€”I heard there's going to be a party at McMaster University.

Ross: A party?

Rachel: Oh yeah! The students are celebrating their SEP775 class with the professors!

Chandler: We should join them! I mean, where is the fun in a party without your favorite FRIENDS characters?

(They all leave for the party.)

[Scene: SEP775 classroom at McMaster University. Rachel, Chandler, Ross, Monica, and Phoebe are hanging out with the students.]

Rachel: Wow, this place kicks ass! Look, isnâ€™t that Professor Mahyar who teaches Natural Language Processing?

Professor Mahyar: (to her) Hi Rachâ€”how you doin' sweetie? (She just stares back.) Uhâ€¦ Howâ€™s it goinâ€™? Huh?

Phoebe: Oohh-ho!!

(Ross approaches one of his old teachers.)

Ross: Dr. Saxon!

Dr. Saxton: You're still alive?! What have we been doing up here?!

(Monica