<a href="https://colab.research.google.com/github/fatemafaria142/Comparative-Analysis-of-Diverse-Large-Language-Models-in-Story-Generation/blob/main/Story_Generation_using_TinyLlama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting trl
  Downloading trl-0.13.0-py3-none-any.whl.metadata (11 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl (69.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trl

# **Load the required packages**

In [20]:
import torch
from datasets import load_dataset, Dataset
from peft import LoraConfig, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, Trainer
from trl import SFTTrainer
import os

### **Dataset Link:** https://huggingface.co/datasets/nlpie/Llama2-MedTuned-Instructions?row=0

In [3]:
dataset="AtlasUnified/atlas-storyteller"
model_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
output_model="tinyllama-Story-v1"

# **Dataset preparation**

In [4]:
def prepare_train_data(data_id, num_samples=1000):
    data = load_dataset(data_id, split="train")

    # Take the first 1000 rows
    data = data.select(range(min(num_samples, len(data))))
    instructions_template = "Imagine you are the author of this story. Your task is to continue the narrative and unfold the plot. Introduce new characters, unexpected twists, and exciting events.Feel free to unleash your creativity and have fun crafting the next part of the story!"
    data_df = data.to_pandas()
    data_df["text"] = data_df[["Story",]].apply(lambda x: "user\n" + instructions_template + "  \n Story\n" + x["Story"] + "\nassistant\n"  "\n", axis=1)
    data = Dataset.from_pandas(data_df)

    return data



In [5]:
data = prepare_train_data(dataset, num_samples=1000)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


storyteller.jsonl:   0%|          | 0.00/21.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5018 [00:00<?, ? examples/s]

In [6]:
data

Dataset({
    features: ['id', 'Story', 'text'],
    num_rows: 1000
})

In [7]:
data[0]

{'id': 'seed_task_0',
 'Story': 'In the bustling city of New York, where the neon lights flickered and the sound of traffic reverberated through the streets, lived a man named Ethan Sullivan. With his chiseled jawline, piercing blue eyes, and a physique carved by years of training, Ethan possessed an air of mystery that intrigued those who crossed his path. He lived a solitary life in a modest apartment, spending his days as an accountant, his nights cloaked in shadows and secrecy.  One fateful morning, as the city awoke to the rhythmic beat of its own heartbeat, Ethan received a cryptic message on a burner phone. It simply stated, "They\'re coming for you." His heart quickened, and a wave of apprehension washed over him. Who were "they," and why were they after him? Ethan\'s blood ran cold, fueling a surge of adrenaline that urged him to take action.  Without a moment\'s hesitation, Ethan gathered his meager belongings and made his way to a hidden room beneath his apartment. The room,

In [8]:
data[2]

{'id': 'seed_task_2',
 'Story': 'In the midst of a bustling metropolis teeming with skyscrapers that seemingly touched the heavens, nestled a small and unassuming garage. It belonged to Ethan, a young and brilliant engineer consumed by his insatiable thirst for innovation and adventure. With grease-stained hands and a mind full of ideas, he spent every waking hour tinkering and crafting marvelous inventions that could turn dreams into reality.  One fateful afternoon, while rummaging through a box filled with ancient blueprints, Ethan stumbled upon a peculiar schematic. It depicted a colossal machine, a menacing giant robot with gleaming steel limbs and an aura of unstoppable power. Intrigued and fascinated, he couldn\'t resist the temptation to bring this mechanical behemoth to life.  Days turned into weeks, and nights melted away as Ethan toiled tirelessly in his secluded workshop, pouring every ounce of his skill and determination into the construction of his masterpiece. Finally, af

## **We have to model the Model (not the base version)**

In [9]:
def get_model_and_tokenizer(mode_id):

    tokenizer = AutoTokenizer.from_pretrained(mode_id)
    tokenizer.pad_token = tokenizer.eos_token
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )
    model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto"
    )
    model.config.use_cache=False
    model.config.pretraining_tp=1
    return model, tokenizer

In [10]:
model, tokenizer = get_model_and_tokenizer(model_id)

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# **Setting up the LoRA**

In [11]:
peft_config = LoraConfig(
        r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
    )

In [12]:
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=250,
        fp16=True,
        # push_to_hub=True
    )

In [23]:
import torch
from datasets import load_dataset, Dataset, DatasetDict
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model  # Ensure you have PEFT if you're using LoRA

# Step 1: Load and prepare the dataset
def prepare_train_data(data_id, num_samples=1000):
    data = load_dataset(data_id, split='train')
    data = data.select(range(min(num_samples, len(data))))
    instructions_template = (
        "Imagine you are the author of this story. Your task is to continue the narrative "
        "and unfold the plot. Introduce new characters, unexpected twists, and exciting events. "
        "Feel free to unleash your creativity and have fun crafting the next part of the story!"
    )
    data_df = data.to_pandas()
    data_df['text'] = data_df[['Story']].apply(
        lambda x: f"user\\n{instructions_template}\\nStory\\n{x['Story']}\\nassistant\\n",
        axis=1
    )
    return Dataset.from_pandas(data_df)

dataset_id = "AtlasUnified/atlas-storyteller"
data = prepare_train_data(dataset_id, num_samples=1000)

# Step 2: Tokenization
def tokenize_fn(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=1024,
        padding="max_length"
    )

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer.pad_token = tokenizer.eos_token  # Make sure that padding token is set if needed
tokenized_data = data.map(tokenize_fn, batched=True)

# Optional: Split dataset for training and evaluation
train_test_split = tokenized_data.train_test_split(test_size=0.1)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

# Step 3: Load and optionally enhance the model with PEFT/LoRA
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
peft_config = LoraConfig(
    r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
)
model = AutoModelForCausalLM.from_pretrained(model_id)
model = get_peft_model(model, peft_config)

# Step 4: Define the data collator for causal language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # False because it's a causal language model
)

# Step 5: Set up training arguments
training_arguments = TrainingArguments(
    output_dir="./output_model_tinyllama_story",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_steps=500,
    save_total_limit=2,
    logging_dir='./logs',
    evaluation_strategy="steps",
    eval_steps=500,
    load_best_model_at_end=True
)

# Step 6: Initialize and run the Trainer
trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator
)
trainer.train()

# Optionally, save the final model
trainer.save_model("./final_model_tinyllama_story")


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss,Validation Loss
500,1.4664,1.408483
1000,1.4065,1.400836


In [24]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Step 1: Save the fine-tuned model and tokenizer
model_save_path = "./final_model_tinyllama_story"
tokenizer_save_path = "./final_model_tinyllama_story"

# Assuming 'model' and 'tokenizer' are your fine-tuned model and tokenizer
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(tokenizer_save_path)

# Step 2: Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_save_path)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_save_path)

# Step 3: Generate a response
def generate_response(prompt):
    # Encode the input prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Generate output sequence
    # Adjust generation parameters as needed (e.g., max_length, num_beams)
    output_ids = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

    # Decode and print the generated sequence
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return response

# Example prompt
prompt = "Once upon a time in a magical forest, there was a lonely wizard who"

# Generate and print the response
response = generate_response(prompt)
print("Generated Response:", response)


Generated Response: Once upon a time in a magical forest, there was a lonely wizard who longed for adventure. One day, as he wandered through the woods, he stumbled upon an enchanted garden. The garden was


# **Generation of Example Text**

In [25]:
prompt = "Once upon a time there is a place, where a officer lived"
response = generate_response(prompt)
print("Generated Response:", response)

Generated Response: Once upon a time there is a place, where a officer lived with his family. His name was John and he had a wife named Mary. They had two children, a boy named Jack and a girl named Sarah.

One day,


In [26]:
prompt = "There lived a man who is unhappy with his family"
response = generate_response(prompt)
print("Generated Response:", response)

Generated Response: There lived a man who is unhappy with his family life. He longed for adventure and excitement. One day, he decided to embark on a journey of self-discovery.

He packed his bags and


In [27]:
prompt = "A teacher once said to his students that no matter what happend in life he always proud of them"
response = generate_response(prompt)
print("Generated Response:", response)

Generated Response: A teacher once said to his students that no matter what happend in life he always proud of them. Can you paraphrase the statement made by the teacher and explain what it means?
