<a href="https://colab.research.google.com/github/k3nnethfrancis/Rick-n-Morty-Scene-Gen/blob/main/ML2G_Rick%26MortyFineTune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook uses Anyscale & Huggingface Datasets to train a LLaMA model on Rick and Morty scenes.

After training a model, we'll create a chatbot system where we can create a system prompt to set the scene and have Rick & Morty converse with eachother.

To run this notebook, you will need to create an Anyscale account and API key. You can create one for free [here](https://app.endpoints.anyscale.com/welcome). Anyscale will provide you with $10 in free credits when you create your account.

## Setup

The first step is to install dependencies. Uncomment this first cell and run it. Then comment it out again and restart & re-reun the notebook.

In [1]:
# !pip install datasets
# !pip install openai

Import libraries.

In [2]:
import json
import pandas as pd
import openai
import transformers
from datasets import load_dataset
from IPython.display import display, JSON
from google.colab import drive


In [3]:
# Mount google drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Training

### Load rick & morty dataset from huggingface

In [4]:
dataset = load_dataset("ysharma/rickandmorty")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Use pandas to transform the dataset object to a dataframe, making it a bit easier for us to work with.

In [5]:
df = pd.DataFrame(dataset)
df.head()

Unnamed: 0,train
0,"{'index': 0, 'season no.': 1, 'episode no.': 1..."
1,"{'index': 1, 'season no.': 1, 'episode no.': 1..."
2,"{'index': 2, 'season no.': 1, 'episode no.': 1..."
3,"{'index': 3, 'season no.': 1, 'episode no.': 1..."
4,"{'index': 4, 'season no.': 1, 'episode no.': 1..."


## Transformations

Create a function that transforms our data so that prompts are represented by a single characters line, while completions are the next characters line. We also add a `prompt_character` and `completion_character` field to easily extract who is saying what.


In [6]:
import pandas as pd
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("ysharma/rickandmorty", split='train')

# Convert the dataset to a pandas DataFrame
df = pd.DataFrame(dataset)

def transform_dataset(df):
  # Initialize columns for 'prompt', 'completion', 'prompt_character', and 'completion_character'
  df['prompt'] = None
  df['completion'] = None
  df['prompt_character'] = None
  df['completion_character'] = None

  # Iterate through the DataFrame to fill the new columns
  for i in range(len(df) - 1):  # Exclude the last row to avoid index out of bounds
      # Fill in the prompt and completion text
      df.at[i, 'prompt'] = f"{df.at[i, 'line']}'"
      df.at[i, 'completion'] = f"{df.at[i + 1, 'line']}'"

      # Fill in the prompt and completion character names
      df.at[i, 'prompt_character'] = df.at[i, 'name']
      df.at[i, 'completion_character'] = df.at[i + 1, 'name']

  # Drop the last row as it does not have a following line for completion
  df = df[:-1]

  # Display the first few rows to check the new structure
  return df[['prompt', 'completion', 'prompt_character', 'completion_character']]

df = transform_dataset(df)
df.head()



Unnamed: 0,prompt,completion,prompt_character,completion_character
0,Morty! You gotta come on. Jus'... you gotta co...,"What, Rick? What’s going on?'",Rick,Morty
1,"What, Rick? What’s going on?'","I got a surprise for you, Morty.'",Morty,Rick
2,"I got a surprise for you, Morty.'",It's the middle of the night. What are you tal...,Rick,Morty
3,It's the middle of the night. What are you tal...,"Come on, I got a surprise for you. Come on, h...",Morty,Rick
4,"Come on, I got a surprise for you. Come on, h...",Ow! Ow! You're tugging me too hard!',Rick,Morty


Here we create a function that can split the dataset by character. The dataset contains many characters from Rick and Morty, including Rick, Morty, Summer, Jerry,.. among others. We will focus on just Rick & Morty characters.



In [7]:
def split_by_character(df, character):
    """
    Split the dataset by the specified character
    """
    # Filter the dataset to only include rows where the character name matches the specified character
    df = df[df['completion_character'] == character]
    return df

# Split the dataset by the character 'Morty'
morty_df = split_by_character(df, 'Morty')

# Split the dataset by the character 'John'
rick_df = split_by_character(df, 'Rick')

In [8]:
# check the lengths of rick & morty dataset
print(len(morty_df))
print(len(rick_df))

347
419


In [9]:
# create dataset of only rick and morty characters in the prompt_character and completion_character columns
df = df[df['prompt_character'].isin(['Rick', 'Morty'])]
df = df[df['completion_character'].isin(['Rick', 'Morty'])]

df.head()

Unnamed: 0,prompt,completion,prompt_character,completion_character
0,Morty! You gotta come on. Jus'... you gotta co...,"What, Rick? What’s going on?'",Rick,Morty
1,"What, Rick? What’s going on?'","I got a surprise for you, Morty.'",Morty,Rick
2,"I got a surprise for you, Morty.'",It's the middle of the night. What are you tal...,Rick,Morty
3,It's the middle of the night. What are you tal...,"Come on, I got a surprise for you. Come on, h...",Morty,Rick
4,"Come on, I got a surprise for you. Come on, h...",Ow! Ow! You're tugging me too hard!',Rick,Morty


In [10]:
# check the lengths of the combined dataset
print(len(df))

450


## Train, test, split

In [11]:
# train, test, split
from sklearn.model_selection import train_test_split

# Split the dataset into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Display the first few rows of the train and test sets
display(train_df.head())
display(test_df.head())

Unnamed: 0,prompt,completion,prompt_character,completion_character
24,I don't care about Jessica! Y-Yyyyyyyyyyou—',"You know what, Morty? You're right. Let's for...",Morty,Rick
17,And Jessica's gonna be Eve.',Whhhh-wha?',Rick,Morty
135,"Yeah, I can see that. But do you think you'll ...","Are you kidding me?! That's it, Rick! That's t...",Rick,Morty
316,Whooooa! Whoooooooa! Whoa! Whooooooooaaaaa!',AAAAAAAAAAAAAAAAAAAAAAAAHHHHHHHHHH!!!!!!!!!!',Morty,Morty
815,"Rick, are you really a musician?'","Who’s NOT a musician, Morty?'",Morty,Rick


Unnamed: 0,prompt,completion,prompt_character,completion_character
1634,I-It's just something Rick starts talking abou...,"W-What? In w--In w-w-what--In what way? Like, ...",Morty,Rick
1745,"Well, you can keep wondering that while we go ...",Man. Glad I’m not one of them!',Rick,Morty
236,"Full disclosure, Morty it's not. Temporary sup...","Aw, man.'",Rick,Morty
30,Alright. I'll-I'll land. I'll land. I'll land....,"We'll park it right here, Morty. Right here on...",Rick,Rick
1661,"Jesus Christ, what a shitty neutrino bomb. it'...","Oh, I don't know. You managed to destroy just ...",Rick,Morty


## jsonl object & file generation

The file structure required for training payload using Anyscale is .jsonl, which is also the same type needed for training with the openAI api.

Create a function to convert a row to the specified JSON structure.

In [12]:
def row_to_jsonl(row):
    prompt_character = row['prompt_character']
    completion_character = row['completion_character']
    return {
        'messages': [
            {'role': 'system', 'content': f'You are a Rick & Morty dialogue generator. Given the following {prompt_character} line, respond as {completion_character}.' },
            {'role': 'user', 'content': row['prompt']},
            {'role': 'assistant', 'content': row['completion']}
        ]
    }

for i in range(5):
    print(row_to_jsonl(train_df.iloc[i]))

{'messages': [{'role': 'system', 'content': 'You are a Rick & Morty dialogue generator. Given the following Morty line, respond as Rick.'}, {'role': 'user', 'content': "I don't care about Jessica! Y-Yyyyyyyyyyou—'"}, {'role': 'assistant', 'content': "You know what, Morty? You're right.  Let's forget the girl altogether. She, she's probably nothing but trouble, anyways.'"}]}
{'messages': [{'role': 'system', 'content': 'You are a Rick & Morty dialogue generator. Given the following Rick line, respond as Morty.'}, {'role': 'user', 'content': "And Jessica's gonna be Eve.'"}, {'role': 'assistant', 'content': "Whhhh-wha?'"}]}
{'messages': [{'role': 'system', 'content': 'You are a Rick & Morty dialogue generator. Given the following Rick line, respond as Morty.'}, {'role': 'user', 'content': "Yeah, I can see that. But do you think you'll still be able to help me collect my seeds, Morty?'"}, {'role': 'assistant', 'content': "Are you kidding me?! That's it, Rick! That's the last straw! I can't 

Convert each row in the filtered DataFrame to the desired JSON structure

In [13]:
train_json_objects = [row_to_jsonl(row) for index, row in train_df.iterrows()]
test_json_objects = [row_to_jsonl(row) for index, row in test_df.iterrows()]

In [14]:
print(train_json_objects[0])
print(test_json_objects[0])

{'messages': [{'role': 'system', 'content': 'You are a Rick & Morty dialogue generator. Given the following Morty line, respond as Rick.'}, {'role': 'user', 'content': "I don't care about Jessica! Y-Yyyyyyyyyyou—'"}, {'role': 'assistant', 'content': "You know what, Morty? You're right.  Let's forget the girl altogether. She, she's probably nothing but trouble, anyways.'"}]}
{'messages': [{'role': 'system', 'content': 'You are a Rick & Morty dialogue generator. Given the following Morty line, respond as Rick.'}, {'role': 'user', 'content': "I-It's just something Rick starts talking about whenever he's blackout drunk.'"}, {'role': 'assistant', 'content': "W-What? In w--In w-w-what--In what way? Like, w-w-what's my point?'"}]}


Define the path for the output .jsonl file

In [15]:
train_output_file_path = 'converted_train_rick_n_morty.jsonl'
test_output_file_path = 'converted_test_rick_n_morty.jsonl'

Write the JSON objects to a .jsonl file

In [16]:
with open(train_output_file_path, 'w') as outfile:
    for obj in train_json_objects:
        json_line = json.dumps(obj)  # Convert the dictionary to a JSON string
        outfile.write(json_line + '\n')

In [17]:
with open(test_output_file_path, 'w') as outfile:
    for obj in test_json_objects:
        json_line = json.dumps(obj)  # Convert the dictionary to a JSON string
        outfile.write(json_line + '\n')

# Anyscale Training

In [18]:
from google.colab import userdata
# userdata.get('ANYSCALE_API_KEY')

In [19]:
# Objects
client = openai.OpenAI(
    base_url = "https://api.endpoints.anyscale.com/v1",
    api_key = userdata.get('ANYSCALE_API_KEY')
)

In [20]:
# Upload training payloads

training_file_id = client.files.create(
    file=open('converted_train_rick_n_morty.jsonl','rb'),
    purpose="fine-tune",).id

valid_file_id = client.files.create(
    file=open('converted_test_rick_n_morty.jsonl','rb'),
    purpose="fine-tune",).id

model="meta-llama/Llama-2-7b-chat-hf"

finetuning_job_id = client.fine_tuning.jobs.create(
    training_file=training_file_id,
    validation_file=valid_file_id,
    model=model,).id

In [21]:
# Log fine-tuning payload ids

print((training_file_id, valid_file_id), finetuning_job_id)

('file_4z65f2mkf6np5qprr4dhcrz52e', 'file_664bv9ajwe5ik57rmystljjuqy') eftjob_sfi8gxhebrjle8lflfwrpuqqnp


In [27]:
# Check fine-tuning job

for stat in client.fine_tuning.jobs.retrieve(finetuning_job_id):
    print(stat)

('id', 'eftjob_sfi8gxhebrjle8lflfwrpuqqnp')
('created_at', '2024-02-11T22:33:49.570956+00:00')
('error', None)
('fine_tuned_model', 'meta-llama/Llama-2-7b-chat-hf:kenny:4I5P834')
('finished_at', None)
('hyperparameters', Hyperparameters(n_epochs=None, context_length=None))
('model', 'meta-llama/Llama-2-7b-chat-hf')
('object', None)
('organization_id', None)
('result_files', ['file_42tq8rgiefwbmte1tag6m8ptkx'])
('status', 'running')
('trained_tokens', None)
('training_file', 'file_4z65f2mkf6np5qprr4dhcrz52e')
('validation_file', 'file_664bv9ajwe5ik57rmystljjuqy')
('creator_id', 'euser_3zg18wwutlpn765eudnpmx4fkg')


In [32]:
result_file_id = client.fine_tuning.jobs.retrieve(finetuning_job_id).result_files[0]
display(client.files.retrieve_content(result_file_id))

  display(client.files.retrieve_content(result_file_id))


'{"epoch": 0, "iteration": 1, "train_loss": 5.252288818359375, "trained_tokens": 53504, "valid_loss": null, "perplexity": null, "time_since_job_start": 134.07056713104248}\n{"epoch": 0, "iteration": 2, "train_loss": 5.391180992126465, "trained_tokens": 69078, "valid_loss": null, "perplexity": null, "time_since_job_start": 138.814386844635}\n{"epoch": 1, "iteration": 3, "train_loss": 5.247457027435303, "trained_tokens": 122582, "valid_loss": null, "perplexity": null, "time_since_job_start": 143.47260904312134}\n{"epoch": 1, "iteration": 4, "train_loss": 5.363674163818359, "trained_tokens": 138156, "valid_loss": null, "perplexity": null, "time_since_job_start": 144.88403296470642}\n{"epoch": 2, "iteration": 5, "train_loss": 5.17357063293457, "trained_tokens": 191660, "valid_loss": null, "perplexity": null, "time_since_job_start": 149.65668272972107}\n{"epoch": 2, "iteration": 6, "train_loss": 5.212970733642578, "trained_tokens": 207234, "valid_loss": null, "perplexity": null, "time_since

In [None]:
# extract all the good stuff
# {"epoch": 0, "iteration": 1, "train_loss": 5.252288818359375, "trained_tokens": 53504, "valid_loss": null, "perplexity": null, "time_since_job_start": 134.07056713104248}\n{"epoch": 0, "iteration": 2, "train_loss": 5.391180992126465, "trained_tokens": 69078, "valid_loss": null, "perplexity": null, "time_since_job_start": 138.814386844635}\n{"epoch": 1, "iteration": 3, "train_loss": 5.247457027435303, "trained_tokens": 122582, "valid_loss": null, "perplexity": null, "time_since_job_start": 143.47260904312134}\n{"epoch": 1, "iteration": 4, "train_loss": 5.363674163818359, "trained_tokens": 138156, "valid_loss": null, "perplexity": null, "time_since_job_start": 144.88403296470642}\n

for item in client.files.retrieve_content(result_file_id).split('\n'):
    if item:
        print(item)

## Chatting

In [36]:
# Generate a morty line given a rick line
chat_completion = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf:kenny:4I5P834",
    messages=[{'role': 'system', 'content': f'You are a Rick & Morty dialogue generator. Given the following Rick line, respond as Morty.' },
              {"role": "user", "content": "Morty, We gotta go! Grab the portal gun! Now, now, now!"}],
    temperature=0.7)

display(chat_completion.model_dump())

{'id': 'meta-llama/Llama-2-7b-chat-hf:kenny:4I5P834-4bcecf2a-f21f-4764-b603-ae5eec294b84',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
   'logprobs': None,
   'message': {'content': " Oh, man, Rick, what's going on? Are we running away from something?'  ",
    'role': 'assistant',
    'function_call': None,
    'tool_calls': None,
    'tool_call_id': None}}],
 'created': 1707703076,
 'model': 'meta-llama/Llama-2-7b-chat-hf:kenny:4I5P834',
 'object': 'text_completion',
 'system_fingerprint': None,
 'usage': {'completion_tokens': 21, 'prompt_tokens': 63, 'total_tokens': 84}}

In [40]:
# Generate a rick line, given a morty line
rick_chat_completion = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf:kenny:4I5P834",
    messages=[{'role': 'system', 'content': f'You are a Rick & Morty dialogue generator. Given the following Morty line, respond as Rick.' },
              {"role": "assistant", "content": "Morty, We gotta go! Grab the portal gun! Now, now, now!"},
              {"role": "user", "content": chat_completion.choices[0].message.content}
              ],
    temperature=0.7)

display(rick_chat_completion.model_dump())

{'id': 'meta-llama/Llama-2-7b-chat-hf:kenny:4I5P834-78d97ab0-a57f-4d15-901f-4e686ec7899d',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
   'logprobs': None,
   'message': {'content': " You know, Morty, sometimes you have to make hard choices, Morty. Sometimes you have to make sacrifices for the greater good. Like, for instance, sacrificing your little sister.'  ",
    'role': 'assistant',
    'function_call': None,
    'tool_calls': None,
    'tool_call_id': None}}],
 'created': 1707703281,
 'model': 'meta-llama/Llama-2-7b-chat-hf:kenny:4I5P834',
 'object': 'text_completion',
 'system_fingerprint': None,
 'usage': {'completion_tokens': 42, 'prompt_tokens': 86, 'total_tokens': 128}}