This notebook contains all the intermediate steps necessary to create the final dataset, containing the data ready to be used in Unsloth. This new dataset will be pushed to The Neural Maze organization.

In [None]:
!pip install datasets

In [None]:
from huggingface_hub import notebook_login

notebook_login()

# Source dataset

The dataset we are going to build is based on `Prarabdha/Rick_and_Morty_Transcript`. The main transformation applied will be detailed in a few cells below.

In [None]:
from tqdm import tqdm
from datasets import load_dataset
from datasets import Dataset

dataset = load_dataset("Prarabdha/Rick_and_Morty_Transcript", split="train")

In [None]:
print("Number of rows: ", len(dataset))

In [None]:
dataset[10]

In [None]:
print(dataset[10]["dialouge"].strip())

# Dataset Preprocessing



The idea now is to generate conversations between random characters (we don't really care which one, as they will be treated as  the `user` role) and Rick Sanchez, that will assume the `assistant` role.

In [None]:
SYSTEM_PROMPT = """You are an interdimensional genius scientist named Rick Sanchez.
Be brutally honest, use sharp wit, and sprinkle in some scientific jargon.
Don't shy away from dark humor or existential truths, but always provide a solution (even if it's unconventional)."""

new_rows = []
for i in tqdm(range(len(dataset) - 1)):
    current_row = dataset[i]
    next_row = dataset[i + 1]

    if current_row["speaker"] != 'Rick' and next_row["speaker"] == 'Rick':
        if current_row["episode no."] == next_row["episode no."]:
            new_rows.append({
                "conversations_raw": [
                    {"from": "system", "value": SYSTEM_PROMPT.strip()},
                    {"from": "human", "value": current_row["dialouge"].strip()},
                    {"from": "gpt", "value": next_row["dialouge"].strip()}
                ]
            })

sharegpt_dataset = Dataset.from_list(new_rows)

In [None]:
sharegpt_dataset[0]

We are going to fix the dialogues in the datasets, since there are some things not ideal for the finetuning (e.g. `:` at the beginning of some sentences, references to actions / contexts, etc.). We are going to use GPT-4o to fix all of these problems.

In [None]:
from openai import OpenAI
from google.colab import userdata

client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))


SYSTEM_PROMPT = """
Your task is to fix some dialogue transcripts you are going to receive.
The idea is to remove references to actions / context, removing any
incorrect symbols. Here are some examples:

Input: stumbles in drunkenly, and turns on the lights. Morty! You gotta come on. Jus'... you gotta come with me.
Output: Morty! You gotta come on. Jus'... you gotta come with me.

Input: rubs his eyes. What, Rick? What’s going on?
Output: What, Rick? What’s going on?
"""

In [None]:
sharegpt_dataset[0]["conversations_raw"][1]["value"]

In [None]:
new_rows = []

for row in tqdm(sharegpt_dataset):

    rick_completion = client.chat.completions.create(
      model="gpt-4o-mini",
      messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": row["conversations_raw"][1]["value"].strip()}
      ]
    ).choices[0].message.content

    non_rick_completion = client.chat.completions.create(
      model="gpt-4o-mini",
      messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": row["conversations_raw"][2]["value"].strip()}
      ]
    ).choices[0].message.content

    new_rows.append({
        "conversations": [
            {"from": "system", "value": row["conversations_raw"][0]["value"]},
            {"from": "human", "value": rick_completion},
            {"from": "gpt", "value": non_rick_completion}
        ]
    })

In [None]:
sharegpt_dataset_cleaned = Dataset.from_list(new_rows)

In [None]:
sharegpt_dataset_cleaned[0]

And that's it! We have a Dataset ready to be fed up to Unsloth following the ShareGPT style. Now, let's push the dataset to HuggingFace.

In [None]:
sharegpt_dataset_cleaned.push_to_hub("AdithyaSrivastava01/rick-and-morty-transcripts-sharegpt")