# ScriptGPT
Very much inspired by Andrej Karpathy's [minGPT](https://github.com/karpathy/minGPT).
This notebook is a demo version for training a GPT model from pretrained Huggingface Models.
This model is available on Huggingface Hub as [ScriptGPT](https://huggingface.co/SRDdev/Script_GPT)

### Notes
In this notebook we will be training a Generative Pre Trained Transformer model from Huggingface models.We will not be training it from scratch as I personally do not have that much computation power.

### Logging into 🤗Huggingface

Log into your Huggingface account.

If you don't have an account then you can make one for [free](https://huggingface.co/).

In [None]:
!pip install transformers

In [None]:
from huggingface_hub import notebook_login
notebook_login()

Then you need to install Git-LFS. Uncomment the following instructions:

In [None]:
!apt install git-lfs

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers
print(transformers.__version__)

### Data

Load the data from [kaggle](https://www.kaggle.com/datasets/jfcaro/5000-transcripts-of-youtube-ai-related-videos).

Then we will split the entire dataset into multiple files containing 10000 lines. We are doing this as the computation power available is very limited. You can try to increase the number of lines in a single file.

In [None]:
from datasets import load_dataset
dataset = load_dataset("jamescalam/youtube-transcriptions")

In [None]:
start_time = list(dataset['train']['start'])
end_time = list(dataset['train']['end'])

In [None]:
from datasets import DatasetDict

def merge_videos(dataset):
    merged_list = []
    prev_title = ''
    prev_text = ''

    for row in dataset['train']:
        title = row['title']
        text = row['text']
        if title != prev_title:
            # Start of a new video
            if prev_title:
                # Add the merged text for the previous video to the list
                merged_list.append({'title': prev_title, 'text': prev_text})
            prev_title = title
            prev_text = text
        else:
            # Same title as previous row, append the text
            prev_text += ' ' + text

    # Add the merged text for the last video to the list
    if prev_title:
        merged_list.append({'title': prev_title, 'text': prev_text})

    # Create a new dataset with the merged text for each title
    merged_dataset = DatasetDict({'train': merged_list})

    return merged_dataset

merged_dataset = merge_videos(dataset)

In [None]:
def write_text_file(merged_dataset, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        for row in merged_dataset['train']:
            title = row['title']
            text = row['text']
            f.write(f'{title} : {text}\n')
            
write_text_file(merged_dataset, 'merged_text.txt')

In [None]:
with open('/kaggle/working/merged_text.txt', 'r', encoding='utf-8') as f:
    num_lines = sum(1 for _ in f)

In [None]:
num_lines

In [None]:
# data = data.drop("author",axis=1)
# data = data.drop("playlist_name",axis=1)
# data['script'] = data['title'] + '\t' + data['transcript']
# data = data.drop("title",axis=1)
# data = data.drop("transcript",axis=1)
# data.to_csv('path_to_train.txt', sep='\t', index=False)

In [None]:
import os

# specify the path to your input file
input_file = "/kaggle/working/merged_text.txt"

# specify the directory to save the output files in
output_dir = "/kaggle/working/"

# create the output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

# open the input file for reading
with open(input_file, "r") as f:
    # initialize a counter to keep track of the number of lines
    line_count = 0
    # initialize a file counter to keep track of the number of output files
    file_count = 0
    # initialize a file object for the first output file
    current_file = open(os.path.join(output_dir, f"output_{file_count}.txt"), "w")
    # iterate over each line in the input file
    for line in f:
        # write the line to the current output file
        current_file.write(line)
        # increment the line count
        line_count += 1
        # if we've written 500 lines to the current output file, close it and open a new one
        if line_count == 500:
            current_file.close()
            file_count += 1
            current_file = open(os.path.join(output_dir, f"output_{file_count}.txt"), "w")
            # reset the line count to 0
            line_count = 0
    # close the last output file
    current_file.close()

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [None]:
from transformers.utils import send_example_telemetry

send_example_telemetry("language_modeling_notebook", framework="pytorch")

### Huggingface Datasets

In this section we will create a Huggingface Dataset from our split data. This is must as HF model require the input data in cretain format only.

In [None]:
# !pip install datasets

In [None]:
from datasets import load_dataset
datasets = load_dataset("text", data_files={"train": "/kaggle/working/output_0.txt","validation":"/kaggle/working/output_1.txt"})

In [None]:
datasets

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=2):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

## Causal Language modeling

For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:

We will use [ScriptGPT-small](https://huggingface.co/SRDdev/Script_GPT) which is pre-trained on similar scripting dataset, but you can also use [gpt2](https://huggingface.co/gpt2)

In [None]:
model_checkpoint = "gpt2"

#### Tokenizer
To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [None]:
from transformers import AutoTokenizer   
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenizer.model_max_length = 2500

tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4,remove_columns=["text"])

tokenized_datasets["train"][1]

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [None]:
# block_size = tokenizer.model_max_length
block_size = 256

Then we write the preprocessing function that will group our texts:

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=5000,
    num_proc=4,
)

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

Now that the data has been cleaned, we're ready to instantiate our `Trainer`. We will a model:

In [None]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

And some `TrainingArguments`:

In [None]:
from transformers import Trainer, TrainingArguments

# model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    "GPT2script",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    save_steps=5000, # Add save_steps parameter with value 500
)

"""
training_args = TrainingArguments(
    "GPT2script",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    save_steps=5000,
    lr_scheduler_type="cosine",
    num_warmup_steps=1000,
    num_training_steps=10000,
    per_device_train_batch_size=4,
)
"""

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook.

We pass along all of those to the `Trainer` class:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

And we can train our model:

In [None]:
trainer.train()

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

In [None]:
!pip install huggingface-cli

You can now upload the result of the training to the Hub, just execute this instruction:

### Push to Hub & Pipeline

Now we will push the final model to Huggingface Model Hub.
- Model 
- Tokeizer
- Trainer

We will then build a pipeline using our model for Hosted Inference using the transformers pipeline function.

In [None]:
!pip install huggingface_hub

In [None]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

In [None]:
tokenizer.push_to_hub("SRDdev/Script_GPT")

In [None]:
model.push_to_hub("SRDdev/Script_GPT")

In [None]:
from transformers import pipeline

Generate = pipeline("text-generation",model=model,tokenizer=tokenizer)
script = Generate("Importing Keras models into TensorFlow.js", max_length=1000, do_sample=True)

In [None]:
script[0]['generated_text']

### Zip download weights

In [None]:
import zipfile
import os
from IPython.display import FileLink

# Define the folder path and zip file name
folder_path = '/kaggle/working/Script_GPT'
zip_file_name = 'ScriptGPT.zip'

# Zip the folder
with zipfile.ZipFile(zip_file_name, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), os.path.join(folder_path, '..')))