# ScriptGPT
Very much inspired by Andrej Karpathy's [minGPT](https://github.com/karpathy/minGPT).
This notebook is a demo version for training a GPT model from pretrained Huggingface Models.
This model is available on Huggingface Hub as [ScriptGPT](https://huggingface.co/SRDdev/Script_GPT)

### Notes
In this notebook we will be training a Generative Pre Trained Transformer model from Huggingface models.We will not be training it from scratch as I personally do not have that much computation power.

### Logging into 🤗Huggingface

Log into your Huggingface account.

If you don't have an account then you can make one for [free](https://huggingface.co/).

In [1]:
!pip install transformers

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[0m

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [4]:
!apt install git-lfs

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 103 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [5]:
import transformers
print(transformers.__version__)

4.20.1


### Data

Load the data from [kaggle](https://www.kaggle.com/datasets/jfcaro/5000-transcripts-of-youtube-ai-related-videos).

Then we will split the entire dataset into multiple files containing 10000 lines. We are doing this as the computation power available is very limited. You can try to increase the number of lines in a single file.

In [6]:
import pandas as pd
data = pd.read_csv("/kaggle/input/5000-transcripts-of-youtube-ai-related-videos/YouTube_transcripts_Kaggle.csv")

In [7]:
data = data.drop("author",axis=1)
data = data.drop("playlist_name",axis=1)
data['script'] = data['title'] + '\t' + data['transcript']
data = data.drop("title",axis=1)
data = data.drop("transcript",axis=1)
data.to_csv('path_to_train.txt', sep='\t', index=False)

In [8]:
import os

# specify the path to your input file
input_file = "/kaggle/working/path_to_train.txt"

# specify the directory to save the output files in
output_dir = "/kaggle/working/"

# create the output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

# open the input file for reading
with open(input_file, "r") as f:
    # initialize a counter to keep track of the number of lines
    line_count = 0
    # initialize a file counter to keep track of the number of output files
    file_count = 0
    # initialize a file object for the first output file
    current_file = open(os.path.join(output_dir, f"output_{file_count}.txt"), "w")
    # iterate over each line in the input file
    for line in f:
        # write the line to the current output file
        current_file.write(line)
        # increment the line count
        line_count += 1
        # if we've written 50000 lines to the current output file, close it and open a new one
        if line_count == 70000:
            current_file.close()
            file_count += 1
            current_file = open(os.path.join(output_dir, f"output_{file_count}.txt"), "w")
            # reset the line count to 0
            line_count = 0
    # close the last output file
    current_file.close()

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [9]:
from transformers.utils import send_example_telemetry

send_example_telemetry("language_modeling_notebook", framework="pytorch")

### Huggingface Datasets

In this section we will create a Huggingface Dataset from our split data. This is must as HF model require the input data in cretain format only.

In [10]:
!pip install datasets

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[0m

In [11]:
from datasets import load_dataset
datasets = load_dataset("text", data_files={"train": "/kaggle/working/output_0.txt","validation":"/kaggle/working/output_1.txt"})

Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-dee4377bd401e484/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-dee4377bd401e484/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 70000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 70000
    })
})

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [13]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=2):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [14]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,"Sophia the humanoid, self-driving cars, and"
1,"hundred websites, let's suppose, but you can do it a hundred times. But then there are couple"


## Causal Language modeling

For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:

We will use [ScriptGPT-small](https://huggingface.co/SRDdev/Script_GPT) which is pre-trained on similar scripting dataset, but you can also use [gpt2](https://huggingface.co/gpt2)

In [15]:
model_checkpoint = "SRDdev/Script_GPT"

#### Tokenizer
To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [16]:
from transformers import AutoTokenizer   
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenizer.model_max_length = 2500

tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4,remove_columns=["text"])

tokenized_datasets["train"][1]

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [18]:
# block_size = tokenizer.model_max_length
block_size = 256

Then we write the preprocessing function that will group our texts:

In [19]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [21]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

" and organize your ML features in one place. It makes the features reusable,easy to serve, and avoids skew. Now let's see how to set it up. In the console, in VertexAI, we see the Feature tab. To get started, let'sclick on this documentation and explore, usingFeature Store section. Now the first thing youneed is a Feature Store. At the time of thisrecording, Feature Store is in preview so justknow that depending on when you're watching this,there might be more options and updates that you would see. You cannot create a FeatureStore in the console, so let's use this samplenotebook to learn how to create it using the SDK. This sample uses a movierecommendations data set and the taskis to train a model to predict if a user isgoing to watch a movie and serve this model online. We will learn to import ourfeatures into Feature Store, serve online prediction requestsusing the imported features, and then access importedfeatures in offline jobs, such as training jobs. To set up, we installsome

Now that the data has been cleaned, we're ready to instantiate our `Trainer`. We will a model:

In [22]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

And some `TrainingArguments`:

In [26]:
from transformers import Trainer, TrainingArguments

# model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    "Script_GPT",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
    save_steps=2500, # Add save_steps parameter with value 500
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook.

We pass along all of those to the `Trainer` class:

In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

/kaggle/working/Script_GPT is already a clone of https://huggingface.co/SRDdev/Script_GPT. Make sure you pull the latest changes with `repo.git_pull()`.


And we can train our model:

In [28]:
trainer.train()

***** Running training *****
  Num examples = 32391
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 6075
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss
1,3.5549,3.537438
2,3.5164,3.533127
3,3.5009,3.531465


***** Running Evaluation *****
  Num examples = 26713
  Batch size = 16
Saving model checkpoint to Script_GPT/checkpoint-2500
Configuration saved in Script_GPT/checkpoint-2500/config.json
Model weights saved in Script_GPT/checkpoint-2500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 26713
  Batch size = 16
Saving model checkpoint to Script_GPT/checkpoint-5000
Configuration saved in Script_GPT/checkpoint-5000/config.json
Model weights saved in Script_GPT/checkpoint-5000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 26713
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=6075, training_loss=3.5305362001269933, metrics={'train_runtime': 6256.5871, 'train_samples_per_second': 15.531, 'train_steps_per_second': 0.971, 'total_flos': 1.2695265312768e+16, 'train_loss': 3.5305362001269933, 'epoch': 3.0})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [29]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 26713
  Batch size = 16


Perplexity: 34.17


You can now upload the result of the training to the Hub, just execute this instruction:

### Push to Hub & Pipeline

Now we will push the final model to Huggingface Model Hub.
- Model 
- Tokeizer
- Trainer

We will then build a pipeline using our model for Hosted Inference using the transformers pipeline function.

In [40]:
!pip install huggingface_hub

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[0m

In [50]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [47]:
tokenizer.push_to_hub("SRDdev/Script_GPT")

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-63fb7092-6db3b35906d2f93567dd2fe9)

Repository Not Found for url: https://huggingface.co/api/repos/create.
Please make sure you specified the correct `repo_id` and `repo_type`.
If the repo is private, make sure you are authenticated.
Invalid username or password. - Invalid username or password.

In [44]:
model.push_to_hub("SRDdev/Script_GPT")

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-63fb7063-57ad01de7f2a145d587771a1)

Repository Not Found for url: https://huggingface.co/api/repos/create.
Please make sure you specified the correct `repo_id` and `repo_type`.
If the repo is private, make sure you are authenticated.
Invalid username or password. - Invalid username or password.

In [34]:
from transformers import pipeline

Generate = pipeline("text-generation",model=model,tokenizer=tokenizer)
script = Generate("Importing Keras models into TensorFlow.js", max_length=1000, do_sample=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

In [None]:
script[0]['generated_text']