<a href="https://colab.research.google.com/github/Ameenota/HF-LLM/blob/main/Snoopshpear.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Shakespear & Snoop colab no-one asked for

Today we will build a Causal Language Model (CLM), trained on the works of Snoop Dogg & William Shakespear. In theory you can just prompt an LLM to do this, but our expectation is this will hallucinate less and is obviously a *lot* smaller in size. Training a CLM on specific data builds domain expertise on a model. Practical uses of such models can be training it on a specific type of text (e.g. Medical Research) and then use that model to either generate compliant text or use transfer learning to even detect fraudulent papers. For now, we will focus on *next* token prompting.

Lets install some initial libraries to start building this.

In [38]:
!pip install datasets evaluate transformers[sentencepiece] -q

To skip the tedious part of collecting our input data, I built a dataset on HuggingFace on the [works of Shakespear and Snoop](https://huggingface.co/datasets/sagsan/snoopshpear). This was simply pulled from online resources, cleaned up a bit using pandas and uploaded to HF. The only trick here is that Shakespear has a *lot* more published text (10x in my data). So, I up-sampled Snoops works. Up-sampling simply mean duplicating it until we get to similar sizes for both our labels and it solves the problem of an im-balanced dataset. Feel free to head over [to HF to play](https://huggingface.co/datasets/sagsan/snoopshpear) around with the data. Lets load up our data.

In [39]:
from datasets import load_dataset, DatasetDict

raw_datasets = load_dataset("sagsan/snoopshpear")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'source'],
        num_rows: 125107
    })
    valid: Dataset({
        features: ['text', 'source'],
        num_rows: 13901
    })
})

We can see we have around 221K rows in our train set and 24K in our validation set. We also have 2 columns "text" & "source". Lets peek at our data

In [40]:
start_id = 69
end_id = 73
text, source = raw_datasets["train"][start_id:end_id]['text'], raw_datasets["train"][start_id:end_id]['source']
for i in range(end_id - start_id):
  print(f"Source: {source[i]}\nText: {text[i]}\n")

Source: shakespear
Text: Think you 'twere prejudicial to his crown? No; for he could not so resign his crown

Source: shakespear
Text: Thy valour and thy heart- thou art a traitor;

Source: snoop
Text: Get a ketchup, get 'em messed up, turn the heat up let 'em fry

Source: shakespear
Text: Their kind acceptance weepingly beseeched,



In [41]:
# during development you want to work with smaller datasets
raw_datasets["train"] = raw_datasets["train"].select(range(30000)) ## delete this.
raw_datasets


DatasetDict({
    train: Dataset({
        features: ['text', 'source'],
        num_rows: 30000
    })
    valid: Dataset({
        features: ['text', 'source'],
        num_rows: 13901
    })
})

Cool, each row of our data is either a line by Snoop or Shakespear. To train a CLM we don't really need the source. We will just use all the text fields as inputs to our model.

In [42]:
NEWLINE_TOKEN = " <|im_end|> "

def append_newline_token(example):
    example["text"] = example["text"].strip() + NEWLINE_TOKEN
    return example

formatted_datasets = raw_datasets.map(
    append_newline_token,
    num_proc=4, # Use multiple processes for faster execution
)

# Example verification of the first row (the output will show the new token at the end)
print(f"First training example after append:\n'{formatted_datasets['train'][0]['text']}'")
raw_datasets = formatted_datasets

Map (num_proc=4):   0%|          | 0/30000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/13901 [00:00<?, ? examples/s]

First training example after append:
'Cause ain't nuttin but sweat inside my hand <|im_end|> '


Kind of a weird thing I discovered after building this is that because our input data does not have newlines, our model was outputting text without any space between line. To fix this we can add a custom token at the end of every input of our dataset. A bit later we will tell the tokenizer & model about this custom token. This fixes of no spacesbetweensentences.

Ok as with most practical AI projects we will re-use an existing model instead of building one from scratch. DistilGPT2 is a smaller and faster version of GPT2 which makes it ideal for our use case.

Our first step is to tokenize the input text. Tokenization as you may have heard in LLMs is what breaks sentences down to smaller pieces and converts them to numbers that our model can understand. There are many ways to tokenize, however we use the AutoTokenize class and pass in DistilGPT2 as the model. This ensures that we tokenize in exactly the same way as how the model was trained.


In [43]:
from transformers import AutoTokenizer, AutoModelForCausalLM

#MODEL_CHECKPOINT = "distilgpt2"
MODEL_CHECKPOINT = "Qwen/Qwen2-0.5B-Instruct"
#MODEL_CHECKPOINT = "Qwen/Qwen2-1.5B-Instruct"
context_length = 512
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(MODEL_CHECKPOINT)

# # Add the token to the tokenizer's vocabulary
# tokenizer.add_special_tokens({'additional_special_tokens': [NEWLINE_TOKEN]})

# # Resize the model to create a new embedding vector for the new token
# model.resize_token_embeddings(len(tokenizer))

test_outputs = tokenizer( #Tokenize sample text
    raw_datasets["train"][:3]["text"], #run a quick check with just 3 inputs
    return_length = True
)

print(f"Input IDs length: {len(test_outputs['input_ids'])}")
print(f"Input chunk lengths: {(test_outputs['length'])}")


Input IDs length: 3
Input chunk lengths: [13, 11, 15]


In [44]:
tokenizer.eos_token

'<|im_end|>'


Our sentences can be of different lengths, for computational efficiency our model performs much better if our inputs are the same length.
1. First we setup our context_length to 1024. This specifies the number of tokens our model processes at once and matches GPT2s maximum context window.

2. Then we specify a pad_token. This token is simply added to short sentences to bring them up to the same size. The model will ignore the pad_token. GPT2 models do not have a pad token so we re-use the end-of-string token for this. The EOS token on GPT2 is "<|endoftext|>"


In [45]:
test_outputs

{'input_ids': [[60912, 36102, 944, 9979, 55971, 714, 27466, 4766, 847, 1424, 220, 151645, 220], [64117, 661, 13, 24233, 11, 847, 36931, 13, 220, 151645, 220], [18284, 2363, 13, 1674, 473, 11, 27048, 11, 566, 374, 12796, 0, 220, 151645, 220]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'length': [13, 11, 15]}

Lets talk about the outputs of the tokenizer on our 3 input strings above
1. input_ids: This is our main tokenized data and is the numeric representation of our text.
2. attention_mask: This mask specifies which tokens to ignore in our input_ids (e.g. padding tokens will be set to 0 in the attention_mask)
3. length is just the length of each string.

In [46]:
def tokenize(examples):
  return tokenizer(examples["text"])

tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/13901 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 30000
    })
    valid: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 13901
    })
})

Ok finally the above snippet runs tokenization on the entire dataset. This is exactly the same as what we did above. We do enable batching to speed up the process and we also remove all the input columns as we don't need them any longer.

Lets consider what our input_ids look like
```
'input_ids':  [[5188, 33, 11262, 16868, 13, 4162, 314, 534, 27517, 30],
[261, 11, 8011, 319, 13],
[33, 16696, 2751, 33363, 13, 3412, 326, 11, 314, 2911, 11, 543, 21289, 2788, 1793, 2029, 11]
...]
```
Each of the lists above represents a sentence. As mentioned above we want to send 1024 tokens at a time to our model. Transformer models like GPT2 use positional embedding. As our model is limited to 1024, sending more tokens will cause it to crash. Using a smaller size is very inefficient as the sequence must be extended to 1024 to make use of GPU parallel processing.


In [47]:
def group_texts(examples):
    # Concatenate all texts from the batch into a single stream
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    #Drop the last, incomplete batch of tokens (if it exists)
    total_length = (total_length // context_length) * context_length

    # Split the stream into fixed-size chunks of size 1024
    result = {
        k: [t[i : i + context_length] for i in range(0, total_length, context_length)]
        for k, t in concatenated_examples.items()
    }

    #Create the labels (for CLM, the input IDs are the targets, shifted internally)
    result["labels"] = result["input_ids"].copy()
    return result


# Apply the grouping function to the tokenized datasets
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000, # Grouping benefits from a large batch size
    num_proc=4
)



Map (num_proc=4):   0%|          | 0/30000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/13901 [00:00<?, ? examples/s]


To re-iterate the cell below modies the above inputs to look like:
```
'input_ids':  [
[list of 1024 tokens],
[list of 1024 tokens],  
[list of 1024 tokens],
...]
```

Another thing we did is add a label (aka target) to our dataset. The cool thing about CLMs is that you don't really need to do anything special to create a target. Remember our task is to predict the next token; so our target or label is simply the input shifted by one. In our HF models, all we do is copy the inputs, the actual shift happens internally.

In [48]:
len(lm_datasets['train']['input_ids'][1])

512

## Training
Ok our data prep is done, we are ready to start training the model.

In [49]:
from transformers import TrainingArguments
import torch

# Define your output directory for logs, and the final model
OUTPUT_DIR = "snoop-shpear-clm-v13"

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,

    # Core Training Parameters
    per_device_train_batch_size=8,        # how many sequences of 1024 tokens to take at once (more is better if your GPU can handle it)
    per_device_eval_batch_size=8,         # Same as train batch size
    num_train_epochs=3,                   # How many times do you want to process the dataset, lower values lead to underfitting and higher to over fitting.
    learning_rate=5e-5,                   # We are using a pretrained model so our steps should be tiny
    weight_decay=0.01,                    # 0.01 is what I found seems to be recommended for finetuning
    gradient_accumulation_steps=8,

    # Evaluation and Logging
    eval_strategy="epoch",                # Evaluation is done at the end of each epoch
    logging_steps=50,                     # how often do you want to get updated
    save_strategy="epoch",                # Save a checkpoint at the end of each epoch
    load_best_model_at_end=True,          # Load the model with the best validation loss

    report_to="none",           # by default HF seems to log to wand.ai, I don't want that.

    # Resource Management
    fp16=True if torch.cuda.is_available() else False,

)

TrainingArguments sets up parameter for our model. There are a lot of arguments here but pretty much everything I've used is based on standard recommendations from HF.

In [50]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False # Set to False for Causal Language Modeling (CLM)
)

A Data collators in HF is usually the last step before you can train a model. It handles padding any creating any final labels for processing. It will even covert our inputs to pytorch tensors. For CLM, HF recommends using [DataCollatorForLanguageModeling](https://huggingface.co/docs/transformers/v4.57.1/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling) with mlm=False. The mlm field is True by default and is used for masking tasks.

In [51]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Trainer(


Alrite, lets setup our Trainer. The Trainer abstracts away a lot of the details like using a GPU if available, setting up Optimizers and automatically updating learning rates. It also managers the Pytorch loop around calculating loss, and updating parameters in each step. The TL;DR is that it greatly simplifies training a model vs doing it by hand in Pytorch.

In [52]:
train_result = trainer.train()

print("\n*** Training Complete! ***")

print(train_result.metrics)

# Save the final model and tokenizer to your output directory
trainer.save_model()
tokenizer.save_pretrained(trainer.args.output_dir)
print(f"Model saved to: {trainer.args.output_dir}")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151645}.


Epoch,Training Loss,Validation Loss
1,No log,4.000817
2,No log,3.867839
3,No log,3.869758


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].



*** Training Complete! ***
{'train_runtime': 91.3788, 'train_samples_per_second': 26.691, 'train_steps_per_second': 0.427, 'total_flos': 2681590257156096.0, 'train_loss': 3.8443462665264425, 'epoch': 3.0}
Model saved to: snoop-shpear-clm-v13


## Play with the trained model

In [53]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

# Load our saved
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
model = AutoModelForCausalLM.from_pretrained(OUTPUT_DIR)

# Create the text generation pipeline
generator = pipeline(
    'text-generation',
    model=model,
    tokenizer=tokenizer
)

Device set to use cuda:0


In [54]:

def generate_text(prompt, max_len=50):
    result = generator(
        prompt,
        max_length=3000,
        num_return_sequences=2,
        do_sample=True,         # Use sampling for creative text
        top_k=50,               # Filter top 50 likely tokens
        temperature=0.9,        # Good balance of coherence and creativity
        pad_token_id=tokenizer.eos_token_id, # Prevents warnings
        repetition_penalty=1.5
    )
    # The output is a list of dicts, extract the generated text
    print(result)
    return result[0]['generated_text']


# Example 1: The Royal G-Funk Decree
prompt1 = "Fo shizzle, my brethren, I decree that "
print(f"Prompt: {prompt1}")
print(f"Output: {generate_text(prompt1)}\n")

# Example 2: The Philosophical Ride
prompt2 = "To be or not to be, that is the question, "
print(f"Prompt: {prompt2}")
print(f"Output: {generate_text(prompt2)}")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=3000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Prompt: Fo shizzle, my brethren, I decree that 


Both `max_new_tokens` (=256) and `max_length`(=3000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "Fo shizzle, my brethren, I decree that 13-2 be no more. There are not many to say in this house but a thousand; and the best of them all will tell it with one mouth: they do me wrong for their folly as much so as if by some accident made him righteous! Come on thee again. Speak now unto us thy husband's father-in-law at home.' 'I am madly beseeched,' he says, or his face is like an eagle peck'd out, which hath been slain therein. By heaven forgive you your pardon-fare-well? Well pray keep ye faith hereon too well; lest when thou art come hither weeping away our lives o'ertop'ts ere any blood could bleed off from these walls nor dust fall down upon those heads-beside, poor King John cannot live long hence, yet let England remain her king till then; she'll give place thus-namely to France once before-tongue-lidness-a** n***a, wip wit em bitches sippin cokes Snoop Dogg.. what does thang?? What else wouldst thou have done hadest thou known whom? Thou hast nothing left 

## Publish the model

In [55]:
# repo_id = "sagsan/snoopshpear"

# tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
# model = AutoModelForCausalLM.from_pretrained(OUTPUT_DIR)

# tokenizer.push_to_hub(repo_id)


# model.push_to_hub(
#     repo_id,
#     commit_message="Final SnoopShepear CLM fine-tuned model"
#)

