# Code for training a pretrained language model 'distilgpt2' to write linkedIn posts. The training is limited to a few keywords and pattern for this project.

## Importing the dataset and performing EDA.

## For this project a custom dataset has been used which is a combination of two datasets from huggingface

In [1]:
import pandas as pd


In [2]:
df= pd.read_csv("/content/final.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,input,output
0,0,Generate a LinkedIn post announcing the releas...,We just released a new climate emulator to exp...
1,1,Compose a LinkedIn post announcing the launch ...,Today we launched the Llama 3 Tool Use 8B and ...
2,2,Generate a LinkedIn post announcing that Daphn...,10/19 Speaker of #Pezcoller23 Symposium - New ...
3,3,Please generate a LinkedIn post with the follo...,You won't want to miss this one.\n\nOpenAI pul...
4,4,Generate a LinkedIn post expressing your enjoy...,I've really enjoyed using crewAI tools to buil...


In [4]:
df.shape

(3007, 3)

In [5]:
df= df[['input','output']]

In [6]:
df['output'][100]

"It’s go time! Today marks the start of our second annual in vivo Week at insitro! This week of learning, sharing and innovation is designed by and for insitrocytes, with a jam-packed schedule focused on exchanging ideas and building the future together. We kicked off #inVivoWeek2023 with an inspiring fireside chat with Dr. Sue Desmond-Hellmann, a trailblazing oncologist recognized for her leadership throughout the healthcare ecosystem, including at Genentech, UCSF and the Bill & Melinda Gates Foundation. The thought-provoking discussion between Sue and Daphne Koller, our Founder and CEO, explored opportunities in healthcare innovation, the future of drug development and the importance of following professional passions. We're grateful for Sue's valuable insights! \n\nThe energy is palpable as we forge ahead during this pivotal week, collaborating to advance our mission of bringing better drugs faster to the patients who can benefit most, through machine learning and data at scale. #te

In [7]:
df['output'].isna().sum()

np.int64(0)

In [8]:
final = df.drop_duplicates()

In [9]:
final = final.dropna()

In [10]:
final.shape

(3007, 2)

In [11]:
final.head()

Unnamed: 0,input,output
0,Generate a LinkedIn post announcing the releas...,We just released a new climate emulator to exp...
1,Compose a LinkedIn post announcing the launch ...,Today we launched the Llama 3 Tool Use 8B and ...
2,Generate a LinkedIn post announcing that Daphn...,10/19 Speaker of #Pezcoller23 Symposium - New ...
3,Please generate a LinkedIn post with the follo...,You won't want to miss this one.\n\nOpenAI pul...
4,Generate a LinkedIn post expressing your enjoy...,I've really enjoyed using crewAI tools to buil...


# As pretrained model is being used from huggingface hence importing transformers library.

In [12]:
!pip install transformers --q


In [13]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
import torch

In [14]:
# Initializing the model
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Splitting the data into train, val and test set

In [15]:
from sklearn.model_selection import train_test_split

# Split the data into 75% training and 25% temporary (for test and validation)
train_df, temp_df = train_test_split(final, test_size=0.25, random_state=42)

# Split the temporary data into 50% validation and 50% test
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

print(f"Training data shape: {train_df.shape}")
print(f"Validation data shape: {val_df.shape}")
print(f"Test data shape: {test_df.shape}")

Training data shape: (2255, 2)
Validation data shape: (376, 2)
Test data shape: (376, 2)


## Tokenizing the training and validation data

In [16]:

train_tokenized_inputs = train_df['output'].apply(lambda x: tokenizer(x, return_attention_mask=False)['input_ids'])
val_tokenized_inputs = val_df['output'].apply(lambda x: tokenizer(x, return_attention_mask=False)['input_ids'])
test_tokenized_inputs = test_df['output'].apply(lambda x: tokenizer(x, return_attention_mask=False)['input_ids'])

# Concatenate all tokenized sequences for each split
train_tokenized_text = [item for sublist in train_tokenized_inputs.tolist() for item in sublist]
val_tokenized_text = [item for sublist in val_tokenized_inputs.tolist() for item in sublist]
test_tokenized_text = [item for sublist in test_tokenized_inputs.tolist() for item in sublist]


block_size = 128

# Divide the tokenized sequence into chunks
def chunk_data(tokenized_list, block_size):
    input_chunks = []
    for i in range(0, len(tokenized_list) - block_size + 1, block_size):
        input_chunks.append(tokenized_list[i : i + block_size])
    return input_chunks

train_tokenized_chunks = chunk_data(train_tokenized_text, block_size)
val_tokenized_chunks = chunk_data(val_tokenized_text, block_size)
test_tokenized_chunks = chunk_data(test_tokenized_text, block_size)

print(f"Number of training tokenized chunks: {len(train_tokenized_chunks)}")
print(f"Number of validation tokenized chunks: {len(val_tokenized_chunks)}")

Token indices sequence length is longer than the specified maximum sequence length for this model (1078 > 1024). Running this sequence through the model will result in indexing errors


Number of training tokenized chunks: 3151
Number of validation tokenized chunks: 510


## Creating a dataset using pytorch

In [17]:
from torch.utils.data import Dataset
import torch

class TextDataset(Dataset):
    def __init__(self, tokenized_chunks):
        self.tokenized_chunks = tokenized_chunks

    def __len__(self):
        return len(self.tokenized_chunks)

    def __getitem__(self, idx):
        # Convert the list of token IDs to a PyTorch tensor
        return torch.tensor(self.tokenized_chunks[idx])

train_dataset = TextDataset(train_tokenized_chunks)
val_dataset = TextDataset(val_tokenized_chunks)
test_dataset = TextDataset(test_tokenized_chunks)


print(f"Training dataset length: {len(train_dataset)}")
print(f"Validation dataset length: {len(val_dataset)}")
print(f"Example data point (first training chunk): {train_dataset[0]}")
print(f"Example data point shape: {train_dataset[0].shape}")

Training dataset length: 3151
Validation dataset length: 510
Example data point (first training chunk): tensor([    1,    40,   716, 10607,   284,   423,  7675,  5668, 15941,   513,
          784,  7320,  5172,  8495,    11,   257,  1994,  7515,   286,   616,
        43029,  8495,  2445,  6720,   379,   440, 17765,  4806, 26730,   660,
           13,   383,  3663,   284,  6121,   281,  4238,  3721,   656,   257,
        23895,  1486,   422, 30839,   284, 11939,   468,   587, 30438,    13,
          198,   198,   464, 19249,  2008,  3769,   257,  9815, 16700,   286,
          262,  1486,  1429,    11, 40318,   262,  2267,    11, 23355,  1634,
           11, 47517,    11,   290, 24415,  9539,   326,  2957,   284,   262,
         2457,  1720,    13,   314,   716, 14066,   329,   262,  2832,    12,
          261,  1998,  8618,   832,   428, 42329,    11,   543,   468,  9343,
          502,   284, 35139,   616,  6276,  4678,   287, 13028,  1486,   290,
         5963,   355,   257,  7325,  4

# Setting up data collator

In [18]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## Defining training parameters

In [19]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=15,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=5e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10

)

## Initializing the model

## For this part wandb api key is required

In [20]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mgodfathertheme1[0m ([33mgodfathertheme1-internshala[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,2.8189,2.764333
2,2.609,2.550816
3,2.2808,2.42305
4,2.0458,2.338911
5,2.021,2.299715
6,1.9547,2.251405
7,1.7161,2.227247
8,1.8307,2.199719
9,1.6988,2.207051
10,1.6854,2.196306


TrainOutput(global_step=5910, training_loss=1.9569660845141725, metrics={'train_runtime': 1580.6113, 'train_samples_per_second': 29.903, 'train_steps_per_second': 3.739, 'total_flos': 1543773864591360.0, 'train_loss': 1.9569660845141725, 'epoch': 15.0})

## Evaluate the model on the validation set

In [21]:

import math
eval_results = trainer.evaluate()

print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 8.99


## Evaluate the model on test set

In [22]:
test_results = trainer.evaluate(test_dataset)

print(f"Test Loss: {test_results['eval_loss']:.4f}")
print(f"Test Perplexity: {math.exp(test_results['eval_loss']):.2f}")

Test Loss: 2.2963
Test Perplexity: 9.94


## Saving the trained model

In [23]:
save_directory = "./tuned_distilgpt2"
trainer.save_model(save_directory)

print(f"Fine-tuned model saved to {save_directory}")

Fine-tuned model saved to ./tuned_distilgpt2
