# <p style="padding:10px;background-color:#8A2BE2;margin:0;color:white;font-family:newtimeroman;font-size:150%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Medium Articles Text Generation Using LoRA and GPT2</p>

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">Project Overview</span>

##### The project aims to fine-tune a GPT-2 language model for text generation using LoRA and apply it to generate text based on given prompts. The model is trained on a dataset containing Medium articles, which is cleaned and tokenized before training. The fine-tuning process involves configuring the GPT-2 model with custom settings for LoRA, setting up parameters for training, and evaluating the model's performance using the perplexity score.

##### The fine-tuned model is then saved for future use and integrated into a text generation pipeline. This pipeline allows users to input text prompts and receive generated text output based on the trained model's predictions. The project demonstrates a practical application of advanced natural language processing techniques for text generation tasks.

##### Overall, the project showcases the process of fine-tuning a language model for specific text generation tasks, highlighting the use of techniques like LoRA to improve the model's performance and the implementation of a text generation pipeline for practical use cases.

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">Import Libraries</span>

In [1]:
import re
import os
import math
import emoji
import torch
import wandb
wandb.login()

import pandas as pd

from trl import SFTTrainer
from datasets import Dataset,DatasetDict
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AutoConfig, DataCollatorForLanguageModeling, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

import warnings
warnings.filterwarnings("ignore")

[34m[1mwandb[0m: Currently logged in as: [33mkmohankumar1405[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [2]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model     = GPT2LMHeadModel.from_pretrained("gpt2")

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">Load Dataset</span>

In [3]:
data = pd.read_csv('Medium Articles.csv')

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">Data Cleaning</span>

In [5]:
def cleaning_process(text):
    text = re.sub('[^a-zA-Z]', ' ', str(text).lower().strip())
    text = re.sub('@[A-Za-z0-9_]+', '', text) 
    text = re.sub('#','',text) 
    text = re.sub('RT[\s]+','',text)
    text = re.sub('https?:\/\/\S+', '', text) 
    text = re.sub('\n',' ',text)
    text = emoji.replace_emoji(text, replace='')
    return text

In [6]:
data['TexT'] = data['TexT'].apply(cleaning_process)

In [7]:
dataset       = Dataset.from_pandas(data)
train_size    = int(len(dataset) * 0.8)
train_dataset = dataset.select(range(train_size))
test_dataset  = dataset.select(range(train_size, len(dataset)))
raw_datasets  = DatasetDict({"train": train_dataset, "test": test_dataset})

In [8]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['TexT'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['TexT'],
        num_rows: 2000
    })
})

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">Convert data into tokenizer</span>

In [9]:
context_length = 128
outputs = tokenizer(raw_datasets["train"][:2]["TexT"],
                    truncation=True,max_length=context_length,
                    return_overflowing_tokens=True,return_length=True)

In [10]:
def tokenize(element):
    outputs = tokenizer(
        element["TexT"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True)
    
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

In [11]:
tokenized_datasets = raw_datasets.map(tokenize, batched=True, remove_columns=raw_datasets["train"].column_names)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [12]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 7718
    })
    test: Dataset({
        features: ['input_ids'],
        num_rows: 1899
    })
})

In [13]:
config = AutoConfig.from_pretrained("gpt2",vocab_size=len(tokenizer),n_ctx=context_length,
                                     bos_token_id=tokenizer.bos_token_id,
                                     eos_token_id=tokenizer.eos_token_id)

In [14]:
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 124.4M parameters


In [15]:
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;"> LoRa config </span>

In [16]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM")

In [17]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [18]:
lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    #target_modules=["q_proj", "v_proj", "k_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM")

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
print_trainable_parameters(model)

trainable params: 2359296 || all params: 126799104 || trainable%: 1.8606566809809635


<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">parameters for training</span>

In [19]:
args = TrainingArguments(
    output_dir="Medium Article File",
    overwrite_output_dir = True,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    num_train_epochs=5,
    weight_decay=0.1,
    warmup_steps=100,
    learning_rate=5e-4,
    save_steps=500,
    report_to="wandb",
    metric_for_best_model = 'accuracy',
    run_name = 'custom_training')

In [20]:
def format_generated_text(generated_text):
    if isinstance(generated_text, dict):

        text = generated_text.get('text', '')
        
        formatted_text = text.strip()
        
        formatted_text = formatted_text.capitalize()
        
        if not formatted_text.endswith('.'):
            formatted_text += '.'
        return formatted_text
    else:
        return generated_text

In [21]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    packing=True,
    formatting_func=format_generated_text,
    peft_config=peft_config)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">Fine Tuning Model</span>

In [22]:
wandb.init(project="Medium Article Generation")

In [23]:
trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=10, training_loss=10.24756088256836, metrics={'train_runtime': 1485.5462, 'train_samples_per_second': 0.05, 'train_steps_per_second': 0.007, 'total_flos': 40280968396800.0, 'train_loss': 10.24756088256836, 'epoch': 5.0})

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;"> Perplexity Score check</span>

In [24]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 8764.61


<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">wandb score</span>

In [25]:
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁
train/global_step,▁▁
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁
train/train_samples_per_second,▁

0,1
eval/loss,9.07848
eval/runtime,16.7935
eval/samples_per_second,0.179
eval/steps_per_second,0.06
train/epoch,5.0
train/global_step,10.0
train/total_flos,40280968396800.0
train/train_loss,10.24756
train/train_runtime,1485.5462
train/train_samples_per_second,0.05


<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">Saved a Fine Tuned Model</span>

In [26]:
trainer.save_model()

In [27]:
output_dir = 'Medium Article File'
model.save_pretrained(output_dir)

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">Model into Pipeline</span>

In [28]:
from transformers import pipeline

output_test = pipeline('text-generation',model='Medium Article File', tokenizer='gpt2')

#result = chef('NLP is Good thing to have in future')[0]['generated_text']

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">Test the Fine Tuned Model</span>

In [32]:
output_test('Neural Networks')[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Neural Networks/E-PFC\n\n\nIn some areas where there is a difference in brain structure-we have a similar type of brain for language. However, it is most often difficult to pinpoint with our own eyes where those differences arise.'

In [35]:
output_test('Data science')[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Data science for your business, not just for you. You shouldn't rely exclusively on your computer science background to make a living.\n\nTo do so, first make sure your project is successful and do your best to answer every question on the site"

In [36]:
output_test('what is a codeing')[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'what is a codeing function)?\n\nAnswer: I like to use the Python interpreter on my Mac. In this test, I choose a variable that contains the path to that variable:\n\nimport os def b = os.path / obj'

In [37]:
output_test('GPU is very Low')[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"GPU is very Low Profile as opposed to an R&D system, and the fact it does not utilize a much more powerful CPU at all, will ensure that the price of its predecessor really doesn't drop to the stratosphere.\n\nIt was"

<div style="display: flex; justify-content: center;">
    <span style="padding: 10px; margin: 5px; background-color: #ADD8E6; color: white; font-family: newtimeroman; font-size: 150%; border-radius: 15px; font-weight: 500;">Conclusion</span>

##### This project showcases the effective fine-tuning of a GPT-2 language model using LoRA for text generation tasks. The model is trained on a dataset of Medium articles, demonstrating the importance of data preprocessing and tokenization for training. The fine-tuned model achieves a perplexity score of approximately 8764.61 on the test dataset, indicating its ability to predict the next word in a sequence.

##### The project highlights the practical applications of the fine-tuned model, including content generation, chatbot development, text summarization, and creative writing assistance. It also emphasizes the model's potential for content personalization and language translation tasks.

##### Overall, this project demonstrates the versatility and effectiveness of fine-tuning transformer-based models like GPT-2 for a variety of natural language processing tasks, paving the way for more advanced and context-aware text generation applications.