<div style="background-color:rgb(223, 119, 160); padding: 30px; border-radius: 20px; box-shadow: 0 4px 15px rgba(255, 105, 180, 0.3); color: #F8BBD0; font-family: 'Times New Roman', serif;">

<h1 style="text-align: center; font-size: 38px; color: white; font-weight: bold;">🎀 Fine-Tuning GPT-2 for Interactive Job Interview Preparation Chatbots 🎀</h1>


<hr style="border-top: 2px dashed white;">

<h2 style="font-size: 26px; color: white; font-weight: bold;">🧠 GPT-2 Architecture</h2>
<p style="text-align: center;">
</p>
<p>
GPT-2 is a transformer-based architecture with multiple decoder layers, enabling it to learn long-range dependencies in text. It uses self-attention to understand context and generate coherent, context-aware outputs for given prompts.
</p>
<img src="./GPT-2 architecture.jpg" alt="GPT-2 Architecture" width="600" style="border-radius: 12px;">




<hr style="border-top: 2px dashed white;">

<h2 style="font-size: 28px; color: white; font-weight: bold;">✨ Overview</h2>
<p>
This project presents the fine-tuning of a GPT-2 language model to serve as an intelligent assistant for job interview preparation. The system is trained on a dataset containing interview-style questions and responses, enabling it to generate relevant, coherent, and context-aware answers. The end goal is to create a chatbot that can simulate realistic interview conversations and help users build confidence and fluency before actual interviews.
</p>



<hr style="border-top: 2px dashed white;">

<h2 style="font-size: 26px; color: white; font-weight: bold;">📦 Libraries and Tools</h2>
<ul>
  <li><b>Transformers:</b> A powerful NLP library by Hugging Face that provides access to pre-trained GPT-2 and tools for fine-tuning.</li>
  <li><b>Torch:</b> The foundational deep learning framework used for training and managing model weights.</li>
  <li><b>Datasets:</b> Used for loading and handling datasets in a format compatible with Hugging Face pipelines.</li>
  <li><b>Pandas:</b> A data analysis library used for processing CSV-based text datasets before feeding them to the tokenizer.</li>
</ul>


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import torch
from datasets import Dataset
import pandas as pd
import re

  from .autonotebook import tqdm as notebook_tqdm



<hr style="border-top: 2px dashed white;">

<h2 style="font-size: 26px; color: white; font-weight: bold;">🔢 Step-by-Step Implementation</h2>



<h3 style="font-size: 22px; color: white; font-weight: bold;">① Load and Clean the Dataset</h3>
<p>
The Q&A dataset is loaded from a CSV file. Rows with missing values are removed, and whitespace inconsistencies are cleaned using regular expressions.
</p>


In [None]:
df = pd.read_csv('./Q&A_data.csv')

df = df.dropna(subset=['question', 'answer'])

# Clean the text
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

df['question'] = df['question'].apply(clean_text)
df['answer'] = df['answer'].apply(clean_text)


<h3 style="font-size: 22px; color: white; font-weight: bold;">② Format Q&A into Prompt Templates</h3>
<p>
Each question-answer pair is wrapped in special tokens to guide the GPT-2 model during training. This ensures that the model understands where a Q&A session begins and ends.
</p>


In [None]:
def format_prompt(row):
    return f"<|startoftext|>\nQ: {row['question']}\nA: {row['answer']}\n<|endoftext|>"

df['formatted'] = df.apply(format_prompt, axis=1)


<h3 style="font-size: 22px; color: white; font-weight: bold;">③ Convert Data to Hugging Face Dataset Format</h3>
<p>
The formatted DataFrame is converted into a <code>Dataset</code> object, making it compatible with the Hugging Face `Trainer` API.
</p>


In [None]:
dataset = Dataset.from_pandas(df[['formatted']])


<h3 style="font-size: 22px; color: white; font-weight: bold;">④ Load the Tokenizer and Tokenize the Prompts</h3>
<p>
The GPT-2 tokenizer encodes each prompt into token IDs that the model can understand. Truncation and padding are applied to ensure consistent input sizes.
</p>


In [None]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': '<|pad|>', 'eos_token': '<|endoftext|>'})

def tokenize(example):
    tokens = tokenizer(example["formatted"], truncation=True, padding="longest", max_length=512)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens


tokenized_dataset = dataset.map(tokenize, batched=True)


Map: 100%|██████████| 199/199 [00:00<00:00, 4721.10 examples/s]
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50258, 768)


<h3 style="font-size: 22px; color: white; font-weight: bold;">⑤ Load the GPT-2 Model</h3>
<p>
The GPT-2 language model is loaded in its causal form, enabling it to generate outputs based on previous tokens.
</p>


In [None]:
model = AutoModelForCausalLM.from_pretrained('gpt2')
model.resize_token_embeddings(len(tokenizer))


<h3 style="font-size: 22px; color: white; font-weight: bold;">⑥ Configure Training Parameters</h3>
<p>
Key hyperparameters like learning rate, epochs, and batch size are defined here. The model will be evaluated after each epoch and saved to the specified directory.
</p>


In [None]:
training_args = TrainingArguments(
    output_dir='./models/qna_model',
    per_device_train_batch_size=4,
    num_train_epochs=50,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=500,
    warmup_steps=50,
    weight_decay=0.01,
    fp16=torch.cuda.is_available()
)


<h3 style="font-size: 22px; color: white; font-weight: bold;">⑦ Fine-Tune the GPT-2 Model</h3>

<p>
Using the Hugging Face `Trainer` class, the model is trained on the tokenized Q&A dataset. Over the epochs, the model adapts to the specific language and structure of job interview interactions.
</p>


In [None]:
# Initialize trainer and start training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer
)

trainer.train()

  trainer = Trainer(


Step,Training Loss
10,0.5433
20,0.3827
30,0.4872
40,0.3705
50,0.4809
60,0.4177
70,0.3933
80,0.411
90,0.4061
100,0.4161


TrainOutput(global_step=2500, training_loss=0.08720667467713356, metrics={'train_runtime': 320.7525, 'train_samples_per_second': 31.021, 'train_steps_per_second': 7.794, 'total_flos': 2599855718400000.0, 'train_loss': 0.08720667467713356, 'epoch': 50.0})

In [15]:
# Save the model and tokenizer
model.save_pretrained("./interview_model")
tokenizer.save_pretrained("./interview_model")

('./interview_model\\tokenizer_config.json',
 './interview_model\\special_tokens_map.json',
 './interview_model\\vocab.json',
 './interview_model\\merges.txt',
 './interview_model\\added_tokens.json',
 './interview_model\\tokenizer.json')


<hr style="border-top: 2px dashed white;">

<h2 style="font-size: 26px; color: white; font-weight: bold;">🎀 Conclusion</h2>
<p>
By following these structured steps, a general-purpose GPT-2 model is transformed into a specialized interview assistant. This chatbot can simulate realistic interview dialogue, providing users with valuable practice opportunities and confidence-building interactions. 🎀
</p>

</div>
