# NoteBook 1 : Fine-tune GPT 2 LMM:

**Fine-tune LLM GPT2**

This notebook is divided into two main parts, each focused on enhancing the capabilities of the GPT-2 language model through fine-tuning and data augmentation.

**Part 1: Fine-tuning GPT-2 with Deepseek Conversations**

In the first section, we fine-tune the GPT-2 model using a dataset composed of conversational exchanges with **Deepseek**. This allows the model to understand and replicate the style and structure of human-like dialogue, improving its performance in generating coherent and contextually relevant responses.

**Part 2: Exploring the GooAQ Dataset and Using ChromaDB**

In the second part, we explore a dataset called **GooAQ**, which is built from real user queries on the Google search platform. These questions reflect a wide range of natural language expressions and real-world information needs.

To further enhance the model's ability to provide accurate and relevant answers, we store the answers in **ChromaDB**, a vector database. This step allows for efficient retrieval of relevant responses and improves the overall performance of the system by combining fine-tuned generation with memory-based retrieval.


## 1. Importing libraries

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling, MistralForCausalLM, TFAutoModelForCausalLM, TrainerCallback
from datasets import load_dataset, Dataset
import json
import torch

## 2. logging into huggingface_hub

In [3]:
from huggingface_hub import login
login("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")

## 3. FINE TUNE USING JSON DATASET (Part 1) (deepseek convs):

##### 1. Load and prepare the data set:

In this part, we will load a dataset from a JSON file that stores simple conversational exchanges.

In [34]:
# load the data from json file
with open("/kaggle/input/jsondata/chatgpt_logs.json", "r") as file:
    chat_data = json.load(file)

In [35]:
# prepare the data (clean and transform into the json format)
texts = [f"User: {conv['user']}\nChatbot: {conv['chatbot']}" for conv in chat_data]
dataset = Dataset.from_dict({"text": texts})

##### 2. Import tokenizer and model (GPT2):

In [36]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

##### 3. Add the tokenizer:

In [37]:
# Add padding token :
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer), mean_resizing=False)

In [38]:
# Tokenizer the data :
def tokenize_function(examples):
    tokens = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens


# map the fucntionnn
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

In [39]:
# testinngg :
print(tokenized_datasets[0])

{'text': "User: hey how are you doing ?\nChatbot: Hi! I'm just a virtual assistant, so I don't have feelings, but I'm here and ready to help! How are *you* doing? 😊", 'input_ids': [12982, 25, 17207, 703, 389, 345, 1804, 5633, 198, 30820, 13645, 25, 15902, 0, 314, 1101, 655, 257, 7166, 8796, 11, 523, 314, 836, 470, 423, 7666, 11, 475, 314, 1101, 994, 290, 3492, 284, 1037, 0, 1374, 389, 1635, 5832, 9, 1804, 30, 30325, 232, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 5

In [40]:
# testig ...
print(tokenized_datasets[10])

{'text': 'User: What are the key differences between traditional machine learning and deep learning?\nChatbot: Traditional machine learning and deep learning are both subsets of artificial intelligence, but they differ significantly in their approaches, architectures, and applications. Here are the key differences:\n\n### 1. **Feature Engineering**\n   - **Traditional Machine Learning**: Requires manual feature extraction and engineering. Domain experts often need to identify and create relevant features from the raw data to improve model performance.\n   - **Deep Learning**: Automatically learns features from raw data through multiple layers of neural networks. This eliminates the need for manual feature engineering, especially in complex tasks like image or speech recognition.\n\n### 2. **Model Architecture**\n   - **Traditional Machine Learning**: Uses simpler algorithms such as linear regression, decision trees, support vector machines (SVM), and k-nearest neighbors (KNN). These mo

##### 4. Split the data into train and val:

In [41]:
datasets = tokenized_datasets.train_test_split(train_size=0.9, test_size=0.1)
train_dataset = datasets["train"]
eval_dataset = datasets["test"]
print(train_dataset)
print(eval_dataset)

Dataset({
    features: ['text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 27
})
Dataset({
    features: ['text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 3
})


##### 5. Fine tune the model :

In [45]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./results_gpt2',          # output directory
    num_train_epochs=10,              # number of training epochs
    per_device_train_batch_size=4,   # batch size for training
    per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    save_steps=500,
    eval_strategy="steps",
    save_total_limit=2,
)

# Setup Trainer
trainer = Trainer(
    model=model,                         # the model to be trained
    args=training_args,                  # training arguments
    train_dataset=train_dataset,         # training dataset
    eval_dataset=eval_dataset,     # validation dataset
)

In [46]:
# Define a custom callback class to log training progress
class TrainingLogger(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            # Print the current epoch, global step, and training loss (if available)
            print(f"Epoch: {state.epoch:.2f}, Step: {state.global_step}, Loss: {logs.get('train_loss', 'N/A')}")


# cclear any existing callbacks (optional, to avoid duplicates)
trainer.callback_handler.callbacks = []

# Add the custom logging callback once
trainer.add_callback(TrainingLogger())

In [47]:
# train the model:
trainer.train()



Epoch: 10.00, Step: 40, Loss: 54.1775146484375


TrainOutput(global_step=40, training_loss=54.1775146484375, metrics={'train_runtime': 29.8489, 'train_samples_per_second': 9.046, 'train_steps_per_second': 1.34, 'total_flos': 70548848640000.0, 'train_loss': 54.1775146484375, 'epoch': 10.0})

##### 6. store and load the finetuned  model

In [48]:
model.save_pretrained("gpt2_finetuned1")
tokenizer.save_pretrained("gpt2_finetuned1")

('gpt2_finetuned1/tokenizer_config.json',
 'gpt2_finetuned1/special_tokens_map.json',
 'gpt2_finetuned1/vocab.json',
 'gpt2_finetuned1/merges.txt',
 'gpt2_finetuned1/added_tokens.json',
 'gpt2_finetuned1/tokenizer.json')

In [65]:
model1 = AutoModelForCausalLM.from_pretrained("gpt2_finetuned1")
tokenizer1 = AutoTokenizer.from_pretrained("gpt2_finetuned1")
tokenizer1.pad_token = tokenizer1.eos_token
model1.config.pad_token_id = tokenizer1.eos_token_id

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model1.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50258, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50258, bias=False)
)

##### 7. Test the model with a sentence

In [66]:
input_text = "how are you"
inputs = tokenizer1(input_text, return_tensors="pt").to(model1.device)
output = model1.generate(**inputs, max_length=100)
print(tokenizer1.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


how are you ... ... ... ... ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...
 ...



We use a simple dataset to fine-tune the GPT-2 model. 

Firstly, GPT-2 is not a highly performant model by today's standards, and to fine-tune it effectively, a large volume of high-quality data is usually required to help it understand complex conversational structures.

Secondly, modern models like **Deepseek** are trained on data with dialogue patterns that differ significantly from those GPT-2 was originally trained on. As a result, our model struggles to understand the structure of the conversations — even when increasing the number of training epochs — leading to **underfitting**.

To address this issue, the solution is to work with a more extensive and diverse dataset. That is exactly what we will explore in the second part of this notebook!

## 4. FINE TUNE USING gooaq DATASET (Part 2) :

##### 1. Loading the dataset:

We’ll use the **gooaq** dataset from Hugging Face's datasets library.

In [4]:
# importing the load dataset from datasets :
from datasets import load_dataset

In [5]:
# Load dataset
dataset = load_dataset("gooaq")
print(dataset)

README.md:   0%|          | 0.00/9.45k [00:00<?, ?B/s]

gooaq.py:   0%|          | 0.00/6.00k [00:00<?, ?B/s]

The repository for gooaq contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/gooaq.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/191M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3112679 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'short_answer', 'answer', 'answer_type'],
        num_rows: 3112679
    })
    validation: Dataset({
        features: ['id', 'question', 'short_answer', 'answer', 'answer_type'],
        num_rows: 2500
    })
    test: Dataset({
        features: ['id', 'question', 'short_answer', 'answer', 'answer_type'],
        num_rows: 2500
    })
})


In [6]:
# select 30000 rows 
train_data = dataset['train'].select(range(30001))
val_data = dataset['validation'].select(range(1000))
test_data = dataset['test'].select(range(1000))

In [7]:
# Check the number of rows
print(f"Number of rows in train split: {len(train_data)}")
print(f"Number of rows in train split: {len(val_data)}")
print(f"Number of rows in train split: {len(test_data)}")

Number of rows in train split: 30001
Number of rows in train split: 1000
Number of rows in train split: 1000


In [8]:
# Print the first row
print("First row:")
print(train_data[5625])

First row:
{'id': 9388, 'question': '1 decade is equal to how many years?', 'short_answer': '10 Calendar year', 'answer': None, 'answer_type': 3}


In [9]:
# Print the second row
print("Second row:")
print(dataset["test"][1])

Second row:
{'id': 1322, 'question': '1 acre how many square foot?', 'short_answer': '43560 Square foot', 'answer': None, 'answer_type': 3}


##### 2. Prepare the data set gooaq:

In [10]:
def prepare_data(example):
    question = example["question"]
    answer = example["short_answer"]
    
    input_text = f"Question: {question} Answer:"
    target_text = answer

    if answer:
        input_text = f"Question: {question} Answer:"
        target_text = answer
        
        return {
            "input_text": input_text,
            "target_text": target_text
        }
    return None

In [11]:
# map the function :
prepared_dataset_train = train_data.map(prepare_data, remove_columns=["answer_type", "id", "short_answer", 'question', 'answer'])
prepared_dataset_val = val_data.map(prepare_data, remove_columns=["answer_type", "id", "short_answer", 'question', 'answer'])
prepared_dataset_test = test_data.map(prepare_data, remove_columns=["answer_type", "id", "short_answer", 'question', 'answer'])


Map:   0%|          | 0/30001 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [12]:
# verify the format :
print(prepared_dataset_train)
print(prepared_dataset_val)
print(prepared_dataset_test)

Dataset({
    features: ['input_text', 'target_text'],
    num_rows: 19791
})
Dataset({
    features: ['input_text', 'target_text'],
    num_rows: 775
})
Dataset({
    features: ['input_text', 'target_text'],
    num_rows: 772
})


In [13]:
# testing ::
print(prepared_dataset_train[10])
print(prepared_dataset_val[10])
print(prepared_dataset_test[10])

{'input_text': 'Question: 0800 do sac da netflix? Answer:', 'target_text': '1 (866) 579-7172'}
{'input_text': 'Question: 1 bar is how many pascals? Answer:', 'target_text': '100000 Pascal'}
{'input_text': 'Question: 1 am pst to est? Answer:', 'target_text': '04:00 Tuesday, Eastern Time (ET)'}


##### 3. Saving the data set into txt files:

In [14]:
# fucntion to save save the dataset_trained
def save_to_file(prepared_dataset, filename):
    with open(filename, "w") as f:
        for example in prepared_dataset:
            f.write(f"{example['input_text']} {example['target_text']}\n")

In [79]:
# save the train data
save_to_file(prepared_dataset_train, "prepared_gooaq_data_train.txt")

In [80]:
# save the validation data
save_to_file(prepared_dataset_val, "prepared_gooaq_data_val.txt")

In [81]:
# save the  test data
save_to_file(prepared_dataset_test, "prepared_gooaq_data_test.txt")

##### 3. DATA CLEANING (NAN values) :

In [15]:
#cleaning
dataset_cleaned_train = prepared_dataset_train.filter(lambda example: example['input_text'] is not None and example['target_text'] is not None)
dataset_cleaned_val = prepared_dataset_val.filter(lambda example: example['input_text'] is not None and example['target_text'] is not None)
dataset_cleaned_test = prepared_dataset_test.filter(lambda example: example['input_text'] is not None and example['target_text'] is not None)

Filter:   0%|          | 0/19791 [00:00<?, ? examples/s]

Filter:   0%|          | 0/775 [00:00<?, ? examples/s]

Filter:   0%|          | 0/772 [00:00<?, ? examples/s]

In [16]:
# verify the format :
print(dataset_cleaned_train)
print(dataset_cleaned_val)
print(dataset_cleaned_test)

Dataset({
    features: ['input_text', 'target_text'],
    num_rows: 19791
})
Dataset({
    features: ['input_text', 'target_text'],
    num_rows: 775
})
Dataset({
    features: ['input_text', 'target_text'],
    num_rows: 772
})


##### 4. Tokenizer the dataset:

In [17]:
# Load tokenizer
tokenizer2 = AutoTokenizer.from_pretrained('gpt2')
tokenizer2.pad_token = tokenizer2.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [18]:
def tokenize_function(examples):
    return tokenizer2(examples["input_text"], truncation=True, padding="max_length", max_length=512)

In [19]:
#train
train_dataset = dataset_cleaned_train.map(tokenize_function, batched=True)
train_dataset = train_dataset.rename_column("target_text", "labels")
print(train_dataset[0])

Map:   0%|          | 0/19791 [00:00<?, ? examples/s]

{'input_text': 'Question: 0800 da natura do brasil? Answer:', 'labels': '1 (720) 408-2293', 'input_ids': [24361, 25, 657, 7410, 12379, 299, 2541, 64, 466, 48029, 346, 30, 23998, 25, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,

In [20]:
#val
val_dataset = dataset_cleaned_val.map(tokenize_function, batched=True)
val_dataset = val_dataset.rename_column("target_text", "labels")
print(val_dataset[0])

Map:   0%|          | 0/775 [00:00<?, ? examples/s]

{'input_text': 'Question: 1 acre equal to how many square miles? Answer:', 'labels': '0.0015625 Square mile', 'input_ids': [24361, 25, 352, 31244, 4961, 284, 703, 867, 6616, 4608, 30, 23998, 25, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 

In [21]:
#test
test_dataset = dataset_cleaned_test.map(tokenize_function, batched=True)
test_dataset = test_dataset.rename_column("target_text", "labels")
print(test_dataset[0])

Map:   0%|          | 0/772 [00:00<?, ? examples/s]

{'input_text': 'Question: 1 acer equal to how much square feet? Answer:', 'labels': '43560 Square foot', 'input_ids': [24361, 25, 352, 936, 263, 4961, 284, 703, 881, 6616, 3625, 30, 23998, 25, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50

In [22]:
# preprocess the data into the GPT-2 input format:
def preprocess(example):
    full_text = example['input_text'] + " " + example['labels']
    
    encoding = tokenizer2(
        full_text,
        truncation=True,
        padding="max_length",
        max_length=128  
    )
    encoding["labels"] = encoding["input_ids"].copy()

    return encoding

In [23]:
tokenized_dataset_inpuutrain = train_dataset.map(preprocess)
tokenized_dataset_inpuutval = val_dataset.map(preprocess)

Map:   0%|          | 0/19791 [00:00<?, ? examples/s]

Map:   0%|          | 0/775 [00:00<?, ? examples/s]

In [24]:
print(tokenized_dataset_inpuutrain[0])

{'input_text': 'Question: 0800 da natura do brasil? Answer:', 'labels': [24361, 25, 657, 7410, 12379, 299, 2541, 64, 466, 48029, 346, 30, 23998, 25, 352, 357, 23906, 8, 41247, 12, 1828, 6052, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256], 'input_ids': [24361, 25, 657, 7410, 12379, 299, 2541, 64, 466, 48

In [25]:
print(tokenized_dataset_inpuutval[0])

{'input_text': 'Question: 1 acre equal to how many square miles? Answer:', 'labels': [24361, 25, 352, 31244, 4961, 284, 703, 867, 6616, 4608, 30, 23998, 25, 657, 13, 405, 21599, 1495, 9276, 10591, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256], 'input_ids': [24361, 25, 352, 31244, 4961, 284

##### 5. Load the model and finetune:

In [26]:
print(tokenized_dataset_inpuutrain)

Dataset({
    features: ['input_text', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 19791
})


In [27]:
# Load the pre-trained GPT-2 model
model_fine = AutoModelForCausalLM.from_pretrained("gpt2")

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [29]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./results_gpt2',       # output directory (in kaggle outputs cell !!!)
    num_train_epochs=10,              # number of training epochs, 10 epochs 
    per_device_train_batch_size=4,   # batch size for training
    per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    save_steps=500,
    eval_strategy="steps",
    save_total_limit=2,
)

# Setup Trainer
trainer = Trainer(
    model=model_fine,                           # the model to be trained (loaded form huggingfacee)
    args=training_args,                         # training arguments
    train_dataset=tokenized_dataset_inpuutrain, # training dataset
    eval_dataset=tokenized_dataset_inpuutval,   # validation dataset
)

In [30]:
# Define a custom callback class to log training progress
class TrainingLogger(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            # Print the current epoch, global step, and training loss (if available)
            print(f"Epoch: {state.epoch:.2f}, Step: {state.global_step}, Loss: {logs.get('train_loss', 'N/A')}")


# cclear any existing callbacks (optional, to avoid duplicates)
trainer.callback_handler.callbacks = []

# Add the custom logging callback once
trainer.add_callback(TrainingLogger())

In [31]:
# train the model:
trainer.train()



Epoch: 10.00, Step: 24740, Loss: 0.1382127051018088


TrainOutput(global_step=24740, training_loss=0.1382127051018088, metrics={'train_runtime': 6962.1118, 'train_samples_per_second': 28.427, 'train_steps_per_second': 3.554, 'total_flos': 1.292807651328e+16, 'train_loss': 0.1382127051018088, 'epoch': 10.0})

In [32]:
input_text = "1m equal what in mm"
inputs = tokenizer2(input_text, return_tensors="pt").to(model_fine.device)
output = model_fine.generate(**inputs, max_length=100)
print(tokenizer2.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


1m equal what in mm? Answer: 1000 Millimeter


In [33]:
input_text = "1 acre equal to how many square miles ?"
inputs = tokenizer2(input_text, return_tensors="pt").to(model_fine.device)
output = model_fine.generate(**inputs, max_length=100)
print(tokenizer2.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


1 acre equal to how many square miles ? Answer: 0.0015625 Square mile


In [112]:
input_text = "c'est quoi cm?"
inputs = tokenizer2(input_text, return_tensors="pt").to(model_fine.device)
output = model_fine.generate(**inputs, max_length=100)
print(tokenizer2.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


c'est quoi cm? Answer: 10 Centimeters


##### 4. saving the model :


In [None]:
# saving ...
model_fine.save_pretrained("gpt2_finetuned_gooq")
tokenizer2.save_pretrained("gpt2_finetuned1_gooq")

### 3. LANGCHIAN PART :
In this section, we will integrate **LangChain**, a powerful framework designed to build applications with large language models.

LangChain enables advanced features such as prompt chaining, memory management, retrieval-augmented generation (RAG), and seamless interaction with external tools like vector databases. By using LangChain, we can build more dynamic and context-aware conversational agents.

In our case, we will use LangChain to connect GPT-2 with **ChromaDB** for retrieving relevant information from stored answers, allowing the model to provide more accurate and enriched responses based on user queries.

This integration marks a shift from simple text generation to building more intelligent and interactive LLM-powered applications.

In [1]:
#!pip install -U langchain-community
#!pip install chromadb

##### 1. Importing libraries

In [114]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document

##### 2. Loading the fine-tuned model 

In [115]:
#load the model:
model = GPT2LMHeadModel.from_pretrained("gpt2_finetuned_gooq")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2_finetuned1_gooq")

##### 3. Creating some functions :

In [158]:
# function to generate the output
def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True).split("Answer:")[1].strip()

def generate_response1(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True).split(prompt)[1]

In [130]:
# define a fucntion to store the result
def store_output(prompt, response):
    doc = Document(page_content=response, metadata={"prompt": prompt})
    vectorstore.add_documents([doc])

In [3]:
# find the retrive similar responses:
def retrieve_similar_responses(query, k=2):
    docs = vectorstore.similarity_search(query, k=k)
    return [(doc.page_content, doc.metadata) for doc in docs]

##### 4. extract question from test data set :

In [117]:
#extract some question (test data set)
dataset_cleaned_test[100]["input_text"].split('Question:')[1].split('Answer:')[0].strip()

'1 ml is how many cubic feet?'

##### 5. Load the embedding model from huggingFace

In [122]:
# Embedding model
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

##### 6. Create a chroma base :

In [129]:
# Intilize the  Chroma base
persist_directory='/kaggle/working/results_gpt2'
vectorstore = Chroma(embedding_function=embedding, persist_directory=persist_directory)
print(vectorstore)

<langchain_community.vectorstores.chroma.Chroma object at 0x7b36b70e7610>


##### 7. Testing :

In [152]:
# test 1 :
prompt = "1 centimeter is equal to how many cubic meters"

# get the response 
response = generate_response(prompt)
print("Réponse : ", response)

# Storing (chroma db) :
store_output(prompt, response)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Réponse :  1e-6 Cubic meter


In [154]:
# test 2 :
prompt = "1 centimeter is equal to how many cubic meters"

# get the response
response = generate_response(prompt)
print("Réponse : ", response)

# storing :
store_output(prompt, response)

print("-------------------------")

# test 3 :
prompt = "how many episodes in gentleman jack"

# get the response
response = generate_response(prompt)
print("Réponse : ", response)



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Réponse :  1e-6 Cubic meter
-------------------------
Réponse :  14 episodes


In [161]:
# MINI chat :
while True:
        user_message = input("Vous : ")
        if user_message.lower() in ["exit", "stop"]:
            print("shutdown")
            break
        response = generate_response(user_message)
        store_output(user_message, response)
        print("GPT :", response)

Vous :  10 am est to tokyo time?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT : 6:00 PM Thursday, in Tokyo, Japan


Vous :  10 am mst to cet?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT : 6:00 PM Thursday, Central European Time (CET)


Vous :  1 oz how much kg?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT : 0.0283495 Kilogram


Vous :  exit


shutdown


##### 8. Test the similarity fucntion on chroma db:

In [162]:
retrieved = retrieve_similar_responses("inch")
print("similar response : ", retrieved)

similar response :  [(' 7.48052 Inch', {'prompt': ' how many inches in 1 foot ? '}), (' how many inches in 1 foot?  7:33.6 Inch', {'prompt': ' how many inches in 1 foot? '})]


**@MOHAMMED AMHAL**