# Fine-tuning Nvidia Chatbot for Question Answering
This project involves fine-tuning a pre-trained Nvidia chatbot model to enhance its performance for question-answering tasks. The notebook outlines the steps taken to download the dataset, preprocess the data, fine-tune the model, and evaluate its performance. Each step is meticulously documented to ensure a comprehensive understanding of the process and the technologies used.

## Environment Setup and Library Imports

In [1]:
!pip install opendatasets datasets accelerate --quiet
import opendatasets as od
od.download("https://www.kaggle.com/datasets/gondimalladeepesh/nvidia-documentation-question-and-answer-pairs")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dep

100%|██████████| 400k/400k [00:00<00:00, 60.6MB/s]







In [2]:
import pandas as pd
from datasets import Dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer
import torch
import re

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


This section sets up the environment by installing necessary libraries and importing essential modules. The key components include:

- **opendatasets**: This library allows for easy downloading of datasets from Kaggle. It's used here to download the Nvidia documentation question-and-answer pairs dataset, facilitating the acquisition of training data.
- **pandas**: A powerful data manipulation and analysis library used for handling the dataset. It provides data structures and functions needed to clean and preprocess the data efficiently.
- **datasets**: A library by Hugging Face designed for accessing and managing datasets in a way that is optimized for training machine learning models. It supports various dataset formats and integrates seamlessly with the transformers library.
- **transformers**: Hugging Face’s transformers library provides pre-trained models and tools to fine-tune them for various NLP tasks. Here, it’s used for loading the model and tokenizer, and for training and evaluation processes.
- **torch**: PyTorch is a deep learning framework that offers flexibility and speed. It's used for handling tensors and performing computations on GPUs.
- **re**: The regular expression library in Python is used for text preprocessing to clean and normalize the dataset.
- **CUDA**: If available, the code will use GPU acceleration to speed up the training and inference processes, which is crucial for handling large models and datasets efficiently.

By setting up these libraries, the notebook prepares the environment for data processing, model training, and evaluation, ensuring that all necessary tools are available for subsequent steps.

## Data Loading and Initial Exploration

In [3]:
data = pd.read_csv('/content/nvidia-documentation-question-and-answer-pairs/NvidiaDocumentationQandApairs.csv')[['question', 'answer']]
data.head()

Unnamed: 0,question,answer
0,What is Hybridizer?,Hybridizer is a compiler from Altimesh that en...
1,How does Hybridizer generate optimized code?,Hybridizer uses decorated symbols to express p...
2,What are some parallelization patterns mention...,The text mentions using parallelization patter...
3,How can you benefit from accelerators without ...,You can benefit from accelerators' compute hor...
4,What is an example of using Hybridizer?,An example in the text demonstrates using Para...


This section loads the dataset into a pandas DataFrame and performs an initial exploration. The dataset consists of question-and-answer pairs from Nvidia documentation, which will be used for training and evaluating the model. Using `pandas` for data loading and initial exploration provides several advantages:

- **DataFrame Structure**: Pandas DataFrame is a versatile data structure that allows for easy manipulation and analysis of tabular data.
- **Data Inspection**: The `head()` function helps in quickly viewing the first few rows of the dataset to understand its structure and content.

## Data Preprocessing

In [4]:
def clean_text(text):
  text = text.lower()
  text = re.sub('[^A-Za-z8-9]+', ' ', text)

  return text

In [5]:
data['question'] = data['question'].apply(clean_text)
data['answer'] = data['answer'].apply(clean_text)

In [6]:
data.head()

Unnamed: 0,question,answer
0,what is hybridizer,hybridizer is a compiler from altimesh that en...
1,how does hybridizer generate optimized code,hybridizer uses decorated symbols to express p...
2,what are some parallelization patterns mention...,the text mentions using parallelization patter...
3,how can you benefit from accelerators without ...,you can benefit from accelerators compute hors...
4,what is an example of using hybridizer,an example in the text demonstrates using para...


The data preprocessing step involves normalizing and cleaning the text data. This step is crucial for preparing the data for training the model. The preprocessing steps include:

- **Lowercasing**: Converting all text to lowercase ensures uniformity, which helps the model treat similar words with different cases as the same.
- **Removing Punctuation and Special Characters**: Regular expressions (`re` library) are used to remove unwanted characters, which can introduce noise and affect the model's performance.

These preprocessing steps help in creating a clean and consistent dataset, which is vital for training an effective machine learning model.

## Data Splitting

In [7]:
train = data.sample(frac=0.7)
test = data.drop(train.index)


val = test.sample(frac=0.5)
test = test.drop(val.index)

print('Training Shape:', train.shape)
print('Validation Shape:', val.shape)
print('Testing Shape:', test.shape)

Training Shape: (4976, 2)
Validation Shape: (1066, 2)
Testing Shape: (1066, 2)


Splitting the dataset into training, validation, and testing sets is a standard practice in machine learning. It ensures that the model is trained on one subset of data, validated on another, and tested on a completely unseen subset. This process helps in:

- **Preventing Overfitting**: By validating the model on a separate subset, it ensures that the model does not overfit to the training data.
- **Model Evaluation**: Testing on an unseen dataset provides an unbiased evaluation of the model's performance.

## Tokenization and Dataset Preparation

In [8]:
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype = torch.float32)
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [9]:
def tokenize_function(example):
  start_prompt = "According to the following question:\n\n"
  end_prompt = "\nAnswer:\n\n"

  prompt = [start_prompt + question + end_prompt for question in example['question']]

  example["input_ids"] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors = 'pt').input_ids
  example["labels"] = tokenizer(example['answer'], padding="max_length", truncation=True, return_tensors = 'pt').input_ids

  return example

In [10]:
train_data = Dataset.from_pandas(train)
train_tokenized_datasets = train_data.map(tokenize_function, batched = True)
train_tokenized_datasets = train_tokenized_datasets.remove_columns(['question', 'answer', '__index_level_0__'])

Map:   0%|          | 0/4976 [00:00<?, ? examples/s]

In [11]:
val_data = Dataset.from_pandas(val)
val_tokenized_datasets = val_data.map(tokenize_function, batched = True)
val_tokenized_datasets = val_tokenized_datasets.remove_columns(['question', 'answer', '__index_level_0__'])

test_data = Dataset.from_pandas(test)
test_tokenized_datasets = test_data.map(tokenize_function, batched = True)
test_tokenized_datasets = test_tokenized_datasets.remove_columns(['question', 'answer', '__index_level_0__'])

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Tokenization converts text data into numerical representations that the model can process. This section involves:

- **Using Pre-trained Models**: The `AutoModelForSeq2SeqLM` and `AutoTokenizer` from Hugging Face’s transformers library are used to load a pre-trained model (`google/flan-t5-base`) and tokenizer.
- **Creating Prompts**: The `tokenize_function` creates input prompts for the questions and tokenizes both the questions and answers.
- **Dataset Conversion**: The pandas DataFrame is converted into a Hugging Face Dataset format, which is optimized for training and evaluation with transformers.

This process ensures that the data is in a suitable format for training a sequence-to-sequence model, facilitating effective learning.

## Model Training, Evaluation, and Saving

In [12]:
EPOCHS = 5
LR = 1e-3
BATCH_SIZE = 2

training_path = "./training_nvidia_chatbot"

training_args = TrainingArguments(
    output_dir = training_path,
    save_total_limit = 2,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size = BATCH_SIZE,
    num_train_epochs = EPOCHS,
    learning_rate = LR,
    evaluation_strategy = "epoch",
)

trainer = Trainer(
    model = original_model,
    args = training_args,
    train_dataset = train_tokenized_datasets,
    eval_dataset = val_tokenized_datasets
)

trainer.train()

model_path = './nvidia_chatbot_final_model'

trainer.model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)



Epoch,Training Loss,Validation Loss
1,0.182,0.159619
2,0.1343,0.142804
3,0.0926,0.138898
4,0.0618,0.146118
5,0.037,0.162011


('./nvidia_chatbot_final_model/tokenizer_config.json',
 './nvidia_chatbot_final_model/special_tokens_map.json',
 './nvidia_chatbot_final_model/spiece.model',
 './nvidia_chatbot_final_model/added_tokens.json',
 './nvidia_chatbot_final_model/tokenizer.json')

In [13]:
eval_results = trainer.evaluate(eval_dataset = test_tokenized_datasets)

In [14]:
print(eval_results)

{'eval_loss': 0.17181049287319183, 'eval_runtime': 118.4041, 'eval_samples_per_second': 9.003, 'eval_steps_per_second': 4.502, 'epoch': 5.0}


Fine-tuning the pre-trained model involves training it on the specific dataset to adapt it to the task at hand. This section includes:

- **Setting Hyperparameters**: The number of epochs, learning rate, and batch size are specified to control the training process.
- **TrainingArguments**: This class from Hugging Face’s transformers library specifies various training parameters and configurations, such as the output directory, evaluation strategy, and saving checkpoints.
- **Trainer API**: The `Trainer` class abstracts the training loop, making it easy to train and evaluate models with minimal code.

These steps ensure that the model is fine-tuned effectively, leveraging the powerful training utilities provided by the transformers library.

After training, the model and tokenizer are saved to disk for future use. The model is then evaluated on the test dataset to assess its performance. Key components include:

- **Saving the Model**: The `save_pretrained` method saves the fine-tuned model and tokenizer to a specified path, allowing for easy loading and reuse.
- **Evaluation**: The `evaluate` method of the `Trainer` class evaluates the model on the test dataset, providing metrics such as evaluation loss to gauge its performance.

These steps ensure that the trained model is preserved and its performance is quantified on unseen data.

## Testing and Inference

In [16]:
test_text = 'what is cuda nsight?'

start_prompt = "According to the following question:\n\n"
end_prompt = "\nAnswer:\n\n"

full_prompt = start_prompt + test_text + end_prompt

print(full_prompt)

According to the following question:

what is cuda nsight?
Answer:




In [19]:
from transformers import GenerationConfig

trained_model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path)

tokenized_test_text = tokenizer(full_prompt, return_tensors = 'pt').input_ids.to(device)

model_output = trained_model.generate(tokenized_test_text,
                                      generation_config = GenerationConfig(max_new_tokens = 150))[0]

final_output = tokenizer.decode(model_output, skip_special_tokens = True)

print(final_output)

cuda nsight is a preview release that provides csight for gpu kernels nsight compute provides a preview of cuda nsight compute for gpu kernels 


In [21]:
from transformers import GenerationConfig

original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype = torch.float32).to(device)
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

tokenized_test_text = tokenizer(full_prompt, return_tensors = 'pt').input_ids.to(device)

model_output = original_model.generate(tokenized_test_text,
                                      generation_config = GenerationConfig(max_new_tokens = 150))[0]

final_output = tokenizer.decode(model_output, skip_special_tokens = True)

print(final_output)

a snaretictrachetebedete etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement etablissement


In [23]:
test_data[53]

{'question': 'what are the downsides of using malloc like abstractions in cuda applications before cuda ',
 'answer': 'before cuda using malloc like abstractions in cuda applications had limitations such as limited options for memory management and inefficient dynamic data structure creation ',
 '__index_level_0__': 300}

In [24]:
test_text = 'what are the downsides of using malloc like abstractions in cuda applications before cuda '

start_prompt = "According to the following question:\n\n"
end_prompt = "\nAnswer:\n\n"

full_prompt = start_prompt + test_text + end_prompt

print(full_prompt)

According to the following question:

what are the downsides of using malloc like abstractions in cuda applications before cuda 
Answer:




In [25]:
from transformers import GenerationConfig

trained_model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path)

tokenized_test_text = tokenizer(full_prompt, return_tensors = 'pt').input_ids.to(device)

model_output = trained_model.generate(tokenized_test_text,
                                      generation_config = GenerationConfig(max_new_tokens = 150))[0]

final_output = tokenizer.decode(model_output, skip_special_tokens = True)

print(final_output)

before cuda malloc like abstractions introduced in c standard cuda provided a single abstraction for malloc like arrays requiring the use of malloc like abstractions for these operations led to a lack of performance and a lack of abstraction for explicitly mapped arrays 


This section demonstrates how to use the fine-tuned model for inference. The steps include:

- **Creating a Prompt**: A test question is formatted into a prompt that the model can process.
- **Loading the Model**: The fine-tuned model and tokenizer are loaded from the saved path.
- **Generating an Answer**: The prompt is tokenized and passed through the model to generate an answer. The output is then decoded to produce the final answer.

This process showcases the model's ability to understand and respond to questions based on the training it received.