# Load Dataset

Nvidia Documentation Question and Answer pairs Q&A dataset for LLM finetuning about the NVIDIA about SDKs and blogs

This dataset is obtained by generating Q&A pairs from a few NVIDIA websites such as development kits and guides. This data can be used to fine-tune any LLM for indulging knowledge about NVIDIA into them.

In [3]:
!pip list

Package                                  Version
---------------------------------------- -----------
accelerate                               0.29.3
aiofiles                                 23.2.1
aiohttp                                  3.9.5
aiosignal                                1.3.1
annotated-types                          0.6.0
anyio                                    3.7.1
asttokens                                2.4.1
async-timeout                            4.0.3
asyncer                                  0.0.2
attrs                                    23.2.0
bidict                                   0.23.1
certifi                                  2024.2.2
chainlit                                 1.0.504
charset-normalizer                       3.3.2
chevron                                  0.14.0
click                                    8.1.7
comm                                     0.2.2
dataclasses-json                         0.5.14
debugpy                                  

In [54]:
from utils.constants import RANDOM_SEED
from utils.util_funcs import data_cleaning
import pandas as pd

In [29]:
from datasets import load_dataset

dataset = load_dataset("ajsbsd/nvidia-qa")

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'question', 'answer'],
        num_rows: 7108
    })
})

## This is important for import other packages from another directory

In [8]:
%cd ..

/workspaces/Chatchat_AIMeng/src


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


# Split dataset and save


In [31]:
dataset = dataset['train']
dataset

Dataset({
    features: ['Unnamed: 0', 'question', 'answer'],
    num_rows: 7108
})

In [34]:
dataset_split = dataset.train_test_split(test_size=0.2, seed=RANDOM_SEED)
dataset_split

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'question', 'answer'],
        num_rows: 5686
    })
    test: Dataset({
        features: ['Unnamed: 0', 'question', 'answer'],
        num_rows: 1422
    })
})

In [35]:
train_dataset = dataset_split['train']
test_dataset = dataset_split['test']
train_dataset

Dataset({
    features: ['Unnamed: 0', 'question', 'answer'],
    num_rows: 5686
})

In [36]:
train_val_dataset = train_dataset.train_test_split(test_size=0.2, seed=RANDOM_SEED)
train_val_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'question', 'answer'],
        num_rows: 4548
    })
    test: Dataset({
        features: ['Unnamed: 0', 'question', 'answer'],
        num_rows: 1138
    })
})

In [37]:
train_dataset = train_val_dataset['train']
val_dataset = train_val_dataset['test']
test_dataset = test_dataset # original

In [38]:
train_dataset

Dataset({
    features: ['Unnamed: 0', 'question', 'answer'],
    num_rows: 4548
})

In [39]:
test_dataset

Dataset({
    features: ['Unnamed: 0', 'question', 'answer'],
    num_rows: 1422
})

In [40]:
val_dataset

Dataset({
    features: ['Unnamed: 0', 'question', 'answer'],
    num_rows: 1138
})

In [41]:
train_df = pd.DataFrame(train_dataset)
test_df = pd.DataFrame(test_dataset)
val_df = pd.DataFrame(val_dataset)

In [42]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,question,answer
0,5189,What limitation does cuBLAS-XT solve for GPU a...,cuBLAS-XT overcomes the limitation of manually...
1,4339,What is the role of the NVIDIA CUDA compiler i...,The NVIDIA CUDA compiler optimizes memory reso...
2,1398,What are the two ways to modify behavior in th...,The two ways to modify behavior in the system ...
3,2457,How does the cudaMemAdvise API enhance memory ...,The cudaMemAdvise API provides memory usage hi...
4,2157,What kind of measurement is used to compare fl...,The comparison of floating-point results betwe...


In [43]:
# remove useless columns
train_df.drop(columns=['Unnamed: 0'], inplace=True)
test_df.drop(columns=['Unnamed: 0'], inplace=True)
val_df.drop(columns=['Unnamed: 0'], inplace=True)

In [44]:
train_df.to_csv('data/raw/train.csv', index=False)
test_df.to_csv('data/raw/test.csv', index=False)
val_df.to_csv('data/raw/val.csv', index=False)

In [49]:
pd.read_csv('./data/processed/train.csv').index

RangeIndex(start=0, stop=4548, step=1)

In [53]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4548 entries, 0 to 4547
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  4548 non-null   object
 1   answer    4548 non-null   object
dtypes: object(2)
memory usage: 71.2+ KB


# Post Processing with Data

## 1. Data Clean

In [55]:
train_df['question'] = train_df['question'].apply(data_cleaning)
val_df['question'] = val_df['question'].apply(data_cleaning)
test_df['question'] = test_df['question'].apply(data_cleaning)

train_df['answer'] = train_df['answer'].apply(data_cleaning)
val_df['answer'] = val_df['answer'].apply(data_cleaning)
test_df['answer'] = test_df['answer'].apply(data_cleaning)

In [56]:
train_df

Unnamed: 0,question,answer
0,what limitation does cublasxt solve for gpu ac...,cublasxt overcomes the limitation of manually ...
1,what is the role of the nvidia cuda compiler i...,the nvidia cuda compiler optimizes memory reso...
2,what are the two ways to modify behavior in th...,the two ways to modify behavior in the system ...
3,how does the cudamemadvise api enhance memory ...,the cudamemadvise api provides memory usage hi...
4,what kind of measurement is used to compare fl...,the comparison of floatingpoint results betwee...
...,...,...
4543,what is the connection between the lowpower gk...,the lowpower gk208 gpu present in the geforce ...
4544,how does unified memory impact the development...,unified memory simplifies memory management in...
4545,how can loop unrolling improve cuda code perfo...,loop unrolling is a technique that reduces loo...
4546,which digitized manuscripts were used for the ...,the study conducted by the university of notre...


In [57]:
train_df.to_csv('data/processed/train.csv', index=False)
test_df.to_csv('data/processed/test.csv', index=False)
val_df.to_csv('data/processed/val.csv', index=False)