# Installing Dependencies


In [1]:
! pip install transformers datasets accelerate peft

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

# Explaination
datasets:

A library for easily accessing and sharing datasets. It provides a wide range of datasets for machine learning tasks and allows for efficient loading, preprocessing, and manipulation of data.

transformers:

A library developed by Hugging Face for working with transformer models. It provides pre-trained models for various natural language processing (NLP) tasks (like text classification, translation, and summarization) and tools to fine-tune these models on custom datasets.

accelerate:

A library that simplifies training and deploying models on various hardware configurations (like CPU, single GPU, or multiple GPUs). It streamlines the setup for distributed training and makes it easier to optimize performance.

peft:

Stands for "Parameter-Efficient Fine-Tuning." This library offers techniques to fine-tune large language models with fewer parameters, enabling efficient training on smaller datasets while preserving model performance.

# Importing Libraries

In [2]:
import torch
from datasets import Dataset
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq
)
from peft import (
    get_peft_config,
    get_peft_model,
    get_peft_model_state_dict,
    LoraConfig,
    TaskType
)
import pandas as pd
import numpy as np


In [4]:
df = pd.read_csv("/content/datascience.csv")

In [5]:
df

Unnamed: 0,Question,Answer
0,What is under-fitting and overfitting in machi...,"Underfitting is when a model is too simple, an..."
1,Can you explain what a false positive and a fa...,A false positive incorrectly indicates a condi...
2,Clarify the concept of Phase IV.,"Phase IV studies, also known as post-marketing..."
3,What is semi-supervised learning described in ...,Semi-supervised learning integrates both label...
4,Discuss the parallelization of training in gra...,Parallelizing training of a gradient boosting ...
...,...,...
1165,Can you explain the ROC curve and AUC score an...,A ROC (Receiver Operating Characteristic) curv...
1166,How do you approach setting the threshold in a...,When setting the threshold in a binary classif...
1167,What is the difference between LDA (Linear Dis...,LDA (Linear Discriminant Analysis) and PCA (Pr...
1168,How does the Naive Bayes algorithm compare to ...,Naive Bayes is a simple and fast algorithm tha...


# Splitting df into Train and Test

In [6]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=42)

In [7]:
train_df

Unnamed: 0,Question,Answer
1139,What is the difference between bagging boostin...,Both bagging and boosting are ensemble learnin...
809,How do the zip() and enumerate() functions wor...,Zip combines lists into tuples based on the sa...
1087,"What are lambda functions in python, and why a...","In Python, a lambda function is a small anonym..."
184,Define and describe the concept of knowledge e...,Knowledge engineering is a discipline within a...
1115,What are some of the techniques to avoid overf...,Some techniques that can be used to avoid over...
...,...,...
1044,Define regression and list models used for reg...,Regression analysis is used to understand the ...
1095,Can you explain the difference between descrip...,Descriptive statistics is used to summarize an...
1130,What is the difference between MinMaxScaler an...,Both the MinMaxScaler and StandardScaler are t...
860,Outline the basic concept of regression to the...,Regression to the mean refers to the tendency ...


In [8]:
train_df.to_parquet("/content/drive/MyDrive/PROJECT/train.parquet", index=False)
test_df.to_parquet("/content/drive/MyDrive/PROJECT/test.parquet", index=False)

**Explaination :**

The training data is saved to train.parquetri and the test data to test.parquet, both without the index column.

In [9]:
train_df = pd.read_parquet("/content/drive/MyDrive/PROJECT/train.parquet")
test_df = pd.read_parquet("/content/drive/MyDrive/PROJECT/test.parquet")
train_data = Dataset.from_pandas(train_df)
test_data = Dataset.from_pandas(test_df)

**Explaintaion :**

This code reads Parquet files train.parquet and test.parquet from Google Drive into DataFrames train_df and test_df. It then converts these DataFrames into Hugging Face Dataset objects, train_data and test_data.

In [10]:
model_id="google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

**Explaination :**

initializes a text generation model and tokenizer using Hugging Face's AutoTokenizer and AutoModelForSeq2SeqLM classes. It loads the "google/flan-t5-large" model and tokenizer for sequence-to-sequence tasks.

In [11]:
def preprocess_function(sample,padding="max_length"):
    model_inputs = tokenizer(sample["Question"], max_length=256, padding=padding, truncation=True)
    labels = tokenizer(sample["Answer"], max_length=256, padding=padding, truncation=True)
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

**Explaination :**

This function, preprocess_function, tokenizes question-answer pairs from the dataset for model training. It encodes the "Question" and "Answer" fields with padding and truncation to a max length of 256 tokens. If padding is set to "max_length," label padding tokens are replaced with -100 to be ignored during loss calculation. The processed "input_ids" for labels are then added to model_inputs.

In [12]:
train_tokenized_dataset = train_data.map(preprocess_function, batched=True, remove_columns=train_data.column_names)
test_tokenized_dataset = test_data.map(preprocess_function, batched=True, remove_columns=test_data.column_names)
print(f"Keys of tokenized dataset: {list(train_tokenized_dataset.features)}")

Map:   0%|          | 0/936 [00:00<?, ? examples/s]

Map:   0%|          | 0/234 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


**Explaination :**

 Here we tokenizes the train_data and test_data datasets by applying the preprocess_function to each batch. The original columns are removed, leaving only the processed features. It then prints the keys of the train_tokenized_dataset, showing the available feature names after tokenization.

In [13]:
lora_config = LoraConfig(
 r=4,
 lora_alpha=32,
 target_modules=["q", "v"],
 lora_dropout=0.1,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)

**Explaination :**

Here we creates a configuration for Low-Rank Adaptation (LoRA) with LoraConfig. It sets the rank (r) to 4, scaling factor (lora_alpha) to 32, and applies LoRA to the "q" and "v" modules (typically query and value layers). A dropout rate of 0.1 is used, and no bias is added. The configuration is set for a sequence-to-sequence language model (SEQ_2_SEQ_LM) task.

In [14]:
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 1,179,648 || all params: 784,329,728 || trainable%: 0.1504


**Explaination :**

Here weapplies the LoRA configuration (lora_config) to the model using get_peft_model, adapting it with low-rank updates for efficient fine-tuning. It then prints the parameters that are trainable under this configuration.

In [15]:
label_pad_token_id = -100
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

**Explaination :**

Here we sets up a data collator, DataCollatorForSeq2Seq, for batching the tokenized data during training. It uses the specified tokenizer and model, pads labels with -100 (to ignore them in loss calculation), and ensures padding aligns to multiples of 8 for efficiency.

In [16]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

**Explaination :**

logs into the Hugging Face Hub using the notebook_login function, enabling access to Hugging Face resources, like model storage and sharing, directly from the notebook.

In [17]:
output_dir="lora-flan-t5-large-qg"
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    learning_rate=1e-3,
    num_train_epochs=1,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="epoch",
    save_strategy="epoch",
    report_to="tensorboard",
    push_to_hub = True
)

**Explaination :**


Here we defines the training configuration for fine-tuning the model using Seq2SeqTrainingArguments. It specifies the output directory for saving the model (lora-flan-t5-large-qg), sets a batch size of 4, a learning rate of 1e-3, and 1 training epoch. Logging is set to occur at the end of each epoch, with logs stored in the logs subdirectory. The model will be saved after each epoch and reports will be sent to TensorBoard. Additionally, the model will be pushed to the Hugging Face Hub after training.

In [18]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_tokenized_dataset,
)

**Explaination :**


Here we initializes a Seq2SeqTrainer with the model, training arguments, data collator, and tokenized training dataset. The trainer will be used for fine-tuning the model on the specified dataset with the provided configuration.

In [19]:
model.config.use_cache = False

**Explaination :**

Disables the model's cache during inference or training by setting model.config.use_cache to False. This is often done to save memory or prevent caching of intermediate results when fine-tuning or using certain models.

In [20]:
trainer.train()
peft_save_model_id="lora-flan-t5-large-qg"
trainer.model.save_pretrained(peft_save_model_id, push_to_hub=True)
tokenizer.save_pretrained(peft_save_model_id, push_to_hub=True)
trainer.model.base_model.save_pretrained(peft_save_model_id, push_to_hub=True)

Step,Training Loss
234,2.6094


No files have been modified since last commit. Skipping to prevent empty commit.


README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.14G [00:00<?, ?B/s]

**Explaination :**

This code trains the model using the trainer.train() method. After training, it saves the fine-tuned model, tokenizer, and the base model to the Hugging Face Hub with the identifier lora-flan-t5-large-qg. All components are saved using save_pretrained, and the model is pushed to the Hub for sharing and future use.

# Saving the fine tuned model To Drive

In [21]:
! cp -r /content/lora-flan-t5-large-qg /content/drive/MyDrive/PROJECT