<a href="https://colab.research.google.com/github/KaifAhmad1/Agri-Llama/blob/main/Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Installing Necessary Dependencies:**

In [1]:
!pip install -qU bitsandbytes
!pip install -qU trl
!pip install -qU transformers
!pip install -qU peft
!pip install -qU optimum
!pip install -qU datasets
!pip install -qU accelerate
!pip install -qU nltk
!pip install -qU rouge_score

**Necessary Imports:**

In [2]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
    pipeline,
    logging
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from datasets import Dataset
from huggingface_hub import notebook_login
from google.colab import drive
import plotly.express as px
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

**Set Up Environment:**

In [3]:
notebook_login()
drive.mount('/content/drive')

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Load Data:**

In [4]:
# Load data
file_path = '/content/drive/MyDrive/Network-QA-Dataset.csv'
data = pd.read_csv(file_path)
data

Unnamed: 0,Questions,Answers,Context Info,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109,Unnamed: 110
0,What is the scope of the technical specificati...,The scope of the technical specification is de...,"The technical specification, titled ""3GPP TS 2...",,,,,,,,...,,,,,,,,,,
1,Where can specifications and reports for the i...,Specifications and reports for the implementat...,,,,,,,,,...,,,,,,,,,,
2,What are the different restoration indicators ...,The document discusses various restoration ind...,,,,,,,,,...,,,,,,,,,,
3,What procedures are outlined for the restorati...,Procedures for the restoration of data in the ...,,,,,,,,,...,,,,,,,,,,
4,In which section can information about the res...,Information about the restoration of data in ...,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1266,"In the context of CAPIF deployment models, wha...","""NEF implements the CAPIF architecture"" means...",,,,,,,,,...,,,,,,,,,,
1267,"Explain the concept of ""Distributed deployment...","The ""Distributed deployment of the NEF complia...",,,,,,,,,...,,,,,,,,,,
1268,"According to Annex D, what is the document's a...",Annex D provides a table (Table D-1) that illu...,,,,,,,,,...,,,,,,,,,,
1269,What kind of information does Annex E (Configu...,Annex E specifies configuration data for CAPIF...,,,,,,,,,...,,,,,,,,,,


In [5]:
network_data = data[['Questions', 'Answers', 'Context Info']]
network_data

Unnamed: 0,Questions,Answers,Context Info
0,What is the scope of the technical specificati...,The scope of the technical specification is de...,"The technical specification, titled ""3GPP TS 2..."
1,Where can specifications and reports for the i...,Specifications and reports for the implementat...,
2,What are the different restoration indicators ...,The document discusses various restoration ind...,
3,What procedures are outlined for the restorati...,Procedures for the restoration of data in the ...,
4,In which section can information about the res...,Information about the restoration of data in ...,
...,...,...,...
1266,"In the context of CAPIF deployment models, wha...","""NEF implements the CAPIF architecture"" means...",
1267,"Explain the concept of ""Distributed deployment...","The ""Distributed deployment of the NEF complia...",
1268,"According to Annex D, what is the document's a...",Annex D provides a table (Table D-1) that illu...,
1269,What kind of information does Annex E (Configu...,Annex E specifies configuration data for CAPIF...,


In [6]:
def process_data_sample(example):
    # Extract relevant information from the example
    question = example['Questions']
    answer = example['Answers']
    context_info = example['Context Info']

    # Convert potential NaN values to empty string
    question = str(question)
    answer = str(answer)
    context_info = str(context_info) if pd.notna(context_info) else ""

    # Prepare the processed example for a Question Answering System
    processed_example = (
        "You are a Question Answering System designed to assist users with queries. "
        "Your capabilities include providing technical details, offering implementation guidance, "
        "and staying updated on telecommunications standards.\n\n"
        f"User Query:\n{question}\n\n"
        f"Answer:\n{answer}\n\n"
        f"Context Information:\n{context_info}"
    )
    return processed_example

In [7]:
# Create 'text' column in 'network_data' by applying 'process_data_sample' to each row's 'Questions', 'Answers', and 'Context Info'
network_data['text'] = network_data[['Questions', 'Answers', 'Context Info']].apply(lambda x: process_data_sample(x), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  network_data['text'] = network_data[['Questions', 'Answers', 'Context Info']].apply(lambda x: process_data_sample(x), axis=1)


In [8]:
# Split data
train_data, test_data = train_test_split(network_data, test_size=0.2, random_state=42)

In [9]:
model_name = 'mistralai/Mistral-7B-v0.1'

In [10]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [11]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
# Tokenization and Padding
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token

**LoRA and SFT**

In [13]:
model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [14]:
# LoRA Config
peft_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.1,
    bias='none',
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"]
)

In [15]:
# Training Arguments
training_arguments = TrainingArguments(
    output_dir='Mistral-Network-QnA-System',
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim='paged_adamw_32bit',
    learning_rate=2e-4,
    lr_scheduler_type='cosine',
    save_strategy='epoch',
    logging_steps=10,
    save_steps=10,
    num_train_epochs=1,
    max_steps=200,
    fp16=True,
    warmup_ratio=0.05,
    push_to_hub=False,
)

In [16]:
# SFT Trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=Dataset.from_pandas(train_data[['text']]),
    peft_config=peft_config,
    dataset_text_field='text',
    args=training_arguments,
    tokenizer=tokenizer,
    packing=False,
    max_seq_length=264
)

Map:   0%|          | 0/1016 [00:00<?, ? examples/s]



In [17]:
# Train the model
trainer.train()



Step,Training Loss
10,2.3916
20,1.5875
30,1.3088
40,1.2271
50,1.1922
60,1.213
70,1.1436
80,1.1465
90,1.0794
100,1.0834


Checkpoint destination directory Mistral-Network-QnA-System/checkpoint-63 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory Mistral-Network-QnA-System/checkpoint-127 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory Mistral-Network-QnA-System/checkpoint-190 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory Mistral-Network-QnA-System/checkpoint-200 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=200, training_loss=1.178642873764038, metrics={'train_runtime': 2626.8495, 'train_samples_per_second': 1.218, 'train_steps_per_second': 0.076, 'total_flos': 2.5708172149653504e+16, 'train_loss': 1.178642873764038, 'epoch': 3.15})

In [31]:
import plotly.express as px

# Extract training loss values
train_losses = [entry['loss'] for entry in trainer.state.log_history if 'loss' in entry]

# Create a plot
fig = px.line(x=range(1, len(train_losses) + 1), y=train_losses, title='Training Loss Over Steps',
              labels={'x': 'Steps', 'y': 'Training Loss'})

# Show the plot
fig.show()