<a href="https://colab.research.google.com/github/MelDashti/Smart-Chatbot/blob/master/AIChatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Cloning into 'Smart-Chatbot'...
remote: Enumerating objects: 189, done.[K
remote: Counting objects: 100% (65/65), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 189 (delta 17), reused 53 (delta 10), pack-reused 124 (from 1)[K
Receiving objects: 100% (189/189), 187.60 MiB | 15.27 MiB/s, done.
Resolving deltas: 100% (47/47), done.
Updating files: 100% (86/86), done.


Here we install the necessary Libraries

In [2]:
# Install necessary libraries
!pip install trl
!pip install unsloth
!pip install pandas

# Standard library imports
import os
import warnings

# Third-party library imports
import math
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer, AutoTokenizer, AutoModelForSequenceClassification
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported
from datasets import Dataset

# Configure warnings and matplotlib
warnings.filterwarnings("ignore")
%matplotlib inline
plt.style.use('ggplot')

# Set device (GPU if available, otherwise CPU)
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(DEVICE)

Collecting trl
  Downloading trl-0.12.2-py3-none-any.whl.metadata (11 kB)
Collecting datasets>=2.21.0 (from trl)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.21.0->trl)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets>=2.21.0->trl)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets>=2.21.0->trl)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets>=2.21.0->trl)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading trl-0.12.2-py3-none-any.whl (365 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
cuda:0


## **Data Loading and Preprocessing**
In this section, we load the dataset containing question-answer pairs for fine-tuning the chatbot.
The dataset is in JSONL format, where each line represents a JSON object containing a question and its corresponding answer.This dataset was created by scraping relevant information from the AROL Group website.
The `read_jsonl_to_df` function is defined to read the JSONL file and convert it into a Pandas DataFrame.
This DataFrame will be used for training and evaluating the chatbot model.


In [4]:

import os
os.chdir('/content/Smart-Chatbot/') # Here we set the working directory


import pandas as pd
import json
import re

def read_jsonl_to_df(file_path):
    data = []
    with open(file_path, 'r') as f:
        current_entry = {}  # Store data for the current entry
        for line in f:
            line = line.strip()
            if not line:  # Skip empty lines
                continue
            if line == '{':  # Start of a new entry
                current_entry = {}
            elif line == '}':  # End of an entry
                data.append(current_entry)
            else:
                # Handle lines with "key": "value" format
                match = re.match(r'"(.*?)":\s*"(.*?)"', line)
                if match:
                    key, value = match.groups()
                    current_entry[key] = value
                else:
                    print(f"Skipping invalid JSON line: {line}")  # Handle invalid lines
    return pd.DataFrame(data)

df_training = read_jsonl_to_df("qa.jsonl")

df_validation = read_jsonl_to_df("validation_dataset.jsonl")



Skipping invalid JSON line: } {
Skipping invalid JSON line: } {
Skipping invalid JSON line: "answer": "AROL's capping machines are capable of handling a variety of beverages including water juice beer and wine }


This section prepares data for fine-tuning: a prompt template guides the model, and data is formatted for optimal learning.This section prepares data for fine-tuning: a prompt template guides the model, and training and evaluation data is formatted for optimal learning.

In [7]:
import pandas as pd

# first we convert sample data to a DataFrame
# df = pd.DataFrame(sample_data)
# we already converted the json file to dataframe so we use it directly

data_prompt = """
You are a customer support assistant for AROL Group, specialized in bottle caps and capping technologies.
Your goal is to provide accurate, clear, and helpful responses about AROL Group's products and processes.

### question:
{}

--- Instructions ---
- Provide a concise and informative response about bottle cap manufacturing or capping technology.
- If technical details or product features are mentioned, explain them simply.
- If concerns are raised, offer relevant recommendations or solutions.
- Keep the answer focused on the specific query.

### answer:
{}
""".strip()  # Using strip to avoid trailing newlines at the end

EOS_TOKEN = "</s>"

# we use templates so that we can fine tune our model better with instructions for how to analyze the input
# data and generate the output data.

def formatting_prompt(df):
    inputs = df["question"]
    outputs = df["answer"]
    texts = []
    for input_, output in zip(inputs, outputs):
        # Add a newline before the EOS token for clarity
        text = data_prompt.format(input_, output) + "\n" + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# Here we format the training data
#training_data = formatting_prompt(df_training)
#print(training_data["text"][1])

# Now we format the validation data.
#validation_data = formatted_prompt(df_validation)
#print(validation_data["text"][1])
training_data = Dataset.from_pandas(df_training) # here we convert pandas dataframe into a hugging face dataset object
training_data = training_data.map(formatting_prompt, batched=True) # Here we apply the formatting func to each element of the dataset using map method.

validation_data = Dataset.from_pandas(df_validation)
validation_data = validation_data.map(formatting_prompt, batched=True)

Map:   0%|          | 0/553 [00:00<?, ? examples/s]

Map:   0%|          | 0/102 [00:00<?, ? examples/s]

Here sets up the LLaMA 3.2 model with 3 billion parameters for fine-tuning. It uses LoRA (Low-Rank Adaptation) to efficiently train only specific parts of the model, making the process faster and less resource-intensive. The model is loaded with full precision (not quantized) for better accuracy, and gradient checkpointing is enabled to manage memory during training. Finally, it prints the number of trainable parameters for verification.


In [8]:
# WE are using lama with 1B parameters.
max_seq_length = 1024  # imo its enough for a simple AI chatbot
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B", # trying 3B, removed 4 bit quantization.
    max_seq_length=max_seq_length,
    load_in_4bit=False, # Here we ensure that the model has full precision and is not 4 bits
    # By setting false we wanna check with original precision if its better!
    dtype=None,
)
# we use parameter efficient fine tuning like we learnt in LLM which applied LORA techniques. This approach
# focuses on fine tuning only specific layers or parts of the model, rather than the entire network.
# r = 16 and lora_alpha = 16 adjusts the complexity and scaling of these adaptations.
#  target modules specifies which layers of the model should be adapted, which include key components involed
# in attention mechanisms like q_proj and k_proj and v_proj.
# use_rslora activates Rank stabalized LORA, which improves the stability of the fine tuning process.
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj",],
    # q_proj, k_proj, v_proj: Handle the query, key, and value projections in the attention mechanism, essential for capturing contextual information.
    # up_proj, down_proj: Layers in feedforward networks. o_proj: Combines attention heads’ output. gate_proj: Controls flow in certain feedforward networks.
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
    random_state = 32,
    loftq_config = None,
)
print(model.print_trainable_parameters())

==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

Unsloth 2024.12.2 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


trainable params: 24,313,856 || all params: 3,237,063,680 || trainable%: 0.7511
None


Training the Model with the Trainer API
Goal: Use the Trainer API to actually fine-tune the model on the formatted dataset. This step leverages all previous configurations for efficient training.

Process:

The formatted data is fed into the Trainer as input for model training.
The Trainer uses LoRA fine-tuning to adjust only specific layers, optimizing performance while keeping memory usage low.
Purpose: This final step leverages all previous configurations and formatted data to train the model. The Trainer applies gradient updates to the specified layers according to LoRA parameters, optimizing the model for the task without requiring massive resources.

In [11]:
trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=training_data,
    eval_dataset = validation_data,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc = 2,
    # Consider disabling packing if not needed:
    # packing=False,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=4, # changed from 2 to 4  Larger effective batch sizes can sometimes lead to more stable training.
        gradient_accumulation_steps=2, # from 4 to 2
        num_train_epochs=40,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=50, # the logging parameter determines how frequently in terms of training
        # steps the the trainer logs metrics like training loss, learning rate, and other available metrics.
        evaluation_strategy = "steps",
        eval_steps = 200,
        save_steps = 200,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=100,
        output_dir="output",
        run_name = "my_llama_chatbot_finetune_run",
        report_to = "wandb",
        seed=0,
    ),
)

# Here we train
trainer.train()

# Here we manually save the fine tuned model
trainer.save_model("/output2")
tokenizer.save_pretrained("/output2")

# Evaluation & Perplexity
eval_results = trainer.evaluate()
eval_loss = eval_results["eval_loss"]
perplexity = math.exp(eval_loss)
print(f"Evaluation loss: {eval_loss}")
print(f"Perplexity: {perplexity}")



Map (num_proc=2):   0%|          | 0/553 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/102 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 553 | Num Epochs = 40
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 2
\        /    Total batch size = 8 | Total steps = 2,760
 "-____-"     Number of trainable parameters = 24,313,856
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss
200,0.2007,0.32131
400,0.1204,0.325637
600,0.0923,0.3791
800,0.0746,0.378767
1000,0.0635,0.415229
1200,0.0599,0.414516
1400,0.0573,0.423272
1600,0.0534,0.450093
1800,0.0513,0.457502
2000,0.0484,0.488911


Evaluation loss: 0.5593435168266296
Perplexity: 1.7495235904188378


/content/Smart-Chatbot


In [16]:

model_name = "output/checkpoint-2760"  # Use your actual final checkpoint directory
max_seq_length = 1024

# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=False,  # If you trained in full precision or 8-bit, adjust accordingly
    dtype=None,
)


==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [20]:
data_prompt = """
You are a customer support assistant for AROL Group...

### question:
{}

### answer:
""".strip()

user_query = "What kind of capping technologies does AROL Group offer?"
full_prompt = data_prompt.format(user_query)

model_name = "output/checkpoint-2760"
max_seq_length = 1024

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=False,
    dtype=None,
)

# Determine model device
model_device = next(model.parameters()).device

# Tokenize input and move to model device
inputs = tokenizer(full_prompt, return_tensors="pt")
inputs = {k: v.to(model_device) for k, v in inputs.items()}

# Generate response
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)


==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 5.06 MiB is free. Process 4145 has 14.74 GiB memory in use. Of the allocated memory 14.58 GiB is allocated by PyTorch, and 4.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [3]:
!git add .

fatal: not a git repository (or any of the parent directories): .git


## Inference Mode: Applying Knowledge to User Queries

Now that the model is trained, it's ready to assist users with their inquiries about AROL Group products and services. In this phase, the model leverages the knowledge gained during fine-tuning to generate informative and helpful responses.

### User Interaction:

Users will input their questions or requests related to AROL Group's offerings, such as:

*   "What types of bottle caps are suitable for carbonated drinks?"
*   "How do I maintain my AROL capping machine for optimal performance?"
*   "Can AROL's solutions be customized for my specific production needs?"

### Model Response:

The model processes the user's input and generates a response based on the information it has learned. These responses will be:

*   **Tailored to AROL Group's domain:** The model's knowledge is focused on bottle caps, capping technologies, and related services offered by AROL Group.
*   **Informative and accurate:** The responses aim to provide clear and relevant answers to user queries, leveraging the data it was trained on.

In [8]:
!git push

Everything up-to-date


In [None]:
text = "Can you list different versions of the Eagle PK"
model = FastLanguageModel.for_inference(model)
# Here we prepare the input for the model
inputs = tokenizer(
[
    data_prompt.format(
        #instructions
        text,
        #answer
        "",
    )
], return_tensors = "pt").to("cuda")

# Here we generate the response
outputs = model.generate(**inputs, max_new_tokens = 1024)
# Here we decode the response
# answer=tokenizer.batch_decode(outputs)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = answer.split("### answer:")[-1].strip()
print("Answer of the question is:", answer)


Answer of the question is: Yes, the Eagle PK machine is available in different versions, including an optional caps sorter and fully automatic bottle-neck guide assembly.
</s>


## Evaluating Results
Here we will evaluate the fine tuned chatbot's performance using precision and recall with cosine similarity by using a seprate dataset for evaluation that wasn't used during training so that we see how the models behaves with unseen data.
We choose to evaluate our AI chatbot using precision and recall with cosine similarity to ensure it provides accurate product information. Precision measures the relevance of the chatbot's responses, guaranteeing that users receive correct information, while recall assesses its ability to capture all relevant responses, minimizing missed information. Cosine similarity evaluates the semantic similarity between the chatbot's responses and the correct information, which is essential since users may phrase their questions differently. This approach ensures that the chatbot understands and responds appropriately to diverse inquiries, making it well-suited for a corporate environment where context and accuracy are critical. Overall, this evaluation method provides a comprehensive assessment of the chatbot's performance in delivering relevant product information.
We did not use BLEU score for evaluation because it focuses on exact n-gram overlaps, which can penalize semantically correct responses that differ in wording. This limitation makes it less suitable for our chatbot, which needs to provide contextually relevant and accurate product information, regardless of variations in user queries. Instead, precision, recall, and cosine similarity better capture the chatbot's effectiveness in understanding and responding to diverse inquiries.

In [None]:
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

model_path = "output/checkpoint-400"  # Path to the saved model directory

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Load the model for inference (this time ensuring we don't return a tuple)
model = FastLanguageModel.from_pretrained(model_path, load_in_4bit=True, dtype=torch.bfloat16).to("cuda")
model = FastLanguageModel.prepare_model_for_generation(model)

==((====))==  Unsloth 2024.11.11: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Device does not support bfloat16. Will change to float16.


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 1.06 MiB is free. Process 2448 has 14.74 GiB memory in use. Of the allocated memory 14.62 GiB is allocated by PyTorch, and 16.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)