<a href="https://colab.research.google.com/github/MelDashti/Smart-Chatbot/blob/master/AIChatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here we install the necessary Libraries

In [None]:
# Install necessary libraries
!pip install trl
!pip install unsloth
!pip install pandas

# Standard library imports
import os
import warnings

# Third-party library imports
import math
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer, AutoTokenizer, AutoModelForSequenceClassification
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported
from datasets import Dataset

# Configure warnings and matplotlib
warnings.filterwarnings("ignore")
%matplotlib inline
plt.style.use('ggplot')

# Set device (GPU if available, otherwise CPU)
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(DEVICE)

## **Data Loading and Preprocessing**
In this section, we load the dataset containing question-answer pairs for fine-tuning the chatbot.
The dataset is in JSONL format, where each line represents a JSON object containing a question and its corresponding answer. This dataset was created by scraping relevant information from the AROL Group website.
The `read_jsonl_to_df` function is defined to read the JSONL file and convert it into a Pandas DataFrame.
This DataFrame will be used for training and evaluating the chatbot model.


In [None]:

import os
os.chdir('/content/Smart-Chatbot/') # Here we set the working directory


import pandas as pd
import json
import re

def read_jsonl_to_df(file_path):
    data = []
    with open(file_path, 'r') as f:
        current_entry = {}  # Store data for the current entry
        for line in f:
            line = line.strip()
            if not line:  # Skip empty lines
                continue
            if line == '{':  # Start of a new entry
                current_entry = {}
            elif line == '}':  # End of an entry
                data.append(current_entry)
            else:
                # Handle lines with "key": "value" format
                match = re.match(r'"(.*?)":\s*"(.*?)"', line)
                if match:
                    key, value = match.groups()
                    current_entry[key] = value
                else:
                    print(f"Skipping invalid JSON line: {line}")  # Handle invalid lines
    return pd.DataFrame(data)

df_training = read_jsonl_to_df("qa.jsonl")

df_validation = read_jsonl_to_df("validation_dataset.jsonl")

This section prepares data for fine-tuning: a prompt template guides the model, and data is formatted for optimal learning.

In [None]:
import pandas as pd

# first we convert sample data to a DataFrame
# df = pd.DataFrame(sample_data)
# we already converted the json file to dataframe so we use it directly

data_prompt = """
You are a customer support assistant for AROL Group, specialized in bottle caps and capping technologies.
Your goal is to provide accurate, clear, and helpful responses about AROL Group's products and processes.

### question:
{}

--- Instructions ---
- Provide a concise and informative response about bottle cap manufacturing or capping technology.
- If technical details or product features are mentioned, explain them simply.
- If concerns are raised, offer relevant recommendations or solutions.
- Keep the answer focused on the specific query.

### answer:
{}
""".strip()  # Using strip to avoid trailing newlines at the end

EOS_TOKEN = "</s>"

# we use templates so that we can fine tune our model better with instructions for how to analyze the input
# data and generate the output data.

def formatting_prompt(df):
    inputs = df["question"]
    outputs = df["answer"]
    texts = []
    for input_, output in zip(inputs, outputs):
        # Add a newline before the EOS token for clarity
        text = data_prompt.format(input_, output) + "\n" + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# Here we format the training data
#training_data = formatting_prompt(df_training)
#print(training_data["text"][1])

# Now we format the validation data.
#validation_data = formatted_prompt(df_validation)
#print(validation_data["text"][1])
training_data = Dataset.from_pandas(df_training) # here we convert pandas dataframe into a hugging face dataset object
training_data = training_data.map(formatting_prompt, batched=True) # Here we apply the formatting func to each element of the dataset using map method.

validation_data = Dataset.from_pandas(df_validation)
validation_data = validation_data.map(formatting_prompt, batched=True)

Map:   0%|          | 0/553 [00:00<?, ? examples/s]

Map:   0%|          | 0/102 [00:00<?, ? examples/s]

Here sets up the LLaMA 3.2 model with 3 billion parameters for fine-tuning. It uses LoRA (Low-Rank Adaptation) to efficiently train only specific parts of the model, making the process faster and less resource-intensive. The model is loaded with full precision (not quantized) for better accuracy, and gradient checkpointing is enabled to manage memory during training. Finally, it prints the number of trainable parameters for verification.


In [None]:
# WE are using lama with 1B parameters.
max_seq_length = 1024  # imo its enough for a simple AI chatbot
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B", # trying 3B, removed 4 bit quantization.
    max_seq_length=max_seq_length,
    load_in_4bit=False, # Here we ensure that the model has full precision and is not 4 bits
    # By setting false we wanna check with original precision if its better!
    dtype=None,
)
# we use parameter efficient fine tuning like we learnt in LLM which applied LORA techniques. This approach
# focuses on fine tuning only specific layers or parts of the model, rather than the entire network.
# r = 16 and lora_alpha = 16 adjusts the complexity and scaling of these adaptations.
#  target modules specifies which layers of the model should be adapted, which include key components involed
# in attention mechanisms like q_proj and k_proj and v_proj.
# use_rslora activates Rank stabalized LORA, which improves the stability of the fine tuning process.
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj",],
    # q_proj, k_proj, v_proj: Handle the query, key, and value projections in the attention mechanism, essential for capturing contextual information.
    # up_proj, down_proj: Layers in feedforward networks. o_proj: Combines attention heads’ output. gate_proj: Controls flow in certain feedforward networks.
    use_rslora=True,
    use_gradient_checkpointing="unsloth",
    random_state = 32,
    loftq_config = None,
)
print(model.print_trainable_parameters())

==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

Unsloth 2024.12.2 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


trainable params: 24,313,856 || all params: 3,237,063,680 || trainable%: 0.7511
None


Training the Model with the Trainer API

Goal: Use the Trainer API to actually fine-tune the model on the formatted dataset. This step leverages all previous configurations for efficient training.

Process: The formatted data is fed into the Trainer as input for model training.
The Trainer uses LoRA fine-tuning to adjust only specific layers, optimizing performance while keeping memory usage low.

Purpose: This final step leverages all previous configurations and formatted data to train the model. The Trainer applies gradient updates to the specified layers according to LoRA parameters, optimizing the model for the task without requiring massive resources.

In [None]:
trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=training_data,
    eval_dataset = validation_data,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc = 2,
    # Consider disabling packing if not needed:
    # packing=False,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=4, # changed from 2 to 4  Larger effective batch sizes can sometimes lead to more stable training.
        gradient_accumulation_steps=2, # from 4 to 2
        num_train_epochs=40,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=50, # the logging parameter determines how frequently in terms of training
        # steps the the trainer logs metrics like training loss, learning rate, and other available metrics.
        evaluation_strategy = "steps",
        eval_steps = 200,
        save_steps = 200,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=100,
        output_dir="output",
        run_name = "my_llama_chatbot_finetune_run",
        report_to = "wandb",
        seed=0,
    ),
)

# Here we train
trainer.train()

# Here we manually save the fine tuned model
trainer.save_model("/output2")
tokenizer.save_pretrained("/output2")

# Evaluation & Perplexity
eval_results = trainer.evaluate()
eval_loss = eval_results["eval_loss"]
perplexity = math.exp(eval_loss)
print(f"Evaluation loss: {eval_loss}")
print(f"Perplexity: {perplexity}")

Map (num_proc=2):   0%|          | 0/553 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/102 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 553 | Num Epochs = 40
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 2
\        /    Total batch size = 8 | Total steps = 2,760
 "-____-"     Number of trainable parameters = 24,313,856
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss
200,0.2007,0.32131
400,0.1204,0.325637
600,0.0923,0.3791
800,0.0746,0.378767
1000,0.0635,0.415229
1200,0.0599,0.414516
1400,0.0573,0.423272
1600,0.0534,0.450093
1800,0.0513,0.457502
2000,0.0484,0.488911


Evaluation loss: 0.5593435168266296
Perplexity: 1.7495235904188378


## Inference Mode: Applying Knowledge to User Queries

Now that the model is trained, it's ready to assist users with their inquiries about AROL Group products and services. In this phase, the model leverages the knowledge gained during fine-tuning to generate informative and helpful responses.

### User Interaction:

Users will input their questions or requests related to AROL Group's offerings, such as:

*   "What types of bottle caps are suitable for carbonated drinks?"
*   "How do I maintain my AROL capping machine for optimal performance?"
*   "Can AROL's solutions be customized for my specific production needs?"

### Model Response:

The model processes the user's input and generates a response based on the information it has learned. These responses will be:

*   **Tailored to AROL Group's domain:** The model's knowledge is focused on bottle caps, capping technologies, and related services offered by AROL Group.
*   **Informative and accurate:** The responses aim to provide clear and relevant answers to user queries, leveraging the data it was trained on.

In [None]:

import os
os.chdir('/content/Smart-Chatbot/') # Here we set the working directory

Everything up-to-date


In [None]:
import os
import math
import torch
from unsloth import FastLanguageModel
from transformers import TrainingArguments, AutoTokenizer
from huggingface_hub import HfApi, Repository
from tqdm import tqdm

# Diagnostic Function
def run_diagnostics():
    print("=== System Diagnostics ===")
    print("PyTorch Version:", torch.__version__)
    print("CUDA Available:", torch.cuda.is_available())
    print("CUDA Device Count:", torch.cuda.device_count())

    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            print(f"CUDA Device {i}:")
            print(f"  Name: {torch.cuda.get_device_name(i)}")
            print(f"  Total Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB")

    print("\n=== Checkpoint Validation ===")
    checkpoint_paths = [
        "output/checkpoint-1200",
        "/output2",
        "output",
        "/output"
    ]

    valid_paths = [path for path in checkpoint_paths if os.path.exists(path)]
    print("Potential Valid Checkpoint Paths:", valid_paths)

    return valid_paths

# Model Loading Function with Comprehensive Error Handling
def load_fine_tuned_model(checkpoint_path):
    try:
        print(f"\nAttempting to load model from: {checkpoint_path}")

        # Removed max_memory_key parameter
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=checkpoint_path,
            load_in_4bit=False,  # Force 4-bit quantization for memory efficiency
            device_map='auto'
        )

        # Prepare model for inference
        FastLanguageModel.for_inference(model)

        print("Model loaded successfully!")
        return model, tokenizer

    except Exception as e:
        print(f"Error loading model: {e}")
        import traceback
        traceback.print_exc()
        return None, None

# Inference Function
def run_inference(model, tokenizer, query):
    if model is None or tokenizer is None:
        print("Model or tokenizer is not initialized!")
        return None

    try:
        # Prepare prompt template
        data_prompt = """
        You are a customer support assistant for AROL Group...

        ### question:
        {}

        ### answer:
        """.strip()

        # Prepare input
        inputs = tokenizer(
            [data_prompt.format(query)],
            return_tensors="pt"
        ).to("cuda")

        # Generate response
        outputs = model.generate(**inputs, max_new_tokens=512)
        answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
        answer = answer.split("### answer:")[-1].strip()

        return answer

    except Exception as e:
        print(f"Inference error: {e}")
        import traceback
        traceback.print_exc()
        return None

# Save Model and Tokenizer to Hugging Face Hub
def save_to_hub(model, tokenizer, repo_name, token):
    try:
        # Push the model to the Hugging Face Hub
        model.push_to_hub(repo_name, use_auth_token=token)

        # Push the tokenizer to the Hugging Face Hub
        tokenizer.push_to_hub(repo_name, use_auth_token=token)

        print("Model and tokenizer saved to Hugging Face Hub successfully!")

    except Exception as e:
        print(f"Error saving to Hugging Face Hub: {e}")
        import traceback
        traceback.print_exc()

# Main Execution

# Run system diagnostics
valid_paths = run_diagnostics()

if not valid_paths:
    print("No valid checkpoint paths found! Check your model save locations.")


# Try loading the model from the first valid path
model, tokenizer = load_fine_tuned_model(valid_paths[0])

if model is not None and tokenizer is not None:
    # Example query
    query = "What kind of capping technologies does AROL Group offer?"

    # Run inference
    response = run_inference(model, tokenizer, query)

    if response:
        print("\n=== Model Response ===")
        print(response)

    # Define the Hugging Face Hub repository details
    repo_name = "meldashti/lora-model"
    token = ""

    # Save the model and tokenizer to the Hugging Face Hub
    save_to_hub(model, tokenizer, repo_name, token)

Answer of the question is: Yes, the Eagle PK machine is available in different versions, including an optional caps sorter and fully automatic bottle-neck guide assembly.
</s>


In [None]:
from huggingface_hub import Repository, upload_folder

In [None]:
model.push_to_hub("Meldashti/chatbot")
tokenizer.push_to_hub("Meldashti/chatbot")

### Deploying the Chatbot with Gradio and Hugging Face

The fine-tuned model is saved and uploaded to Hugging Face Hub to facilitate easy deployment and sharing. Additionally, a Gradio-based chat interface is created to enable real-time interaction with the model. This enables users to interact with the fine-tuned AI model in real-time through a hosted Gradio application.
