<a href="https://colab.research.google.com/github/ShoukatKhattak/AI-Engineering/blob/main/BERT_Model_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd
import re
import torch
import random
import torch.nn as nn
from transformers import BertTokenizer, BertModel
from tqdm import tqdm
import os

This section imports necessary libraries and modules:
- `numpy` and `pandas` for data manipulation.
- `re` for regular expressions.
- `torch` for building and training neural networks.
- `random` for random number generation.
- `torch.nn` for neural network components.
- `BertTokenizer` and `BertModel` from the Hugging Face `transformers` library, used for working with BERT models.
- `tqdm` for displaying progress bars during training.
- `os` for operating system related functions.


In [3]:
class BERT_Arch(nn.Module):
    def __init__(self, bert):
        super(BERT_Arch, self).__init__()
        self.bert = bert
        self.dropout = nn.Dropout(0.1)  # Adjust dropout rate
        # Add an additional layer
        self.fc1 = nn.Linear(768, len(answers))

    def forward(self, sent_id, attention_mask):  # Update method signature to accept attention_mask
        cls_hs = self.bert(sent_id, attention_mask=attention_mask)[0][:, 0]
        x = self.dropout(cls_hs)
        output = self.fc1(x)
        return output

Here, a custom neural network architecture `BERT_Arch` is defined. It inherits from `nn.Module` and implements the BERT architecture with an additional linear layer for classification.

In [4]:
# Load the multilingual BERT model
bert = BertModel.from_pretrained('bert-base-multilingual-uncased')

# Load the multilingual BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

This section loads the pre-trained BERT model and tokenizer. It uses the `'bert-base-multilingual-uncased'` version.

In [5]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [6]:
device

device(type='cuda')

Determines whether to use GPU or CPU for computation based on availability.

In [7]:
#!huggingface-cli login

In [9]:
def load_dataset(file_path):
    df = pd.read_csv(file_path)
    questions = df['question'].tolist()
    answers = df['answer'].tolist()
    return questions, answers

questions, answers = load_dataset('Conversation.csv')

In [10]:
answers

["i'm fine. how about yourself?",
 "i'm pretty good. thanks for asking.",
 'no problem. so how have you been?',
 "i've been great. what about you?",
 "i've been good. i'm in school right now.",
 'what school do you go to?',
 'i go to pcc.',
 'do you like it there?',
 "it's okay. it's a really big campus.",
 'good luck with school.',
 'thank you very much.',
 "i'm doing well. how about you?",
 'never better, thanks.',
 'so how have you been lately?',
 "i've actually been pretty good. you?",
 "i'm actually in school right now.",
 'which school do you attend?',
 "i'm attending pcc right now.",
 'are you enjoying it there?',
 "it's not bad. there are a lot of people there.",
 'good luck with that.',
 'thanks.',
 "i'm doing great. what about you?",
 "i'm absolutely lovely, thank you.",
 "everything's been good with you?",
 "i haven't been better. how about yourself?",
 'i started school recently.',
 'where are you going to school?',
 "i'm going to pcc.",
 'how do you like it so far?',
 'i l

In [11]:
questions

['hi, how are you doing?',
 "i'm fine. how about yourself?",
 "i'm pretty good. thanks for asking.",
 'no problem. so how have you been?',
 "i've been great. what about you?",
 "i've been good. i'm in school right now.",
 'what school do you go to?',
 'i go to pcc.',
 'do you like it there?',
 "it's okay. it's a really big campus.",
 'good luck with school.',
 "how's it going?",
 "i'm doing well. how about you?",
 'never better, thanks.',
 'so how have you been lately?',
 "i've actually been pretty good. you?",
 "i'm actually in school right now.",
 'which school do you attend?',
 "i'm attending pcc right now.",
 'are you enjoying it there?',
 "it's not bad. there are a lot of people there.",
 'good luck with that.',
 'how are you doing today?',
 "i'm doing great. what about you?",
 "i'm absolutely lovely, thank you.",
 "everything's been good with you?",
 "i haven't been better. how about yourself?",
 'i started school recently.',
 'where are you going to school?',
 "i'm going to pcc.

This function loads the dataset from a CSV file and returns lists of questions and answers.

In [12]:
# Tokenize and encode questions
max_seq_len = 55  # or any suitable value
tokens_train = tokenizer(
    questions,
    max_length=max_seq_len,
    pad_to_max_length=True,
    truncation=True,
    return_token_type_ids=False
)



Tokenizes and encodes the questions using the BERT tokenizer, ensuring they are of uniform length by padding/truncating.

In [13]:
# Convert to tensors
train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor([answers.index(ans) for ans in answers])


Converts the tokenized data into PyTorch tensors.

In [14]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

# define a batch size
batch_size = 64

# wrap tensors
train_data = TensorDataset(train_seq, train_mask, train_y)

# sampler for sampling the data during training
train_sampler = RandomSampler(train_data)

# DataLoader for train set
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

This part prepares the data for training by creating a `DataLoader` object.

In [15]:
# Initialize your model
model = BERT_Arch(bert)

# Push the model to GPU
model = model.to(device)

Initializes the BERT model and sends it to the appropriate device (GPU or CPU).

In [16]:
# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)  # Adjust learning rate

# Loss function
loss_function = nn.CrossEntropyLoss()

Defines the optimizer (Adam) and the loss function (CrossEntropyLoss).

In [17]:
# Training loop
def train():
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader, desc="Training")):
        batch = [r.to(device) for r in batch]
        sent_id, mask, labels = batch
        model.zero_grad()
        output = model(sent_id, mask)
        loss = loss_function(output, labels)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    avg_loss = total_loss / len(train_dataloader)
    return avg_loss


This function defines the training loop.

In [18]:
# Number of training epochs
epochs = 100 # Increase for better training

In [None]:
# Train the model
for epoch in range(epochs):
    print(f'\n Epoch {epoch + 1} / {epochs}')
    train_loss = train()
    print(f'Training Loss: {train_loss:.3f}')


 Epoch 1 / 100


Training: 100%|██████████| 59/59 [00:31<00:00,  1.90it/s]


Training Loss: 8.270

 Epoch 2 / 100


Training: 100%|██████████| 59/59 [00:31<00:00,  1.85it/s]


Training Loss: 8.217

 Epoch 3 / 100


Training: 100%|██████████| 59/59 [00:33<00:00,  1.78it/s]


Training Loss: 8.159

 Epoch 4 / 100


Training: 100%|██████████| 59/59 [00:32<00:00,  1.81it/s]


Training Loss: 8.029

 Epoch 5 / 100


Training: 100%|██████████| 59/59 [00:32<00:00,  1.81it/s]


Training Loss: 7.855

 Epoch 6 / 100


Training: 100%|██████████| 59/59 [00:32<00:00,  1.80it/s]


Training Loss: 7.711

 Epoch 7 / 100


Training: 100%|██████████| 59/59 [00:32<00:00,  1.81it/s]


Training Loss: 7.577

 Epoch 8 / 100


Training: 100%|██████████| 59/59 [00:32<00:00,  1.81it/s]


Training Loss: 7.461

 Epoch 9 / 100


Training: 100%|██████████| 59/59 [00:32<00:00,  1.81it/s]


Training Loss: 7.335

 Epoch 10 / 100


Training: 100%|██████████| 59/59 [00:32<00:00,  1.80it/s]


Training Loss: 7.221

 Epoch 11 / 100


Training:  59%|█████▉    | 35/59 [00:19<00:13,  1.78it/s]

Trains the model for a specified number of epochs.

In [None]:
# Save the trained model
torch.save(model.state_dict(), 'trained_model.pth')

Saves the trained model to disk.

In [None]:
# Load the saved model
model_path = "/content/trained_model.pth"
# Initialize the model with the correct pre-trained model 'bert'
model = BERT_Arch(bert)

# Load the state dictionary onto the CPU
model.load_state_dict(torch.load(model_path, map_location=torch.device('cpu')))
# Move the model to the desired device (GPU if available, otherwise CPU)
model.to(device)

BERT_Arch(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_af

This will load the save model

In [None]:
def get_prediction(input_str, model):
    # Remove non-Arabic characters and spaces
    input_str = re.sub(r'[^\u0600-\u06ff\s]+', '', input_str)

    # Tokenize the input string
    tokens_test_data = tokenizer(
        [input_str],
        max_length=max_seq_len,
        padding='max_length',
        truncation=True,
        return_token_type_ids=False
    )

    # Convert the tokenized text to tensors
    test_seq = torch.tensor(tokens_test_data['input_ids']).to(device)
    test_mask = torch.tensor(tokens_test_data['attention_mask']).to(device)

    # Set the model to evaluation mode
    model.eval()

    # Disable gradient calculation to improve efficiency
    with torch.no_grad():
        # Get model predictions
        preds = model(test_seq, attention_mask=test_mask)
        # Apply softmax function to get probabilities
        preds = torch.softmax(preds, dim=1)
        # Get the index of the highest probability
        pred_idx = torch.argmax(preds, dim=1).item()

    # Return the predicted answer
    return answers[pred_idx]

Defines a function to get predictions from the model given an input question.

In [None]:
while True:
    input_question = input("You: ")
    if input_question.lower() in ["exit", "Exit", "EXIT"]:
        break
    predicted_answer = get_prediction(input_question, model)
    print(f"Chatbot: {predicted_answer}")


You: پ کس زمانے سے موجود ہیں؟',  'آپ کو کیسی ہنر میں استعداد ہے؟'
Chatbot: میرا استعداد معلومات جمع کرنے اور سوالات کا جواب دینے میں ہے۔
You: آپ کس زمانے سے موجود ہیں
Chatbot: میں ابتدائی طور پر 2010 میں تیار کیا گیا تھا، میری معلومات مستقل طور پر بروز برس ہوتی رہتی ہے۔
You: کیا python پروگرامنگ کیس حساس ہے؟
Chatbot: پائتھن میں میتھڈز اور فَنکشن کا فرق یہ ہوتا ہے کہ میتھڈز کلاس کے ساتھ متعلق ہوتے ہیں جبکہ فَنکشنز کسی کلاس کے باہر بنائی جاتی ہیں۔
You: کیا پائتھن پروگرامنگ کیس حساس ہے؟
Chatbot: پائتھن میں اوبجیکٹ اورینٹڈ پروگرامنگ  ایک programming paradigm ہے جو کہ objects کی مدد سے کوڈ کو ساخت بخشتا ہے۔
You: Exit


Creates an interactive loop where the user can input questions to the chatbot, and it responds with predicted answers.