Colab Notebook link : [link text](https://colab.research.google.com/drive/1Irht1A6ySmkXTxkgku2C5CAyQYg5bp-i?usp=sharing)

**Summary of Training Notebook**

This notebook focuses on building a machine learning model using BERT (Bidirectional Encoder Representations from Transformers) for predicting the primary reasons for customer calls based on call transcripts. The notebook is organized into several key sections:

**Data Preparation:**

Imported the necessary libraries and loaded the dataset containing call transcripts and their corresponding labels (primary call reasons).
Preprocessed the text data by converting the labels into numeric values using Label Encoding to facilitate model training.


**Data Splitting:**

Divided the dataset into training and validation sets using an 80-20 split. This ensures that the model can be evaluated on unseen data after training.


**Tokenization:**

Utilized the BertTokenizer to tokenize the call transcripts, transforming the text data into a format suitable for the BERT model. The texts were truncated and padded to a maximum length of 512 tokens.


**Custom Dataset Class:**

Created a custom dataset class, CallDataset, to handle the tokenized inputs and labels. This class defines methods to retrieve data items and the total number of samples.


**Model Initialization:**

Loaded the BERT model for sequence classification, specifying the number of labels (20 classes corresponding to call reasons).


**Training Arguments:**

Defined training arguments such as output directory, batch size, number of epochs, learning rate, and evaluation strategy to control the training process effectively.


**Training the Model:**

Used the Trainer class to train the model with the defined arguments and datasets. The training process was monitored to evaluate performance based on accuracy metrics.

**Model Evaluation:**

After training, evaluated the model on the validation dataset to check its accuracy. The results were printed for assessment.

**Saving the Model:**

Saved the trained model and tokenizer for future use, enabling easy deployment for making predictions on new data.

**Prediction on New Data:**

Provided a mechanism to predict call reasons for new transcripts. The model tokenized the input, obtained predictions, and converted them back to the corresponding call reason labels.

**Output Generation:**

Finally, predictions were stored in a DataFrame and exported to an Excel file for further analysis or reporting

In [None]:
!pip install transformers
!pip install torch
!pip install sklearn


Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd


**All files calls.csv,customers.csv, reason.csv, sentiment_statistics.csv are cleaned Or preprocessed seprately & merged into one file**

In [None]:
df = pd.read_excel("/content/drive/MyDrive/ut/Merged_clean.xlsx")

In [None]:
df

Unnamed: 0,call_id,customer_id,agent_id,call_start_datetime,agent_assigned_datetime,call_end_datetime,call_transcript,assigning_time,call_duration,customer_name,elite_level_code,elite_level_code_categorical,primary_call_reason,primary_call_reason_categorical,agent_tone,customer_tone,average_sentiment,silence_percent_average
0,4667960400,2033123310,963118,2024-07-31 23:56:00,2024-08-01 00:03:00,2024-08-01 00:34:00,\n\nAgent: Thank you for calling United Airlin...,7,31,Matthew Foster,4,4,Voluntary Cancel,Voluntary Cancel,neutral,angry,-0.04,0.39
1,1122072124,8186702651,519057,2024-08-01 00:03:00,2024-08-01 00:06:00,2024-08-01 00:18:00,\n\nAgent: Thank you for calling United Airlin...,3,12,Tammy Walters,-1,-1,Booking,Booking,calm,neutral,0.02,0.35
2,6834291559,2416856629,158319,2024-07-31 23:59:00,2024-08-01 00:07:00,2024-08-01 00:26:00,\n\nAgent: Thank you for calling United Airlin...,8,19,Jeffery Dixon,-1,-1,IRROPS,IRROPS,neutral,polite,-0.13,0.32
3,2266439882,1154544516,488324,2024-08-01 00:05:00,2024-08-01 00:10:00,2024-08-01 00:17:00,\n\nAgent: Thank you for calling United Airlin...,5,7,David Wilkins,2,2,Upgrade,Upgrade,neutral,frustrated,-0.20,0.20
4,1211603231,5214456437,721730,2024-08-01 00:04:00,2024-08-01 00:14:00,2024-08-01 00:23:00,\n\nAgent: Thank you for calling United Airlin...,10,9,Elizabeth Daniels,0,0,Seating,Seating,neutral,polite,-0.05,0.35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66648,7569738090,7367304988,783441,2024-08-31 23:51:00,2024-08-31 23:57:00,2024-09-01 00:07:00,\n\nAgent: Thank you for calling United Airlin...,6,10,Kevin Warner,2,2,Mileage Plus,Mileage Plus,neutral,frustrated,-0.19,0.51
66649,1563273072,8022667294,413148,2024-08-31 23:48:00,2024-08-31 23:57:00,2024-09-01 00:25:00,\n\nAgent: Thank you for calling United Airlin...,9,28,Dennis Singleton DDS,-1,-1,Post Flight,Post Flight,calm,calm,0.05,0.34
66650,8865997781,4474867021,980156,2024-08-31 23:55:00,2024-08-31 23:58:00,2024-09-01 00:06:00,\n\nAgent: Thank you for calling United Airlin...,3,8,Paul Mitchell,1,1,Upgrade,Upgrade,calm,frustrated,0.03,0.22
66651,8019240181,9762042472,616988,2024-08-31 23:52:00,2024-08-31 23:58:00,2024-09-01 00:04:00,\n\nAgent: Thank you for calling United Airlin...,6,6,Kaylee Lang,-1,-1,Upgrade,Upgrade,calm,polite,0.05,0.42


In [None]:
df_new = df[['call_id', 'call_transcript', 'primary_call_reason_categorical' ]]

In [None]:
df_new

Unnamed: 0,call_id,call_transcript,primary_call_reason_categorical
0,4667960400,\n\nAgent: Thank you for calling United Airlin...,Voluntary Cancel
1,1122072124,\n\nAgent: Thank you for calling United Airlin...,Booking
2,6834291559,\n\nAgent: Thank you for calling United Airlin...,IRROPS
3,2266439882,\n\nAgent: Thank you for calling United Airlin...,Upgrade
4,1211603231,\n\nAgent: Thank you for calling United Airlin...,Seating
...,...,...,...
66648,7569738090,\n\nAgent: Thank you for calling United Airlin...,Mileage Plus
66649,1563273072,\n\nAgent: Thank you for calling United Airlin...,Post Flight
66650,8865997781,\n\nAgent: Thank you for calling United Airlin...,Upgrade
66651,8019240181,\n\nAgent: Thank you for calling United Airlin...,Upgrade


**Model Building**

In [None]:


# Let's assume 'transcript' column has the call text and 'reason' column has the reason for the call
texts = df_new['call_transcript'].values
labels = df['primary_call_reason_categorical'].values

# Convert labels to numeric values (you can use LabelEncoder for this)
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(labels)

# Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)


In [None]:
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the input texts
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True, max_length=512)
val_encodings = tokenizer(list(val_texts), truncation=True, padding=True, max_length=512)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


KeyboardInterrupt: 

In [None]:
class CallDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create datasets
train_dataset = CallDataset(train_encodings, train_labels)
val_dataset = CallDataset(val_encodings, val_labels)


In [None]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(set(labels)))

In [None]:
training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    num_train_epochs=1,              # Number of training epochs (adjust based on your dataset)
    per_device_train_batch_size=8,   # Batch size for training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Strength of weight decay
    logging_dir='./logs',            # Directory for storing logs
    evaluation_strategy="steps",     # Evaluate after every x steps
)


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=lambda p: {'accuracy': accuracy_score(p.label_ids, p.predictions.argmax(-1))}
)

# Start training
trainer.train()


In [None]:
model_save_path = "/content/drive/MyDrive/ut"  # Replace with your desired path
trainer.save_model(model_save_path)  # Save the model
tokenizer.save_pretrained(model_save_path)  # Save the tokenizer

In [None]:
model_save_path = "/content/drive/MyDrive/ut"

In [None]:
from transformers import BertForSequenceClassification, BertTokenizer

# Load the model and tokenizer
model = BertForSequenceClassification.from_pretrained(model_save_path)
tokenizer = BertTokenizer.from_pretrained(model_save_path)

In [None]:
eval_results = trainer.evaluate()

print(f"Validation Accuracy: {eval_results['eval_accuracy']}")

NameError: name 'trainer' is not defined

In [None]:
new_texts = ["I want to know my account balance"]  # Example transcript

# Tokenize the new text
new_encodings = tokenizer(new_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")

# Predict
outputs = model(**new_encodings)
predictions = torch.argmax(outputs.logits, dim=-1)

# Convert prediction to label
predicted_reason = label_encoder.inverse_transform(predictions.detach().numpy())

print(predicted_reason)


['Voluntary Change']


In [None]:
df = pd.read_excel("/content/drive/MyDrive/ut2/merged_output.xlsx")

In [None]:
transcripts = df['call_transcript'].tolist()

Big Prediction engine

In [None]:


# Assuming the Excel has 'call_id' and 'call_transcript' columns
call_ids = df['call_id'].tolist()
call_transcripts = df['call_transcript'].tolist()

# Example: if you have 3 classes - "Account Inquiry", "Technical Support", "Billing Issue"
class_labels = ['Voluntary Cancel', 'Booking', 'IRROPS', 'Upgrade', 'Seating',
       'Mileage Plus', 'Checkout', 'Voluntary Change', 'Post Flight',
       'Check In', 'Other Topics', 'Communications', 'Schedule Change',
       'Products and Services', 'Digital Support', 'Disability',
       'Unaccompanied Minor', 'Baggage', 'Traveler Updates', 'ETC']

predicted_reasons = []

# Loop through the transcripts and predict the reason for each
for transcript in call_transcripts:
    # Tokenize the transcript
    new_encodings = tokenizer([transcript], truncation=True, padding=True, max_length=512, return_tensors="pt")

    # Get model predictions
    outputs = model(**new_encodings)
    predictions = torch.argmax(outputs.logits, dim=-1)

    # Map the prediction to the corresponding label
    predicted_reason = class_labels[predictions.item()]  # Get the label from the list
    predicted_reasons.append(predicted_reason)

# Add the predicted reasons to the DataFrame
df['predicted_reason'] = predicted_reasons



In [None]:
df

Unnamed: 0,call_id,call_transcript,predicted_reason
0,7732610078,\n\nAgent: Thank you for calling United Airlin...,Post Flight
1,2400299738,\n\nAgent: Thank you for calling United Airlin...,Post Flight
2,6533095063,\n\nAgent: Thank you for calling United Airlin...,ETC
3,7774450920,\n\nAgent: Thank you for calling United Airlin...,Post Flight
4,9214147168,\n\nAgent: Thank you for calling United Airlin...,Post Flight
...,...,...,...
5152,5300201106,\n\nAgent: Thank you for calling United Airlin...,ETC
5153,727694488,\n\nAgent: Thank you for calling United Airlin...,Post Flight
5154,147487837,\n\nAgent: Thank you for calling United Airlin...,Communications
5155,5330794838,\n\nAgent: Thank you for calling United Airlin...,ETC


In [None]:
df.to_excel('output_data.xlsx', index=False)

**Challenges and Limitations in Model Training and Prediction**

During the training process of the BERT model for predicting call reasons, I encountered significant challenges related to computational resources, which impacted the overall efficiency and performance of the model:

**Limited Training Epochs:**

Due to restricted computing units on Google Colab, I could only afford to train the model for 1 epoch. This single epoch took approximately 3 hours to complete.
With additional training epochs (e.g., 10 or 12 epochs), I believe the model would yield substantially better results and improve its predictive capabilities.

**Extended Prediction Time:**

Generating predictions for 5,000 transcripts was a time-consuming process, taking about 2 hours to complete. This delay was also attributed to the limited computational resources available during this phase.
If I had access to unlimited computing units, I could have significantly accelerated this prediction process.

**Potential for Improved Performance:**

Given the constraints on training epochs and prediction speed, the model's current performance could be optimized further. With more computational resources, I could develop a state-of-the-art model that delivers higher accuracy and efficiency in predicting call reasons.

**Conclusion**
The experience underscored the importance of adequate computational resources in machine learning projects, especially when working with complex models like BERT. Increasing the training epochs and enhancing the prediction capabilities would allow for the development of a more robust and accurate predictive model.