## PART C

Understanding the primary reasons for incoming calls is vital for enhancing operational efficiency and improving customer service. Accurately categorizing call reasons enables the call center to streamline processes, reduce manual tagging efforts, and ensure that customers are directed to the appropriate resources. In this context, analyze the dataset to uncover patterns that can assist in understanding and identifying these primary call reasons. Please outline your approach, detailing the data analysis techniques and feature identification methods you plan to use. Optional task, you may utilize the `test.csv` file to generate and submit your predictions

DOWNLOADING THE LIBRARIES

In [None]:
%pip install pandas numpy sklearn imblearn xgboost lightgbm catboost

## LOADING THE DATASETS

Converting the Data Types to remove discrepancies and for analysis.
Adding the AHT and AST Column for modelling.
Merged the DataFrames to get a single dataframe for the job.

In [2]:
import pandas as pd
import numpy as np
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.impute import SimpleImputer

calls_df = pd.read_csv(r'.\callsf0d4f5a.csv')
customers_df = pd.read_csv(r'.\customers2afd6ea.csv')
reasons_df = pd.read_csv(r'.\reason18315ff.csv')
sentiment_df = pd.read_csv(r'.\sentiment_statisticscc1e57a.csv')
test_df = pd.read_csv(r'.\testbc7185d.csv')


calls_df['call_start_datetime'] = pd.to_datetime(calls_df['call_start_datetime'])
calls_df['agent_assigned_datetime'] = pd.to_datetime(calls_df['agent_assigned_datetime'])
calls_df['call_end_datetime'] = pd.to_datetime(calls_df['call_end_datetime'])

calls_df['AHT'] = (calls_df['call_end_datetime'] - calls_df['agent_assigned_datetime']).dt.total_seconds()
calls_df['AST'] = (calls_df['agent_assigned_datetime'] - calls_df['call_start_datetime']).dt.total_seconds()

merged_df = pd.merge(calls_df, reasons_df, on='call_id', how='left')
merged_df = pd.merge(merged_df, sentiment_df, on='call_id', how='left')
merged_df = pd.merge(merged_df, customers_df, on='customer_id', how='left')


Preprocessed the "primary_call_reason" feature to make the values consistent and usable.

In [3]:
def clean_call_reason(column):
    column = column.str.strip()
    column = column.str.replace(r'\s+', ' ', regex=True)
    return column

merged_df['primary_call_reason'] = clean_call_reason(merged_df['primary_call_reason'])

def standardize_call_reasons(column):
    standardization_dict = {
        'Check In': 'Check-In',
        'Post Flight': 'Post-Flight',
        'Products & Services': 'Products and Services'
    }
    
    column = column.replace(standardization_dict)
    
    return column

merged_df['primary_call_reason'] = standardize_call_reasons(merged_df['primary_call_reason'])

merged_df['average_sentiment'].fillna(merged_df['average_sentiment'].mean(), inplace=True)



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  merged_df['average_sentiment'].fillna(merged_df['average_sentiment'].mean(), inplace=True)


Proprocessing the Transcripts to remove whitespaces for processing it properly.
Also removed the call_ids which are present in the test.csv file because they are the ids of the used which doesn't have primary reason.

In [4]:

def preprocess_transcript(text):
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    return text

merged_df['cleaned_transcript'] = merged_df['call_transcript'].apply(preprocess_transcript)

merged_df = merged_df[merged_df['primary_call_reason'].notna()]


Using the TF-IDF Vectorizer to convert the transcripts to embeddings and Imputing the missing values with mean.

In [5]:
tfidf = TfidfVectorizer(max_features=500)
X_text = tfidf.fit_transform(merged_df['cleaned_transcript']).toarray()

numerical_features = merged_df[['AHT', 'AST', 'average_sentiment', 'silence_percent_average']].values

imputer = SimpleImputer(strategy='mean')
X_numerical = imputer.fit_transform(numerical_features)


Stacking the input features and the Vectors for training.
Train Testing and Splitting the Dataset for validation.

In [6]:
X = np.concatenate([X_text, X_numerical], axis=1)

y = merged_df['primary_call_reason']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


## DON'T RUN THE CELLS FROM HERE ONWARDS THEY WILL TAKE LONG AMOUNTS OF TIME 

Training a Random Forest Classifier

In [26]:

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


The Metrics for Random Forests on this data.

In [28]:

y_pred = model.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:\n", classification_report(y_val, y_pred))


Validation Accuracy: 0.3335833770909909


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Classification Report:
                        precision    recall  f1-score   support

              Baggage       0.75      0.00      0.01       604
              Booking       0.39      0.03      0.05       513
             Check-In       0.00      0.00      0.00       359
             Checkout       0.89      0.09      0.16       384
       Communications       0.23      0.02      0.03       757
      Digital Support       0.00      0.00      0.00       255
           Disability       0.00      0.00      0.00        86
                  ETC       0.00      0.00      0.00       197
               IRROPS       0.34      0.95      0.50      2763
         Mileage Plus       0.17      0.03      0.05      1130
         Other Topics       0.00      0.00      0.00       174
          Post-Flight       0.37      0.15      0.22       848
Products and Services       0.11      0.00      0.01       658
      Schedule Change       0.00      0.00      0.00       146
              Seating       0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Handle Class Imbalance Using Class Weights to train a new model.

In [37]:
from sklearn.utils.class_weight import compute_class_weight

class_labels = np.unique(y_train)

class_weights = compute_class_weight(class_weight='balanced', classes=class_labels, y=y_train)

class_weight_dict = {class_label: weight for class_label, weight in zip(class_labels, class_weights)}

model1 = RandomForestClassifier(n_estimators=100, random_state=42, class_weight=class_weight_dict)
model1.fit(X_train, y_train)


Metrics for this model.

In [38]:

y_pred = model1.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:\n", classification_report(y_val, y_pred))


Validation Accuracy: 0.3503113044782837


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Classification Report:
                        precision    recall  f1-score   support

              Baggage       0.25      0.00      0.00       604
              Booking       0.33      0.07      0.12       513
             Check-In       0.00      0.00      0.00       359
             Checkout       0.95      0.47      0.62       384
       Communications       0.42      0.05      0.09       757
      Digital Support       0.00      0.00      0.00       255
           Disability       0.00      0.00      0.00        86
                  ETC       0.00      0.00      0.00       197
               IRROPS       0.34      0.86      0.49      2763
         Mileage Plus       0.22      0.01      0.01      1130
         Other Topics       0.00      0.00      0.00       174
          Post-Flight       0.44      0.20      0.28       848
Products and Services       0.25      0.00      0.00       658
      Schedule Change       0.00      0.00      0.00       146
              Seating       0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tried SMOTE for solving data imbalance.

In [39]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

model2 = RandomForestClassifier(n_estimators=100, random_state=42)
model2.fit(X_train_resampled, y_train_resampled)


SMOTE metrics.

In [40]:

# Step 15: Validate the model
y_pred = model2.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:\n", classification_report(y_val, y_pred))


Validation Accuracy: 0.3091290975920786


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Classification Report:
                        precision    recall  f1-score   support

              Baggage       0.17      0.06      0.09       604
              Booking       0.23      0.18      0.20       513
             Check-In       0.05      0.01      0.01       359
             Checkout       0.59      0.31      0.41       384
       Communications       0.22      0.16      0.18       757
      Digital Support       0.05      0.01      0.01       255
           Disability       0.00      0.00      0.00        86
                  ETC       0.00      0.00      0.00       197
               IRROPS       0.36      0.71      0.47      2763
         Mileage Plus       0.19      0.10      0.13      1130
         Other Topics       0.07      0.01      0.01       174
          Post-Flight       0.30      0.26      0.28       848
Products and Services       0.11      0.03      0.05       658
      Schedule Change       0.00      0.00      0.00       146
              Seating       0.

Tried XGBoost to improve accuracy but takes longer to run.

In [42]:
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

# Step 1: Convert string labels to numerical labels
label_encoder = LabelEncoder()

# Fit the label encoder on the training data
y_train_resampled_encoded = label_encoder.fit_transform(y_train_resampled)
y_val_encoded = label_encoder.transform(y_val)

# Step 2: Train an XGBoost model with numerical labels
xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42)
xgb_model.fit(X_train_resampled, y_train_resampled_encoded)

# Step 3: Validate the model on validation set (using encoded labels)
y_pred_xgb = xgb_model.predict(X_val)

# Decode the predicted labels back to the original string labels
y_pred_xgb_decoded = label_encoder.inverse_transform(y_pred_xgb)

# Step 4: Evaluate the model using original string labels
print("XGBoost Validation Accuracy:", accuracy_score(y_val, y_pred_xgb_decoded))
print("XGBoost Classification Report:\n", classification_report(y_val, y_pred_xgb_decoded))


XGBoost Validation Accuracy: 0.3758157677593579
XGBoost Classification Report:
                        precision    recall  f1-score   support

              Baggage       0.25      0.11      0.15       604
              Booking       0.27      0.27      0.27       513
             Check-In       0.07      0.01      0.02       359
             Checkout       0.87      0.52      0.65       384
       Communications       0.29      0.30      0.29       757
      Digital Support       0.06      0.01      0.01       255
           Disability       0.00      0.00      0.00        86
                  ETC       0.17      0.01      0.01       197
               IRROPS       0.42      0.69      0.53      2763
         Mileage Plus       0.26      0.19      0.22      1130
         Other Topics       0.07      0.02      0.03       174
          Post-Flight       0.39      0.43      0.41       848
Products and Services       0.19      0.08      0.11       658
      Schedule Change       0.09     

Trying Catboost to further improve the accuracy which It did and got a jump upto 41%.

In [7]:
from catboost import CatBoostClassifier

cat_model = CatBoostClassifier(iterations=200, depth=5, learning_rate=0.1, random_seed=42, verbose=100)

cat_model.fit(X_train, y_train)

y_pred_cat = cat_model.predict(X_val)

print("CatBoost Validation Accuracy:", accuracy_score(y_val, y_pred_cat))
print("CatBoost Classification Report:\n", classification_report(y_val, y_pred_cat))


0:	learn: 2.6647899	total: 8s	remaining: 26m 31s
100:	learn: 1.6397306	total: 11m 46s	remaining: 11m 32s
199:	learn: 1.5522196	total: 23m 8s	remaining: 0us
CatBoost Validation Accuracy: 0.41152201635286173


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


CatBoost Classification Report:
                        precision    recall  f1-score   support

              Baggage       0.37      0.07      0.11       604
              Booking       0.29      0.29      0.29       513
             Check-In       0.00      0.00      0.00       359
             Checkout       0.93      0.55      0.69       384
       Communications       0.37      0.26      0.31       757
      Digital Support       0.00      0.00      0.00       255
           Disability       0.00      0.00      0.00        86
                  ETC       0.00      0.00      0.00       197
               IRROPS       0.41      0.85      0.56      2763
         Mileage Plus       0.34      0.11      0.17      1130
         Other Topics       0.00      0.00      0.00       174
          Post-Flight       0.44      0.45      0.45       848
Products and Services       0.36      0.01      0.01       658
      Schedule Change       0.00      0.00      0.00       146
              Seating

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


LightGBM the Best results till yet with respect to training time.

In [8]:
import lightgbm as lgb
from sklearn.metrics import accuracy_score, classification_report

lgb_model = lgb.LGBMClassifier(n_estimators=100, random_state=42)

lgb_model.fit(X_train, y_train)

y_pred_lgb = lgb_model.predict(X_val)

print("LightGBM Validation Accuracy:", accuracy_score(y_val, y_pred_lgb))
print("LightGBM Classification Report:\n", classification_report(y_val, y_pred_lgb))


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.370078 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 127870
[LightGBM] [Info] Number of data points in the train set: 53322, number of used features: 504
[LightGBM] [Info] Start training from score -3.175245
[LightGBM] [Info] Start training from score -3.223048
[LightGBM] [Info] Start training from score -3.541325
[LightGBM] [Info] Start training from score -3.568221
[LightGBM] [Info] Start training from score -2.850446
[LightGBM] [Info] Start training from score -4.006808
[LightGBM] [Info] Start training from score -5.125203
[LightGBM] [Info] Start training from score -4.257387
[LightGBM] [Info] Start training from score -1.620413
[LightGBM] [Info] Start training from score -2.424328
[LightGBM] [Info] Start training from score -4.416406
[LightGBM] [Info] Start training from score -2.728742
[LightGBM] [Info] Start training from score -2.992774
[Light

## OPTIONAL USING THE MODEL FOR PREDICTIONS

Used Catboost model because it did better on validation dataset.

## YOU CAN RUN THIS CELL BECAUSE I HAVE THE MODEL WEIGHTS PRESENT


In [22]:
import pickle

test_merged_df = pd.merge(test_df, calls_df, on='call_id', how='left')

test_merged_df = pd.merge(test_merged_df, sentiment_df, on='call_id', how='left')
test_merged_df = pd.merge(test_merged_df, customers_df, on='customer_id', how='left')
test_merged_df['call_transcript'] = test_merged_df['call_transcript'].fillna('no transcript')

test_merged_df['call_transcript'] = test_merged_df['call_transcript'].astype(str)

test_merged_df['cleaned_transcript'] = test_merged_df['call_transcript'].apply(preprocess_transcript)
X_test_text = tfidf.transform(test_merged_df['cleaned_transcript']).toarray()

X_test_numerical = imputer.transform(test_merged_df[['AHT', 'AST', 'average_sentiment', 'silence_percent_average']].values)

X_test = np.concatenate([X_test_text, X_test_numerical], axis=1)

with open('catboost_model.pkl', 'rb') as model_file:
    loaded_cat_model = pickle.load(model_file)

test_predictions = loaded_cat_model.predict(X_test)



In [24]:
print("Shape of test_predictions:", test_predictions.shape)
# Ensure test_predictions is a 1D array
test_predictions = test_predictions.flatten()


Shape of test_predictions: (5157, 1)


In [27]:
test_merged_df['predicted_call_reason'] = test_predictions
test_merged_df[['call_id', 'predicted_call_reason']].to_csv('test_NiharMittal.csv', index=False)

## CODE TO ADD THE MODEL TO PICKLE FILE FOR REUSABILITY

In [26]:
# import pickle

# with open('catboost_model.pkl', 'wb') as model_file:
#     pickle.dump(cat_model, model_file)
