# Problem Statement

Haptik is one of the world's largest conversational AI platforms. It is a personal assistant mobile app, powered by a combination of artificial intelligence and human assistance. It has its domain in multiple fields including customer support, feedback, order status and live chat.

We have with us the dataset of Haptik containing the messages it receives from the customers and which topic(class) the messages refer to.

We need to create a model predicting which class a particular message belongs to using NLP.

Additionally we use techniques like LSA (Latent Semantic Analysis) and LDA (Latent Dirichlet Allocation) to assign topics to new messages.


# About the dataset

The dataset consisted of `message` column along with the different column associated with the topic they could associated with it. 

We have combined the instances of different topic into a single column called cateogory.

The dataset has details of 40000 messages You need to predict the category.

For submission purposes, following is the label encoding of the category column:
```python

{0: 'casual',
 1: 'food',
 2: 'movies',
 3: 'nearby',
 4: 'other',
 5: 'recharge',
 6: 'reminders',
 7: 'support',
 8: 'travel'}
```

## Evaluation metrics

For this particular dataset we are using simple `F1 score`(average="macro") as the evaluation metric. 



In [4]:
#Using Goole Colab : Mounting google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
#import modules

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns',None)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, f1_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import gensim
from gensim.models.lsimodel import LsiModel
from gensim import corpora
from pprint import pprint
from gensim.models import LdaModel
from gensim.models import CoherenceModel

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [6]:
# read the dataset and extract the test  and train data separately
df_train=pd.read_csv('/content/drive/MyDrive/Customer_message_domain_classification/train_data.csv')
df_test=pd.read_csv('/content/drive/MyDrive/Customer_message_domain_classification/test_data.csv')

#Dropping df_train Id column : train_id
train_id = df_train['MID']
df_train.drop(['MID'], axis=1, inplace=True)

In [7]:
#First look at data
df_train.head()

Unnamed: 0,message,category
0,7am everyday,reminders
1,chocolate cake,food
2,closed mortice and tenon joint door dimentions,support
3,train eppo kelambum,travel
4,yesterday i have cancelled the flight ticket,travel


In [8]:
# Data shape and columns
print(df_train.shape)
print(df_train.columns)

(40659, 2)
Index(['message', 'category'], dtype='object')


In [9]:
#Features Info
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40659 entries, 0 to 40658
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   message   40659 non-null  object
 1   category  40659 non-null  object
dtypes: object(2)
memory usage: 635.4+ KB


In [10]:
# Describe data
df_train.describe()

Unnamed: 0,message,category
count,40659,40659
unique,39309,9
top,12/30/1899,travel
freq,47,11063


# Data Processing

In the Text Analytics, we convert the textual data into vectors so that we can apply machine learning algorithms to them.

In this task we employ a normal TF-IDF vectorizer to vectorize the message column and label encode the category column, essentially making it a classification problem. 


In [11]:
# Sampling only 1000 samples of each category
df_train = df_train.groupby('category').apply(lambda x: x.sample(n=1000, random_state=0))

In [12]:
# Converting all messages to lower case and storing it
all_text = df_train["message"].str.lower()

# Initialising TF-IDF object
tfidf = TfidfVectorizer(stop_words="english")

# Vectorizing data
tfidf.fit(all_text)

# Storing the TF-IDF vectorized data into an array
X = tfidf.transform(all_text).toarray()

# Initiating a label encoder object
le = LabelEncoder()

# Fitting the label encoder object on the data
le.fit(df_train["category"])

# Transforming the data and storing it
y = le.transform(df_train["category"])

# Classification implementation

In the previous tasks, we have cleaned the data and converted the textual data into numbers in order to enable us to apply machine learning models. 

In this task we apply Logistic Regression , Naive Bayes and Lienar SVM model onto the data.



In [13]:
#we split 70% of the data to training set while 30% of the data to validation 
X_train,X_valid,y_train,y_valid = train_test_split(X,y,test_size=0.3, random_state=42) 

#X_train, X_valid shape
print(X_train.shape)
print(X_valid.shape)

(6300, 7361)
(2700, 7361)


In [14]:
# Defining the Logistic Regression algorithm
log_reg = LogisticRegression(random_state=0)
log_reg.fit(X_train,y_train)

# Predicting the values of validation data
y_lr_pred = log_reg.predict(X_valid)
print("Classification report - \n", classification_report(y_valid,y_lr_pred))

#f1_score
f1_score(y_valid,y_lr_pred, average='macro')

Classification report - 
               precision    recall  f1-score   support

           0       0.46      0.80      0.58       300
           1       0.73      0.50      0.59       326
           2       0.89      0.78      0.83       287
           3       0.65      0.65      0.65       288
           4       0.67      0.71      0.69       296
           5       0.77      0.77      0.77       298
           6       0.89      0.82      0.85       314
           7       0.76      0.67      0.72       297
           8       0.80      0.66      0.72       294

    accuracy                           0.71      2700
   macro avg       0.74      0.71      0.71      2700
weighted avg       0.74      0.71      0.71      2700



0.7121925559391143

In [15]:
# Defining the Multinomial Naive Bayes algorithm
nb = MultinomialNB()
nb.fit(X_train,y_train)

# Predicting the values of validation data
y_nb_pred = nb.predict(X_valid)
print("Classification report - \n", classification_report(y_valid,y_nb_pred))

#f1_score
f1_score(y_valid,y_nb_pred, average='macro')

Classification report - 
               precision    recall  f1-score   support

           0       0.55      0.62      0.58       300
           1       0.81      0.51      0.63       326
           2       0.71      0.83      0.76       287
           3       0.66      0.65      0.66       288
           4       0.75      0.69      0.72       296
           5       0.68      0.85      0.75       298
           6       0.80      0.87      0.83       314
           7       0.74      0.69      0.72       297
           8       0.78      0.71      0.74       294

    accuracy                           0.71      2700
   macro avg       0.72      0.71      0.71      2700
weighted avg       0.72      0.71      0.71      2700



0.7098202248522268

In [16]:
# Defining the Linear Support vector algorithm
lsvm = LinearSVC(random_state=0)
lsvm.fit(X_train,y_train)

# Predicting the values of validation data
y_lsvm_pred = lsvm.predict(X_valid)
print("Classification report - \n", classification_report(y_valid,y_lsvm_pred))

#f1_score
f1_score(y_valid,y_lsvm_pred, average='macro')

Classification report - 
               precision    recall  f1-score   support

           0       0.48      0.76      0.59       300
           1       0.73      0.53      0.61       326
           2       0.88      0.84      0.86       287
           3       0.63      0.66      0.64       288
           4       0.74      0.70      0.72       296
           5       0.75      0.78      0.76       298
           6       0.86      0.83      0.84       314
           7       0.72      0.66      0.69       297
           8       0.80      0.67      0.73       294

    accuracy                           0.71      2700
   macro avg       0.73      0.71      0.72      2700
weighted avg       0.73      0.71      0.72      2700



0.7167828373922656

# Prediction on test data

In [17]:
# Prediction on test data

#Test data shape and columns names
print(df_test.shape)
print(df_test.columns)

(10000, 1)
Index(['message'], dtype='object')


In [18]:
#First look at test data
df_test.head()

Unnamed: 0,message
0,Nearest metro station
1,Pick up n drop service trough cab
2,I wants to buy a bick
3,Show me pizza
4,What is the cheapest package to andaman and ni...


In [19]:
#convert to lower case 
all_text = df_test["message"].str.lower()

# Transforming using the tfidf object - tfidf
X_test = tfidf.transform(all_text).toarray()


In [20]:
# Predicting using the linear svm model - lsvm
y_test_pred = lsvm.predict(X_test)


In [21]:
#Making df for submission
subm=pd.DataFrame({"category": y_test_pred})
print(subm.head())

   category
0         3
1         7
2         4
3         1
4         8


In [22]:
# To CSV for submission
subm.to_csv('category.csv',index=False)

#from google.colab import files
#files.download('category.csv')

# LSI Modeling
In this task, we use LSI on the entire dataset.


In [23]:
# Creating a stopwords list
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

# Function to lemmatize and remove the stopwords
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = "".join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

# Creating a list of documents from the complaints column
list_of_docs = df_train["message"].tolist()

# Implementing the function for all the complaints of list_of_docs
doc_clean = [clean(doc).split() for doc in list_of_docs]

# Code starts here
dictionary = corpora.Dictionary(doc_clean)

doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

lsimodel = LsiModel(corpus=doc_term_matrix, num_topics=5, id2word=dictionary)

pprint(lsimodel.print_topics())

[(0,
  '-0.347*"reminder" + -0.267*"like" + -0.267*"cancel" + -0.266*"would" + '
  '-0.256*"apiname" + -0.256*"exotel" + -0.256*"offset" + -0.256*"userid" + '
  '-0.255*"taskname" + -0.255*"reminderlist"'),
 (1,
  '-0.831*"want" + -0.221*"u" + -0.187*"know" + -0.181*"movie" + -0.135*"book" '
  '+ -0.128*"ticket" + -0.114*"need" + -0.108*"hi" + -0.096*"please" + '
  '-0.092*"service"'),
 (2,
  '-0.451*"reminder" + 0.328*"call" + 0.316*"u" + 0.233*"wake" + '
  '-0.204*"water" + 0.197*"march" + 0.192*"wakeup" + -0.185*"every" + '
  '-0.181*"drink" + -0.168*"want"'),
 (3,
  '0.611*"u" + -0.419*"want" + 0.244*"need" + 0.238*"reminder" + '
  '0.197*"please" + 0.143*"movie" + 0.118*"service" + -0.101*"wake" + '
  '0.101*"near" + 0.101*"help"'),
 (4,
  '-0.622*"need" + 0.510*"u" + -0.490*"movie" + -0.189*"offer" + 0.137*"want" '
  '+ -0.115*"ticket" + -0.058*"know" + -0.052*"today" + 0.052*"find" + '
  '-0.049*"book"')]


# LDA Modeling

Topic modeling using LDA. 

We found the optimum no. of topics using coherence score and then create a model attaining to the optimum no. of topics.



In [24]:
# doc_term_matrix - Word matrix created in the last task
# dictionary - Dictionary created in the last task

# Function to calculate coherence values
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    topic_list : No. of topics chosen
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    topic_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamodel.LdaModel(doc_term_matrix, random_state = 0, num_topics=num_topics, id2word = dictionary, iterations=10)
        topic_list.append(num_topics)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return topic_list, coherence_values

# Calling the function
topic_list, coherence_value_list = compute_coherence_values(dictionary=dictionary, corpus=doc_term_matrix, texts=doc_clean, start=1, limit=41, step=5)
print(coherence_value_list)

# Finding the index associated with maximum coherence value
max_index=coherence_value_list.index(max(coherence_value_list))

# Finding the optimum no. of topics associated with the maximum coherence value
opt_topic= topic_list[max_index]
print("Optimum no. of topics:", opt_topic)

# Implementing LDA with the optimum no. of topic
lda_model = LdaModel(corpus=doc_term_matrix, num_topics=opt_topic, id2word = dictionary, iterations=10, passes = 30,random_state=0)

# display top 5 topics
pprint(lda_model.print_topic(5))

[0.3287476298674388, 0.4801812391625579, 0.5306698259321219, 0.5376618801954907, 0.5587078765648961, 0.572049572781549, 0.5663902769474314, 0.5889600955365673]
Optimum no. of topics: 36
('0.064*"800" + 0.047*"medicine" + 0.040*"day" + 0.037*"got" + 0.020*"trip" + '
 '0.015*"many" + 0.014*"theater" + 0.012*"moto" + 0.011*"showing" + '
 '0.010*"low"')
