## Contents
* [1. Optimise Pre-Processing & Vectoriser](#1.-Optimise-Pre-Processing-&-Vectoriser)
* [2. Imports](#2.-Imports)
* [3. Data Cleaning & Preparation](#3.-Data-Cleaning-&-Preparation)
* [4. Model Fit & Predict](#4.-Model-Fit-&-Predict)
* [5. Remarks](#5.-Remarks)

---
## 1. Optimise Pre-Processing & Vectoriser
---
The objective is to compare the following models to determine the best combination of pre-processing and vectoriser, before deciding on the best model to use:

|                | Baseline Model          | Alternate 1             | Alternate 2                                       | Alternate 3                                       |
|----------------|-------------------------|-------------------------|---------------------------------------------------|---------------------------------------------------|
| Pre-processing | - Basic cleaning<br>- Stem        | - Basic cleaning<br>- Stem       | - Basic cleaning<br>- Stem<br>- Remove duplicated sentences | - Basic cleaning<br>- Stem<br>- Remove duplicated sentences |
| Vectoriser     | CountVectoriser         | TFIDF                   | CountVectoriser                                   | TFIDF                                             |
| Model          | Multinomial Naive Bayes | Multinomial Naive Bayes | Multinomial Naive Bayes                           | Multinomial Naive Bayes                           |

---
## 2. Imports
---

In [None]:
import numpy as np
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier as ovr
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,classification_report, roc_auc_score, f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

---
## 3. Data Cleaning & Preparation
---

- read CSV files

In [None]:
kf_df = pd.read_csv('/kaggle/input/sq-services/kf_clean.csv')
lca_df = pd.read_csv('/kaggle/input/sq-services/LCA_clean.csv')
other_df = pd.read_csv('/kaggle/input/sq-services/other_clean.csv')
kf_df.head()

- prepare df for alternate 2 and 3 approaches
- tokenise text into sentences, remove duplicate sentences
    - in the SQTalk forum, when person A replies to person B's comment, person A's comment will start with a word-for-word quote of person B's comment
    - the strategy is to tokenise the text by sentences, then remove any repeated sentences to remove such repetitive quotes

In [None]:
# for kf dataset
temp_df = []

for text in kf_df['text']:
    for sent in sent_tokenize(str(text)):
        temp_df.append(sent)

kf_sent_df = pd.DataFrame(data=temp_df, columns=['sent'])
print(f"kf_df had {kf_sent_df.shape[0]} rows")
kf_sent_df.drop_duplicates(inplace=True)
print(f"After removing duplicates, kf_df has {kf_sent_df.shape[0]} rows")

# for lca dataset
temp_df = []

for text in lca_df['text']:
    for sent in sent_tokenize(str(text)):
        temp_df.append(sent)

lca_sent_df = pd.DataFrame(data=temp_df, columns=['sent'])
print(f"lca_df had {lca_sent_df.shape[0]} rows")
lca_sent_df.drop_duplicates(inplace=True)
print(f"After removing duplicates, lca_df has {lca_sent_df.shape[0]} rows")

# for other dataset
temp_df = []

for text in other_df['text']:
    for sent in sent_tokenize(str(text)):
        temp_df.append(sent)

other_sent_df = pd.DataFrame(data=temp_df, columns=['sent'])
print(f"other_df had {other_sent_df.shape[0]} rows")
kf_sent_df.drop_duplicates(inplace=True)
print(f"After removing duplicates, other_df has {other_sent_df.shape[0]} rows")

- reassign 'source' column, and combine into 1 dataframe

In [None]:
kf_sent_df['source'] = 'kf'
lca_sent_df['source'] = 'lca'
other_sent_df['source'] = 'other'

services_df = pd.concat([kf_sent_df, lca_sent_df, other_sent_df])
services_df.shape

- check for and resolve any NA values

In [None]:
print(services_df.isna().sum())

# acceptable to drop 3 NA values out of 44k values
services_df.dropna(inplace=True)
# reset index, and drop old index
services_df.reset_index(drop=True, inplace=True)

print(services_df.isna().sum())

- create a 'kf' column: 
    - if value = 0, the source is others
    - if value = 1, the source is kf
    - if value = 2, the source is from LCA

In [None]:
services_df['y_true'] = services_df['source'].map({'other':0, 'kf': 1, 'lca': 2})
print(services_df.head())
services_df['y_true'].value_counts(normalize=True)

- stem text and stopwords

In [None]:
stemmer = PorterStemmer()

In [None]:
def token_stem(sent):
    result = []
    list = word_tokenize(sent)
    for word in list:
        result.append(stemmer.stem(word))
    return ' '.join(result)

In [None]:
services_stem_df = services_df.copy()
services_stem_df['sent'] = [token_stem(text) for text in services_df['sent']]
services_stem_df.head()

- add selected words to stopwords, taken from ["SQTalk Abbreviations, Slangs, Definitions, Phrases"](http://www.sqtalk.com/forum/forum/general/sqtalk-community/1010-) thread

In [None]:
cvec = CountVectorizer(max_features = 500, stop_words = 'english') 
stem_stopwords = [stemmer.stem(word) for word in cvec.get_stop_words()]
stem_stopwords.extend([stemmer.stem(word) for word in ['btw','iirc','imo','imho']])

- prepare untreated df for alternate 1 approach

In [None]:
services_untreated_df = pd.concat([kf_df, lca_df, other_df])

print(services_untreated_df.isna().sum())

# acceptable to drop 3 NA values out of 44k values
services_untreated_df.dropna(inplace=True)
# reset index, and drop old index
services_untreated_df.reset_index(drop=True, inplace=True)

print(services_untreated_df.isna().sum())

- create a 'kf' column: 
    - if value = 0, the source is others
    - if value = 1, the source is kf
    - if value = 2, the source is from LCA

In [None]:
services_untreated_df['y_true'] = services_untreated_df['source'].map({'other':0, 'kf': 1, 'lca': 2})
services_untreated_df['y_true'].value_counts(normalize=True)

- stem untreated text

In [None]:
services_stem_untreated_df = services_untreated_df.copy()
services_stem_untreated_df['text'] = [token_stem(text) for text in services_untreated_df['text']]
services_stem_untreated_df.head()

---
## 4. Model Fit & Predict
---

## 4.1 Alternate 1
### - using TFIDF and Multinomial NB

In [None]:
X = services_stem_untreated_df['text']
y = services_stem_untreated_df['y_true']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
X_train

- tokenise with Count Vectoriser

In [None]:
tvec = TfidfVectorizer(max_features = 500, stop_words = stem_stopwords) 
X_train_cvec = tvec.fit_transform(X_train)
X_test_cvec = tvec.transform(X_test)

- instantiate and fit a Naive Bayes model

In [None]:
nb = MultinomialNB()
NB_model = ovr(nb).fit(X_train_cvec, y_train)  # using OneVsRestClassifier

- visualise confusion matrix

In [None]:
y_pred = NB_model.predict(X_test_cvec)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Others', 'KrisFlyer', 'LCA'])
disp.plot();

- display precision, recall, f1-score of each class

In [None]:
print(classification_report(y_test, y_pred))
# 0: other, 1: kf, 2: lca

- display the weighted average ROC AUC score

In [None]:
y_pred_prob = NB_model.predict_proba(X_test_cvec)

roc_auc_score(y_test, y_pred_prob, multi_class='ovr', average='macro')

- weighted f1-score of KrisFlyer and LCA

In [None]:
f1_score(y_test, y_pred, labels=[1,2], average = 'macro')

## 4.2 Alternate 2
### - using removed duplicated sentences, CountVectoriser and Multinomial NB

- train-test split our df

In [None]:
X = services_stem_df['sent']
y = services_stem_df['y_true']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
X_train

- tokenise with Count Vectoriser

In [None]:
cvec = CountVectorizer(max_features = 500, stop_words = stem_stopwords) 
X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)

- instantiate and fit a Naive Bayes model

In [None]:
nb = MultinomialNB()
NB_model = ovr(nb).fit(X_train_cvec, y_train)  # using OneVsRestClassifier

- visualise confusion matrix

In [None]:
y_pred = NB_model.predict(X_test_cvec)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Others', 'KrisFlyer', 'LCA'])
disp.plot();

- display precision, recall, f1-score of each class

In [None]:
print(classification_report(y_test, y_pred))
# 0: other, 1: kf, 2: lca

- display the weighted average ROC AUC score

In [None]:
y_pred_prob = NB_model.predict_proba(X_test_cvec)

roc_auc_score(y_test, y_pred_prob, multi_class='ovr', average='macro')

- weighted f1-score of KrisFlyer and LCA

In [None]:
f1_score(y_test, y_pred, labels=[1,2], average = 'macro')

## 4.3 Alternate 3
### - using removed duplicated sentences, TFIDF and Multinomial NB

- tokenise with TFIDF

In [None]:
tvec = TfidfVectorizer(max_features = 500, stop_words = stem_stopwords)
X_train_cvec = tvec.fit_transform(X_train)
X_test_cvec = tvec.transform(X_test)

- instantiate and fit a Naive Bayes model

In [None]:
nb = MultinomialNB()
NB_model = ovr(nb).fit(X_train_cvec, y_train)  # using OneVsRestClassifier

- visualise confusion matrix

In [None]:
y_pred = NB_model.predict(X_test_cvec)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Others', 'KrisFlyer', 'LCA'])
disp.plot();

- display precision, recall, f1-score of each class

In [None]:
print(classification_report(y_test, y_pred))
# 0: other, 1: kf, 2: lca

- display the weighted average ROC AUC score

In [None]:
y_pred_prob = NB_model.predict_proba(X_test_cvec)

roc_auc_score(y_test, y_pred_prob, multi_class='ovr', average='macro')

- weighted f1-score of KrisFlyer and LCA

In [None]:
f1_score(y_test, y_pred, labels=[1,2], average = 'macro')

---
## 5. Remarks
---

|                                     | Baseline Model<br>(from notebook 2) | Alternate 1<br>\*Best performance\* | Alternate 2 | Alternate 3 |
|-------------------------------------|----------------|-------------|-------------|-------------|
| Pre-processing | - Basic cleaning<br>- Stem        | - Basic cleaning<br>- Stem       | - Basic cleaning<br>- Stem<br>- Remove duplicated sentences | - Basic cleaning<br>- Stem<br>- Remove duplicated sentences |
| Vectoriser     | CountVectoriser         | TFIDF                   | CountVectoriser                                   | TFIDF                                             |
| Model          | Multinomial Naive Bayes | Multinomial Naive Bayes | Multinomial Naive Bayes                           | Multinomial Naive Bayes                           |
| Macro-average ROC AUC            | 0.887          | 0.908       | 0.807       | 0.819       |
| Macro-average f1-score (kf, lca) | 0.752          | 0.759       | 0.625       | 0.630       |

- Alternate 1 combination of preprocessing and vectoriser performed the best, and will be used for the next step of finding the best model
<br>
<br>
- TFIDF had a slightly better performance than CountVectoriser (baseline vs alt 1, alt2 vs alt 3)
<br>
<br>
- The removal of duplicated sentences seemed to have an adverse effect on model performance. This suggested that the quotes contain valuable key words that deserved to be emphasised (i.e. the comments that people usually reply to contain valuable key words)