**Name:** Xian Jia Le, Ben

**EID:** 56214537

**Kaggle Team Name:** Glanceman

# CS5489 - Assignment 1 - SMS classification

## Goal
In this assignment, the task is predict whether an SMS message is a real message, a spam message, or a phishing message (called smishing). Here are some examples:

  - **Normal**: "For real tho this sucks. I can't even cook my whole electricity is out. And I'm hungry."
  - **Spam**: "Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out"
  - **Smishing**: "Todays Vodafone numbers ending 5347 are selected to receive a Rs.2,00,000 award. If you have a match please call 6299257179 quoting claim code 2041 standard rates apply"


Your goal is to train a classifier to predict the class from the SMS text. 


## Methodology
You need to train classifiers using the training data, and then predict on the test data. You are free to choose the feature extraction method and classifier algorithm.  You are free to use methods that were not introduced in class.  You should probably do cross-validation to select a good parameters.


## Evaluation on Kaggle

You need to submit your test predictions to Kaggle for evaluation.  50% of the test data will be used to show your ranking on the live leaderboard.  After the assignment deadline, the remaining 50% will be used to calculate your final ranking. Also the top-ranked entries will be asked to give a short 5 minute presentation on what they did.

The evaluation metric used on Kaggle is **balanced accuracy score**. This is because the dataset has some class imbalance as there are more normal samples than spam/smishing samples. See details for `sklearn.metrics.balanced_accuracy_score` [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html).

To submit to Kaggle you need to create an account, and use the competition invitation that is posted to Canvas. You must submit your Kaggle account name to the "Kaggle Username" assignment on Canvas **1 week before the Assignment 1 deadline**. This is to prevent students from creating multiple Kaggle accounts to gain unfair advantage. 

**Note:** You can only submit 2 times per day to Kaggle!

## What to hand in
You need to turn in the following things:

1. This ipynb file with your source code and documentation. _**You should write about all the various attempts that you make to find a good solution.**_ You may also submit python scripts as source code, but your documentation must be in the ipynb file.
2. Your final csv submission file to Kaggle.
3. The ipynb file `Assignment1-Final.ipynb`, which contains the code that generates the final submission file that you submit to Kaggle.  **This code will be used to verify that your Kaggle submission is reproducible.**
4. Your Kaggle username (submitted to the "Kaggle Username" assignment on Canvas 1 week before the Assignment 1 deadline)

Files should be uploaded to Assignment 1 on Canvas.

## Grading
The marks of the assignment are distributed as follows:
- 45% - Results using various classifiers and feature representations.
- 30% - Trying out feature representations (e.g. adding additional features) or classifiers not used in the tutorials/lectures.
- 20% - Quality of the written report.  More points for insightful observations and analysis.
- 5% - Final ranking on the Kaggle test data (private leaderboard). Only the Kaggle username submitted to Canvas on time will be considered. If a submission cannot be reproduced by the submitted code, it will not receive marks for ranking.
- **Late Penalty:** 25 marks will be subtracted for each day late.

**NOTE:** This is an _individual_ assignment.

**NOTE:** you should start early! Some classifiers may take a while to train.


## Kaggle Notebooks

If you like, you can use Kaggle notebooks to run your code. Note that you still need to submit your code to Canvas for grading.
<hr>

# Load the Data

The training data is in the text file `smishing_train.txt`.  This CSV file contains the SMS text and the class label. The class labels are: `0`, `1`, `2`, which are `normal`, `spam`, `smishing`. 

The testing data is in the text file `smishing_test.txt`, and only contains the SMS text.

To submit to Kaggle, you need to generate a Kaggle submission files, which is CSV file with the following format: 

<pre>
Id,Prediction
1,0
2,1
3,0
4,2
...
</pre>

Here are two helpful functions for reading the text data and writing the Kaggle submission file.

In [1]:
%matplotlib inline
import matplotlib_inline   # setup output image format
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import matplotlib.pyplot as plt
import matplotlib
from numpy import *
from sklearn import *
from scipy import stats
import csv
random.seed(100)

In [2]:
def read_text_data(fname):
    txtdata = []
    classes = []
    with open(fname, 'r', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        for row in reader:
            # get the text
            txtdata.append(row[0])
            # get the class (convert to integer)
            if len(row)>1:
                classes.append(int(row[1]))
        
    return (txtdata, classes)

def write_csv_kaggle_sub(fname, Y):
    # fname = file name
    # Y is a list/array with class entries
    
    # header
    tmp = [['Id', 'Prediction']]
    
    # add ID numbers for each Y
    for (i,y) in enumerate(Y):
        tmp2 = [(i+1), y]
        tmp.append(tmp2)
        
    # write CSV file
    with open(fname, 'w') as f:
        writer = csv.writer(f)
        writer.writerows(tmp)

The below code will load the training and test sets.

In [3]:
# load the data
(trainTxt, trainY) = read_text_data("smishing_train.txt")
(testX, _)   = read_text_data("smishing_test.txt")

print(len(trainTxt))
print(len(testX))

2985
2986


In [4]:
# show the classnames
classnames = unique(trainY)
print(classnames)

[0 1 2]


In [5]:
classlabels = ['normal', 'spam', 'smishing']

Here is an example to write a csv file with predictions on the test set.  These are random predictions.

In [6]:
# write your predictions on the test set
i = random.randint(len(classnames), size=len(testX))
print(i.shape)
predY = classnames[i]
write_csv_kaggle_sub("Output/my_submission.csv", predY)

(2986,)


Look at the data:

In [7]:
for c in classnames:
    tmp = where(trainY==c)
    for a in tmp[0][0:5]:
        print('[{}]: {}'.format(classlabels[trainY[a]], trainTxt[a]))

[normal]: Dunno da next show aft 6 is 850. Toa payoh got 650.
[normal]: I.ll hand her my phone to chat wit u
[normal]: I dont have i shall buy one dear
[normal]: Nite...
[normal]: Ok�congrats�
[spam]: I'd like to tell you my deepest darkest fantasies. Call me 09094646631 just 60p/min. To stop texts call 08712460324 (nat rate)
[spam]: Santa Calling! Would your little ones like a call from Santa Xmas eve? Call 09058094583 to book your time.
[spam]: Meet Top 35 US universities in Delhi at India Habitat Centre Lodhi Road on Nov 8th, 2 to 6 pm for student admission.Entry Free,  details contact 9911489000
[spam]: SMS AUCTION You have won a Nokia 7250i. This is what you get when you win our FREE auction. To take part send Nokia to 86021 now. HG/Suite342/2Lands Row/W1JHL 16+
[spam]: Call Germany for only 1 pence per minute! Call from a fixed line via access number 0844 861 85 86.
[smishing]: WIN URGENT! Your mobile number has been awarded with a £2000 prize GUARANTEED call 09061790121 from lan

# YOUR CODE and DOCUMENTATION HERE

## Data cleaning

All the text will be converted to lower case and remove all the symbols

In [8]:
# INSERT YOUR CODE HERE
import re
def cleanText(text:list):
    cleanedText=[]
    for sentence in text:
        cleaned = re.sub("[^a-zA-Z0-9']"," ",sentence)
        lowered = cleaned.lower()
        stripped = lowered.strip()
        cleanedText.append(stripped)
    return cleanedText

trainTxt=cleanText(trainTxt)
testX = cleanText(testX)

for c in classnames:
    tmp = where(trainY==c)
    for a in tmp[0][0:5]:
        print('[{}]: {}'.format(classlabels[trainY[a]], trainTxt[a]))

[normal]: dunno da next show aft 6 is 850  toa payoh got 650
[normal]: i ll hand her my phone to chat wit u
[normal]: i dont have i shall buy one dear
[normal]: nite
[normal]: ok congrats
[spam]: i'd like to tell you my deepest darkest fantasies  call me 09094646631 just 60p min  to stop texts call 08712460324  nat rate
[spam]: santa calling  would your little ones like a call from santa xmas eve  call 09058094583 to book your time
[spam]: meet top 35 us universities in delhi at india habitat centre lodhi road on nov 8th  2 to 6 pm for student admission entry free   details contact 9911489000
[spam]: sms auction you have won a nokia 7250i  this is what you get when you win our free auction  to take part send nokia to 86021 now  hg suite342 2lands row w1jhl 16
[spam]: call germany for only 1 pence per minute  call from a fixed line via access number 0844 861 85 86
[smishing]: win urgent  your mobile number has been awarded with a  2000 prize guaranteed call 09061790121 from land line  c

## Approach 1: CountVector + MultinomialNB (Kaggle 0.868)

In [9]:
import sklearn.naive_bayes as NB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

vectorizer = CountVectorizer(stop_words="english")
trainXVec = vectorizer.fit_transform(trainTxt)


Multi_clf= NB.MultinomialNB()

param_grid = {'alpha': linspace(0,1,10)}
# cross validation
grid_search = GridSearchCV(Multi_clf, param_grid, cv=5, n_jobs=-1,verbose=1)
grid_search.fit(trainXVec, trainY)

estimator = grid_search.best_estimator_
print(f"mean test score : {grid_search.cv_results_['mean_test_score']}")

predY = estimator.predict(trainXVec)
print(accuracy_score(predY,trainY))

testXVec = vectorizer.transform(testX)
predY = estimator.predict(testXVec)
write_csv_kaggle_sub("Output/Submission-BoW+MultinomialNB.csv", predY)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


mean test score : [0.92730318 0.92830821 0.92964824 0.93366834 0.9360134  0.94103853
 0.94304858 0.94371859 0.9440536  0.94673367]
0.9842546063651592


## Approach 1.2: CountVector with SVM (Kaggle :0.81)

In [10]:
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

vectorizer = CountVectorizer( max_features=1000)
trainXVec = vectorizer.fit_transform(trainTxt)


SVM_clf = SVC()

param_grid = {
    "C":logspace(-4,4,20),
    "kernel":['linear','rbf','poly']
    }
# cross validation
grid_search = GridSearchCV(SVM_clf, param_grid, cv=5, n_jobs=-1)
grid_search.fit(trainXVec, trainY)

estimator = grid_search.best_estimator_
print(f"mean test score : {grid_search.cv_results_['mean_test_score']}")

predY = estimator.predict(trainXVec)
print(accuracy_score(predY,trainY))

testXVec = vectorizer.transform(testX)
predY = estimator.predict(testXVec)
write_csv_kaggle_sub("Output/Submission-CountVector+SVM.csv", predY)

mean test score : [0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82211055 0.82211055 0.82211055 0.8241206  0.82211055 0.82211055
 0.87805695 0.82211055 0.82211055 0.92194305 0.82211055 0.82177554
 0.93936348 0.82211055 0.8321608  0.94572864 0.82278057 0.86264657
 0.94539363 0.86599665 0.88643216 0.94438861 0.93065327 0.90485762
 0.94137353 0.94371859 0.91323283 0.94070352 0.94606365 0.91959799
 0.94070352 0.94539363 0.92194305 0.94070352 0.94438861 0.92428811
 0.94070352 0.94438861 0.92462312 0.94070352 0.94438861 0.92361809
 0.94070352 0.94438861 0.92361809 0.94070352 0.94438861 0.92328308
 0.94070352 0.94438861 0.92328308 0.94070352 0.94438861 0.92328308]
0.9959798994974874


## Approach 1.3: CountVector with logistic (Kaggle : 0.859)

In [18]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

vectorizer = CountVectorizer(stop_words="english")
trainXVec = vectorizer.fit_transform(trainTxt)
testXVec = vectorizer.transform(testX)

logistic_regression = LogisticRegression(max_iter=1000) 
param_grid = {
    'C': logspace(-4,4,20),  # Regularization parameter (smaller values for more regularization)
    'solver': ['liblinear', 'lbfgs','saga'],  # Solver algorithm
}

grid_search = GridSearchCV(logistic_regression, param_grid, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(trainXVec, trainY)


estimator= grid_search.best_estimator_
print(f"mean test score : {grid_search.cv_results_['mean_test_score']}")

predY=estimator.predict(trainXVec)
print(accuracy_score(predY,trainY))

predY = estimator.predict(testXVec)
write_csv_kaggle_sub("Output/Submission-CountVector+Logistic.csv", predY)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
mean test score : [0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82244556 0.82546064 0.82579564 0.84656616 0.86063652 0.86097152
 0.88509213 0.89849246 0.89916248 0.91323283 0.92261307 0.9239531
 0.93031826 0.93701843 0.93701843 0.94070352 0.94438861 0.94472362
 0.94539363 0.94572864 0.94606365 0.94572864 0.9440536  0.94639866
 0.94472362 0.94237856 0.94505863 0.94237856 0.94070352 0.94271357
 0.94036851 0.93668342 0.94271357 0.93936348 0.93534338 0.94271357
 0.93802345 0.93366834 0.94271357 0.93735343 0.9319933  0.94304858
 0.93668342 0.9319933  0.94271357 0.93668342 0.93232831 0.94237856]
0.992964824120603




## Approach 1.4: CountVector with RandomForest (Kaggle :0.71)

In [16]:
from sklearn import ensemble, metrics, model_selection

vectorizer = CountVectorizer(stop_words="english")
trainXVec = vectorizer.fit_transform(trainTxt)
testXVec = vectorizer.transform(testX)

rf=ensemble.RandomForestClassifier(random_state=7)
paramgrid={"n_estimators":array([100,250,500]),
"max_depth":stats.randint(15,35),
"min_samples_split":[2, 6, 10],
"min_samples_leaf": [1, 3, 4]
}

rfCV=model_selection.RandomizedSearchCV(rf,paramgrid,random_state=7,n_iter=1000,cv=5,n_jobs=-1,verbose=1)
rfCV.fit(trainXVec,trainY)

print(f"mean test score : {rfCV.cv_results_['mean_test_score']}")

rfclf = rfCV.best_estimator_

predY=rfclf.predict(trainXVec)
print(metrics.accuracy_score(predY,trainY))
predY=rfclf.predict(testXVec)
write_csv_kaggle_sub("Output/Submission-CountVector+RandomForest.csv", predY)

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
0.9122278056951424


## Approach 2.1: TF-IDF with MultinomialNB (Kaggle:0.38)

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

vectorizer_Tfidf = TfidfVectorizer(max_features=1000)
train_X_tfidf = vectorizer_Tfidf.fit_transform(trainTxt)

Multi_clf= NB.MultinomialNB()

param_grid = {'alpha': [0.1, 2.0, 20.0]}
# cross validation
grid_search = GridSearchCV(Multi_clf, param_grid, cv=5, n_jobs=-1)
grid_search.fit(train_X_tfidf, trainY)

estimator = grid_search.best_estimator_
print(f"mean test score : {grid_search.cv_results_['mean_test_score']}")

predY = estimator.predict(train_X_tfidf)
print(accuracy_score(predY,trainY))

test_X_tfidf = vectorizer_Tfidf.fit_transform(testX)
predY = estimator.predict(test_X_tfidf)
write_csv_kaggle_sub("Output/Submission-TF-IDF+MultinomialNB.csv", predY)

mean test score : [0.95309883 0.91390285 0.82211055]
0.9708542713567839


## Approach 2.2 : TF-IDF with SVM (Kaggle:0.822)

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

vectorizer_Tfidf = TfidfVectorizer(max_features=1000)
train_X_tfidf = vectorizer_Tfidf.fit_transform(trainTxt)

SVM_clf = SVC()

param_grid = {
    "C":logspace(-4,4,20),
    "kernel":['linear','rbf','poly']
    }
# cross validation
grid_search = GridSearchCV(SVM_clf, param_grid, cv=5, n_jobs=-1)
grid_search.fit(train_X_tfidf , trainY)

estimator = grid_search.best_estimator_
print(f"mean test score : {grid_search.cv_results_['mean_test_score']}")

predY = estimator.predict(train_X_tfidf)
print(accuracy_score(predY,trainY))

testX_Tfidf = vectorizer_Tfidf.transform(testX)
predY = estimator.predict(testX_Tfidf)
write_csv_kaggle_sub("Output/Submission-TF_IDF-SVM.csv", predY)


mean test score : [0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82211055 0.82278057 0.87068677 0.93668342 0.95075377 0.95142379
 0.95108878 0.95108878 0.95108878 0.95108878 0.95108878 0.95108878
 0.95108878 0.95108878]
0.9963149078726968


## Approach 2.3 TF-IDF with logistic (Kaggle : 0.85)

In [14]:
from sklearn.linear_model import LogisticRegression

tfidf_vectorizer = TfidfVectorizer(max_features=1000)
train_X_tfidf = tfidf_vectorizer.fit_transform(trainTxt)
test_X_tfidf = tfidf_vectorizer.transform(testX)

logistic_regression = LogisticRegression(max_iter=1000) 

# Define a grid of hyperparameters to search
param_grid = {
    'C': logspace(-4,4,20),  # Regularization parameter (smaller values for more regularization)
    'solver': ['liblinear', 'lbfgs'],  # Solver algorithm
}

# Perform grid search with cross-validation to find the best hyperparameters
grid_search = GridSearchCV(logistic_regression, param_grid, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(train_X_tfidf, trainY)


estimator = grid_search.best_estimator_
print(f"mean test score : {grid_search.cv_results_['mean_test_score']}")

predY = estimator.predict(train_X_tfidf)
print(accuracy_score(predY,trainY))

predY = estimator.predict(test_X_tfidf)
write_csv_kaggle_sub("Output/Submission-TF_IDF-Logistic.csv", predY)


Fitting 5 folds for each of 40 candidates, totalling 200 fits
mean test score : [0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82211055 0.82211055 0.82211055 0.8281407  0.84623116 0.88241206
 0.90619765 0.9279732  0.93366834 0.94271357 0.94505863 0.94941374
 0.94941374 0.95008375 0.94840871 0.94740369 0.94840871 0.94706868
 0.94840871 0.94639866 0.94773869 0.94572864 0.94639866 0.94572864
 0.94438861 0.94438861 0.9440536  0.94472362]
0.9943048576214405


## Approach 3.1 Word embedding with Logistic Regression (Kaggle :0.7)

In [15]:
import numpy as np
# transform to word2vec
def word2vec_Avg_Transform(word2vec_model,text:list):
    doc=[]
    for sentence in text:
        sen=[]
        for word in sentence.split():
            if word in word2vec_model.wv:
                vec = word2vec_model.wv[word]
                sen.append(vec)
        if(len(sen)!=0):
            sen=np.mean(np.array(sen),axis=0)
            doc.append(sen)
        else:
            sen=np.zeros(word2vec_model.vector_size)
            doc.append(sen)
    res=np.array(doc)
    return res

In [30]:
from gensim.models import Word2Vec

tokenized_text = [text.split() for text in trainTxt]
word2vec_model = Word2Vec(tokenized_text, vector_size=100, window=3, min_count=1)

X_train_word2vec=word2vec_Avg_Transform(word2vec_model,trainTxt)
X_test_word2vec=word2vec_Avg_Transform(word2vec_model,testX)


logistic_regression = LogisticRegression(max_iter=1000) 

# Define a grid of hyperparameters to search
param_grid = {
    'C': logspace(-4,4,20),  # Regularization parameter (smaller values for more regularization)
    'solver': ['liblinear', 'lbfgs'],  # Solver algorithm
}

# Perform grid search with cross-validation to find the best hyperparameters
grid_search = GridSearchCV(logistic_regression, param_grid, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(X_train_word2vec, trainY)

estimator = grid_search.best_estimator_
print(f"mean test score : {grid_search.cv_results_['mean_test_score']}")

predY = estimator.predict(X_train_word2vec)
print(accuracy_score(predY,trainY))

predY = estimator.predict(X_test_word2vec)
write_csv_kaggle_sub("Output/Submission-Word_Embeding+LogisticRegression.csv", predY)


Fitting 5 folds for each of 40 candidates, totalling 200 fits
mean test score : [0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82211055 0.82211055 0.82211055 0.82211055 0.8241206  0.83886097
 0.85226131 0.86532663 0.87705193 0.88308208 0.88542714 0.89447236
 0.8958124  0.89916248 0.89916248 0.89849246]
0.9182579564489112


## Approach 3.2 Word embedding with SVM (0.76)

In [11]:
from gensim.models import Word2Vec

tokenized_text = [text.split() for text in trainTxt]
print(f"token text : {tokenized_text[0]}")
word2vec_model = Word2Vec(tokenized_text, vector_size=100)



X_train_word2vec=word2vec_Avg_Transform(word2vec_model,trainTxt)
X_test_word2vec=word2vec_Avg_Transform(word2vec_model,testX)

SVM_clf = SVC()
param_grid = {"C":logspace(-3,3,13)}
# cross validation
grid_search = GridSearchCV(SVM_clf, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train_word2vec , trainY)

estimator = grid_search.best_estimator_
print(f"mean test score : {grid_search.cv_results_['mean_test_score']}")

predY = estimator.predict(X_train_word2vec)
print(accuracy_score(predY,trainY))

predY = estimator.predict(X_test_word2vec)
write_csv_kaggle_sub("Output/Submission-Word_Embeding-SVM.csv", predY)


token text : ['dunno', 'da', 'next', 'show', 'aft', '6', 'is', '850', 'toa', 'payoh', 'got', '650']
mean test score : [0.82211055 0.82211055 0.82211055 0.82211055 0.82211055 0.82211055
 0.82211055 0.82211055 0.82211055 0.83283082 0.89681742 0.91423786
 0.92462312]
0.9386934673366835


## Approach 3.3: Word embedding with Random Forest (Kaggle:0.46)

In [31]:
from sklearn.ensemble import RandomForestClassifier

tokenized_text = [text.split() for text in trainTxt]
word2vec_model = Word2Vec(tokenized_text, vector_size=100, window=3, min_count=1)

X_train_word2vec=word2vec_Avg_Transform(word2vec_model,trainTxt)
X_test_word2vec=word2vec_Avg_Transform(word2vec_model,testX)

RFC = RandomForestClassifier()

param_grid = {
    'n_estimators': [50, 100, 150],  # Number of trees in the forest
    'max_depth': [10, 20, 30],  # Maximum depth of the trees
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]  # Minimum number of samples required to be at a leaf node
}

grid_search = GridSearchCV(RFC, param_grid, cv=5, verbose=1, n_jobs=-1)

grid_search.fit(X_train_word2vec , trainY)

estimator = grid_search.best_estimator_
print(f"mean test score : {grid_search.cv_results_['mean_test_score']}")

predY = estimator.predict(X_train_word2vec)
print(accuracy_score(predY,trainY))

predY = estimator.predict(X_test_word2vec)
write_csv_kaggle_sub("Output/Submission-Word_Embeding-RandomForest.csv", predY)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
mean test score : [0.86465662 0.86465662 0.86532663 0.86365159 0.86432161 0.86264657
 0.85561139 0.85762144 0.85896147 0.86599665 0.86097152 0.86499162
 0.86197655 0.86264657 0.86063652 0.85929648 0.85929648 0.85661642
 0.85628141 0.85628141 0.85628141 0.85561139 0.85561139 0.85762144
 0.85628141 0.8559464  0.8519263  0.84857621 0.84690117 0.84623116
 0.84958124 0.84690117 0.84522613 0.84455611 0.84489112 0.84254606
 0.84656616 0.84623116 0.84656616 0.84690117 0.84422111 0.84623116
 0.84254606 0.84321608 0.84087102 0.84221106 0.84154104 0.84187605
 0.84154104 0.84120603 0.84288107 0.84053601 0.84120603 0.84187605
 0.86566164 0.86465662 0.86465662 0.86633166 0.86700168 0.86432161
 0.8559464  0.85862647 0.86063652 0.86063652 0.86298157 0.86465662
 0.86365159 0.86264657 0.86365159 0.8599665  0.85829146 0.85728643
 0.85795645 0.85728643 0.85527638 0.85896147 0.85561139 0.85561139
 0.85561139 0.85393635 0.85393635 0.86432161 0.8

# Conclusion

In our pursuit of effective spam email classification, we embarked on a series of experiments employing various classification techniques. These experiments aimed to discern the most suitable method for discriminating between spam and legitimate emails. Below, we provide an overview of the three primary methods we explored:

**Method 1: CountVector with Machine Learning Models**

In Method 1, we harnessed the CountVectorization technique in combination with a range of machine learning models. This approach involved converting email texts into numerical feature vectors based on word frequency counts. Subsequently, we applied diverse machine learning algorithms to classify the emails.

**Method 2: TF-IDF (Term Frequency-Inverse Document Frequency) with Various Models**

Approach 2 leveraged the TF-IDF technique, designed to reduce the significance of non-relevant words in the classification process. TF-IDF assigns weights to words based on their importance within individual emails and across the entire dataset. We then employed multiple machine learning models to evaluate the effectiveness of this approach.

**Method 3: Word Embedding Technique**

In "Method 3," we delved into the word embedding technique, which considers word order in the text data to potentially yield improved results. This method takes into account the context in which words appear, enhancing the classification accuracy.

**Analyzing Experimental Results**

Following an in-depth analysis of the results obtained from these experiments, a consistent performance trend emerged. Method 1 consistently outperformed Methods 2 and 3 in terms of average performance. This observation underscores the superiority of CountVectorization as a vectorization technique for our dataset. Notably, our dataset predominantly comprises short email sentences with unique words, making it challenging for classifiers to identify similarities. While TF-IDF excels at eliminating common words, it proved less effective in this specific scenario.

Furthermore, it is essential to note that all our experimental models exhibited varying degrees of overfitting. While they demonstrated high accuracy during validation, their performance significantly degraded when tested on new data.

Proposed Measures for Improvement

To address these challenges and enhance the reliability of our spam email classification system, we propose the following measures:

1. **Exploration of Different Text Data Vectorization Techniques**: To mitigate overfitting issues, we should experiment with alternative text data vectorization methods. Exploring techniques like word embeddings, Doc2Vec, or more advanced vectorization schemes may yield improved results.

2. **Consideration of Simpler Classifier Models: Reducing classifier** complexity by adopting smaller and shallower models, such as Random Forest or XGBoost, can enhance the generalizability of our spam classification system. These models often exhibit robust performance and are less prone to overfitting.

3. **Consideration of adding english dictionary** : lower the weight of mispelled words
   
These adaptations are crucial for ensuring the effectiveness of our spam classification system, especially in the face of diverse email content and the persistent challenge of overfitting in our experimental work.