# Analysing SMS Content to Detect Spam From Ham

## Load the important libraries and dataset

In [1]:
import pandas as pd
import nltk # Natural Language Toolkit library
import re # Regular Expression library
import string
# nltk.download() # to download the needed libraries

In [2]:
pd.set_option('display.max_colwidth', 100)

fullCorpus = pd.read_csv('fullCorpus_feature_engineering.csv')
X_tfidf_df = pd.read_csv('X_tfidf_df.csv')

fullCorpus.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,body_text_lemmatized,body_length,punct%
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"['ive', 'been', 'searching', 'for', 'the', 'right', 'words', 'to', 'thank', 'you', 'for', 'this'...","['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '...","['ive', 'search', 'right', 'word', 'thank', 'breather', 'promis', 'wont', 'take', 'help', 'grant...","['ive', 'searching', 'right', 'word', 'thank', 'breather', 'promise', 'wont', 'take', 'help', 'g...",160,2.5
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21...","['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...","['free', 'entri', '2', 'wkli', 'comp', 'win', 'fa', 'cup', 'final', 'tkt', '21st', 'may', '2005'...","['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...",128,4.7
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"['nah', 'i', 'dont', 'think', 'he', 'goes', 'to', 'usf', 'he', 'lives', 'around', 'here', 'though']","['nah', 'dont', 'think', 'goes', 'usf', 'lives', 'around', 'though']","['nah', 'dont', 'think', 'goe', 'usf', 'live', 'around', 'though']","['nah', 'dont', 'think', 'go', 'usf', 'life', 'around', 'though']",49,4.1
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"['even', 'my', 'brother', 'is', 'not', 'like', 'to', 'speak', 'with', 'me', 'they', 'treat', 'me...","['even', 'brother', 'like', 'speak', 'treat', 'like', 'aids', 'patent']","['even', 'brother', 'like', 'speak', 'treat', 'like', 'aid', 'patent']","['even', 'brother', 'like', 'speak', 'treat', 'like', 'aid', 'patent']",62,3.2
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"['i', 'have', 'a', 'date', 'on', 'sunday', 'with', 'will']","['date', 'sunday']","['date', 'sunday']","['date', 'sunday']",28,7.1


## Building Machine Learning Classifiers

In this section, we will use the vectorized data generated from TF-IDF vectorization and combine it with the features created from the previous section.

In [3]:
wn = nltk.WordNetLemmatizer()

def lemmatizing(tokenized_text):
    text = [wn.lemmatize(word) for word in tokenized_text]
    return text

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer=lemmatizing)
X_tfidf = tfidf_vect.fit_transform(fullCorpus['body_text_lemmatized'])
print(X_tfidf.shape)
print(tfidf_vect.get_feature_names())

(5568, 48)
[' ', "'", ',', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'è', 'é', 'ì', 'ú', 'ü', '〨', '鈥']




In [5]:
X_features = pd.concat([fullCorpus['body_length'], fullCorpus['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()

Unnamed: 0,body_length,punct%,0,1,2,3,4,5,6,7,...,38,39,40,41,42,43,44,45,46,47
0,160,2.5,0.313691,0.658716,0.313691,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,128,4.7,0.316222,0.650824,0.316222,0.21671,0.214931,0.200747,0.0,0.044145,...,0.07762,0.107613,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,49,4.1,0.313791,0.705991,0.313791,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,62,3.2,0.29668,0.667495,0.29668,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,28,7.1,0.15576,0.613273,0.15576,0.0,0.0,0.0,0.0,0.0,...,0.0,0.233228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 1. Random Forest Classifier

In [6]:
from sklearn.ensemble import RandomForestClassifier

In [7]:
print(dir(RandomForestClassifier))
print(RandomForestClassifier())

['__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_feature_names', '_check_n_features', '_compute_oob_predictions', '_estimator_type', '_get_oob_predictions', '_get_param_names', '_get_tags', '_make_estimator', '_more_tags', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_required_parameters', '_set_oob_score_and_attributes', '_validate_X_predict', '_validate_data', '_validate_estimator', '_validate_y_class_weight', 'apply', 'decision_path', 'feature_importances_', 'fit', 'get_params', 'n_features_', 'predict', 'predict_log_proba', 'predict_proba', 'score',

#### Explore RandomForestClassifier through Cross-Validation

In [8]:
from sklearn.model_selection import KFold, cross_val_score

In [9]:
rf = RandomForestClassifier(n_jobs=-1)
k_fold = KFold(n_splits=5)
cross_val_score(rf, X_features, fullCorpus['label'], cv=k_fold, scoring='accuracy', n_jobs=-1)



array([0.98025135, 0.98204668, 0.98025135, 0.97933513, 0.98113208])

#### Explore RandomForestClassifier through Holdout Set

In [10]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

To create a holdout test set, we will use `train_test_split` function as below

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X_features, fullCorpus['label'], test_size=0.2, random_state=42)

Train and fit the model

In [12]:
rf = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)
rf_model = rf.fit(X_train, Y_train)



To find the features that most contribute to the learning outcomes we will use feature importance parameter.

In [13]:
sorted(zip(rf_model.feature_importances_, X_train.columns), reverse=True)[0:10]

[(0.18433405169989772, 3),
 (0.1062198177627775, 4),
 (0.08815627113333634, 10),
 (0.08770572811240049, 8),
 (0.07546930824190715, 11),
 (0.0494047227478608, 12),
 (0.041242283639700034, 5),
 (0.040152107024117735, 9),
 (0.040130490918402116, 13),
 (0.03683324130124441, 'body_length')]

In [14]:
Y_pred = rf.predict(X_test)
precision, recall, fscore, support = score(Y_test, Y_pred, pos_label='spam', average='binary')



In [15]:
print('precision: {} / recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                        round(recall, 3),
                                                        round((Y_pred==Y_test).sum() / len(Y_pred), 3)))

precision: 0.962 / recall: 0.852 / Accuracy: 0.976


#### Evaluate Random Forest with GridSearchCV

**Grid-search:** Exhaustively search all parameter combinations in a given grid to determine the best model.

In this section, we will evaluate two victorization methods (`CountVectorization` & `TFIDF`) using `GridSearchCV` technique and tune the hyperparameters of RF model.

In [16]:
from sklearn.model_selection import GridSearchCV

In [17]:
fullCorpus.head(2)

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,body_text_lemmatized,body_length,punct%
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"['ive', 'been', 'searching', 'for', 'the', 'right', 'words', 'to', 'thank', 'you', 'for', 'this'...","['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '...","['ive', 'search', 'right', 'word', 'thank', 'breather', 'promis', 'wont', 'take', 'help', 'grant...","['ive', 'searching', 'right', 'word', 'thank', 'breather', 'promise', 'wont', 'take', 'help', 'g...",160,2.5
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21...","['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...","['free', 'entri', '2', 'wkli', 'comp', 'win', 'fa', 'cup', 'final', 'tkt', '21st', 'may', '2005'...","['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...",128,4.7


In [18]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(analyzer=lemmatizing)
X_counts = count_vect.fit_transform(fullCorpus['body_text_lemmatized'])
print(X_counts.shape)
print(count_vect.get_feature_names())

(5568, 48)
[' ', "'", ',', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'è', 'é', 'ì', 'ú', 'ü', '〨', '鈥']




In [19]:
# CountVectorizer
count_vect = CountVectorizer(analyzer=lemmatizing)
X_count = count_vect.fit_transform(fullCorpus['body_text_lemmatized'])
X_count_feat = pd.concat([fullCorpus['body_length'], fullCorpus['punct%'], pd.DataFrame(X_count.toarray())], axis=1)

X_count_feat.head()

Unnamed: 0,body_length,punct%,0,1,2,3,4,5,6,7,...,38,39,40,41,42,43,44,45,46,47
0,160,2.5,15,32,15,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,128,4.7,22,46,22,5,5,5,0,1,...,2,5,0,0,0,0,0,0,0,0
2,49,4.1,7,16,7,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,62,3.2,7,16,7,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,28,7.1,1,4,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [20]:
# TF-IDF
tfidf_vect = TfidfVectorizer(analyzer=lemmatizing)
X_tfidf = tfidf_vect.fit_transform(fullCorpus['body_text_lemmatized'])
X_tfidf_feat = pd.concat([fullCorpus['body_length'], fullCorpus['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)

X_tfidf_feat.head()

Unnamed: 0,body_length,punct%,0,1,2,3,4,5,6,7,...,38,39,40,41,42,43,44,45,46,47
0,160,2.5,0.313691,0.658716,0.313691,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,128,4.7,0.316222,0.650824,0.316222,0.21671,0.214931,0.200747,0.0,0.044145,...,0.07762,0.107613,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,49,4.1,0.313791,0.705991,0.313791,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,62,3.2,0.29668,0.667495,0.29668,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,28,7.1,0.15576,0.613273,0.15576,0.0,0.0,0.0,0.0,0.0,...,0.0,0.233228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Training the model on CountVectorizing dataset

In [21]:
rf = RandomForestClassifier()
parameters = {'n_estimators': [10, 150, 300],
              'max_depth': [30, 60, 90, None]}

gs = GridSearchCV(rf, parameters, cv = 5, n_jobs = -1)
gs_fit = gs.fit(X_count_feat, fullCorpus['label'])
pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]









Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
2,2.642613,0.033107,0.112799,0.009323,30,300,"{'max_depth': 30, 'n_estimators': 300}",0.979354,0.985637,0.976661,0.977538,0.980234,0.979885,0.003143,1
1,1.270896,0.018208,0.050319,0.002428,30,150,"{'max_depth': 30, 'n_estimators': 150}",0.978456,0.985637,0.977558,0.977538,0.979335,0.979705,0.003039,2
7,1.346538,0.007288,0.065377,0.004834,90,150,"{'max_depth': 90, 'n_estimators': 150}",0.979354,0.986535,0.976661,0.977538,0.978437,0.979705,0.003531,3
6,0.096569,0.0073,0.010063,0.000558,90,10,"{'max_depth': 90, 'n_estimators': 10}",0.981149,0.987433,0.976661,0.97664,0.97664,0.979704,0.004239,4
8,2.396941,0.323343,0.09113,0.022161,90,300,"{'max_depth': 90, 'n_estimators': 300}",0.979354,0.98474,0.975763,0.977538,0.978437,0.979166,0.003029,5


#### Training the model on TFIDF dataset

In [22]:
rf = RandomForestClassifier()
parameters = {'n_estimators': [10, 150, 300],
              'max_depth': [30, 60, 90, None]}

gs = GridSearchCV(rf, parameters, cv = 5, n_jobs = -1)
gs_fit = gs.fit(X_tfidf_feat, fullCorpus['label'])
pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]









Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
10,3.127761,0.02828,0.063621,0.007068,,150,"{'max_depth': None, 'n_estimators': 150}",0.981149,0.983842,0.980251,0.978437,0.982031,0.981142,0.0018,1
5,5.793543,0.229556,0.127757,0.008944,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.981149,0.982944,0.978456,0.978437,0.982031,0.980603,0.00185,2
2,5.063565,0.064893,0.114154,0.007446,30.0,300,"{'max_depth': 30, 'n_estimators': 300}",0.979354,0.982047,0.980251,0.979335,0.981132,0.980424,0.001048,3
8,5.805686,0.331961,0.100609,0.02792,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.981149,0.982047,0.977558,0.979335,0.981132,0.980244,0.001606,4
11,4.109794,0.47817,0.059954,0.002167,,300,"{'max_depth': None, 'n_estimators': 300}",0.981149,0.982047,0.978456,0.978437,0.981132,0.980244,0.001505,5


In comparison between both models, we can see that the RF model that has been trained on the TFIDF data frame performed slightly better than the RF model that has been trained on the CountVectorizing data frame. This is an indicator that TFIDF is a better vectorization method than the Count vectorization method.

### 2. Gradient Boosting Classifier

In [23]:
from sklearn.ensemble import GradientBoostingClassifier

In [24]:
print(dir(GradientBoostingClassifier))
print(GradientBoostingClassifier())

['_SUPPORTED_LOSS', '__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_feature_names', '_check_initialized', '_check_n_features', '_check_params', '_clear_state', '_compute_partial_dependence_recursion', '_estimator_type', '_fit_stage', '_fit_stages', '_get_param_names', '_get_tags', '_init_state', '_is_initialized', '_make_estimator', '_more_tags', '_raw_predict', '_raw_predict_init', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_required_parameters', '_resize_state', '_staged_raw_predict', '_validate_data', '_validate_estimator', '_validate_y', '_warn_ma

#### Explore Gradient Boosting Classifier through Cross-Validation

In [25]:
gb = GradientBoostingClassifier()
k_fold = KFold(n_splits=5)
cross_val_score(gb, X_features, fullCorpus['label'], cv=k_fold, scoring='accuracy')



array([0.97666068, 0.98384201, 0.97845601, 0.97933513, 0.97933513])

#### Explore Gradient Boosting Classifier through Holdout Set

To create a holdout test set, we will use `train_test_split` function as below

In [26]:
X_train, X_test, Y_train, Y_test = train_test_split(X_features, fullCorpus['label'], test_size=0.2, random_state=42)

Train and fit the model

In [27]:
gb = GradientBoostingClassifier(n_estimators=50, max_depth=20)
gb_model = gb.fit(X_train, Y_train)



To find the features that most contribute to the learning outcomes we will use feature importance parameter.

In [28]:
sorted(zip(gb_model.feature_importances_, X_train.columns), reverse=True)[0:10]

[(0.7557075425688832, 3),
 (0.046138045082292106, 'body_length'),
 (0.04160947821733227, 1),
 (0.011141995001994606, 6),
 (0.009001106984510869, 39),
 (0.00885480002651843, 38),
 (0.008036411555504035, 16),
 (0.0076595556572632536, 4),
 (0.0073324897269875025, 29),
 (0.007138402565407586, 19)]

In [29]:
Y_pred = gb.predict(X_test)
precision, recall, fscore, support = score(Y_test, Y_pred, pos_label='spam', average='binary')



In [30]:
print('precision: {} / recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                        round(recall, 3),
                                                        round((Y_pred==Y_test).sum() / len(Y_pred), 3)))

precision: 0.846 / recall: 0.846 / Accuracy: 0.959


#### Evaluate Gradient Boosting with GridSearchCV

In this section, we will evaluate two victorization methods (`CountVectorization` & `TFIDF`) using `GridSearchCV` technique and tune the hyperparameters of GB model.

In [31]:
fullCorpus.head(2)

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,body_text_lemmatized,body_length,punct%
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"['ive', 'been', 'searching', 'for', 'the', 'right', 'words', 'to', 'thank', 'you', 'for', 'this'...","['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '...","['ive', 'search', 'right', 'word', 'thank', 'breather', 'promis', 'wont', 'take', 'help', 'grant...","['ive', 'searching', 'right', 'word', 'thank', 'breather', 'promise', 'wont', 'take', 'help', 'g...",160,2.5
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21...","['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...","['free', 'entri', '2', 'wkli', 'comp', 'win', 'fa', 'cup', 'final', 'tkt', '21st', 'may', '2005'...","['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...",128,4.7


In [32]:
# CountVectorizer
count_vect = CountVectorizer(analyzer=lemmatizing)
X_count = count_vect.fit_transform(fullCorpus['body_text_lemmatized'])
X_count_feat = pd.concat([fullCorpus['body_length'], fullCorpus['punct%'], pd.DataFrame(X_count.toarray())], axis=1)

X_count_feat.head()

Unnamed: 0,body_length,punct%,0,1,2,3,4,5,6,7,...,38,39,40,41,42,43,44,45,46,47
0,160,2.5,15,32,15,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,128,4.7,22,46,22,5,5,5,0,1,...,2,5,0,0,0,0,0,0,0,0
2,49,4.1,7,16,7,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,62,3.2,7,16,7,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,28,7.1,1,4,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [33]:
# TF-IDF
tfidf_vect = TfidfVectorizer(analyzer=lemmatizing)
X_tfidf = tfidf_vect.fit_transform(fullCorpus['body_text_lemmatized'])
X_tfidf_feat = pd.concat([fullCorpus['body_length'], fullCorpus['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)

X_tfidf_feat.head()

Unnamed: 0,body_length,punct%,0,1,2,3,4,5,6,7,...,38,39,40,41,42,43,44,45,46,47
0,160,2.5,0.313691,0.658716,0.313691,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,128,4.7,0.316222,0.650824,0.316222,0.21671,0.214931,0.200747,0.0,0.044145,...,0.07762,0.107613,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,49,4.1,0.313791,0.705991,0.313791,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,62,3.2,0.29668,0.667495,0.29668,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,28,7.1,0.15576,0.613273,0.15576,0.0,0.0,0.0,0.0,0.0,...,0.0,0.233228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Training the model on CountVectorizing dataset

In [34]:
gb = GradientBoostingClassifier()
parameters = {'n_estimators': [100, 150],
              'max_depth': [7, 11, 15],
              'learning_rate': [0.1]}

gs = GridSearchCV(gb, parameters, cv = 5, n_jobs = -1)
gs_fit = gs.fit(X_count_feat, fullCorpus['label'])
pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]





Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,5.051267,0.023737,0.01389,0.002098,0.1,7,150,"{'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 150}",0.977558,0.980251,0.973968,0.975741,0.97664,0.976832,0.002081,1
3,9.24197,0.100688,0.01735,0.002461,0.1,11,150,"{'learning_rate': 0.1, 'max_depth': 11, 'n_estimators': 150}",0.975763,0.982047,0.975763,0.975741,0.974843,0.976831,0.002632,2
0,3.310864,0.021549,0.010292,0.000885,0.1,7,100,"{'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100}",0.977558,0.978456,0.972172,0.974843,0.977538,0.976114,0.002313,3
2,5.482781,0.159101,0.014148,0.001245,0.1,11,100,"{'learning_rate': 0.1, 'max_depth': 11, 'n_estimators': 100}",0.974865,0.980251,0.973968,0.975741,0.972147,0.975395,0.002704,4
4,7.781184,0.364988,0.015245,0.001907,0.1,15,100,"{'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 100}",0.969479,0.970377,0.972172,0.969452,0.973944,0.971085,0.001738,5


#### Training the model on TFIDF dataset

In [35]:
gb = GradientBoostingClassifier()
parameters = {'n_estimators': [100, 150],
              'max_depth': [7, 11, 15],
              'learning_rate': [0.1]}

gs = GridSearchCV(gb, parameters, cv = 5, n_jobs = -1)
gs_fit = gs.fit(X_tfidf_feat, fullCorpus['label'])
pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]





Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,22.097956,0.085492,0.027093,0.002205,0.1,7,150,"{'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 150}",0.980251,0.982047,0.976661,0.975741,0.973944,0.977729,0.002981,1
0,13.04761,0.052072,0.016618,0.000969,0.1,7,100,"{'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100}",0.978456,0.982047,0.974865,0.975741,0.975741,0.97737,0.002632,2
3,38.968439,1.277141,0.019918,0.003462,0.1,11,150,"{'learning_rate': 0.1, 'max_depth': 11, 'n_estimators': 150}",0.975763,0.982944,0.974865,0.974843,0.974843,0.976652,0.003166,3
2,25.487836,2.648427,0.021898,0.002826,0.1,11,100,"{'learning_rate': 0.1, 'max_depth': 11, 'n_estimators': 100}",0.975763,0.982944,0.97307,0.973944,0.972147,0.975574,0.003874,4
4,30.589162,2.804994,0.019135,0.004723,0.1,15,100,"{'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 100}",0.974865,0.977558,0.967684,0.96496,0.96496,0.970005,0.005234,5


In comparison between both models, we can see that the RF model that has been trained on the TFIDF data frame performed slightly better than the RF model that has been trained on the CountVectorizing data frame. This is an indicator that TFIDF is a better vectorization method than the Count vectorization method.

## Final Model Selection & Evaluation

To select the best performed models from both RF and GB, we will follow the following steps:
    
1. Split the data into training and test set.
2. Train vectorizers on training set and use that to transform test set.
3. Fit best Random Forest model and best Gradient Boosting model on training set and predict on test set.
4. Thoroughly evaluate results of these two models to select best model.

#### Split into train/test

We will use the "body_text_lemmatized" data column to split into training and testing dataset.

In [36]:
fullCorpus.head(2)

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,body_text_lemmatized,body_length,punct%
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"['ive', 'been', 'searching', 'for', 'the', 'right', 'words', 'to', 'thank', 'you', 'for', 'this'...","['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '...","['ive', 'search', 'right', 'word', 'thank', 'breather', 'promis', 'wont', 'take', 'help', 'grant...","['ive', 'searching', 'right', 'word', 'thank', 'breather', 'promise', 'wont', 'take', 'help', 'g...",160,2.5
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21...","['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...","['free', 'entri', '2', 'wkli', 'comp', 'win', 'fa', 'cup', 'final', 'tkt', '21st', 'may', '2005'...","['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...",128,4.7


In [37]:
stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation]) #remove strings
    tokens = re.split('\W+', text) # Tokenization
    text = " ".join([ps.stem(word) for word in tokens if word not in stopwords]) #remove stopwords and combine them again
    return text

fullCorpus['cleaned_text'] = fullCorpus['body_text'].apply(lambda x: clean_text(x))
fullCorpus.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,body_text_lemmatized,body_length,punct%,cleaned_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"['ive', 'been', 'searching', 'for', 'the', 'right', 'words', 'to', 'thank', 'you', 'for', 'this'...","['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '...","['ive', 'search', 'right', 'word', 'thank', 'breather', 'promis', 'wont', 'take', 'help', 'grant...","['ive', 'searching', 'right', 'word', 'thank', 'breather', 'promise', 'wont', 'take', 'help', 'g...",160,2.5,ive search right word thank breather promis wont take help grant fulfil promis wonder bless time
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21...","['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...","['free', 'entri', '2', 'wkli', 'comp', 'win', 'fa', 'cup', 'final', 'tkt', '21st', 'may', '2005'...","['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005...",128,4.7,free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd...
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"['nah', 'i', 'dont', 'think', 'he', 'goes', 'to', 'usf', 'he', 'lives', 'around', 'here', 'though']","['nah', 'dont', 'think', 'goes', 'usf', 'lives', 'around', 'though']","['nah', 'dont', 'think', 'goe', 'usf', 'live', 'around', 'though']","['nah', 'dont', 'think', 'go', 'usf', 'life', 'around', 'though']",49,4.1,nah dont think goe usf live around though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"['even', 'my', 'brother', 'is', 'not', 'like', 'to', 'speak', 'with', 'me', 'they', 'treat', 'me...","['even', 'brother', 'like', 'speak', 'treat', 'like', 'aids', 'patent']","['even', 'brother', 'like', 'speak', 'treat', 'like', 'aid', 'patent']","['even', 'brother', 'like', 'speak', 'treat', 'like', 'aid', 'patent']",62,3.2,even brother like speak treat like aid patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"['i', 'have', 'a', 'date', 'on', 'sunday', 'with', 'will']","['date', 'sunday']","['date', 'sunday']","['date', 'sunday']",28,7.1,date sunday


In [38]:
X_train, X_test, Y_train, Y_test = train_test_split(fullCorpus[['cleaned_text', 'body_length', 'punct%']], fullCorpus['label'], test_size=0.2)

#### Vectorize text

We will use reset_index as after we split the data, the tfidf_train index starts from 0, but the X_train index stuck where it was, so we need to reset the index to match with the new generated value.

In [39]:
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['cleaned_text'])

# We need to vectorize both training and testing set so both of them will have the same number of columns.
tfidf_train = tfidf_vect_fit.transform(X_train['cleaned_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['cleaned_text'])

# We will concatenate the vectorized values with the remaining columns from the original X_train and X_test.
X_train_vect = pd.concat([X_train[['body_length', 'punct%']].reset_index(drop=True), 
                          pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_length', 'punct%']].reset_index(drop=True), 
                          pd.DataFrame(tfidf_test.toarray())], axis=1)

X_train_vect.head()

Unnamed: 0,body_length,punct%,0,1,2,3,4,5,6,7,...,34,35,36,37,38,39,40,41,42,43
0,35,0.0,0.363953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.326846,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,109,9.2,0.509293,0.243448,0.160259,0.374356,0.087534,0.082014,0.17057,0.182838,...,0.144432,0.049416,0.19212,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,84,4.8,0.478622,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.07353,0.142935,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,83,4.8,0.467046,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.246004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,45,2.2,0.422331,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.311432,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
X_test_vect.head()

Unnamed: 0,body_length,punct%,0,1,2,3,4,5,6,7,...,34,35,36,37,38,39,40,41,42,43
0,54,3.7,0.57442,0.0,0.0,0.0,0.267976,0.0,0.0,0.0,...,0.0,0.15128,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,23,0.0,0.399425,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,58,13.8,0.479542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.126293,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,120,2.5,0.452616,0.5481,0.067652,0.189636,0.073903,0.0,0.144009,0.0,...,0.06097,0.041721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,33,3.0,0.172461,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.317938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Final evaluations of models

In [41]:
import time

#### Random Forest model

In [42]:
rf = RandomForestClassifier(n_estimators=300, max_depth=None, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train_vect, Y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred_rf = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(Y_test, y_pred_rf, pos_label='spam', average='binary')
print('Fit Time: {} / Predict Time: {} / Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), 
    round((y_pred_rf==Y_test).sum()/len(y_pred_rf), 3)))



Fit Time: 0.578 / Predict Time: 0.062 / Precision: 0.978 / Recall: 0.84 / Accuracy: 0.974




#### Gradient Boosting model

In [43]:
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

start = time.time()
gb_model = gb.fit(X_train_vect, Y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred_gb = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(Y_test, y_pred_gb, pos_label='spam', average='binary')
print('Fit Time: {} / Predict Time: {} / Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), 
    round((y_pred_gb==Y_test).sum()/len(y_pred_gb), 3)))



Fit Time: 12.625 / Predict Time: 0.01 / Precision: 0.945 / Recall: 0.852 / Accuracy: 0.971




#### Results Trade-off

We notice that although GradientBoosting takes way longer than RandomForest to fit, it takes less time to predict. 

In terms of precision and recall, the RandomForest model has much better precision at 100%, but GradientBoosting has a slightly better recall. Now we find ourselves in a situation where no matter which model we pick, we're making some trade-off. If we pick RandomForest, that means that we care more about precision than we do predict time or recall, and vice versa. This kind of trade-off is very common, bringing me to a couple of important points. 

First, generally, we will dive into the metrics much more than we are here. We wouldn't base it only on overall precision, recall, and prediction time. We'd split our test set in a variety of different ways to understand how it does across a number of different dimensions. We might say let's look only at text messages that have a length greater than 50 and see how our model does there. Or, let's look at text messages that have zero punctuation and see how our model does there. We'd slice it in a variety of different ways to understand where the model's doing well and maybe where it doesn't do well. That would also include looking at specific text messages that the model is getting wrong. 

The second point is after thorough training and evaluation process, you usually end up in a place where you have some kind of trade-off, as we have here between performance and prediction time. In this case, which is very important, you make your decision based on the business problem or the business context. That means having a longer prediction time will create a huge bottleneck in your process. In some business contexts, having a model that takes over 0.2 seconds to predict might be a deal breaker, so you might have no choice but to go with the GradientBoosting model. 

Third, most problems either have a higher cost on false positives, which means we would prioritize precision, or false negatives, which means we would prioritize recall. For instance, for a spam filter, we can probably deal with spam in our inbox here and there, but we don't want our spam filter to capture real emails, so we'd prioritize here for precision. So when it says it's spam, it better be spam. In this case, false positives are very costly. The second case would be something like anti-virus software. False positives where they say that you have a virus, but you really don't, that can be scary without a doubt, but if you're getting hacked and your software doesn't catch it, that's much, much worse. In this case, we should optimize for recall so that if there's a breach, the model better be able to catch it. With all that said, assuming that prediction time is not a deal breaker for your business problem, and you don't necessarily have a super-clear answer of whether false positives or false negatives are more costly, the model that you'd probably select here is the RandomForest model. That's because the precision is much better than the GradientBoosting model, and the recall is very close. 
