# 1. Introduction

# 2. Methodology
## 2.1. Naive Bayes

## 2.2. Random Forest
Random forest is a Bagging (short for Bootstrap AGgregation) method in integrated learning.

Random forests are made up of many decision trees, and there is no correlation between different decision trees.

On the classification task, when a new input sample enters, and each decision tree in the forest is judged and classified separately. Each decision tree will get its own classification result, and random forest will use the result given by most decision trees as the final result.

### Four steps of constructing a random forest
1. Do random sampling.
 * If there are N samples, there are N samples randomly selected (each time one sample is randomly selected, and then returned to continue selection). 
 * This selected N samples are used to train a decision tree as a sample at the root of the decision tree.
 * This sampling method is called the bootstrap sample method.
2. Randomly select attributes to do node splitting.
 * Suppose that each sample has M attributes.
 * When each node of the decision tree needs to be split, m attributes are randomly selected from the M attributes, satisfying the condition m << M. 
 * Then use some strategy (such as information gain) to select 1 attribute as the split attribute of the node.
3. Repeat step 2 until it is no longer able to do splitting.
 * In the decision tree formation process, each node must be split according to step 2.
 * If the next attribute selected by the node is the attribute that was used just when its parent node split, the node has reached the leaf node, and there is no need to continue to split.
4. Establish a large number of decision trees to form a forest.
 * A large number of decision trees are established following above 3 steps, which constitutes a random forest.



### Advantages
* Training can be highly parallelized, which has an advantage for large sample training speeds in the era of big data.
* Since the decision tree node partitioning feature can be randomly selected, the model can still be effectively trained when the sample feature dimension is high.
* After training, the importance of each feature to the output can be given
* Due to the random sampling, the trained model has small variance and strong generalization ability.
* Compared to the Boosting series of Adaboost and GBDT, the RF implementation is relatively simple.
* Not sensitive to the loss of some features.

### Disadvantages
* The RF model is prone to overfitting on some of the more noisy sample sets.
* Features with a large number of values are more likely to have a greater impact on RF decisions, thus affecting the effect of the fitted model.

# 3. Experimental Setup
* scikit learn

In [1]:
import pandas as pd
import numpy as np

## 3.1. Load Data
* It’s the SMS Spam Collection Data Set from UCI Machine Learning Repository.
* The link of the dataset is as below.
	https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
* The dataset contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. 

In [2]:
data = pd.read_table('data/SMSSpamCollection', sep='\t', names=['label', 'sms_message'])

#print first 5 records
data.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 3.2. Data Prepocessing
### 3.2.1. Convert Label
The label is converted into binary variable for convenience of calculating
* 0 : "ham"
* 1 : "spam"

In [3]:
# show size of data (#row,#column)
print(data.shape)

# convert label
data['label'] = data.label.map({'ham':0, 'spam':1})

# show label of first 5 records 
data['label'].head()

(5572, 2)


0    0
1    0
2    1
3    0
4    0
Name: label, dtype: int64

### 3.2.2. Split into Traing Set and Testing Set
The method train_test_split of sklearn is used to implement division of traing set and testing set. The result is received by following variables:
* X_train: training data of 'sms_message'
* y_train: training data of 'label'
* X_test: testing data of 'sms_message'
* y_test: testing data of 'label'

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data['sms_message'], 
                                                    data['label'], 
                                                    random_state=1)
print('Number of rows in the total set: {}'.format(data.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


### 3.2.3. Convert SMS Message
SMS message is converted into "Bag of Words"
* Bag of words(BoW)
 * It indicates that the problem to be solved has "a lot of words" or a lot of text data.
 * The basic concept of BoW is to take a piece of text and calculate how often words appear in the text.
 * BoW treats each word equally, and the order in which the words appear is not important.
* The collection of SMS message will be converted into a frequency matrix, each SMS message is a row, each word(token) is a column, and the corresponding (row,column) value is the frequency of each word or token that appears in this document.
* sklearn.feature_extraction.text.CountVectorizer is used to implements BoW
 * It tokenizes the string (dividing the string into a single word) and sets an integer ID for each token.
 * It counts the number of occurrences of each token.
 * Setting parameter 'lowercase' to 'True' can convert token into lowercase, which help ignore case.
 * The default value of parameter 'token_pattern' help ignore punctuation.
 * Setting parameter 'stop_words' to 'english' can ignore the most commonly used words in English, including "am", "an", "and", "the", and so on, which reduce bias of analysis result.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

# print(training_data)
# print(testing_data)

# print all tokens
# print(count_vector.get_feature_names())   

print(len(count_vector.get_feature_names()))

#  print all tokens with ID
# print(count_vector.vocabulary_)  

#  print all tokens with ID
for key,value in count_vector.vocabulary_.items():
    print(key,value)

# print token with certain ID
# print(count_vector.get_feature_names()[509])


7456
4mths 509
half 3181
price 5193
orange 4781
line 3971
rental 5479
latest 3880
camera 1572
phones 4987
free 2864
had 3170
your 7424
phone 4983
11mths 264
call 1552
mobilesdirect 4375
on 4743
08000938767 50
to 6656
update 6892
now 4662
or2stoptxt 4779
cs 2022
did 2222
you 7420
stitch 6218
his 3316
trouser 6758
hope 3362
enjoyed 2502
new 4580
content 1916
text 6514
stop 6228
61610 563
unsubscribe 6882
help 3276
08712400602450p 98
provided 5255
by 1538
tones2you 6683
co 1810
uk 6829
not 4647
heard 3255
from 2899
u4 6823
while 7199
rude 5612
chat 1691
private 5206
01223585334 5
cum 2040
wan 7075
2c 374
pics 5002
of 4704
me 4238
gettin 3002
shagged 5804
then 6552
pix 5023
8552 660
2end 378
send 5764
sam 5656
xxx 7374
neva 4575
tell 6481
how 3388
noe 4618
at 1081
home 3346
in 3502
da 2066
aft 828
wat 7100
wiskey 7255
brandy 1431
rum 5619
gin 3016
beer 1243
vodka 7021
scotch 5705
shampain 5813
wine 7236
kudi 3831
yarasu 7389
dhina 2209
vaazhthukkal 6941
am 905
seeking 5744
lady 3848
the 65

stylist 6284
aaniye 722
pudunga 5271
venaam 6970
request 5495
maangalyam 4131
alaipayuthe 872
set 5786
callertune 1561
callers 1560
press 5183
copy 1942
friends 2885
tirunelvai 6636
lick 3944
pussy 5289
inside 3556
office 4712
filling 2732
forms 2840
textin 6520
bout 1398
worries 7315
photo 4989
shoot 5855
spiffing 6119
workage 7305
xclusive 7366
clubsaisai 1800
2morow 390
28 367
soiree 6024
speciale 6100
zouk 7452
nichols 4593
paris 4880
roses 5593
ladies 3847
07946746291 40
07880867867 38
entry 2516
textpod 6523
chance 1670
40gb 469
ipod 3600
250 359
pod 5076
84128 652
ts 6772
net 4566
custcare 2051
08712405020 106
meanwhile 4246
shit 5845
suite 6322
xavier 7364
decided 2128
seconds 5729
samantha 5657
over 4827
playing 5043
jay 3663
guitar 3153
impress 3498
doug 2343
realizes 5387
anymore 961
100 245
music 4482
starting 6184
87066 667
tscs 6773
ldew 3897
skillgame 5942
1winaweek 330
150ppermesssubscription 296
expert 2611
safe 5642
selfish 5757
pa 4843
thank 6530
yet 7410
elsewhere 2

type 6818
scold 5700
asking 1064
habit 3168
nan 4514
bari 1182
hudgi 3405
yorge 7419
pataistha 4903
ertini 2533
confirmed 1894
staying 6193
garage 2952
centre 1656
part 4886
exhaust 2597
replacing 5487
ordered 4786
mentor 4280
percent 4942
action 776
80608 628
movietrivia 4437
08712405022 107
1x150p 332
wnt 7278
bmw 1361
urgently 6909
vry 7033
hv 3430
shortage 5861
lacs 3845
source 6081
arng 1035
amt 924
07808726822 35
02 7
09 150
872 672
9758 709
missin 4343
haven 3236
guilty 3152
squatting 6157
walking 7067
aren 1018
imma 3486
flip 2784
peace 4925
miracle 4332
blessed 1336
ahead 850
sliding 5964
midnight 4301
invite 3588
somerset 6037
far 2667
fills 2733
complete 1869
reassurance 5396
knackered 3810
lark 3872
huh 3410
parkin 4884
kent 3771
vale 6944
seriously 5781
mayb 4233
forgot 2833
hotel 3379
excellent 2587
misundrstud 4351
hate 3227
throws 6603
gal 2941
falls 2654
brothers 1472
head 3245
ac 745
sptv 6154
jersey 3681
devils 2205
detroit 2199
wings 7238
ice 3448
hockey 3335
incorr

2stop 402
ar 1010
praveesh 5157
delicious 2157
cover 1974
sticky 6213
giving 3025
woul 7325
curfew 2045
gibe 3009
getsleep 3000
studdying 6266
ear 2415
single 5919
meaning 4241
senthil 5774
hsbc 3401
perhaps 4948
identification 3458
pocked 5074
finishes 2751
ignorant 3467
february 2692
tmr 6651
problems 5215
suggestion 6320
lands 3861
helps 3282
forgt 2834
previous 5191
machan 4135
curious 2046
joanna 3694
freaking 2861
myspace 4494
logged 4023
gumby 3154
cheese 1712
07801543489 32
latests 3881
llc 4007
ny 4684
usa 6919
mt 4452
msgrcvd18 4449
sophas 6062
secondary 5728
application 993
applying 995
joke 3703
ogunrinde 4721
less 3930
flavour 2777
bud 1491
comprehensive 1876
cmon 1805
turn 6790
replies 5489
cancel 1578
vday 6964
shirts 5844
bottom 1395
underwear 6853
playin 5042
space 6088
poker 5081
89545 688
biz 1322
2optout 399
087187262701 133
50gbp 525
mtmsg18 4457
mathews 4222
tait 6416
edwards 2446
anderson 930
reception 5412
tuesday 6782
lab 3842
goggles 3048
mila 4306
age23 839
b

09050002311 159
b4280703 1143
08718727868 137
sarcasm 5667
scarcasim 5691
44 478
7732584351 606
sian 5893
consensus 1903
phone750 4984
08000776320 47
blackberry 1325
torch 6702
nigeria 4597
buyer 1527
4a 500
italian 3624
stone 6225
murder 4477
crickiting 2008
okies 4732
skip 5946
cine 1760
blah 1327
possession 5113
wouldn 7328
jerk 3679
collapsed 1829
0808 57
145 286
4742 493
9am 713
11pm 265
screwd 5716
snowboarding 6015
affair 818
cheers 1710
massages 4211
oil 4725
fave 2682
position 5110
subpoly 6290
81618 634
08718727870 138
bag 1162
priscilla 5204
flights 2782
0871277810810 110
inperialmusic 3550
listening2the 3993
weirdest 7154
leafcutter 3906
insects 3553
molested 4390
plumbing 5057
remixed 5472
evil 2572
men 4274
acid 767
playng 5044
racing 5325
preferably 5165
kegger 3769
hmph 3331
baller 1170
09061213237 181
canary 1577
islands 3617
177 306
m227xy 4121
yest 7408
sumthin 6331
cuddling 2037
sleeps 5958
tp 6718
ouch 4808
tues 6781
wed 7135
heaven 3262
prince 5198
audrey 1102
sta

121 269
fraction 2854
neo69 4563
09050280520 162
subscribe 6293
dps 2350
bcm 1213
8027 626
lara 3868
5226 534
hava 3234
1131 260
447801259231 483
09058094597 174
lovin 4081
spjanuary 6126
09050000928 156
pouch 5129
reboot 5399
15pm 301
taunton 6451
path 4905
appear 988
front 2902
paths 4907
payed 4918
suganya 6317
plm 5052
supervisor 6343
sometext 6038
increase 3517
winning 7241
jack 3642
helpful 3279
pretend 5186
hypotheticalhuagauahahuagahyuhagga 3436
tallahassee 6433
names 4512
name1 4509
name2 4510
mobno 4378
adam 785
07123456789 24
txtno 6811
ads 806
dumb 2399
drinking 2367
bluff 1358
soz 6085
imat 3483
mums 4472
2moro 389
vivek 7017
ay 1139
wkg 7269
subs 6291
expired 2612
monoc 4403
monos 4404
polyc 5085
stream 6245
0871212025016 94
gravel 3106
69888 580
31p 426
disappeared 2258
inner 3546
tigress 6621
showers 5877
possessiveness 5115
poured 5132
lies 3948
golden 3054
btw 1487
08712103738 93
rounder 5596
required 5498
ability 731
09063458130 198
polyph 5087
suffering 6314
dysentr

karnan 3757
effect 2449
irritation 3610
lock 4019
keypad 3777
peoples 4940
happiest 3209
characters 1679
differences 2232
spell 6110
pubs 5270
frankie 2856
bennys 1272
08452810071 66
miwa 4355
consistently 1908
practicum 5149
links 3978
ears 2420
explicitly 2617
nora 4637
09064019788 207
box42wr29c 1414
muchxxlove 4464
locaxx 4018
wrc 7331
rally 5342
lucozade 4101
61200 562
packs 4846
itcould 3625
credited 2001
tamilnadu 6435
adewale 793
egbon 2454
rich 5548
postponed 5124
stocked 6220
apologetic 981
fallen 2652
actin 774
spoilt 6135
badly 1160
netflix 4568
optin 4775
bbc 1206
charts 1687
ajith 867
bcum 1217
property 5246
dusk 2406
puzzles 5295
exorcism 2602
emily 2483
incredible 3518
blow 1351
o2fwd 4691
18p 310
sight 5900
maintain 4160
cr 1980
grr 3130
pharmacy 4974
lolnice 4030
multiply 4467
independently 3521
division 2280
push 5286
showed 5874
whereare 7193
friendsare 2886
thekingshead 6549
canlove 1582
nic 4591
manageable 4175
wrk 7337
lst 4092
foned 2809
chuck 1757
laughing 3885

## 3.3. Algorithm Training and Evaluations
We choose the following 3 algorithms to make classifiers:
* Naive Bayes
* Random Forest
* ..SVM..
    
The following evaluation metrics are used:
* accuracy
* precision
* recall
* F1 score

In [6]:
# recording results of each algorithm
from prettytable import PrettyTable
result_table = PrettyTable(['Algorithm','Accuracy','Precision','Recall','F1'])

### 3.3.1. Naive Bayes
The sklearn.naive_bayes is used to implement Naive Bayes.

In [7]:
from sklearn.naive_bayes import MultinomialNB
# MultinomialNB is suitable for classifying discrete features.
naive_bayes = MultinomialNB()

# train the algorithm using training data set
naive_bayes.fit(training_data, y_train)

# make predictions on the test data 
predictions_naive_bayes = naive_bayes.predict(testing_data)
print(predictions_naive_bayes)

[0 0 0 ... 0 1 0]


In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions_naive_bayes)))
print('Precision score: ', format(precision_score(y_test, predictions_naive_bayes)))
print('Recall score: ', format(recall_score(y_test, predictions_naive_bayes)))
print('F1 score: ', format(f1_score(y_test, predictions_naive_bayes)))



result_table.add_row(["Naive Bayes",accuracy_score(y_test, predictions_naive_bayes),precision_score(y_test, predictions_naive_bayes),recall_score(y_test, predictions_naive_bayes),f1_score(y_test, predictions_naive_bayes)])

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


### 3.3.2. Random Forest
The sklearn.ensemble.RandomForestClassifier is used to implement Random Forest training.
#### parameters of RF framework: 
1. n_estimators: 
 * The maximum number of iterations of the weak learner, or the number of the largest weak learners. 
 * If n_estimators are too small, it is easy to underfit. 
 * If n_estimators is too large, the amount of calculation will be too large.
 * After n_estimators reaches a certain number, the model increase obtained by increasing n_estimators will be small.
 * Generally a moderate value is chosen. 
 * The default value is 100.
2. oob_score:
 * It decide whether to use the out-of-bag sample to evaluate the quality of the model.
 * The out-of-bag score reflects the generalization ability of a model after fitting.
 * The default value is False.
3. criterion:
 * The evaluation standard on features division of CART tree.
 * Supported value:
     * "gini"
     * "entropy"
     * optional(default = "gini")

#### parameters of RF decision tree
1. max_features:
 * The number of features to consider when looking for the best split.
 * Supported value:
     * int: then consider max_features features at each split.
     * float: int(max_features * n_features) features are considered at each split.
     * "auto": max_features=sqrt(n_features) 
     * "sqrt": max_features=sqrt(n_features) (same as “auto”).
     * "log2": max_features=log2(n_features).
     * None: max_features=n_features.
     * optional(default="auto")
2. max_depth:
 * The maximum depth of the tree. 
 * Supported value:
     * None:  nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples
     * integer
     * optional(default="None")
 * In general, this value can be ignored when there is little data or features. 
 * If the model sample size is large and there are many features, it is recommended to limit this maximum depth. Commonly used values can be between 10-100.
3. min_samples_split:
 * The minimum number of samples required to split an internal node
 * Supported value:
     * int: min_samples_split is considered as the minimum number.
     * float: min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
     * optional(default = 2)
4. min_samples_leaf:
 * This value limits the conditions under which the subtree continues to be partitioned. If the number of samples for a node is less than min_samples_split, then it will not continue to try to select the optimal feature for partitioning.
 * Supported value:
     * int: min_samples_leaf is considered as the minimum number.
     * float: min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
     * optional(default = 1)
 * If the sample size is not large, this value can be ignored. If the sample size is very large, it is recommended to increase this value
5. min_weight_fraction_leaf:
 * This value limits the minimum value of the sum of the weights of all sample nodes of the leaf node. Leaf node whose weight is less than this value will be pruned together with the sibling node. 
 * The default is 0, which means that the weight problem is not considered. 
 * In general, if there are more samples with missing values, or if the deviation of the distribution category of the sample is large,the sample weight will be introduced, and this value should be paid attention to.
6. max_leaf_nodes:
 * This value limits the maximum number of leaf nodes to prevent overfitting. The algorithm will establish an optimal decision tree within the maximum number of leaf nodes.
7. min_impurity_decrease:
 * A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

#### training algorithm with default parameter value

In [9]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators = 100, oob_score = True, max_depth = None, random_state=0)

# train the algorithm
random_forest.fit(training_data, y_train)

# make predictions
predictions_random_forest = random_forest.predict(testing_data)
print(predictions_random_forest)


[0 0 0 ... 0 1 0]


In [10]:
print('Accuracy score: ', format(accuracy_score(y_test, predictions_random_forest)))
print('Precision score: ', format(precision_score(y_test, predictions_random_forest)))
print('Recall score: ', format(recall_score(y_test, predictions_random_forest)))
print('F1 score: ', format(f1_score(y_test, predictions_random_forest)))

Accuracy score:  0.9791816223977028
Precision score:  1.0
Recall score:  0.8432432432432433
F1 score:  0.9149560117302054


#### adjusting parameters
Next, the parameter n_estimators is adjusted to get a better classifier.

In [11]:
from prettytable import PrettyTable
# table = PrettyTable(['n_estimators','Accuracy score','Precision score','Recall score','F1 score'])
table = PrettyTable(['n_estimators','Accuracy','Precision','Recall','F1'])

for N_ESTIMATORS in range(50,400,50):
    random_forest.set_params(n_estimators = N_ESTIMATORS)

    # train the algorithm
    random_forest.fit(training_data, y_train)

    # make predictions
    predictions_random_forest = random_forest.predict(testing_data)

    table.add_row([N_ESTIMATORS,accuracy_score(y_test, predictions_random_forest),precision_score(y_test, predictions_random_forest),recall_score(y_test, predictions_random_forest),f1_score(y_test, predictions_random_forest)])

print(table)

+--------------+--------------------+-----------+--------------------+--------------------+
| n_estimators |      Accuracy      | Precision |       Recall       |         F1         |
+--------------+--------------------+-----------+--------------------+--------------------+
|      50      | 0.9784637473079684 |    1.0    | 0.8378378378378378 | 0.911764705882353  |
|     100      | 0.9791816223977028 |    1.0    | 0.8432432432432433 | 0.9149560117302054 |
|     150      | 0.9813352476669059 |    1.0    | 0.8594594594594595 | 0.9244186046511628 |
|     200      | 0.9827709978463748 |    1.0    | 0.8702702702702703 | 0.930635838150289  |
|     250      | 0.9820531227566404 |    1.0    | 0.8648648648648649 | 0.927536231884058  |
|     300      | 0.9784637473079684 |    1.0    | 0.8378378378378378 | 0.911764705882353  |
|     350      | 0.9806173725771715 |    1.0    | 0.8540540540540541 | 0.9212827988338192 |
+--------------+--------------------+-----------+--------------------+----------

The above result shows that the algorithm works better with n_estimators = 200.

In [12]:
random_forest.set_params(n_estimators = 200)

# train the algorithm
random_forest.fit(training_data, y_train)

# make predictions
predictions_random_forest = random_forest.predict(testing_data)

# print evaluation metrics
print('Accuracy score: ', format(accuracy_score(y_test, predictions_random_forest)))
print('Precision score: ', format(precision_score(y_test, predictions_random_forest)))
print('Recall score: ', format(recall_score(y_test, predictions_random_forest)))
print('F1 score: ', format(f1_score(y_test, predictions_random_forest)))

result_table.add_row(["Random Forest",accuracy_score(y_test, predictions_random_forest),precision_score(y_test, predictions_random_forest),recall_score(y_test, predictions_random_forest),f1_score(y_test, predictions_random_forest)])

Accuracy score:  0.9827709978463748
Precision score:  1.0
Recall score:  0.8702702702702703
F1 score:  0.930635838150289


#### rank of tokens
The following codes output each parameter(in this case, token) with its importance in descending order of the importance. 
From the results, the top 10 most important tokens, which means that these 10 tokens are most imformative while distinguishing between ham and spam, are "call", "txt", "free", "claim", "www", "150p", "mobile", "uk", "stop" and "text".

In [13]:
# rank features
importances = random_forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[0]):
    print("%2d) %-*s %f" % (f + 1, 30, count_vector.get_feature_names()[indices[f]], importances[indices[f]]))

 1) call                           0.030404
 2) txt                            0.024035
 3) free                           0.020706
 4) claim                          0.020618
 5) www                            0.018852
 6) 150p                           0.015238
 7) mobile                         0.014894
 8) uk                             0.013733
 9) stop                           0.013528
10) text                           0.011181
11) service                        0.011001
12) prize                          0.010171
13) reply                          0.010083
14) to                             0.009936
15) win                            0.009847
16) co                             0.009559
17) your                           0.009190
18) 50                             0.008698
19) com                            0.008087
20) or                             0.007835
21) guaranteed                     0.007761
22) cash                           0.007124
23) ringtone                    

209) as                             0.000780
210) ll                             0.000779
211) valued                         0.000772
212) been                           0.000767
213) freephone                      0.000763
214) price                          0.000762
215) 11                             0.000757
216) minmobsmorelkpobox177hp51fl    0.000756
217) sexy                           0.000753
218) age                            0.000747
219) 10                             0.000744
220) xmas                           0.000742
221) help                           0.000737
222) wap                            0.000735
223) out                            0.000728
224) txts                           0.000726
225) all                            0.000719
226) an                             0.000718
227) update                         0.000717
228) waiting                        0.000714
229) explicit                       0.000714
230) player                         0.000714
231) time 

397) thanks                         0.000372
398) comuk                          0.000371
399) sipix                          0.000370
400) polys                          0.000368
401) ec2a                           0.000368
402) custcare                       0.000366
403) ladies                         0.000366
404) 7pm                            0.000365
405) locations                      0.000362
406) 08714712388                    0.000362
407) visit                          0.000359
408) gmw                            0.000358
409) cust                           0.000356
410) sport                          0.000356
411) refused                        0.000355
412) re                             0.000355
413) ntt                            0.000354
414) going                          0.000354
415) contacted                      0.000353
416) trying                         0.000351
417) love                           0.000350
418) 400                            0.000349
419) video

600) g696ga                         0.000233
601) 86021                          0.000233
602) extra                          0.000233
603) fromm                          0.000233
604) frnd                           0.000232
605) nikiyu4                        0.000232
606) hey                            0.000231
607) rgds                           0.000231
608) 03                             0.000230
609) stockport                      0.000230
610) 6wu                            0.000228
611) then                           0.000227
612) home                           0.000227
613) networks                       0.000227
614) 7876150ppm                     0.000227
615) flights                        0.000226
616) they                           0.000226
617) 88888                          0.000226
618) contract                       0.000226
619) 09061209465                    0.000226
620) wallpaper                      0.000225
621) calling                        0.000224
622) guara

804) texting                        0.000165
805) sw73ss                         0.000164
806) manchester                     0.000164
807) 0a                             0.000164
808) minmobsmore                    0.000164
809) 123                            0.000163
810) fullonsms                      0.000163
811) 250k                           0.000163
812) hlp                            0.000163
813) around                         0.000162
814) red                            0.000162
815) 400mins                        0.000162
816) barry                          0.000161
817) unsub                          0.000161
818) details                        0.000161
819) everyone                       0.000161
820) 01223585334                    0.000161
821) dena                           0.000160
822) notxt                          0.000160
823) easy                           0.000159
824) tsunami                        0.000159
825) 9am                            0.000158
826) asked

994) pobox45w2tg150p                0.000126
995) too                            0.000125
996) right                          0.000125
997) notifications                  0.000125
998) 08001950382                    0.000125
999) cds                            0.000125
1000) pobox334                       0.000125
1001) 600                            0.000125
1002) web                            0.000124
1003) swat                           0.000124
1004) lor                            0.000124
1005) yo                             0.000124
1006) minutes                        0.000123
1007) fund                           0.000123
1008) football                       0.000123
1009) laid                           0.000123
1010) mobilesdirect                  0.000123
1011) start                          0.000123
1012) stopsms                        0.000123
1013) block                          0.000122
1014) img                            0.000122
1015) requests                       0.0

1176) 1250                           0.000101
1177) drvgsto                        0.000101
1178) 12mths                         0.000100
1179) icmb3cktz8r7                   0.000100
1180) or2stoptxt                     0.000100
1181) login                          0.000100
1182) horo                           0.000100
1183) baby                           0.000100
1184) 09065394514                    0.000100
1185) mnths                          0.000100
1186) card                           0.000100
1187) ever                           0.000099
1188) otbox                          0.000099
1189) password                       0.000099
1190) ends                           0.000099
1191) xafter                         0.000099
1192) headset                        0.000099
1193) fixedline                      0.000099
1194) provided                       0.000099
1195) ecstacy                        0.000099
1196) gaytextbuddy                   0.000098
1197) cheaper                     

1368) 08718727868                    0.000082
1369) each                           0.000082
1370) gigolo                         0.000082
1371) 118p                           0.000081
1372) toclaim                        0.000081
1373) capital                        0.000081
1374) wc1n3xx                        0.000081
1375) sexual                         0.000081
1376) tons                           0.000081
1377) le                             0.000081
1378) box177                         0.000081
1379) post                           0.000081
1380) plus                           0.000081
1381) saying                         0.000081
1382) calls1                         0.000080
1383) say                            0.000080
1384) kingdom                        0.000080
1385) mad1                           0.000080
1386) telediscount                   0.000080
1387) thing                          0.000080
1388) ya                             0.000080
1389) support                     

1555) where                          0.000065
1556) rgent                          0.000065
1557) cute                           0.000065
1558) tenerife                       0.000065
1559) record                         0.000065
1560) sing                           0.000065
1561) virgins                        0.000065
1562) th                             0.000065
1563) starting                       0.000065
1564) 2025050                        0.000064
1565) flirt                          0.000064
1566) sweet                          0.000064
1567) splashmobile                   0.000064
1568) exciting                       0.000064
1569) 1winaweek                      0.000064
1570) charts                         0.000064
1571) rply                           0.000064
1572) wish                           0.000064
1573) 07781482378                    0.000064
1574) 69855                          0.000064
1575) txt82228                       0.000064
1576) callcost                    

1744) copy                           0.000052
1745) convey                         0.000052
1746) real1                          0.000052
1747) someonone                      0.000052
1748) 77                             0.000052
1749) request                        0.000052
1750) golf                           0.000052
1751) thurs                          0.000052
1752) rpl                            0.000052
1753) those                          0.000052
1754) m221bp                         0.000052
1755) country                        0.000052
1756) fixed                          0.000051
1757) seeds                          0.000051
1758) subpoly                        0.000051
1759) sign                           0.000051
1760) ignore                         0.000051
1761) confirm                        0.000051
1762) register                       0.000051
1763) 4719                           0.000051
1764) phone750                       0.000051
1765) pay                         

1927) strike                         0.000043
1928) anything                       0.000043
1929) buffy                          0.000043
1930) south                          0.000042
1931) nice                           0.000042
1932) amanda                         0.000042
1933) treat                          0.000042
1934) trip                           0.000042
1935) customersqueries               0.000042
1936) 09066350750                    0.000042
1937) sol                            0.000042
1938) txtin                          0.000042
1939) videosounds                    0.000042
1940) 09058091854                    0.000042
1941) mobilesvary                    0.000042
1942) basically                      0.000042
1943) promised                       0.000042
1944) stop2stop                      0.000042
1945) erotic                         0.000042
1946) svc                            0.000042
1947) prepared                       0.000042
1948) fighting                    

2111) polyc                          0.000034
2112) jogging                        0.000034
2113) bank                           0.000034
2114) better                         0.000034
2115) prompts                        0.000034
2116) pie                            0.000034
2117) awesome                        0.000034
2118) forget                         0.000033
2119) wasn                           0.000033
2120) 09065989180                    0.000033
2121) 09701213186                    0.000033
2122) woods                          0.000033
2123) 07821230901                    0.000033
2124) 14thmarch                      0.000033
2125) version                        0.000033
2126) yellow                         0.000033
2127) ahead                          0.000033
2128) confirmd                       0.000033
2129) norm                           0.000033
2130) eire                           0.000033
2131) fine                           0.000033
2132) exp                         

2294) fa                             0.000027
2295) idew                           0.000027
2296) atm                            0.000027
2297) warranty                       0.000027
2298) fri                            0.000027
2299) snap                           0.000027
2300) informed                       0.000027
2301) willing                        0.000027
2302) 88222                          0.000027
2303) global                         0.000027
2304) hardcore                       0.000027
2305) shit                           0.000027
2306) fifa                           0.000027
2307) ultimate                       0.000027
2308) ages                           0.000027
2309) doin                           0.000027
2310) leh                            0.000027
2311) lab                            0.000027
2312) fakeye                         0.000027
2313) 49                             0.000026
2314) babes                          0.000026
2315) converter                   

2475) culdnt                         0.000020
2476) meant                          0.000020
2477) violet                         0.000020
2478) chik                           0.000020
2479) 02085076972                    0.000020
2480) coming                         0.000020
2481) kanagu                         0.000020
2482) lunch                          0.000020
2483) musical                        0.000020
2484) bold                           0.000020
2485) 09064017295                    0.000020
2486) hun                            0.000020
2487) self                           0.000020
2488) occurs                         0.000020
2489) 08448714184                    0.000020
2490) 2ez                            0.000020
2491) 09061221061                    0.000020
2492) poor                           0.000020
2493) soul                           0.000020
2494) however                        0.000020
2495) admit                          0.000020
2496) beautiful                   

2665) belly                          0.000016
2666) kusruthi                       0.000016
2667) anyway                         0.000016
2668) turning                        0.000016
2669) athletic                       0.000016
2670) worried                        0.000015
2671) 0871277810710p                 0.000015
2672) runs                           0.000015
2673) khelate                        0.000015
2674) waheed                         0.000015
2675) box1146                        0.000015
2676) forgiveness                    0.000015
2677) bird                           0.000015
2678) 2morrow                        0.000015
2679) beverage                       0.000015
2680) 83110                          0.000015
2681) cnn                            0.000015
2682) sun                            0.000015
2683) logon                          0.000015
2684) full                           0.000015
2685) meaning                        0.000015
2686) vegetables                  

2848) able                           0.000012
2849) lies                           0.000012
2850) 27                             0.000012
2851) 09061743810                    0.000012
2852) chain                          0.000012
2853) planet                         0.000012
2854) dude                           0.000012
2855) js                             0.000012
2856) non                            0.000012
2857) sis                            0.000012
2858) heading                        0.000012
2859) disclose                       0.000012
2860) quick                          0.000012
2861) renewal                        0.000012
2862) deliver                        0.000012
2863) 6hrs                           0.000012
2864) askin                          0.000012
2865) looked                         0.000012
2866) situation                      0.000012
2867) telphone                       0.000012
2868) wishlist                       0.000012
2869) watching                    

3041) evaporated                     0.000009
3042) sharing                        0.000009
3043) prefer                         0.000009
3044) xxxx                           0.000009
3045) refilled                       0.000009
3046) food                           0.000009
3047) pressure                       0.000009
3048) 7th                            0.000009
3049) subsequent                     0.000009
3050) june                           0.000009
3051) crash                          0.000009
3052) understand                     0.000009
3053) letter                         0.000009
3054) application                    0.000009
3055) quickly                        0.000009
3056) disconnected                   0.000009
3057) cafe                           0.000009
3058) rtm                            0.000009
3059) urgently                       0.000009
3060) 177                            0.000009
3061) envelope                       0.000009
3062) mean                        

3242) gr8fun                         0.000007
3243) 4217                           0.000007
3244) tag                            0.000007
3245) hdd                            0.000007
3246) costumes                       0.000007
3247) movies                         0.000007
3248) noi                            0.000007
3249) dysentry                       0.000007
3250) clear                          0.000007
3251) paperwork                      0.000007
3252) hint                           0.000007
3253) tonite                         0.000007
3254) twittering                     0.000007
3255) updat                          0.000007
3256) prepare                        0.000007
3257) beneficiary                    0.000007
3258) gonnamissu                     0.000007
3259) battery                        0.000007
3260) tuesday                        0.000007
3261) cantdo                         0.000007
3262) mrng                           0.000007
3263) omw                         

3434) funny                          0.000004
3435) careful                        0.000004
3436) elsewhere                      0.000004
3437) pete                           0.000004
3438) months                         0.000004
3439) 2814032                        0.000004
3440) tryin                          0.000004
3441) fridge                         0.000004
3442) 5min                           0.000004
3443) cookies                        0.000004
3444) diff                           0.000004
3445) brb                            0.000004
3446) weight                         0.000004
3447) mr                             0.000004
3448) likeyour                       0.000004
3449) fffff                          0.000004
3450) spoilt                         0.000004
3451) laughed                        0.000004
3452) hire                           0.000004
3453) financial                      0.000004
3454) attractive                     0.000004
3455) singing                     

3624) ibored                         0.000002
3625) airtel                         0.000002
3626) test                           0.000002
3627) broken                         0.000002
3628) thm                            0.000002
3629) depression                     0.000002
3630) drive                          0.000002
3631) wn                             0.000002
3632) hrishi                         0.000002
3633) beyond                         0.000002
3634) tirunelvali                    0.000002
3635) 2wt                            0.000002
3636) fear                           0.000002
3637) ashwini                        0.000002
3638) bar                            0.000002
3639) working                        0.000002
3640) bowl                           0.000002
3641) pierre                         0.000002
3642) debating                       0.000002
3643) difficulties                   0.000002
3644) 09066358361                    0.000002
3645) early                       

3809) general                        0.000001
3810) heron                          0.000001
3811) ipads                          0.000001
3812) length                         0.000001
3813) norm150p                       0.000001
3814) exactly                        0.000001
3815) pretty                         0.000001
3816) add                            0.000001
3817) taylor                         0.000001
3818) 07808247860                    0.000001
3819) taken                          0.000001
3820) tonights                       0.000001
3821) jas                            0.000001
3822) teasing                        0.000001
3823) s89                            0.000001
3824) drpd                           0.000001
3825) rebtel                         0.000001
3826) twelve                         0.000001
3827) 4get                           0.000001
3828) 88877                          0.000001
3829) jess                           0.000001
3830) priya                       

4001) lady                           0.000000
4002) hunt                           0.000000
4003) fever                          0.000000
4004) fourth                         0.000000
4005) fills                          0.000000
4006) bullshit                       0.000000
4007) stagwood                       0.000000
4008) hold                           0.000000
4009) 09111032124                    0.000000
4010) popping                        0.000000
4011) dog                            0.000000
4012) bakra                          0.000000
4013) lionp                          0.000000
4014) stranger                       0.000000
4015) 07732584351                    0.000000
4016) door                           0.000000
4017) legal                          0.000000
4018) poet                           0.000000
4019) tome                           0.000000
4020) wa14                           0.000000
4021) wihtuot                        0.000000
4022) pimples                     

# 4. Experimental Results

In [14]:
print(result_table)

+---------------+--------------------+--------------------+--------------------+--------------------+
|   Algorithm   |      Accuracy      |     Precision      |       Recall       |         F1         |
+---------------+--------------------+--------------------+--------------------+--------------------+
|  Naive Bayes  | 0.9885139985642498 | 0.9720670391061452 | 0.9405405405405406 | 0.9560439560439562 |
| Random Forest | 0.9827709978463748 |        1.0         | 0.8702702702702703 | 0.930635838150289  |
+---------------+--------------------+--------------------+--------------------+--------------------+


# 5. Level of Contribution from Each Member