# Spam Detector

### In order to estimate the cost of development and monitor the project, a checklist of tasks to be carried out must be drawn up.

https://trello.com/b/JNiVTMvb/spam-detector


I used Trello, to follow my project and to develop my roadmap

### You must create functions for the different parts of your code so that you can easily reuse them

In [40]:
import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd 
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, make_scorer
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
import time

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/randon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
#Import data, make sep ad \t and Names the news colones
df = pd.read_csv("/home/randon/git/brief-01-12-spamDetector/data/SMSSpamCollection.txt", sep ='\t',names=["label", "sms"])

#Lets see 10 first rows
print(df[:8])

  label                                                sms
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
5  spam  FreeMsg Hey there darling it's been 3 week's n...
6   ham  Even my brother is not like to speak with me. ...
7   ham  As per your request 'Melle Melle (Oru Minnamin...


In [3]:
#Group now by label to see, how moch ham and how much spam

df.groupby('label').describe()

Unnamed: 0_level_0,sms,sms,sms,sms
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


We can see we have 4825 Spam, and 747 spam

Now lets transforme colonne in 0 or 1, 0 for Ham, and 1 for spam

In [95]:
df['labelf1ham']= df['label'].map({'ham': 1, 'spam': 0})
df['labelf1spam']= df['label'].map({'ham': 0, 'spam': 1})

#lets see what happening
print(df[:8])

  label                                                sms  labelnumber  \
0   ham  Go until jurong point, crazy.. Available only ...            1   
1   ham                      Ok lar... Joking wif u oni...            1   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...            0   
3   ham  U dun say so early hor... U c already then say...            1   
4   ham  Nah I don't think he goes to usf, he lives aro...            1   
5  spam  FreeMsg Hey there darling it's been 3 week's n...            0   
6   ham  Even my brother is not like to speak with me. ...            1   
7   ham  As per your request 'Melle Melle (Oru Minnamin...            1   

   labelf1ham  labelf1spam  
0           1            0  
1           1            0  
2           0            1  
3           1            0  
4           1            0  
5           0            1  
6           1            0  
7           1            0  


#### Lets create vectorisation and stop world function

In [5]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

#define stop world
stopwords = nltk.corpus.stopwords.words('english')

def clean(x):
    vectoriser = TfidfVectorizer(stop_words = stopwords)
    x = vectoriser.fit_transform(x)
    return x

def fitting(X, y, mod):
    mod.fit(X, y)

def predict(X, mod):
    xx = mod.predict(X)
    return xx


#### Try function

In [106]:
x = np.array(df['sms'])
y = np.array(df['labelf1ham'])

#Clean data
x = clean(x)

#Define model classification
LogReg = LogisticRegression()
mlp = MLPClassifier(hidden_layer_sizes=(15,), activation='logistic', alpha=1e-4,
                    solver='sgd', tol=1e-4, random_state=1,
                    learning_rate_init=.08, verbose=True, max_iter=250)
svclass= SVC()

### LogisticRegression

In [107]:
#Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)


#fit and predict
tmps1=time.time()

fitting(x_train, y_train, LogReg)

tmps2=time.time() - tmps1
print("Temps d'execution = %f seconde\n" %tmps2)


ypred = predict(x_test, LogReg)

#test on test set
print(ypred[:50])
print(y_test[:50])


Temps d'execution = 0.145042 seconde

[1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0
 1 1 0 1 1 1 1 1 1 1 1 1 0]
[1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0
 1 1 0 1 1 1 1 1 1 1 1 1 0]


In [120]:
logreg_report = classification_report(y_test, ypred)
print(logreg_report)


              precision    recall  f1-score   support

           0       1.00      0.84      0.92       160
           1       0.97      1.00      0.99       955

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



### MLP Classifier

In [114]:
#Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

#fit and predict
tmps1=time.time()

fitting(x_train, y_train, mlp)

tmps2=time.time() - tmps1
print("Temps d'execution = %f seconde\n" %tmps2)

ypred = predict(x_test, mlp)

#test on test set
print(ypred[:100])
print(y_test[:100])


Iteration 1, loss = 0.42672297
Iteration 2, loss = 0.39062156
Iteration 3, loss = 0.38927769
Iteration 4, loss = 0.38823304
Iteration 5, loss = 0.38727843
Iteration 6, loss = 0.38680031
Iteration 7, loss = 0.38587299
Iteration 8, loss = 0.38504709
Iteration 9, loss = 0.38506499
Iteration 10, loss = 0.38268475
Iteration 11, loss = 0.38214738
Iteration 12, loss = 0.38086973
Iteration 13, loss = 0.37929672
Iteration 14, loss = 0.37804895
Iteration 15, loss = 0.37609545
Iteration 16, loss = 0.37436649
Iteration 17, loss = 0.37121310
Iteration 18, loss = 0.36856274
Iteration 19, loss = 0.36624328
Iteration 20, loss = 0.36203743
Iteration 21, loss = 0.35806391
Iteration 22, loss = 0.35366808
Iteration 23, loss = 0.34877907
Iteration 24, loss = 0.34301373
Iteration 25, loss = 0.33653664
Iteration 26, loss = 0.32932764
Iteration 27, loss = 0.32134159
Iteration 28, loss = 0.31251685
Iteration 29, loss = 0.30340942
Iteration 30, loss = 0.29275862
Iteration 31, loss = 0.28200521
Iteration 32, los

In [115]:
nlp_report = classification_report(y_test, ypred)
print(nlp_report)

              precision    recall  f1-score   support

           0       0.99      0.89      0.94       160
           1       0.98      1.00      0.99       955

    accuracy                           0.98      1115
   macro avg       0.99      0.94      0.96      1115
weighted avg       0.98      0.98      0.98      1115



### SVC

In [116]:
#Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)


#fit and predict
tmps1=time.time()

fitting(x_train, y_train, svclass)

tmps2=time.time() - tmps1
print("Temps d'execution = %f seconde\n" %tmps2)


ypred = predict(x_test, svclass)

#test on test set
print(ypred[:50])
print(y_test[:50])


Temps d'execution = 1.820585 seconde

[1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0
 1 1 0 1 1 1 1 1 1 1 1 1 0]
[1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0
 1 1 0 1 1 1 1 1 1 1 1 1 0]


In [117]:
svc_report = classification_report(y_test, ypred)
print(svc_report)


              precision    recall  f1-score   support

           0       1.00      0.84      0.92       160
           1       0.97      1.00      0.99       955

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115




### You must perform a cross-validation on 10 different learning and testing sets. The seed should be set at 42 and the test set should represent 20% of the data.

#### Logistic regression

In [96]:
from sklearn.model_selection import cross_val_score

y = np.array(df['labelf1ham'])
y2 = np.array(df['labelf1spam'])
cv = ShuffleSplit(n_splits=10, test_size=.20, random_state=42)
cv.get_n_splits(x)

tmps1=time.time()

scores = cross_val_score(LogReg, x, y, cv=cv, scoring='f1')

tmps2=time.time() - tmps1
print("Temps d'execution = %f seconde\n" %tmps2)

print("Cross-validation F1 ham scores: {}".format(scores))

tmps1=time.time()

scores2 = cross_val_score(LogReg, x, y2, cv=cv, scoring='f1')

tmps2=time.time() - tmps1
print("\n\nTemps d'execution = %f seconde\n" %tmps2)

print("Cross-validation F1 spam scores: {}".format(scores2))

Temps d'execution = 1.076684 seconde

Cross-validation F1 ham scores: [0.97672065 0.97163121 0.97380586 0.98330804 0.97531486 0.97052846
 0.97782258 0.9721519  0.96390442 0.97192445]


Temps d'execution = 0.946598 seconde

Cross-validation F1 spam scores: [0.81889764 0.78125    0.81978799 0.86956522 0.8        0.77862595
 0.82113821 0.78431373 0.73003802 0.79704797]


#### MPL Classifier

In [97]:
cv = ShuffleSplit(n_splits=10, test_size=.20, random_state=42)
cv.get_n_splits(x)

tmps1=time.time()

scores = cross_val_score(mlp, x, y, cv=cv, scoring='f1')

tmps2=time.time() - tmps1
print("Temps d'execution = %f seconde\n" %tmps2)

print("Cross-validation F1 ham scores: {}".format(scores))

tmps1=time.time()

scores2 = cross_val_score(mlp, x, y2, cv=cv, scoring='f1')

tmps2=time.time() - tmps1
print("\n\nTemps d'execution = %f seconde\n" %tmps2)

print("Cross-validation F1 spam scores: {}".format(scores2))

Iteration 1, loss = 0.43194236
Iteration 2, loss = 0.39543011
Iteration 3, loss = 0.39334119
Iteration 4, loss = 0.39270048
Iteration 5, loss = 0.39181674
Iteration 6, loss = 0.39233062
Iteration 7, loss = 0.39026682
Iteration 8, loss = 0.38973030
Iteration 9, loss = 0.38845797
Iteration 10, loss = 0.38800361
Iteration 11, loss = 0.38615389
Iteration 12, loss = 0.38535544
Iteration 13, loss = 0.38363531
Iteration 14, loss = 0.38243993
Iteration 15, loss = 0.38072588
Iteration 16, loss = 0.37806098
Iteration 17, loss = 0.37523150
Iteration 18, loss = 0.37243347
Iteration 19, loss = 0.36979514
Iteration 20, loss = 0.36535341
Iteration 21, loss = 0.36115647
Iteration 22, loss = 0.35651488
Iteration 23, loss = 0.35124234
Iteration 24, loss = 0.34497465
Iteration 25, loss = 0.33879014
Iteration 26, loss = 0.33061073
Iteration 27, loss = 0.32210616
Iteration 28, loss = 0.31359770
Iteration 29, loss = 0.30345462
Iteration 30, loss = 0.29212299
Iteration 31, loss = 0.28102154
Iteration 32, los

Iteration 43, loss = 0.15053428
Iteration 44, loss = 0.14448444
Iteration 45, loss = 0.13777296
Iteration 46, loss = 0.13144773
Iteration 47, loss = 0.12611528
Iteration 48, loss = 0.12080363
Iteration 49, loss = 0.11653013
Iteration 50, loss = 0.11222615
Iteration 51, loss = 0.10766197
Iteration 52, loss = 0.10397992
Iteration 53, loss = 0.10033255
Iteration 54, loss = 0.09696742
Iteration 55, loss = 0.09372306
Iteration 56, loss = 0.09112665
Iteration 57, loss = 0.08811972
Iteration 58, loss = 0.08538769
Iteration 59, loss = 0.08298072
Iteration 60, loss = 0.08050789
Iteration 61, loss = 0.07837040
Iteration 62, loss = 0.07623802
Iteration 63, loss = 0.07420103
Iteration 64, loss = 0.07195963
Iteration 65, loss = 0.07017222
Iteration 66, loss = 0.06834547
Iteration 67, loss = 0.06661308
Iteration 68, loss = 0.06526048
Iteration 69, loss = 0.06354415
Iteration 70, loss = 0.06214424
Iteration 71, loss = 0.06047215
Iteration 72, loss = 0.05902576
Iteration 73, loss = 0.05792550
Iteratio

Iteration 96, loss = 0.03669372
Iteration 97, loss = 0.03612510
Iteration 98, loss = 0.03555505
Iteration 99, loss = 0.03480135
Iteration 100, loss = 0.03397031
Iteration 101, loss = 0.03343612
Iteration 102, loss = 0.03272065
Iteration 103, loss = 0.03262137
Iteration 104, loss = 0.03149592
Iteration 105, loss = 0.03099573
Iteration 106, loss = 0.03038735
Iteration 107, loss = 0.02977484
Iteration 108, loss = 0.02926713
Iteration 109, loss = 0.02875550
Iteration 110, loss = 0.02838310
Iteration 111, loss = 0.02773930
Iteration 112, loss = 0.02731201
Iteration 113, loss = 0.02687664
Iteration 114, loss = 0.02631522
Iteration 115, loss = 0.02582235
Iteration 116, loss = 0.02540720
Iteration 117, loss = 0.02495583
Iteration 118, loss = 0.02449829
Iteration 119, loss = 0.02418542
Iteration 120, loss = 0.02372077
Iteration 121, loss = 0.02341825
Iteration 122, loss = 0.02295811
Iteration 123, loss = 0.02250185
Iteration 124, loss = 0.02216133
Iteration 125, loss = 0.02177877
Iteration 126,

Iteration 145, loss = 0.01677209
Iteration 146, loss = 0.01652914
Iteration 147, loss = 0.01631801
Iteration 148, loss = 0.01613669
Iteration 149, loss = 0.01588578
Iteration 150, loss = 0.01565499
Iteration 151, loss = 0.01546891
Iteration 152, loss = 0.01527957
Iteration 153, loss = 0.01514381
Iteration 154, loss = 0.01488513
Iteration 155, loss = 0.01471897
Iteration 156, loss = 0.01455224
Iteration 157, loss = 0.01432038
Iteration 158, loss = 0.01417926
Iteration 159, loss = 0.01420907
Iteration 160, loss = 0.01386139
Iteration 161, loss = 0.01362813
Iteration 162, loss = 0.01347646
Iteration 163, loss = 0.01331559
Iteration 164, loss = 0.01316342
Iteration 165, loss = 0.01297093
Iteration 166, loss = 0.01281828
Iteration 167, loss = 0.01268339
Iteration 168, loss = 0.01253982
Iteration 169, loss = 0.01239743
Iteration 170, loss = 0.01224194
Iteration 171, loss = 0.01210036
Iteration 172, loss = 0.01196727
Iteration 173, loss = 0.01181976
Iteration 174, loss = 0.01169488
Iteration 

Iteration 189, loss = 0.00940560
Iteration 190, loss = 0.00932107
Iteration 191, loss = 0.00922363
Iteration 192, loss = 0.00914042
Iteration 193, loss = 0.00905133
Iteration 194, loss = 0.00896200
Iteration 195, loss = 0.00887909
Iteration 196, loss = 0.00880392
Iteration 197, loss = 0.00871359
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Iteration 1, loss = 0.43371822
Iteration 2, loss = 0.39604335
Iteration 3, loss = 0.39140950
Iteration 4, loss = 0.39083030
Iteration 5, loss = 0.38978659
Iteration 6, loss = 0.38927902
Iteration 7, loss = 0.38796659
Iteration 8, loss = 0.38808665
Iteration 9, loss = 0.38658979
Iteration 10, loss = 0.38533398
Iteration 11, loss = 0.38426487
Iteration 12, loss = 0.38309179
Iteration 13, loss = 0.38115560
Iteration 14, loss = 0.37937554
Iteration 15, loss = 0.37747541
Iteration 16, loss = 0.37523091
Iteration 17, loss = 0.37306111
Iteration 18, loss = 0.37024478
Iteration 19, loss = 0.36670510
Iteration 20, 

Iteration 33, loss = 0.25306910
Iteration 34, loss = 0.24002548
Iteration 35, loss = 0.22786958
Iteration 36, loss = 0.21601902
Iteration 37, loss = 0.20419841
Iteration 38, loss = 0.19379209
Iteration 39, loss = 0.18310370
Iteration 40, loss = 0.17385110
Iteration 41, loss = 0.16461911
Iteration 42, loss = 0.15644672
Iteration 43, loss = 0.14893780
Iteration 44, loss = 0.14193801
Iteration 45, loss = 0.13548616
Iteration 46, loss = 0.13001199
Iteration 47, loss = 0.12405523
Iteration 48, loss = 0.11924971
Iteration 49, loss = 0.11452581
Iteration 50, loss = 0.11037260
Iteration 51, loss = 0.10629147
Iteration 52, loss = 0.10219456
Iteration 53, loss = 0.09900707
Iteration 54, loss = 0.09562528
Iteration 55, loss = 0.09259751
Iteration 56, loss = 0.08993486
Iteration 57, loss = 0.08717391
Iteration 58, loss = 0.08446135
Iteration 59, loss = 0.08187494
Iteration 60, loss = 0.07997559
Iteration 61, loss = 0.07753340
Iteration 62, loss = 0.07540773
Iteration 63, loss = 0.07332842
Iteratio

Iteration 73, loss = 0.05703376
Iteration 74, loss = 0.05563605
Iteration 75, loss = 0.05410842
Iteration 76, loss = 0.05283099
Iteration 77, loss = 0.05167594
Iteration 78, loss = 0.05056155
Iteration 79, loss = 0.04929412
Iteration 80, loss = 0.04839322
Iteration 81, loss = 0.04717133
Iteration 82, loss = 0.04624118
Iteration 83, loss = 0.04530241
Iteration 84, loss = 0.04406068
Iteration 85, loss = 0.04319079
Iteration 86, loss = 0.04244645
Iteration 87, loss = 0.04138945
Iteration 88, loss = 0.04053359
Iteration 89, loss = 0.03982821
Iteration 90, loss = 0.03911676
Iteration 91, loss = 0.03810136
Iteration 92, loss = 0.03764778
Iteration 93, loss = 0.03654642
Iteration 94, loss = 0.03581725
Iteration 95, loss = 0.03510330
Iteration 96, loss = 0.03437197
Iteration 97, loss = 0.03397593
Iteration 98, loss = 0.03329354
Iteration 99, loss = 0.03235617
Iteration 100, loss = 0.03191215
Iteration 101, loss = 0.03123775
Iteration 102, loss = 0.03076107
Iteration 103, loss = 0.02998710
Iter

Iteration 111, loss = 0.02545824
Iteration 112, loss = 0.02495936
Iteration 113, loss = 0.02449445
Iteration 114, loss = 0.02406243
Iteration 115, loss = 0.02365945
Iteration 116, loss = 0.02323815
Iteration 117, loss = 0.02291179
Iteration 118, loss = 0.02240313
Iteration 119, loss = 0.02203017
Iteration 120, loss = 0.02167626
Iteration 121, loss = 0.02128357
Iteration 122, loss = 0.02090325
Iteration 123, loss = 0.02073097
Iteration 124, loss = 0.02033885
Iteration 125, loss = 0.01991892
Iteration 126, loss = 0.01969241
Iteration 127, loss = 0.01930677
Iteration 128, loss = 0.01897373
Iteration 129, loss = 0.01869713
Iteration 130, loss = 0.01837753
Iteration 131, loss = 0.01812764
Iteration 132, loss = 0.01795163
Iteration 133, loss = 0.01762763
Iteration 134, loss = 0.01743093
Iteration 135, loss = 0.01711808
Iteration 136, loss = 0.01673732
Iteration 137, loss = 0.01647725
Iteration 138, loss = 0.01645494
Iteration 139, loss = 0.01608896
Iteration 140, loss = 0.01581593
Iteration 

Iteration 162, loss = 0.01178453
Iteration 163, loss = 0.01164439
Iteration 164, loss = 0.01152057
Iteration 165, loss = 0.01139116
Iteration 166, loss = 0.01124668
Iteration 167, loss = 0.01111749
Iteration 168, loss = 0.01098454
Iteration 169, loss = 0.01084369
Iteration 170, loss = 0.01072217
Iteration 171, loss = 0.01058567
Iteration 172, loss = 0.01047784
Iteration 173, loss = 0.01037444
Iteration 174, loss = 0.01024511
Iteration 175, loss = 0.01013744
Iteration 176, loss = 0.01001719
Iteration 177, loss = 0.00993351
Iteration 178, loss = 0.00983733
Iteration 179, loss = 0.00973109
Iteration 180, loss = 0.00961597
Iteration 181, loss = 0.00948954
Iteration 182, loss = 0.00939283
Iteration 183, loss = 0.00930157
Iteration 184, loss = 0.00919546
Iteration 185, loss = 0.00909656
Iteration 186, loss = 0.00901042
Iteration 187, loss = 0.00895068
Iteration 188, loss = 0.00884129
Iteration 189, loss = 0.00873892
Iteration 190, loss = 0.00865236
Iteration 191, loss = 0.00858464
Iteration 

Iteration 205, loss = 0.00796242
Iteration 206, loss = 0.00789957
Iteration 207, loss = 0.00781705
Iteration 208, loss = 0.00777839
Iteration 209, loss = 0.00768393
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Iteration 1, loss = 0.45249319
Iteration 2, loss = 0.39434839
Iteration 3, loss = 0.39219741
Iteration 4, loss = 0.39140486
Iteration 5, loss = 0.39024451
Iteration 6, loss = 0.38959137
Iteration 7, loss = 0.38965507
Iteration 8, loss = 0.38813613
Iteration 9, loss = 0.38817740
Iteration 10, loss = 0.38689604
Iteration 11, loss = 0.38545273
Iteration 12, loss = 0.38421407
Iteration 13, loss = 0.38293409
Iteration 14, loss = 0.38118962
Iteration 15, loss = 0.37982477
Iteration 16, loss = 0.37759453
Iteration 17, loss = 0.37556196
Iteration 18, loss = 0.37235978
Iteration 19, loss = 0.36945302
Iteration 20, loss = 0.36555062
Iteration 21, loss = 0.36219631
Iteration 22, loss = 0.35680567
Iteration 23, loss = 0.35119974
Iteration 24, loss

Iteration 46, loss = 0.13753948
Iteration 47, loss = 0.13165743
Iteration 48, loss = 0.12676666
Iteration 49, loss = 0.12206960
Iteration 50, loss = 0.11702380
Iteration 51, loss = 0.11263324
Iteration 52, loss = 0.10883414
Iteration 53, loss = 0.10493889
Iteration 54, loss = 0.10132716
Iteration 55, loss = 0.09798216
Iteration 56, loss = 0.09511332
Iteration 57, loss = 0.09185097
Iteration 58, loss = 0.08925176
Iteration 59, loss = 0.08657059
Iteration 60, loss = 0.08394984
Iteration 61, loss = 0.08160460
Iteration 62, loss = 0.07960727
Iteration 63, loss = 0.07704624
Iteration 64, loss = 0.07499284
Iteration 65, loss = 0.07312132
Iteration 66, loss = 0.07090034
Iteration 67, loss = 0.06910455
Iteration 68, loss = 0.06734360
Iteration 69, loss = 0.06565088
Iteration 70, loss = 0.06395062
Iteration 71, loss = 0.06257697
Iteration 72, loss = 0.06090787
Iteration 73, loss = 0.05973655
Iteration 74, loss = 0.05802474
Iteration 75, loss = 0.05670539
Iteration 76, loss = 0.05532122
Iteratio

Iteration 98, loss = 0.03574915
Iteration 99, loss = 0.03510041
Iteration 100, loss = 0.03437921
Iteration 101, loss = 0.03368869
Iteration 102, loss = 0.03311217
Iteration 103, loss = 0.03247353
Iteration 104, loss = 0.03187874
Iteration 105, loss = 0.03129890
Iteration 106, loss = 0.03084867
Iteration 107, loss = 0.03019411
Iteration 108, loss = 0.02961480
Iteration 109, loss = 0.02920760
Iteration 110, loss = 0.02866535
Iteration 111, loss = 0.02810825
Iteration 112, loss = 0.02758651
Iteration 113, loss = 0.02715545
Iteration 114, loss = 0.02679648
Iteration 115, loss = 0.02625235
Iteration 116, loss = 0.02582055
Iteration 117, loss = 0.02547205
Iteration 118, loss = 0.02506271
Iteration 119, loss = 0.02448312
Iteration 120, loss = 0.02418701
Iteration 121, loss = 0.02377188
Iteration 122, loss = 0.02330948
Iteration 123, loss = 0.02293871
Iteration 124, loss = 0.02257710
Iteration 125, loss = 0.02220606
Iteration 126, loss = 0.02188633
Iteration 127, loss = 0.02148852
Iteration 12

Iteration 143, loss = 0.01588289
Iteration 144, loss = 0.01567469
Iteration 145, loss = 0.01545773
Iteration 146, loss = 0.01522834
Iteration 147, loss = 0.01501114
Iteration 148, loss = 0.01485172
Iteration 149, loss = 0.01465771
Iteration 150, loss = 0.01443352
Iteration 151, loss = 0.01426980
Iteration 152, loss = 0.01411993
Iteration 153, loss = 0.01388509
Iteration 154, loss = 0.01368814
Iteration 155, loss = 0.01364615
Iteration 156, loss = 0.01334927
Iteration 157, loss = 0.01319148
Iteration 158, loss = 0.01307161
Iteration 159, loss = 0.01288590
Iteration 160, loss = 0.01270020
Iteration 161, loss = 0.01259093
Iteration 162, loss = 0.01240024
Iteration 163, loss = 0.01224802
Iteration 164, loss = 0.01210109
Iteration 165, loss = 0.01194423
Iteration 166, loss = 0.01183738
Iteration 167, loss = 0.01170061
Iteration 168, loss = 0.01155156
Iteration 169, loss = 0.01141606
Iteration 170, loss = 0.01129855
Iteration 171, loss = 0.01117440
Iteration 172, loss = 0.01105253
Iteration 

Iteration 197, loss = 0.00822380
Iteration 198, loss = 0.00814208
Iteration 199, loss = 0.00808010
Iteration 200, loss = 0.00800250
Iteration 201, loss = 0.00793579
Iteration 202, loss = 0.00785600
Iteration 203, loss = 0.00778068
Iteration 204, loss = 0.00772861
Iteration 205, loss = 0.00764302
Iteration 206, loss = 0.00760874
Iteration 207, loss = 0.00751576
Iteration 208, loss = 0.00745792
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Iteration 1, loss = 0.45820022
Iteration 2, loss = 0.39821992
Iteration 3, loss = 0.39608748
Iteration 4, loss = 0.39560804
Iteration 5, loss = 0.39472888
Iteration 6, loss = 0.39480947
Iteration 7, loss = 0.39361189
Iteration 8, loss = 0.39282515
Iteration 9, loss = 0.39181928
Iteration 10, loss = 0.39101695
Iteration 11, loss = 0.38971824
Iteration 12, loss = 0.38949259
Iteration 13, loss = 0.38719070
Iteration 14, loss = 0.38594534
Iteration 15, loss = 0.38385126
Iteration 16, loss = 0.38186940
Iteration 1

Iteration 30, loss = 0.29243951
Iteration 31, loss = 0.28052464
Iteration 32, loss = 0.26896137
Iteration 33, loss = 0.25652265
Iteration 34, loss = 0.24374443
Iteration 35, loss = 0.23148652
Iteration 36, loss = 0.21959983
Iteration 37, loss = 0.20819734
Iteration 38, loss = 0.19710396
Iteration 39, loss = 0.18682590
Iteration 40, loss = 0.17708656
Iteration 41, loss = 0.16777117
Iteration 42, loss = 0.15978697
Iteration 43, loss = 0.15183652
Iteration 44, loss = 0.14498595
Iteration 45, loss = 0.13851929
Iteration 46, loss = 0.13203807
Iteration 47, loss = 0.12675610
Iteration 48, loss = 0.12163335
Iteration 49, loss = 0.11639010
Iteration 50, loss = 0.11212191
Iteration 51, loss = 0.10815359
Iteration 52, loss = 0.10385368
Iteration 53, loss = 0.10023143
Iteration 54, loss = 0.09674843
Iteration 55, loss = 0.09368420
Iteration 56, loss = 0.09060509
Iteration 57, loss = 0.08770161
Iteration 58, loss = 0.08493836
Iteration 59, loss = 0.08226656
Iteration 60, loss = 0.07992917
Iteratio

Iteration 78, loss = 0.04952733
Iteration 79, loss = 0.04864586
Iteration 80, loss = 0.04718657
Iteration 81, loss = 0.04618549
Iteration 82, loss = 0.04514586
Iteration 83, loss = 0.04408727
Iteration 84, loss = 0.04330914
Iteration 85, loss = 0.04251920
Iteration 86, loss = 0.04142051
Iteration 87, loss = 0.04041690
Iteration 88, loss = 0.03961802
Iteration 89, loss = 0.03869162
Iteration 90, loss = 0.03779996
Iteration 91, loss = 0.03715604
Iteration 92, loss = 0.03641326
Iteration 93, loss = 0.03580071
Iteration 94, loss = 0.03488622
Iteration 95, loss = 0.03423545
Iteration 96, loss = 0.03353000
Iteration 97, loss = 0.03282896
Iteration 98, loss = 0.03210806
Iteration 99, loss = 0.03149967
Iteration 100, loss = 0.03103996
Iteration 101, loss = 0.03028958
Iteration 102, loss = 0.02961042
Iteration 103, loss = 0.02905679
Iteration 104, loss = 0.02852544
Iteration 105, loss = 0.02801203
Iteration 106, loss = 0.02752872
Iteration 107, loss = 0.02696443
Iteration 108, loss = 0.02645337

Iteration 130, loss = 0.01811509
Iteration 131, loss = 0.01781734
Iteration 132, loss = 0.01749843
Iteration 133, loss = 0.01722696
Iteration 134, loss = 0.01700578
Iteration 135, loss = 0.01672578
Iteration 136, loss = 0.01645823
Iteration 137, loss = 0.01623979
Iteration 138, loss = 0.01601761
Iteration 139, loss = 0.01573399
Iteration 140, loss = 0.01551510
Iteration 141, loss = 0.01533144
Iteration 142, loss = 0.01505793
Iteration 143, loss = 0.01483097
Iteration 144, loss = 0.01463550
Iteration 145, loss = 0.01453764
Iteration 146, loss = 0.01426825
Iteration 147, loss = 0.01403061
Iteration 148, loss = 0.01385481
Iteration 149, loss = 0.01365299
Iteration 150, loss = 0.01347941
Iteration 151, loss = 0.01329594
Iteration 152, loss = 0.01312206
Iteration 153, loss = 0.01294695
Iteration 154, loss = 0.01277130
Iteration 155, loss = 0.01262586
Iteration 156, loss = 0.01244432
Iteration 157, loss = 0.01232437
Iteration 158, loss = 0.01213893
Iteration 159, loss = 0.01201878
Iteration 

#### SVC

In [98]:
cv = ShuffleSplit(n_splits=10, test_size=.20, random_state=42)
cv.get_n_splits(x)

tmps1=time.time()

scores = cross_val_score(svclass, x, y, cv=cv, scoring='f1')

tmps2=time.time() - tmps1
print("Temps d'execution = %f seconde\n" %tmps2)

print("Cross-validation F1 ham scores: {}".format(scores))

tmps1=time.time()

scores2 = cross_val_score(svclass, x, y2, cv=cv, scoring='f1')

tmps2=time.time() - tmps1
print("\n\nTemps d'execution = %f seconde\n" %tmps2)

print("Cross-validation F1 spam scores: {}".format(scores2))

Temps d'execution = 18.630127 seconde

Cross-validation F1 ham scores: [0.98773006 0.9825998  0.98751301 0.99185336 0.98878695 0.98614674
 0.98830707 0.98360656 0.97986577 0.98399587]


Temps d'execution = 18.258321 seconde

Cross-validation F1 spam scores: [0.91240876 0.87681159 0.92207792 0.93984962 0.91791045 0.90391459
 0.91254753 0.88489209 0.8668942  0.89419795]



### Compare at least three classification algorithms in terms of ** f1 score **. Which is the most powerful?

#### Logistic Regression

- F1_score Ham  : 0.97                    
- F1_score Spam : 0.82
- Temps d'éxécution : 0.173481 seconde   



- Score cross validation ham: 
[0.97672065 0.97163121 0.97380586 0.98330804 0.97531486 0.97052846 0.97782258 0.9721519  0.96390442 0.97192445]
- Mean Cross validation score ham : __0.973711243__

- Score cross validation spam : 
[0.81889764 0.78125    0.81978799 0.86956522 0.8        0.77862595 0.82113821 0.78431373 0.73003802 0.79704797]
- Mean Cross validation score spam : __0.800066473__


- Temps d'éxécution cross validation : 1.076684 seconde



#### MPL Classifier

- F1_score Ham : 0.99
- F1_score Spam : 0.94

- Temps d'éxécution: 22.520708 seconde

- Score cross validation ham: [0.98206278 0.97668161 0.9838565  0.98744395 0.98475336 0.97847534
 0.98026906 0.98026906 0.97309417 0.97578475] 
- Mean Cross validation score ham : __0.988594359__

- Score cross validation spam : [0.92957746 0.9109589  0.94303797 0.94852941 0.93772894 0.91946309
 0.91791045 0.92567568 0.9025974  0.910299  ]
- Mean cross validation score spam : __0.924577829__

- Temps d'éxécution cross validation: 183.508235 seconde

#### SVC

- F1_score : 0.99 
- F1_score : 0.92

- Temps d'éxécution: 1.631358 seconde

- Score cross validation ham: [0.97847534 0.96950673 0.97847534 0.98565022 0.98026906 0.97578475
 0.9793722  0.97130045 0.96502242 0.97219731] 
- Mean Cross validation score : __0.975__

- Score cross validation spam: [0.91240876, 0.87681159, 0.92207792, 0.93984962, 0.91791045, 0.90391459,
 0.91254753, 0.88489209, 0.8668942,  0.89419795] 
- Mean Cross validation score : __0.903150470__

- Temps d'éxécution cross validation: 20.917359 seconde

In [118]:
# mean cross validation score calcule

a1 = [0.97672065, 0.97163121, 0.97380586, 0.98330804, 0.97531486, 0.97052846,
 0.97782258 ,0.9721519  ,0.96390442 ,0.97192445]
print(sum(a1)/len(a1))

a2 = [0.81889764, 0.78125,    0.81978799, 0.86956522, 0.8        ,0.77862595
 ,0.82113821 ,0.78431373 ,0.73003802 ,0.79704797]
print(sum(a2)/len(a2))

b1 = [0.98972251, 0.98658411, 0.99059561, 0.99234303, 0.99131323, 0.98757764
 ,0.98878695, 0.9881137,  0.9849037 , 0.98600311]
print(sum(b1)/len(b1))

b2 = [0.92957746, 0.9109589,  0.94303797, 0.94852941, 0.93772894, 0.91946309,
 0.91791045, 0.92567568, 0.9025974,  0.910299  ]
print(sum(b2)/len(b2))

c2 = [0.91240876, 0.87681159, 0.92207792, 0.93984962, 0.91791045, 0.90391459,
 0.91254753, 0.88489209, 0.8668942,  0.89419795]
print(sum(c2)/len(c2))

0.973711243
0.800066473
0.988594359
0.9245778299999999
0.9031504700000001


# Conclusion