# Predict Spam SMS
##### Chintan Chitroda

### Context
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

### Content
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

In [117]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [118]:
import os
os.listdir()

['.ipynb_checkpoints',
 'TestDataset.csv',
 'TrainDataset.csv',
 'Untitled.ipynb',
 'Untitled1.ipynb']

In [119]:
df1 = pd.read_csv('TrainDataset.csv')
df2 = pd.read_csv('TestDataset.csv')

In [120]:
df1.head(3)

Unnamed: 0,v1,v2
0,spam,U were outbid by simonwatson5120 on the Shinco...
1,ham,Do you still have the grinder?
2,ham,No. Yes please. Been swimming?


In [121]:
df2.head(3)

Unnamed: 0,v2
0,Prabha..i'm soryda..realy..frm heart i'm sory
1,"Jus chillaxin, what up"
2,Ok no prob. Take ur time.


In [122]:
df1['v1'].replace({'spam':1,'ham':0},inplace=True)

In [123]:
res = df1['v1']
df1.drop('v1',axis=1,inplace=True)

In [124]:
joindf = df1.append(df2)

In [125]:
import string
from textblob import TextBlob ### For Sentiment Polarity
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
import string

In [126]:
def Process(mess):
    nopunc = [char for char in mess if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    nopunc =  ' '.join([word for word in nopunc.split() if word.lower() not in stop_words])
    return str(TextBlob(nopunc)).lower()

In [127]:
joindf['v2'] = joindf['v2'].apply(Process)

In [128]:
joindf.replace("[^a-zA-Z]"," ",regex=True,inplace=True)

In [129]:
joindf.iloc[1]

v2    still grinder
Name: 1, dtype: object

In [130]:
import nltk
from nltk.stem import WordNetLemmatizer

In [131]:
le = WordNetLemmatizer()

In [132]:
msg = []
for i in joindf.v2:
    msg.append(i)

In [133]:
msg[32]

'adult    content video shortly'

In [134]:
lemmetized_sentence = []

for i in range(len(msg)):
    words = nltk.word_tokenize(msg[i])
    words = [le.lemmatize(word) for word in words]
    lemmetized_sentence.append(' '.join(words))

### TFIDF

In [135]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [136]:
tfidf = TfidfVectorizer() 

In [137]:
msgs = tfidf.fit_transform(lemmetized_sentence).toarray()

In [138]:
msgs

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [139]:
df1

Unnamed: 0,v2
0,U were outbid by simonwatson5120 on the Shinco...
1,Do you still have the grinder?
2,No. Yes please. Been swimming?
3,No de.am seeing in online shop so that i asked.
4,"Faith makes things possible,Hope makes things ..."
...,...
4452,Good. Good job. I like entrepreneurs
4453,Living is very simple.. Loving is also simple....
4454,Msgs r not time pass.They silently say that I ...
4455,What is this 'hex' place you talk of? Explain!


In [140]:
traindf = msgs[:4457]
testdf = msgs[4457:]

In [141]:
len(traindf)

4457

## Modelling

In [142]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

In [143]:
X = traindf
y = res

In [144]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7)

In [145]:
rfc = RandomForestClassifier(n_estimators=50,max_depth=12, random_state=101,
                             class_weight='balanced',verbose=1,n_jobs=-1)

In [146]:
rfc.fit(X_train,y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    0.6s finished


RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=12, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=50, n_jobs=-1, oob_score=False,
                       random_state=101, verbose=1, warm_start=False)

In [147]:
y_pred = rfc.predict(X_test)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done  50 out of  50 | elapsed:    0.0s finished


In [148]:
print("F1 Score :",f1_score(y_pred,y_test,average = "weighted"))
print('Report:\n',classification_report(y_test, y_pred))
print('Confusion Matrix: \n',confusion_matrix(y_test, y_pred))

F1 Score : 0.9783618248051721
Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99      1153
           1       0.99      0.84      0.91       185

    accuracy                           0.98      1338
   macro avg       0.98      0.92      0.95      1338
weighted avg       0.98      0.98      0.98      1338

Confusion Matrix: 
 [[1152    1]
 [  29  156]]


In [152]:
sol = rfc.predict(testdf)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done  50 out of  50 | elapsed:    0.0s finished


In [153]:
Ptest = pd.DataFrame()
Ptest['v1'] = sol

In [156]:
Ptest.head(4)

Unnamed: 0,v1
0,0
1,0
2,0
3,1


In [162]:
df2.loc[0]

v2    Prabha..i'm soryda..realy..frm heart i'm sory
Name: 0, dtype: object

In [163]:
df2.loc[1]

v2    Jus chillaxin, what up
Name: 1, dtype: object

In [164]:
df2.loc[2]

v2    Ok no prob. Take ur time.
Name: 2, dtype: object

In [165]:
df2.loc[3]

v2    Congrats! 2 mobile 3G Videophones R yours. cal...
Name: 3, dtype: object

In [170]:
Ptest[Ptest.v1 == 1].head(5)

Unnamed: 0,v1
3,1
14,1
51,1
67,1
73,1


### Checking which msgs were Marked Spam

In [167]:
df2.loc[14]

v2    Free 1st week entry 2 TEXTPOD 4 a chance 2 win...
Name: 14, dtype: object

In [168]:
df2.loc[51]

v2    UR awarded a City Break and could WIN a å£200 ...
Name: 51, dtype: object

In [169]:
df2.loc[67]

v2    FREE UNLIMITED HARDCORE PORN direct 2 your mob...
Name: 67, dtype: object

### We Can see that msg 4 was detected as spam and by reading we can actually see its spam msg

## We can conclude that our Model has Good Accuracy.

### Thank You