In [111]:
import pandas as pd
import numpy as np

In [112]:
df = pd.read_csv(r"D:\data_for_analysis\spam\spam.csv", encoding="latin-1")
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [113]:
df.tail()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
5571,ham,Rofl. Its true to its name,,,


Setting the encoding parameter in read_csv() to "latin-1" to counter act the "'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte" error

In [114]:
df.shape

(5572, 5)

In [115]:
df = df[["v1","v2"]]

In [116]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Dropped the other columns since they had nothing but NaN values, probably because of the encoding

In [117]:
df.columns = ["Classifier", "Key"]

In [118]:
df.head()

Unnamed: 0,Classifier,Key
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


If we are to build a model, we must convert out classifier into numerical values for the machine learning methods. This is achieved through Label Encoding.

In [119]:
df.isna().sum()


Classifier    0
Key           0
dtype: int64

In [120]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Classifier"] = le.fit_transform(df["Classifier"])
df.head()

Unnamed: 0,Classifier,Key
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [121]:
df.tail()

Unnamed: 0,Classifier,Key
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...
5571,0,Rofl. Its true to its name


ham = 0
spam = 1

Lets look at the count of each value

In [122]:
df["Classifier"].value_counts()

0    4825
1     747
Name: Classifier, dtype: int64

747 spam values, the number is a less than what would be optimal for training, but we'll have to make do.

Typical spam has randomly capitalized words, mispelled words(though this doesnt count for much since text language itself often makes use of misspellings), email addresses, phone numbers, etc. We should clean out the "Key" column in a way such that all text is on the same scale, ie, a string of lowercase words that can be used further.

In [123]:
import re
import nltk
stop_words = nltk.corpus.stopwords.words('english')
porter = nltk.PorterStemmer()
def clean_text(messy_string):
    assert(type(messy_string) == str)
    cleaned = re.sub(r'\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', messy_string)
    cleaned = re.sub(r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr',
                     cleaned)
    cleaned = re.sub(r'£|\$', 'moneysymb', cleaned)
    cleaned = re.sub(
        r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',
        'phonenumbr', cleaned)
    cleaned = re.sub(r'\d+(\.\d+)?', 'numbr', cleaned)
    cleaned = re.sub(r'[^\w\d\s]', ' ', cleaned)
    cleaned = re.sub(r'\s+', ' ', cleaned)
    cleaned = re.sub(r'^\s+|\s+?$', '', cleaned.lower())
    return ' '.join(
        porter.stem(term) 
        for term in cleaned.split()
        if term not in set(stop_words)
    )

In [124]:
example = """  ***** CONGRATlations **** You won 2 tIckETs to Hamilton in 
NYC http://www.hamiltonbroadway.com/J?NaIOl/event   wORtH over $500.00...CALL 
555-477-8914 or send message to: hamilton@freetix.com to get ticket !! !  """

In [125]:
clean_text(example)

'congratl numbr ticket hamilton nyc httpaddr worth moneysymbnumbr call phonenumbr send messag emailaddr get ticket'

The above function heavily relies on Regular Expressions and the Natural Language Processing Library of python to clean the text. First order of business was to convert all the emails, urls, phone numbers and symbols to corresponding string values for the purpose of normalization, which is simply converting out text input into pure alphabetical values for efficient classifying. This is done by creating the corresponding regex pattern for each of the aforementioned categories. 

Once this is accomplished we clear the sentence of stop words, which are basically words that dont contribute much meaning to the sentence like "when" "is" "those" "had" etc. This is done using the nltk library which already has a list of stop words in english. 

Then a process called Stemming was performed, which is basically replacing words with various suffixes such as "distribute", "distributing", "distributor" or "distribution" with just "distribute". Again this is already performed by the nltk library, so we don't have to create a Stemmer from scratch by ourselves.

Finally we stem the processed text, run it against a list of stop words and return the sentence completely devoid of it. We've used set to convert the stop words from list to a set so as to speed up the lookup function.

Finally we test it on one of the spam messages in the dataset to see if it works fine, which it does.

In [126]:
df['Cleaned_Text'] = df['Key'].map(lambda x: clean_text(x))

In [127]:
df.head()

Unnamed: 0,Classifier,Key,Cleaned_Text
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazi avail bugi n great world...
1,0,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entri numbr wkli comp win fa cup final tk...
3,0,U dun say so early hor... U c already then say...,u dun say earli hor u c alreadi say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah think goe usf live around though


Now let us see what are the most common words generally used in spam messages, before that, we should split our data frame into training and test sets.

In [128]:
from sklearn.model_selection import train_test_split

X_spam = df[df["Classifier"] == 1]

X = df
y = df["Classifier"].values.reshape(-1,1)
X = df.drop("Classifier",1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [129]:
print(X_train.shape, X_test.shape)

(4457, 2) (1115, 2)


In [130]:
print(y_train.shape, y_test.shape)

(4457, 1) (1115, 1)


In [131]:
X_spam

Unnamed: 0,Classifier,Key,Cleaned_Text
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entri numbr wkli comp win fa cup final tk...
5,1,FreeMsg Hey there darling it's been 3 week's n...,freemsg hey darl numbr week word back like fun...
8,1,WINNER!! As a valued network customer you have...,winner valu network custom select receivea åmo...
9,1,Had your mobile 11 months or more? U R entitle...,mobil numbr month u r entitl updat latest colo...
11,1,"SIX chances to win CASH! From 100 to 20,000 po...",six chanc win cash numbr numbr numbr pound txt...
...,...,...,...
5537,1,Want explicit SEX in 30 secs? Ring 02073162414...,want explicit sex numbr sec ring phonenumbr co...
5540,1,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...,ask numbrmobil numbr chatlin inclu free min in...
5547,1,Had your contract mobile 11 Mnths? Latest Moto...,contract mobil numbr mnth latest motorola noki...
5566,1,REMINDER FROM O2: To get 2.50 pounds free call...,remind onumbr get numbr pound free call credit...


In [132]:
from collections import Counter
common_spam_words = Counter(" ".join(X_spam["Cleaned_Text"]).split()).most_common(25)

In [133]:
common_spam_words

[('numbr', 1278),
 ('phonenumbr', 408),
 ('call', 374),
 ('åmoneysymbnumbr', 284),
 ('free', 223),
 ('txt', 173),
 ('u', 168),
 ('httpaddr', 167),
 ('numbrp', 148),
 ('text', 144),
 ('ur', 144),
 ('mobil', 139),
 ('stop', 121),
 ('claim', 115),
 ('repli', 112),
 ('prize', 94),
 ('get', 90),
 ('tone', 85),
 ('min', 79),
 ('cash', 76),
 ('servic', 73),
 ('send', 72),
 ('nokia', 70),
 ('week', 69),
 ('new', 69)]

The above is a list of common spam words that apperas in spam messages. We will assign a score to the message based on the occurence of its letters in the spam column

In [134]:
def assign_score(clean_string):
    score=0
    list_of_words = clean_string.split()
    for word in list_of_words:
        for i in range(len(common_spam_words)):
            if word == common_spam_words[i][0]:
                score += int(common_spam_words[i][1])
    #score_list = []
    #score_list.append(score)
    return score

In [135]:
score_list=[]
X_train["Score"] = X_train["Cleaned_Text"].map(lambda x: assign_score(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [136]:
X_train

Unnamed: 0,Key,Cleaned_Text,Score
1114,No no:)this is kallis home ground.amla home to...,kalli home httpaddr home town durban,167
3589,I am in escape theatre now. . Going to watch K...,escap theatr go watch kavalan minut,0
3095,We walked from my moms. Right on stagwood pass...,walk mom right stagwood pass right winterston ...,0
1012,I dunno they close oredi not... ÌÏ v ma fan...,dunno close oredi ìï v fan,0
3320,Yo im right by yo work,yo im right yo work,0
...,...,...,...
4931,Match started.india &lt;#&gt; for 2,match httpaddr lt gt numbr,1445
3264,"44 7732584351, Do you want a New Nokia 3510i c...",numbr phonenumbr want new nokia numbri colour ...,6227
1653,I was at bugis juz now wat... But now i'm walk...,bugi juz wat walk home oredi ìï late repli oso...,112
2607,:-) yeah! Lol. Luckily i didn't have a starrin...,yeah lol luckili star role like,0


In [138]:
df.loc[3264]

Classifier                                                      1
Key             44 7732584351, Do you want a New Nokia 3510i c...
Cleaned_Text    numbr phonenumbr want new nokia numbri colour ...
Name: 3264, dtype: object

Here we've iterated through the column and assigned it a score based on the words that appear in the common_spam_words filter.

In [140]:
X_train.drop("Key", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [141]:
X_train.head()

Unnamed: 0,Cleaned_Text,Score
1114,kalli home httpaddr home town durban,167
3589,escap theatr go watch kavalan minut,0
3095,walk mom right stagwood pass right winterston ...,0
1012,dunno close oredi ìï v fan,0
3320,yo im right yo work,0


In [142]:
X_test.drop("Key", axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [143]:
X_test.head()

Unnamed: 0,Cleaned_Text
4456,aight plan come later tonight
690,farm open
944,sent score sopha secondari applic school think...
3768,grnumbr see messag r u leav congrat dear schoo...
1189,case guess see campu lodg


In [144]:
X_test["Score"] = X_test["Cleaned_Text"].map(lambda x: assign_score(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [145]:
X_test.head()

Unnamed: 0,Cleaned_Text,Score
4456,aight plan come later tonight,0
690,farm open,0
944,sent score sopha secondari applic school think...,0
3768,grnumbr see messag r u leav congrat dear schoo...,312
1189,case guess see campu lodg,0


We've dropped the "Key" column since it was redundant and assigned scores to the test set as well. Now we build the model.

In [148]:
from sklearn.linear_model import LogisticRegression
model_1 = LogisticRegression()
score_train = X_train["Score"].values.reshape(-1,1)
model_1.fit(score_train, y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [150]:
model_1.score(score_train, y_train)

0.9178819833969037

The score above seems to be a good value as far as the training set goes. Let's test it out on the test set.

In [151]:
score_test = X_test["Score"].values.reshape(-1,1)
y_pred_1 = model_1.predict(score_test)

In [152]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred_1)
confusion_matrix

array([[933,  16],
       [ 93,  73]], dtype=int64)

In [153]:
model_1.score(score_test, y_test)

0.9022421524663677

So, using logistic regression, we've built a model that is 90% accurate in its spam classification. The model classified 16 spam messages as normal, and 93 normal messages as spam.

In [155]:
param_grid = [    
    {'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'C' : np.logspace(-4, 4, 20),
    'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter' : [100, 1000,2500, 5000]
    }
]

In [156]:
from sklearn.model_selection import GridSearchCV

In [157]:
clf = GridSearchCV(model_1, param_grid = param_grid, cv = 3, verbose=True, n_jobs=-1)

In [158]:
best_clf = clf.fit(score_train,y_train)

Fitting 3 folds for each of 1600 candidates, totalling 4800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done 2832 tasks      | elapsed:   14.3s
[Parallel(n_jobs=-1)]: Done 4800 out of 4800 | elapsed:   20.0s finished
  y = column_or_1d(y, warn=True)


In [159]:
best_clf.best_estimator_

LogisticRegression(C=0.0006951927961775605, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Here we perform hyper parameter tuning so as to find out the best parameter values for our model with regards to accuracy. Now that we have the best values for our model we should test its accuracy.

In [163]:
best_clf.score(score_train, y_train)

0.9194525465559794

The difference in accuracy between the models pre and post hyper parameter tuning is 0.002 units, which isnt a lot

In [166]:
y_pred_2 = best_clf.predict(score_test)

In [168]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred_2)
confusion_matrix

array([[928,  21],
       [ 76,  90]], dtype=int64)

The confusion matrix above depicts the exact change in the accuracy values. We can see the number of False Positives has reduced significantly, with the model now classifying 17 messages as normal where they were incorrectly classified as spam, bring the False Positive numbers down from 93 to 76. On the flip side, False negatives have increased from 16 to 21, with the model now classfying 5 spam messages as normal when they where correctly classified before.

In [169]:
best_clf.score(score_test, y_test)

0.9130044843049328

With hyper parameter tuning we managed to increase the accuracy of our base model by 1 unit, which isn't an earth shattering change, but still a more favourable one.