> Igor Sorochan

## Classification of text messages (simple spam filter)

Usage of base Linear Regression model for classification of text messages ([dataset](https://github.com/obulygin/pyda_homeworks/blob/master/stat_case_study/spam.csv)) for spam as a target feature. 

Algorithm:
1. Convert all text to lower case;
1. Remove non-word characters;
1. Remove stopwords;
1. Lemmatization;
1. Convert messages to TF-IDF vectors.  
1. Run LR

1. Evaluate a confusion_matrix.

### `Prepare`

In [3]:
# Additional dependencies
import numpy as np
import pandas as pd
import re
# https://www.machinelearningplus.com/nlp/gensim-tutorial/
#  gensim library effectively manages corpora texts
from gensim import corpora
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

In [4]:
# df_raw = pd.read_csv('https://github.com/obulygin/pyda_homeworks/blob/master/stat_case_study/spam.csv')
df_raw = pd.read_csv('/Users/velo1/SynologyDrive/GIT_syno/data/spam.csv')
# df_raw = pd.read_csv('Notebooks/spam.csv')
df_raw

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [5]:
import nltk
nltk.download('stopwords')
stopwords_set = set(stopwords.words('english')) # stopwords_set

from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /Users/velo1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/velo1/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/velo1/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### `Process`

In [6]:
df_raw.duplicated().sum(), df_raw.isna().sum()

(415,
 Category    0
 Message     0
 dtype: int64)

In [7]:
df_raw[df_raw.duplicated()]

Unnamed: 0,Category,Message
103,ham,As per your request 'Melle Melle (Oru Minnamin...
154,ham,As per your request 'Melle Melle (Oru Minnamin...
207,ham,"As I entered my cabin my PA said, '' Happy B'd..."
223,ham,"Sorry, I'll call later"
326,ham,No calls..messages..missed calls
...,...,...
5524,spam,You are awarded a SiPix Digital Camera! call 0...
5535,ham,"I know you are thinkin malaria. But relax, chi..."
5539,ham,Just sleeping..and surfing
5553,ham,Hahaha..use your brain dear


In [8]:
df = df_raw.drop(df_raw[df_raw.duplicated()].index)
df.Category.replace({'ham':0, 'spam':1}, inplace= True)
df.rename({'Category':'is_spam'}, axis= 1, inplace= True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5157 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   is_spam  5157 non-null   int64 
 1   Message  5157 non-null   object
dtypes: int64(1), object(1)
memory usage: 120.9+ KB


### `Analyze`
#### `Tokenization`

In [9]:
# lower ->  subst. non-word character  ->  split  ->  remove stopwords
df['words'] = df.Message.apply(lambda x:
 [word for word in re.sub('[\W\d]+', ' ', x.lower()).split() if word not in stopwords_set] )

df

Unnamed: 0,is_spam,Message,words
0,0,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazy, available, bugis, n..."
1,0,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,0,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,0,"Nah I don't think he goes to usf, he lives aro...","[nah, think, goes, usf, lives, around, though]"
...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,"[nd, time, tried, contact, u, u, pound, prize,..."
5568,0,Will ü b going to esplanade fr home?,"[ü, b, going, esplanade, fr, home]"
5569,0,"Pity, * was in mood for that. So...any other s...","[pity, mood, suggestions]"
5570,0,The guy did some bitching but I acted like i'd...,"[guy, bitching, acted, like, interested, buyin..."


In [10]:
wordnet_lemmatizer = WordNetLemmatizer()

# lemmatize -> to string to feed 
df['lemm_str'] = df.words.apply(lambda x: 
    ' '.join([wordnet_lemmatizer.lemmatize(word) for word in x]) )
df

Unnamed: 0,is_spam,Message,words,lemm_str
0,0,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazy, available, bugis, n...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]",ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, wkly, comp, win, fa, cup, final,...",free entry wkly comp win fa cup final tkts st ...
3,0,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]",u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...","[nah, think, goes, usf, lives, around, though]",nah think go usf life around though
...,...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,"[nd, time, tried, contact, u, u, pound, prize,...",nd time tried contact u u pound prize claim ea...
5568,0,Will ü b going to esplanade fr home?,"[ü, b, going, esplanade, fr, home]",ü b going esplanade fr home
5569,0,"Pity, * was in mood for that. So...any other s...","[pity, mood, suggestions]",pity mood suggestion
5570,0,The guy did some bitching but I acted like i'd...,"[guy, bitching, acted, like, interested, buyin...",guy bitching acted like interested buying some...


In [11]:
# dictionary = corpora.Dictionary([['love']])
# dictionary

`tf–idf` means term-frequency `times` inverse document-frequency

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(lowercase = False, ngram_range= (1,1))
# ngram_range= (1,1)  FN 34 FP 6
# ngram_range= (1,2)  FN 51 FP 3
# ngram_range= (1,3)  FN 69 FP 3
# ngram_range= (2,2)  FN 109 FP 0

# Learn vocabulary and idf, return document-term matrix.
tfidf_matrix = tfidf.fit_transform(df.lemm_str)

# Get output feature names for transformation.
names = tfidf.get_feature_names_out()


tfidf_matrix = pd.DataFrame(tfidf_matrix.toarray(), columns= names)
tfidf_matrix.shape

(5157, 7101)

In [13]:
# tfidf_matrix['upper_ind'] = df.Message.apply(lambda x: sum(True for c in x if c.isupper()) / len (x)) # uppercase index
# tfidf_matrix['nonword_ind'] = df.apply(lambda row: len(row.lemm_str)/ len (row.Message), axis= 1)  # nonword index
# tfidf_matrix['len']= df.Message.apply(lambda x: len(x)) # empirical criterion approximates the 'Euclidean distance'
tfidf_matrix

Unnamed: 0,____,aa,aah,aaniye,aaooooright,aathi,ab,abbey,abdomen,abeg,...,zf,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5154,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5155,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# tfidf_matrix[tfidf_matrix.upper_ind.isna()]

In [15]:
tfidf_matrix.replace({np.nan:0},inplace= True)

In [16]:
# index_ = tfidf_matrix[tfidf_matrix.upper_ind.isna()== False].index

In [17]:
# tfidf_matrix.iloc[index_]
# df.reset_index().iloc[index_]

In [18]:
X = tfidf_matrix
y = df['is_spam']
X.shape[0] == y.shape[0]

True

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y ,test_size=0.3, random_state= 42)

#### `LR`

#### `process pipeline`

In [20]:
# %%time
# parameters = {
#             # 'scaler': [StandardScaler()],
# 	            'logit__tol': [1e-2],# 1e-3],           #Tolerance for stopping criteria.
#               'logit__C': [ 7 ], # 5 ,  10],            # Regularization parameter.
#               'logit__max_iter': [1000], # 5000, 10000],  
#               # 'logit__multi_class': ['ovr','multinomial'],
#               'logit__n_jobs':[-1]
#                 }
# pipe = Pipeline([
#                   # ('scaler', StandardScaler()),
#                  ('logit', LogisticRegression())])
                                 
# grid = GridSearchCV(pipe, parameters, cv=10).fit(X_train, y_train)

# logit_optim_score = grid.score(X_test, y_test)
# print(f'Training set best score: {grid.score(X_train, y_train):2.2f}')
# print(f'Test set best score: {logit_optim_score:2.3f}')

# Access the best set of parameters
# best_params = grid.best_params_
# print('\nOptimal set of parameters:', best_params)

# Stores the optimum model in best_pipe
# best_pipe = grid.best_estimator_
# print('\nOptimal pipeline:', best_pipe)

#### `LR using pipeline best params`

In [21]:
%%time
lr = LogisticRegression(tol= 1e-2, C= 7, max_iter= 1000, n_jobs= -1)
lr.fit(X_train, y_train)

CPU times: user 169 ms, sys: 260 ms, total: 429 ms
Wall time: 25.1 s


LogisticRegression(C=7, max_iter=1000, n_jobs=-1, tol=0.01)

In [22]:
# [x for x in dir(lr) if '_' not in x]

In [23]:
# ?f1_score

#### `results`

In [24]:
# Return the mean accuracy on the given test data and labels.

print(f'Model accuracy: {lr.score(X_test, y_test):.3%}')

Model accuracy: 97.416%


In [25]:
# Compute the F1 score, also known as balanced F-score or F-measure.

# The F1 score can be interpreted as a harmonic mean of the precision and
# recall, where an F1 score reaches its best value at 1 and worst score at 0.
# The relative contribution of precision and recall to the F1 score are
# equal. 
print(f'Model f1-score: {f1_score(y_test, lr.predict(X_test)) :.3%}')

Model f1-score: 89.418%


In [26]:
def conf_matrix(input, predicted):
    if input == 0:
        return 'TN' if predicted == 0 else 'FP'
    else:
        return 'TP' if predicted == 1 else 'FN'

In [27]:
df['Predicted'] = lr.predict(tfidf_matrix)
df['conf_matrix'] = df.apply(lambda row: conf_matrix(row['is_spam'], row['Predicted']), axis= 1)

In [28]:
# MISTAKES (FN + FP)
pd.set_option('max_colwidth', 200)
df[df.is_spam != df.Predicted][['Message', 'lemm_str','is_spam','Predicted', 'conf_matrix']].head()

Unnamed: 0,Message,lemm_str,is_spam,Predicted,conf_matrix
45,No calls..messages..missed calls,call message missed call,0,1,FP
68,"Did you hear about the new ""Divorce Barbie""? It comes with all of Ken's stuff!",hear new divorce barbie come ken stuff,1,0,FN
84,Yup next stop.,yup next stop,0,1,FP
95,"Your free ringtone is waiting to be collected. Simply text the password ""MIX"" to 85069 to verify. Get Usher and Britney. FML, PO Box 5249, MK17 92H. 450Ppw 16",free ringtone waiting collected simply text password mix verify get usher britney fml po box mk h ppw,1,0,FN
333,Call Germany for only 1 pence per minute! Call from a fixed line via access number 0844 861 85 85. No prepayment. Direct access!,call germany penny per minute call fixed line via access number prepayment direct access,1,0,FN


### `Share`


<table class="wikitable" style="border:none; background:transparent; text-align:center;" align="center">
<tbody><tr>
<td rowspan="2" style="border:none;">
</td>
<td style="border:none;">
</td>
<td colspan="2" style="background:#bbeeee;"><b>Predicted condition</b>
</td>

</td></tr>
<tr>
<td style="background:#eeeeee;"><a href="/wiki/Statistical_population" title="Statistical population">Total population</a> <br><span style="white-space:nowrap;">= P + N</span>
</td>
<td style="background:#ccffff;"><b>Positive (PP)</b>
</td>
<td style="background:#aadddd;"><b>Negative (PN)</b>
</td></tr>
<tr>
<td rowspan="2" class="nowrap unsortable" style="line-height:99%;vertical-align:middle;padding:.4em .4em .2em;background-position:50% .4em !important;min-width:0.875em;max-width:0.875em;width:0.875em;overflow:hidden;background:#eeeebb;"><div style="vertical-rl=-webkit-writing-mode: vertical-rl; -o-writing-mode: vertical-rl; -ms-writing-mode: tb-rl;writing-mode: tb-rl; writing-mode: vertical-rl; layout-flow: vertical-ideographic;transform:rotate(180deg);display:inline-block;padding-left:1px;text-align:center;"><b>Actual condition</b></div>
</td>
<td style="background:#ffffcc;"><b>Positive (P)</b>
</td>
<td style="background:#ccffcc;"><b><a href="/wiki/True_positive" class="mw-redirect" title="True positive">True positive</a> (TP) <br></b>
</td>
<td style="background:#ffdddd;"><b><a href="/wiki/False_negative" class="mw-redirect" title="False negative">False negative</a> (FN) <br></b>
</td></tr>
<tr>
<td style="background:#ddddaa;"><b>Negative (N)</b>
</td>
<td style="background:#ffcccc;"><b><a href="/wiki/False_positive" class="mw-redirect" title="False positive">False positive</a> (FP) <br></b>
</td>
<td style="background:#bbeebb;"><b><a href="/wiki/True_negative" class="mw-redirect" title="True negative">True negative</a> (TN) <br></b>
</td></tr></tbody></table>

#### `Model overall (train+test) metrics`

In [29]:
df_cm = df[['is_spam', 'conf_matrix']]

In [30]:
TP = (df_cm['conf_matrix'] == 'TP').sum()
TN = (df_cm['conf_matrix'] == 'TN').sum()
FP = (df_cm['conf_matrix'] == 'FP').sum()
FN = (df_cm['conf_matrix'] == 'FN').sum()
P = (df_cm['is_spam'] == 1).sum()
N = (df_cm['is_spam'] == 0).sum()

df_cm.groupby('conf_matrix').count().style.bar(align='left', color='lightgreen')

Unnamed: 0_level_0,is_spam
conf_matrix,Unnamed: 1_level_1
FN,48
FP,7
TN,4509
TP,593


In [31]:
print(f'Prevalence {" "*12}= {P /(P+N):0.3%}'.replace('%', ' %'))
print(f'TPR (sensivity, power, precision) = {TP/P:0.3%}'.replace('%', ' %'))
print(f'TNR (specificity){" "*6}= {TN/N:0.3%}'.replace('%', ' %'))
print(f'FPR (false alarm){" "*6}= {FP/N:0.3%}'.replace('%', ' %'))
print(f'FNR (miss rate){" "*8}= {FN/P:0.3%}'.replace('%', ' %'))
print(f'Accuracy {" "*14}= {(TP+TN)/(P+N):0.3%}'.replace('%', ' %'))
# F1 = 2*TP/(2*TP+FP+FN)
print(f'F1 score {" "*14}= {2*TP/(2*TP+FP+FN):0.3%}'.replace('%', ' %'))

Prevalence             = 12.430 %
TPR (sensivity, power, precision) = 92.512 %
TNR (specificity)      = 99.845 %
FPR (false alarm)      = 0.155 %
FNR (miss rate)        = 7.488 %
Accuracy               = 98.933 %
F1 score               = 95.568 %


#### `Model metrics on test data only`

In [32]:
df_cm = df_cm.iloc[X_test.index]

In [33]:
TP = (df_cm['conf_matrix'] == 'TP').sum()
TN = (df_cm['conf_matrix'] == 'TN').sum()
FP = (df_cm['conf_matrix'] == 'FP').sum()
FN = (df_cm['conf_matrix'] == 'FN').sum()
P = (df_cm['is_spam'] == 1).sum()
N = (df_cm['is_spam'] == 0).sum()

df_cm.groupby('conf_matrix').count().style.bar(align='left', color='lightgreen')

Unnamed: 0_level_0,is_spam
conf_matrix,Unnamed: 1_level_1
FN,34
FP,6
TN,1339
TP,169


In [34]:
print(f'Prevalence {" "*12}= {P /(P+N):0.3%}'.replace('%', ' %'))
print(f'TPR (sensivity, power, precision) = {TP/P:0.3%}'.replace('%', ' %'))
print(f'TNR (specificity){" "*6}= {TN/N:0.3%}'.replace('%', ' %'))
print(f'FPR (false alarm){" "*6}= {FP/N:0.3%}'.replace('%', ' %'))
print(f'FNR (miss rate){" "*8}= {FN/P:0.3%}'.replace('%', ' %'))
print(f'Accuracy {" "*14}= {(TP+TN)/(P+N):0.3%}'.replace('%', ' %'))
# F1 = 2*TP/(2*TP+FP+FN)
print(f'F1 score {" "*14}= {2*TP/(2*TP+FP+FN):0.3%}'.replace('%', ' %'))

Prevalence             = 13.114 %
TPR (sensivity, power, precision) = 83.251 %
TNR (specificity)      = 99.554 %
FPR (false alarm)      = 0.446 %
FNR (miss rate)        = 16.749 %
Accuracy               = 97.416 %
F1 score               = 89.418 %


In [35]:
# WE HAVE FALSE ALARM ON THESE MESSAGES
df[df['conf_matrix'] == 'FP']

Unnamed: 0,is_spam,Message,words,lemm_str,Predicted,conf_matrix
45,0,No calls..messages..missed calls,"[calls, messages, missed, calls]",call message missed call,1,FP
84,0,Yup next stop.,"[yup, next, stop]",yup next stop,1,FP
495,0,Are you free now?can i call now?,"[free, call]",free call,1,FP
3364,0,Can... I'm free...,[free],free,1,FP
4419,0,"When you get free, call me","[get, free, call]",get free call,1,FP
4729,0,I (Career Tel) have added u as a contact on INDYAROCKS.COM to send FREE SMS. To remove from phonebook - sms NO to &lt;#&gt;,"[career, tel, added, u, contact, indyarocks, com, send, free, sms, remove, phonebook, sms, lt, gt]",career tel added u contact indyarocks com send free sm remove phonebook sm lt gt,1,FP
5157,0,K k:) sms chat with me.,"[k, k, sms, chat]",k k sm chat,1,FP


In [36]:
# WE MISS THESE spam MESSAGES
df[df['conf_matrix'] == 'FN'].head()

Unnamed: 0,is_spam,Message,words,lemm_str,Predicted,conf_matrix
68,1,"Did you hear about the new ""Divorce Barbie""? It comes with all of Ken's stuff!","[hear, new, divorce, barbie, comes, ken, stuff]",hear new divorce barbie come ken stuff,0,FN
95,1,"Your free ringtone is waiting to be collected. Simply text the password ""MIX"" to 85069 to verify. Get Usher and Britney. FML, PO Box 5249, MK17 92H. 450Ppw 16","[free, ringtone, waiting, collected, simply, text, password, mix, verify, get, usher, britney, fml, po, box, mk, h, ppw]",free ringtone waiting collected simply text password mix verify get usher britney fml po box mk h ppw,0,FN
333,1,Call Germany for only 1 pence per minute! Call from a fixed line via access number 0844 861 85 85. No prepayment. Direct access!,"[call, germany, pence, per, minute, call, fixed, line, via, access, number, prepayment, direct, access]",call germany penny per minute call fixed line via access number prepayment direct access,0,FN
607,1,XCLUSIVE@CLUBSAISAI 2MOROW 28/5 SOIREE SPECIALE ZOUK WITH NICHOLS FROM PARIS.FREE ROSES 2 ALL LADIES !!! info: 07946746291/07880867867,"[xclusive, clubsaisai, morow, soiree, speciale, zouk, nichols, paris, free, roses, ladies, info]",xclusive clubsaisai morow soiree speciale zouk nichols paris free rose lady info,0,FN
751,1,"Do you realize that in about 40 years, we'll have thousands of old ladies running around with tattoos?","[realize, years, thousands, old, ladies, running, around, tattoos]",realize year thousand old lady running around tattoo,0,FN


``Main assumptions:``
* `to maximize TNR (specificity, minimize false alarm) use n-grams (2,2)`
* `to maximize TPR (sensivity, maximize spam detection) use n-grams(1,1)`

[Feature Extraction and Logistic Regression](https://medium.com/@annabiancajones/sentiment-analysis-on-reviews-feature-extraction-and-logistic-regression-43a29635cc81)