> Igor Sorochan
# "Statistical tests practise"
## 1

We have [videogames](https://github.com/obulygin/pyda_homeworks/blob/master/stat_case_study/vgsales.csv) dataset.

Questions to ask:

1. Do critics like sports games?
1. Which video platforms do critics prefer (PC or PS4)?
1. Do critics prefer shooters or strategy games?

### Prepare

In [3]:
#Dependencies
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

import scipy.stats as stats

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

In [4]:
# uncomment to load dataset:
# df_raw = pd.read_csv('https://raw.githubusercontent.com/obulygin/pyda_homeworks/master/stat_case_study/vgsales.csv')

# local source
df_raw = pd.read_csv('/Users/velo1/SynologyDrive/GIT_syno/data/vgsales.csv') 

df_raw

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16714,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,Tecmo Koei,0.00,0.00,0.01,0.00,0.01,,,,,,
16715,LMA Manager 2007,X360,2006.0,Sports,Codemasters,0.00,0.01,0.00,0.00,0.01,,,,,,
16716,Haitaka no Psychedelica,PSV,2016.0,Adventure,Idea Factory,0.00,0.00,0.01,0.00,0.01,,,,,,
16717,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,0.00,0.00,0.00,0.01,,,,,,


### Process

In [5]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


In [6]:
# checking for duplicates and NaNs
df_raw.duplicated().sum(), df_raw.isna().sum()

(0,
 Name                  2
 Platform              0
 Year_of_Release     269
 Genre                 2
 Publisher            54
 NA_Sales              0
 EU_Sales              0
 JP_Sales              0
 Other_Sales           0
 Global_Sales          0
 Critic_Score       8582
 Critic_Count       8582
 User_Score         6704
 User_Count         9129
 Developer          6623
 Rating             6769
 dtype: int64)

In [7]:
# drop Genre or Critic_Score empty observations as they are essential for analysis
df = df_raw.drop(df_raw[df_raw.Critic_Score.isna() |  df_raw.Genre.isna()].index)

df

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8,322.0,Nintendo,E
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8,192.0,Nintendo,E
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo,11.28,9.14,6.50,2.88,29.80,89.0,65.0,8.5,431.0,Nintendo,E
7,Wii Play,Wii,2006.0,Misc,Nintendo,13.96,9.18,2.93,2.84,28.92,58.0,41.0,6.6,129.0,Nintendo,E
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16700,Breach,PC,2011.0,Shooter,Destineer,0.01,0.00,0.00,0.00,0.01,61.0,12.0,5.8,43.0,Atomic Games,T
16701,Bust-A-Move 3000,GC,2003.0,Puzzle,Ubisoft,0.01,0.00,0.00,0.00,0.01,53.0,4.0,tbd,,Taito Corporation,E
16702,Mega Brain Boost,DS,2008.0,Puzzle,Majesco Entertainment,0.01,0.00,0.00,0.00,0.01,48.0,10.0,tbd,,Interchannel-Holon,E
16706,STORM: Frontline Nation,PC,2011.0,Strategy,Unknown,0.00,0.01,0.00,0.00,0.01,60.0,12.0,7.2,13.0,SimBin,E10+


In [8]:
# leave only related attributes
df = df[['Genre','Critic_Score', 'Platform']]
df.duplicated().sum(), df.isna().sum()

(4048,
 Genre           0
 Critic_Score    0
 Platform        0
 dtype: int64)

In [9]:
df.shape

(8137, 3)

In [10]:
df.Genre.unique(), df.Genre.nunique()

(array(['Sports', 'Racing', 'Platform', 'Misc', 'Action', 'Puzzle',
        'Shooter', 'Fighting', 'Simulation', 'Role-Playing', 'Adventure',
        'Strategy'], dtype=object),
 12)

In [11]:
df.Platform.unique(), df.Platform.nunique()

(array(['Wii', 'DS', 'X360', 'PS3', 'PS2', '3DS', 'PS4', 'PS', 'XB', 'PC',
        'PSP', 'WiiU', 'GC', 'GBA', 'XOne', 'PSV', 'DC'], dtype=object),
 17)

So we have 12 Genres and 17 video games platforms.

### Analyze

#### Do critics like sports games?

In [12]:
df.groupby('Genre').mean(numeric_only = True).sort_values(by='Critic_Score',ascending= False).style.bar(align='left',color='yellow')

Unnamed: 0_level_0,Critic_Score
Genre,Unnamed: 1_level_1
Role-Playing,72.652646
Strategy,72.086093
Sports,71.968174
Shooter,70.181144
Fighting,69.217604
Simulation,68.619318
Platform,68.05835
Racing,67.963612
Puzzle,67.424107
Action,66.629101


In [13]:
fig = px.box(df, x='Genre', y='Critic_Score', notched=True, color='Genre')
fig.show()

#### Test scores medians

In [14]:
df_nonsports= df[df.Genre != 'Sports'].Critic_Score
df_sports= df[df.Genre == 'Sports'].Critic_Score

print(f'Median score of Sports games: {df_sports.median()}')
print(f'Median score of Other  games: {df_nonsports.median()}')

Median score of Sports games: 75.0
Median score of Other  games: 70.0


In [15]:
# visualisation of appropriate scores
fig = go.Figure()
fig.add_trace(go.Box(x=df_nonsports, notched= True, name= 'Other',marker_color='green'))
fig.add_trace(go.Box(x=df_sports, notched= True, name= 'Sports',marker_color='yellow'))
fig.update_layout(title="Sports and Other Genres Critic's Scores", xaxis_title="Critic's Scores")
fig.show()

Notches displays a confidence interval around the median.  
We compute the confidence interval as  
$median \pm 1.57 * \frac{ IQR } {\sqrt(N)}$, where  
* IQR is the interquartile range  
* and N is the sample size.  

If two boxes' notches do not overlap there is 95% confidence their medians differ. 

Let's check it with one of statistical tests.

#### Test scores means

In [16]:
print(stats.shapiro(y0), stats.shapiro(y1))

NameError: name 'y0' is not defined

Both distributions are normal and are independent.  
We could use a Student's T-test for the means of *two independent* samples.  
This test assumes that the populations have identical variances.  


H0:   $CS.mean{_{Sports}} = CS.mean{_{Others}} $  

H1:   $CS.mean{_{Sports}} \ne CS.mean{_{Others}} $

$confidence = 0.95$

In [None]:
# CS mean of non-sports games
popmean_notsports = df[df.Genre != 'Sports'].Critic_Score.mean()
print(popmean_notsports)
stats.ttest_1samp(df_nonsports, popmean= popmean_notsports, nan_policy= 'omit')

68.4516779490134


Ttest_1sampResult(statistic=0.0, pvalue=1.0)

In [None]:
# CS overall mean
popmean = df.Critic_Score.mean()
print(popmean)
stats.ttest_1samp(df_sports, popmean= popmean, nan_policy= 'omit')

68.96767850559173


Ttest_1sampResult(statistic=7.470587451672033, pvalue=1.538088875231057e-13)

In [None]:
stats.ttest_ind(df_sports, df_nonsports,  equal_var= False, nan_policy= 'omit') #  If False, perform Welch’s t-test, 
#which does not assume equal population variance 

Ttest_indResult(statistic=8.08698828481822, pvalue=1.181171308320441e-15)

$p-value < 0.05 =>$  
We have statistically significant reasons to reject the null hypothesis.  

`Critics prefer Sports games more than other games genres together.`

#### Which video platforms do critics prefer (PC or PS4)?

In [None]:
df.groupby('Platform').mean(numeric_only= True).sort_values(by='Critic_Score',ascending= False).style.bar(align='left', color='coral')

Unnamed: 0_level_0,Critic_Score
Platform,Unnamed: 1_level_1
DC,87.357143
PC,75.928671
XOne,73.325444
PS4,72.09127
PS,71.515
PSV,70.791667
WiiU,70.733333
PS3,70.382927
XB,69.85931
GC,69.488839


In [None]:
y_pc = df[df.Platform == 'PC'].Critic_Score
y_ps4 = df[df.Platform == 'PS4'].Critic_Score

fig = go.Figure()
fig.add_trace(go.Box(x=y_ps4, notched= True, name= 'PS4', marker_color='darkblue'))
fig.add_trace(go.Box(x=y_pc, notched= True, name= 'PC', marker_color='#FF4136'))

fig.update_layout(title="PC and PS4 Critic's Scores", xaxis_title="Critic's Scores")
fig.show()

Two boxes' notches do not overlap  
so there is 95% confidence their medians differ. 

Let's check it with t-tests.

In [None]:
print(stats.shapiro(y_pc), stats.shapiro(y_ps4))

ShapiroResult(statistic=0.9565241932868958, pvalue=1.0608874889683761e-13) ShapiroResult(statistic=0.9328337907791138, pvalue=2.690704770103025e-09)


Both distributions are normal and are independent.  
We could use a Students t-test.  
This test assumes that the populations have identical variances.  


H0:   $CS.mean{_{PC}} = CS.mean{_{PS4}} $  

H1:   $CS.mean{_{PC}} \ne CS.mean{_{PS4}} $

$confidence = 0.95$

In [None]:
stats.ttest_ind(y_pc, y_ps4, equal_var= False, nan_policy= 'omit') #  If False, perform Welch’s t-test, 
#which does not assume equal population variance 

Ttest_indResult(statistic=4.3087588262138725, pvalue=2.067249157283479e-05)

$p-value < 0.05 =>$
We have statistically significant reasons to reject the null hypothesis.  

`Critics prefer PC games to PS4 games.`

### Do critics prefer shooters or strategy games?

In [None]:
df.groupby('Genre').mean(numeric_only= True).sort_values(by='Critic_Score',ascending= False).style.bar(align='left', color='grey')

Unnamed: 0_level_0,Critic_Score
Genre,Unnamed: 1_level_1
Role-Playing,72.652646
Strategy,72.086093
Sports,71.968174
Shooter,70.181144
Fighting,69.217604
Simulation,68.619318
Platform,68.05835
Racing,67.963612
Puzzle,67.424107
Action,66.629101


In [None]:
y_rpg = df[df.Genre == 'Role-Playing'].Critic_Score
y_str = df[df.Genre == 'Strategy'].Critic_Score

fig = go.Figure()
fig.add_trace(go.Box(x=y_rpg, notched= True, name= 'Role-Playing', marker_color='red'))
fig.add_trace(go.Box(x=y_str, notched= True, name= 'Strategy', marker_color='black'))
fig.add_vline(x=y_rpg.median(), line_color='red')
fig.update_layout(title="Role-Playing and Strategy Critic's Scores", xaxis_title="Critic's Scores")
fig.show()

Two boxes' notches do overlap so there is **NO 95% confidence** their medians differ. 

Let's check it with t-tests.

In [None]:
print(stats.shapiro(y_rpg), stats.shapiro(y_str))

ShapiroResult(statistic=0.9816334843635559, pvalue=5.457165030975375e-08) ShapiroResult(statistic=0.9744413495063782, pvalue=3.258884316892363e-05)


Both distributions are normal and are independent.  
We could use a Students t-test.  
This test assumes that the populations have identical variances.  


H0:   $CS.mean{_{rpg}} = CS.mean{_{strategy}} $  

H1:   $CS.mean{_{rpg}} \ne CS.mean{_{strategy}} $

$confidence = 0.95$

In [None]:
stats.ttest_ind(y_rpg, y_str, equal_var= False, nan_policy= 'omit') #  If False, perform Welch’s t-test, 
#which does not assume equal population variance 

Ttest_indResult(statistic=0.698083061405362, pvalue=0.4854113519174341)

We have no reasons to reject the Null Hypothesis.  
We `don't have statistically significant results` to assume that critics prefer RPG over Strategy games or vice versa.

## Задание 2  

Реализуйте базовую модель логистической регрессии для классификации текстовых сообщений (используемые данные [здесь](https://github.com/obulygin/pyda_homeworks/blob/master/stat_case_study/spam.csv)) по признаку спама. Для этого:

1. Приведите весь текст к нижнему регистру;
1. Удалите мусорные символы;
1. Удалите стоп-слова;
1. Привидите все слова к нормальной форме;
1. Преобразуйте все сообщения в вектора TF-IDF.  
Вам поможет следующий код:  

``from sklearn.feature_extraction.text import TfidfVectorizer``  
``tfidf = TfidfVectorizer()``  
``tfidf_matrix = tfidf.fit_transform(df.Message)``  
``names = tfidf.get_feature_names()``  
``tfidf_matrix = pd.DataFrame(tfidf_matrix.toarray(), columns=names)``  
Можете поэкспериментировать с параметрами TfidfVectorizer;
1. Разделите данные на тестовые и тренировочные в соотношении 30/70,  
укажите random_state=42.  
Используйте train_test_split;
1. Постройте модель логистической регрессии,  
укажите random_state=42,  
оцените ее точность на тестовых данных;
1. Опишите результаты при помощи confusion_matrix;
1. Постройте датафрейм, который будет содержать все исходные тексты сообщений, классифицированные неправильно (с указанием фактического и предсказанного).

### `Prepare`

In [None]:
# Additional dependencies
import re
# https://www.machinelearningplus.com/nlp/gensim-tutorial/
# библиотека gensim позволяет эффективно работать с корпусами текстов
from gensim import corpora
from nltk.corpus import stopwords

In [None]:
# df_raw = pd.read_csv('https://github.com/obulygin/pyda_homeworks/blob/master/stat_case_study/spam.csv')
df_raw = pd.read_csv('/Users/velo1/SynologyDrive/GIT_syno/data/spam.csv')
df_raw

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"
...,...,...
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate."
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other suggestions?"
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free


In [None]:
import nltk
nltk.download('stopwords')
stopwords_set = set(stopwords.words('english'))
# stopwords_set

from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /Users/velo1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/velo1/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/velo1/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### `Process`

In [None]:
df_raw.duplicated().sum(), df_raw.isna().sum()

(415,
 Category    0
 Message     0
 dtype: int64)

In [None]:
df_raw[df_raw.duplicated()]

Unnamed: 0,Category,Message
103,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
154,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
207,ham,"As I entered my cabin my PA said, '' Happy B'day Boss !!''. I felt special. She askd me 4 lunch. After lunch she invited me to her apartment. We went there."
223,ham,"Sorry, I'll call later"
326,ham,No calls..messages..missed calls
...,...,...
5524,spam,You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p p£3.99
5535,ham,"I know you are thinkin malaria. But relax, children cant handle malaria. She would have been worse and its gastroenteritis. If she takes enough to replace her loss her temp will reduce. And if you..."
5539,ham,Just sleeping..and surfing
5553,ham,Hahaha..use your brain dear


In [None]:
df = df_raw.drop(df_raw[df_raw.duplicated()].index)
df.Category.replace({'ham':0, 'spam':1}, inplace= True)
df.rename({'Category':'is_spam'}, axis= 1, inplace= True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5157 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   is_spam  5157 non-null   int64 
 1   Message  5157 non-null   object
dtypes: int64(1), object(1)
memory usage: 120.9+ KB


### `Analyze`
#### `tokenization`

In [None]:
# lower ->  subst. non-word character  ->  split  ->  remove stopwords
df['words'] = df.Message.apply(lambda x:
 [word for word in re.sub('[\W\d]+', ' ', x.lower()).split() if word not in stopwords_set] )

df

Unnamed: 0,is_spam,Message,words
0,0,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,0,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,1,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,"[free, entry, wkly, comp, win, fa, cup, final, tkts, st, may, text, fa, receive, entry, question, std, txt, rate, c, apply]"
3,0,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,0,"Nah I don't think he goes to usf, he lives around here though","[nah, think, goes, usf, lives, around, though]"
...,...,...,...
5567,1,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.","[nd, time, tried, contact, u, u, pound, prize, claim, easy, call, p, per, minute, bt, national, rate]"
5568,0,Will ü b going to esplanade fr home?,"[ü, b, going, esplanade, fr, home]"
5569,0,"Pity, * was in mood for that. So...any other suggestions?","[pity, mood, suggestions]"
5570,0,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free,"[guy, bitching, acted, like, interested, buying, something, else, next, week, gave, us, free]"


In [None]:
wordnet_lemmatizer = WordNetLemmatizer()

# lemmatize -> to string to feed 
df['lemm_str'] = df.words.apply(lambda x: 
    ' '.join([wordnet_lemmatizer.lemmatize(word) for word in x]) )
df

Unnamed: 0,is_spam,Message,words,lemm_str
0,0,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]",go jurong point crazy available bugis n great world la e buffet cine got amore wat
1,0,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]",ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,"[free, entry, wkly, comp, win, fa, cup, final, tkts, st, may, text, fa, receive, entry, question, std, txt, rate, c, apply]",free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply
3,0,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]",u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives around here though","[nah, think, goes, usf, lives, around, though]",nah think go usf life around though
...,...,...,...,...
5567,1,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.","[nd, time, tried, contact, u, u, pound, prize, claim, easy, call, p, per, minute, bt, national, rate]",nd time tried contact u u pound prize claim easy call p per minute bt national rate
5568,0,Will ü b going to esplanade fr home?,"[ü, b, going, esplanade, fr, home]",ü b going esplanade fr home
5569,0,"Pity, * was in mood for that. So...any other suggestions?","[pity, mood, suggestions]",pity mood suggestion
5570,0,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free,"[guy, bitching, acted, like, interested, buying, something, else, next, week, gave, us, free]",guy bitching acted like interested buying something else next week gave u free


In [None]:
# dictionary = corpora.Dictionary([['love']])
# dictionary

`tf–idf` means term-frequency `times` inverse document-frequency

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(lowercase = False, ngram_range= (1,1))
# ngram_range= (1,1)  FN 34 FP 6
# ngram_range= (1,2)  FN 51 FP 3
# ngram_range= (1,3)  FN 69 FP 3
# ngram_range= (2,2)  FN 109 FP 0

# Learn vocabulary and idf, return document-term matrix.
tfidf_matrix = tfidf.fit_transform(df.lemm_str)

# Get output feature names for transformation.
names = tfidf.get_feature_names_out()


tfidf_matrix = pd.DataFrame(tfidf_matrix.toarray(), columns= names)
tfidf_matrix.shape

(5157, 7101)

In [None]:
# tfidf_matrix['upper_ind'] = df.Message.apply(lambda x: sum(True for c in x if c.isupper()) / len (x)) # uppercase index
# tfidf_matrix['nonword_ind'] = df.apply(lambda row: len(row.lemm_str)/ len (row.Message), axis= 1)  # nonword index
# tfidf_matrix['len']= df.Message.apply(lambda x: len(x)) # empirical criterion approximates the 'Euclidean distance'
tfidf_matrix

Unnamed: 0,____,aa,aah,aaniye,aaooooright,aathi,ab,abbey,abdomen,abeg,...,zf,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5154,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5155,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# tfidf_matrix[tfidf_matrix.upper_ind.isna()]

In [None]:
tfidf_matrix.replace({np.nan:0},inplace= True)

In [None]:
# index_ = tfidf_matrix[tfidf_matrix.upper_ind.isna()== False].index

In [None]:
# tfidf_matrix.iloc[index_]
# df.reset_index().iloc[index_]

In [None]:
X = tfidf_matrix
y = df['is_spam']
X.shape[0] == y.shape[0]

True

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y ,test_size=0.3, random_state= 42)

#### `LR`

#### `process pipeline`

In [None]:
# %%time
# parameters = {
#             # 'scaler': [StandardScaler()],
# 	            'logit__tol': [1e-2],# 1e-3],           #Tolerance for stopping criteria.
#               'logit__C': [ 7 ], # 5 ,  10],            # Regularization parameter.
#               'logit__max_iter': [1000], # 5000, 10000],  
#               # 'logit__multi_class': ['ovr','multinomial'],
#               'logit__n_jobs':[-1]
#                 }
# pipe = Pipeline([
#                   # ('scaler', StandardScaler()),
#                  ('logit', LogisticRegression())])
                                 
# grid = GridSearchCV(pipe, parameters, cv=10).fit(X_train, y_train)

# logit_optim_score = grid.score(X_test, y_test)
# print(f'Training set best score: {grid.score(X_train, y_train):2.2f}')
# print(f'Test set best score: {logit_optim_score:2.3f}')

# Access the best set of parameters
# best_params = grid.best_params_
# print('\nOptimal set of parameters:', best_params)

# Stores the optimum model in best_pipe
# best_pipe = grid.best_estimator_
# print('\nOptimal pipeline:', best_pipe)

#### `LR using pipeline best params`

In [None]:
%%time
lr = LogisticRegression(tol= 1e-2, C= 7, max_iter= 1000, n_jobs= -1)
lr.fit(X_train, y_train)

CPU times: user 210 ms, sys: 348 ms, total: 558 ms
Wall time: 29.2 s


In [None]:
# [x for x in dir(lr) if '_' not in x]

In [None]:
# ?f1_score

#### `results`

In [None]:
# Return the mean accuracy on the given test data and labels.

print(f'Model accuracy: {lr.score(X_test, y_test):.3%}')

Model accuracy: 97.416%


In [None]:
# Compute the F1 score, also known as balanced F-score or F-measure.

# The F1 score can be interpreted as a harmonic mean of the precision and
# recall, where an F1 score reaches its best value at 1 and worst score at 0.
# The relative contribution of precision and recall to the F1 score are
# equal. 
print(f'Model f1-score: {f1_score(y_test, lr.predict(X_test)) :.3%}')

Model f1-score: 89.418%


In [None]:
def conf_matrix(input, predicted):
    if input == 0:
        return 'TN' if predicted == 0 else 'FP'
    else:
        return 'TP' if predicted == 1 else 'FN'

In [None]:
df['Predicted'] = lr.predict(tfidf_matrix)
df['conf_matrix'] = df.apply(lambda row: conf_matrix(row['is_spam'], row['Predicted']), axis= 1)

In [None]:
# MISTAKES (FN + FP)
pd.set_option('max_colwidth', 200)
df[df.is_spam != df.Predicted][['Message', 'lemm_str','is_spam','Predicted', 'conf_matrix']].head()

Unnamed: 0,Message,lemm_str,is_spam,Predicted,conf_matrix
45,No calls..messages..missed calls,call message missed call,0,1,FP
68,"Did you hear about the new ""Divorce Barbie""? It comes with all of Ken's stuff!",hear new divorce barbie come ken stuff,1,0,FN
84,Yup next stop.,yup next stop,0,1,FP
95,"Your free ringtone is waiting to be collected. Simply text the password ""MIX"" to 85069 to verify. Get Usher and Britney. FML, PO Box 5249, MK17 92H. 450Ppw 16",free ringtone waiting collected simply text password mix verify get usher britney fml po box mk h ppw,1,0,FN
333,Call Germany for only 1 pence per minute! Call from a fixed line via access number 0844 861 85 85. No prepayment. Direct access!,call germany penny per minute call fixed line via access number prepayment direct access,1,0,FN


### `Share`


<table class="wikitable" style="border:none; background:transparent; text-align:center;" align="center">
<tbody><tr>
<td rowspan="2" style="border:none;">
</td>
<td style="border:none;">
</td>
<td colspan="2" style="background:#bbeeee;"><b>Predicted condition</b>
</td>

</td></tr>
<tr>
<td style="background:#eeeeee;"><a href="/wiki/Statistical_population" title="Statistical population">Total population</a> <br><span style="white-space:nowrap;">= P + N</span>
</td>
<td style="background:#ccffff;"><b>Positive (PP)</b>
</td>
<td style="background:#aadddd;"><b>Negative (PN)</b>
</td></tr>
<tr>
<td rowspan="2" class="nowrap unsortable" style="line-height:99%;vertical-align:middle;padding:.4em .4em .2em;background-position:50% .4em !important;min-width:0.875em;max-width:0.875em;width:0.875em;overflow:hidden;background:#eeeebb;"><div style="vertical-rl=-webkit-writing-mode: vertical-rl; -o-writing-mode: vertical-rl; -ms-writing-mode: tb-rl;writing-mode: tb-rl; writing-mode: vertical-rl; layout-flow: vertical-ideographic;transform:rotate(180deg);display:inline-block;padding-left:1px;text-align:center;"><b>Actual condition</b></div>
</td>
<td style="background:#ffffcc;"><b>Positive (P)</b>
</td>
<td style="background:#ccffcc;"><b><a href="/wiki/True_positive" class="mw-redirect" title="True positive">True positive</a> (TP) <br></b>
</td>
<td style="background:#ffdddd;"><b><a href="/wiki/False_negative" class="mw-redirect" title="False negative">False negative</a> (FN) <br></b>
</td></tr>
<tr>
<td style="background:#ddddaa;"><b>Negative (N)</b>
</td>
<td style="background:#ffcccc;"><b><a href="/wiki/False_positive" class="mw-redirect" title="False positive">False positive</a> (FP) <br></b>
</td>
<td style="background:#bbeebb;"><b><a href="/wiki/True_negative" class="mw-redirect" title="True negative">True negative</a> (TN) <br></b>
</td></tr></tbody></table>

#### `Model overall (train+test) metrics`

In [None]:
df_cm = df[['is_spam', 'conf_matrix']]

In [None]:
TP = (df_cm['conf_matrix'] == 'TP').sum()
TN = (df_cm['conf_matrix'] == 'TN').sum()
FP = (df_cm['conf_matrix'] == 'FP').sum()
FN = (df_cm['conf_matrix'] == 'FN').sum()
P = (df_cm['is_spam'] == 1).sum()
N = (df_cm['is_spam'] == 0).sum()

df_cm.groupby('conf_matrix').count().style.bar(align='left', color='lightgreen')

Unnamed: 0_level_0,is_spam
conf_matrix,Unnamed: 1_level_1
FN,48
FP,7
TN,4509
TP,593


In [None]:
print(f'Prevalence {" "*12}= {P /(P+N):0.3%}'.replace('%', ' %'))
print(f'TPR (sensivity, power, precision) = {TP/P:0.3%}'.replace('%', ' %'))
print(f'TNR (specificity){" "*6}= {TN/N:0.3%}'.replace('%', ' %'))
print(f'FPR (false alarm){" "*6}= {FP/N:0.3%}'.replace('%', ' %'))
print(f'FNR (miss rate){" "*8}= {FN/P:0.3%}'.replace('%', ' %'))
print(f'Accuracy {" "*14}= {(TP+TN)/(P+N):0.3%}'.replace('%', ' %'))
# F1 = 2*TP/(2*TP+FP+FN)
print(f'F1 score {" "*14}= {2*TP/(2*TP+FP+FN):0.3%}'.replace('%', ' %'))

Prevalence             = 12.430 %
TPR (sensivity, power, precision) = 92.512 %
TNR (specificity)      = 99.845 %
FPR (false alarm)      = 0.155 %
FNR (miss rate)        = 7.488 %
Accuracy               = 98.933 %
F1 score               = 95.568 %


#### `Model metrics on test data only`

In [None]:
df_cm = df_cm.iloc[X_test.index]

In [None]:
TP = (df_cm['conf_matrix'] == 'TP').sum()
TN = (df_cm['conf_matrix'] == 'TN').sum()
FP = (df_cm['conf_matrix'] == 'FP').sum()
FN = (df_cm['conf_matrix'] == 'FN').sum()
P = (df_cm['is_spam'] == 1).sum()
N = (df_cm['is_spam'] == 0).sum()

df_cm.groupby('conf_matrix').count().style.bar(align='left', color='lightgreen')

Unnamed: 0_level_0,is_spam
conf_matrix,Unnamed: 1_level_1
FN,34
FP,6
TN,1339
TP,169


In [None]:
print(f'Prevalence {" "*12}= {P /(P+N):0.3%}'.replace('%', ' %'))
print(f'TPR (sensivity, power, precision) = {TP/P:0.3%}'.replace('%', ' %'))
print(f'TNR (specificity){" "*6}= {TN/N:0.3%}'.replace('%', ' %'))
print(f'FPR (false alarm){" "*6}= {FP/N:0.3%}'.replace('%', ' %'))
print(f'FNR (miss rate){" "*8}= {FN/P:0.3%}'.replace('%', ' %'))
print(f'Accuracy {" "*14}= {(TP+TN)/(P+N):0.3%}'.replace('%', ' %'))
# F1 = 2*TP/(2*TP+FP+FN)
print(f'F1 score {" "*14}= {2*TP/(2*TP+FP+FN):0.3%}'.replace('%', ' %'))

Prevalence             = 13.114 %
TPR (sensivity, power, precision) = 83.251 %
TNR (specificity)      = 99.554 %
FPR (false alarm)      = 0.446 %
FNR (miss rate)        = 16.749 %
Accuracy               = 97.416 %
F1 score               = 89.418 %


In [None]:
# WE HAVE FALSE ALARM ON THESE MESSAGES
df[df['conf_matrix'] == 'FP']

Unnamed: 0,is_spam,Message,words,lemm_str,Predicted,conf_matrix
45,0,No calls..messages..missed calls,"[calls, messages, missed, calls]",call message missed call,1,FP
84,0,Yup next stop.,"[yup, next, stop]",yup next stop,1,FP
495,0,Are you free now?can i call now?,"[free, call]",free call,1,FP
3364,0,Can... I'm free...,[free],free,1,FP
4419,0,"When you get free, call me","[get, free, call]",get free call,1,FP
4729,0,I (Career Tel) have added u as a contact on INDYAROCKS.COM to send FREE SMS. To remove from phonebook - sms NO to &lt;#&gt;,"[career, tel, added, u, contact, indyarocks, com, send, free, sms, remove, phonebook, sms, lt, gt]",career tel added u contact indyarocks com send free sm remove phonebook sm lt gt,1,FP
5157,0,K k:) sms chat with me.,"[k, k, sms, chat]",k k sm chat,1,FP


In [None]:
# WE MISS THESE spam MESSAGES
df[df['conf_matrix'] == 'FN'].head()

Unnamed: 0,is_spam,Message,words,lemm_str,Predicted,conf_matrix
68,1,"Did you hear about the new ""Divorce Barbie""? It comes with all of Ken's stuff!","[hear, new, divorce, barbie, comes, ken, stuff]",hear new divorce barbie come ken stuff,0,FN
95,1,"Your free ringtone is waiting to be collected. Simply text the password ""MIX"" to 85069 to verify. Get Usher and Britney. FML, PO Box 5249, MK17 92H. 450Ppw 16","[free, ringtone, waiting, collected, simply, text, password, mix, verify, get, usher, britney, fml, po, box, mk, h, ppw]",free ringtone waiting collected simply text password mix verify get usher britney fml po box mk h ppw,0,FN
333,1,Call Germany for only 1 pence per minute! Call from a fixed line via access number 0844 861 85 85. No prepayment. Direct access!,"[call, germany, pence, per, minute, call, fixed, line, via, access, number, prepayment, direct, access]",call germany penny per minute call fixed line via access number prepayment direct access,0,FN
607,1,XCLUSIVE@CLUBSAISAI 2MOROW 28/5 SOIREE SPECIALE ZOUK WITH NICHOLS FROM PARIS.FREE ROSES 2 ALL LADIES !!! info: 07946746291/07880867867,"[xclusive, clubsaisai, morow, soiree, speciale, zouk, nichols, paris, free, roses, ladies, info]",xclusive clubsaisai morow soiree speciale zouk nichols paris free rose lady info,0,FN
751,1,"Do you realize that in about 40 years, we'll have thousands of old ladies running around with tattoos?","[realize, years, thousands, old, ladies, running, around, tattoos]",realize year thousand old lady running around tattoo,0,FN


``Main assumptions:``
* `to maximize TNR (specificity, minimize false alarm) use n-grams (2,2)`
* `to maximize TPR (sensivity, maximize spam detection) use n-grams(1,1)`

[Feature Extraction and Logistic Regression](https://medium.com/@annabiancajones/sentiment-analysis-on-reviews-feature-extraction-and-logistic-regression-43a29635cc81)