<a href="https://colab.research.google.com/github/PollyIva/NLP-projects/blob/main/NLP_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Detect spam


---
Here we consider 2 ways to find spam:

1. using length and punctuation
2. using message text processing


In [2]:
import pandas as pd
#pip install scikit-learn
#import sklearn
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
#import xgboost as xgb

from sklearn import metrics

#**I. Using length and punctuation:**


---




# Data processing

In [3]:
df = pd.read_csv('./smsspamcollection.tsv', sep='\t')
df

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2
...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,160,8
5568,ham,Will ü b going to esplanade fr home?,36,1
5569,ham,"Pity, * was in mood for that. So...any other s...",57,7
5570,ham,The guy did some bitching but I acted like i'd...,125,1


In [4]:
df.isnull().sum() #checking the data for NULL

label      0
message    0
length     0
punct      0
dtype: int64

**Defining the target and feature:**

In [5]:
X = df.drop(['label', 'message'], axis=1)
y = df.label
y = y.apply(lambda x: 0 if x == 'ham' else 1)

 **Divide the data into training and test samples**

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

**Let's check the quantitative content of SPAM and HAM**

In [7]:
labels = y_train.value_counts()
labels

0    3232
1     501
Name: label, dtype: int64

**We see that they are not equal, which can lead to models being more likely to predict a sample class that is larger.**



---


**To avoid this, we generate additional samples.**



In [8]:
from imblearn.over_sampling import SMOTE

In [9]:
smote = SMOTE(sampling_strategy='minority')
X_smote_over, y_smote_over = smote.fit_resample(X_train, y_train)
y_smote_over.value_counts()

0    3232
1    3232
Name: label, dtype: int64

**It remains to train the model and choose which of the models will be better**

In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn

In [11]:
model_params = {
    'XGBClassifier': {
        'model': XGBClassifier(),
        'params' : {
            'alpha': [0], 
            'eta': [0.1], 
            'eval_metric': ['logloss'], 
            'lambda': [1],  
            'max_depth': [5, 10], 
        } 
    },  
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    },
    'naive_bayes_multinomial': {
        'model': MultinomialNB(),
        'params': {}
    },
    'decision_tree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'criterion': ['gini','entropy'],
            
        }
    },
}

In [12]:
from sklearn.model_selection import GridSearchCV

In [13]:
def compare(model_params, x_train, y_train, x_test, y_test):
  scores = []
  for model_name, mp in model_params:
      clf = GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False, scoring='f1_micro', refit=True) # metrics.SCORERS.keys()
      clf.fit(x_train, y_train)
      scores.append({
          'model': model_name,
          'best_score_f1': clf.best_score_,
          'best_params': clf.best_params_
      })
      predictions=clf.predict(x_test)   
      print(model_name)
      print(metrics.classification_report(y_test,predictions))
  return scores

In [14]:
scores = compare(model_params.items(), X_smote_over, y_smote_over, X_test, y_test)
#scores

XGBClassifier
              precision    recall  f1-score   support

           0       0.96      0.88      0.91      1593
           1       0.48      0.74      0.58       246

    accuracy                           0.86      1839
   macro avg       0.72      0.81      0.75      1839
weighted avg       0.89      0.86      0.87      1839

random_forest
              precision    recall  f1-score   support

           0       0.95      0.87      0.91      1593
           1       0.46      0.70      0.55       246

    accuracy                           0.85      1839
   macro avg       0.70      0.78      0.73      1839
weighted avg       0.88      0.85      0.86      1839

logistic_regression
              precision    recall  f1-score   support

           0       0.98      0.81      0.89      1593
           1       0.41      0.87      0.56       246

    accuracy                           0.82      1839
   macro avg       0.69      0.84      0.72      1839
weighted avg       0.90   

In [15]:
Compare_1 = pd.DataFrame(scores,columns=['model','best_score_f1','best_params'])
Compare_1

Unnamed: 0,model,best_score_f1,best_params
0,XGBClassifier,0.878402,"{'alpha': 0, 'eta': 0.1, 'eval_metric': 'loglo..."
1,random_forest,0.873296,{'n_estimators': 10}
2,logistic_regression,0.831837,{'C': 1}
3,naive_bayes_multinomial,0.592049,{}
4,decision_tree,0.86943,{'criterion': 'entropy'}


# **II. Using message text processing**


---



In [16]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [17]:
import re

In [18]:
df

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2
...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,160,8
5568,ham,Will ü b going to esplanade fr home?,36,1
5569,ham,"Pity, * was in mood for that. So...any other s...",57,7
5570,ham,The guy did some bitching but I acted like i'd...,125,1


In [19]:
X = df.drop(['label' , 'length', 'punct' ], axis = 1)
X

Unnamed: 0,message
0,"Go until jurong point, crazy.. Available only ..."
1,Ok lar... Joking wif u oni...
2,Free entry in 2 a wkly comp to win FA Cup fina...
3,U dun say so early hor... U c already then say...
4,"Nah I don't think he goes to usf, he lives aro..."
...,...
5567,This is the 2nd time we have tried 2 contact u...
5568,Will ü b going to esplanade fr home?
5569,"Pity, * was in mood for that. So...any other s..."
5570,The guy did some bitching but I acted like i'd...


In [20]:
type(X)

pandas.core.frame.DataFrame

In [21]:
y = df.label
y = y.apply(lambda x: 0 if x == 'ham' else 1)
y

0       0
1       0
2       1
3       0
4       0
       ..
5567    1
5568    0
5569    0
5570    0
5571    0
Name: label, Length: 5572, dtype: int64

**Processing our text in spaCy methods**

In [22]:
X['message'] = X['message'].apply(lambda x: nlp(x))
X

Unnamed: 0,message
0,"(Go, until, jurong, point, ,, crazy, .., Avail..."
1,"(Ok, lar, ..., Joking, wif, u, oni, ...)"
2,"(Free, entry, in, 2, a, wkly, comp, to, win, F..."
3,"(U, dun, say, so, early, hor, ..., U, c, alrea..."
4,"(Nah, I, do, n't, think, he, goes, to, usf, ,,..."
...,...
5567,"(This, is, the, 2nd, time, we, have, tried, 2,..."
5568,"(Will, ü, b, going, to, esplanade, fr, home, ?)"
5569,"(Pity, ,, *, was, in, mood, for, that, ., So, ..."
5570,"(The, guy, did, some, bitching, but, I, acted,..."


In [23]:
type(X['message'][0])

spacy.tokens.doc.Doc

**Normalizes and eliminates insignificant words**

In [24]:
def normalization_text(x):
    tok = [token.text.lower() for sent in x.sents for token in sent if not (token.is_stop or token.is_punct)]
    tok = ' '.join(tok)
    patern = re.compile(r'[^A-Za-z]')
    tok = re.sub(patern, ' ', tok)
    return tok

In [25]:
X['message'] = X['message'].apply(lambda x: normalization_text(x))

In [26]:
type(X['message'][0])

str

In [27]:
X['message']

0       jurong point crazy available bugis n great wor...
1                                 ok lar joking wif u oni
2       free entry   wkly comp win fa cup final tkts  ...
3                                     u dun early hor u c
4                                nah think goes usf lives
                              ...                        
5567     nd time tried   contact u  u won       pound ...
5568                            b going esplanade fr home
5569                                pity mood suggestions
5570    guy bitching acted like interested buying week...
5571                                            rofl true
Name: message, Length: 5572, dtype: object

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


---
**Сonvert words into vectors**



In [29]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

In [30]:
X_train_counts = count_vect.fit_transform(X_train['message'])
X_train_counts

<3733x6141 sparse matrix of type '<class 'numpy.int64'>'
	with 27611 stored elements in Compressed Sparse Row format>

In [31]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()



**Process the train data**

In [32]:
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3733, 6141)

In [33]:
X_train_tfidf

<3733x6141 sparse matrix of type '<class 'numpy.float64'>'
	with 27611 stored elements in Compressed Sparse Row format>

**Now we process the test data**

In [34]:
X_test_counts = count_vect.transform(X_test['message'])
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

In [35]:
scores_nlp = compare(model_params.items(), X_train_tfidf, y_train, X_test_tfidf, y_test)

XGBClassifier
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1593
           1       0.99      0.86      0.92       246

    accuracy                           0.98      1839
   macro avg       0.98      0.93      0.95      1839
weighted avg       0.98      0.98      0.98      1839

random_forest
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      1593
           1       0.97      0.82      0.89       246

    accuracy                           0.97      1839
   macro avg       0.97      0.91      0.94      1839
weighted avg       0.97      0.97      0.97      1839

logistic_regression
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1593
           1       0.98      0.91      0.94       246

    accuracy                           0.99      1839
   macro avg       0.98      0.95      0.97      1839
weighted avg       0.99   

In [36]:
Compare_2 = pd.DataFrame(scores_nlp,columns=['model','best_score_f1','best_params'])
Compare_2

Unnamed: 0,model,best_score_f1,best_params
0,XGBClassifier,0.966516,"{'alpha': 0, 'eta': 0.1, 'eval_metric': 'loglo..."
1,random_forest,0.965711,{'n_estimators': 5}
2,logistic_regression,0.972677,{'C': 10}
3,naive_bayes_multinomial,0.963031,{}
4,decision_tree,0.957409,{'criterion': 'gini'}


In [37]:
Compare_1

Unnamed: 0,model,best_score_f1,best_params
0,XGBClassifier,0.878402,"{'alpha': 0, 'eta': 0.1, 'eval_metric': 'loglo..."
1,random_forest,0.873296,{'n_estimators': 10}
2,logistic_regression,0.831837,{'C': 1}
3,naive_bayes_multinomial,0.592049,{}
4,decision_tree,0.86943,{'criterion': 'entropy'}


**Conclusion: You can see that best_score_f1 is better in the second case, and most importantly, it is clear from the classification report that F1 has almost doubled.**