# **Case Study 5: NLP Classifier**

# **Case Study 5: NLP Classifier (Email Spam)**

### **Installing required libraries**

In [2]:
!pip install gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [1]:
!pip install wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=aeb891163669eb946e6f67355afd7be43954f8b0f9e5065126dc423e75f04802
  Stored in directory: /root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [18]:
!python -m spacy download en_core_web_lg

2022-10-17 13:16:16.173166: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0-py3-none-any.whl (587.7 MB)
[K     |████████████████████████████████| 587.7 MB 8.8 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


## **Importing Libraries**

In [3]:
import numpy as np
import pandas as pd

In [19]:
import spacy

In [61]:
import gensim
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec
from nltk import sent_tokenize

## **Dataset**

In [30]:
df = pd.read_csv('spam.csv', encoding='ISO-8859-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [31]:
df.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [32]:
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

In [33]:
df.rename({'v1':'category', 'v2':'text'}, axis=1, inplace=True)

In [34]:
df.head()

Unnamed: 0,category,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## **Preprocessing**

In [35]:
nlp = spacy.load('en_core_web_lg')

In [36]:
df['text'][0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [37]:
nlp(df['text'][0])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

In [38]:
def preprocess(text):
    filtered = []
    doc = nlp(text)
    for token in doc:
        if token.is_stop or token.is_punct or token.is_space:
            continue
        if token.has_vector:
            filtered.append(token.lemma_)
    return " ".join(filtered)

In [39]:
df['spacy_filtered'] = df['text'].apply(preprocess)

In [40]:
df.head()

Unnamed: 0,category,text,spacy_filtered
0,ham,"Go until jurong point, crazy.. Available only ...",point crazy available n great world la e buffe...
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 comp win FA Cup final 21st 2005 t...
3,ham,U dun say so early hor... U c already then say...,U dun early hor u c
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah think go live


In [41]:
df['spacy_vector'] = df['spacy_filtered'].apply(lambda text: nlp(text).vector)

In [42]:
df['spacy_vector'][0].shape

(300,)

## **Label Encoding**

In [45]:
from sklearn.preprocessing import LabelEncoder

In [46]:
le = LabelEncoder()
y = le.fit_transform(df['category'])

In [47]:
X = df['text']

In [48]:
np.unique(y, return_counts=True)

(array([0, 1]), array([4825,  747]))

## **Splitting Dataset**

In [50]:
from sklearn.model_selection import train_test_split

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [52]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((4457,), (1115,), (4457,), (1115,))

In [58]:
X_train

184                            Going on nothing great.bye
2171                        I wont. So wat's wit the guys
5422              Ok k..sry i knw 2 siva..tats y i askd..
4113    Where are you ? What do you do ? How can you s...
4588         Have you not finished work yet or something?
                              ...                        
1932                            Jus finished avatar nigro
5316                         Jus finish watching tv... U?
2308    Moby Pub Quiz.Win a å£100 High Street prize if...
1903    Free entry in 2 a weekly comp for a chance to ...
763     Nothing but we jus tot u would ask cos u ba gu...
Name: text, Length: 4457, dtype: object

In [59]:
X_train[0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [60]:
simple_preprocess(X_train[0])

['go',
 'until',
 'jurong',
 'point',
 'crazy',
 'available',
 'only',
 'in',
 'bugis',
 'great',
 'world',
 'la',
 'buffet',
 'cine',
 'there',
 'got',
 'amore',
 'wat']

In [66]:
text_data = X_train.apply(simple_preprocess)
text_data

184                      [going, on, nothing, great, bye]
2171                      [wont, so, wat, wit, the, guys]
5422                     [ok, sry, knw, siva, tats, askd]
4113    [where, are, you, what, do, you, do, how, can,...
4588    [have, you, not, finished, work, yet, or, some...
                              ...                        
1932                       [jus, finished, avatar, nigro]
5316                          [jus, finish, watching, tv]
2308    [moby, pub, quiz, win, high, street, prize, if...
1903    [free, entry, in, weekly, comp, for, chance, t...
763     [nothing, but, we, jus, tot, would, ask, cos, ...
Name: text, Length: 4457, dtype: object

In [84]:
model = gensim.models.Word2Vec(window=10, min_count=2, workers=4)

In [85]:
model.build_vocab(text_data, progress_per=1000)

In [86]:
model.epochs

5

In [87]:
model.train(text_data, total_examples=model.corpus_count, epochs=model.epochs)



(237101, 312390)

In [92]:
model.wv.most_similar("good")

[('day', 0.9999451041221619),
 ('of', 0.9999352693557739),
 ('as', 0.9999247193336487),
 ('all', 0.999923050403595),
 ('and', 0.9999223947525024),
 ('last', 0.9999210834503174),
 ('the', 0.9999198913574219),
 ('was', 0.9999184012413025),
 ('too', 0.999916672706604),
 ('my', 0.9999161958694458)]