# NLP-Powered Spam Detection:Using Word2Vec,Average Word2vec, Lemmatization and Random Forest

## Project Overview:

With the exponential growth of digital communication, email remains a critical channel for both personal and professional correspondence. However, it is also a primary target for spam, which can clutter inboxes, waste time, and pose security risks.

**This project aims to develop an advanced email spam detection system utilizing Natural Language Processing (NLP) techniques, specifically incorporating the Random Forest algorithm, Word2Vec word embeddings, and lemmatization.**

## Project Objectives:

**Spam or Ham Classification:** The primary goal of this project is to create a robust machine learning model capable of distinguishing between spam (unsolicited, often malicious emails) and ham (legitimate) emails. This classification will help users manage their email inboxes effectively.

**Text Preprocessing:** Extensive preprocessing of email text will be performed, including **tokenization, removal of stop words, and lemmatization**. These steps are essential for transforming raw email content into a format suitable for NLP-based analysis.

**Word2Vec Embeddings:** The Word2Vec algorithm will be used to convert words into numerical vectors **that capture semantic relationships**. This representation will enable the model to understand the meaning and context of words in emails.

**Random Forest Classifier:** The Random Forest algorithm, known for its robustness and accuracy, will serve as the foundation for the spam detection model. It will be trained on labeled email data to learn patterns and features that distinguish spam from ham.

**Hyperparameter Tuning:** Optimization of the Random Forest model's hyperparameters will be performed to maximize its performance, including the number of trees, depth of trees, and feature selection.

## Importing Neccessary libraries

In [1]:
import gensim #open source library for NLP
from gensim.models import Word2Vec #developed by google,to create word embedding that capture semantic relationships b/w words
from gensim.models import KeyedVectors # to perform operations like finding similar words

In [2]:
import pandas as pd # to create dataframe

In [3]:
messages=pd.read_csv('SMSSpamCollection123.csv',sep='\t',names=['Label','Message'])
messages.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Exploratory Data Analysis

In [4]:
messages.isnull().sum()

Label      0
Message    0
dtype: int64

**There are no null values in this data**

In [5]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Label    5572 non-null   object
 1   Message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


This dataset contains totally 5572 records with two columns

**Checking Imbalancement of Independent Variable**

In [6]:
messages['Label'].value_counts()

Label
ham     4825
spam     747
Name: count, dtype: int64

## Data Processing by Lemmatization and Removing StopWords

In [7]:
from nltk.stem import WordNetLemmatizer #lemmatization
from nltk.corpus import stopwords #to remove stopwords

In [8]:
lem=WordNetLemmatizer()

In [9]:
import re
corpus=[]
for i in range(len(messages)):
    clean=re.sub('[^a-zA-Z]',' ',messages['Message'][i])
    clean=clean.lower()
    clean=clean.split()
    clean=[lem.lemmatize(word) for word in clean if word not in set(stopwords.words('english'))]
    clean=' '.join(clean)
    corpus.append(clean)

In [10]:
corpus

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 'free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply',
 'u dun say early hor u c already say',
 'nah think go usf life around though',
 'freemsg hey darling week word back like fun still tb ok xxx std chgs send rcv',
 'even brother like speak treat like aid patent',
 'per request melle melle oru minnaminunginte nurungu vettam set callertune caller press copy friend callertune',
 'winner valued network customer selected receivea prize reward claim call claim code kl valid hour',
 'mobile month u r entitled update latest colour mobile camera free call mobile update co free',
 'gonna home soon want talk stuff anymore tonight k cried enough today',
 'six chance win cash pound txt csh send cost p day day tsandcs apply reply hl info',
 'urgent week free membership prize jackpot txt word claim c www dbuk net lccltd pobox ldnw rw'

In [11]:
len(corpus)

5572

### Tokenizing to Use in WORD2VEC

In [12]:
from nltk.tokenize import sent_tokenize
from gensim.utils import simple_preprocess#to convert each sentences into a list and to access for each word for word2vec

In [13]:
words=[]
for sent in corpus:
    sent_token=sent_tokenize(sent)
    for sent in sent_token:
        words.append(simple_preprocess(sent))

In [14]:
words

[['go',
  'jurong',
  'point',
  'crazy',
  'available',
  'bugis',
  'great',
  'world',
  'la',
  'buffet',
  'cine',
  'got',
  'amore',
  'wat'],
 ['ok', 'lar', 'joking', 'wif', 'oni'],
 ['free',
  'entry',
  'wkly',
  'comp',
  'win',
  'fa',
  'cup',
  'final',
  'tkts',
  'st',
  'may',
  'text',
  'fa',
  'receive',
  'entry',
  'question',
  'std',
  'txt',
  'rate',
  'apply'],
 ['dun', 'say', 'early', 'hor', 'already', 'say'],
 ['nah', 'think', 'go', 'usf', 'life', 'around', 'though'],
 ['freemsg',
  'hey',
  'darling',
  'week',
  'word',
  'back',
  'like',
  'fun',
  'still',
  'tb',
  'ok',
  'xxx',
  'std',
  'chgs',
  'send',
  'rcv'],
 ['even', 'brother', 'like', 'speak', 'treat', 'like', 'aid', 'patent'],
 ['per',
  'request',
  'melle',
  'melle',
  'oru',
  'minnaminunginte',
  'nurungu',
  'vettam',
  'set',
  'callertune',
  'caller',
  'press',
  'copy',
  'friend',
  'callertune'],
 ['winner',
  'valued',
  'network',
  'customer',
  'selected',
  'receivea',
 

In [15]:
len(words)

5564

### Training WORD2VEC by custom(our Data)

In [16]:
model=Word2Vec(sentences=words)

In [17]:
model.corpus_count

5564

### Finding most similar word by custom trained WORD2VEC

In [18]:
model.wv.similar_by_word('good')

[('go', 0.9995966553688049),
 ('give', 0.9995700716972351),
 ('day', 0.9995675086975098),
 ('going', 0.9995640516281128),
 ('night', 0.9995638132095337),
 ('much', 0.9995583891868591),
 ('said', 0.9995322823524475),
 ('last', 0.9994977116584778),
 ('keep', 0.9994969964027405),
 ('tomorrow', 0.9994962215423584)]

In [19]:
model.wv.most_similar('good')

[('go', 0.9995966553688049),
 ('give', 0.9995700716972351),
 ('day', 0.9995675086975098),
 ('going', 0.9995640516281128),
 ('night', 0.9995638132095337),
 ('much', 0.9995583891868591),
 ('said', 0.9995322823524475),
 ('last', 0.9994977116584778),
 ('keep', 0.9994969964027405),
 ('tomorrow', 0.9994962215423584)]

In [20]:
model.wv['good']

array([-0.16747005,  0.5280176 , -0.11291052,  0.04931507,  0.0734771 ,
       -0.6224802 ,  0.24211524,  0.8580509 , -0.22812477, -0.19257666,
       -0.36297548, -0.52935123, -0.07655388,  0.16812244,  0.04779557,
       -0.36579984,  0.04380327, -0.46777493,  0.05826885, -0.76208943,
        0.28214866,  0.21009125,  0.11275892, -0.20127667, -0.1205099 ,
        0.05997745, -0.4206151 , -0.35520408, -0.4266823 ,  0.12289354,
        0.5286177 ,  0.04583664,  0.12428921, -0.22407024, -0.07824904,
        0.5075391 , -0.09984374, -0.5084783 , -0.32321313, -0.8252997 ,
       -0.01313873, -0.40112892,  0.02736567,  0.01369124,  0.2932785 ,
       -0.13785768, -0.22313511, -0.12479357,  0.25137588,  0.1973697 ,
        0.20691857, -0.35402516,  0.03325241,  0.05162938, -0.35388333,
        0.15715623,  0.18133228,  0.08630298, -0.43526462,  0.00584675,
        0.11128578,  0.1379659 , -0.27679262, -0.13875282, -0.52701193,
        0.37492636,  0.14477754,  0.28593177, -0.5037354 ,  0.52

In [21]:
model.wv.index_to_key

['call',
 'get',
 'ur',
 'gt',
 'go',
 'lt',
 'day',
 'ok',
 'free',
 'know',
 'come',
 'like',
 'time',
 'good',
 'got',
 'love',
 'text',
 'want',
 'send',
 'need',
 'one',
 'txt',
 'today',
 'going',
 'stop',
 'home',
 'lor',
 'sorry',
 'see',
 'still',
 'mobile',
 'take',
 'back',
 'da',
 'reply',
 'dont',
 'think',
 'tell',
 'week',
 'phone',
 'hi',
 'new',
 'please',
 'later',
 'pls',
 'co',
 'msg',
 'min',
 'dear',
 'night',
 'make',
 'message',
 'well',
 'say',
 'thing',
 'much',
 'claim',
 'hope',
 'great',
 'oh',
 'hey',
 'give',
 'number',
 'happy',
 'friend',
 'wat',
 'work',
 'way',
 'yes',
 'www',
 'prize',
 'let',
 'right',
 'tomorrow',
 'already',
 'tone',
 'ask',
 'said',
 'win',
 'cash',
 'amp',
 'life',
 'yeah',
 'im',
 'really',
 'meet',
 'babe',
 'find',
 'miss',
 'morning',
 'last',
 'year',
 'service',
 'uk',
 'thanks',
 'care',
 'anything',
 'would',
 'com',
 'also',
 'nokia',
 'lol',
 'feel',
 'every',
 'keep',
 'pick',
 'sure',
 'urgent',
 'sent',
 'contact',


In [22]:
len(model.wv.index_to_key)

1603

**Totally there are 1603 unique values in our corpus after Preprocessing**

In [23]:
len(model.wv['good'])

100

WORD2VEC default word vector conversion is 100 and here also we can see the single word 'good' has dimension of 100
**For dimensionality reduction and to make compatible with Random Forest Algorithm we will be using AverageWord2Vec**

### By using AVERAGEWORD2VEC, averaging them for an entire document,we reduce the dimensionality of the data.By this we will be reducing computational resources and simplify the subsequent analysis.

In [24]:
import numpy as np

In [25]:
def avgword2vec(words):
    return np.mean([model.wv[word] for word in words if word in model.wv.index_to_key ],axis=0)

In [26]:
from tqdm import tqdm # to show progess in for loop

In [27]:
x=[]
for i in tqdm(range(len(words))):
    x.append(avgword2vec(words[i]))

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
100%|███████████████████████████████████████████████████████████████████████████| 5564/5564 [00:00<00:00, 13619.36it/s]


In [28]:
len(x)

5564

In [29]:
len(x[0])

100

In [30]:
messages['Label'].shape

(5572,)

In [31]:
filtered_data = []

for i, j, k in zip(list(map(len, corpus)), corpus, messages['Message']):
    if i < 1:
        filtered_data.append([i, j, k])

        

In [32]:
filtered_data

[[0, '', 'What you doing?how are you?'],
 [0, '', 'Where @'],
 [0, '', '645'],
 [0, '', 'Can a not?'],
 [0, '', ':) '],
 [0, '', 'What you doing?how are you?'],
 [0, '', ':( but your not here....'],
 [0, '', ':-) :-)']]

In [33]:
missing_data=[[i,j,k] for i,j,k in zip(list(map(len,corpus)),corpus,messages['Message']) if i<1]

In [34]:
missing_data

[[0, '', 'What you doing?how are you?'],
 [0, '', 'Where @'],
 [0, '', '645'],
 [0, '', 'Can a not?'],
 [0, '', ':) '],
 [0, '', 'What you doing?how are you?'],
 [0, '', ':( but your not here....'],
 [0, '', ':-) :-)']]

In [35]:
y=messages[list(map(lambda x:len(x)>0,corpus))]

In [36]:
y

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [37]:
len(y)

5564

In [38]:
y=pd.get_dummies(y['Label'],drop_first=True).astype(int)

In [39]:
y

Unnamed: 0,spam
0,0
1,0
2,1
3,0
4,0
...,...
5567,1
5568,0
5569,0
5570,0


In [40]:
y.shape

(5564, 1)

In [41]:
y=y.squeeze()

In [42]:
y.shape

(5564,)

In [43]:
x[0].shape

(100,)

In [44]:
abc=x[0].reshape(1,-1)

In [45]:
abc.shape

(1, 100)

In [46]:
import numpy as np

In [47]:
import pandas as pd

In [48]:
df=pd.DataFrame()

In [49]:
for arr in x:
    row_data = arr.reshape(1, -1)
    row_df = pd.DataFrame(row_data)
    df = pd.concat([df, row_df], ignore_index=True)

# 'df' now contains reshaped rows from 'x'
print(df)


            0         1         2         3         4         5         6   \
0    -0.095470  0.305846 -0.061249  0.034911  0.050584 -0.365735  0.150975   
1    -0.075115  0.258995 -0.047877  0.030447  0.043019 -0.312640  0.132719   
2    -0.101703  0.327281 -0.062845  0.040512  0.058410 -0.385789  0.161780   
3    -0.138547  0.437719 -0.087945  0.052930  0.073324 -0.538764  0.217118   
4    -0.108123  0.354378 -0.075004  0.042045  0.056069 -0.420008  0.177713   
...        ...       ...       ...       ...       ...       ...       ...   
5559 -0.113476  0.373677 -0.078933  0.047612  0.068793 -0.446868  0.185608   
5560 -0.114579  0.385826 -0.072823  0.037595  0.061358 -0.463981  0.185261   
5561 -0.030767  0.133203 -0.017757  0.013275  0.015971 -0.155998  0.059102   
5562 -0.109704  0.344383 -0.070249  0.039860  0.057372 -0.409406  0.173871   
5563 -0.100251  0.342162 -0.068124  0.038994  0.049163 -0.407297  0.161493   

            7         8         9   ...        90        91    

## Finally we have reduced the dimension from single word having 100 dimension to whole sentence having 100 dimension

In [50]:
df['Output']=y

In [51]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,Output
0,-0.09547,0.305846,-0.061249,0.034911,0.050584,-0.365735,0.150975,0.513254,-0.133396,-0.112878,...,0.160873,-0.02122,0.06372,0.441863,0.257481,0.119576,-0.252532,0.142366,-0.057892,0.0
1,-0.075115,0.258995,-0.047877,0.030447,0.043019,-0.31264,0.132719,0.442142,-0.115083,-0.096892,...,0.131588,-0.024497,0.051024,0.376089,0.220282,0.105146,-0.217463,0.126968,-0.052703,0.0
2,-0.101703,0.327281,-0.062845,0.040512,0.05841,-0.385789,0.16178,0.545061,-0.139432,-0.112146,...,0.172268,-0.021625,0.062771,0.472345,0.27665,0.12229,-0.272453,0.157384,-0.065558,1.0
3,-0.138547,0.437719,-0.087945,0.05293,0.073324,-0.538764,0.217118,0.753284,-0.185684,-0.164937,...,0.234888,-0.032318,0.095945,0.64032,0.378265,0.167897,-0.368356,0.214447,-0.089411,0.0
4,-0.108123,0.354378,-0.075004,0.042045,0.056069,-0.420008,0.177713,0.600034,-0.151341,-0.127389,...,0.187244,-0.027798,0.073369,0.515252,0.299495,0.14382,-0.292382,0.172246,-0.074418,0.0


In [52]:
df.dropna(inplace=True)

In [53]:
df.shape

(5488, 101)

In [54]:

y=df['Output']
y.shape

(5488,)

In [55]:
x=df.drop('Output',axis=1)

In [56]:
x.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.09547,0.305846,-0.061249,0.034911,0.050584,-0.365735,0.150975,0.513254,-0.133396,-0.112878,...,0.267309,0.160873,-0.02122,0.06372,0.441863,0.257481,0.119576,-0.252532,0.142366,-0.057892
1,-0.075115,0.258995,-0.047877,0.030447,0.043019,-0.31264,0.132719,0.442142,-0.115083,-0.096892,...,0.229498,0.131588,-0.024497,0.051024,0.376089,0.220282,0.105146,-0.217463,0.126968,-0.052703
2,-0.101703,0.327281,-0.062845,0.040512,0.05841,-0.385789,0.16178,0.545061,-0.139432,-0.112146,...,0.277142,0.172268,-0.021625,0.062771,0.472345,0.27665,0.12229,-0.272453,0.157384,-0.065558
3,-0.138547,0.437719,-0.087945,0.05293,0.073324,-0.538764,0.217118,0.753284,-0.185684,-0.164937,...,0.38699,0.234888,-0.032318,0.095945,0.64032,0.378265,0.167897,-0.368356,0.214447,-0.089411
4,-0.108123,0.354378,-0.075004,0.042045,0.056069,-0.420008,0.177713,0.600034,-0.151341,-0.127389,...,0.308042,0.187244,-0.027798,0.073369,0.515252,0.299495,0.14382,-0.292382,0.172246,-0.074418


In [57]:
x.shape

(5488, 100)

# Random Forest Classifier

In [58]:

from sklearn.model_selection import train_test_split

In [59]:
X_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.20,random_state=42)

In [60]:
import sklearn

In [61]:
from sklearn.ensemble import RandomForestClassifier

In [62]:
rf=RandomForestClassifier()

In [63]:
rf.fit(X_train,y_train)

In [64]:
ypred=rf.predict(x_test)

In [65]:
from sklearn.metrics import classification_report,accuracy_score

In [66]:
print(classification_report(y_test,ypred))

              precision    recall  f1-score   support

         0.0       0.87      0.99      0.93       956
         1.0       0.19      0.02      0.04       142

    accuracy                           0.86      1098
   macro avg       0.53      0.50      0.48      1098
weighted avg       0.78      0.86      0.81      1098



In [67]:
print(accuracy_score(y_test,ypred))

0.8615664845173042


**We are getting accuracy of 86% which is a good accuracy**

# Hyperparameter Tuning

**Let's try if we can improve the accuracy by GridSearchCV**

In [68]:
from sklearn.model_selection import GridSearchCV

In [69]:
param_grid_rf={'n_estimators': [100,200,300],
              'max_depth':[None, 5, 10],
              'min_samples_split':[2,5,10]}

In [70]:
rf_grid_search= GridSearchCV(rf, param_grid_rf, cv=5)
model_rf= rf_grid_search

In [71]:
model_rf.fit(X_train, y_train)

In [72]:
rf_best_params = rf_grid_search.best_params_

rf_best_score = rf_grid_search.best_score_

print('Random Forest best parameters:',rf_best_params)
print('Random Forest best score:',rf_best_score)

Random Forest best parameters: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 100}
Random Forest best score: 0.8642369020501139



 ### Finally I got accuracy of  86.42%  by Hyperparameter tuning via GridSearchCV
 

# Here we will work on pre-trained Google's Word2vec which is trained on 3 billion words

In [73]:
import gensim #open source library for NLP
from gensim.models import Word2Vec 

In [74]:
import gensim.downloader as api

In [75]:
pre_trn_wv=api.load('word2vec-google-news-300') #loading pretrained model which is around 2 gb file

In [76]:
bird=pre_trn_wv['bird']

In [77]:
len(bird)

300

 Here we can see that default pretrained word2vec's single word vector is 300 dimensions

## We try to find most similar word and analogy with Pretrained Word2VEC

In [79]:
similar_words = pre_trn_wv.most_similar('king', topn=10)
similar_words

[('kings', 0.7138045430183411),
 ('queen', 0.6510956287384033),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204219460487366),
 ('prince', 0.6159993410110474),
 ('sultan', 0.5864822864532471),
 ('ruler', 0.5797566771507263),
 ('princes', 0.5646552443504333),
 ('Prince_Paras', 0.5432944297790527),
 ('throne', 0.5422104597091675)]

In [80]:
analogy_result = pre_trn_wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
analogy_result

[('queen', 0.7118192911148071)]

## Conclusion 

 By leveraging NLP techniques, Word2Vec embeddings, and the Random Forest algorithm, this project aims to provide an effective solution for e-mail spam detection,benefiting individuals and organizations in managing their email correspondence securely and efficiently.