# Spam Detection Usuing NLP and Neural and Network

1. **Spam file is collection of emails, categorized by spam and not spam.**
2. **spam mails are can be malware files, promotions, Not secured links ...etc.**
3. **For given dataset, consist of text and response, perform Natural Language Processing to convert text into desired vectors.**
4. **Build a model using Neural Network(ANN or CNN) to predict whether a mail is spam or not.**
5. **Spam mails are harm in terms of Malware..etc, if we predict whether a mail is spam or not startingly, we can avoid malware attacks at early.**

In [2]:
# Libraries for analysising the data 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
from sklearn.manifold import TSNE
warnings.filterwarnings('ignore')

# Libraries for NLP

import nltk
import string

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

import re

# [1.1] Exploratory Data Analysis

In [2]:
spam = pd.read_csv('SpamCollection', sep = '\t', header = None)
spam.columns = ['Response', 'Message']
spam.head()

Unnamed: 0,Response,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
spam.shape

(5572, 2)

In [4]:
spam.Response.value_counts()

ham     4825
spam     747
Name: Response, dtype: int64

In [5]:
# Encoding is the process of converting string Numerical where ML model doesn't understand string values
# should convert string values into numericals, Encoding 'harm' = 0 and 'spam' = 1

def Encoding(x):
    if x == 'ham':
        return 0
    if x == 'spam':
        return 1

In [6]:
spam['Response'] = spam.Response.map(Encoding)
spam.head()

Unnamed: 0,Response,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


# [1.2] Text Processing

1. **removing HTML tags, html tags like <br /> are meaning less in terms of spam detection.**
2. **Clean punctuations like !,@,#.. .,..etc are trash to sentence or words while performing Conversion text to vector.**
3. **Checking that each word in each mail/document are alphabet, apart from alphabets if there are alphanumeric like "Mani1234@" should be removed.**
4. **Each of the word in mail/document are of length greater than 2, research suggest that there is no adjective of length less than 2.**
5. **Converting Higher case to lower case each word in mail / document to avoid duplication of words.**
6. **Removing stop words from mails, stopwords are total 173 words like I, we, the, and...etc words.**
7. **Steming words like tasty and tastfull words have the stem words like tatsie, we can decrease the size of vector.**
8. **Lemmatization spliting sentence/ document into words such that words like New York or Andhra Pradesh splits combinigly, its done by lemmatization.**

In [7]:
# Downloading stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mani\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
# we use re module to remove trash in document, 're' stands for regular Expression
# re module use compile method to findout pattern in  data, and sub method is used to replace the pattern by string.

stop = set(stopwords.words('english'))
print("Number of stop words in English are ", len(stop))


# Intializing the snowball stemmer
sno = SnowballStemmer('english')
word = "tasty"
print("\nStem word for ",word," is ", sno.stem(word).encode('utf8'))

Number of stop words in English are  179

Stem word for  tasty  is  b'tasti'


###  Cleaning HTML tags

1. **Define a function to clean html tags.**
2. **re.compile method is used to findout the patterns in data.**
3. **re.sub method is used replace the pattern with string metioned by user.**

In [9]:
def clean_html(x):
    cleaned = re.compile('<.*?>')
    cleaned_text = re.sub(cleaned, r'', x)
    return cleaned_text

In [10]:
spam['Message'] = spam.Message.map(clean_html)

###  Cleaning punctuations

In [11]:
def clean_punc(x):
    cleaned = re.sub(r'[?|!|\'|"|#]', r'', x)
    cleaned = re.sub(r'[.|,|)|(|\|/]', r'', cleaned)
    return cleaned

In [12]:
for sent in spam.Message.values[:5]:
    sent = clean_html(sent)
    print(sent)

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Ok lar... Joking wif u oni...
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
U dun say so early hor... U c already then say...
Nah I don't think he goes to usf, he lives around here though


In [13]:
spam['Message'] = spam.Message.map(clean_punc)

In [14]:
spam.head()

Unnamed: 0,Response,Message
0,0,Go until jurong point crazy Available only in ...
1,0,Ok lar Joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor U c already then say
4,0,Nah I dont think he goes to usf he lives aroun...


### Filtering sentence ,removing stop words, considering alphabets, stemming and converting into lower case

In [15]:
def Filter(x):
    filtered_sentence = []
    final_string = []
    for word in x.split():
        if (word.isalpha() & (len(word) > 2) & (word.lower() not in stop)):
            s = sno.stem(word.lower())
            filtered_sentence.append(s)
    str1 = " ".join(filtered_sentence)
    final_string.append(str1)
    return final_string[0]

In [16]:
spam['Message'] = spam.Message.map(Filter)

In [17]:
spam.head()

Unnamed: 0,Response,Message
0,0,jurong point crazi avail bugi great world buff...
1,0,lar joke wif oni
2,1,free entri wkli comp win cup final tkts may te...
3,0,dun say earli hor alreadi say
4,0,nah dont think goe usf live around though


In [18]:
type(spam.Message.values[0].split()[0])

str

###  Ham mail words and Spam mail words

1. **After text processing , document is cleaned , words remained are more enough to convert a text into vector.**
2. **finding the words that are in Ham and storing them in Ham list and spam list.**

In [19]:
def Ham_Spam_Words(x, y):
    Ham_words = []
    Spam_words = []
    i = 0
    for sentence in x:
        if y.values[i] == 1:
            for word in sentence.split():
                Spam_words.append(word)
        if y.values[i] == 0:
            for word in sentence.split():
                Ham_words.append(word)
        i += 1
    return (Ham_words, Spam_words)

In [20]:
Ham_list = Ham_Spam_Words(spam.Message, spam.Response)[0]
Spam_list = Ham_Spam_Words(spam.Message, spam.Response)[1]

**Finding out the most frequently used wprds in Ham mails and spam mails.**

In [21]:
freq_dist_ham = nltk.FreqDist(Ham_list)
freq_dist_spam = nltk.FreqDist(Spam_list)
print("Top 20 words occurs in Ham mails are :: ", freq_dist_ham.most_common(20))
print("\nTop 20 words occurs in Spam mails are :: ", freq_dist_spam.most_common(20))

Top 20 words occurs in Ham mails are ::  [('get', 359), ('come', 295), ('call', 288), ('dont', 263), ('like', 244), ('know', 244), ('ill', 240), ('love', 234), ('got', 232), ('good', 225), ('time', 217), ('want', 213), ('day', 213), ('need', 176), ('one', 171), ('go', 166), ('home', 160), ('lor', 160), ('see', 153), ('sorri', 153)]

Top 20 words occurs in Spam mails are ::  [('call', 363), ('free', 215), ('text', 137), ('txt', 137), ('mobil', 135), ('claim', 115), ('stop', 115), ('repli', 109), ('prize', 94), ('get', 87), ('week', 85), ('tone', 73), ('servic', 72), ('send', 70), ('new', 69), ('nokia', 68), ('award', 66), ('cash', 62), ('urgent', 62), ('contact', 61)]


1. **There are certain words that most frequently occurs in both ham and spam mails.**
2. **Something fishy that certain words to be appears in ham and spam .**
3. **To make data more efficient, we must apply n-grams technique.**

In [22]:
# Bi-grams, tri-grams and n-grams

# words like 'not' and 'very' are also stopwords, but they plays crucial role for sentence orr document

count_vect = CountVectorizer(ngram_range = (1, 2))
final_bigram_counts = count_vect.fit_transform(spam.Message.values)

In [23]:
print("Shape of Countvector for bi-gram words ",final_bigram_counts.get_shape())
print("\n")
final_bigram_counts.toarray()

Shape of Countvector for bi-gram words  (5572, 33031)




array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

# TF-IDF ( Term-Frequency and Inverse-Document-Frequency)

In [24]:
tfidf_vect = TfidfVectorizer( ngram_range = (1,2))

final_tfidf = tfidf_vect.fit_transform(spam.Message.values)

In [25]:
print("Shape of TfidfVectorizer is ", final_tfidf.get_shape())
print("\n")
final_tfidf.toarray()

Shape of TfidfVectorizer is  (5572, 33031)




array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [26]:
# Feature names of the vector after the text converted to vector

feature_names = tfidf_vect.get_feature_names()
len(feature_names)

33031

In [27]:
# Prinitng 3 rd vector of tf-idf vectorizer by converting sparse to array

print(type(final_tfidf[3,:]))
print("\n Third vector of the tf-idf ", final_tfidf[3,:].toarray())

<class 'scipy.sparse.csr.csr_matrix'>

 Third vector of the tf-idf  [[0. 0. 0. ... 0. 0. 0.]]


In [28]:
# printing top n features with corresponding tfidf values for certain row

def top_tfidf_feat(row, feature, top_n):
    sort = np.argsort(row)[::-1][:top_n]
    top_feat = [(feature[i], row[i]) for i in sort]
    df = pd.DataFrame(top_feat)
    df.columns = ["Feature_name", "Tf-Idf_Values"]
    return df

In [29]:
top_feat = top_tfidf_feat(final_tfidf[1,:].toarray()[0], feature_names, 25)
top_feat

Unnamed: 0,Feature_name,Tf-Idf_Values
0,joke wif,0.432939
1,lar joke,0.432939
2,wif oni,0.432939
3,oni,0.388529
4,joke,0.329215
5,wif,0.306793
6,lar,0.290229
7,üll take,0.0
8,gonna kill,0.0
9,gonna leav,0.0


# Word2Vec

In [30]:
word2vec_model = Word2Vec(spam.Message.values, min_count = 3, size = 50)

In [31]:
# list of words or features in word2vec after transforming to vector 

words = list(word2vec_model.wv.vocab)
len(words)

28

In [28]:
list_of_sentence = []
for sent in spam.Message.values:
    filtered_sentence = []
    for word in sent.split():
        filtered_sentence.append(word)
    list_of_sentence.append(filtered_sentence)

In [29]:
len(list_of_sentence)

5572

In [30]:
print(spam.Message.values[0])
print("\n")
print("=================PROCESSED====================")
print("\n")
print(list_of_sentence[0])

jurong point crazi avail bugi great world buffet cine got amor wat




['jurong', 'point', 'crazi', 'avail', 'bugi', 'great', 'world', 'buffet', 'cine', 'got', 'amor', 'wat']


In [91]:
word2vec_model_2 = Word2Vec(list_of_sentence, min_count = 1, size = 1)

In [92]:
words = list(word2vec_model_2.wv.vocab)
len(words)

6291

In [93]:
len(words)

6291

In [94]:
vectors = word2vec_model_2.wv.vectors

In [95]:
words_dict = {}
for i,j in enumerate(words):
    words_dict[j] = vectors[i][0]

In [96]:
len(list_of_sentence)

5572

In [107]:
data = np.zeros((5572, 6291))
df = pd.DataFrame(data, columns = words)
for row, sent in enumerate(list_of_sentence):
    for word in sent:
        df.loc[row, word] = words_dict[word]
df.loc[0, 'jurong']

-4.800270080566406

In [36]:
# words releated to holiday parllel relationship

word2vec_model_2.wv.most_similar('holiday')

[('last', 0.9997702240943909),
 ('collect', 0.9997590780258179),
 ('start', 0.9997560977935791),
 ('way', 0.9997544288635254),
 ('cash', 0.9997499585151672),
 ('mobil', 0.9997413158416748),
 ('friend', 0.9997363686561584),
 ('call', 0.9997330904006958),
 ('text', 0.9997284412384033),
 ('cos', 0.9997254014015198)]

In [37]:
word2vec_model_2.wv.most_similar('buffet')

[('strip', 0.9282799363136292),
 ('linerent', 0.9249205589294434),
 ('handsom', 0.9223565459251404),
 ('iouri', 0.9219058752059937),
 ('fli', 0.9214571118354797),
 ('decis', 0.9210811853408813),
 ('heater', 0.9207569360733032),
 ('bodi', 0.9196411371231079),
 ('affair', 0.9193337559700012),
 ('maintain', 0.9190778732299805)]

# Average word2vec and TF_IDF * word2vec