# SPAM Detection with Natural Language Processing

In this project I try to predict if an email should be classified as spam or not.

In [30]:
import pandas as pd

In [31]:
data = pd.read_csv('spam.csv')

data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [32]:
data.loc[data['Category']=='ham','Category'] = 0
data.loc[data['Category']=='spam','Category'] = 1
data['Category'] = data['Category'].astype(int)

In [33]:
message = data['Message']

category = data['Category']

We save the message column to a variable called **message** and the category to a variable called **category**.

In [34]:
message.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: Message, dtype: object

In [35]:
category.head()

0    0
1    0
2    1
3    0
4    0
Name: Category, dtype: int32

## Pre Processing
Now in this section we have to process the data by:
1. Converting all the rows to lower case.
2. Removing stop words like i, me , you, our, your etc
3. Removing hyperlinks,numbers,punctuations etc.

Now we import the nltk library. NLTK is a toolkit build for working with NLP in Python. It provides us various text processing libraries with a lot of test datasets.

In [36]:
import nltk
import re
import string

In [37]:
nltk.download('stopwords')

stop_words = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to C:\Users\IFEANYI
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We download the stopwords we want to remove from the dataset.

In [38]:
nltk.download('punkt')

from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to C:\Users\IFEANYI
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [39]:
def pre_process(txt):
    lowered_text = txt.lower()
    
    removed_numbers = re.sub(r'\d+','',lowered_text) # re. is for regular expressions. Substitutes digits with an empty string.
    
    removed_punctuation = removed_numbers.translate(str.maketrans('','',string.punctuation)) # This removes punctuation from the text and replaces it with an empty string
    
    # now we split the text to obtain tokens and then remove the stopwords.
    
    word_tokens = word_tokenize(removed_punctuation)
    
    processed_text = ''.join([word for word in word_tokens if word not in stop_words])
    
    return processed_text

In [40]:
processed = message.apply(pre_process) #.apply applies a function across a pandas dataframe.

processed

0       gojurongpointcrazyavailablebugisngreatworldlae...
1                                      oklarjokingwifuoni
2       freeentrywklycompwinfacupfinaltktsstmaytextfar...
3                             udunsayearlyhorucalreadysay
4                    nahdontthinkgoesusflivesaroundthough
                              ...                        
5567    ndtimetriedcontactuu£poundprizeclaimeasycallpp...
5568                               übgoingesplanadefrhome
5569                             pitymoodsoanysuggestions
5570    guybitchingactedlikeidinterestedbuyingsomethin...
5571                                         rofltruename
Name: Message, Length: 5572, dtype: object

We have now processed the text but we still need to tokenize it.

In [41]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

input_data = vectorizer.fit_transform(processed)
input_data

<5572x5315 sparse matrix of type '<class 'numpy.int64'>'
	with 5965 stored elements in Compressed Sparse Row format>

We have now created our sparse matrix with number of reviews as rows(5572) and all the words in the dataset as columns after removing the stopwords(5965)

In [42]:
print(input_data)

  (0, 1304)	1
  (1, 3121)	1
  (2, 1130)	1
  (3, 4467)	1
  (4, 2804)	1
  (5, 1142)	1
  (5, 3497)	1
  (6, 1015)	1
  (7, 3278)	1
  (8, 4922)	1
  (8, 3436)	1
  (9, 2729)	1
  (10, 2065)	1
  (11, 3829)	1
  (12, 4594)	1
  (12, 3435)	1
  (13, 2178)	1
  (14, 696)	1
  (15, 5027)	1
  (16, 2999)	1
  (17, 971)	1
  (18, 1087)	1
  (18, 4080)	1
  (18, 4079)	1
  (19, 988)	1
  :	:
  (5548, 4416)	1
  (5549, 2338)	1
  (5550, 626)	1
  (5551, 4870)	1
  (5552, 3628)	1
  (5553, 1519)	1
  (5554, 4843)	1
  (5555, 5118)	1
  (5556, 5178)	1
  (5557, 2648)	1
  (5558, 3905)	1
  (5559, 183)	1
  (5560, 158)	1
  (5561, 1229)	1
  (5562, 3135)	1
  (5563, 178)	1
  (5564, 886)	1
  (5565, 1962)	1
  (5566, 3556)	1
  (5567, 2817)	1
  (5567, 3390)	1
  (5568, 5273)	1
  (5569, 3301)	1
  (5570, 1484)	1
  (5571, 3593)	1


Now we can feed the matrix to a machine learning model. In this case we'll use the Logistic Regression model since we are trying to classify it into positive or negative.

In [43]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(input_data, category)

LogisticRegression()

In [60]:
def prediction_input(sentence):
    processed = pre_process(sentence)
    input_data = vectorizer.transform([processed])
    prediction = model.predict(input_data)
    
    if (prediction[0] == 0):
        print('This is Spam.')
    else:
        print('This is not Spam.')

In [61]:
prediction_input("This is meant to be today")

This is Spam.


In [62]:
prediction_input("Send and recieve emails")

This is Spam.
