<h2>Importing the Libraries</h2>

In [1]:
import pandas as pd                                             # For Data Exploration, 
import numpy as np                                              # To create arrays
import nltk                                                     # For Text Pre-processing 
import re                                                       # For Text Pre-processing
from nltk.tokenize import word_tokenize                         # Tokenize text into words
from nltk.stem import PorterStemmer                             # Reducing word to it's root
from sklearn.feature_extraction.text import CountVectorizer     # Create Bag of Words
from sklearn.model_selection import train_test_split            # Split data into groups (Testing and Training)
from sklearn.naive_bayes import MultinomialNB                   # Selecting the Multinomial Algorithm 
from sklearn.metrics import accuracy_score                      # Display Accuracy 

from nltk.corpus import stopwords
from string import punctuation
trashwords = stopwords.words('english')

## Step 1
Import your data into the program and display it <br>
Task: Load dataset and display dataset

In [19]:
df = pd.read_csv('emails.csv')
df.head()

Unnamed: 0,text,spam,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 100,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109
0,Subject: naturally irresistible your corporate...,1,,,,,,,,,...,,,,,,,,,,
1,Subject: the stock trading gunslinger fanny i...,1,,,,,,,,,...,,,,,,,,,,
2,Subject: unbelievable new homes made easy im ...,1,,,,,,,,,...,,,,,,,,,,
3,Subject: 4 color printing special request add...,1,,,,,,,,,...,,,,,,,,,,
4,"Subject: do not have money , get software cds ...",1,,,,,,,,,...,,,,,,,,,,


## Step 2
Check for any Null Values (empty rows) and drop duplicate rows <br>
Task: Eliminate empty and duplicate rows

In [20]:
df.isnull().sum()

text               0
spam               2
Unnamed: 2      5729
Unnamed: 3      5729
Unnamed: 4      5729
                ... 
Unnamed: 105    5729
Unnamed: 106    5729
Unnamed: 107    5730
Unnamed: 108    5730
Unnamed: 109    5730
Length: 110, dtype: int64

In [21]:
df.drop_duplicates(inplace = True)
print(df.isnull().sum())
df.fillna("not needed", inplace=True)

text               0
spam               2
Unnamed: 2      5696
Unnamed: 3      5696
Unnamed: 4      5696
                ... 
Unnamed: 105    5696
Unnamed: 106    5696
Unnamed: 107    5697
Unnamed: 108    5697
Unnamed: 109    5697
Length: 110, dtype: int64


In [22]:
df.head()

Unnamed: 0,text,spam,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 100,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109
0,Subject: naturally irresistible your corporate...,1,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,...,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed
1,Subject: the stock trading gunslinger fanny i...,1,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,...,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed
2,Subject: unbelievable new homes made easy im ...,1,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,...,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed
3,Subject: 4 color printing special request add...,1,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,...,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed
4,"Subject: do not have money , get software cds ...",1,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,...,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed,not needed


## Step 3
Text Cleaning <br>
Now it's time to start cleaning. Let's remove any unnecessary pieces of text.

In [23]:
print("Original Text", end = "\n")
df['text']

Original Text


0       Subject: naturally irresistible your corporate...
1       Subject: the stock trading gunslinger  fanny i...
2       Subject: unbelievable new homes made easy  im ...
3       Subject: 4 color printing special  request add...
4       Subject: do not have money , get software cds ...
                              ...                        
5726    Subject: re : receipts from visit  jim ,  than...
5727    Subject: re : enron case study update  wow ! a...
5728    Subject: re : interest  david ,  please , call...
5729    Subject: news : aurora 5 . 2 update  aurora ve...
5730    Subjecet : Congratulation. Hi Harsh You have b...
Name: text, Length: 5698, dtype: object

In [24]:
for index, row in df.iterrows():
    newText = re.sub('Subject: |re : |fw : |fwd :', '',row['text'])
    newText = newText.lower().strip()
    df.loc[index,'text'] = newText

In [25]:
print("Cleaned Text", end = "\n")
df['text']

Cleaned Text


0       naturally irresistible your corporate identity...
1       the stock trading gunslinger  fanny is merrill...
2       unbelievable new homes made easy  im wanting t...
3       4 color printing special  request additional i...
4       do not have money , get software cds from here...
                              ...                        
5726    receipts from visit  jim ,  thanks again for t...
5727    enron case study update  wow ! all on the same...
5728    interest  david ,  please , call shirley crens...
5729    news : aurora 5 . 2 update  aurora version 5 ....
5730    subjecet : congratulation. hi harsh you have b...
Name: text, Length: 5698, dtype: object

## Step 4
Creating a corpus. (corpus is a list of stemmed words - no punctuation) <br>
Task: Create a list of strings containing each stemmed and processed sentence.

In [26]:
corpus = []
stemmer = PorterStemmer()
for text in df['text']:
    tokenized_text = nltk.word_tokenize(text)
    stemmed_text = ''
    for word in tokenized_text:
        if word not in punctuation and word not in trashwords:
            stemmed_text += stemmer.stem(word) + ' '
    corpus.append(stemmed_text)

## Step 5
Creating a bag of words and training <br>
Task: Create a Bag of Words model and its respective list of labels and start with training

In [27]:
cv = CountVectorizer()
x = cv.fit_transform(corpus).toarray()
y = df.iloc[:,1].values

In [28]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

## Step 6
Training and Analysing
Task: Implement the Naive Bayes Classification Algorithm <br> 
Naive Bayes is a supervised learning classification algorithm. The Naive Bayes Algorithm is “Naive” because it assumes the occurrence of one feature does not affect the probability of occurrence of other features. In other words, features are not related to each other.

In [29]:
classifier = MultinomialNB()
classifier.fit(x_train,y_train)

MultinomialNB()

In [30]:
y_pred = classifier.predict(x_test)
print('Accuracy:', accuracy_score(y_test, y_pred)*100)

Accuracy: 98.15789473684211


In [31]:
user_text = input('Input the text: ')
prediction = classifier.predict(cv.transform([user_text]))[0]
if prediction == 1:
    print('Spam!')
else:
    print('Not spam!')

Input the text: text
Not spam!
