## Executive Summary

### Problem Statement

The main aim of this project is to determine the spam mail. 

### About the Dataset

I downloaded dataset from Kaggle. The dataset has 5572 row and 5 columns. 

### Preprocessing Data

For analysis, I just used 2 columns one of them text column and other one is classification column. I deleted other columns while preparing the dataset for analysis. I cahnged the column's name and check the duplicates mesaages. I also, clean the messages from punctuation. Finally, I transform the words into numbers and split the data as a train and test set.

### Build the Model

I used Multinomial Gaussian Naive Bayes for determining the spam mails.

### Evaluate the Model

I reached 0.99 accuracy on the train dataset and 0.98 f1 score. When it comee to test data, accuracy is 0.95 and f1 score is 0.86. So, I can say that this model can be trustfully determine the mails as a spam or ham. 

## Import the Libraries

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

## Load the Dataset

In [2]:
data = pd.read_csv('spam.csv', encoding= "ISO-8859-1")
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Preprocessing the data

### Drop the unnecessaery columns

In [3]:
data = data.drop(columns = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])

### Rename the column to easy understand during the analysis 

In [4]:
data = data.rename(columns = {'v1':'category', 'v2': 'text'})
data.head()

Unnamed: 0,category,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
data['category'] = data['category'].str.replace('ham','0')
data['category'] = data['category'].str.replace('spam','1')

In [6]:
data.head()

Unnamed: 0,category,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
data.shape

(5572, 2)

### Drop the duplicates 

In [8]:
data.drop_duplicates(inplace=True)

In [9]:
data.shape

(5169, 2)

### Check the null values

In [10]:
data.isnull().sum()

category    0
text        0
dtype: int64

### Download the stopwords package

In [11]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to C:\Users\Lenovo
[nltk_data]     Pc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Define the cleaning function

In [12]:
def process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)

    clean = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return clean
# to show the tokenization
data['text'].head().apply(process)

0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: text, dtype: object

### Transform the words into Numbers to use in analysis

In [13]:
message = CountVectorizer(analyzer=process).fit_transform(data['text'])

In [14]:
message

<5169x11304 sparse matrix of type '<class 'numpy.int64'>'
	with 45872 stored elements in Compressed Sparse Row format>

## Split the data

In [15]:
x_train, x_test, y_train, y_test = train_test_split(message, data['category'], test_size=0.20, random_state=0)
# To see the shape of the data
print(message.shape)

(5169, 11304)


## Build the Model and Train the Data

In [16]:
classifier = MultinomialNB().fit(x_train, y_train)

In [17]:
print(classifier.predict(x_train))
print(y_train.values)

['0' '0' '0' ... '0' '0' '0']
['0' '0' '0' ... '0' '0' '0']


## Evaluate the Model

In [18]:
pred = classifier.predict(x_train)
print(classification_report(y_train, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(y_train, pred))
print("Accuracy: \n", accuracy_score(y_train, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3631
           1       0.98      0.98      0.98       504

    accuracy                           1.00      4135
   macro avg       0.99      0.99      0.99      4135
weighted avg       1.00      1.00      1.00      4135


Confusion Matrix: 
 [[3623    8]
 [  11  493]]
Accuracy: 
 0.9954050785973397


In [19]:
#print the predictions
print(classifier.predict(x_test))
#print the actual values
print(y_test.values)

['0' '0' '0' ... '0' '0' '0']
['0' '0' '0' ... '0' '0' '0']


In [20]:
pred = classifier.predict(x_test)
print(classification_report(y_test, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(y_test, pred))
print("Accuracy: \n", accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.99      0.96      0.97       885
           1       0.80      0.93      0.86       149

    accuracy                           0.96      1034
   macro avg       0.89      0.94      0.92      1034
weighted avg       0.96      0.96      0.96      1034


Confusion Matrix: 
 [[850  35]
 [ 11 138]]
Accuracy: 
 0.9555125725338491
