# ------------------------------------------ OASIS INFOBYTE ------------------------------------------
# Name - Akash Prakash Mandlik
# Task4 - Email Spam Classifier Detection

# Step-1) Business Problem Understanding

- We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.

- In this Project, use Python to build an email spam detector. Then, use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam. Let’s get
started!



In [1]:
import pandas as pd

import warnings 
warnings.simplefilter("ignore")

# Step-2) Data Understanding

In [2]:
df = pd.read_csv("SMSSpamCollection", sep = "\t", names = ["label","message"])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Step-3) Text Pre-processing

## 3.1) Text cleaning

#### 3.1.1) Remove Punctuations
#### 3.1.2) Remove Stopwords
#### 3.1.3) Stemming / Lemmatization

In [3]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer

In [4]:
corpus = []

for i in range(len(df)):
    rp = re.sub('[^a-zA-Z]'," ", df["message"][i])
    rp = rp.lower()
    rp = rp.split()
    rp = [word for word in rp if not word in set(stopwords.words("english"))]
    rp = " ".join(rp)
    corpus.append(rp)
    
#print(corpus)

## 3.2) Vectorization

#### 3.2.1) Count Vectorizer(bag of words)

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
x = cv.fit_transform(corpus).toarray()
x

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

#### X & values

In [6]:
x

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [7]:
y = pd.get_dummies(df["label"], drop_first = True)

In [8]:
y

Unnamed: 0,spam
0,0
1,0
2,1
3,0
4,0
...,...
5567,1
5568,0
5569,0
5570,0


## 3.3) Train-Test Split

In [9]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.3, random_state = 0)

# Step-4) Modeling

### Naive bayes Classifier with default parameteres

In [10]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

MultinomialNB()

# Step-5) Predictions

In [11]:
ypred_test  = model.predict(x_test)
ypred_train = model.predict(x_train)

# Step-6) Evaluation

In [12]:
from sklearn.metrics import accuracy_score
print("Train Accuracy :", accuracy_score(y_train, ypred_train))
print("Test Accuracy :", accuracy_score(y_test, ypred_test))

Train Accuracy : 0.9930769230769231
Test Accuracy : 0.979066985645933
