# Oasis InfoByte Data Science Internship - Task 4

## Samarth Pandey

## EMAIL SPAM DETECTION WITH MACHINE LEARNING
### We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email that is sent to a massive number of users at one time, frequently containing cryptic messages, scams, or most dangerously, phishing content. In this Project, use Python to build an email spam detector. Then, use machine learning to train the spam detector to recognize and classify emails into spam and non-spam.

# Importing libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

from warnings import filterwarnings
filterwarnings(action='ignore')

# Preprocessing of the data

In [2]:
d = pd.read_csv(r'spam.csv', encoding = 'latin1')
d
data = d.where((pd.notnull(d)), '')

In [3]:
data.shape

(5572, 5)

In [4]:
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
# label the spam emails as 0 and the non spam emails as 1
data.loc[data['v1'] == 'spam', 'v1',] = 0
data.loc[data['v1'] == 'ham', 'v1',] = 1

In [6]:
# separating the data as text and labelling X --> text; Y --> label
X = data['v2']
Y = data['v1']

In [7]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: v2, Length: 5572, dtype: object


In [8]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: v1, Length: 5572, dtype: object


In [9]:
# split the data as train data and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, test_size=0.2, random_state=3)

# Feature extraction

In [10]:
#transform the text data to feature vectors that can be used as input to the svm model using TfidVectorizer
#covert the text to lower case letter

feature_extraction = TfidfVectorizer(min_df = 1, stop_words = 'english', lowercase = True)
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)


In [11]:
# convert the Y_train and Y_test values to integers

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

# Training the model using Support Vector Machine(SVM)

In [12]:
# training the support vector machine model with training data 

model = LinearSVC()
model.fit(X_train_features, Y_train)

# Evaluating the model

In [13]:
#prediction on training data
prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

In [14]:
print('Accuracy on training data : ',accuracy_on_training_data)

Accuracy on training data :  0.9995512676688355


In [15]:
#prediction on test data 
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
print('Accuracy: ',accuracy_on_test_data)
print('Accuracy of the model is: {:.3f}'.format(accuracy_on_test_data*100),'%')

Accuracy:  0.9856502242152466
Accuracy of the model is: 98.565 %


# Prediction using an email

In [16]:
input_mail = ["Even my brother is not like to speak with me. They treat me like aids patent.,,,"]

#convert text to feature 
input_mail_features = feature_extraction.transform(input_mail)

#making predictions
prediction = model.predict(input_mail_features)

#Spam email means 0, ham email means 1

print(prediction)

if (prediction[0] == 1):
              print("IT IS NOT A SPAM MAIL")
else :
              print("IT IS A SPAM MAIL")

[1]
IT IS NOT A SPAM MAIL


## Thank You