<a href="https://colab.research.google.com/github/Lnchi/CIP-Data-Science-Internship/blob/main/EmailSpamDetectionUsingPython.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Detecting email spam is a prevalent application of natural language processing (NLP), In this task, we classify whether an email is spam or not using machine learning algorithms.






# Import Libraries

In [32]:
#panda : to read and process the email data 
import pandas as pd 
#sklearn : to build the machine learning model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# Load the dataset

# The dataset that we will utilize is the Spam Mails Dataset available on Kaggle, which comprises a collection of emails that have been pre-labeled as either spam or not spam.



In [33]:
!pip install kaggle #Install the Kaggle library.
!mkdir ~/.kaggle #Create a new directory named ".kaggle".
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json #copy the “kaggle.json” file from the mounted google drive to the current instance storage. 
!chmod 600 ~/.kaggle/kaggle.json

! kaggle datasets download venky73/spam-mails-dataset

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
mkdir: cannot create directory ‘/root/.kaggle’: File exists
spam-mails-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [34]:
!unzip spam-mails-dataset.zip

Archive:  spam-mails-dataset.zip
replace spam_ham_dataset.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: spam_ham_dataset.csv    


In [35]:
df = pd.read_csv("spam_ham_dataset.csv")
df.sample(10)

Unnamed: 0.1,Unnamed: 0,label,text,label_num
3199,2731,ham,Subject: neon for march 28\r\nplease respond t...,0
1118,4096,spam,"Subject: how ' s it going\r\nhey marjorie ,\r\...",1
2309,4954,spam,Subject: ladies rolex watches\r\ncall me at ro...,1
3954,1818,ham,Subject: re : koch midstream services co\r\nti...,0
4128,3166,ham,Subject: equistar - i ' m still waiting for th...,0
1016,1202,ham,Subject: extend deal\r\nduring the month of ju...,0
2031,2442,ham,Subject: long term deals not going to aep\r\nl...,0
2118,642,ham,Subject: 98 - 6373\r\nhere i go again . . . . ...,0
3226,2994,ham,Subject: big cowboy - additional production\r\...,0
283,349,ham,Subject: natural gas nomination for 03 / 00\r\...,0


In [36]:
df=df.drop(['Unnamed: 0'],axis=1) #drop Unnamed: 0 column  
df.sample(10)

Unnamed: 0,label,text,label_num
2666,ham,Subject: half day of vac on 2 / 28\r\n12 . 20 ...,0
2269,ham,Subject: fw : first delivery - rodessa operati...,0
4993,ham,Subject: hplc / ocean energy inc . 09 / 99 pur...,0
3626,ham,Subject: re : covenants - project miracle\r\no...,0
3529,ham,Subject: february production estimate\r\n- - -...,0
2354,spam,Subject: legal operating systems for a third o...,1
3473,spam,Subject: adult movie downloads to keep you com...,1
3356,spam,Subject: breaking news\r\nwould you ref inance...,1
1967,ham,Subject: re : fuel\r\napplication of the fuel ...,0
4827,spam,Subject: - want a new laptop ? - get one free ...,1


# - The dataset consists of three variables: "label," which indicates whether an email is classified as "Spam" or "Ham"; "text," which contains the email content; and "label_num," which is assigned a value of 1 if the email is spam, and 0 if it is not.

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   label      5171 non-null   object
 1   text       5171 non-null   object
 2   label_num  5171 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 121.3+ KB


In [38]:
df['label'].value_counts()

ham     3672
spam    1499
Name: label, dtype: int64

In [39]:
label = df.groupby(by = ["label_num"]).size().reset_index(name="Count")
figure = px.bar(label,x='label_num',y='Count', color='label_num', title='The count of spam labels.')
figure.show()

# - The dataset we are utilizing is imbalanced, as the number of samples labeled as 'Ham' is significantly higher than those labeled as 'Spam'.

# Preprocess the data
# Preprocessing of the data is required before constructing the machine learning model.

# Check the data for any missing values and clean the data

In [40]:
df.isna().sum()

label        0
text         0
label_num    0
dtype: int64

In [41]:
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
5166    False
5167    False
5168     True
5169    False
5170    False
Length: 5171, dtype: bool

In [42]:
df.drop_duplicates(inplace=True)

# Text preprocessing by removing stop words and non-alphanumeric characters from the email text data.

In [43]:
# list of stop words, which are common words that don't add much meaning to a text.
# These words will be removed from the text data during preprocessing.
SW=['a', 'an', 'the', 'in', 'of', 'at', 'on', 'by', 'I', 'you', 'he', 'she', 'it', 'we', 'they']

# Apply a lambda function to the 'text' column using the apply() method. 
# The lambda function takes each email text, splits it into words,
# removes any stop words, converts the remaining words to lowercase, and joins them back into a string.
df['text']=df['text'].apply(lambda x: ' '.join([word.lower() for word in x.split() if word.lower() not in SW]))

# Apply another lambda function to the 'text' column using the str.replace() method. 
# This function removes any non-alphanumeric characters from the text data.
# [^\w\s] is a regular expression pattern that matches any character that is not a word character 
# (\w, which includes all letters, digits, and underscores) or whitespace (\s). 
# The ^ symbol inside the square brackets negates the pattern, so it matches any character that is not a word character or whitespace.
df['text']=df['text'].str.replace('[^\w\s]','')
df.sample(10)


The default value of regex will change from True to False in a future version.



Unnamed: 0,label,text,label_num
481,spam,subject fw old aged woman wants to date groov...,1
2298,ham,subject re discrepancies price gas redelivere...,0
3206,ham,subject 98 6892 sitara deal 319063 above de...,0
3368,spam,subject hi hi i am looking for new friends ...,1
3741,spam,subject save up to 89 ink no shipping cost s...,1
2815,ham,subject rfp dated june 25 2001 return receipt...,0
3663,ham,subject potential june 00 daren here is list...,0
274,spam,subject get bu _ lky p 0 le dcrgvabyssyzbr l...,1
3404,ham,subject hpl noms for june 08 2000 see attach...,0
3652,ham,subject hpl nom for october 27 2000 see atta...,0


# We will divide the data into two parts: a training set and a testing set.

In [44]:
# Split the data into training and testing sets
# Sets the testing set to 20 percent of df['text'] and df['label'].
x_train, x_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=0)

# We will employ the CountVectorizer class to transform the textual data into a matrix of term frequencies (of token counts).



In [45]:
# Create a Vectorizer Object
Counvect=CountVectorizer()
# Fit the CountVectorizer to data
# CountVectorizer() assigns a numerical value to each word in the dataset and computes the frequency of occurrence
# of each word. The resulting values are stored in the variable "Counvect"
train_features=Counvect.fit_transform(x_train)
test_features=Counvect.transform(x_test)
print(train_features)

  (0, 37413)	1
  (0, 23594)	1
  (0, 18202)	1
  (0, 21150)	1
  (0, 35482)	1
  (0, 19287)	1
  (0, 9560)	1
  (0, 25800)	2
  (0, 33946)	1
  (0, 8191)	1
  (0, 10930)	1
  (0, 35296)	1
  (0, 35470)	1
  (0, 22811)	1
  (0, 29137)	1
  (0, 29061)	1
  (0, 9215)	1
  (0, 42146)	1
  (0, 5556)	2
  (0, 33994)	1
  (0, 28590)	1
  (0, 9214)	1
  (0, 20328)	1
  (0, 28163)	1
  (0, 7474)	1
  :	:
  (3993, 38256)	1
  (3993, 14415)	1
  (3993, 17586)	1
  (3993, 15331)	1
  (3993, 32149)	1
  (3993, 22988)	1
  (3993, 27638)	1
  (3993, 21659)	1
  (3993, 36803)	1
  (3993, 28516)	1
  (3993, 22980)	1
  (3993, 36885)	1
  (3993, 6825)	1
  (3993, 4459)	1
  (3993, 7046)	1
  (3993, 9142)	1
  (3993, 16652)	1
  (3993, 17601)	1
  (3993, 28305)	1
  (3993, 31417)	1
  (3993, 38598)	1
  (3993, 18070)	1
  (3993, 14560)	1
  (3993, 18020)	1
  (3993, 17605)	1


# - (0, 37791)	1 : The index of the email is denoted by 0. The middle column in the output represents the sequence number assigned to each word by the CountVectorizer function, and the values on the right indicate the frequency count of each word.

# Train the Model

# Now that the data has been preprocessed, we can proceed with training our machine learning model. For this purpose, we will be using Scikit-learn's MultinomialNB, svm and linear_model classes to build a Naive Bayes, an SVM and a Logistic Regression model.

- A **Naive Bayes** model is a type of machine learning model based on the Bayesian classification method. It is commonly used for text classification in natural language processing and can be trained to predict the category or label of a document based on its textual content

In [46]:
model_NB= MultinomialNB() # create model 
model_NB.fit(train_features, y_train) # train model

- **SVM**, short for support vector machine, is a classification and regression algorithm that uses a linear model. The basic idea behind SVM is to create a line or a hyperplane that can separate the data into different classes. One of the benefits of SVM is that it can handle both linear and non-linear problems

In [47]:
model_SVM= svm.SVC()
model_SVM.fit(train_features, y_train)

- **Logistic Regression** is a popular machine learning algorithm used for binary classification problems. In Logistic Regression, a linear model is used to predict the probability of the positive class (i.e., class 1), given a set of input features. The model then uses a logistic function (i.e., the sigmoid function) to convert this probability value into a binary output prediction.

In the context of spam detection, Logistic Regression can be used to predict whether an email is spam or not based on various features of the email. The model can be trained on a dataset of pre-labeled emails and then used to make predictions on new, unseen emails.

In [48]:
model_LR= LogisticRegression()
model_LR.fit(train_features, y_train)

- Trains the model using the training data x_train_features and corresponding labels y_train. The "fit" method adjusts the model weights to the training data to minimize the prediction error. Once the model is trained, it can be used to predict class labels for new data.

# Test the performance of the model on the test set.
- Calculates the accuracy of the machine learning model by comparing the predicted target variable (spam or non-spam) with the actual target variable (y_test) on the test dataset (x_test_features). The resulting accuracy value is then assigned to the variable "accuracy".


In [49]:
# Naive Bayes model
y_pred_NB=model_NB.predict(test_features) #y_pred: predicted labels
y_pred_NB

array(['ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham',
       'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'spam',
       'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham',
       'ham', 'ham', 'spam', 'ham', 'spam', 'spam', 'spam', 'ham', 'spam',
       'spam', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'spam', 'ham',
       'spam', 'ham', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'ham',
       'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam',
       'ham', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'spam', 'spam',
       'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham',
       'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham',
       'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham',
       'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam',
       'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',

In [50]:
# SVM model
y_pred_SVM=model_SVM.predict(test_features)
y_pred_SVM

array(['ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham',
       'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham',
       'ham', 'ham', 'spam', 'ham', 'spam', 'spam', 'spam', 'ham', 'spam',
       'spam', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'spam', 'ham',
       'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham',
       'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam',
       'ham', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'spam', 'spam',
       'spam', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham',
       'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham', 'spam', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham',
       'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham',
       'ham', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam',
       'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham',

In [51]:
#LogisticRegression
y_pred_LR=model_LR.predict(test_features)
y_pred_LR

array(['ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham',
       'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'spam',
       'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham',
       'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'spam', 'ham', 'spam',
       'spam', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'spam', 'ham',
       'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham',
       'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam',
       'ham', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'spam', 'spam',
       'spam', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham',
       'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham',
       'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham',
       'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham',
       'ham', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam',
       'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 

In [52]:
# Naive Bayes model
Accuracy=accuracy_score(y_pred_NB,y_test) #y_test: true labels 
print(Accuracy)

0.9769769769769769


In [53]:
# SVM model
Accuracy=accuracy_score(y_pred_SVM,y_test) 
print(Accuracy)

0.963963963963964


In [54]:
#LogisticRegression
Accuracy=accuracy_score(y_pred_LR,y_test) 
print(Accuracy)

0.9819819819819819


- classification_report takes in the true labels and predicted labels as input and generates a report with precision, recall, F1-score, and support for each class.

In [55]:
#A report with precision, recall, F1-score, and support for each class.

# Naive Bayes model
report=classification_report(y_test,y_pred_NB)
print(report)

              precision    recall  f1-score   support

         ham       0.98      0.99      0.98       722
        spam       0.96      0.95      0.96       277

    accuracy                           0.98       999
   macro avg       0.97      0.97      0.97       999
weighted avg       0.98      0.98      0.98       999



In [56]:
# SVM model
report=classification_report(y_test,y_pred_SVM)
print(report)

              precision    recall  f1-score   support

         ham       0.98      0.97      0.97       722
        spam       0.92      0.96      0.94       277

    accuracy                           0.96       999
   macro avg       0.95      0.96      0.96       999
weighted avg       0.96      0.96      0.96       999



In [57]:
#LogisticRegression
report=classification_report(y_test,y_pred_LR)
print(report)

              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       722
        spam       0.96      0.97      0.97       277

    accuracy                           0.98       999
   macro avg       0.98      0.98      0.98       999
weighted avg       0.98      0.98      0.98       999

