# AMJAD KHAN   BSCS  USTB 

# Email spam Detection with Machine Learning  and NLP

## 1. import numpy as np
Purpose: Imports the NumPy library and gives it the alias np.

NumPy: A powerful library for numerical computations in Python. It is widely used for handling arrays, performing mathematical operations, and working with multi-dimensional data.

Example Use: np.array([1, 2, 3]) creates a NumPy array.
## 2. import pandas as pd
Purpose: Imports the Pandas library and gives it the alias pd.

Pandas: A library for data manipulation and analysis. It provides two main data structures:

DataFrame: Tabular data similar to Excel spreadsheets.

Series: One-dimensional labeled data.

Example Use: pd.DataFrame({'A': [1, 2, 3]}) creates a simple DataFrame.
## 3. import nltk
Purpose: Imports the NLTK (Natural Language Toolkit) library.

NLTK: A library for natural language processing tasks like tokenization, stemming, lemmatization, and more.

Example Use: nltk.word_tokenize("Hello, world!") splits the sentence into words.
## 4. from nltk.corpus import stopwords
Purpose: Imports the stopwords module from the nltk.corpus.

Stopwords: Common words like "the", "is", "and" that are often removed in text processing because they do not contribute much meaning.

Example Use: stopwords.words('english') provides a list of English stopwords.
## 5. import string
Purpose: Imports Python’s built-in string module.

String Module: Provides useful constants and classes for string manipulation.

string.punctuation: A string of all punctuation characters (!"#$%&'()*+,-./:;<=>?@[\\]^_{|}~`).

string.ascii_letters: A string of all ASCII letters (abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ).

Example Use: string.punctuation is often used to remove punctuation in text processing.

## Import the libraries

In [5]:
import numpy as np 
import pandas as pd   
import nltk
from nltk.corpus import stopwords
import string

## Load the data and print the first 5 rows

## 1. df = pd.read_csv("emails.csv")
Purpose: Reads a CSV (Comma-Separated Values) file named emails.csv into a Pandas DataFrame.
Components:

pd.read_csv: A Pandas function that loads data from a CSV file.

"emails.csv": The file to be read. It must be in the same directory as your script or provide the full file path.

## 2. df.head()
Purpose: Displays the first five rows of the DataFrame df by default.

Components:

.head(): A Pandas method that allows you to preview the top rows of the DataFrame. You can specify the number of rows to display,

e.g., df.head(5) for 5 rows.

In [8]:
df = pd.read_csv("emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## df.shape
Purpose: Returns the dimensions of the DataFrame df as a tuple.

Output: A tuple of two values:

Number of rows (records).

Number of columns (features or attributes).

In [10]:
df.shape

(5728, 2)

# df.columns
## Explanation:
Purpose: Returns the column names of the DataFrame df as a Pandas Index object.
Output: A list-like object containing all column names in the same order as they appear in the DataFrame.

In [12]:
df.columns

Index(['text', 'spam'], dtype='object')

# Check for duplicates and remove them

## 1. df.drop_duplicates(inplace=True)
## Explanation:
Purpose: Removes duplicate rows from the DataFrame df.

### Components:
df.drop_duplicates(): A Pandas method to drop rows that have identical values across all columns (i.e., duplicates).

inplace=True: Modifies the original DataFrame directly instead of creating a new DataFrame.

## 2. print(df.shape)
### Explanation:
Purpose: Prints the dimensions of the DataFrame df after removing duplicates.
### Components:
df.shape: Returns the tuple (number_of_rows, number_of_columns).

print(): Outputs this tuple to the console.


In [14]:
df.drop_duplicates(inplace=True)
print(df.shape)

(5695, 2)


## See the number of missing data for each column

# print(df.isnull().sum())
## Explanation:
This line of code checks for missing (null) values in the DataFrame df and prints the count of null values for each column.
### df.isnull():

Purpose: Creates a new DataFrame of the same shape as df where each cell contains:

True if the value in that cell is null (i.e., missing or NaN).

False otherwise.

### .sum():

Purpose: Sums the True values column-wise (treating True as 1 and False as 0).

In [16]:
print(df.isnull().sum())

text    0
spam    0
dtype: int64


## Download the stop words

3 nltk.download("stopwords")
* Purpose: Downloads the stopwords dataset from the NLTK library.

* What are Stopwords? Common words like "the", "is", "and", etc., that are often removed in text processing because they don't add much meaning.

* Why Download? Stopwords are not included by default, so they need to be downloaded before use.


In [18]:
# download the stopwords package
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Amjad\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

# def process(text):
Purpose: Defines a function named process that takes a single argument, text.

# nopunc = [char for char in text if char not in string.punctuation]
Purpose: Creates a list of characters from text excluding punctuation.
* How:
Loops through each character (char) in text.

Keeps the character only if it is not in string.punctuation (a predefined string of common punctuation marks like .,!?").

#### Result: A list of characters without punctuation.

# nopunc = ''.join(nopunc)
Purpose: Converts the list of characters back into a single string.

How: Joins all characters in the nopunc list.

Result: A string with punctuation removed.

# clean = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
Purpose: Removes common stopwords (like "the", "is") from the text.
How:
nopunc.split(): Splits the string into a list of words.

Loops through each word in this list.

Keeps the word only if it (converted to lowercase) is not in the list of English stopwords (stopwords.words('english')).

Result: A list of meaningful words without stopwords.

# return clean
Purpose: Outputs the list of cleaned words from the function.

Example Output: ['hello', 'world', 'test']

# df['text'].head().apply(process)
Purpose: Applies the process function to the first 5 rows of the text column in the DataFrame df.
## Steps:
df['text']: Selects the text column from df.

.head(): Takes the first 5 rows of this column.

.apply(process): Applies the process function to each row (i.e., to each text entry).

Result: Returns a Pandas Series where each entry is a list of cleaned words for the corresponding text in the first 5 rows.



In [20]:
def process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)

    clean = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return clean
# to show the tokenization
df['text'].head().apply(process)

0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: object

# convert the text into a matrix of token counts

## 1. Importing the CountVectorizer
### from sklearn.feature_extraction.text import CountVectorizer
Purpose: Imports the CountVectorizer class from the sklearn.feature_extraction.text module.

#### What is CountVectorizer?
It converts a collection of text documents into a matrix of token counts (a Bag of Words model).
Each column in the matrix represents a unique word (token), and the cell value represents the word's count in a specific document.

## 2. Using CountVectorizer with a Custom Analyzer
### message = CountVectorizer(analyzer=process).fit_transform(df['text'])

#### CountVectorizer(analyzer=process)
Purpose: Initializes the CountVectorizer object with a custom analyzer.

Analyzer:Defines how the text will be preprocessed and tokenized before converting into a Bag of Words.

Here, process (the custom function defined earlier) is passed as the analyzer.

process ensures:Text is cleaned (removes punctuation and stopwords).

Words are tokenized into meaningful terms.

#### .fit_transform(df['text'])

Purpose: Learns the vocabulary (unique tokens) from the text column of df and transforms the text data into a sparse matrix representation.
###### Components:
fit: Extracts all unique words (tokens) across the text column and builds a vocabulary.

transform: Converts each document into a row of token counts based on the learned vocabulary.


In [23]:
from sklearn.feature_extraction.text import CountVectorizer
message = CountVectorizer(analyzer=process).fit_transform(df['text'])

## from sklearn.model_selection import train_test_split
Imports the train_test_split function to split the data into training and testing sets.

## xtrain, xtest, ytrain, ytest = train_test_split(message, df['spam'], test_size=0.20, random_state=0)
Purpose: Splits the data into training and testing sets.
Inputs: message: Features (Bag of Words matrix).

df['spam']: Labels (target column indicating spam or not).

test_size=0.20: 20% of the data is reserved for testing.

random_state=0: Ensures reproducibility of the split.

Ensures reproducibility means -----> The ability to obtain consistent results

Outputs:
xtrain: Features for training (80%).

xtest: Features for testing (20%).

ytrain: Labels for training (80%).

ytest: Labels for testing (20%).

### print(message.shape)
Prints the shape (dimensions) of the message sparse matrix.

Output: (n_rows, n_columns)

n_rows: Number of documents (messages).

n_columns: Number of unique words (vocabulary size).

In [27]:
#split the data into 80% training and 20% testing
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(message, df['spam'], test_size=0.20, random_state=0)
# To see the shape of the data
print(message.shape)

(5695, 37229)


## from sklearn.naive_bayes import MultinomialNB
### Imports the MultinomialNB class from sklearn.naive_bayes.
Purpose: Multinomial Naive Bayes is a probabilistic classifier commonly used for text classification, especially for Bag of Words data.
### classifier = MultinomialNB()           
Creates an instance of the MultinomialNB classifier.

This classifier assumes: Features are counts or frequencies (like in Bag of Words or TF-IDF representations).

Follows the Naive Bayes theorem to calculate probabilities.
### .fit(xtrain, ytrain)
Trains the Naive Bayes classifier on the training data.
##### Inputs:
xtrain: Sparse matrix of training features (word counts for each document).

ytrain: Labels (spam or not) for the corresponding documents.
##### Output: The classifier is trained and ready to make predictions.

In [35]:
# create and train the Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(xtrain, ytrain)

# See the classifiers prediction and actual values on the data set

## 1. print(classifier.predict(xtrain))
Purpose: Predicts the labels (spam or not) for the training data xtrain.

Steps:

classifier.predict(xtrain):  Uses the trained Naive Bayes model (classifier) to classify each document in xtrain (training features).
### Outputs an array of predicted labels (e.g., [0, 1, 0, 1, ...]), where:
0 means "not spam."

1 means "spam."
###### Output: Prints the array of predicted labels for the training data.

## 2. print(ytrain.values)
Purpose: Prints the true labels (actual spam or not) for the training data.
Steps: 
ytrain.values:   Converts the ytrain Series (Pandas column) to a NumPy array.

This contains the actual labels (e.g., [0, 1, 0, 1, ...]) corresponding to xtrain.
###### Output: Prints the true labels for comparison with the predictions.

In [37]:
print(classifier.predict(xtrain))
print(ytrain.values)

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]


## from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
Purpose: Imports evaluation functions from sklearn.metrics to assess the model's performance.

classification_report: Provides detailed metrics like precision, recall, F1-score, and support for each class.

confusion_matrix: Computes the confusion matrix, which shows the count of true positives, true negatives, false positives, and false negatives.

accuracy_score: Calculates the accuracy of the model, i.e., the percentage of correct predictions.

## pred = classifier.predict(xtrain)
Purpose: Makes predictions on the training set (xtrain) using the trained Naive Bayes classifier (classifier).
#### Result:
pred will be an array of predicted labels (e.g., [0, 1, 0, 1, ...]), where 0 is "not spam" and 1 is "spam".

## print(classification_report(ytrain, pred))
Purpose: Prints the classification report comparing the true labels (ytrain) with the predicted labels (pred).

##### Explanation:
Precision: The percentage of correct positive predictions (spam) out of all predicted positives.

Recall: The percentage of correct positive predictions (spam) out of all actual positives (spam).

F1-score: The harmonic mean of precision and recall.

Support: The number of occurrences of each class in the dataset.

## print()
Purpose: Prints a blank line for better readability between the outputs.
# print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
Purpose: Prints the confusion matrix that shows the breakdown of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
Explanation: The confusion matrix for binary classification is a 2x2 matrix.

TN: True Negative (not spam correctly predicted as not spam)

FP: False Positive (not spam incorrectly predicted as spam)

FN: False Negative (spam incorrectly predicted as not spam)

TP: True Positive (spam correctly predicted as spam)

# print("Accuracy: \n", accuracy_score(ytrain, pred))

Purpose: Prints the accuracy of the model, which is the ratio of correct predictions to the total number of predictions.

## Formula:
Accuracy  =   Number of Correct Predictions  /  Total Number of Predictions



In [39]:
# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtrain)
print(classification_report(ytrain, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy: \n", accuracy_score(ytrain, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3457
           1       0.99      1.00      0.99      1099

    accuracy                           1.00      4556
   macro avg       0.99      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556


Confusion Matrix: 
 [[3445   12]
 [   1 1098]]
Accuracy: 
 0.9971466198419666


## 1. print(classifier.predict(xtest))
Purpose: Prints the predicted labels for the test set (xtest) using the trained Naive Bayes classifier (classifier).
##### Steps:
classifier.predict(xtest):   Passes the feature matrix of the test set (xtest) to the classifier's predict method.

Outputs an array of predictions for each document in the test set.

Predicted Labels: Typically 0 (not spam) or 1 (spam)

## 2. print(ytest.values)
Purpose: Prints the actual labels (ground truth) for the test set (ytest).

##### Steps:
ytest.values:  Converts the ytest Series (Pandas column) into a NumPy array.

Contains the true labels (e.g., [0, 1, 0, 0, 1]).


In [41]:
#print the predictions
print(classifier.predict(xtest))
#print the actual values
print(ytest.values)

[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]


# evaluate the model on the test data set

## from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
Purpose: Imports evaluation metrics from sklearn.metrics to assess the model's performance on the test data.

classification_report:   Summarizes key metrics like precision, recall, F1-score, and support for each class.

confusion_matrix:   Provides a breakdown of correct and incorrect predictions.

accuracy_score:   Calculates the proportion of correct predictions.

# pred = classifier.predict(xtest)
Purpose: Makes predictions on the test dataset (xtest) using the trained Naive Bayes classifier (classifier).
##### Output:
pred is an array containing the predicted labels (e.g., [0, 1, 0, 1, ...]) for each document in the test set.

These predictions will be compared with the actual labels (ytest) to evaluate the model.

## print(classification_report(ytest, pred))
Purpose:   Prints the classification report comparing the true labels (ytest) with the predicted labels (pred).
#### Explanation of Metrics:
Precision:   The fraction of correctly identified positive predictions (spam) out of all predicted positives.

Recall:    The fraction of correctly identified positive predictions (spam) out of all actual positives (spam).

F1-score:   The harmonic mean of precision and recall, providing a balance between them.

Support:   The number of true instances for each class in the dataset.

## print("Accuracy: \n", accuracy_score(ytest, pred))

Purpose: Prints the overall accuracy of the model on the test data.

### Formula:
Accuracy = Number of Correct Predictions  /  Total Number of Predictions

​



In [43]:
# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtest)
print(classification_report(ytest, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytest, pred))
print("Accuracy: \n", accuracy_score(ytest, pred))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       870
           1       0.97      1.00      0.98       269

    accuracy                           0.99      1139
   macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139


Confusion Matrix: 
 [[862   8]
 [  1 268]]
Accuracy: 
 0.9920983318700615
