# **EMAIL SPAM DETECTION WITH MACHINE LEARNING (TASK-3)**
Project By: **Tejaswini Jaunjat**

We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.



In this Project, I used Python to build an email spam detector. Then, use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam. Let’s get
started!

In [1]:
import pandas as pd

# Load the dataset into a pandas DataFrame
df = pd.read_csv('/kaggle/input/spam-dataset/spam.csv', encoding='latin-1')

# Drop unnecessary columns and rename remaining columns
df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
df = df.rename(columns={'v1': 'label', 'v2': 'text'})

# Print the first few rows of the DataFrame
print(df.head())


  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


This code will load the spam.csv file into a pandas DataFrame, drop unnecessary columns, and rename the remaining columns to 'label' and 'text'. The 'label' column contains the labels (spam or ham) and the 'text' column contains the text messages.

Once we have loaded the data, we can start processing the text messages and extracting features from them. One approach is to use the Bag of Words (BoW) model to convert the text messages into numerical feature vectors.

**We can use the CountVectorizer class from scikit-learn to implement the BoW model. Here's some sample code:**

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object
count_vectorizer = CountVectorizer()

# Fit and transform the text data
count_vectorizer.fit(df['text'])
text_counts = count_vectorizer.transform(df['text'])

# Print the shape of the feature vectors
print(text_counts.shape)


(5572, 8672)


This code will create a CountVectorizer object and fit it to the text data. It will then transform the text data into feature vectors using the transform() method of the CountVectorizer object. Finally, it will print the shape of the feature vectors.

Now that we have extracted features from the text messages, we can split the dataset into training and testing sets and train a machine learning model on the training set. We can use the Multinomial Naive Bayes (MNB) algorithm, which is a popular algorithm for text classification tasks.

**Here's some sample code to split the dataset, train the MNB model, and test it on the testing set:**

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(text_counts, df['label'], test_size=0.2, random_state=42)

# Train the MNB model
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = mnb.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)


Accuracy: 0.97847533632287


This code will split the dataset into training and testing sets, train the MNB model on the training set, make predictions on the testing set, and evaluate the accuracy of the model. The accuracy score will give us an idea of how well the model is able to classify the text messages as spam or ham.

# **Data Cleaning and Preprocessing**
Before training a machine learning model, it's important to clean and preprocess the data. Here's some sample code that you can use to perform some basic data cleaning and preprocessing on the SMS spam collection dataset:

In [5]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Load the dataset into a pandas DataFrame
df = pd.read_csv('/kaggle/input/spam-dataset/spam.csv', encoding='latin-1')

# Drop unnecessary columns and rename remaining columns
df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
df = df.rename(columns={'v1': 'label', 'v2': 'text'})

# Clean and preprocess the text data
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def clean_text(text):
    # Remove non-alphanumeric characters
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    
    # Stem words
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    
    return text

df['text'] = df['text'].apply(clean_text)


This code will load the spam.csv file into a pandas DataFrame, drop unnecessary columns, and rename the remaining columns to 'label' and 'text'. It will then clean and preprocess the text data by removing non-alphanumeric characters, converting to lowercase, removing stopwords, and stemming words using the PorterStemmer algorithm from the NLTK library.

# **Feature Engineering**
In addition to the Bag of Words model, there are many other features that you can extract from the text messages. Here's some sample code that you can use to extract additional features:

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np

# Create a TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_vectorizer.fit(df['text'])
text_tfidf = tfidf_vectorizer.transform(df['text'])

# Create a HashingVectorizer object
hash_vectorizer = HashingVectorizer(n_features=2**10)

# Fit and transform the text data
hash_vectorizer.fit(df['text'])
text_hash = hash_vectorizer.transform(df['text'])

# Create a TruncatedSVD object
svd = TruncatedSVD(n_components=100)

# Fit and transform the tf-idf feature vectors
text_svd = svd.fit_transform(text_tfidf)

# Concatenate the feature vectors
text_features = np.hstack((text_tfidf.toarray(), text_hash.toarray(), text_svd))


This code will create a TfidfVectorizer object to extract TF-IDF features from the text messages, a HashingVectorizer object to extract hashing features, and a TruncatedSVD object to perform dimensionality reduction on the TF-IDF feature vectors. It will then concatenate the feature vectors into a single numpy array.

# **Hyperparameter Tuning**
To improve the performance of the machine learning model, you can tune the hyperparameters of the algorithm. Here's some sample code that you can use to perform hyperparameter tuning:

In [11]:
from sklearn.naive_bayes import MultinomialNB

# Create a Naive Bayes model
model = MultinomialNB()

# Train the model on the training data
model.fit(X_train, y_train)


MultinomialNB()

In [12]:
from sklearn.model_selection import GridSearchCV
# Define the parameter grid
param_grid = {'alpha': [0.1, 1.0, 10.0]}

# Perform grid search with cross-validation
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit the model to the data
grid_search.fit(X_train, y_train)

# Print the best parameters and score
print('Best parameters:', grid_search.best_params_)
print('Best score:', grid_search.best_score_)

Best parameters: {'alpha': 0.1}
Best score: 0.9833972510355171


This code defines the parameter grid for hyperparameter tuning. In this case, we are testing different values for the alpha hyperparameter, which controls the strength of regularization in the Naive Bayes algorithm. We are testing the values 0.1, 1.0, and 10.0. We can add more values to this list if we want to test additional values.
This code will perform a grid search with cross-validation to find the best hyperparameters for the algorithm. It will print out the best parameters and score.

# **Model Evaluation**
After training the machine learning model, it's important to evaluate its performance. Here's some sample code that you can use to evaluate the model:

In [14]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, pos_label='spam'))
print('Recall:', recall_score(y_test, y_pred, pos_label='spam'))
print('F1 score:', f1_score(y_test, y_pred, pos_label='spam'))



Accuracy: 0.97847533632287
Precision: 0.9144736842105263
Recall: 0.9266666666666666
F1 score: 0.9205298013245033


# **Thank you**