TASK3: E-MAIL SPAM DETECTION PROJECT
INTRODUCTION:
We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.
In this Project, use Python to build an email spam detector. Then, use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam.

In [3]:
import pandas as pd

file_path = 'C:\\Users\\User\\OneDrive\\Desktop\\ALISHBA\\extra-curiculums\\oasis_infobyte_offerLetter\\task3\\spam.csv'
data = pd.read_csv(file_path, encoding='latin-1')
data.head(), data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


(     v1                                                 v2 Unnamed: 2  \
 0   ham  Go until jurong point, crazy.. Available only ...        NaN   
 1   ham                      Ok lar... Joking wif u oni...        NaN   
 2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
 3   ham  U dun say so early hor... U c already then say...        NaN   
 4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   
 
   Unnamed: 3 Unnamed: 4  
 0        NaN        NaN  
 1        NaN        NaN  
 2        NaN        NaN  
 3        NaN        NaN  
 4        NaN        NaN  ,
 None)

The dataset contains 5 columns:
1. v1: This appears to be the label, with "ham" indicating non-spam and "spam" indicating spam.
2. v2: This contains the actual email text.
3. Unnamed: 2, 3, and 4: These columns have many missing values and seem irrelevant for our analysis.

Now we will clean the dataset

In [10]:
data_cleaned = data[['v1', 'v2']].rename(columns={'v1': 'label', 'v2': 'message'})
data_cleaned['label'] = data_cleaned['label'].map({'spam': 1, 'ham': 0})
data_cleaned.isnull().sum(),data_cleaned.shape, data_cleaned.head()


(label      0
 message    0
 dtype: int64,
 (5572, 2),
    label                                            message
 0      0  Go until jurong point, crazy.. Available only ...
 1      0                      Ok lar... Joking wif u oni...
 2      1  Free entry in 2 a wkly comp to win FA Cup fina...
 3      0  U dun say so early hor... U c already then say...
 4      0  Nah I don't think he goes to usf, he lives aro...)

Now, There are no missing values in the cleaned dataset. we can proceed to preprocess the text data by converting it into a numerical format using TF-IDF (Term Frequency-Inverse Document Frequency), and then split the data into training and test sets for model training.

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(data_cleaned['message'], data_cleaned['label'], test_size=0.2, random_state=42)
tfidf = TfidfVectorizer(stop_words='english', max_features=3000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
X_train_tfidf.shape, X_test_tfidf.shape


((4457, 3000), (1115, 3000))

The email messages have been successfully converted into a numerical format using TF-IDF, resulting in a dataset with 3,000 features (words) for both the training and test sets.

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_tfidf, y_train)
y_pred = nb_classifier.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=['Ham', 'Spam'])

print("Accuracy:", accuracy)
print("Classification Report:\n", report)


Accuracy: 0.97847533632287
Classification Report:
               precision    recall  f1-score   support

         Ham       0.98      1.00      0.99       965
        Spam       1.00      0.84      0.91       150

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



Here's the breakdown of what the results were shown:
Ham (non-spam):
1. Precision: 0.98 (98% of the emails predicted as "ham" were actually ham)
2. Recall: 1.00 (100% of the actual ham emails were correctly identified)
3. F1-Score: 0.99 (mean of precision and recall)

Spam:
1. Precision: 1.00 (100% of the emails predicted as spam were truly spam)
2. Recall: 0.84 (84% of the actual spam emails were correctly identified)
3. F1-Score: 0.91