# **Email Spam Classifier: Project Overview**
This Jupyter Notebook presents an Email Spam Classifier developed using machine learning techniques. The goal of this project is to accurately distinguish between legitimate (ham) and unsolicited (spam) emails. Email spam remains a significant issue, impacting user productivity and posing security risks. By building an effective classifier, we aim to filter out unwanted messages, improving the email experience for users.

## Imports and Initial Setup
This section handles the necessary library imports and the initial setup required for our spam classification project. We import `numpy` for numerical operations, `pandas` for data manipulation, and key modules from `sklearn` for model selection, feature extraction, and model training.


In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Loading the Dataset

We begin by loading our email dataset, which is stored in a CSV file named `mail_data.csv`. This dataset contains a collection of emails labeled as either 'spam' or 'ham' (legitimate).


In [4]:
df = pd.read_csv('mail_data.csv')

In [5]:
print(df)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [7]:
data = df.where((pd.notnull(df)), '')

## Exploratory Data Analysis (EDA)

Before proceeding with model training, it's crucial to understand the structure and characteristics of our data. This section performs basic exploratory data analysis to inspect the dataset, including viewing the first few rows, checking data types, and identifying any missing values.
 

In [8]:
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [64]:
data.shape

(5572, 2)

## Label Encoding

To prepare the 'Category' column for machine learning algorithms, we convert the categorical labels ('spam' and 'ham') into numerical representations. 'Spam' is encoded as `0` and 'ham' is encoded as `1`. This transformation is essential for the model to process the target variable.

In [13]:
data.loc[data['Category'] == 'spam', 'Category',] = 0
data.loc[data['Category'] == 'ham', 'Category',] = 1

## Splitting Features and Target

Here, we separate our dataset into features (`X`) and the target variable (`Y`). The `Message` column, containing the email text, serves as our feature set, while the `Category` column (now numerically encoded) is our target variable that the model will learn to predict.

In [14]:
X = data['Message']

Y = data['Category']

In [15]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [16]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


## Train-Test Split

To evaluate the model's performance on unseen data, the dataset is divided into training and testing sets. A `test_size` of 0.2 means `20%` of the data will be reserved for testing, and `random_state=3` ensures reproducibility of the split.

In [19]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state = 3)

In [20]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(5572,)
(4457,)
(1115,)


In [21]:
print(Y.shape)
print(Y_train.shape)
print(Y_test.shape)

(5572,)
(4457,)
(1115,)


## TF-IDF Vectorization

Text data cannot be directly fed into machine learning models. Therefore, we use TF-IDF (Term Frequency-Inverse Document Frequency) to convert the email messages into numerical feature vectors.

* `min_df = 1`: Considers terms that appear in at least one document.
* `stop_words` = 'english': Removes common English stop words (like "the", "is", "a") which usually don't carry much meaning for classification.
* `lowercase=True`: Converts all text to lowercase to ensure consistency.

The `fit_transform` method is applied to the training data to learn the vocabulary and transform it, while transform is applied to the test data using the same vocabulary learned from the training set.

In [31]:
feature_extraction = TfidfVectorizer(min_df = 1, stop_words = 'english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [32]:
print(X_train)

3075                  Don know. I did't msg him recently.
1787    Do you know why god created gap between your f...
1614                         Thnx dude. u guys out 2nite?
4304                                      Yup i'm free...
3266    44 7732584351, Do you want a New Nokia 3510i c...
                              ...                        
789     5 Free Top Polyphonic Tones call 087018728737,...
968     What do u want when i come back?.a beautiful n...
1667    Guess who spent all last night phasing in and ...
3321    Eh sorry leh... I din c ur msg. Not sad alread...
1688    Free Top ringtone -sub to weekly ringtone-get ...
Name: Message, Length: 4457, dtype: object


In [33]:
print(X_train_features)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 34775 stored elements and shape (4457, 7431)>
  Coords	Values
  (0, 2329)	0.38783870336935383
  (0, 3811)	0.34780165336891333
  (0, 2224)	0.413103377943378
  (0, 4456)	0.4168658090846482
  (0, 5413)	0.6198254967574347
  (1, 3811)	0.17419952275504033
  (1, 3046)	0.2503712792613518
  (1, 1991)	0.33036995955537024
  (1, 2956)	0.33036995955537024
  (1, 2758)	0.3226407885943799
  (1, 1839)	0.2784903590561455
  (1, 918)	0.22871581159877646
  (1, 2746)	0.3398297002864083
  (1, 2957)	0.3398297002864083
  (1, 3325)	0.31610586766078863
  (1, 3185)	0.29694482957694585
  (1, 4080)	0.18880584110891163
  (2, 6601)	0.6056811524587518
  (2, 2404)	0.45287711070606745
  (2, 3156)	0.4107239318312698
  (2, 407)	0.509272536051008
  (3, 7414)	0.8100020912469564
  (3, 2870)	0.5864269879324768
  (4, 2870)	0.41872147309323743
  (4, 487)	0.2899118421746198
  :	:
  (4454, 2855)	0.47210665083641806
  (4454, 2246)	0.47210665083641806
  (4455, 4456)	0.24

## Model Training: Logistic Regression

We choose Logistic Regression as our classification model. Despite its name, Logistic Regression is a powerful algorithm for binary classification tasks, well-suited for distinguishing between two classes (spam or ham). The model is trained using the TF-IDF features from the training data (`X_train_features`) and their corresponding labels (`Y_train`).

In [34]:
model = LogisticRegression()

In [35]:
model.fit(X_train_features, Y_train)

## Heading: Model Evaluation

After training, it's essential to evaluate the model's performance. We assess the accuracy of the model on both the training data and the unseen test data.

* `Accuracy on Training Data`: Indicates how well the model learned from the data it was trained on.
* `Accuracy on Test Data`: Provides an unbiased estimate of the model's performance on new, unseen emails, reflecting its generalization capability.

In [37]:
prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

In [38]:
print('Accuracy on Training Data : ', accuracy_on_training_data)

Accuracy on Training Data :  0.9676912721561588


In [40]:
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

In [41]:
print('Accuracy on Test Data : ', accuracy_on_test_data)

Accuracy on Test Data :  0.9668161434977578


## Making Predictions on New Email

This section demonstrates how to use the trained spam classifier to predict whether a new, unseen email is "ham" or "spam." **Users are encouraged to modify the** `input_your_mail` **variable with their own email content (either ham or spam) to test the model's prediction capabilities**.

The input email string is first transformed into a numerical feature vector using the same TF-IDF vectorizer that was fitted on the training data. Subsequently, the trained Logistic Regression model utilizes these features to make a prediction.

In [52]:
input_your_mail = ["Today, we've released additional information regarding our related party and off-balance sheet transactions. This information is available on our website and in a Form 8-K filing with the SEC. We are restating our financial statements from 1997 to 2000 and for the first and second quarters of 2001. This is due to the consolidation of certain off-balance sheet entities that should have been included in our financial statements. While these restatements reflect a reduction in shareholders' equity and an increase in debt for prior years, they have no material effect on Enron's current financial position or reported earnings for the nine-month period ending September 2001. We continue to cooperate fully with the SEC's investigation and will keep you updated on any developments. More detailed information can be found in the Form 8-K filing."]

# Example spam mail: 
# Top rated online store . hot new - levitra / lipitor / nexium weekly speciasls on all our drugs . - zocor - soma - ambien - phentermine - vlagra - discount generic ' s on all - more next day discrete shipping on all products ! http : / / www . rxstoreusa . biz / shopping please , i wish to receive no more discounts on valuable items . http : / / www . rxstoreusa . biz / a . html jet djjdnj 33 xks npvjkps ekhvhdqkxhm xvgwk cpjtrsbqgogmjnyi uknuilrj moqwrcaigwvvfpsljzycp k p e p gp c j

# Example ham mail: 
# Today, we've released additional information regarding our related party and off-balance sheet transactions. This information is available on our website and in a Form 8-K filing with the SEC. We are restating our financial statements from 1997 to 2000 and for the first and second quarters of 2001. This is due to the consolidation of certain off-balance sheet entities that should have been included in our financial statements. While these restatements reflect a reduction in shareholders' equity and an increase in debt for prior years, they have no material effect on Enron's current financial position or reported earnings for the nine-month period ending September 2001. We continue to cooperate fully with the SEC's investigation and will keep you updated on any developments. More detailed information can be found in the Form 8-K filing.
input_data_features = feature_extraction.transform(input_your_mail)

prediction = model.predict(input_data_features)

print(prediction)

if (prediction[0]==1):
    print('Ham mail')
else:
    print('Spam mail')

[1]
Ham mail


## Project Summary and Next Steps

We've successfully built an Email Spam Classifier that effectively identifies unsolicited emails. By transforming raw email text into numerical features using TF-IDF and leveraging the power of Logistic Regression, we've created a functional tool that can help users maintain a cleaner inbox. The model's performance on unseen data highlights its strong predictive capabilities.

This project serves as a solid foundation, and there are many exciting avenues for future exploration:

* Deployment: Integrating the trained model into a web application or email client for real-time spam detection.
* User Feedback Loop: Implementing a system where users can correct misclassified emails to continuously retrain and improve the model.
* Analysis of Misclassifications: Deep-diving into emails that the model misclassified to understand patterns and refine the preprocessing or model architecture.
* Handling Attachments and URLs: Extending the classifier to analyze content within attachments or the nature of URLs in emails.