# Project Overview

This project involves building a spam email classifier using a machine learning approach. The goal is to predict whether an email is "ham" (not spam) or "spam" based on the text content of the email. To achieve this, we use techniques such as Natural Language Processing (NLP), feature extraction, oversampling (to balance the dataset), and classification algorithms (like Naive Bayes and Logistic Regression).

# Problem Statement

The goal of this project is to classify emails as spam or ham based on the email's content. Spam emails can be defined as unsolicited or unwanted emails, often used for advertising or spreading malware, whereas ham emails are regular, legitimate emails. The task is to train a machine learning model to predict whether an email is spam or ham.

# Key Challenges:

Imbalanced Data: Spam datasets are often imbalanced, where one class (e.g., ham) dominates the other (spam). We address this by using techniques like oversampling.
Text Preprocessing: Emails are composed of unstructured text, which requires preprocessing and feature extraction to make it suitable for machine learning.
Model Selection: Selecting the best algorithm for classification, such as Naive Bayes (good for text classification) and Logistic Regression.

In [None]:
import nltk

# Dataset

The dataset (mail_data.csv) contains two primary columns:

Message: The email content (text data).
Category: The label for the email, where:
'spam' denotes a spam email (labeled as 0).
'ham' denotes a legitimate email (labeled as 1).
The dataset is used to train a machine learning model, which can then predict whether a given email is spam or ham.

Data Cleaning and Preparation:
Removing Duplicates: The dataset might contain duplicate entries that can bias the model. These are removed.
Handling Missing Values: Any missing values in the dataset are checked and handled.
Label Encoding: The categorical label values ('spam', 'ham') are converted into numerical values (0 for spam, 1 for ham).
Text Processing:
TF-IDF Vectorization: We convert the raw text (email content) into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency), which reflects the importance of a word in the context of the document.
Oversampling (SMOTE): If the dataset is imbalanced, we use Synthetic Minority Over-sampling Technique (SMOTE) to balance the classes by generating synthetic samples for the minority class (spam).Imbalanced Data: Spam email datasets are often imbalanced, where the majority of emails are ham. This imbalance can lead to biased models that perform poorly on the minority class (spam). We address this with oversampling techniques.
Text Data: Emails consist of unstructured text, making it challenging for traditional machine learning models to process. We use feature extraction techniques like TF-IDF to convert text into numerical features that the machine learning model can understand.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.linear_model import LogisticRegression


# Read the Data

In [None]:
df=pd.read_csv('/content/mail_data.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


# Summerise the Data

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


# Check the null value

In [None]:
df.isnull().sum()

Unnamed: 0,0
Category,0
Message,0


# Cleaning the Data:

In [None]:
df.duplicated().sum()

415

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.duplicated().sum()

0

In [None]:
df.shape

(5157, 2)

# Label Encording

In [None]:
df.loc[df['Category']=='spam','Category',]=0
df.loc[df['Category']=='ham','Category',]=1

# check the unique

In [None]:
df['Category'].unique()

array([1, 0], dtype=object)

# view the value counts the certain columns

In [None]:
df['Category'].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
1,4516
0,641


# Resampling the Data:
 To handle any imbalance in the dataset (if more ham than spam emails), we use SMOTE:

In [None]:
!pip install scikit-learn
# Explicitly convert 'Category' column to integer type
df['Category'] = df['Category'].astype(int)

from imblearn.over_sampling import RandomOverSampler, SMOTE
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Assuming 'df' is your DataFrame with 'Category' as the target variable
X = df['Message']  # Features (only the 'Message' column)
y = df['Category']  # Target

# Create a TfidfVectorizer to convert text to numerical features
vectorizer = TfidfVectorizer()

# Fit and transform the text data
X_vec = vectorizer.fit_transform(X)

# Create an oversampler object (choose either RandomOverSampler or SMOTE)
# Uncomment the one you wish to use:
#ros = RandomOverSampler()  # For Random Oversampling
ros = SMOTE()  # For Synthetic Minority Oversampling

# Apply the oversampler to the data
# Use the vectorized features (X_vec) instead of the original text data (X)
X_resampled, y_resampled = ros.fit_resample(X_vec, y)

# If you need the resampled data as a DataFrame, you can convert it back:
# X_resampled_df = pd.DataFrame(X_resampled.toarray(), columns=vectorizer.get_feature_names_out())
# df = pd.concat([X_resampled_df, y



In [None]:
df['Category'].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
1,4516
0,641


In [None]:
x=df[['Message']]#indepenent features
y=df[['Category']]#depend features(Target features)

In [None]:
x_train.shape

(4125, 8709)

Feature Extraction: The email content is converted to numerical features using TF-IDF:

Splitting the Data: The dataset is split into training and test sets (80% training and 20% testing):

In [None]:
# Assuming 'df' is your DataFrame containing the 'Message' and 'Category' columns

# 1. Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# 2. Fit the vectorizer to your training data and transform it
x_vect= vectorizer.fit_transform(df['Message'])

# 3. Split the data into training and testing sets
# Notice that we are using x_train_tfidf which is the transformed data.
x_train, x_test, y_train, y_test = train_test_split(x_vect, y, test_size=0.2, random_state=42)

In [None]:
print(x_vect)

  (0, 3567)	0.1472838323968625
  (0, 8080)	0.2284805832636267
  (0, 4370)	0.32548246375773743
  (0, 5954)	0.2539580820731383
  (0, 2334)	0.25142216206874096
  (0, 1313)	0.2468216328953706
  (0, 5567)	0.15809897531782258
  (0, 4110)	0.10777814259403065
  (0, 1763)	0.27452746613871426
  (0, 3651)	0.1816911244016972
  (0, 8544)	0.22981732189151766
  (0, 4497)	0.27452746613871426
  (0, 1761)	0.31057908234200526
  (0, 2057)	0.27452746613871426
  (0, 7690)	0.15584788863245208
  (0, 3611)	0.15221254465391032
  (0, 1079)	0.32548246375773743
  (0, 8320)	0.1820605371713429
  (1, 5534)	0.27641681599588036
  (1, 4533)	0.40693812451964195
  (1, 4338)	0.5234057786973465
  (1, 8446)	0.43046670700566175
  (1, 5563)	0.5465710490257072
  (2, 4110)	0.07705316657450818
  (2, 3369)	0.11168084942140385
  :	:
  (5155, 4241)	0.12239086725008379
  (5155, 8367)	0.19129579355204077
  (5155, 1094)	0.11223907528342428
  (5155, 4638)	0.15964099084339545
  (5155, 7085)	0.18503795441669896
  (5155, 3319)	0.1220602681

In [None]:
"""future_extraction=TfidfVectorizer(min_df=1,stop_words='english',lowercase=True) # Changed 'True' to True
x_train_features=future_extraction.fit_transform(x_train)
x_test_features=future_extraction.transform(x_test)

y_train=y_train.astype('int')
y_test=y_test.astype('int')"""

"future_extraction=TfidfVectorizer(min_df=1,stop_words='english',lowercase=True) # Changed 'True' to True\nx_train_features=future_extraction.fit_transform(x_train)\nx_test_features=future_extraction.transform(x_test)\n\ny_train=y_train.astype('int')\ny_test=y_test.astype('int')"

#Model Training and Prediction:

  MultinomialNB to train the classifier. The model is then used to make predictions on unseen data:


In [None]:
!pip install scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB


# 4. Initialize and train the Logistic Regression model
#model = LogisticRegression()
model=MultinomialNB()
model.fit(x_train, y_train)

# 5. Predict on the test set
# x_test is already transformed, so you don't need to transform it again.
y_pred = model.predict(x_test) # Directly use x_test for prediction.



  y = column_or_1d(y, warn=True)


# Evaluating the Model:
The model's performance is evaluated using accuracy score

In [None]:
accuracy_score(y_test,y_pred)

0.9554263565891473

In [None]:
df

Unnamed: 0,Category,Message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,0,This is the 2nd time we have tried 2 contact u...
5568,1,Will ü b going to esplanade fr home?
5569,1,"Pity, * was in mood for that. So...any other s..."
5570,1,The guy did some bitching but I acted like i'd...


# User Input Prediction:

After training, the model can be used to classify user-inputted emails:




In [None]:
# Now you can use 'input' to get user input
input_your_email = input('enter your email :')

# Use the SAME vectorizer instance for prediction
input_transformed = vectorizer.transform([input_your_email])  # Use 'vectorizer' instead of 'future_extraction'

# Make prediction
prediction = model.predict(input_transformed)
print(prediction)
if prediction[0]==1:
  print('your email is ham')
else:
  print('your email is spam')

enter your email :XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL
[0]
your email is spam


#Final Report and Documentation


The following should be included in the final project report:

Introduction: Overview of spam email detection and its importance.
Problem Statement: The goal of the project is to classify emails as spam or ham.
Dataset Overview: Description of the dataset (columns, number of rows, etc.).
Data Preprocessing:
Handling missing values, duplicates, and label encoding.
TF-IDF for feature extraction.
Handling class imbalance with SMOTE.
Model Development:
Explanation of the models used (Naive Bayes and Logistic Regression).
Training, validation, and testing steps.
Evaluation:
Accuracy score and confusion matrix results.
Possible improvements and challenges.
Conclusion: Summary of findings and potential real-world applications of the spam classifier.
Spam email classification is an important task in the field of Natural Language Processing (NLP) and machine learning. Spam emails are unsolicited, often unwanted messages, typically used for advertising or spreading malware. By classifying emails as either spam or ham (legitimate email), organizations can automatically filter out unwanted messages, protecting users from malicious content.

This project aims to build a machine learning model that can classify emails as spam or ham based on their textual content. We will use techniques like TF-IDF Vectorization, Oversampling (SMOTE), and various classification algorithms (Logistic Regression and Naive Bayes) to achieve this goal.
# Future Improvements

Hyperparameter Tuning: Experiment with different hyperparameters (e.g., regularization for Logistic Regression).
Advanced Models: Try more complex models like Random Forest, Support Vector Machines (SVM), or deep learning techniques (e.g., LSTM).
Feature Engineering: Experiment with more advanced text representation techniques, such as Word2Vec or BERT embeddings, for richer feature extraction.
# Conclusion

This project demonstrates how to classify emails as spam or ham using machine learning and natural language processing techniques. By transforming text data into numerical features using TF-IDF and overcoming class imbalance with SMOTE, we can effectively train models such as Naive Bayes to predict the category of new, unseen emails.

For future improvements, the model could be enhanced by:

Trying different feature extraction techniques (e.g., Word2Vec).
Using more advanced algorithms (e.g., Random Forest, SVM).
Fine-tuning model hyperparameters for better performance.


