<a href="https://colab.research.google.com/github/BilawalBaloch/Projects/blob/main/spam_email_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [31]:
import pandas as pd
import numpy as np
import seaborn as sns


In [32]:
data = pd.read_csv("/content/email_classification_dataset.csv")

In [33]:
data

Unnamed: 0,id,email,label
0,2685,From: support@legitcompany.com\nSubject: Regar...,ham
1,5857,From: noreply@softwareupdates.com\nSubject: We...,ham
2,2399,From: noreply@softwareupdates.com\nSubject: Im...,ham
3,3244,From: info@customerservice.co\nSubject: Team S...,ham
4,2844,From: info@customerservice.co\nSubject: Team S...,ham
...,...,...,...
9995,6397,From: noreply@softwareupdates.com\nSubject: Ca...,ham
9996,7470,From: family@homemail.net\nSubject: Weekly New...,ham
9997,9273,From: team@projectmanagement.com\nSubject: Fee...,ham
9998,3192,From: accounts@billingcorp.com\nSubject: Photo...,ham


In [34]:
data.label.count().sum()

np.int64(10000)

In [35]:
data.describe()

Unnamed: 0,id
count,10000.0
mean,5000.5
std,2886.89568
min,1.0
25%,2500.75
50%,5000.5
75%,7500.25
max,10000.0


In [36]:
data['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
ham,8500
spam,1500


In [37]:
data['label'] = data['label'].map({'ham':0, 'spam':1})

In [38]:
from sklearn.model_selection import train_test_split

In [39]:
from sklearn.naive_bayes import MultinomialNB

In [40]:
from sklearn.metrics import classification_report

In [41]:
X = data['email']
y = data['label']

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

In [43]:
model = MultinomialNB()

To address the `ValueError`, we need to convert the email text into numerical features that the Multinomial Naive Bayes model can process. We can use `CountVectorizer` from scikit-learn to achieve this by converting the text into a matrix of token counts.

In [44]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

Now that the email data is vectorized, we can fit the `MultinomialNB` model with the numerical data.

In [45]:
model.fit(X_train_vectorized, y_train)

Now that the model is trained, we can evaluate its performance on the test set using the vectorized test data.

In [46]:
y_pred = model.predict(X_test_vectorized)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2535
           1       1.00      1.00      1.00       465

    accuracy                           1.00      3000
   macro avg       1.00      1.00      1.00      3000
weighted avg       1.00      1.00      1.00      3000



# Email Classification Project

This notebook demonstrates a basic email classification project using a Multinomial Naive Bayes model.

## Project Steps

1.  **Data Loading**: Loaded the email classification dataset from a CSV file.
2.  **Data Exploration**: Performed basic data exploration to understand the dataset, including checking the number of samples and the distribution of labels (ham vs. spam).
3.  **Data Preprocessing**: Converted the categorical labels ('ham', 'spam') into numerical values (0, 1).
4.  **Data Splitting**: Split the dataset into training and testing sets.
5.  **Text Vectorization**: Converted the email text data into numerical features using `CountVectorizer`.
6.  **Model Training**: Trained a Multinomial Naive Bayes model on the vectorized training data.
7.  **Model Evaluation**: Evaluated the trained model's performance on the test set using a classification report, which includes precision, recall, and F1-score.

## How to Run the Code

1.  Make sure you have the required libraries installed (pandas, numpy, scikit-learn). You can install them using pip: