# Email Spam Detection Using Multinomial Naive Bayes

## Introduction

This project focuses on applying **Bayesian machine learning** techniques to tackle a real-world problem: classifying emails as either spam or ham (non-spam). The **Multinomial Naive Bayes** algorithm, which is based on Bayes' theorem, is well-suited for text classification, especially when the data is represented by word counts or frequencies.

The dataset used for this project contains labeled emails categorized as either spam or ham. By transforming the text data into numerical features, we can apply the **Multinomial Naive Bayes** classifier to predict whether an incoming email is spam. Bayesian learning enables us to compute posterior probabilities for each class, incorporating prior knowledge and likelihoods based on word frequencies in the email text.

This project demonstrates the effectiveness of probabilistic modeling in text classification tasks and highlights my growing expertise in **Bayesian machine learning** concepts, particularly in using **Naive Bayes** for real-world applieal-world applications.


### import libaraies and upload data set download from kaggle

In [1]:
import pandas as pd
df = pd.read_csv("spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


#### Displays the top frequency categories:


- ham (4825)
- spam (747)
- Sorry (30)
- Please (4)


Shows the unique categories and their frequency counts, helping identify the most common categories.


#### Creates a new binary column 'spam' indicating whether the message is spam (1) or not (0).


In [3]:
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


#### Splits dataset into training (model learning) and testing sets to evaluate model performance.

#### Prevents overfitting, allowing hyperparameter tuning and accurate model assessment.

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam)

## Purpose of `CountVectorizer`

The `CountVectorizer` is used to convert text data, such as emails, into a form that can be used by machine learning models. It counts the occurrence of words in each email and turns the text into numbers.

### Example

Imagine we have two emails:

1. **"I love learning"**
2. **"Learning is fun"**

The `CountVectorizer` counts the unique words: `["I", "love", "learning", "is", "fun"]`.

For these two emails, it creates a table like this:

| Word       | Email 1 ("I love learning") | Email 2 ("Learning is fun") |
|------------|-----------------------------|-----------------------------|
| **I**      | 1                           | 0                           |
| **love**   | 1                           | 0                           |
| **learning**| 1                           | 1                           |
| **is**     | 0                           | 1                           |
| **fun**    | 0                           | 1                           |

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## Purpose of `MultinomialNB` Model

The `MultinomialNB` model is used to classify data based on probabilities, particularly when the data represents counts or frequencies, such as word occurrences in text. This model is a part of the **Naive Bayes** family of algorithms, which are grounded in **Bayes' theorem**. The `MultinomialNB` classifier assumes that the features (in this case, word counts) follow a multinomial distribution, making it ideal for text classification tasks, like detecting spam emails.nt, y_train)


In [6]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count,y_train)

## Purpose of Email Prediction Code

This code block is used to test the trained **Multinomial Naive Bayes** model by predicting whether new emails are spam or ham (non-spam). After training the model, we can provide it with new emails that the model hasn't seen before, and it will predict the likelihood of each email being spam or ham based on the word frequencies.


In [7]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]
emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1], dtype=int64)

## Purpose of Testing Model Accuracy

This code block is used to evaluate the performance of the trained **Multinomial Naive Bayes** model on a separate test dataset. By measuring how well the model predicts spam and ham emails that it hasn't seen during training, we can assess its accuracy and reliability.


In [8]:
X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)

0.9813352476669059

## Model Accuracy Output

The output of the model accuracy evaluation is **0.9827709978463748**, which can be interpreted as follows:

### Interpretation

- **Accuracy**: The value **0.9827709978463748** indicates that the model correctly classified approximately **98.28%** of the emails in the test dataset as either spam or ham. 
- **Performance**: This high accuracy suggests that the **Multinomial Naive Bayes** model is effective for this particular email classification task. It demonstrates the model's ability to generalize well to unseen data, making accurate predictions based on the patterns learned during training.

### Importance of Accuracy

- A high accuracy score indicates that the model has successfully learned the underlying features of the data and can reliably distinguish between spam and ham emails.
- This metric is essential for evaluating the effectiveness of the model in real-world applications, where misclassifying spam as ham (or vice versa) could have significant consequences.

Overall, achieving an accuracy of **98.28%** highlights the strengths of using Bayesian methods in natural language processing tasks like spam detection.


## Purpose of Using Sklearn Pipeline

The Sklearn Pipeline is a convenient way to streamline the workflow of machine learning tasks. It allows us to chain multiple processing steps together, ensuring that each step is executed in the correct order. In this case, we are combining a vectorization step with the Multinomial Naive Bayes model into a single pipeline.


In [9]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

## Purpose of Fitting the Pipeline

The `clf.fit(X_train, y_train)` command is used to train the entire pipeline on the training data. This step involves both transforming the input data and fitting the model in a single operation.


In [10]:
clf.fit(X_train, y_train)


---

This explanation covers the purpose and significance of using the `clf.predict()` method to classify new emails.


In [13]:
clf.predict(emails)

array([0, 1], dtype=int64)

#### To see the predictions made by your model on the new emails.

In [14]:
# List of new emails to classify
emails = [
    'Hey mohan, can we get together to watch football game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Don’t miss this reward!'
]

# Predict whether the new emails are spam (1) or ham (0)
predictions = clf.predict(emails)

# Display the predictions
print("Predictions for the new emails:", predictions)


Predictions for the new emails: [0 1]


#### This means:

The first email is classified as ham (0).
The second email is classified as spam (1).

This way, we can easily view and interpret the predictions made by your model.


---

## Conclusion

In this project, I successfully implemented a **Multinomial Naive Bayes** classifier to detect spam emails with an accuracy of **98.28%**. The pipeline approach streamlined the process by vectorizing text data and fitting the Naive Bayes model in one cohesive step. The results demonstrate the power of Bayesian machine learning methods in handling text classification problems, even when dealing with sparse data such as word occurrences.

This project further solidifies the practical utility of Bayesian learning techniques, particularly in natural language processing tasks like email filtering. The **Multinomial Naive Bayes** model's a its predictions based on observed data showcases the flexibility and strength of Bayesian approaches in real-world applications.
