## Naive Bayes
- Naive Bayes is supervised machine learning classification algorithm based on based on Bayes' theorem, which describes the probability of an event based on prior knowledge of conditions that might be related to the event. 
- The "naive" aspect of Naive Bayes comes from the assumption that features used to describe an observation are conditionally independent, given the class label.
- This means that the presence of a particular feature in a class is unrelated to the presence of any other features.

#### *MultinomialNB is another variant of the Naive Bayes algorithm, specifically designed for datasets where the features represent discrete counts. It is commonly used for text classification problems, where each feature corresponds to the frequency of a term in a document. MultinomialNB assumes that the features are generated from a multinomial distribution.*

## Approach
#### These steps outline the process to be followed when working on a predictive model: 
- Problem Definition
- Data Collection
- Data Preprocessing
- Feature Selection/Engineering
- Data Splitting
- Model Selection
- Model Training
- Prediction
- Hyperparameter Tuning
- Model Evaluation



## Problem Definition

### *Clearly state the problem you want to solve, as well as the outcome you want to predict.*


Here we have to predict whether the email spam or not.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings('ignore')

## Data Collection

### *Gather relevant data that will be used to train and test the prediction model.*


In [29]:
df = pd.read_csv("spam.csv")

In [30]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Data Preprocessing


### *Clean the data by handling missing values, dealing with outliers, data visualization, normalizing features, and encoding categorical variables.*


In [31]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


## Feature Selection/Engineering

### *Identify which features are important for the prediction task and create new features if needed.*


In [32]:
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


## Data Splitting

### *Divide the datasets into a training set and a testing set to evaluate your model's performance.*

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam)

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## Model Selection

### *Choose an appropriate machine learning algorithm based on the type of problem (classification, regression, etc.) and the characteristics of the data.*

In [35]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()


## Model Training

### *Use the training data to train the selected model by adjusting its parameters to minimize the prediction error.*

In [36]:
model.fit(X_train_count,y_train)

MultinomialNB()

In [37]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]
emails_count = v.transform(emails)


## Prediction

### *Once the model is trained and validated, it can be used to make predictions on new, unseen data.*


In [38]:
model.predict(emails_count)

array([0, 1], dtype=int64)

In [39]:
X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)

0.9834888729361091

## Hyperparameter Tuning

### *Fine-tune the model's hyperparameters to optimize its performance.*


In [40]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [41]:
clf.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

## Model Evaluation

### *Assess the model's performance on a separate set of data not used during training to understand its predictive power and generalization capability.*



In [42]:
clf.score(X_test,y_test)

0.9834888729361091

In [43]:
clf.predict(emails)

array([0, 1], dtype=int64)

### Thank You !!!