# <span style="color:purple; font-weight:bold">Naive Bayes</span>                   
It's called "Naive" Bayes because it makes the naive assumption that all features are independent of each other, which is a strong, often unrealistic simplification.        

---

# <span style="color:purple; font-weight:bold">Types of Naive Bayes Classification Algorithm</span>                


 1. ## <span style="color:purple; font-weight:bold"> Gaussian Naive Bayes (GaussianNB)</span>               
    
    * <span style="color:blue; font-weight:bold">Definition:</span> A Naive Bayes variant that models continuous features using a Gaussian (Normal/Bell-shaped) distribution.     

    - <span style="color:blue; font-weight:bold">Key Concepts:</span>             

       * Assumes input variables are continuous and normally distributed within each target class.               
       * Calculates mean and standard deviation of features for each class during training.               
       * Uses the probability density function of the normal distribution for classification.   

    - <span style="color:blue; font-weight:bold">Usage:</span>            

       * Classification problems with purely numerical, real-valued input data.              
       * Medical diagnosis based on vitals (age, blood pressure, weight).                    
       * Predicting stock market movements or weather conditions from continuous measurements.                      

 2. ## <span style="color:purple; font-weight:bold"> Multinomial Naive Bayes (MultinomialNB)</span>               
    
    * <span style="color:blue; font-weight:bold">Definition:</span> A Naive Bayes variant designed for features that represent discrete counts or frequencies.     

    - <span style="color:blue; font-weight:bold">Key Concepts:</span>             

       * Assumes features are generated from a multinomial distribution.              
       * Input data is typically integer count vectors (e.g., word frequency counts in documents).               
       * Laplace smoothing (alpha parameter) is commonly used to handle zero counts for words unseen during training.   

    - <span style="color:blue; font-weight:bold">Usage:</span>            

       * Text Classification (Primary use): Spam detection, sentiment analysis, topic categorization of documents.             
       * Any problem where the frequency of events (counts) is the primary input feature.             


 3. ## <span style="color:purple; font-weight:bold"> Bernoulli Naive Bayes (BernoulliNB) </span>               
    
    * <span style="color:blue; font-weight:bold">Definition:</span> A Naive Bayes variant strictly for binary or boolean features.     

    - <span style="color:blue; font-weight:bold">Key Concepts:</span>             

       * Assumes all features are binary (0/1 or True/False).              
       * Focuses purely on the presence or absence of a feature, ignoring how many times it appears.               
       * Uses the Bernoulli distribution to calculate probabilities.   

    - <span style="color:blue; font-weight:bold">Usage:</span>            

       * Text classification where presence/absence of keywords is more predictive than frequency.              
       * Categorizing documents using a binary "bag-of-words" model.                    
       * Classifying user behavior (e.g., did a user click a link: Yes/No).      

 ---            

 # <span style="color:purple; font-weight:bold"> Maths ?                
 https://towardsdatascience.com/a-mathematical-explanation-of-naive-bayes-in-5-minutes-44adebcdb5f8/                    
 https://www.kdnuggets.com/2020/06/naive-bayes-algorithm-everything.html                

 https://www.youtube.com/watch?v=O2L2Uv9pdDA


---





**Lets predict survival from Titani Crash.**

In [112]:
import pandas as pd

In [113]:
df = pd.read_csv("./assets/files/titanic.csv")
df.head()

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0


In [114]:
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.25,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,7.925,1
3,1,female,35.0,53.1,1
4,3,male,35.0,8.05,0


In [115]:
inputs = df.drop('Survived',axis='columns')
target = df.Survived


In [116]:
dummies = pd.get_dummies(inputs.Sex)
dummies.head(3)

Unnamed: 0,female,male
0,False,True
1,True,False
2,True,False


In [117]:
inputs = pd.concat([inputs,dummies],axis='columns')
inputs.head(3)

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,False,True
1,1,female,38.0,71.2833,True,False
2,3,female,26.0,7.925,True,False


In [118]:
inputs.drop(['Sex','male'],axis='columns',inplace=True)
inputs.head(3)

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,False
1,1,38.0,71.2833,True
2,3,26.0,7.925,True


In [119]:
inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

In [120]:

inputs.Age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

In [121]:
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,False
1,1,38.0,71.2833,True
2,3,26.0,7.925,True
3,1,35.0,53.1,True
4,3,35.0,8.05,False


In [122]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.3)

In [123]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [124]:
model.fit(X_train,y_train) 

0,1,2
,priors,
,var_smoothing,1e-09


In [125]:
model.score(X_test,y_test) 

0.7649253731343284

In [126]:

X_test[0:10]  

Unnamed: 0,Pclass,Age,Fare,female
35,1,42.0,52.0,False
815,1,29.699118,0.0,False
22,3,15.0,8.0292,True
536,1,45.0,26.55,False
73,3,26.0,14.4542,False
859,3,29.699118,7.2292,False
412,1,33.0,90.0,True
109,3,29.699118,24.15,True
655,2,24.0,73.5,False
50,3,7.0,39.6875,False


In [127]:
y_test[0:10]

35     0
815    0
22     1
536    0
73     0
859    0
412    1
109    1
655    0
50     0
Name: Survived, dtype: int64

In [128]:
model.predict(X_test[0:10])

array([0, 0, 1, 0, 0, 0, 1, 1, 0, 0])

In [129]:

model.predict_proba(X_test[:10])

array([[0.64877796, 0.35122204],
       [0.77569196, 0.22430804],
       [0.40522526, 0.59477474],
       [0.76086   , 0.23914   ],
       [0.97374244, 0.02625756],
       [0.9740265 , 0.0259735 ],
       [0.00616968, 0.99383032],
       [0.47415255, 0.52584745],
       [0.74165137, 0.25834863],
       [0.9377058 , 0.0622942 ]])

**Calculate the score using cross validation**

In [130]:
from sklearn.model_selection import cross_val_score
cross_val_score(GaussianNB(),X_train, y_train, cv=5)

array([0.8       , 0.768     , 0.816     , 0.79032258, 0.72580645])

## <span style="color:purple; font-weight:bold">Lets do email spam detection now using Multinomial Naive Bayes model</span>

In [131]:
import pandas as pd
df = pd.read_csv("./assets/files/spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [132]:
df.groupby('Category').describe() 

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [133]:
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [134]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam)

**We need to convert this Message column data into a count vector because machine understands numeric input**

<span style="color:blue; font-weight:bold">CountVectorizer</span>   is a method used in Natural Language Processing (NLP) to convert a collection of text documents into a numerical representation.                

![alt text](assets/images/Count_Vectorizer.png)

In [135]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(2, 7475))

In [136]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count,y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [137]:
emails = [
    'Hey Anshuman, can we get together to watch Basketball game this weekend?',
    'Special discount of 30% on parking, exclusive offer just for you. Dont miss this reward!'
]
emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1])

In [138]:
X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)

0.9820531227566404

## We now use Pipeline from sklearn library to create a Machine Learning Pipeline.              

In essence, this Pipeline defines a workflow for text classification:               

<span style="color:blue; font-weight:bold">Text Vectorization:</span> When you call clf.fit() or clf.predict() with raw text data, the data first passes through the CountVectorizer. It transforms the text into numerical feature vectors (word counts).                    

<span style="color:blue; font-weight:bold">Classification:</span> The output of the CountVectorizer (the numerical feature vectors) is then fed as input to the MultinomialNB classifier, which learns to classify the data or make predictions based on these features.

In [139]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [140]:

clf.fit(X_train, y_train)

0,1,2
,steps,"[('vectorizer', ...), ('nb', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [141]:
clf.score(X_test,y_test)

0.9820531227566404

In [142]:
clf.predict(emails)

array([0, 1])