## Naive Bayes classifier:

**Definition**: It is a supervised generative learning algorithm that is used for classification tasks. It aims at modelling the distribution of the inputs of a given class. It predicts the probability of an instance belonging to a class with a given set of features.

**Assumptions**:
    - Feature independence: The features are condtionally independent given a target class.
    - Continues features: follow Gaussian distribution
    - Discrete features have multinominal distribution
    - Features are equally important

**Cons**:
    - In reality, most datasets have some dependency between features.
    - It is not a discriminative model. It doesn't learn whihc feature are most important to differentiate between classes.
    

**Pros**:
    - Very efficient and highly scalable as the number of parameters scale linearly with the number of variables. 
    - MLE training can be done using a closed form, which takes a linear time. 
    - It requiers a small amount of data to estimate parameters.

**Main applications**:
    - Text classification (spam filtering, sentiment detection, rating classification)


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd

In [3]:
data = {
    'Outlook': ['Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Sunny', 'Rainy', 'Overcast', 'Overcast', 'Sunny'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
    'Windy': [False, True, False, False, False, True, True, False, False, False, True, True, False, True],
    'Play Golf': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}

df = pd.DataFrame(data)

In [25]:
target = "Play Golf"
results = {}
for col in df.columns:
    if col == target:
        print("Yes")
        counts = pd.DataFrame(df.groupby(target).size(), columns=["count"])
        for idx in counts.index:
            counts.loc[idx,f"P({target})"] = f'{counts.loc[idx,"count"]} / {counts["count"].sum()}'
    else:
        counts = df.groupby([col, target]).size().unstack(fill_value=0)
        for idx in counts.index:
            for unique_target in df[target].unique():
                counts.loc[idx,f"P({unique_target})"] = f'{counts.loc[idx,unique_target]} / {counts[unique_target].sum()}'
    
    counts.loc['Total'] = counts.sum()
    for sub_col in counts.columns:
        if "P(" in sub_col:
            counts.loc["Total",sub_col] = round(sum([eval(item) for item in counts[sub_col]]),0)
    display(counts)
    results[col] = counts
    print("------------------------------------------------------------------------")


Play Golf,No,Yes,P(No),P(Yes)
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Overcast,0,4,0 / 5,4 / 9
Rainy,3,2,3 / 5,2 / 9
Sunny,2,3,2 / 5,3 / 9
Total,5,9,1.0,1.0


------------------------------------------------------------------------


Play Golf,No,Yes,P(No),P(Yes)
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cool,1,3,1 / 5,3 / 9
Hot,2,2,2 / 5,2 / 9
Mild,2,4,2 / 5,4 / 9
Total,5,9,1.0,1.0


------------------------------------------------------------------------


Play Golf,No,Yes,P(No),P(Yes)
Humidity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
High,4,3,4 / 5,3 / 9
Normal,1,6,1 / 5,6 / 9
Total,5,9,1.0,1.0


------------------------------------------------------------------------


Play Golf,No,Yes,P(No),P(Yes)
Windy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,2,6,2 / 5,6 / 9
True,3,3,3 / 5,3 / 9
Total,5,9,1.0,1.0


------------------------------------------------------------------------
Yes


Unnamed: 0_level_0,count,P(Play Golf)
Play Golf,Unnamed: 1_level_1,Unnamed: 2_level_1
No,5,5 / 14
Yes,9,9 / 14
Total,14,1.0


------------------------------------------------------------------------


In [47]:
# new pred
today = ("Sunny", "Hot", "Normal", "False")

# p_yes_given_today = (p_today_given_yes * p_yes) / p_today
# p_no_given_today = (p_today_given_no * p_yes) / p_today
# p_yes_given_today + p_no_given_today = 1

# Based on the previous tables
p_sunny_given_yes = eval(results["Outlook"].loc["Sunny", "P(Yes)"])
p_hot_given_yes = eval(results["Temperature"].loc["Hot", "P(Yes)"])
p_normal_given_yes = eval(results["Humidity"].loc["Normal", "P(Yes)"])
p_false_given_yes = eval(results["Windy"].loc[False, "P(Yes)"])

p_today_given_yes = p_sunny_given_yes * p_hot_given_yes * p_normal_given_yes * p_false_given_yes
p_yes_given_today_nom = p_today_given_yes * eval(results["Play Golf"].loc["Yes", "P(Play Golf)"])


p_sunny_given_no = eval(results["Outlook"].loc["Sunny", "P(No)"])
p_hot_given_no = eval(results["Temperature"].loc["Hot", "P(No)"])
p_normal_given_no = eval(results["Humidity"].loc["Normal", "P(No)"])
p_false_given_no = eval(results["Windy"].loc[False, "P(No)"])

p_today_given_no = p_sunny_given_no * p_hot_given_no * p_normal_given_no * p_false_given_no
p_no_given_today_nom = p_today_given_no * eval(results["Play Golf"].loc["No", "P(Play Golf)"])

p_today = p_yes_given_today_nom + p_no_given_today_nom

In [48]:
p_yes_given_today_nom / p_today

0.8223684210526315

In [49]:
p_no_given_today_nom / p_today

0.17763157894736847