# Naive Bayes
*For categorical features*

# 

Here I will take the bookish example, so that I can compare the result and ensure the correctness as ususal. Then we will have a python file for this (for mixed features after 3rd book).

Let's go.

In [1]:
# Ususal imports
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

In [94]:
# Data
weather = ['sunny', 'rainy', 'sunny', 'sunny', 'sunny', 'rainy', 'rainy', 'sunny', 'sunny', 'rainy']
car = ['works', 'broken', 'works', 'works', 'works', 'broken', 'broken', 'works', 'broken', 'broken']
y = ['go-out', 'go-out', 'go-out', 'go-out', 'go-out', 'stay-home', 'stay-home', 'stay-home', 'stay-home', 'stay-home']

# Data Frame
df = pd.DataFrame({"weather": weather,
                   "car": car,
                   "y": y})
df

Unnamed: 0,weather,car,y
0,sunny,works,go-out
1,rainy,broken,go-out
2,sunny,works,go-out
3,sunny,works,go-out
4,sunny,works,go-out
5,rainy,broken,stay-home
6,rainy,broken,stay-home
7,sunny,works,stay-home
8,sunny,broken,stay-home
9,rainy,broken,stay-home


This is a nice dataset, with 2 conflicting rows. 

# 

### 1. Calculate the Class Probs

In [95]:
class_counts = df["y"].value_counts()
class_counts

stay-home    5
go-out       5
Name: y, dtype: int64

In [96]:
class_porbs = class_counts / df.shape[0]
class_porbs

stay-home    0.5
go-out       0.5
Name: y, dtype: float64

# 

### 2. Calculate Conditional Probs

Here, we need to get the unique values from each feature and for each feature individually we need to perform the conditional prob for each class. Let's see how.

##### $ P(X_1 = A | B) = \frac {\text{count}(A \cap B)} {\text{count}(B)}$

In [107]:
def get_probs(feature):
    probs = {}
    for category in df[feature].unique():
        for class_ in df["y"].unique():
            filter_ = (df[feature] == category) & (df["y"] == class_)
            A_n_B = df[filter_].shape[0]
            B = class_counts[class_]
            probs[(category, class_)] = A_n_B / B
    return pd.Series(probs)

###### 

For `weather` feature.

In [108]:
get_probs("weather")

sunny  go-out       0.8
       stay-home    0.4
rainy  go-out       0.2
       stay-home    0.6
dtype: float64

###### 

For `car` feature.

In [109]:
get_probs("car")

works   go-out       0.8
        stay-home    0.2
broken  go-out       0.2
        stay-home    0.8
dtype: float64

In [113]:
lookup = pd.concat([get_probs("weather"), get_probs("car")]).unstack()
lookup

Unnamed: 0,go-out,stay-home
broken,0.2,0.8
rainy,0.2,0.6
sunny,0.8,0.4
works,0.8,0.2


Alright, now we have got the conditional probs too! That's it! Let's get into predicting.

###### 

### 3. Predict!


## $$ P(y | X) = {P(X | y) \times P(y)} $$

As said, we don't need to normalize stuff (if we didn we would have got the "Probability"), but here if we just work with the numerator, we can get the number and that will be enough for us to find the predicted class.

In [101]:
# Taking the first instance
df.iloc[[0]]

Unnamed: 0,weather,car,y
0,sunny,works,go-out


If weather is `sunny` an car is `works` then what is the Probability (can't use the term probability but it is okay) of:
1. It is `go-out`
2. It is `stay-home`

We need to find for both target classes.

So our *final equation* would looklike...

> $ \text{go-out} = P(\text{sunny} | \text{go-out}) \times P(\text{works} | \text{go-out}) \times P(\text{go-out}) $

> $ \text{stay-home} = P(\text{sunny} | \text{stay-home}) \times P(\text{works} | \text{stay-home}) \times P(\text{stay-home}) $

$ max(\text{go-out}, \text{stay-home}) $

In [126]:
probs = {}
for class_ in lookup.columns:
    multi = 1
    for value in lookup.loc[df.iloc[0].values[:-1], class_]:
        multi *= value
    multi *= class_porbs[class_]
    probs[class_] = round(multi, 5)

In [127]:
probs

{'go-out': 0.32, 'stay-home': 0.04}

Worked! Amazing! <br>
Now, we can directly go and do this for all instances on our dataframe.

In [142]:
pd.Series([12, 4]).prod()

48

In [158]:
each = []
for features in df.iloc[:, :-1].values:
    probs = {}
    for class_ in lookup.columns:
        multi = lookup.loc[features, class_].prod()
        multi *= class_porbs[class_]
        probs[class_] = round(multi, 5)
    winner = sorted(probs.items(), key=lambda i:i[1], reverse=True)[0][0]
    probs["pred"] = winner
    each.append(probs)

In [159]:
pred = pd.concat([df, pd.DataFrame(each)], axis=1)
pred

Unnamed: 0,weather,car,y,go-out,stay-home,pred
0,sunny,works,go-out,0.32,0.04,go-out
1,rainy,broken,go-out,0.02,0.24,stay-home
2,sunny,works,go-out,0.32,0.04,go-out
3,sunny,works,go-out,0.32,0.04,go-out
4,sunny,works,go-out,0.32,0.04,go-out
5,rainy,broken,stay-home,0.02,0.24,stay-home
6,rainy,broken,stay-home,0.02,0.24,stay-home
7,sunny,works,stay-home,0.32,0.04,go-out
8,sunny,broken,stay-home,0.08,0.16,stay-home
9,rainy,broken,stay-home,0.02,0.24,stay-home


In [160]:
((pred["y"] == pred["pred"]).sum() / pred.shape[0]) * 100

80.0

We have got 80% accuracy! Great! Let's make a proper class so that we can use and test it for large dataset (with multiple categories and classes).

# 

In [366]:
%%writefile NB.py
import numpy as np
import pandas as pd

class NaiveBayes():
    """
    This class implements the categorical naive bayes algorithm
    which requires all features and the targets to be discrete.
    
    The performance and predictions are too similar to sklearn's
    CategoricalNB model. Accuracy is so similar in comparision.
    
    How To
    ------
    >>> model = NaiveBayes(wholeDF, "target_y")
    >>> model.predict(wholeDF)
        
        # Concatinating to see the predicted probabilities
    >>> pred = pd.concat([wholeDF, pd.DataFrame(model.predict(wholeDF))], axis=1)
    
        # Checking for accuracy
    >>> (pred.y == pred.pred).sum() / pred.shape[0] * 100

    
    """
    def __init__(self, data: pd.DataFrame, target: str):
        
        self.data = data
        self.target = target
        
        if not self.target in self.data:
            raise NotImplementedError("Target provided is not found in the data.")
        
        self.class_counts = self.data[self.target].value_counts()
        self.class_porbs = self.class_counts / self.data.shape[0]

        for_each_feature = []
        for feature in self.data.drop(self.target, axis=1):
            for_each_feature.append(self.calculate_conditional_prob(feature))

        self.lookup = pd.concat(for_each_feature).unstack()
    
    
    def calculate_conditional_prob(self, feature):
        probs = {}
        for category in self.data[feature].unique():
            for class_ in self.data[self.target].unique():
                filter_ = (self.data[feature] == category) & (self.data[self.target] == class_)
                A_n_B = self.data[filter_].shape[0]
                B = self.class_counts[class_]
                probs[(category, class_)] = A_n_B / B
        return pd.Series(probs)
    
    
    def predict(self, data):
        each = []
        for features in data.drop(self.target, axis=1).values:
            probs = {}
            for class_ in self.lookup.columns:
                multi = self.lookup.loc[features, class_].prod()
                multi *= self.class_porbs[class_]
                probs[class_] = round(multi, 5)
            winner = sorted(probs.items(), key=lambda i:i[1], reverse=True)[0][0]
            probs["pred"] = winner
            each.append(probs)
        return each

Writing NB.py


In [346]:
obj = NaiveBayes(df, "y")

In [347]:
obj.lookup

Unnamed: 0,go-out,stay-home
broken,0.2,0.8
rainy,0.2,0.6
sunny,0.8,0.4
works,0.8,0.2


In [348]:
pred = pd.concat([df, pd.DataFrame(obj.predict(df))], axis=1)
pred

Unnamed: 0,weather,car,y,go-out,stay-home,pred
0,sunny,works,go-out,0.32,0.04,go-out
1,rainy,broken,go-out,0.02,0.24,stay-home
2,sunny,works,go-out,0.32,0.04,go-out
3,sunny,works,go-out,0.32,0.04,go-out
4,sunny,works,go-out,0.32,0.04,go-out
5,rainy,broken,stay-home,0.02,0.24,stay-home
6,rainy,broken,stay-home,0.02,0.24,stay-home
7,sunny,works,stay-home,0.32,0.04,go-out
8,sunny,broken,stay-home,0.08,0.16,stay-home
9,rainy,broken,stay-home,0.02,0.24,stay-home


In [349]:
(pred.y == pred.pred).sum() / pred.shape[0] * 100

80.0

We are getting 80% here again!<br>
Let's try this on new dataset.

In [289]:
# Creating multiple categories in multiple features

X1 = ["blue", "black", "red", "pink"]
X2 = ["old", "young", "medium", "teen", "baby"]
X3 = ["male", "female"]
y = ["indian", "american", "russian", "european"]

In [290]:
# Randomly picking up and building data

x1 = np.random.choice(X1, 200)
x2 = np.random.choice(X2, 200)
x3 = np.random.choice(X3, 200)
y = np.random.choice(y, 200)

In [310]:
DF = pd.DataFrame({"eye_color":x1,
                   "age": x2,
                   "gender": x3,
                   "country": y})
DF

Unnamed: 0,eye_color,age,gender,country
0,red,teen,male,russian
1,pink,teen,male,european
2,blue,medium,male,european
3,pink,medium,female,russian
4,blue,old,male,european
...,...,...,...,...
195,blue,young,male,european
196,black,teen,female,indian
197,red,teen,female,american
198,red,young,male,american


###### 

## With ***our*** model 

In [311]:
obj = NaiveBayes(DF, "country")
pred = pd.concat([DF, pd.DataFrame(obj.predict(DF))], axis=1)

In [312]:
pred

Unnamed: 0,eye_color,age,gender,country,american,european,indian,russian,pred
0,red,teen,male,russian,0.00553,0.00709,0.00554,0.01248,russian
1,pink,teen,male,european,0.00173,0.00967,0.00986,0.00858,indian
2,blue,medium,male,european,0.00533,0.00536,0.00845,0.00660,indian
3,pink,medium,female,russian,0.00278,0.00372,0.01075,0.00440,indian
4,blue,old,male,european,0.00533,0.01250,0.00634,0.00726,european
...,...,...,...,...,...,...,...,...,...
195,blue,young,male,european,0.00474,0.00982,0.00422,0.00462,european
196,black,teen,female,indian,0.00519,0.00591,0.00706,0.00624,indian
197,red,teen,female,american,0.00691,0.00591,0.00706,0.00832,russian
198,red,young,male,american,0.00632,0.00600,0.00238,0.00672,russian


We have our predictions... let's check the accuracy.

In [313]:
(pred["country"] == pred.pred).sum() / pred.shape[0] * 100

36.5

So bad accuracy... but see the data was randomly generate so... we can't assume better accuracy! <br>
Let's try the same data on `sklearn` model and check whether it gives better.
                                                                                                    

###### 

## With ***`sklearn`*** model 

In [314]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

sklearn doesn't act like we do. It takes features and target seperately.

In [326]:
X = DF.drop("country", axis=1)
y = DF["country"]

sklearn requires data in numeric format... we need to convert.

In [316]:
X = OneHotEncoder().fit_transform(X).toarray()

In [321]:
# Model creation
model = CategoricalNB()
model.fit(X, y)

CategoricalNB()

In [322]:
(y == model.predict(X)).sum() / len(y)

0.37

Nearly same! Wow!

# 

# That's it!
In the next book, we will see how to deal with numerical data.