# Mixed Naive Bayes

As promised, here I am trying to demonstrate the real world situation. It is where our dataset **doesn't completely** have the `categorical` features or the `numerical` features.

So, we can't solely use the `NB` or `GNB`. 

___
But still, to give it a try - I have tried GNB (with the converted categories into numbers.) And I have tried the numerical data to feed in the categorical... but it doesn't work because the method it follows is different.

So here it goes **GNB**.

In [18]:
# Our created function
from GNB import GaussianNaiveBayes
import pandas as pd

In [19]:
# The same categorical featured dataset
weather = ['sunny', 'rainy', 'sunny', 'sunny', 'sunny', 'rainy', 'rainy', 'sunny', 'sunny', 'rainy']
car = ['works', 'broken', 'works', 'works', 'works', 'broken', 'broken', 'works', 'broken', 'broken']
y = ['go-out', 'go-out', 'go-out', 'go-out', 'go-out', 'stay-home', 'stay-home', 'stay-home', 'stay-home', 'stay-home']

# Data Frame
df = pd.DataFrame({"weather": weather,
                   "car": car,
                   "y": y})
df

Unnamed: 0,weather,car,y
0,sunny,works,go-out
1,rainy,broken,go-out
2,sunny,works,go-out
3,sunny,works,go-out
4,sunny,works,go-out
5,rainy,broken,stay-home
6,rainy,broken,stay-home
7,sunny,works,stay-home
8,sunny,broken,stay-home
9,rainy,broken,stay-home


In [21]:
# For conversion
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

In [35]:
encoder = OneHotEncoder()
data = encoder.fit_transform(df.drop("y", axis=1)).toarray()

In [36]:
data = pd.concat([pd.DataFrame(data), df.y], axis=1)

In [37]:
data

Unnamed: 0,0,1,2,3,y
0,0.0,1.0,0.0,1.0,go-out
1,1.0,0.0,1.0,0.0,go-out
2,0.0,1.0,0.0,1.0,go-out
3,0.0,1.0,0.0,1.0,go-out
4,0.0,1.0,0.0,1.0,go-out
5,1.0,0.0,1.0,0.0,stay-home
6,1.0,0.0,1.0,0.0,stay-home
7,0.0,1.0,0.0,1.0,stay-home
8,0.0,1.0,1.0,0.0,stay-home
9,1.0,0.0,1.0,0.0,stay-home


In [38]:
# Training
model = GaussianNaiveBayes(data, "y")

In [39]:
pd.DataFrame(model.predict(data.drop("y", axis=1)))

Unnamed: 0,go-out,stay-home,winner
0,0.212243,0.002592,go-out
1,0.000526,0.101386,stay-home
2,0.212243,0.002592,go-out
3,0.212243,0.002592,go-out
4,0.212243,0.002592,go-out
5,0.000526,0.101386,stay-home
6,0.000526,0.101386,stay-home
7,0.212243,0.002592,go-out
8,0.010567,0.052053,stay-home
9,0.000526,0.101386,stay-home


Holy! Our gaussian worked on the categorical features too!!

Which means, we can (ideally) work with the data, which has both type of values but needed to be converted to numerical first. Let's try it on one more (real) dataset.

### 

In [40]:
import seaborn as sns

In [41]:
tips = sns.load_dataset("tips")

In [49]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [53]:
data = pd.concat([
    tips[["total_bill", "tip", "size"]], 
    pd.DataFrame(encoder.fit_transform(tips[["smoker", "day", "time"]]).toarray()),
    tips["sex"]],
    
    axis=1)
data

Unnamed: 0,total_bill,tip,size,0,1,2,3,4,5,6,7,sex
0,16.99,1.01,2,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,Female
1,10.34,1.66,3,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,Male
2,21.01,3.50,3,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,Male
3,23.68,3.31,2,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,Male
4,24.59,3.61,4,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,Female
...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,3,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,Male
240,27.18,2.00,2,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,Female
241,22.67,2.00,2,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,Male
242,17.82,1.75,2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,Male


In [54]:
model = GaussianNaiveBayes(data, "sex")

In [56]:
pred = model.predict(data.drop("sex", axis=1))

In [59]:
pred = pd.concat([data, pd.DataFrame(pred)], axis=1)
pred.head()

Unnamed: 0,total_bill,tip,size,0,1,2,3,4,5,6,7,sex,Female,Male,winner
0,16.99,1.01,2,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,Female,6e-06,7e-05,Male
1,10.34,1.66,3,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,Male,8e-06,7.8e-05,Male
2,21.01,3.5,3,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,Male,1.6e-05,0.000223,Male
3,23.68,3.31,2,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,Male,1.5e-05,0.000189,Male
4,24.59,3.61,4,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,Female,4e-06,7.7e-05,Male


In [60]:
(pred.sex == pred.winner).sum() / pred.shape[0]

0.6557377049180327

Of course... this is not great to predict the gender... but let's try the same model with `sklearn`.

In [67]:
from sklearn.naive_bayes import GaussianNB, CategoricalNB

In [68]:
X = data.drop("sex", axis=1)
y= data["sex"]

In [72]:
model = GaussianNB()
model.fit(X, y)

GaussianNB()

In [73]:
pred = model.predict(X)

In [74]:
(pred == y).sum() / len(y)

0.6557377049180327

THE SAME. **EXACTLY SAME**. ***NOT A POINT DIFFERENCE***.

# 

## Here, we can stop.
But, I want to try new thing now.<br>
That is, what if I make a class of `MixedNB` it is where, we can pass the dataframe with features (numerical and categorical mixed) ***but with that***, we also pass which features are categorical and which are numerical and then based on that we follow 2 different approaces (pdf and probability)? 

# 

Let's say, we have the same tipping dataset. 

In [79]:
numerical = ["total_bill", "tip", "size"]
categorical = ["smoker", "day", "time"]
target = "sex"

In [187]:
import math

In [213]:
from GNB import GaussianNaiveBayes
from NB import NaiveBayes
import numpy as np
import pandas as pd
import math

class MixedNB:
    """
    This class is the implementation of mixture of 
    GNB + NB = MixNB
    
    About & Why
    -----------
    Here, I have tried to combine the logic of both worlds
    and see if we can get better accuracy than the solo
    model.
    
    This model calculates the probability and pdf individually
    for both type of features (numerical and categorical).
    
    As the fundamental assumption of Naive Bayes, all features
    are independent, I took it as the advantage.
    
    
    How To
    ------
    
        # Prepare the feature namess
    >>> numerical = ["total_bill", "tip", "size"]
    ... categorical = ["smoker", "day", "time"]
    ... target = "sex"
        
        # Provide data in initialization
    >>> model = MixNB(tips, target, numerical, categorical)
    
        # Predict - pass data without the target in it.
    >>> pred = model.predict(tips.drop(target, axis=1))
    >>> pred = pd.DataFrame(pred)
    
        # Get accuracy
    >>> (tips[target] == pred.winner).sum() / tips.shape[0]
    
    
    """
    def __init__(self, data: pd.DataFrame, target: str, 
                 numerical: list, categorical: list):
        
        # Saving the feature by their type
        self.numerical_features = numerical
        self.categorical_features = categorical
        self.target = target
        
        # Adding the target with the features
        # because GNB & NB implementation accepts
        # such format
        numerical = numerical + [target]
        categorical = categorical + [target]
        
        # Training GNB model (of course our own)
        # - For numerical features
        self.numerical_model = GaussianNaiveBayes(data[numerical], target)
        # Training NB model (of course our own)
        # - For categorical features
        self.categorical_model = NaiveBayes(data[categorical], target)
        
        # Saving the numerical model's pdf function
        self.pdf = self.numerical_model.pdf
        
        # Saving categorical lookup table
        self.lookup_categorical = self.categorical_model.lookup
        
        # Saving numerical lookup table
        self.lookup_numerical = self.numerical_model.lookup
        
        # Saving the class probs
        self.class_probs_categorical = self.categorical_model.class_probs
        # We could also use the `numerical_class_probs`
        # since both gives same result, we will just use
        # categorical's
    
    
    def predict(self, data):
        unique_targets = self.class_probs_categorical.index.tolist()
        
        # It will contain row wise predictions
        # in a form of dict {0:0.23, 1:2.33, winner:1}
        prediction = []
        
        # Iterating in range of whole dataset
        for index in range(data.shape[0]):
            
            # This will contain that dict ↑
            class_wise_probs = {}
            # For each target class
            for target_class in unique_targets:
                # For categorical
                feature_categories = data.iloc[index][self.categorical_features].values
                self.categorical_product = self.lookup_categorical.loc[feature_categories, target_class].prod()
                
                # For numerical
                feature_numbers = data.iloc[index][self.numerical_features].values
                pdfs = []
                for column, its_value in zip(self.numerical_features, feature_numbers):
                    learned_values = self.lookup_numerical.loc[target_class, column]
                    pdfs.append(
                        self.pdf(its_value, learned_values["mean"], learned_values["std"]))
                self.numerical_product = math.prod(pdfs)
                
                class_wise_probs[target_class] = self.categorical_product * self.numerical_product \
                                                    * self.class_probs_categorical[target_class]
            
            # After both's product combined, sorting.
            class_wise_probs["winner"] = sorted(class_wise_probs.items(), key=lambda i: i[1], reverse=True)[0][0]
            prediction.append(class_wise_probs)
        return prediction        

In [214]:
model = MixedNB(tips, target, numerical, categorical)

In [215]:
pred = model.predict(tips.drop("sex", axis=1))

In [216]:
pred = pd.DataFrame(pred)

In [217]:
(tips.sex == pred.winner).sum() / tips.shape[0]

0.6516393442622951

Cool! We are getting some accuracy.


In [212]:
# If we compare 
0.6557377049180327 > 0.6516393442622951

True

The simple model was giving better than our MixNB.

###### 

### One More Dataset

In [221]:
diamonds = sns.load_dataset("diamonds")
diamonds

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


We will take `cut` as our target variable.

###### 

### With `MixNB` model

In [222]:
numerical = ["carat", "depth", "table", "price", "x", "y", "z"]
categorical = ["color", "clarity"]
target = "cut"

In [223]:
model = MixedNB(diamonds, target, numerical, categorical)

In [227]:
pred = model.predict(diamonds.drop("cut", axis=1))

In [228]:
pred = pd.DataFrame(pred)

In [231]:
(diamonds["cut"] == pred.winner).sum() / diamonds.shape[0]

0.5725806451612904

Only 57% ?

# 

## With `GNB` model

In [233]:
from GNB import GaussianNaiveBayes
from sklearn.preprocessing import OneHotEncoder

In [237]:
encoder = OneHotEncoder()
trans = encoder.fit_transform(diamonds[categorical])

In [244]:
data = pd.concat([diamonds[numerical],
                   pd.DataFrame(trans.toarray()),
                   diamonds[target]],
          axis=1)

In [256]:
model = GaussianNaiveBayes(data, "cut")

In [None]:
pred = model.predict(data.drop("cut", axis=1))

In [None]:
pred = pd.DataFrame(pred)

# 

Great! Now we have our model. The mix model!<br>
The thing is, I have made a very bad model with respect to the time and space complexity. But it results in the same accuracy with the built in model!

# 

# Next up,
We will learn new model called KNN.