# Naive Bayes Classifier
Mushroom  Dataset is used which includes various features of a mushrooms such as size,taste,color and many more. And there are multiple categories(let's say 1,2,3) they belong to. 

Our aim is to find in which category each new mushroom belong to that is 
P(y=1/2/3|features of mushroom)

# 1. Loading DataSet

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt

In [2]:
data=pd.read_csv("./mushrooms.csv")

In [3]:
data.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [4]:
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


# 2. Encode the data into Numerical Data

Data is not numerical data - convert all data to numbers

1. one way is creating a separate dictionary for each column and mapping numbers to them - lengthy and time  consuming
2. Second way - shortcut - using sklearn preprocessing LabelEncoder feature

In [5]:
from sklearn.preprocessing import LabelEncoder

In [6]:
#create an object of the funcion
LE=LabelEncoder()

#applying LabelEncoder into the dataset
data=data.apply(LE.fit_transform)

In [7]:
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1


In [8]:
# Now all data columns are numerical data

# 3. Converting Dataframe to numpy array

In [9]:
data=data.values
print(data.shape)

(8124, 23)


In [10]:
data[:5]

array([[1, 5, 2, 4, 1, 6, 1, 0, 1, 4, 0, 3, 2, 2, 7, 7, 0, 2, 1, 4, 2, 3,
        5],
       [0, 5, 2, 9, 1, 0, 1, 0, 0, 4, 0, 2, 2, 2, 7, 7, 0, 2, 1, 4, 3, 2,
        1],
       [0, 0, 2, 8, 1, 3, 1, 0, 0, 5, 0, 2, 2, 2, 7, 7, 0, 2, 1, 4, 3, 2,
        3],
       [1, 5, 3, 8, 1, 6, 1, 0, 1, 5, 0, 3, 2, 2, 7, 7, 0, 2, 1, 4, 2, 3,
        5],
       [0, 5, 2, 3, 0, 5, 1, 1, 0, 4, 1, 3, 2, 2, 7, 7, 0, 2, 1, 0, 3, 0,
        1]])

# 4. Break data into Training and Testing Data

In [11]:
#first column is the category of the mushroom
y_data= data[:,0]
x_data=data[:,1:]

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
x_train,x_test,y_train,y_test=train_test_split(x_data,y_data,test_size=0.2)

In [14]:
print("Training Data: ",x_train.shape,y_train.shape)
print("Testing Data: ",x_test.shape,y_test.shape)

Training Data:  (6499, 22) (6499,)
Testing Data:  (1625, 22) (1625,)


# 5. Building the Classifier

Aim : P(y=label|features)

To Find: P(y=label) and P(features|y=label)

5.1: To find P(label) -> P(y=1) or P(y=0) 

In [15]:
def prior_prob(y_train,label):
    # whenever y_train data is equal to label sum is increased (shorcut instead of loop)
    count_label=np.sum(y_train==label)
    
    total_count=y_train.shape[0]
    return(count_label/float(total_count))

In [16]:
# test on a sample data
sample_y=np.array([1,0,1,0,1])
print("P(1): ",prior_prob(sample_y,1))
print("P(0): ",prior_prob(sample_y,0))

P(1):  0.6
P(0):  0.4


5.2: To calculate P(features|y=label)

Now we know,
P(features|y=label) = P(f1|y=label) * P(f2|y=label).....P(fn|y=label)

So it involves 2 functions:

1. Calculates individual probabilities for each feature
2. Multiplies all the prob. to calculate likelihood

In [17]:
# feature col : names of feature (f1 or f2 or f3......or fn)
# feature values : Values of the feature which is taken into consideration
# label : for which label is the prob being calculated

def individual_prob(x_train,y_train,feature_col,feature_value,label):
    # Total no of data having the label mentioned
    denominator = np.sum(y_train==label)
    
    # We only require the x_train data where label is equal to the label given to the function
    x_data=x_train[y_train==label]
    # go to the specific col (feature col) of the whole x data and match each value with the feature value
    numerator=np.sum(x_data[:,feature_col]==feature_value)
    
    return(numerator/float(denominator))
    """
    count=0
    denominator = np.sum(y_train==label)
    x_data=x_train[:,feature_col]
    for i in range(x_train.shape[0]):
        if(y_train[i]==label):
            if(x_data[i]==feature_value):
                count+=1
    return (count/float(denominator))"""

In [18]:
# multiply all the individual feature prob found by the function mentioned above
def likelihood(x_train,y_train,x_test,label):
    no_features=x_train.shape[1]
    result=1.0
    for i in range(no_features):
        result*=individual_prob(x_train,y_train,i,x_test[i],label)
    return(result)

# 6.Calculate P(y=label|features) and making pred about the class

In [19]:
# x_test : a single row having features of the new mushroom whoes y needs to be predicted
def predict_category(x_train,y_train,x_test):
    no_class=np.unique(y_train)
    no_features=x_train.shape[1]
    # compute P(y=label|features) for each class and whoever has the max value that label == y_predicted
    # Basically compute P(0|x_test) and P(1|x_test)
    class_prob=[]
    for cur_label in no_class:
        # Now we know the formula :
        prob = likelihood(x_train,y_train,x_test,cur_label) * prior_prob(y_train,cur_label)
        
        # Add the probability along with label to the class_prob list
        class_prob.append(prob)
    class_prob=np.array(class_prob)
    prediction = np.argmax(class_prob)
    return prediction   
    

In [20]:
# Test for one example
output=predict_category(x_train,y_train,x_test[2])
print("Predicted Output: ",output)
print("Actual Label: ",y_test[2])

Predicted Output:  1
Actual Label:  1


# 7. Check Accuracy

In [21]:
def accuracy(x_train,y_train,x_test,y_test):
    all_pred=[]
    for i in range(x_test.shape[0]):
        pred=predict_category(x_train,y_train,x_test[i])
        all_pred.append(pred)
    return ((np.sum(all_pred==y_test)/y_test.shape[0])*100)

In [22]:
accuracy(x_train,y_train,x_test,y_test)

99.75384615384615