# Isaac Rai CIS 678 
## Bayes Classifier

In [2]:
#importing all the important packages
import pandas as pd   
import copy as copy
import numpy as np

### Learn - Estimate Class Probability 

In [3]:
#Reading in the data frame and renaming columns
df = df = pd.read_table('fishing.data', sep = " ", header = None)
df = df.rename(columns={ 0:'Fish', 1:'Wind', 2:'Air', 3:'Water', 4:'Sky'})
df.head()

Unnamed: 0,Fish,Wind,Air,Water,Sky
0,Yes,Strong,WarmAir,Warm,Sunny
1,No,Weak,WarmAir,Warm,Sunny
2,Yes,Strong,WarmAir,Warm,Cloudy
3,Yes,Strong,WarmAir,Moderate,Rainy
4,No,Strong,ColdAir,Cold,Rainy


In [4]:
#Inputs: Data - Pandas Data Frame , Class - a String of the name of the class column
#Outputs: A dictonary with the name of each class and its prior probability 
#This function is designed to be robust, and should work with classes of any size
def ClassEstimate(Data, Class):
    Total = 0
    ClassDict = {}
    #Getting class names
    ClassNames = Data[Class].unique()
    #Getting class size for loop
    ClassSize = len(Data[Class].unique())
    #Getting the values of class
    ClassValues = Data[Class].value_counts()
    #Getting the total number of observations
    for i in range(0, ClassSize):
        Total = Total + ClassValues[i]
    #Getting proportion of class members to the toal in loop
    for i in range(0, ClassSize):
        ClassDict[ClassNames[i]] = ClassValues[i]/Total
    #Returning dictionary of the class values and their respective prob.
    return ClassDict

ClassEstimate(df,'Fish')

{'Yes': 0.5714285714285714, 'No': 0.42857142857142855}

The code above contains my function for calculating the probability of each class occuring in a data set. I designed the function to be robust so that it can handle classes of any size, and from any dataset. As long as the data is in 
a Pandas dataframe the function can get the class names and probabilites from the specified class column. I tested
the function using the fishing data, and it came up with the same answer we got in class. The output of the function
is a dictionary that has key value pairs of each of the class names and its' corresponding probability. 

### Learn - Estimate Attribute Probability Based on Class

In [5]:
#Inputs - Data - Pandas data frame, 
#   Class - a string containing the name of the class col., 
#   ClassVal - The desired value of the specified class
#Outputs - A dictionary with the attribute names as kays, and another dictionary containing the attribute value and probability
def ClassAttEstimate(Data, Class, ClassVal):
    #Filtering data frame by class 
    FilterClass = Data[Data[Class] == ClassVal]
    #Dropping the class column
    DropClass = FilterClass.drop(columns = Class)
    #Dropping class column while keeping all data
    EdgeCaseData = Data.drop(columns = Class)
    #Getting the column names to store as keys
    ColNames = DropClass.columns.unique()
    endresult = {}
    for i in range(0, len(DropClass.columns)):
        #Assigning each of the column names as keys, and a nested dictionary as values
        #Reindexed subdictionary with original data set to make sure that every category has an index 
        #(i.e cloudy has a prob of 0 below since it was not seen by the 'No' class)
        endresult[ColNames[i]] = DropClass[ColNames[i]].value_counts(normalize = True ).reindex(EdgeCaseData[ColNames[i]].unique(), fill_value = 0).to_dict()
    return endresult
ClassAttEstimate(df, 'Fish', 'No')

{'Wind': {'Strong': 0.3333333333333333, 'Weak': 0.6666666666666666},
 'Air': {'WarmAir': 0.3333333333333333, 'ColdAir': 0.6666666666666666},
 'Water': {'Warm': 0.16666666666666666,
  'Moderate': 0.3333333333333333,
  'Cold': 0.5},
 'Sky': {'Sunny': 0.3333333333333333,
  'Cloudy': 0.0,
  'Rainy': 0.6666666666666666}}

The function above calculates the probability of each attribute, based on the given class.My final output is a
dictionary with each of the keys being the name of each of the columns in the data set, and the values are another 
dictionary with the value of each attribute as a key, and its probability as the value. I think the nested
dictionary output contains all the data that we need for the classification step, without too much bloat. I
considered using tripple nested dictonaries and automatically with the top most parse being each value of the class, 
but after further consideration I determined that it would make the resulting data structure too complicated. I
should be able to get all class values from the output of the function in step one, so I don't need to do that in
this function too. The value_counts() function in the Pandas library with the normalize = True argument proved
to be invaluable in my code. This would have been much more difficult without it. The reindex() function was also 
invaluable. I used reindex() with the original data frame to add to each sub dictionary any attribute that was not
seen in a particular class. During my validation process I noticed that Sky = 'Cloudy' is not seen in one of the 
classes, which gave my code problems. I was able to use reindex along with the original data frame as a parameter to 
have the non seen values still in the sub dictionaries with their probability set to 0. This solves the probelem when a particular attribute value is not seen in one of the classes. The output when I run the 
function on the Fish data set with the class set to no, yeilds the same values we got doing the calculations by 
hand in class. 

### Classify New  Instances 

In [6]:
#Inputs - Instance - A list of strings of the attribute values
#       - Class - a string that names the class column
#       - Data - A  pandas data frame of the desired data set
#Output - A dictionary with the name of each class as the keys, and its probability as the value 

def BayesClassifier(instance, Class, Data):
    TestInstance = instance
    #Getting the Class estimates from part one function
    ClassEst = ClassEstimate(Data, Class)
    #Dropping the class column
    DropClass2 = Data.drop(columns = Class)
    #Getting the feature names
    ColNames2 = DropClass2.columns.unique()
    #Getting the Class names
    ClassNames2 = list(ClassEst.keys())
    ProbDict = {}
    Products = {}
    #Itterate through every possible class
    for i in range(0, len(ClassNames2)):
        #Get probability of every class for the class value assigned in this loop
        AttProbs = ClassAttEstimate(Data, Class ,ClassNames2[i])
        #Storing Class names in dictionary as keys 
        ProbDict[ClassNames2[i]] = []
        #This loop uses two keys to look in the nested dictionary from part two
        #The loop looks in the dictionary for the selected class and attribute, and appends the prob to a list. 
        for j in range(0, len(TestInstance)):
            ProbDict[ClassNames2[i]].append(AttProbs[ColNames2[j]][TestInstance[j]])
        #appends the class prob. to each class list so that each value in the dictionary is a list of probs
        ProbDict[ClassNames2[i]].append(ClassEst[ClassNames2[i]])
    #We need to make a copy of the dictionary and store it in another var since python assigns vars by reference
    Products = copy.copy(ProbDict)
    #This loop multiplys the value of each list together, getting the final prob of each class
    for i in range(0, len(Products)):
        #Need to convert list into an array to use numpy functions 
        Products[ClassNames2[i]] = np.array(Products[ClassNames2[i]])
        Products[ClassNames2[i]] = Products[ClassNames2[i]].prod()
    return Products

BayesClassifier(['Strong','WarmAir','Cold','Sunny'], 'Fish', df)

{'Yes': 0.02511160714285714, 'No': 0.007936507936507936}

In [7]:
BayesClassifier(['Weak','ColdAir','Moderate','Sunny'], 'Fish', df)

{'Yes': 0.020089285714285712, 'No': 0.021164021164021163}

In [8]:
BayesClassifier(['Strong','ColdAir','Cold','Rainy'], 'Fish', df)

{'Yes': 0.002511160714285714, 'No': 0.031746031746031744}

My last function shown above uses the first two functions to calculate the probability of each class using the
Naive Bayes algorithm. The way this function more or less works is that it finds the number of classes the data has, 
and creates a key in a dictionary for each of them. For each corresponding value in the dictionary I store a list of 
all the probabilities in the instance for the given class (which is the key in the dictionary). I also append the 
associated class probability to each list. This results in each of the values being a list of all the Multiplicands
in the Naive Bayes formula. I convert each list to an array so I can use numpy functions, and multiply everything
together to give me the probability of each class. My output is a dictionary of all the class probabilities. I
considered outputting the probability lists before multiplication too, but I thought that clutters the output too 
much. In the future I can go back and add that easily since I stored the dictionary with the multiplicands before 
I multiplied them all together. During this step I learned that python assigns variables by reference so I needed
to make a copy of the object or else it will refer back to the original variable. In the output users can see the probabilities of each class, and select the class with the highest value. Like my other functions this code should
work with any data set provided that it is in a Pandas data frame since I use some Pandas functions in it. The 
input instance to my function must be a list of the attribute values. I chose to simply display the probabilties of each class in the output, so that theoretical users could get an idea of how close the probability of the instance being one class versus the others is. The users can themselves select the class with the highest probability and assign it to the instance. If the user sees that the probabilities are close, it may also allow them to make inferences about the instance. 

### Further Investigation - Scikit Learn Validation

In [10]:
#Importing ScikitLearn
from sklearn.naive_bayes import CategoricalNB
from sklearn import preprocessing

In [13]:
#Preprocessing label encoder
le = preprocessing.LabelEncoder()
#Encoding labels and features
wind = le.fit_transform(df.loc[:,'Wind'])
air = le.fit_transform(df.loc[:,'Air'])
water = le.fit_transform(df.loc[:,'Water'])
sky = le.fit_transform(df.loc[:,'Sky'])
label =  le.fit_transform(df.loc[:,'Fish'])
#Combining encoded features 
features = zip(wind, air, water, sky)
#Creating a NB model 
NBM = CategoricalNB()

#Training the model 
NBM.fit(list(features), label)

#Wind - Strong = 0, Weak = 1
#Air - Cold Air = 0, Warm Air = 1
#Water cold = 0, moderate = 1, warm = 2
#Sky cloudy = 0, rainy = 1, sunnny = 2
#Fish No = 0, Yes = 1


#Validating results of 'Strong','WarmAir','Cold','Sunny'
Validate1 = NBM.predict([[0, 1, 0, 2]]) #'Strong','WarmAir','Cold','Sunny'
#Running my classifier
Predict1 = BayesClassifier(['Strong','WarmAir','Cold','Sunny'], 'Fish', df)
print(Predict1)
print(Validate1)

{'Yes': 0.02511160714285714, 'No': 0.007936507936507936}
[1]


In [14]:
#Validating results of 'Weak','ColdAir','Moderate','Sunny'
Validate2 = NBM.predict([[1, 0, 1, 2]])
#Running my classifier
Predict2 = BayesClassifier(['Weak','ColdAir','Moderate','Sunny'], 'Fish', df)
print(Predict2)
print(Validate2)

{'Yes': 0.020089285714285712, 'No': 0.021164021164021163}
[1]


In [15]:
#Validating results of 'Strong','ColdAir','Cold','Rainy'
Validate3 = NBM.predict([[0, 0, 0, 1]])
#Running my classifier
Predict3 = BayesClassifier(['Strong','ColdAir','Cold','Rainy'], 'Fish', df)
print(Predict3)
print(Validate3)

{'Yes': 0.002511160714285714, 'No': 0.031746031746031744}
[0]


In [16]:
#Validating results of 'Weak','ColdAir','Moderate','Sunny'
Validate4 = NBM.predict([[1, 1, 1, 0]])
#Running my classifier
Predict4 = BayesClassifier(['Weak','WarmAir','Moderate','Cloudy'], 'Fish', df)
print(Predict4)
print(Validate4)

{'Yes': 0.005580357142857142, 'No': 0.0}
[1]


In order to verify that my Naïve Bayes works correctly, I validated some instances against Scikit Learn's Categorical Naieve Bayes Algorithm. My classifier gives the same result as Scikit Learn's in every instance that I tested. During my testing I found that my classifier had a problem when it was given the attribute 'Cloudy'. After some sleuthing I found that my class probability function did not return probabilities for attributes that are not 
seen in a particular class. In order to remedy this I came up with a way to add the non seen attributes to the sub dictionaries from the output of my second function with their probability set to zero. Once this was addressed, every instance that I verified against Scikit Learn's NB function agreed with my own funtions results. 