# Project 2.2 - Naive Bayes for Classification
### By: Russell Marvin

Here's the code for my Naive Bayes Classifier components - the functions that are used explicitly in this classifier will include `attribute_col()`, `prior()`, `likelihood()`, `evidence`, and `posterior()`.

`attribute_col()` returns the correct column of attributes that corresponds with a given attribute value a<sub>i</sub> and the index of that column in the data (ndarray). This is not an ideal way to achieve this goal but it works for this function. I'm guessing there's a function I'm not familiar with or more elegant code that could have done this for me. 

In [15]:
import numpy as np
#opening file, we can change this to input("what file?") if we want it to use different files
file_path = open("fishing.txt")
#use numpy genfromtxt to turn .txt into arrays, dtype specifies string values (vs int, float, etc)
data = np.genfromtxt(file_path, dtype = str)

#defining all data in target variable column (first col)
target = data[1:,0]

#defining all data in an array without names of cols
all_data = data[1:,:]

priors = {}

#finding instance with each attribute value 
def attribute_col(asubi):
    #list of attribute values for each column initialized
    attr_col = []
    for i in range(len(all_data)):
        for j in range(len(all_data[i])):
            #checking if that value == our attribute value of interest, if so, save col as j
            if all_data[i,j] == asubi:
                col = j
                #break to save j so we only pull attr in col j
                break
    #append values for each row given column j
    for i in (all_data):
        attr_col.append(i[col])
    #return all possible attributes for that attr column, column index

    return attr_col, col
    

#prior function for prob of each class value / all training examples
def prior(csubj):
    #loop through unique values of classes and count when the value is equal to each class_ (outcome)
    for class_ in np.unique(target):
        class_count = sum(target == class_)
        #calculate num in each class / num training examples
        priors[class_] = class_count / (data.shape[0] - 1)
    return priors[csubj]


In [16]:
# likelhood: estimate the probability of each attribute value ai, given a class of type j
# aka #asubi / #csubj
def likelihood(asubi, csubj):
    #loop through feature columns
    temp_dict = {}
    #for every unique attribute value in the column which includes asubi
    for attr in np.unique(attribute_col(asubi)[0]):
        #create dict key w/ each unique value, start count at 0
        temp_dict[attr] = 0
    #
    for i in range(len(all_data)):
        if all_data[i,0] == csubj:
            #grab the 'j' from attribute_col, which indexes certain column with asubi, add 1 in that dict value
            temp_dict[all_data[i,attribute_col(asubi)[1]]] +=1

    #return # asubi / total possible values given class csubj

    return temp_dict[asubi]/ sum(temp_dict.values())

In [17]:
#Evidence: estimate the probability of each attribute value: 
#P(ai) = #ai / #training examples
#need to count #ai aka number of instances of each value of attribute (count 'A's and 'B's)
def evidence(asubi):
    count = 0
    for i in attribute_col(asubi)[0]:
        if asubi == i:
            count +=1
    return count/len(attribute_col(asubi)[0])

In [18]:
def posterior(csubj, asubi): 
    return (likelihood(asubi, csubj) * prior(csubj)) / (evidence(asubi))


## Learn Phase
### 1. Estimate the probability of each class: P(cj) = #cj / #training examples

In [19]:
#use a dictionary to hold class probabilities
temp_dict2 = {}
def learn_1():
    for csubj in np.unique(target):
        temp_dict2[csubj] = prior(csubj)
    #return a dictionary with keys = class values and values = probability
    return temp_dict2
learn_1()


{'No': 0.42857142857142855, 'Yes': 0.5714285714285714}

### 2. Estimate the probability of each attribute value ai, given a class of type j: P(ai | cj) = #ai / #cj

In [20]:
#create ANOTHER dictionary to store likelihoods
temp_dict3 = {}
def learn_2():
    for csubj in np.unique(target):
        #new dictionary nested within each class value key holding all attribute likelihoods
        temp_dict3[csubj] = {}
        for asubi in np.unique(all_data[:,1:]):
            #loop through all attribute values in each col (except target first col) and assign that dict value to the `likelihood()`  of that attribute value given that class
            temp_dict3[csubj][asubi] = likelihood(asubi,csubj)
    #return the entire dictionary to display readable likelihoods
    return temp_dict3 
learn_2()


{'No': {'Cloudy': 0.16666666666666666,
  'Cold': 0.6666666666666666,
  'Cool': 0.5,
  'Hot': 0.3333333333333333,
  'Moderate': 0.3333333333333333,
  'Rainy': 0.5,
  'Strong': 0.3333333333333333,
  'Sunny': 0.3333333333333333,
  'Warm': 0.16666666666666666,
  'Weak': 0.6666666666666666},
 'Yes': {'Cloudy': 0.125,
  'Cold': 0.375,
  'Cool': 0.125,
  'Hot': 0.625,
  'Moderate': 0.5,
  'Rainy': 0.125,
  'Strong': 0.75,
  'Sunny': 0.75,
  'Warm': 0.375,
  'Weak': 0.25}}

We now have a set of probabilities for each class, and a set of likelihoods for every attribute value given each class type j. This is what we need to use to classify a new instance. 

## Classify Phase (applied to new Instance)

In [21]:
#calculate all class probabilities for an instance
def classify(instance):
    temp_object = learn_2()
    temp_dict5 = {}
    for csubj in np.unique(target):
        #calculate prob of each class (using temp_dict2 from learn_1), using prob of observed data
        temp_dict5[csubj] = (temp_dict2[csubj]) * temp_object[csubj][instance[0]] * temp_object[csubj][instance[1]] * temp_object[csubj][instance[2]] * temp_object[csubj][instance[3]]
    #grab the key of the largest value in temp_dict5 (max probability for this instance)
    #grab key for smallest value in temp_dict5 (min prob for instance)
    maximum = max(temp_dict5, key = temp_dict5.get)
    minimum = min(temp_dict5, key = temp_dict5.get)
    #return the conditional probability of the max class as calculated by the relative probabilities of each class
    return print(f'The conditional probability that the class is {maximum}, given the observed attribute values is :\n {round(temp_dict5[maximum],3)} / ({round(temp_dict5[maximum],3)} + {round(temp_dict5[minimum],3)}) = {round(temp_dict5[maximum] / (temp_dict5[maximum] + temp_dict5[minimum]),3)}')
classify(['Strong','Hot','Cool','Sunny'])





The conditional probability that the class is Yes, given the observed attribute values is :
 0.025 / (0.025 + 0.008) = 0.76


We've now calculated the conditional probability that a new instance belongs to a certain class, aka the most likely class for this new instance given our observed data and assuming the conditional independence of our attribute values.