## Naive Bayes

Student Name:

Student ID:
 
## General info


<b>Materials</b>: See [Using Jupyter Notebook and Python page](https://canvas.lms.unimelb.edu.au/courses/126693/pages/python-and-jupyter-notebooks?module_item_id=3950453) on Canvas (under Modules> Coding Resources) for information on the basic setup required for this class, including an iPython notebook viewer.If your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  


<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should implement functions for the skeletons listed below. You may implement any number of additional (helper) functions. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 

You will be marked not only on the correctness of your methods, but also the quality and efficiency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it. We reserve the right to deduct up to 4 marks for unreadable or excessively inefficient code.

7 of the marks available for this Project will be assigned to whether the five specified Python functions work. Any other implementation will not be directly assessed (except insofar as it is required to make these five functions work correctly).

13 of the marks will be assigned to your responses to the questions, in terms of both accuracy and insightfulness. We will be looking for evidence that you have an implementation that allows you to explore the problem, but also that you have thought deeply about the data and the behaviour of the Naive Bayes classifier.


## Part 1: Base code [7 marks]

Instructions
1. Do **not** shuffle the data set
2. Treat the features as nominal and use them as provided (e.g., do **not** convert them to other feature types, such as numeric ones). Implement a Naive Bayes classifier with appropriate likelihood function for the data.
3. You should implement the Naive Bayes classifier from scratch. Do **not** use existing implementations/learning algorithms. You must use epsilon smoothing strategy as discussed in the Naive Bayes lecture. 
4. Apart from the instructions in point 3, you may use libraries to help you with data reading, representation, maths or evaluation.
5. Ensure that all and only required information is printed, as indicated in the final three code cells. Failure to adhere to print the required information will result in **[-1 mark]** per case. *(We don't mind details like you print a list or several numbers -- just make sure the information is displayed so that it's easily accessible)*
6. Please place the jupyter notebook into the same folder as the input data.


In [1]:
import pandas as pd
import numpy as np
from functools import reduce
from tqdm import tqdm 

In [2]:
class NaiveBayesClassifier():
 
    # constructor for any bayes obect
    def __init__(self,features,target):
        self.data = features # the dataset
        self.target = target # the target class
        self.target_labels = list(set(self.data[self.target])) # unique target labels
        self.target_values = list(self.data[self.target]) # values of target columns
        self.priori = {} # priopi probability
        self.cp = {}
        self.prediction = None 

    '''
    calculation of the probability per class 
    '''
    def calc_priori_probability(self): 
        for item in self.target_labels:
            self.priori[item]  = round(self.target_values.count(item)/float(len(self.target_values)),2)
        print("Priori Probability Values: ", self.priori)

    '''
    calculation of the individual probabilites 
    
    '''
    def individual_probabilites(self, attr, attr_type, x):
        data_attr = list(self.data[attr])
        class_data = list(self.data[self.target])
        total = 0
        for i in range(len(data_attr)):
            if class_data[i] == x and data_attr[i] == attr_type:
                total+=1
        return round(total/float(class_data.count(x)),2)

    '''
    calculation of Likelihood of Evidence and multiple all individual probabilities with priori
    '''
    def calc_conditional_probabilities(self, prediction,columns):
        for i in self.priori:
            self.cp[i] = {}
            for j in columns:
                self.cp[i].update({ prediction[j]: self.individual_probabilites(j, prediction[j], i)})

    def classification(self):
        results = {}
        for i in self.cp:
            if i in self.cp and i in self.priori:
                results[i] = reduce(lambda x, y: x*y, self.cp[i].values())*self.priori[i]
                # Laplace smoothing 
                if results[i] == 0: 
                    results[i] = self.data.shape[0]
        return max(results, key=results.get)

In [3]:
# This function should open a csv file and read the data into a useable format [0.5 mark]
def preprocess(filename):
    # open file
    df = pd.read_csv(filename)
    
    # set threshold for splitting training and test set
    limit = int(df.shape[0]*0.7)
    
    # split training and test set
    train_set = test_set = df[:limit]
    test_set = df[limit:]
    
    # isolate target column
    target = df.columns[-1]
    
    return df,train_set,test_set,target

In [4]:
# This function should build a supervised NB model [3 marks]
def train(train_set,target):
    # create a class object
    bayes_classifier = NaiveBayesClassifier(train_set,target) 
    # calculation of priori probability - train 
    bayes_classifier.calc_priori_probability()
    # return updated object
    return bayes_classifier

In [5]:
# This function should predict the class for a set of instances, based on a trained model [1.5 marks]
def predict(bayes_classifier,test_set,columns):
    predictions = []
    # for each feature in the test set
    for i in tqdm(range(test_set.shape[0])):
        bayes_classifier.prediction = test_set.iloc[i,:]
        bayes_classifier.calc_conditional_probabilities(bayes_classifier.prediction,columns)
        predictions.append(bayes_classifier.classification())
    return predictions

In [6]:
# This function should evaluate a set of predictions [1 mark]
def evaluate(predictions,test_set):
    correct = 0
    for i in range(test_set.shape[0]):
        if test_set.iloc[i,-1] == predictions[i]:
            correct+=1
    return correct / float(len(predictions)) * 100.0

## Bank Marketing

In [21]:
# This cell should act as your "main" function where you call the above functions 
# on the full Bank Marketing data set, and print the evaluation score. [0.33 marks]


# First, read in the data and apply your NB model to the Bank Marketing data

df,train_set,test_set,target = preprocess('bank-marketing.csv')
columns = df.columns
bayes_classifier = train(train_set,target)
predictions = predict(bayes_classifier,test_set,columns)


# Second, print the full evaluation results from the evaluate() function
# predict(bc,train_set)

print("Overall Accuracy:",evaluate(predictions,test_set))


# Third, print data statistics and model predictions, as instructed below 
# N is the total number of instances, F the total number of features, L the total number of labels
# The "class probabilities" may be unnormalized


print("Feature vectors of instances [0, 1, 2]: ", df.iloc[:3,:])

print("\nNumber of instances (N): ", df.shape[0])
print("Number of features (F): ", len(columns) - 1)
print("Number of labels (L): ", len(bayes_classifier.target_labels))

pr3 = test_set.iloc[test_set.shape[0]-3,:]
pr2 = test_set.iloc[test_set.shape[0]-2,:]
pr1 = test_set.iloc[test_set.shape[0]-1,:]

bayes_classifier.calc_conditional_probabilities(pr3,columns)
print("\n\nPredicted class probabilities for instance N-3: ", bayes_classifier.cp)
print("Predicted class for instance N-3: ", predictions[test_set.shape[0]-3])

bayes_classifier.calc_conditional_probabilities(pr2,columns)
print("\nPredicted class probabilities for instance N-2: ", bayes_classifier.cp)
print("Predicted class for instance N-2: ", predictions[test_set.shape[0]-2])

bayes_classifier.calc_conditional_probabilities(pr1,columns)
print("\nPredicted class probabilities for instance N-1: ", bayes_classifier.cp )
print("Predicted class for instance N-1: ", predictions[test_set.shape[0]-1])


Priori Propability Values:  {'yes': 0.12, 'no': 0.88}


100%|██████████████████████████████████████████████████████████████████████████████| 1357/1357 [00:15<00:00, 87.10it/s]

Overall Accuracy: 11.422254974207812
Feature vectors of instances [0, 1, 2]:          ID         job  marital  education default housing loan   contact  \
0  CLIENT0  unemployed  married    primary      no      no   no  cellular   
1  CLIENT1    services  married  secondary      no     yes  yes  cellular   
2  CLIENT2  management   single   tertiary      no     yes   no  cellular   

  month poutcome label  
0   oct  unknown    no  
1   may  failure    no  
2   apr  failure    no  

Number of instances (N):  4521
Number of features (F):  10
Number of labels (L):  2


Predicted class probabilities for instance N-3:  {'yes': {'CLIENT4518': 0.0, 'technician': 0.17, 'married': 0.53, 'secondary': 0.48, 'no': 0.0, 'cellular': 0.81, 'aug': 0.16, 'unknown': 0.65}, 'no': {'CLIENT4518': 0.0, 'technician': 0.17, 'married': 0.63, 'secondary': 0.51, 'no': 1.0, 'cellular': 0.63, 'aug': 0.14, 'unknown': 0.85}}
Predicted class for instance N-3:  yes

Predicted class probabilities for instance N-2:  {'




## Student

In [17]:
# This cell should act as your "main" function where you call the above functions 
# on the full Bank Marketing data set, and print the evaluation score. [0.33 marks]


# First, read in the data and apply your NB model to the Bank Marketing data

df,train_set,test_set,target = preprocess('student.csv')
columns = df.columns
bayes_classifier = train(train_set,target)
predictions = predict(bayes_classifier,test_set,columns)


# Second, print the full evaluation results from the evaluate() function
# predict(bc,train_set)

print("Overall Accuracy:",evaluate(predictions,test_set))


# Third, print data statistics and model predictions, as instructed below 
# N is the total number of instances, F the total number of features, L the total number of labels
# The "class probabilities" may be unnormalized


print("Feature vectors of instances [0, 1, 2]: ", df.iloc[:3,:])

print("\nNumber of instances (N): ", df.shape[0])
print("Number of features (F): ", len(columns) - 1)
print("Number of labels (L): ", len(bayes_classifier.target_labels))

pr3 = test_set.iloc[test_set.shape[0]-3,:]
pr2 = test_set.iloc[test_set.shape[0]-2,:]
pr1 = test_set.iloc[test_set.shape[0]-1,:]

bayes_classifier.calc_conditional_probabilities(pr3,columns)
print("\n\nPredicted class probabilities for instance N-3: ", bayes_classifier.cp)
print("Predicted class for instance N-3: ", predictions[test_set.shape[0]-3])

bayes_classifier.calc_conditional_probabilities(pr2,columns)
print("\nPredicted class probabilities for instance N-2: ", bayes_classifier.cp)
print("Predicted class for instance N-2: ", predictions[test_set.shape[0]-2])

bayes_classifier.calc_conditional_probabilities(pr1,columns)
print("\nPredicted class probabilities for instance N-1: ", bayes_classifier.cp )
print("Predicted class for instance N-1: ", predictions[test_set.shape[0]-1])


Priori Propability Values:  {'A+': 0.02, 'D': 0.3, 'B': 0.2, 'F': 0.09, 'C': 0.28, 'A': 0.11}


100%|████████████████████████████████████████████████████████████████████████████████| 195/195 [00:02<00:00, 67.35it/s]

Overall Accuracy: 3.5897435897435894
Feature vectors of instances [0, 1, 2]:           ID school sex address famsize Pstatus  Medu  Fedu     Mjob     Fjob  \
0  STUDENT0     GP   F       U     GT3       A  high  high  at_home  teacher   
1  STUDENT1     GP   F       U     GT3       T   low   low  at_home    other   
2  STUDENT2     GP   F       U     LE3       T   low   low  at_home    other   

   ... internet romantic     famrel  freetime     goout      Dalc      Walc  \
0  ...       no       no       good  mediocre      good  very_bad  very_bad   
1  ...      yes       no  excellent  mediocre  mediocre  very_bad  very_bad   
2  ...      yes       no       good  mediocre       bad       bad  mediocre   

     health      absences label  
0  mediocre   four_to_six     D  
1  mediocre  one_to_three     D  
2  mediocre   four_to_six     C  

[3 rows x 31 columns]

Number of instances (N):  649
Number of features (F):  30
Number of labels (L):  6


Predicted class probabilities for insta




## Obesity

In [14]:
# This cell should act as your "main" function where you call the above functions 
# on the full Bank Marketing data set, and print the evaluation score. [0.33 marks]


# First, read in the data and apply your NB model to the Bank Marketing data

df,train_set,test_set,target = preprocess('obesity.csv')
columns = df.columns
bayes_classifier = train(train_set,target)
predictions = predict(bayes_classifier,test_set,columns)


# Second, print the full evaluation results from the evaluate() function
# predict(bc,train_set)

print("Overall Accuracy:",evaluate(predictions,test_set))


# Third, print data statistics and model predictions, as instructed below 
# N is the total number of instances, F the total number of features, L the total number of labels
# The "class probabilities" may be unnormalized


print("Feature vectors of instances [0, 1, 2]: ", df.iloc[:3,:])

print("\nNumber of instances (N): ", df.shape[0])
print("Number of features (F): ", len(columns) - 1)
print("Number of labels (L): ", len(bayes_classifier.target_labels))

pr3 = test_set.iloc[test_set.shape[0]-3,:]
pr2 = test_set.iloc[test_set.shape[0]-2,:]
pr1 = test_set.iloc[test_set.shape[0]-1,:]

bayes_classifier.calc_conditional_probabilities(pr3,columns)
print("\n\nPredicted class probabilities for instance N-3: ", bayes_classifier.cp)
print("Predicted class for instance N-3: ", predictions[test_set.shape[0]-3])

bayes_classifier.calc_conditional_probabilities(pr2,columns)
print("\nPredicted class probabilities for instance N-2: ", bayes_classifier.cp)
print("Predicted class for instance N-2: ", predictions[test_set.shape[0]-2])

bayes_classifier.calc_conditional_probabilities(pr1,columns)
print("\nPredicted class probabilities for instance N-1: ", bayes_classifier.cp )
print("Predicted class for instance N-1: ", predictions[test_set.shape[0]-1])


Priori Propability Values:  {'obese': 0.46, 'not-obese': 0.54}


100%|███████████████████████████████████████████████████████████████████████████████| 634/634 [00:04<00:00, 132.21it/s]

Overall Accuracy: 45.268138801261834
Feature vectors of instances [0, 1, 2]:           ID Gender family_history_with_overweight FAVC  FCVC   NCP       CAEC  \
0  PATIENT0   Male                            yes  yes   mid  high  Sometimes   
1  PATIENT1   Male                            yes  yes   mid  high  Sometimes   
2  PATIENT2   Male                            yes  yes  high  high  Sometimes   

  SMOKE  CH2O SCC           FAF       TUE        CALC                 MTRANS  \
0   yes   mid  no  low-activity  mediocre  Frequently  Public_Transportation   
1    no  high  no  low-activity      good   Sometimes  Public_Transportation   
2    no  high  no  low-activity      good   Sometimes  Public_Transportation   

       label  
0  not-obese  
1  not-obese  
2      obese  

Number of instances (N):  2111
Number of features (F):  14
Number of labels (L):  2


Predicted class probabilities for instance N-3:  {'obese': {'PATIENT2108': 0.0, 'Female': 0.52, 'no': 0.98, 'yes': 0.0, 'high': 0




#### Part 2: Conceptual questions [13 marks]

## Question 1: One-R Baseline [3 marks]

In [None]:
# Write additional code here, if necessary (you may insert additional code cells)
# You should implement the One-R classifier from scratch. Do not use existing implementations/learning algorithms.
# Print the feature name and its corresponding error rate that One-R selects, in addition to any evaluation scores.



Provide your text answer to **Question 1.b** of 100-150 words in this cell.



## Question 2: Evaluation strategy [3 marks] 


In [None]:
# Write additional code here, if necessary (you may insert additional code cells)



Provide your text answer to **Question 2** 100-150 words in this cell.



## Question 3: Feature Selection and Naive Bayes Assumptions [3 marks]

In [10]:
df2 = pd.read_csv('obesity.csv')
df2

Unnamed: 0,ID,Gender,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,label
0,PATIENT0,Male,yes,yes,mid,high,Sometimes,yes,mid,no,low-activity,mediocre,Frequently,Public_Transportation,not-obese
1,PATIENT1,Male,yes,yes,mid,high,Sometimes,no,high,no,low-activity,good,Sometimes,Public_Transportation,not-obese
2,PATIENT2,Male,yes,yes,high,high,Sometimes,no,high,no,low-activity,good,Sometimes,Public_Transportation,obese
3,PATIENT3,Female,no,yes,high,high,Frequently,no,low,no,low-activity,good,Sometimes,Public_Transportation,not-obese
4,PATIENT4,Female,yes,no,high,high,Sometimes,no,low,no,mid-activity,good,Sometimes,Automobile,not-obese
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,PATIENT2106,Male,yes,yes,mid,low,Always,no,mid,no,low-activity,bad,Frequently,Automobile,obese
2107,PATIENT2107,Male,yes,yes,mid,high,Sometimes,no,low,no,sedentary,mediocre,no,Automobile,obese
2108,PATIENT2108,Female,no,yes,high,high,Sometimes,no,low,yes,low-activity,good,Sometimes,Public_Transportation,not-obese
2109,PATIENT2109,Female,yes,no,mid,very_high,Sometimes,no,mid,no,low-activity,mediocre,no,Public_Transportation,not-obese


In [11]:
# Write additional code here, if necessary (you may insert additional code cells)
for col in df2.columns:
    unique_values = df2[col].unique()
    df2[col] = df2[col].map(dict(zip(unique_values, range(len(unique_values)))))

In [12]:
variances = df2.var()
toDelete = variances[variances < 0.8]
toDelete

Gender                            0.250083
family_history_with_overweight    0.149187
FAVC                              0.102638
FCVC                              0.342113
CAEC                              0.376350
SMOKE                             0.020418
CH2O                              0.668226
SCC                               0.043429
TUE                               0.448207
CALC                              0.264716
MTRANS                            0.345777
label                             0.248553
dtype: float64

Provide your text answer to **Question 3.a** of 100-150 words in this cell.

#### Delete low variance features
A filter method for implementing feature selection from a dataset is the deletion of these with low variance. Thus, a variance threshold is determined so a function to identify features with less than this threshold of unique values. 

Take into consideration that the calculation of variance demands numeric data and therefore dataframe has been transformed into this data type.


Provide your text answer to **Question 3.b** of 100-150 words in this cell.

#### Assumptions

The fundamental Naïve Bayes assumption is that each feature makes an:
1. independent: There is the assumption of no pair of features is dependent. For example, the job being Unemployed has nothing to do with the status e.g. married. Hence, the features are assumed to be Independent.
2. equal contribution to the outcome. Each feature is given the same influence(or importance). This results that none of the attributes are irrelevant and assumed to be contributing Equally to the outcome

## Question 4: Feature Selection and Ethics [4 marks]

In [110]:
# Write additional code here, if necessary (you may insert additional code cells)
df_students = pd.read_csv('student.csv')
df_students.columns
df_students_reduced = df_students.drop(['sex','address','failures'],axis=1).copy()

Provide your text answer to **Question 4.a** of 100-150 words in this cell.

The usage of ML in education and specifically on college application may raize a series of ethical problems such as:

1. Discrimination. Students that are predicted to have low grade may finally have a good perfomance. Thus, they are excluded from colleges, despite the fact that they could have developed skills during the educational process
2. Bias. Misguided evidence adapted from a not good trained algorithm may lead to wrong decisions.


Provide your text answer to **Question 4.b** of 100-150 words in this cell.

Proposed Columns to Delete:
1. Sex. Gender has nothing to do with developing skills and being well-educated
2. Address. This can lead to discrimination as students from Poor neighborhouds have the potential to be excellent students.
3. Failures. Failure in the past can lead to success in the future. Thus, it has to deleted, otherwise can lead to bias. 


Provide your text answer to **Question 4.c** of 100-150 words in this cell.



