# Name(s)
**PUT YOUR FULL NAME(S) HERE**

**Instructions:** Pair programming assignment. Submit only a single notebook, but make sure to include your first and last names.

# Bayesian Classifier

## Preface
(Courtesy of Dr. Alex Dekhtyar)

The core objective of Knowledge Discovery in Data/Data Mining/Machine Learning methods is to provide efficient algorithms for gaining insight from data. CSC 466 primarily studies the methods and the algorithms that enable
such insight, and that specifically take this insight above and beyond traditional statistical analysis of data (more
about this — later in the course).
However, the true power of KDD/DM/ML methods that we will study in this course is witnessed only when
these methods are applied to actually gain insight from the data. As such, in this course, the deliverables for your
laboratory assignments will be partitioned into two categories:

1. KDD Method implementation. In most labs you will be asked to implement from scratch one or more
KDD method for producing a special type of insight from data. This part of the labs is similar to your other
CS coursework - you will submit your code, and, sometimes, your tests and/or output.

2. Insight, a.k.a., data analysis. For each lab assignment we will provide one or more datasets for your
perusal, and will ask you to perform the analysis of these datasets using the methods you implemented. The
results of this analysis, i.e., the insight, are as important for successful completion of your assignments, as
your implementations. Most of the time, you will be asked to submit a lab report detailing your analysis,
and containing the answers to the questions you are asked to study.
The insight portion of your deliverables is something that you may be seeing for the first time in your CS
coursework. It is not an afterthought in your lab assignments. Your grade will, in no small part, depend on
the results of your analysis, and the writing quality on your report. This lab assignment, and further assignments
will include detailed insturctions on how to prepare reports, and we will discuss report writing several times as
the course progresses.

## Lab Assignment

This is a pair programming assignment. I strongly
discourage individual work for this (and other team/pair programming) lab(s), even if you think you can do it
all by yourself. Also, this is a pair programming assignment, not a ”work in teams of two” assignment. Pair
programming requires joint work on all aspects of the project without delegating portions of the work to individual
1
team members. For this lab, I want all your work — discussion, software development, analysis of the results,
report writing — to be products of joint work.
Students enrolled in the class can pair with other students enrolled in the class. Students on the waitlist can
pair with other students on the waitlists. In the cases of ”odd person out” situations, a team of three people can
be formed, but that team must (a) ask and answer one additional question, and (b) work as a pair would, without
delegation of any work off-line.

For this lab, we are going to first implement a empirical naive bayesian classifier, then implement a feature importance measure and apply it to a dataset, and finally, we will examine the affect of modifying the priors.

For developing this lab, we can use the Titanic Kaggle dataset.

In [15]:
import pandas as pd
titanic_df = pd.read_csv(
    "https://raw.githubusercontent.com/dlsun/data-science-book/master/data/titanic.csv"
)
titanic_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


We only need a few columns, and I will also perform some preprocessing for you:

In [16]:
features = ['pclass','survived','sex','age']
titanic_df = titanic_df.loc[:,features]
display(titanic_df)
titanic_df.loc[:,'pclass']=titanic_df['pclass'].fillna(titanic_df['pclass'].mode()).astype(int)
titanic_df.loc[:,'age']=titanic_df['age'].fillna(titanic_df['age'].median())
titanic_df.loc[:,'age']=(titanic_df['age']/10).astype(str).str[0].astype(int)*10
titanic_df

Unnamed: 0,pclass,survived,sex,age
0,1,1,female,29.0000
1,1,1,male,0.9167
2,1,0,female,2.0000
3,1,0,male,30.0000
4,1,0,female,25.0000
...,...,...,...,...
1304,3,0,female,14.5000
1305,3,0,female,
1306,3,0,male,26.5000
1307,3,0,male,27.0000


Unnamed: 0,pclass,survived,sex,age
0,1,1,female,20
1,1,1,male,0
2,1,0,female,0
3,1,0,male,30
4,1,0,female,20
...,...,...,...,...
1304,3,0,female,10
1305,3,0,female,20
1306,3,0,male,20
1307,3,0,male,20


In [17]:
titanic_df.describe()

Unnamed: 0,pclass,survived,age
count,1309.0,1309.0,1309.0
mean,2.294882,0.381971,24.385027
std,0.837836,0.486055,13.387598
min,1.0,0.0,0.0
25%,2.0,0.0,20.0
50%,3.0,0.0,20.0
75%,3.0,1.0,30.0
max,3.0,1.0,80.0


## Exercise 0
In your own words, describe the preprocessing steps I took above.

## Exercise 1
Fill in the following function to determine the prior probability of the classes. The result must be in the form of a Python dictionary such as ``priors = {0: 0.4, 1: 0.6}``.
<pre>
def compute_priors(y):
  ???
  return priors
</pre>

In [18]:
# YOUR SOLUTION HERE
compute_priors(titanic_df['survived'],yname='survived')

{'survived=0': 0.6180290297937356, 'survived=1': 0.3819709702062643}

## Exercise 2
The next function to implement is the specific class conditional probability:
<pre>
def specific_class_conditional(x,xv,y,yv):
  ???
  return prob
</pre>

In [19]:
# YOUR SOLUTION HERE
specific_class_conditional(titanic_df['sex'],'female',titanic_df['survived'],0)

0.15698393077873918

## Exercise 3
Now construct a dictionary based data structure that stores all possible class conditional probabilities (e.g., loop through all possible combinations of values). The keys in your dictionary should be of the form "pclass=1|survived=0".

<pre>
# X is a dataframe that does not contain the class column y.
def class_conditional(X,y):
  ???
  return probs
</pre>

In [20]:
# YOUR SOLUTION HERE
display(class_conditional(titanic_df.drop("survived",axis=1),titanic_df["survived"]))
display(class_conditional(titanic_df.drop("survived",axis=1),titanic_df["survived"],yname="survived"))

{'pclass=1|y=0': 0.15203955500618047,
 'pclass=1|y=1': 0.4,
 'pclass=2|y=0': 0.19530284301606923,
 'pclass=2|y=1': 0.238,
 'pclass=3|y=0': 0.6526576019777504,
 'pclass=3|y=1': 0.362,
 'sex=female|y=0': 0.15698393077873918,
 'sex=female|y=1': 0.678,
 'sex=male|y=0': 0.8430160692212608,
 'sex=male|y=1': 0.322,
 'age=0|y=0': 0.03955500618046971,
 'age=0|y=1': 0.1,
 'age=10|y=0': 0.10754017305315204,
 'age=10|y=1': 0.112,
 'age=20|y=0': 0.5030902348578492,
 'age=20|y=1': 0.4,
 'age=30|y=0': 0.16563658838071693,
 'age=30|y=1': 0.196,
 'age=40|y=0': 0.10259579728059333,
 'age=40|y=1': 0.104,
 'age=50|y=0': 0.04697156983930779,
 'age=50|y=1': 0.064,
 'age=60|y=0': 0.027194066749072928,
 'age=60|y=1': 0.02,
 'age=70|y=0': 0.007416563658838072,
 'age=70|y=1': 0.002,
 'age=80|y=0': 0.0,
 'age=80|y=1': 0.002}

{'pclass=1|survived=0': 0.15203955500618047,
 'pclass=1|survived=1': 0.4,
 'pclass=2|survived=0': 0.19530284301606923,
 'pclass=2|survived=1': 0.238,
 'pclass=3|survived=0': 0.6526576019777504,
 'pclass=3|survived=1': 0.362,
 'sex=female|survived=0': 0.15698393077873918,
 'sex=female|survived=1': 0.678,
 'sex=male|survived=0': 0.8430160692212608,
 'sex=male|survived=1': 0.322,
 'age=0|survived=0': 0.03955500618046971,
 'age=0|survived=1': 0.1,
 'age=10|survived=0': 0.10754017305315204,
 'age=10|survived=1': 0.112,
 'age=20|survived=0': 0.5030902348578492,
 'age=20|survived=1': 0.4,
 'age=30|survived=0': 0.16563658838071693,
 'age=30|survived=1': 0.196,
 'age=40|survived=0': 0.10259579728059333,
 'age=40|survived=1': 0.104,
 'age=50|survived=0': 0.04697156983930779,
 'age=50|survived=1': 0.064,
 'age=60|survived=0': 0.027194066749072928,
 'age=60|survived=1': 0.02,
 'age=70|survived=0': 0.007416563658838072,
 'age=70|survived=1': 0.002,
 'age=80|survived=0': 0.0,
 'age=80|survived=1': 0

## Exercise 4
Now you are ready to calculate the posterior probabilities for a given sample. Write and test the following function that returns a dictionary where the keys are of the form "survived=0|pclass=1,sex=male,age=60". Make sure you return 0 if the specific combination of values does not exist.
<pre>
def posterior(probs,priors,x):
    return posteriors
</pre>

In [21]:
# YOUR SOLUTION HERE
probs = class_conditional(titanic_df.drop("survived",axis=1),titanic_df["survived"],yname="survived")
priors = compute_priors(titanic_df["survived"],yname="survived")
posteriors(probs,priors,titanic_df.drop("survived",axis=1).loc[0])

{'survived=0|pclass=1,sex=female,age=20': 0.15189282364486656,
 'survived=1|pclass=1,sex=female,age=20': 0.8481071763551334}

## Exercise 5
All this is great, but how would you evaluate how we are doing? Let's write a function call train_test_split that splits our dataframe into approximately training and testing dataset. Make sure it does this randomly.
<pre>
def train_test_split(X,y,test_frac=0.5):
   return Xtrain,ytrain,Xtest,ytest
</pre>

In [22]:
# YOUR SOLUTION HERE
Xtrain,ytrain,Xtest,ytest=train_test_split(titanic_df.drop("survived",axis=1),titanic_df["survived"])
Xtrain,ytrain,Xtest,ytest

(      pclass     sex  age
 1158       3  female   40
 671        3    male   20
 126        1    male   30
 423        2    male   30
 428        2  female   20
 ...      ...     ...  ...
 285        1    male   60
 724        3    male   30
 565        2    male   20
 1292       3    male   20
 1025       3    male    0
 
 [654 rows x 3 columns], 1158    0
 671     0
 126     0
 423     0
 428     1
        ..
 285     0
 724     0
 565     0
 1292    0
 1025    1
 Name: survived, Length: 654, dtype: int64,       pclass     sex  age
 468        2  female   20
 670        3    male   20
 1130       3  female   10
 708        3    male   20
 221        1    male   60
 ...      ...     ...  ...
 130        1  female   20
 937        3  female    0
 984        3  female   20
 1218       3    male   30
 301        1    male   40
 
 [655 rows x 3 columns], 468     0
 670     0
 1130    0
 708     0
 221     0
        ..
 130     1
 937     0
 984     1
 1218    0
 301     0
 Name: survived

## Exercise 6
For this exercise, find the conditional probabilities and the priors using a training dataset of size 70% and then using these probabilities find the accuracy if they are used to predict the test dataset. 

In [23]:
Xtrain,ytrain,Xtest,ytest=train_test_split(titanic_df.drop("survived",axis=1),titanic_df["survived"])
# YOUR SOLUTION HERE

Test set accuracy: 0.7709923664122137


## Exercise 7
For this exercise, you must improve/extend your methods above as necessary to compute the accuracy of predicting the activity from the dataset we've generated in class. Once we have filled out this dataset, I will provide a csv file as well as any preprocessing similar to the Titanic. You may have to modify your functions above to work with both datasets or you may not (depending of course on how you wrote them).

In [24]:
# YOUR SOLUTION HERE
print("Test set accuracy:",sum(predictions==ytest)/len(ytest))
pd.DataFrame({'prediction':predictions,'activity':ytest})

Test set accuracy: 0.5652173913043478


Unnamed: 0,prediction,activity
36,Study,Party
9,Study,Study
4,TV,Study
34,TV,TV
33,Party,Party
35,Study,TV
37,Study,Study
15,Study,Study
38,Party,Bar
21,Party,Bar


## Excercises 8
For this exercise, I would like you to implement the feature importance algorithm describe in [https://christophm.github.io/interpretable-ml-book/feature-importance.html](https://christophm.github.io/interpretable-ml-book/feature-importance.html). After you implement this, what is the most important feature for our in-class activity prediction dataset? Does this feature make sense to you?

In [25]:
# YOUR SOLUTION HERE