# DMA Fall '22


In [None]:
NAME = "Kylie Ren"
COLLABORATORS = ""

---

# Lab 3: Decision Trees

**Please read the following instructions very carefully**

## Working on the assignment / FAQs
- **Always use the seed/random_state as *42* wherever applicable** (This is to ensure repeatability in answers, across students and coding environments)
- Questions can be either autograded and manually graded.
- The type of question and the points they carry are indicated in each question cell
- An autograded question has 3 cells
     - **Question cell** : Read only cell containing the question
     - **Code Cell** : This is where you write the code
     - **Grading cell** : This is where the grading occurs, and **you are required not to edit this cell**
- Manually graded questions only have the question and code cells. **All manually graded questions are explicitly stated**
- To avoid any ambiguity, each question also specifies what *value* must be set. Note that these are dummy values and not the answers
- If an autograded question has multiple answers (due to differences in handling NaNs, zeros etc.), all answers will be considered.
- Most assignments have bonus questions for extra credit, do try them out!
- You can delete the `raise NotImplementedError()` for all questions.
- **Submitting the assignment** : Download the '.ipynb' and '.pdf' files from Colab and upload them to bcourses. Do not delete any outputs from cells before submitting.
- That's about it. Happy coding!


## About the dataset
This assignment uses a dataset obtained from the JSE Data Archive that contains biological and self-reported activity traits of a sample of college students at a single university uploaded in 2013. The study associated with these data focused on exploring if a correspondence exists between eye color and other traits. You will be using gender as the target/label in this lab.

FEATURE DESCRIPTIONS:
- Color (Blue, Brown, Green, Hazel, Other)
- Age (in years)
- YearinSchool (First, Second, Third, Fourth, Other)
- Height (in inches)
- Miles (distance from home town of student to Ames, IA)
- Brothers (number of brothers)
- Sisters (number of sisters)
- CompTime (number of hours spent on computer per week)
- Exercise (whether the student exercises Yes or No)
- ExerTime (number of hours spent exercising per week)
- MusicCDs (number of music CDs student owns)
- PlayGames (number of hours spent playing games per week)
- WatchTV (number of hours spent watching TV per week

Background Information on the dataset: http://jse.amstat.org/v21n2/froelich/eyecolorgender.txt

In [None]:
from collections import Counter, defaultdict
from itertools import combinations
import pandas as pd
import numpy as np
import operator
import math
import itertools
from sklearn.feature_extraction import DictVectorizer
from sklearn import preprocessing, tree
import matplotlib.pyplot as plt


!wget -nc http://askoski.berkeley.edu/~zp/eye_color.csv
!ls
df = pd.read_csv('eye_color.csv')
# remove NA's and reset the index
df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
df = df.reset_index(drop=True)

df.head()

--2022-09-25 06:35:37--  http://askoski.berkeley.edu/~zp/eye_color.csv
Resolving askoski.berkeley.edu (askoski.berkeley.edu)... 169.229.192.179
Connecting to askoski.berkeley.edu (askoski.berkeley.edu)|169.229.192.179|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 101507 (99K) [text/csv]
Saving to: ‘eye_color.csv’


2022-09-25 06:35:38 (1.64 MB/s) - ‘eye_color.csv’ saved [101507/101507]

eye_color.csv  sample_data


Unnamed: 0,gender,age,year,eyecolor,height,miles,brothers,sisters,computertime,exercise,exercisehours,musiccds,playgames,watchtv
0,female,18,first,hazel,68.0,195.0,0,1,20.0,Yes,3.0,75.0,6.0,18.0
1,male,20,third,brown,70.0,120.0,3,0,24.0,No,0.0,50.0,0.0,3.0
2,female,18,first,green,67.0,200.0,0,1,35.0,Yes,3.0,53.0,8.0,1.0
3,male,23,fourth,hazel,74.0,140.0,1,1,5.0,Yes,25.0,50.0,0.0,7.0
4,female,19,second,blue,62.0,60.0,0,1,5.0,Yes,4.0,30.0,2.0,5.0


---
**Question 1 (0.5 points, autograded)**: How many males and females exist in the dataset?

In [None]:
df['gender'].value_counts()
#raise NotImplementedError()

female    1078
male       910
Name: gender, dtype: int64

In [None]:
# The value set in the variables must be integers
num_males = 910 # Replace 0 with the actual value
num_females = 1078 # Replace 0 with the actual value

#raise NotImplementedError()

In [None]:
# This is an autograded cell, do not edit
print(num_males, num_females)

910 1078


---
**Question 2 (0.5 points, autograded)**: What is the Gini Index of this dataset, using males and females as the target classes?

In [None]:
gin = 1 - ((num_males/(num_females + num_males))**2 + (num_females/(num_females + num_males))**2)
#raise NotImplementedError()

In [None]:
# The value set in the variable must be float
gini_index = gin # Replace 0 with the actual value / formula

#raise NotImplementedError()

In [None]:
# This is an autograded cell, do not edit
print(gini_index)

0.4964292799047807


---
## Best Split of a numeric feature
**Question 3 (1.5 points, autograded)**: What is the best split point of the 'height' feature? (Still using males and females as the target classes, assuming a binary split)

Recall that, to calculate the best split of this numeric field, you'll need to order your data by 'height', then consider the midpoint between each pair of consecutive heights as a potential split point, then calculate the Gini Index for that partitioning. You'll want to keep track of the best split point and its Gini Index (remember that you are trying to minimize the Gini Index).

In [None]:
split_point = df.sort_values('height')['height'].unique()
split_point

midpoints = [(split_point[x] + split_point[x+1]) / 2 for x in range(len(split_point)-1)]

index = []
for i in midpoints:
  df_more = df.loc[df['height'] > i]
  df_less = df.loc[df['height'] <= i]
  df_less_m = len(df_less.loc[df_less['gender'] == 'male'])
  df_less_f = len(df_less.loc[df_less['gender'] == 'female'])
  df_more_m = len(df_more.loc[df_more['gender'] == 'male'])
  df_more_f = len(df_more.loc[df_more['gender'] == 'female'])

  gini_less = (1 - np.sum([(df_less_m / len(df_less)) ** 2,
                           (df_less_f / len(df_less)) ** 2]))*(len(df_less) / len(df))
  gini_more = (1 - np.sum([(df_more_m / len(df_more)) ** 2,
                           (df_more_f / len(df_more)) ** 2])) *  (len(df_more) / len(df))
  final_gini = gini_more + gini_less
  index.append(final_gini)

best_split_point = midpoints[np.argmin(index)]

In [None]:
# This is an autograded cell, do not edit
print(best_split_point)

68.5


---
**Question 4 (0.5 points, autograded)**: What is the Gini index of the best split point of the 'height' feature? (Still using males and females as the target classes, assuming a binary split)


In [None]:

#raise NotImplementedError()

In [None]:
# The value set in the variable must be float
gini_of_best_split_point = min(index)
#raise NotImplementedError()

In [None]:
# This is an autograded cell, do not edit
print(gini_of_best_split_point)

0.2655288120702919


---
**Question 5 (0.5 points, autograded)**: How much does this partitioning reduce the Gini Index over the Gini index of the overall dataset?

In [None]:
# The value set in the variable must be float
gini_difference = gini_index - gini_of_best_split_point

# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
# This is an autograded cell, do not edit
print(gini_difference)

0.2309004678344888


---
**Question 6 (0.5 points, autograded)**: How many 'female' and 'male' rows are shorter than the best height split point?

In [None]:
# The value set in the variable must be integer
female_rows_below = df.loc[df['height'] < best_split_point].loc[df['gender'] == 'female'].count()[0]
male_rows_below = df.loc[df['height'] < best_split_point].loc[df['gender'] == 'male'].count()[0]

# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
# This is an autograded cell, do not edit
print(female_rows_below, male_rows_below)

905 142


---
**Question 7 (0.5 points, autograded)**: How many 'female' and 'male' rows are taller than the best height split point?

In [None]:
#The value set in the variable must be integer
female_rows_above = df.loc[df['height'] > best_split_point].loc[df['gender'] == 'female'].count()[0]
male_rows_above = df.loc[df['height'] > best_split_point].loc[df['gender'] == 'male'].count()[0]

In [None]:
# This is an autograded cell, do not edit
print(female_rows_above, male_rows_above)

173 768


---
## Best Split of a Categorial Variable

**Question 8 (0.5 points, autograded)**: How many possible splits are there of the eyecolor feature? (Assuming binary split)

Python tip: the combinations function of the itertools module allows you to enumerate combinations of a list. You might want to Google 'power set'.


In [None]:
from itertools import chain, combinations
#Attribute/source: https://stackoverflow.com/questions/1482308/how-to-get-all-subsets-of-a-set-powerset
def powerset(iterable):
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

In [None]:
# The value set in the variable must be integer
splits = list(powerset(['blue', 'brown', 'green', 'hazel', 'other']))
num_of_splits = len(list(powerset(['blue', 'brown', 'green', 'hazel', 'other']))) - 2


In [None]:
# This is an autograded cell, do not edit
print(num_of_splits)

30


---
**Question 9 (1 points, autograded)**: Which split of eyecolor best splits the female and male rows, as measured by the Gini Index?

In [None]:
index_ecolor = []
for i in splits:
  df_present = df.loc[df['eyecolor'].isin(i)]
  df_not_present = df.loc[~df['eyecolor'].isin(i)]
  gini_present = (len(df_present) / len(df)) * (1 - np.sum([[(len(df_present.loc[df_present['gender'] == "male"]) / len(df_present)) ** 2,
                                                             ((len(df_present.loc[df_present['gender'] == "female"])) / len(df_present)) ** 2]]))


raise NotImplementedError()

In [None]:
# The value set in the variable must be an array
colour_group_1 = ['green'] # Replace [] with the actual colours/values in the group
colour_group_2 = ['hazel', 'brown', 'blue', 'other'] # Replace [] with the actual colours/values in the group

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# This is an autograded cell, do not edit
print(colour_group_1, colour_group_2)

---
**Question 10 (0.5 points, autograded)**: What is the Gini Index of this best split?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# The value set in the variable must be float
gini_of_best_split_group = 0 # Replace 0 with the actual value

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# This is an autograded cell, do not edit
print(gini_of_best_split_group)

---
**Question 11 (0.5 points, autograded)**: How much does this partitioning decrease the Gini Index over the Gini index of the overall dataset?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#The value set in the variable must be float
gini_difference_2 = 0 # Replace 0 with the actual value

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# This is an autograded cell, do not edit
print(gini_difference2)

---
**Question 12 (1 points, autograded)** : How many 'female' rows and 'male' rows are in your first partition? How many 'female' rows and 'male' rows are in your second partition?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# The value set in the variable must be integer, order doesn't matter
partition1_male = 0 # Replace 0 with the actual value
partition1_female = 0 # Replace 0 with the actual value
partition2_male = 0 # Replace 0 with the actual value
partition2_female = 0 # Replace 0 with the actual value

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# This is an autograded cell, do not edit
print(partition1_male, partition1_female, partition2_male, partition2_female)

---
## Training a decision tree
**Question 13 (1 points, autograded)**: Using all of the features in the original dataframe read in at the top of this notebook, train a decision tree classifier that has a depth of three (not including the root node). What is the accuracy of this classifier on the training data)?

Scikit-learn classifiers require class labels and features to be in numeric arrays. As such, you will need to turn your categorical features into numeric arrays using DictVectorizer. This is a helpful notebook for understanding how to do this: http://nbviewer.ipython.org/gist/sarguido/7423289. You can turn a pandas dataframe of features into a dictionary of the form needed by DictVectorizer by using df.to_dict('records'). Make sure you remove the class label first (in this case, gender). If you use the class label as a feature, your classifier will have a training accuracy of 100%! The example notebook link also shows how to turn your class labels into a numeric array using sklearn.preprocessing.LabelEncoder().

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# The value set in the variable must be float
accuracy = 0 #Replace 0 with the actual value

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# This is an autograded cell, do not edit
print(accuracy)

---
**Question 14 (1 points, manually graded)**: Using the following code snippet, visualize your decision tree. In your write-up, **write down the interpretation of the rule at each node** which is used to perform the splitting.

We provide **two options** to visualize decision trees. The first option uses `tree.plot_tree` and the other uses an external tool called `GraphViz`. You can **use either of the two options**.  `tree.plot_tree` is the **recommended and easier option** as it is a built-in function in `sklearn` and doesn't require any additional setup.

Uncomment the code, **fill in the clf (classifier) and `feature_names` arguments**. Executing the code will display the tree visualization in the output cell.

Note for users who want to install graphviz on their local machines (**you don't need to do install graphviz if you're running the notebook Colab**, which is the class' recommended way of doing assignments):



> In order to install graphviz, you may need to download the tool from [this website](https://graphviz.gitlab.io), and then pip3/conda install the python libraries you do not have. Mac users can use ```brew install graphviz``` instead of following the link, and linux users can do the same using their favourite package manager (for example, Ubuntu users can use ```sudo apt-get install graphviz```, followed by the necessary pip3/conda installations.




In [None]:
# Option 1 (Recommended Option) - Using `tree.plot_tree`

# clf = your classifier
# fig, ax = plt.subplots(figsize=(14, 14))
# tree.plot_tree(clf, fontsize=10, feature_names=<Names of columns>);

In [None]:
# Option 2 - Using GraphViz. Visualization is prettier, but additional setup may be required if running on your local machine (although no setup required on Colab)

from IPython.display import Image
import pydotplus
import pydot
from six import StringIO

# clf = your classifier
# dotfile = StringIO()
# tree.export_graphviz(clf, out_file=dotfile,
#                       feature_names=<Names of columns>,
#                           class_names=['Female', 'Male'],
#                           filled=True, rounded=True,
#                           special_characters=True)
# graph = pydotplus.graph_from_dot_data(dotfile.getvalue())
# Image(graph.create_png())


# Ignore the cell below, but do not delete it. It is used to grade the image output of this cell.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

---
## Bonus Question (2 points, auto graded)
For each of your leaf nodes, specify the percentage of 'female' rows in that node (out of the total number of rows at that node)


In [None]:
# The value set in the variable must be array
ratios = 0 # Replace 0 with the actual value

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
# This is an autograded cell, do not edit
print(ratios)

*ⓒ Prof. Zachary Pardos, 2022*