# Decision Tree Lab

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import matplotlib.pyplot as plt

## 1. Implement the ID3 decision tree algorithm  
- Use standard information gain as your basic attribute evaluation metric.  Note that ID3 would usually augment information gain with a mechanism to penalize statistically insignificant attribute splits to avoid overfit (e.g. early stopping, gain ratio, etc.)
- Include the ability to handle unknown attributes by making "unknown" a new attribute value for the attribute.
- You do not need to handle real valued attributes.
- You are welcome to create other classes and/or functions in addition to the ones provided below. (e.g. If you build out a tree structure, you might create a node class).
- It is a good idea to use simple data sets (like the lenses data and the pizza homework), which you can check by hand, to test each detailed step of your algorithm to make sure it works correctly. 

In [None]:
class DTClassifier(BaseEstimator,ClassifierMixin):

    def __init__(self,counts=None):
        """ Initialize class with chosen hyperparameters.
        Args:
        Optional Args (Args we think will make your life easier):
            counts: A list of Ints that tell you how many types of each feature there are
        Example:
            DT  = DTClassifier()
            or
            DT = DTClassifier(count = [2,3,2,2])
            Dataset = 
            [[0,1,0,0],
            [1,2,1,1],
            [0,1,1,0],
            [1,2,0,1],
            [0,0,1,1]]

        """
        

    def fit(self, X, y):
        """ Fit the data; Make the Decision tree

        Args:
            X (array-like): A 2D numpy array with the training data, excluding targets
            y (array-like): A 1D numpy array with the training targets

        Returns:
            self: this allows this to be chained, e.g. model.fit(X,y).predict(X_test)

        """

        return self

    def predict(self, X):
        """ Predict all classes for a dataset X

        Args:
            X (array-like): A 2D numpy array with the training data, excluding targets

        Returns:
            array, shape (n_samples,)
                Predicted target values per element in X.
        """
        pass


    def score(self, X, y):
        """ Return accuracy(Classification Acc) of model on a given dataset. Must implement own score function.

        Args:
            X (array-like): A 2D numpy array with data, excluding targets
            y (array-like): A 1D numpy array of the targets 
        """
        return 0



### 1.1 (20%) Debug 

- Debug your model by training on the lenses dataset: [Debug Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/lenses.arff)
- Test your model on the lenses test set: [Debug Test Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/lenses_test.arff)
- Parameters:
For this problem the number of unique feature values for each feature is: counts = [3,2,2,2] (You should compute this when you read in the data, before fitting)
---

Expected Results: Accuracy = [0.33]

Information gain at splits = [0.5487949406953987, 0.7704260414863775, 0.3166890883150208, 1.0, 0.4591479170272447, 0.9182958340544894]

Predictions should match this file: [Lenses Predictions](https://raw.githubusercontent.com/cs472ta/CS472/master/debug_solutions/pred_lenses.csv)

*NOTE: The [Lenses Prediction](https://raw.githubusercontent.com/cs472ta/CS472/master/debug_solutions/pred_lenses.csv) uses the following encoding: soft=2, hard=0, none=1. Use this same encoding.*

<!-- You should be able to get about 68% (61%-82%) predictive accuracy on the lenses data -->

Here's what your decision tree splits and information gains should look like, with the corresponding child node predictions:

<pre>
tear_prod_rate = normal: 0.5487949406953987
    astigmatism = no: 0.7704260414863775
        age = pre_presbyopic: 0.3166890883150208
            prediction: soft
        age = presbyopic:
            spectacle_prescrip = hypermetrope: 1.0
                prediction: soft
            spectacle_prescrip = myope:
                prediction: none
        age = young:
            prediction: soft
    astigmatism = yes:
        spectacle_prescrip = hypermetrope: 0.4591479170272447
            age = pre_presbyopic: 0.9182958340544894
                prediction: none
            age = presbyopic:
                prediction: none
            age = young:
                prediction: hard
        spectacle_prescrip = myope:
            prediction: hard
tear_prod_rate = reduced:
    prediction: none
</pre>

In [None]:
# Load debug training data 
# Train Decision Tree
# Load debug test data
# Execute and print the model accuracy and the information gain of every split you make

Discussion

In [None]:
# Optional Debugging Dataset - Pizza Homework
# pizza_dataset = np.array([[1,2,0],[0,0,0],[0,1,1],[1,1,1],[1,0,0],[1,0,1],[0,2,1],[1,0,0],[0,2,0]])
# pizza_labels = np.array([2,0,1,2,1,2,1,1,0])

### 1.2 (20%) Evaluation 

- We will evaluate your model based on its performance on the zoo dataset. 
- Train your model using this dataset: [Evaluation Train Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/zoo.arff)
- Test your model on this dataset: [Evaluation Test Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/zoo_test.arff)
- Parameters: counts = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 6, 2, 2, 2] (You should compute this when you read in the data, before fitting)
---
Your progam should print out your accuracy on the evaluation test dataset and also the information gain of each split you make.

In [None]:
# Load evaluation training data
# Train Decision Tree
# Load evaluation test data
# Execute and print the model accuracy and the information gain of every split you make

Discussion

## 2. Learn Cars and Voting Data Sets and Predict accuracy with *n*-fold CV  
- Use your ID3 algorithm to induce decision trees for the cars dataset and the voting dataset.  Do not use a stopping criteria, but induce the tree as far as it can go (until classes are pure or there are no more data or attributes to split on).
- Implement and use 10-fold Cross Validation (CV) on each data set to predict how well the models will do on novel data.  
- For each dataset, create a table with the training, validation, and test classification accuracy for each of the 10 runs and the average accuracies for the training, validation, and test data. 
- As a rough sanity check, typical decision tree accuracies for these data sets are: Cars: .90-.95, Vote: .92-.95.

### 2.1 (15%) Implement 10-fold Cross Validation and report results for the Cars Dataset
- Use this [Cars Dataset](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/cars.arff)
- Create a table for your *n*-fold cross validation accuracies

*If you are having trouble using scipy's loadarff function (scipy.io.arff.loadarff), try:*

*pip install arff &nbsp;&nbsp;&nbsp;&nbsp;          # Install arff library*

*import arff as arf*                   

*cars = list(arf.load('cars.arff'))   &nbsp;&nbsp;&nbsp;&nbsp;# Load your downloaded dataset (!curl, etc.)*

*df = pd.DataFrame(cars)*  

*There may be additional cleaning needed*

In [None]:
# Write a function that implements 10-fold cross validation
# Use 10-fold CV on Cars Dataset

Discussion

### 2.3 (15%) Voting Dataset 
- Use this [Voting Dataset with missing values](https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/voting_with_missing.arff)
- Create a table for your *n*-fold cross validation accuracies
- This data set has don't know data.  Discuss how your algorithm handles this

In [None]:
# Use 10-fold CV on Voting Dataset
# Make sure it handles don't know inputs

Discussion

### 2.4 (5%) Decision Tree Intuition
- For each of the two problems above, summarize in English what the decision tree has learned (i.e. look at the induced tree and describe what "rules" it has discovered to try to solve each task). 
- If the tree is very large you can just discuss a few of the more shallow attribute combinations and the most important decisions made high in the tree.

Discuss what the Trees have learned on the 2 data sets

## 3 Using SciKit Learn's decision tree  

### 3.1 (10%) SK Learn on Voting Dataset
- Use SciKit learns decision tree (CART) on the voting dataset and compare the results with your ID3 version. Use this [Voting Dataset with missing values].(https://raw.githubusercontent.com/cs472ta/CS472/master/datasets/voting_with_missing.arff)
- Try different parameters and report what parameters perform the best on the test set.

In [2]:
# Use SK Learn's Decision Tree to learn the voting dataset
# Explore different parameters

Discuss scikit CART results & also compare to your ID3 results

### 3.2 (10%) Choose a data set of your choice (not already used in this or previous labs) and use the SK decision tree to learn it. Experiment with different hyper-parameters to try to get the best results possible.

In [None]:
# Use SciKit Learn's Decision Tree on a new dataset
# Experiment with different hyper-parameters

Discussion

### 3.3 (5%) Print sklearn's decision tree for your chosen data set (using export_graphviz or another tool) and discuss what you find. If your tree is too deep to reasonably fit on one page, show only the first several levels (e.g. top 5).

In [None]:
# Include decision tree visualization here

Discussion

## 4. (Optional 5% extra credit) Implement reduced error pruning to help avoid overfitting
- You will need to take a validation set out of your training data to do this, while still having a test set to test your final accuracy. 
- Create a table comparing your decision tree implementation's results on the cars and voting data sets with and without reduced error pruning. 
- This table should compare:
    - a) The # of nodes (including leaf nodes) and tree depth of the final decision trees 
    - b) The generalization (test set) accuracy. (For the unpruned 10-fold CV models, just use their average values in the table).

In [None]:
# Reduced Error Pruning Code

Discussion