# [CPSC 310](https://github.com/GonzagaCPSC310) Data Mining
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)
# BONUS PA7 Ensemble Learning (30 BONUS pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Implement a random forest ensemble learner
* Build a random forest using bagging
* Introduce random attribute subsets for building a decision tree
* Tune ensemble learning parameters


## Prerequisites
Before starting this programming assignment, participants should be able to:
At the conclusion of this programming assignment, participants should be able to:
* Represent a tree in Python as a nested list
* Implement a decision tree classifier using the TDIDT algorithm and entropy-based attribute selection
* Evaluate classifiers using train/test sets

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining HW5

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repositories to track code changes and submit your assignment. Open this PA7 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/rYB5L1R7 

Your repo, for example, will be named GonzagaCPSC310/pa7-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up. We will grade your most recent commit, even if that commit is after the due date (your work will be marked late if this is the case).

## Overview and Requirements
Write a program (`pa7.py`) that builds a random forest classifier for the "pre-processed" automobile dataset (auto-data.txt) you created for PA1 (pick one method to deal with missing values for this assignment (e.g., eliminate rows with missing values, use means or medians, etc.)), the titanic dataset, and the Wisconsin dataset. Download the titanic.txt dataset and the wisconsin.dat dataset from https://github.com/GonzagaCPSC310/PAs/tree/master/files

For this assignment you will need to perform the following steps and turn in your source code and a log of any assumptions and/or issues you had in doing the steps. Your log needs to be written separately from your .py file and may be written in a .txt or a .md (markdown) file.

Note: as you write solutions for the following steps, I highly encourage you to design functions that are generic and re-usable for future programming assignments and data mining tasks.

Note: we are learning data mining from scratch! The only libraries you should need to use for this assignment include numpy (sparingly), csv (if you'd like to use it for file I/O), and tabulate (if you'd like to use it for pretty printing). This means no pandas...

## Step 1
Implement a basic random forest algorithm as follows:
1. Generate a random stratified test set consisting of one third of the original data set, with the
remaining two thirds of the instances forming the "remainder set".
1. Generate $N$ "random" decision trees using bootstrapping (giving a training and validation
set) over the remainder set. At each node, build your decision trees by randomly selecting
F of the remaining attributes as candidates to partition on. This is the standard random
forest approach discussed in class. Note that to build your decision trees you should still use
entropy; however, you are selecting from only a (randomly chosen) subset of the available
attributes.
1. Select the $M$ most accurate of the $N$ decision trees using the corresponding validation sets.
1. Use simple majority voting to predict classes using the $M$ decision trees over the test set.
1. Give the accuracy of the random forest approach.

Run your random forest algorithm over the "interview" dataset using $N$ = 20, $M$ = 7, and $F$ = 2 and print the predictive accuracy of the random forest. Only after you are confident that your random forest classifier is working correctly, move on to the next step. For convenience, I've provided the column names and dataset as Python lists below: 

In [1]:
col_names = ["level", "lang", "tweets", "phd", "interviewed_well"]
table = [
        ["Senior", "Java", "no", "no", "False"],
        ["Senior", "Java", "no", "yes", "False"],
        ["Mid", "Python", "no", "no", "True"],
        ["Junior", "Python", "no", "no", "True"],
        ["Junior", "R", "yes", "no", "True"],
        ["Junior", "R", "yes", "yes", "False"],
        ["Mid", "R", "yes", "yes", "True"],
        ["Senior", "Python", "no", "no", "False"],
        ["Senior", "R", "yes", "no", "True"],
        ["Junior", "Python", "yes", "no", "True"],
        ["Senior", "Python", "yes", "yes", "True"],
        ["Mid", "Python", "no", "yes", "True"],
        ["Mid", "Java", "yes", "no", "True"],
        ["Junior", "Python", "no", "yes", "False"]
    ]

Note: because we randomly select the remainder set, you will likely get a different predictive accuracy each time you run your program.

## Step 2
Run your random forest algorithm over both the titanic and auto data sets using $N$ = 20, $M$ = 7, and $F$ = 2 and print both the predictive accuracy of the random forests and the corresponding confusion matrices. In addition, build a single "normal" decision tree from the remainder set and print its predictive accuracy and confusion matrix against the test set (from 1 above) for comparison. 

## Step 3
Run your random forest algorithm for each data set to see the variation of results for different values of the parameters $N$, $M$, and $F$. Note that for each setting of $N$, $M$, and $F$, you will need to run your program multiple times because of the randomly generated remainder set to get a sense of the settings (e.g., you might run each setting 5 times). You should try a wide range of values including large values for $N$. Print out the results (i.e., the values for $N$, $M$, and $F$, the accuracy, and the confusion matrices) that seem to give the best results for each dataset. As in step 2, output the accuracy of the corresponding single "normal" decision tree as well for comparison. Report on the settings you tried and the results you obtained in your log.

## Step 4
Run your random forest algorithm over the "preprocessed" wisconsin.dat dataset, which contains cases from a study that was conducted at the University of Wisconsin Hospitals in Madison, Wisconsin, about patients who had undergone surgery for breast cancer. The task is to determine if the detected tumor is benign (2) or malignant (4). This dataset has 10 attributes (by order in the table):

1. clump thickness interval value `[1,10]`
2. cell size interval value `[1,10]`
3. cell shape interval value `[1,10]`
4. marginal adhesion interval value `[1,10]`
5. epithelial size interval value `[1,10]`
6. bare nuclei interval value `[1,10]`
7. bland chromatin interval value `[1,10]`
8. normal nucleoli interval value `[1,10]`
9. mitoses interval value `[1,10]`
10. tumor `2=benign, 4=malignant`

As in step 3, report the values for $N$, $M$, and $F$ along with the accuracy and confusion matrix that
gives the best results. Also output the accuracy for a single "normal" decision tree for comparison.

## Step 5
Modify your random forest algorithm to use the "track record" weighted voting scheme discussed in class (instead of simple majority voting). Compare your results to those from Steps 2-4.

## Submitting Assignments
1. Use Github classroom to submit your assignment via a Github repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this. You must commit your solution by the due date and time.
1. Your repo should contain only your .py file(s), your input .csv/.txt file(s), and your log file (.txt or .md) file(s). Double check that this is the case by cloning (or downloading a zip) your submission repo and running your code from command line like we will when we grade your code. 

## Grading Guidelines
This assignment is worth 30 BONUS points. Your assignment will be evaluated based on a successful compilation from command line (using the Anaconda Python Distribution v3.7) and adherence to the program requirements. We will grade according to the following criteria:
* 8 pts for correct step 1
* 5 pts for correct step 2
* 5 pts for correct step 3
* 5 pts for correct step 4
* 5 pts for correct step 5
* 2 pts for adherence to course [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC310/PAs/blob/master/Coding%20Standard.ipynb)