In [7]:
import numpy as np

# pasted from DecisionTreeFun
header = ["level", "lang", "tweets", "phd"]
attribute_domains = {"level": ["Senior", "Mid", "Junior"], 
    "lang": ["R", "Python", "Java"],
    "tweets": ["yes", "no"], 
    "phd": ["yes", "no"]}
X = [
    ["Senior", "Java", "no", "no"],
    ["Senior", "Java", "no", "yes"],
    ["Mid", "Python", "no", "no"],
    ["Junior", "Python", "no", "no"],
    ["Junior", "R", "yes", "no"],
    ["Junior", "R", "yes", "yes"],
    ["Mid", "R", "yes", "yes"],
    ["Senior", "Python", "no", "no"],
    ["Senior", "R", "yes", "no"],
    ["Junior", "Python", "yes", "no"],
    ["Senior", "Python", "yes", "yes"],
    ["Mid", "Python", "no", "yes"],
    ["Mid", "Java", "yes", "no"],
    ["Junior", "Python", "no", "yes"]
]

y = ["False", "False", "True", "True", "True", "False", "True", "False", "True", "True", "True", "True", "True", "False"]
# stitch X and y together to make one table
table = [X[i] + [y[i]] for i in range(len(X))]

### Lab Task 1
Write a bootstrap function to return a random sample of rows with
replacement:

```python
def bootstrap(table):
    return [table[randint(0,len(table)-1)] for _ in len(table)]
```

Note: `randint(i, j)` returns $n$ such that $i \leq n \leq j$

Note that instead of using bootstrapping for testing ...
* We are using it here to **create** the ensemble for prediction
* i.e., our classifier = set of classifiers over subsamples of original dataset
* We are not using bootstrapping for testing in this case

Some advantages of bagging (bootstrap aggregation)
* Simple idea, simple to implement
* Can help deal with overfitting and noisy data (outliers)
* Can increase accuracy by reducing variance of individual classifiers

### Random Forests
Basic Idea
* Generate many different decision trees (a "forest" of trees) ... $N$ trees

Q: What are ways we could do this?
* Use bagging (bootstrap aggregation)
* Randomly select attributes (many possible trees!)
* Use different attribute selection approaches (Entropy, GINI, ...)
* Use a subset of attributes for each tree
* And so on

Random Forests approach:
* Build each tree using bagging (so different data sample used for each tree)
* At each node, select attribute from a random subset of available attributes... subset size $F$
* Use entropy to select attribute to (split) partition on
* Select the "best" subset of random trees to use in ensemble ... $M \subset N$

Note that $N$, $M$, and $F$ are all parameters of the algorithm

### Lab Task 2
Define a python function that selects F random attributes from an attribute list

```python
def random_attribute_subset(attributes, F):
    # shuffle and pick first F
    shuffled = attributes[:] # make a copy
    random.shuffle(shuffled)
    return shuffled[:F]
```
* `shuffle()` performs in-place rearrangement (permutation) of given sequence

### The Random Forest Procedure
1. Divide $D$ into a test and remainder set
    * Take 1/3 for test set, 2/3 for remainder set
    * Ensure test set has same distribution of class labels as $D$ ("stratified")
    * Randomly select instances when generating test set
2. Create $N$ bootstrap samples from remainder set
    * Each results in a **training** (63%) and **validation** (36%) set
    * Build and test a classifier for each of the N bootstrap samples
    * Each classifier is a decision tree using $F$-sized random attribute subsets
    * Determine accuracy of classifier using validation set
3. Pick the $M$ best classifiers generated in step 2
4. Use test set from step 1 to determine performance of the ensemble of $M$ classifiers (using simple majority voting)

Again note: $N$, $M$, and $F$ are parameters (in addition to $D$)

In [36]:
import numpy as np
def compute_random_subset(values, num_values):
    # you can use np.random.choice(), with replace=False
    values_copy = values[:] #shallow copy 
    np.random.shuffle(values_copy) # thios is inplace
    return values_copy[:num_values]

F = 2
print(compute_random_subset(header,F))

['lang', 'phd']


### Lab Task 3 (For Extra Practice)
Assume we have a dataset with 4 attributes ($a_1$, $a_2$, $a_3$, $a_4$) where each attribute has two possible values ($v_1$ and $v_2$) and attribute $a_5$ contains class labels with two possible values ($yes$ and $no$). Using random attribute subsets of size 2:
1. Give an example of a complete decision tree that could be generated using the random forest approach
1. Show the random attribute subset for each attribute node in the tree.