# Random Forest Classifier
***
By Matt Hartley and Bela Abolfathi

## Decision Trees

* Random forests are grown from decision trees. 
* A single tree takes an input, and through a series of tests on the attributes of the data it outputs a class. 
* Built by randomly splitting the data repeatedly and picking the best split.
* Decision trees tend to overfit the input and are therefore difficult to generalize. 
* 100% accuracy only on training set.

## Random Forest

* An input vector is put through each tree in the forest. 
* Determines cuts by fitting a number of decision trees on random sub-samples of the data.
* Tries to avoid overfitting by averaging or taking the mode of the outputs of each tree.
* Better generalization, but not as interpretable.

In [1]:
from __future__ import division, print_function, absolute_import

In [2]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
import pandas
from astroquery.sdss import SDSS

from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

#Random state
seed = 44

#Query input data
TSquery = """SELECT TOP 10000 
             p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r, 
             p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r, 
             s.class
             FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
             WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class != 'QSO'
             ORDER BY p.objid ASC
               """
#Separate features from labels, cast as numpy array
SDSSts = SDSS.query_sql(TSquery)
SDSSts.convert_bytestring_to_unicode()
sdss = SDSSts.to_pandas()
data = np.array(sdss.drop('class', axis=1))
labels = np.array(sdss['class'])



## Overfitting of decision tree vs random forest

In [4]:
#Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.3, 
                                                    random_state=seed)
#Initialize and train decision tree
dtc = tree.DecisionTreeClassifier()
dtc.fit(X_train, y_train)

#Evaluate accuracy
dtc_train_acc = np.sum(dtc.predict(X_train)==y_train)/len(y_train)
dtc_test_acc = np.sum(dtc.predict(X_test)==y_test)/len(y_test)
print('Accuracy on training set: {:d}%'.format(int(round((dtc_train_acc*100)))))
print('Accuracy on test set: {:d}%'.format(int(round(dtc_test_acc*100))))

Accuracy on training set: 100%
Accuracy on test set: 95%


In [4]:
#Choose number of trees to train on
n_trees = 8

#Initialize and train random forest
rfc = RandomForestClassifier(n_estimators=n_trees, random_state=seed)
rfc.fit(X_train, y_train)

#Evaluate accuracy
rfc_train_acc = np.sum(rfc.predict(X_train)==y_train)/len(y_train)
rfc_test_acc = np.sum(rfc.predict(X_test)==y_test)/len(y_test)
print('Accuracy on training set: {:d}%'.format(int(round(rfc_train_acc*100))))
print('Accuracy on test set: {:d}%'.format(int(round(rfc_test_acc*100))))

Accuracy on training set: 100%
Accuracy on test set: 97%


## Similar Algorithms

##### Extra Trees:
* Fits decision trees based on *all* of the input data.
* Decision boundaries are chosen at random, rather than choosing the average or mode of the classes.

##### Ada Boost:
* Boosting is a technique that aims to create a strong classifier from an ensemble of weak classifiers.
* Picks weak learners, weights them in favor of instances they misclassified so that subsequent classifiers can improve by focusing on difficult cases.
* Sensitive to noisy data and outliers.

## Improved figure

![Decision Surfaces](decision_surfaces.png)

## Original figure 

![sklearn](sklearn_rf.png)