# INTRODUCTION:

In machine learning, decision trees are commonly used to help identify features, and specific values of those features, that are most likely to result in a target value. If the target value is categorical, the model is a classification tree; if the target value is continuous, the model is a regression tree.

## Random Forests:

Imagine that an expert has a decision tree model in their head, and we assemble say 100 such experts. Loosely speaking, we have a decision forest. The idea behind Random Forests is to learn a bunch of decision tree classifiers (a forest of decision trees), and sort through the predictions of each decision tree to produce a result that is, in the aggregate, better than the prediction of any individual decision tree. Essentially, a random forest works as follows:

1. Grow a large number of decision trees on your training data.
2. For each tree, use only a random subset of features and random subset of datapoints. This prevents overfitting by not letting all of the trees use the same features.
3. Make predictions by aggregating over each decision trees' individual predctions.

Random forests provide an extremely useful measure known as feature importances. 

In Olympic gymnastic and diving competitions, a panel of judges scores a participant. The top and bottom scores are dropped and the rest are averaged. A Random Forest algorithm uses similar techniques to eliminate some of the trees' predictions. Each round, the Random Forest will randomly drop some number of trees and then average the result after, say, 100 rounds.

# CODE:

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt 
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as skm
import pylab as pl

# data source: https://archive.ics.uci.edu/ml/machine-learning-databases/00240/
# local file location for training: /Users/AshRajBala/Repositories/ThinkfulProjects/UCI HAR Dataset/train

subjects = pd.read_csv("/Users/AshRajBala/Repositories/ThinkfulProjects/UCI HAR Dataset/train/subject_train.txt", header=None, delim_whitespace=True, index_col=False)
subjects.columns = ['Subject']
subjects

Unnamed: 0,Subject
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1


In [2]:
len(subjects.stack().value_counts()) # 21 participants

21

In [3]:
feature_names = pd.read_csv("/Users/AshRajBala/Repositories/ThinkfulProjects/UCI HAR Dataset/features.txt", header=None, delim_whitespace=True, index_col=False)
len(feature_names) # 561 features
feature_names

Unnamed: 0,0,1
0,1,tBodyAcc-mean()-X
1,2,tBodyAcc-mean()-Y
2,3,tBodyAcc-mean()-Z
3,4,tBodyAcc-std()-X
4,5,tBodyAcc-std()-Y
5,6,tBodyAcc-std()-Z
6,7,tBodyAcc-mad()-X
7,8,tBodyAcc-mad()-Y
8,9,tBodyAcc-mad()-Z
9,10,tBodyAcc-max()-X


In [5]:
x_vars = pd.read_csv("/Users/AshRajBala/Repositories/ThinkfulProjects/UCI HAR Dataset/train/X_train.txt", header=None, delim_whitespace=True, index_col=False)

In [6]:
x_vars

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,551,552,553,554,555,556,557,558,559,560
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.074323,-0.298676,-0.710304,-0.112754,0.030400,-0.464761,-0.018446,-0.841247,0.179941,-0.058627
1,0.278419,-0.016411,-0.123520,-0.998245,-0.975300,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,0.158075,-0.595051,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317
2,0.279653,-0.019467,-0.113462,-0.995380,-0.967187,-0.978944,-0.996520,-0.963668,-0.977469,-0.938692,...,0.414503,-0.390748,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.982750,-0.989302,-0.938692,...,0.404573,-0.117290,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663
4,0.276629,-0.016570,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,0.087753,-0.351471,-0.699205,0.123320,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892
5,0.277199,-0.010098,-0.105137,-0.997335,-0.990487,-0.995420,-0.997627,-0.990218,-0.995549,-0.942469,...,0.019953,-0.545410,-0.844619,0.082632,-0.143439,0.275041,-0.368224,-0.849632,0.184823,-0.042126
6,0.279454,-0.019641,-0.110022,-0.996921,-0.967186,-0.983118,-0.997003,-0.966097,-0.983116,-0.940987,...,0.145844,-0.217198,-0.564430,-0.212754,-0.230622,0.014637,-0.189512,-0.852150,0.182170,-0.043010
7,0.277432,-0.030488,-0.125360,-0.996559,-0.966728,-0.981585,-0.996485,-0.966313,-0.982982,-0.940987,...,0.136382,-0.082307,-0.421715,-0.020888,0.593996,-0.561871,0.467383,-0.851017,0.183779,-0.041976
8,0.277293,-0.021751,-0.120751,-0.997328,-0.961245,-0.983672,-0.997596,-0.957236,-0.984379,-0.940598,...,0.314038,-0.269401,-0.572995,0.012954,0.080936,-0.234313,0.117797,-0.847971,0.188982,-0.037364
9,0.280586,-0.009960,-0.106065,-0.994803,-0.972758,-0.986244,-0.995405,-0.973663,-0.985642,-0.940028,...,0.267383,0.339526,0.140452,-0.020590,-0.127730,-0.482871,-0.070670,-0.848294,0.190310,-0.034417


In [7]:
helper = []; helper2 = []; helper3 = []; helper4 = []; helper5 = []; helper6 = []
for el in list(feature_names[1]):
	helper.append(re.sub('[()-]', '', el))
for el in helper:
	helper2.append(re.sub('[,]', '_', el))
for el in helper2:
	helper3.append(el.replace('Body', ''))
for el in helper3:
	helper4.append(el.replace('Mag', ''))
for el in helper4:
	helper5.append(el.replace('mean', 'Mean'))
for el in helper5:
	helper6.append(el.replace('std', 'STD'))
x_vars.columns = helper6
# 7352 observations, 561 columns
y_var = pd.read_csv("/Users/AshRajBala/Repositories/ThinkfulProjects/UCI HAR Dataset/train/y_train.txt", header=None, delim_whitespace=True, index_col=False)
y_var.columns = ['Activity']

In [18]:
data = pd.merge(y_var, x_vars, left_index=True, right_index=True)
data = pd.merge(data, subjects, left_index=True, right_index=True)
# 7352 obs, 562 columns

# change activity to categorical variable
data['Activity'] = pd.Categorical(data['Activity']).codes

# partition data into training, test, and validation sets
fortrain = data.query('Subject >= 27')
fortest = data.query('Subject <= 6')
forval = data.query("(Subject >= 21) & (Subject < 27)")

In [32]:
# fit random forest model
train_target = fortrain['Activity']
train_data = fortrain.ix[:,1:-2]
rfc = RandomForestClassifier(n_estimators=500, oob_score=True)
rfc.fit(train_data, train_target)

RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=500, n_jobs=1,
            oob_score=True, random_state=None, verbose=0)

In [33]:
train_target = fortrain['Activity']
train_data = fortrain.ix[:,1:-2]

In [34]:
# determine oob to show accuracy of model
rfc.oob_score_

# determine most important features
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]
for i in range(10):
    print("%d. feature %d (%f)" % (i + 1, indices[i], importances[indices[i]]))

1. feature 49 (0.036942)
2. feature 56 (0.035986)
3. feature 52 (0.030095)
4. feature 58 (0.029634)
5. feature 40 (0.028831)
6. feature 558 (0.021919)
7. feature 51 (0.020871)
8. feature 50 (0.018161)
9. feature 559 (0.017746)
10. feature 41 (0.017141)


In [None]:
# define validation set and make predictions
val_target = forval['Activity']
val_data = forval.ix[:1,-2]
val_pred = rfc.predict(val_data)

In [None]:
# define test set and make predictions
test_target = fortest['Activity']
test_data = fortest.ix[:,1:-2]
test_pred = rfc.predict(test_data)

In [None]:
# calc and print accuracy scores
print("mean accuracy score for validation set = %f" %(rfc.score(val_data, val_target)))
print("mean accuracy score for test set = %f" %(rfc.score(test_data, test_target)))

In [None]:
# visualize confusion matrix
test_cm = skm.confusion_matrix(test_target, test_pred)
pl.matshow(test_cm)
pl.title('Confusion matrix for test data')
pl.colorbar()
pl.show()

In [None]:
print("Accuracy = %f" %(skm.accuracy_score(test_target, test_pred)))
print("Precision = %f" %(skm.precision_score(test_target, test_pred)))
print("Recall = %f" %(skm.recall_score(test_target, test_pred)))
print("F1 score = %f" %(skm.f1_score(test_target, test_pred)))