## Using Random Forest Models To Predict Sleep Patterns Based On Average Daily Action Counts

### Imports

In [19]:
import modules
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split,KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

### Retrieving and categorizing data

In [7]:
X = modules.get_and_avg_data() #get data averaged over 7 day increments
Y = X['avg_sleep'].copy() #extract labels from set
X = X.drop(labels=['depression_class','avg_sleep'], axis = 1) #drop the depression class because we will be using the depression scores instead and drop avg_sleep because those are our labels
X = X.apply(modules.categorize_column, axis=0) #categorize the columns into 3 categories 0 = less than average, 1 = average, 3 = more than average
Y = modules.categorize_column(Y)

no user_tags: 520
no user_tags: 532
no user_tags: 503
no user_tags: 503
no user_tags: 523
no user_tags: 544
no user_tags: 529
no user_tags: 661
no user_tags: 658
no user_tags: 664
no user_tags: 634
no user_tags: 507
no user_tags: 547
no user_tags: 501
no user_tags: 668
no user_tags: 662


### Finding the best parameters for our random forest using 10-fold cross validation

In [20]:
#define the parameters we will be testing in cross validation
num_trees = [5,10,20,30,40,50,60,70,80,90,100]
criterions = ['gini', 'entropy', 'log_loss']
min_samples_splits = [2,4,8,16]

The only "overbearing" individual tree parameter we test is min_samples_split, which controls the least amount of samples we need to split the tree. This is because min_samples_split affects all other parameters in the created decision trees and could possibly lead to overfitting if left too low. Additionally, all other "overbearing" parameters can limit expansion of a tree where the expansion would most likely be useful. Max depth, for example, would halt any more splits from happening, even when the splits would help us discriminate between samples more accurately. 

In [26]:
#finding the best parameters
               #num_tree, criterion, min_sample_split, accuracy
best_results = [0,        None,      0,                0.0]
for num_tree in num_trees:
    for criterion in criterions:
        for min_sample_split in min_samples_splits:
            k_fold = KFold(n_splits=10, shuffle=True, random_state=57)
            clf = RandomForestClassifier(n_estimators=num_tree, criterion=criterion,min_samples_split=min_sample_split, random_state=56)
            accuracy_scores = cross_val_score(clf, X, Y, cv=k_fold, n_jobs=1)
            avg_accuracy = np.mean(accuracy_scores)
            if(avg_accuracy > best_results[3]):
                best_results = [num_tree, criterion, min_sample_split, avg_accuracy]
            

In [33]:
#get the best parameters
num_tree = best_results[0]
criterion = best_results[1]
min_sample_split = best_results[2]
print(f'The best parameters found via 10-fold cross validation: \nnum_tree-{num_tree} \ncriterion-{criterion} \nmin_sample_split-{min_sample_split}')

The best parameters found via 10-fold cross validation: 
num_tree-70 
criterion-entropy 
min_sample_split-16


### Train and evaluate our best found model

In [28]:
#split up our data into trains and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 68)

In [32]:
#evaluate our model using the best found parameters
clf = RandomForestClassifier(n_estimators=num_tree, criterion=criterion,min_samples_split=min_sample_split, random_state=56)
clf.fit(X_train,Y_train)
Y_pred = clf.predict(X_test)
best_accuracy = accuracy_score(Y_pred, Y_test)

#now evaluate a model where all parameters were default
clf = RandomForestClassifier(random_state=56)
clf.fit(X_train,Y_train)
Y_pred = clf.predict(X_test)
null_accuracy = accuracy_score(Y_pred, Y_test)

#print the accuracies for comparision
print(f'Accuracy of Random Forest model with best found parameters: {best_accuracy}\nAccuracy of Random Forest model with default parameters: {null_accuracy}')

Accuracy of Random Forest model with best found parameters: 0.7706422018348624
Accuracy of Random Forest model with default parameters: 0.7522935779816514
