# Predicting BDI and PSS scores using Random Forests

The Marmar paper uses Random Forests, and that made me curious about Random Forests. So I spent some time understanding what it does, and how it can be used in our context.

Random forest is a type of supervised machine learning algorithm based on ensemble learning. Ensemble learning is a type of learning where you join different types of algorithms or same algorithm multiple times to form a more powerful prediction model. The random forest algorithm combines multiple algorithm of the same type i.e. multiple decision trees, resulting in a forest of trees, hence the name "Random Forest". The random forest algorithm can be used for both regression and classification tasks.

## How the Random Forest Algorithm Works

The following are the basic steps involved in performing the random forest algorithm:

1. Pick N random records from the dataset.
2. Build a decision tree based on these N records.
3. Choose the number of trees you want in your algorithm and repeat steps 1 and 2.
4. In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in forest.

In [1]:
import pandas as pd
import numpy as np

In [2]:
dataset = pd.read_csv('spectral_features.csv')

In [3]:
dataset.head()

Unnamed: 0,SID,patient_BDI,patient_PSS,mfcc_1_mean,mfcc_1_median,mfcc_1_variance,mfcc_1_standard_deviation,mfcc_2_mean,mfcc_2_median,mfcc_2_variance,...,rms_mean,rms_median,rms_variance,rms_standard_deviation,sroll_mean,sroll_median,sroll_variance,sroll_standard_deviation,phonation_rate,speech_productivity
0,50063,6.0,9.0,115.00405,119.36923,1818.635,42.64546,-31.380527,-28.231335,817.0971,...,0.050003,0.037222,0.00192,0.043817,3107.645792,2627.050781,2513764.5,1585.485522,0.653829,0.529451
1,50099,27.0,42.0,128.42497,125.70503,922.8446,30.378357,0.34659,6.390878,952.2941,...,0.064619,0.051565,0.002966,0.05446,2548.309655,2239.453125,1896869.0,1377.268704,0.647931,0.543374
2,50126,11.0,9.0,139.78557,135.4143,1444.5309,38.006985,22.69233,23.906433,1232.4773,...,0.068212,0.045264,0.003241,0.056927,1763.386327,1367.358398,2218651.5,1489.513839,0.665017,0.503722
3,50063,11.0,17.0,154.61896,156.41885,2199.7925,46.901947,5.075056,7.128631,777.93005,...,0.104553,0.088675,0.004037,0.063539,2131.311035,1528.857422,2915403.0,1707.455207,0.995087,0.004938
4,50126,8.0,4.0,151.89973,157.07394,903.2893,30.054771,2.058903,5.572361,477.73828,...,0.061808,0.038281,0.002629,0.051275,2262.284214,1851.855469,1726863.2,1314.101686,0.866511,0.154053


In [24]:
# Lets try predicting BDIs using random forest regressor

# Dividing data into attributes and labels
x = dataset.iloc[:, 3:].values
y = dataset.iloc[:, 0].values

# Make train-test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# Training
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=50000, random_state=0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)

# Testing
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 22.116993333334296
Mean Squared Error: 731.559199711351
Root Mean Squared Error: 27.04735106644181


In [26]:
# Lets try predicting PSS using random forest regressor

# Dividing data into attributes and labels
x = dataset.iloc[:, 3:].values
y = dataset.iloc[:, 1].values

# Make train-test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# Training
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=50000, random_state=0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)

# Testing
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 10.703846666666665
Mean Squared Error: 133.52586259906667
Root Mean Squared Error: 11.555339138210815


# Background Info, and Pertinent Observations

BDIs are usually in the range 0 - 63. And the official distinction is as follows:

1-10: These ups and downs are considered normal  
11-16: Mild mood disturbance  
17-20: Borderline clinical depression  
21-30: Moderate depression  
31-40: Severe depression  
over 40: Extreme depression

Error seems a quite high for BDI prediction, as a score difference of 22 could make a lot of difference according to the metrics mentioned above.

For PSS, the score range is 0 - 40. The error is a bit lower than BDI prediction, but still, not acceptable. Need to get better predictions for a publishable result.

The high errors are probably due to these disadvantages:

1. Just 12 rows in our data
2. Maybe regression is the wrong way to go.



## Future work

The prediction errors are high, and these are the steps I think we should be taking next:

1. Patients have been stratified, so I need to run this notebook on each bucket spearately and observe the results.
2. Discussion with Masum also reflected that maybe making buckets for BDI and PSS would help us achieve better results. Instead of using regressor to calculate exact BDI, maybe using the Random Forest classifier to just classify low/high BDI would be a better avenue to pursue.