This dataset was taken from 
https://www.kaggle.com/aroojanwarkhan/fitness-data-trends
where a single person recorded his fitness trend and corresponding mood.

Here, the factors that we will explore will be
- step_count
- calories_burned
- hours_of_sleep
- bool_of_active
- weight_kg

In [40]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
# read in data from sample file
df = pd.read_csv('25.csv')
df.head() # show the first five rows

Unnamed: 0,date,step_count,mood,calories_burned,hours_of_sleep,bool_of_active,weight_kg
0,2017-10-06,5464,200,181,5,0,66
1,2017-10-07,6041,100,197,8,0,66
2,2017-10-08,25,100,0,5,0,66
3,2017-10-09,5461,100,174,4,0,66
4,2017-10-10,6915,200,223,5,500,66


In [13]:
dataDrop = df.copy()
dataDrop = dataDrop.drop('date', axis=1)

It is assumed that step-count is positively correlated to calories_burned

In [14]:
from pandas.plotting import scatter_matrix
corr = dataDrop.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,step_count,mood,calories_burned,hours_of_sleep,bool_of_active,weight_kg
step_count,1.0,0.246738,0.98926,0.0806867,0.120212,0.109404
mood,0.246738,1.0,0.235044,0.210417,0.379646,-0.458776
calories_burned,0.98926,0.235044,1.0,0.0807261,0.1109,0.1122
hours_of_sleep,0.0806867,0.210417,0.0807261,1.0,0.136603,0.189118
bool_of_active,0.120212,0.379646,0.1109,0.136603,1.0,-0.296443
weight_kg,0.109404,-0.458776,0.1122,0.189118,-0.296443,1.0


From here, we can see that the highest correlation to the mood is the activity of a person.
Following our assumption, it is observed that the calories_burned does correlate to the step_count with a R = 0.98926
Hence we could drop either of them. We would be dropping calories_burned because based on the domain knowledge, step_count is easily tracked with step count aids.

In [15]:
#We will be using width binning
dataDrop['weight_kg'].describe(include='all')

count    96.000000
mean     64.281250
std       0.627495
min      64.000000
25%      64.000000
50%      64.000000
75%      64.000000
max      66.000000
Name: weight_kg, dtype: float64

Weight does not seem to give a good data with min = 64 and max = 66

In [16]:
features_drop = ['calories_burned', 'weight_kg']
dataDrop = dataDrop.drop(features_drop, axis=1)

In [17]:
dataDrop.isnull().sum()

step_count        0
mood              0
hours_of_sleep    0
bool_of_active    0
dtype: int64

In [19]:
#Printing the values count for the data dictionary
print(df['mood'].value_counts())
print(df['bool_of_active'].value_counts())

300    40
100    29
200    27
Name: mood, dtype: int64
0      54
500    42
Name: bool_of_active, dtype: int64


Next, we do data binning on our data

In [25]:
dataDropScaled = dataDrop.copy()

#Scaling the data columns 
dataDropScaled.loc[df['step_count']< 1000, 'step_count'] = 0
dataDropScaled.loc[(df['step_count'] >= 1000) & (df['step_count'] < 2000), 'step_count'] = 1
dataDropScaled.loc[(df['step_count'] >= 2000) & (df['step_count'] < 3000), 'step_count'] = 2
dataDropScaled.loc[(df['step_count'] >= 3000) & (df['step_count'] < 4000), 'step_count'] = 3
dataDropScaled.loc[(df['step_count'] >= 4000) & (df['step_count'] < 5000), 'step_count'] = 4
dataDropScaled.loc[(df['step_count'] >= 5000) & (df['step_count'] < 6000), 'step_count'] = 5
dataDropScaled.loc[(df['step_count'] >= 6000) & (df['step_count'] < 7000), 'step_count'] = 6
dataDropScaled.loc[(df['step_count'] >= 7000) & (df['step_count'] < 8000), 'step_count'] = 7
dataDropScaled.loc[(df['step_count'] >= 8000) & (df['step_count'] < 9000), 'step_count'] = 8
dataDropScaled.loc[(df['step_count'] >= 9000) & (df['step_count'] < 10000), 'step_count'] = 9
dataDropScaled.loc[df['step_count'] >= 10000, 'step_count'] = 10

In [26]:
dataDropScaled.head()

Unnamed: 0,step_count,mood,hours_of_sleep,bool_of_active
0,5,200,5,0
1,6,100,8,0
2,0,100,5,0
3,5,100,4,0
4,6,200,5,500


Mood was measured in either "Happy", "Neutral" or "Sad" which were given numeric values of 300, 200 and 100 respectively. Feeling of activeness was measured in either "Active" or "Inactive" which were given numeric values of 500 and 0 respectively. 

In [30]:
dataDropScaledMap = dataDropScaled.copy()

mydict = {
    100: 0,
    200: 1,
    300: 2,
    0: 0,
    500: 1,
}

for i in [1,3]:
    dataDropScaledMap.iloc[:, i] = dataDropScaledMap.iloc[:, i].map(mydict)

## Modeling

In [42]:
# import train_test_split
from sklearn.model_selection import train_test_split
target = dataDropScaledMap['mood'] # this is like the dependent variable: y
x_train, x_test, y_train, y_test = train_test_split(dataDropScaledMap, target, random_state = 42)

In [48]:
features_drop_inXTrainTest = ['mood']
x_traincopy = x_train.drop(features_drop_inXTrainTest, axis=1)
x_testcopy = x_test.drop(features_drop_inXTrainTest, axis=1)

In [49]:
# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

import numpy as np

In [51]:
dt = DecisionTreeClassifier() 
dt.fit(x_traincopy, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [52]:
y_pred = dt.predict(x_testcopy) # let the model predict the test data

In [53]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.5

In [54]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

In [66]:
def modeltraining(clf):
    scoring = 'accuracy'
    score = cross_val_score(clf, x_traincopy, y_train, cv=k_fold, n_jobs=1, scoring=scoring)
    print(round(np.mean(score)*100, 2))

In [69]:
modeltraining(KNeighborsClassifier(n_neighbors = 13))
modeltraining(DecisionTreeClassifier())
modeltraining(RandomForestClassifier(n_estimators=13))
modeltraining(GaussianNB())
modeltraining(SVC(gamma='scale'))

51.25
45.71
55.0
48.39
50.0


Random Forest Classifier proves to be the best model

#### Random Forest

In [76]:
n_trees = np.arange(1, 20, 4) # build a array contains values form 1 to 100 with step of 4

# for each n, we want to check the accuracy of random forest with n trees
for n in n_trees:
    rdf = RandomForestClassifier(n_estimators = n, random_state = 42)
    rdf.fit(x_traincopy, y_train)
    # we will print both the number of trees, and the accuracy score
    print(n)
    print(accuracy_score(y_test, rdf.predict(x_testcopy)))

1
0.5833333333333334
5
0.4166666666666667
9
0.4583333333333333
13
0.5
17
0.4583333333333333


Therefore, on a random day, to predict the overall mood of the author:
    with step count of 4500 : 4
    hours of sleep: 6
    activity - inactive: 0

In [80]:
from sklearn.ensemble import RandomForestClassifier

rdForest = RandomForestClassifier(n_estimators = 1, random_state = 42)
rdForest.fit(x_traincopy, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [81]:
rdForest.predict([[4,6,0]])

array([0], dtype=int64)

The output is 0 which means that he is most likely to be unhappy with a 58% accuracy