This is the precursor notebook to https://www.kaggle.com/schorsi/stress-and-achievement-prediction-prescription this notebook features the data preprocessing, model selection, training methods used.

In [1]:
import numpy as np
import pandas as pd

wellbeing = pd.read_csv('../input/lifestyle-and-wellbeing-data/Wellbeing_and_lifestyle_data_Kaggle.csv')
wellbeing = wellbeing.drop('Timestamp', axis=1)
wellbeing = wellbeing.drop([10005])
wellbeing.head()

Unnamed: 0,FRUITS_VEGGIES,DAILY_STRESS,PLACES_VISITED,CORE_CIRCLE,SUPPORTING_OTHERS,SOCIAL_NETWORK,ACHIEVEMENT,DONATION,BMI_RANGE,TODO_COMPLETED,...,SLEEP_HOURS,LOST_VACATION,DAILY_SHOUTING,SUFFICIENT_INCOME,PERSONAL_AWARDS,TIME_FOR_PASSION,WEEKLY_MEDITATION,AGE,GENDER,WORK_LIFE_BALANCE_SCORE
0,3,2,2,5,0,5,2,0,1,6,...,7,5,5,1,4,0,5,36 to 50,Female,609.5
1,2,3,4,3,8,10,5,2,2,5,...,8,2,2,2,3,2,6,36 to 50,Female,655.6
2,2,3,3,4,4,10,3,2,2,2,...,8,10,2,2,4,8,3,36 to 50,Female,631.6
3,3,3,10,3,10,7,2,5,2,3,...,5,7,5,1,5,2,0,51 or more,Female,622.7
4,5,1,3,3,10,4,2,4,2,5,...,7,0,0,2,8,1,5,51 or more,Female,663.9


### Data Preprocessing:
Here I change all values to integer values, drop unnecessary columns, scale features, and split the data.

I chose to drop 'BMI_RANGE', 'SUFFICIENT_INCOME', 'GENDER', and 'WORK_LIFE_BALANCE_SCORE' from consideration by the model; the first three contribute very little meaningful signal to the model (in fact only random forest models improve their scores when these features are added), and the fourth score is provided based on user input (including the target values) so even if a user had known the right value to put there would be data leakage. Even though these columns have been dropped I still encode them and scale them, this is to make it easier to include for later testing if needed. 


In [2]:
from sklearn.model_selection import train_test_split
from sklearn import metrics

age_dict = {'Less than 20' : 1, '21 to 35' : 2, '36 to 50' : 3, '51 or more' : 4}
wellbeing['AGE'] = pd.Series([age_dict[x] for x in wellbeing.AGE], index=wellbeing.index)
gender_dict = {'Female' : 1, 'Male' : 0}
wellbeing['GENDER'] = pd.Series([gender_dict[x] for x in wellbeing.GENDER], index=wellbeing.index)
wellbeing['DAILY_STRESS'] = wellbeing['DAILY_STRESS'].astype(int)

X = wellbeing.drop(['DAILY_STRESS', 'ACHIEVEMENT', 'BMI_RANGE', 'SUFFICIENT_INCOME', 'GENDER', 'WORK_LIFE_BALANCE_SCORE'], axis=1)
for col in X.columns:
    X[col] = (X[col]-X[col].mean())/X[col].std()
y = wellbeing[['DAILY_STRESS', 'ACHIEVEMENT']]
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state = 1, test_size=.2)

### Model selection:
A few different models were tested, those which are commented out didn't perform very well. Ultimately forest-based models performed the best by a small margin. I chose a neural net in the end mostly because I wanted to practice deploying one. 


In [3]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()
rfr.fit(x_train, y_train)
preds = rfr.predict(x_test)
print('Mean squared error stress: ', metrics.mean_squared_error(preds[:,0], y_test['DAILY_STRESS']))
print('Mean squared error achievement: ', metrics.mean_squared_error(preds[:,1], y_test['ACHIEVEMENT']))

Mean squared error stress:  1.4644580091322184
Mean squared error achievement:  4.931072258894106


In [4]:
count = 0
for feat in rfr.feature_importances_:
    print(X.columns[count], ':', feat)
    count+=1

FRUITS_VEGGIES : 0.03730546146156958
PLACES_VISITED : 0.05489012349405303
CORE_CIRCLE : 0.05368717208696381
SUPPORTING_OTHERS : 0.06542801087657796
SOCIAL_NETWORK : 0.04627427546571865
DONATION : 0.03721989949938497
TODO_COMPLETED : 0.0618167761848739
FLOW : 0.08639834490520465
DAILY_STEPS : 0.05288629563606017
LIVE_VISION : 0.06834696900527758
SLEEP_HOURS : 0.03926995676251028
LOST_VACATION : 0.03963585070799972
DAILY_SHOUTING : 0.05893716370961288
PERSONAL_AWARDS : 0.13240135430028943
TIME_FOR_PASSION : 0.08415458595444668
WEEKLY_MEDITATION : 0.05127473024554964
AGE : 0.030073029703907166


In [5]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(x_train, y_train)
preds = lr.predict(x_test)
print('Mean squared error stress: ', metrics.mean_squared_error(preds[:,0], y_test['DAILY_STRESS']))
print('Mean squared error achievement: ', metrics.mean_squared_error(preds[:,1], y_test['ACHIEVEMENT']))

Mean squared error stress:  1.5124278706320686
Mean squared error achievement:  5.082810186428724


In [6]:
# from sklearn.svm import LinearSVR

# svm = LinearSVR(max_iter=10000)
# svm.fit(x_train, y_train)
# print('Mean squared error: ', metrics.mean_squared_error(svm.predict(x_test), y_test))

In [7]:
from sklearn.ensemble import ExtraTreesRegressor

etr = ExtraTreesRegressor()
etr.fit(x_train, y_train)
preds = etr.predict(x_test)
print('Mean squared error stress: ', metrics.mean_squared_error(preds[:,0], y_test['DAILY_STRESS']))
print('Mean squared error achievement: ', metrics.mean_squared_error(preds[:,1], y_test['ACHIEVEMENT']))

Mean squared error stress:  1.46268338028169
Mean squared error achievement:  4.888890054773083


In [8]:
# from sklearn.gaussian_process import GaussianProcessRegressor

# gpr = GaussianProcessRegressor()
# gpr.fit(x_train, y_train)
# print('Mean squared error: ', metrics.mean_squared_error(gpr.predict(x_test), y_test))

In [9]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor()
knn.fit(x_train, y_train)
preds = knn.predict(x_test)
print('Mean squared error stress: ', metrics.mean_squared_error(preds[:,0], y_test['DAILY_STRESS']))
print('Mean squared error achievement: ', metrics.mean_squared_error(preds[:,1], y_test['ACHIEVEMENT']))

Mean squared error stress:  1.7745852895148668
Mean squared error achievement:  5.950622848200313


In [10]:
# from sklearn.ensemble import GradientBoostingRegressor

# gbr = GradientBoostingRegressor()
# gbr.fit(x_train, y_train)
# print('Mean squared error: ', metrics.mean_squared_error(gbr.predict(x_test), y_test))

In [11]:
# from sklearn.experimental import enable_hist_gradient_boosting
# from sklearn.ensemble import HistGradientBoostingRegressor

# hgb = HistGradientBoostingRegressor()
# hgb.fit(x_train, y_train)
# print('Mean squared error: ', metrics.mean_squared_error(hgb.predict(x_test), y_test))

Below I create and train a neural net using keras.

In [12]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks, initializers

In [13]:
model = keras.Sequential([
    layers.Dense(units=34, activation='relu'),
    layers.BatchNormalization(),
    layers.GaussianNoise(.2),
    layers.Dense(units=10, activation='relu'),
    layers.GaussianNoise(.2),
    layers.Dense(units=10, activation='relu'),
    layers.Dense(units=2)
])

model.compile(optimizer='Adam', loss='mean_squared_error', metrics=['mean_squared_error'])

In [14]:
def learn_scheduler(epoch, lr):
    if epoch <2:
        return 0.001
    else:
        return lr*.9
schedule = callbacks.LearningRateScheduler(learn_scheduler)
    
early_stopping = keras.callbacks.EarlyStopping(
    patience=20,
    min_delta=0.01,
    restore_best_weights=True,
)

history = model.fit(
    X, y,                               # Comment out this line when testing
#     x_train, y_train,                 # Comment out these lines when producing the final model
#     validation_data=(x_test, y_test), # Comment out these lines when producing the final model
    batch_size=64,
    epochs=30,
    callbacks=[early_stopping, schedule])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


The error for predicting achievement is still higher than I would like, I plan to revisit this a few times and see if I can make some improvements.

In [15]:
preds = model.predict(x_test)
print('Mean squared error stress: ', metrics.mean_squared_error(preds[:,0], y_test['DAILY_STRESS']))
print('Mean absolute error stress: ', metrics.mean_absolute_error(preds[:,0], y_test['DAILY_STRESS']))
print('Mean squared error achievement: ', metrics.mean_squared_error(preds[:,1], y_test['ACHIEVEMENT']))
print('Mean absolute error achievement: ', metrics.mean_absolute_error(preds[:,1], y_test['ACHIEVEMENT']))

Mean squared error stress:  1.5299365506138674
Mean absolute error stress:  1.0089473485946656
Mean squared error achievement:  4.908206464547079
Mean absolute error achievement:  1.749875707227113


Below I created a dictionary that stores descriptions for each column, a sort of dataframe schema. I also created a stats dictionary for feature scaling. Both dictionaries will be exported for the inference/prescription notebook. 

In [16]:
wellbeing_dict = {
     'FRUITS_VEGGIES' : '[Between 1 and 5] HOW MANY FRUITS OR VEGETABLES DO YOU EAT EVERYDAY? In a typical day, averaging workdays and weekends.',
     'DAILY_STRESS' : '[Between 1 and 10] HOW MUCH STRESS DO YOU TYPICALLY EXPERIENCE EVERYDAY? At work or at home, due to the environment (noise, pollution, insecurity...), your co-workers or boss, or because of tragic events such as divorce, job loss, serious illness, loss of family or friends,... In average over 12 months.',
     'PLACES_VISITED' : '[Between 0 and 10] HOW MANY NEW PLACES DO YOU VISIT? Over a period of 12 months. Include new states, new cities as well as museum, places of interest and parks in your neighborhood.',
     'CORE_CIRCLE' : '[Between 0 and 10] HOW MANY PEOPLE ARE VERY CLOSE TO YOU? i.e. close family and friends ready to provide you with a long-term unconditional support.',
     'SUPPORTING_OTHERS' : '[Between 0 and 10] HOW MANY PEOPLE DO YOU HELP ACHIEVE A BETTER LIFE? A reflection of your altruism or selflessness  e.g.: caring for your family, actively supporting a friend, mentoring, coaching, developing or promoting a co-worker, ... Over a period of 12 months.',
     'SOCIAL_NETWORK' : '[Between 0 and 10] WITH HOW MANY PEOPLE DO YOU INTERACT WITH DURING A TYPICAL DAY? True interactions and dialogues at home, at work, at the gym, ... Average of workdays and weekends',
     'ACHIEVEMENT' : '[Between 0 and 10] HOW MANY REMARKABLE ACHIEVEMENTS ARE YOU PROUD OF? Over the last 12 months, personal achievements known to your family, close friends or co-workers such as: running a marathon or important race, birth, successful kids, new house or major renovation, major success at work, opening a new business, ...',
     'DONATION' : '[Between 0 and 5] HOW MANY TIMES DO YOU DONATE YOUR TIME OR MONEY TO GOOD CAUSES? Over a period of 12 months. Include financial donation, your time contribution, fundraising, volunteering, serving your country and the poor, ...',
     'BMI_RANGE' : '[1 if below 25, else 2] WHAT IS YOUR BODY MASS INDEX (BMI) RANGE? Your body mass in kg divided by the square of your height in meters ► Check the online BMI calculator such as www.cdc.gov/healthyweight/assessing/bmi/index.html. ► For instance, an adult of 6 feet and 184 pounds has a BMI of 25',
     'TODO_COMPLETED' : '[Between 1 and 10] HOW WELL DO YOU COMPLETE YOUR WEEKLY TO-DO LISTS? Include your weekly goals, work- and personal-related tasks. On a scale of 0 = not at all to 10 = very well.',
     'FLOW' : '[Between 0 and 10] IN A TYPICAL DAY, HOW MANY HOURS DO YOU EXPERIENCE "FLOW"? `Flow` is defined as the mental state, in which you are fully immersed in performing an activity. You then experience a feeling of energized focus, full involvement, and enjoyment in the process of this activity',
     'DAILY_STEPS' : '[Between 0 and 10] HOW MANY STEPS (IN THOUSANDS) DO YOU TYPICALLY WALK EVERYDAY? Thousand steps, daily average over multiple days including work days and week-end.',
     'LIVE_VISION' : '[Between 0 and 10] FOR HOW MANY YEARS AHEAD IS YOUR LIFE VISION VERY CLEAR FOR? For instance, illustrated in a vision board, detailed in a personal journal or openly discussed with your spouse or close friends.',
     'SLEEP_HOURS' : '[Between 0 and 10] ABOUT HOW LONG DO YOU TYPICALLY SLEEP? Over the course of a typical working week, including week-end.',
     'LOST_VACATION' : '[Between 0 and 10] HOW MANY DAYS OF VACATION DO YOU TYPICALLY LOSE EVERY YEAR ? Unused vacation days, lost or carried forward into the following year. Or because of work stress during your vacation.',
     'DAILY_SHOUTING' : '[Between 0 and 10] HOW OFTEN DO YOU SHOUT OR SULK AT SOMEBODY? In a typical week. Expressing your negative emotions in an active or passive manner.',
     'SUFFICIENT_INCOME' : '[1 for insufficient, 2 for sufficient] HOW SUFFICIENT IS YOUR INCOME TO COVER BASIC LIFE EXPENSES? Such as the costs of housing, food, health care, car and education.',
     'PERSONAL_AWARDS' : '[Between 0 and 10] HOW MANY RECOGNITIONS HAVE YOU RECEIVED IN YOUR LIFE? Significant public recognitions validating a personal level of expertise and engagement E.g.: diploma, degree, certificate, accreditation, award, prize, published book, presentation at major conference, medals, cups, titles...',
     'TIME_FOR_PASSION' : '[Between 0 and 10] HOW MANY HOURS DO YOU SPEND EVERYDAY DOING WHAT YOU ARE PASSIONATE ABOUT? Daily hours spent doing what you are passionate and dreaming about, and/or contributing to a greater cause: health, education, peace, society development, ...',
     'WEEKLY_MEDITATION' : '[Between 0 and 10] IN A TYPICAL WEEK, HOW MANY TIMES DO YOU HAVE THE OPPORTUNITY TO THINK ABOUT YOURSELF? Include meditation, praying and relaxation activities such as fitness, walking in a park or lunch breaks.',
     'AGE' : "[1 = 'Less than 20' 2 = '21 to 35' 3 = '36 to 50' 4 = '51 or more']",
     'GENDER' : "[1 = 'Female' 0 = 'Male']"
 }

In [17]:
wellbeing_stats = {}

for col in wellbeing.columns:
    wellbeing_stats[col] = (wellbeing[col].mean(), wellbeing[col].std())
wellbeing_stats

{'FRUITS_VEGGIES': (2.9226723436228164, 1.4427392618232717),
 'DAILY_STRESS': (2.7916849289336922, 1.3678007467520172),
 'PLACES_VISITED': (5.233235238870453, 3.3118466202433825),
 'CORE_CIRCLE': (5.508296287020224, 2.8402868211101433),
 'SUPPORTING_OTHERS': (5.616179325026611, 3.2419369588232674),
 'SOCIAL_NETWORK': (6.474046709661261, 3.08664272775673),
 'ACHIEVEMENT': (4.000688748356396, 2.7559123263254053),
 'DONATION': (2.715171247886795, 1.851556132858851),
 'BMI_RANGE': (1.410619247385887, 0.491961619609354),
 'TODO_COMPLETED': (5.745977083463778, 2.6241786624245114),
 'FLOW': (3.194477490451443, 2.3572846752920262),
 'DAILY_STEPS': (5.703587752801954, 2.8911020666495197),
 'LIVE_VISION': (3.7521758186713416, 3.2310825267930916),
 'SLEEP_HOURS': (7.042952852044331, 1.199053434423808),
 'LOST_VACATION': (2.8984409241750675, 3.6918674791997406),
 'DAILY_SHOUTING': (2.930999937386513, 2.6763413227911825),
 'SUFFICIENT_INCOME': (1.7289462150147141, 0.4445177193376449),
 'PERSONAL_AW

### User input:
When the cell below is run it will prompt the user to fill in values for each of the survey questions. By filling these in we can later make predictions on the level of stress and achievement the use has, and where they can improve. 

In [18]:
list_pred = []

for col in X.columns:
    print(wellbeing_dict[col])
    try:
        list_pred.append(int(input()))
    except:
        list_pred.append(3)

[Between 1 and 5] HOW MANY FRUITS OR VEGETABLES DO YOU EAT EVERYDAY? In a typical day, averaging workdays and weekends.
[Between 0 and 10] HOW MANY NEW PLACES DO YOU VISIT? Over a period of 12 months. Include new states, new cities as well as museum, places of interest and parks in your neighborhood.
[Between 0 and 10] HOW MANY PEOPLE ARE VERY CLOSE TO YOU? i.e. close family and friends ready to provide you with a long-term unconditional support.
[Between 0 and 10] HOW MANY PEOPLE DO YOU HELP ACHIEVE A BETTER LIFE? A reflection of your altruism or selflessness  e.g.: caring for your family, actively supporting a friend, mentoring, coaching, developing or promoting a co-worker, ... Over a period of 12 months.
[Between 0 and 10] WITH HOW MANY PEOPLE DO YOU INTERACT WITH DURING A TYPICAL DAY? True interactions and dialogues at home, at work, at the gym, ... Average of workdays and weekends
[Between 0 and 5] HOW MANY TIMES DO YOU DONATE YOUR TIME OR MONEY TO GOOD CAUSES? Over a period of 1

### prediction batch:
Below I generate several additional sets of input records. Each record is a copy of the users input with one entry being one unit different. This way we can run inference on the whole set and see the impact small adjustments can make in quality of life.

In [19]:
pred_dict = {}
pred_dict['Actual'] = list_pred
idx = 0
for num in list_pred:
    if num < 10:
        plus_1 = list_pred.copy()
        plus_1[idx] = num + 1
        label = str(X.columns[idx]) + ' plus 1'
        pred_dict[label] = plus_1
    if num > 0:
        minus_1 = list_pred.copy()
        minus_1[idx] = num - 1
        label = str(X.columns[idx]) + ' minus 1'
        pred_dict[label] = minus_1
    idx+=1

personal_preds_batch = pd.DataFrame.from_dict(pred_dict, orient='index', columns = X.columns)

In [20]:
# Feature scaling, using the stats dictionary
personal_preds_batch_preprocessed = personal_preds_batch.copy()
for col in X.columns:
    personal_preds_batch_preprocessed[col] = (personal_preds_batch_preprocessed[col]-wellbeing_stats[col][0])/wellbeing_stats[col][1]

Below you can see your results based on your input. The 'Actual' row refers to the predictions based on your input, each of the following rows represent what those predictions would have been should you have entered a slightly different value (with the row names explaining the change. A color gradient is applied to make the differences more noticable.

In [21]:
personal_preds_batch_preprocessed[['DAILY_STRESS', 'ACHIEVEMENT']] = model.predict(personal_preds_batch_preprocessed)
personal_preds_batch_preprocessed[['DAILY_STRESS', 'ACHIEVEMENT']].style.background_gradient()

Unnamed: 0,DAILY_STRESS,ACHIEVEMENT
Actual,2.92604,2.613681
FRUITS_VEGGIES plus 1,3.004813,2.752305
FRUITS_VEGGIES minus 1,2.847325,2.445579
PLACES_VISITED plus 1,2.91915,2.624651
PLACES_VISITED minus 1,2.932931,2.602712
CORE_CIRCLE plus 1,2.889912,2.678965
CORE_CIRCLE minus 1,2.962168,2.548398
SUPPORTING_OTHERS plus 1,2.992249,2.698792
SUPPORTING_OTHERS minus 1,2.859831,2.528571
SOCIAL_NETWORK plus 1,2.921735,2.626786


### Export model
Below I save the model which will be loaded in the prediction/prescription notebook and possibly later deployed in a web app.

In [22]:
model.save('wellbeing_model')