# Wellness Summary 

This is a follow up to the wellness factor analysis. In that notebook, we found that wellness can be summarized in terms of the `MonitoringScore`, `Pain`, `Illness`, and `Nutrition`.

This notebook will apply the same processing with the goal of finding a summary measure for wellness. In doing so, I want to try and improve the imputing of nutrition but samping from the given player's nutrition distribution instead of the overall distribution.

## Load Data

In [22]:
import pandas as pd
import numpy as np
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity, calculate_kmo
from scipy.stats import pointbiserialr, ttest_ind
import matplotlib.pyplot as plt
import pingouin as pg
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler


In [2]:
np.random.seed(5151)
wellness_df = pd.read_csv('./raw_data/wellness.csv')
wellness_df.head()


Unnamed: 0,Date,PlayerID,Fatigue,Soreness,Desire,Irritability,BedTime,WakeTime,SleepHours,SleepQuality,MonitoringScore,Pain,Illness,Menstruation,Nutrition,NutritionAdjustment,USGMeasurement,USG,TrainingReadiness
0,2018-07-21,1,3,3,2,3,23:00:00,07:00:00,8.0,2,13,No,No,Yes,Excellent,Yes,No,,0%
1,2018-07-21,2,4,3,4,4,23:00:00,07:00:00,8.0,4,19,Yes,No,Yes,,,Yes,1.01,0%
2,2018-07-21,3,3,3,5,4,22:30:00,06:30:00,8.0,4,19,No,No,No,,,Yes,1.016,100%
3,2018-07-21,4,2,3,5,4,00:30:00,07:00:00,6.5,1,15,No,No,Yes,Excellent,Yes,Yes,1.025,95%
4,2018-07-21,5,5,3,4,4,23:45:00,07:00:00,7.25,4,20,No,No,No,Okay,Yes,Yes,1.022,100%


## Inpute Nutrition Values

In order for this to work, I need to make sure the missing nutrition values are not all stemming from a small number of players not providing nutrition information for most days. In this case, the distribution of their nutrition responses will be quite inaccurate and my imputed values with have high bias.

In [3]:
missing_nutrition_df = wellness_df[wellness_df['Nutrition'].isnull()]
missing_nutrition_df.head()


Unnamed: 0,Date,PlayerID,Fatigue,Soreness,Desire,Irritability,BedTime,WakeTime,SleepHours,SleepQuality,MonitoringScore,Pain,Illness,Menstruation,Nutrition,NutritionAdjustment,USGMeasurement,USG,TrainingReadiness
1,2018-07-21,2,4,3,4,4,23:00:00,07:00:00,8.0,4,19,Yes,No,Yes,,,Yes,1.01,0%
2,2018-07-21,3,3,3,5,4,22:30:00,06:30:00,8.0,4,19,No,No,No,,,Yes,1.016,100%
13,2018-07-20,2,4,4,5,4,22:00:00,07:00:00,9.0,3,20,Yes,No,Yes,,,Yes,1.017,0%
14,2018-07-20,3,4,4,6,4,22:30:00,06:30:00,8.0,4,22,No,No,No,,,Yes,1.016,100%
25,2018-07-19,2,4,4,5,4,22:15:00,08:00:00,9.75,5,22,Yes,No,Yes,,,Yes,1.023,0%


In [4]:
missing_nutrition_df.shape


(837, 19)

In [5]:
missing_nutrition_df['PlayerID'].value_counts()


2     280
8     182
14     73
6      72
17     71
1      56
4      51
5      24
3      20
15      5
13      1
10      1
7       1
Name: PlayerID, dtype: int64

Unfortunatly, I appears the majority of missing nutrition values are coming from players 2 and 8. Let's take a look at their nutrition distributions.

### Nutrition For Player 2

In [6]:
wellness_df[wellness_df['PlayerID'] == 2]['Nutrition'].value_counts(normalize=True)


Excellent    0.593220
Okay         0.389831
Poor         0.016949
Name: Nutrition, dtype: float64

#### Compared to average distribution

In [7]:
wellness_df['Nutrition'].value_counts(normalize=True)

Excellent    0.649976
Okay         0.334931
Poor         0.015093
Name: Nutrition, dtype: float64

We see that the distribution for player 2 is the same as the distribution overall so it is unlikely that the samples that we have from this player have high bias.

### Nutrition For Player 8

In [8]:
wellness_df[wellness_df['PlayerID'] == 8]['Nutrition'].value_counts(normalize=True)


Excellent    0.496
Okay         0.392
Poor         0.112
Name: Nutrition, dtype: float64

The distribution for this player does differ from the average distribution, but, we have more samples of this player's nutrition so we can inpute their values using their own nutrition distribution without adding too much bias into our dataset.

### Imputing Nutrition Using Player-Specific Distribution

In [9]:
def impute_nutrition(row):
    if (isinstance(row['Nutrition'], float) and (np.isnan(row['Nutrition']))):
        normalized_nutrition_value_counts = wellness_df[wellness_df['PlayerID'] == row['PlayerID']]['Nutrition'].value_counts(normalize=True)
        
        return np.random.choice(normalized_nutrition_value_counts.index, size=1, p=normalized_nutrition_value_counts.values)[0]
    else:
        return row['Nutrition']
    
wellness_df['Nutrition'] = wellness_df.apply(impute_nutrition, axis=1)

wellness_df.head()

Unnamed: 0,Date,PlayerID,Fatigue,Soreness,Desire,Irritability,BedTime,WakeTime,SleepHours,SleepQuality,MonitoringScore,Pain,Illness,Menstruation,Nutrition,NutritionAdjustment,USGMeasurement,USG,TrainingReadiness
0,2018-07-21,1,3,3,2,3,23:00:00,07:00:00,8.0,2,13,No,No,Yes,Excellent,Yes,No,,0%
1,2018-07-21,2,4,3,4,4,23:00:00,07:00:00,8.0,4,19,Yes,No,Yes,Excellent,,Yes,1.01,0%
2,2018-07-21,3,3,3,5,4,22:30:00,06:30:00,8.0,4,19,No,No,No,Excellent,,Yes,1.016,100%
3,2018-07-21,4,2,3,5,4,00:30:00,07:00:00,6.5,1,15,No,No,Yes,Excellent,Yes,Yes,1.025,95%
4,2018-07-21,5,5,3,4,4,23:45:00,07:00:00,7.25,4,20,No,No,No,Okay,Yes,Yes,1.022,100%


Now we just need to make sure there are not missing nutrition values left.

In [10]:
wellness_df[wellness_df['Nutrition'].isnull()].shape

(0, 19)

## Extract Features

Based on the previous wellness factor analysis

In [11]:
processed_wellness_df = wellness_df.copy()[['Date', 'PlayerID', 'MonitoringScore', 'Pain', 'Illness', 'Nutrition']]
processed_wellness_df.head()


Unnamed: 0,Date,PlayerID,MonitoringScore,Pain,Illness,Nutrition
0,2018-07-21,1,13,No,No,Excellent
1,2018-07-21,2,19,Yes,No,Excellent
2,2018-07-21,3,19,No,No,Excellent
3,2018-07-21,4,15,No,No,Excellent
4,2018-07-21,5,20,No,No,Okay


### Map To Numbers

#### Pain

In [12]:
processed_wellness_df['Pain'] = processed_wellness_df['Pain'].map(dict(Yes=0, No=1))
processed_wellness_df.head()


Unnamed: 0,Date,PlayerID,MonitoringScore,Pain,Illness,Nutrition
0,2018-07-21,1,13,1,No,Excellent
1,2018-07-21,2,19,0,No,Excellent
2,2018-07-21,3,19,1,No,Excellent
3,2018-07-21,4,15,1,No,Excellent
4,2018-07-21,5,20,1,No,Okay


#### Illness

In [13]:
processed_wellness_df['Illness'] = processed_wellness_df['Illness'].map({'Yes': 0, 'Slightly Off': 2, 'No': 3})
processed_wellness_df.head()


Unnamed: 0,Date,PlayerID,MonitoringScore,Pain,Illness,Nutrition
0,2018-07-21,1,13,1,3,Excellent
1,2018-07-21,2,19,0,3,Excellent
2,2018-07-21,3,19,1,3,Excellent
3,2018-07-21,4,15,1,3,Excellent
4,2018-07-21,5,20,1,3,Okay


#### Nutrition

In [14]:
processed_wellness_df['Nutrition'] = processed_wellness_df['Nutrition'].map({'Excellent': 3, 'Okay': 2, 'Poor': 1})
processed_wellness_df.head()


Unnamed: 0,Date,PlayerID,MonitoringScore,Pain,Illness,Nutrition
0,2018-07-21,1,13,1,3,3
1,2018-07-21,2,19,0,3,3
2,2018-07-21,3,19,1,3,3
3,2018-07-21,4,15,1,3,3
4,2018-07-21,5,20,1,3,2


In [15]:
processed_wellness_df['MonitoringScore'].value_counts()

20    827
19    738
18    685
17    592
16    370
21    285
15    255
14    218
13    169
22    157
12    121
23    100
30     79
24     62
25     56
11     52
26     48
10     45
27     36
28     33
9      25
29     19
8      13
31     11
7       4
35      3
6       3
32      2
5       2
33      1
Name: MonitoringScore, dtype: int64

In [16]:
processed_wellness_df['Nutrition'].value_counts()

3    3227
2    1705
1      79
Name: Nutrition, dtype: int64

### Standardize

We need to standardize the player features so they are all on the same scale. This will make sure we can add all the data together to get one summary value for the player's wellness.

In [17]:
processed_wellness_df[['MonitoringScore', 'Pain', 'Illness', 'Nutrition']] = StandardScaler().fit_transform(processed_wellness_df[['MonitoringScore', 'Pain', 'Illness', 'Nutrition']])
processed_wellness_df.head()


Unnamed: 0,Date,PlayerID,MonitoringScore,Pain,Illness,Nutrition
0,2018-07-21,1,-1.450204,0.364611,0.301008,0.72209
1,2018-07-21,2,0.170622,-2.742646,0.301008,0.72209
2,2018-07-21,3,0.170622,0.364611,0.301008,0.72209
3,2018-07-21,4,-0.909929,0.364611,0.301008,0.72209
4,2018-07-21,5,0.440759,0.364611,0.301008,-1.220149


In [18]:
processed_wellness_df['Nutrition'].value_counts()

 0.722090    3227
-1.220149    1705
-3.162388      79
Name: Nutrition, dtype: int64

### Normalize Mointoring Score and Nutrition

In [None]:
scaler = MinMaxScaler()

### Summarize Into Single Wellness Measure

In [19]:
processed_wellness_df['wellness'] = processed_wellness_df['MonitoringScore'] + processed_wellness_df['Pain'] + processed_wellness_df['Illness'] + processed_wellness_df['Nutrition']
processed_wellness_df.head()


Unnamed: 0,Date,PlayerID,MonitoringScore,Pain,Illness,Nutrition,wellness
0,2018-07-21,1,-1.450204,0.364611,0.301008,0.72209,-0.062495
1,2018-07-21,2,0.170622,-2.742646,0.301008,0.72209,-1.548927
2,2018-07-21,3,0.170622,0.364611,0.301008,0.72209,1.558331
3,2018-07-21,4,-0.909929,0.364611,0.301008,0.72209,0.47778
4,2018-07-21,5,0.440759,0.364611,0.301008,-1.220149,-0.113771


#### Save to file

In [20]:
processed_wellness_df.to_csv('./processed_data/processed_wellness.csv')

### Average For Each Player Over All Dates

In [21]:
average_wellness_df = processed_wellness_df.copy()
average_wellness_df = average_wellness_df.drop(columns=['Date'])
average_wellness_df = average_wellness_df.groupby('PlayerID', as_index=False).mean()
average_wellness_df


Unnamed: 0,PlayerID,MonitoringScore,Pain,Illness,Nutrition,wellness
0,1,-0.638939,0.286195,0.023494,0.673074,0.343824
1,2,0.286964,-0.277005,0.2091,-0.108662,0.110398
2,3,-0.131342,0.284251,0.301008,0.72209,1.176006
3,4,-0.258626,0.3445,0.271352,0.684376,1.041602
4,5,0.372964,-0.619154,-0.166024,-1.257644,-1.669858
5,6,0.176785,0.352797,0.196479,0.360228,1.086287
6,7,-0.22789,0.138135,0.124679,0.512577,0.547501
7,8,0.277973,-0.434976,-1.006391,-0.454641,-1.618036
8,9,1.560425,0.350551,0.193199,-0.288578,1.815598
9,10,-0.509995,-1.884185,-0.794237,-0.234448,-3.422865


We see that the player with the highest wellness is player 10 and the player with the lowest wellness is player 17 but players 16 and 12 also have low wellness.