# Regression and Prediction

## Multiple Linear Regression

When there are multiple predictors, the equation is simply extended to accommodate them

Y= b0+b1X1+...+bnXn + e 

where e is the error

## The Lingo

### Root mean squared error
The square root of the average squared error of the regression.
### Residual standard error 
Root mean squared error but adjusted for degrees of freedom.
### R squared 
The proportion of variation explained by the model from 0 to 1.
### t-statistic
The coefficient for a predictor, divided by the standard error of the coefficient, giving a metric to compare the importance of variables in the model.
### Weighted regression
Regression with the records having different weights.

In multiple linear regression we assess the relevancy of a feature by finding its F-statistic (instead of the p-value in linear model). The F-statistic is calculated for the overall model, whereas the p-value is specific to each predictor. If there is a strong relationship, then F will be much larger than 1. Otherwise, it will be approximately equal to 1.

In [80]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std

# Read data from file 'filename.csv' 
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later) 
df = pd.read_csv("clean_sleepdata.csv") 
df=df.drop(['Unnamed: 0'], axis=1)

# Preview the first 5 lines of the loaded data 
df.head()

Unnamed: 0,Sleep quality,Sleep Notes,Heart rate,Activity (steps),Time in bed in minutes,Day,Month,Year,Date,Bedtime,...,Climbing,Feeling ill 🤒,Swimming,Totm,Cycled to work,Games night,🙈,Dance,Pilates,Water workout
0,63.0,Away from home:Drinking alcohol,81,10663,498.0,3,12,2017,2017-12-03,00:44,...,0,0,0,0,0,0,0,0,0,0
1,72.0,Incense,67,16018,399.0,3,12,2017,2017-12-03,23:17,...,0,0,0,0,0,0,0,0,0,0
2,84.0,Stressful day:🙂,81,6064,490.0,4,12,2017,2017-12-04,22:38,...,0,0,0,0,0,0,0,0,0,0
3,83.0,Aerial:🙁,89,4378,496.0,5,12,2017,2017-12-05,22:34,...,0,0,0,0,0,0,0,0,0,0
4,74.0,Incense,81,3105,450.0,6,12,2017,2017-12-06,23:31,...,0,0,0,0,0,0,0,0,0,0


In [81]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import statsmodels.api as sm

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 323 entries, 0 to 322
Data columns (total 31 columns):
Sleep quality             323 non-null float64
Sleep Notes               225 non-null object
Heart rate                323 non-null int64
Activity (steps)          323 non-null int64
Time in bed in minutes    323 non-null float64
Day                       323 non-null int64
Month                     323 non-null int64
Year                      323 non-null int64
Date                      323 non-null object
Bedtime                   323 non-null object
Woke up                   323 non-null object
Mood (out of 3)           323 non-null int64
Away from home            323 non-null int64
Drinking alcohol          323 non-null int64
Incense                   323 non-null int64
Stressful day             323 non-null int64
🙂                         323 non-null int64
Aerial                    323 non-null int64
🙁                         323 non-null int64
Pole fitness              323 no

In [82]:
#I'm going to drop all the object types as well as factors that I know, could not aid sleep (e.g. heart rate the next morning)
df_test = df.drop(['Sleep Notes', 'Bedtime', 'Woke up', 'Date', 'Heart rate', 'Mood (out of 3)', 'Day', 'Month', 'Year'], axis=1)
Xs = df_test.drop(['Sleep quality'], axis=1)
y = df_test[['Sleep quality']]

In [83]:
# Note the difference in argument order
model = sm.OLS(y, Xs).fit()
predictions = model.predict(Xs) # make the predictions by the model

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,Sleep quality,R-squared:,0.989
Model:,OLS,Adj. R-squared:,0.988
Method:,Least Squares,F-statistic:,1266.0
Date:,"Thu, 26 Sep 2019",Prob (F-statistic):,2.26e-280
Time:,08:52:14,Log-Likelihood:,-1141.6
No. Observations:,323,AIC:,2325.0
Df Residuals:,302,BIC:,2404.0
Df Model:,21,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Activity (steps),1.361e-05,8.83e-05,0.154,0.878,-0.000,0.000
Time in bed in minutes,0.1708,0.002,86.518,0.000,0.167,0.175
Away from home,-0.8093,1.503,-0.538,0.591,-3.767,2.148
Drinking alcohol,-1.3623,2.050,-0.665,0.507,-5.396,2.672
Incense,-1.4778,3.921,-0.377,0.707,-9.194,6.239
Stressful day,2.7927,2.004,1.394,0.164,-1.150,6.735
🙂,-1.4025,2.576,-0.545,0.586,-6.471,3.666
Aerial,-2.2037,8.651,-0.255,0.799,-19.228,14.820
🙁,0.4283,1.162,0.368,0.713,-1.859,2.716

0,1,2,3
Omnibus:,143.383,Durbin-Watson:,1.698
Prob(Omnibus):,0.0,Jarque-Bera (JB):,705.769
Skew:,-1.833,Prob(JB):,5.55e-154
Kurtosis:,9.245,Cond. No.,157000.0


The R² isn't much higher than that of simple linear regression, with a value of 0.75. This is because the p-value for hours in bed is low! 

Also, the F-statistic is 138.2. This is much greater than 1, it demonstrates that there is a strong relationship between sleep quality and the variables we have considered. 

However, some of the p values are high and thus we can assume are not statistically significant. Removing these predictors would slightly reduce the R² value, but we might make better predictions.


Let's see the two variables with a low p value. 

In [84]:
#I'm going to drop all the object types as well as factors that I know, could not aid sleep (e.g. heart rate the next morning)
df.head()
df_test = df.drop(['Sleep Notes','Sleep quality', 'Bedtime', 'Woke up', 'Date', 'Heart rate', 'Mood (out of 3)', 'Day', 'Month', 'Year'], axis=1)
Xs = df_test.drop(['Time in bed in minutes'], axis=1)
y = df_test[['Time in bed in minutes']]
reg = LinearRegression()
reg.fit(Xs, y)

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
# Note the difference in argument order
model = sm.OLS(y, Xs).fit()
predictions = model.predict(Xs) # make the predictions by the model

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,Time in bed in minutes,R-squared:,0.715
Model:,OLS,Adj. R-squared:,0.696
Method:,Least Squares,F-statistic:,38.0
Date:,"Thu, 26 Sep 2019",Prob (F-statistic):,2.03e-70
Time:,08:52:15,Log-Likelihood:,-2230.9
No. Observations:,323,AIC:,4502.0
Df Residuals:,303,BIC:,4577.0
Df Model:,20,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Activity (steps),0.0255,0.002,12.080,0.000,0.021,0.030
Away from home,62.4232,43.591,1.432,0.153,-23.357,148.203
Drinking alcohol,165.9323,58.889,2.818,0.005,50.048,281.816
Incense,143.2942,113.815,1.259,0.209,-80.674,367.263
Stressful day,156.8406,57.605,2.723,0.007,43.484,270.197
🙂,96.7562,74.749,1.294,0.197,-50.337,243.850
Aerial,176.0757,251.548,0.700,0.484,-318.927,671.079
🙁,208.3230,31.636,6.585,0.000,146.068,270.578
Pole fitness,125.6674,54.147,2.321,0.021,19.115,232.220

0,1,2,3
Omnibus:,13.273,Durbin-Watson:,1.164
Prob(Omnibus):,0.001,Jarque-Bera (JB):,13.872
Skew:,-0.503,Prob(JB):,0.000972
Kurtosis:,3.141,Cond. No.,157000.0


Yesterday I did 9016 steps, I went to pole fitness and felt happy.

In [91]:
X =np.array([9016,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0])
time_minutes_prediction=model.predict(X)
time_hours_prediction=time_minutes_prediction/60

print('The model predicts that I slept for', time_minutes_prediction, 'minutes. Or equivalently', time_hours_prediction, 'hours.')

The model predicts that I slept for [452.25393953] minutes. Or equivalently [7.53756566] hours.
