# Applied Modelling for Senior Leaders
Welcome to your interactive notebook on applied modelling!  This is a programmatic notebook in Python, that will walk you through this tutorial exercise.


## Practical 1 - Correlation

In [9]:
import numpy as np
import statsmodels.api as smf
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

Let's start by "ingesting" some data.

In [10]:
train = pd.read_csv('../data/training.csv', index_col=0)
test = pd.read_csv('../data/testing.csv', index_col=0)


## Milestone 1 - Correlation

Below, we've calculated correlation between age and years in the labour force.

In [11]:
train[['age','years_in_labour_force']].corr().iloc[0:1]

Unnamed: 0,age,years_in_labour_force
age,1.0,0.749


How would you tweak the below code to examine 

In [12]:
train[['committed','age', 'years_in_labour_force', 'years_of_further_education',
       'miles_resided_from_coast']].corr().round(2).iloc[0:1]

Unnamed: 0,committed,age,years_in_labour_force,years_of_further_education,miles_resided_from_coast
committed,1.0,0.27,0.23,-0.15,-0.1


In [13]:
train[['committed','age', 'years_in_labour_force', 'years_of_further_education',
       'miles_resided_from_coast']].corr().round(2).iloc[0:1]

Unnamed: 0,committed,age,years_in_labour_force,years_of_further_education,miles_resided_from_coast
committed,1.0,0.27,0.23,-0.15,-0.1


## Practical 2 - Regression

Now you've developed some understanding for how your data is connected, let's use it to develop a model.  Here, we've shown a model looking at how years_in_labour is predicted by age.

In [36]:
train = pd.read_csv('../data/training.csv',index_col=0)


In [37]:
mod = smf.ols(formula='years_in_labour_force ~ age', data=train)
res = mod.fit()
res.summary()

0,1,2,3
Dep. Variable:,years_in_labour_force,R-squared:,0.561
Model:,OLS,Adj. R-squared:,0.559
Method:,Least Squares,F-statistic:,342.5
Date:,"Wed, 12 Jul 2023",Prob (F-statistic):,7.9800000000000005e-50
Time:,15:37:37,Log-Likelihood:,-698.63
No. Observations:,270,AIC:,1401.0
Df Residuals:,268,BIC:,1408.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.0470,0.820,6.159,0.000,3.434,6.660
age,0.3713,0.020,18.506,0.000,0.332,0.411

0,1,2,3
Omnibus:,0.324,Durbin-Watson:,2.004
Prob(Omnibus):,0.85,Jarque-Bera (JB):,0.243
Skew:,0.073,Prob(JB):,0.886
Kurtosis:,3.017,Cond. No.,170.0


Can you tweak the below formula to add missing features to your model?

In [15]:
mod = smf.ols(formula='committed ~  miles_resided_from_coast', data=train)
res = mod.fit()
res.summary().tables[0]

0,1,2,3
Dep. Variable:,committed,R-squared:,0.009
Model:,OLS,Adj. R-squared:,0.006
Method:,Least Squares,F-statistic:,2.535
Date:,"Wed, 12 Jul 2023",Prob (F-statistic):,0.113
Time:,17:10:56,Log-Likelihood:,-159.09
No. Observations:,270,AIC:,322.2
Df Residuals:,268,BIC:,329.4
Df Model:,1,,
Covariance Type:,nonrobust,,


## Practical 3 - Making Predictions
So, you've learnt how to build a model, and have seen it fits at l

In [6]:
mod = smf.ols(formula='committed ~ age + years_in_labour_force + years_of_further_education+ miles_resided_from_coast', data=train)
res = mod.fit()

In [7]:
predictions = res.predict(test)
predictions[0:5]

person_id
203    0.822779
266    0.529915
152    0.778840
9      0.584905
233    0.737207
dtype: float64

In [8]:
threshold = 0.5
test['prediction'] = predictions > 0.5
test[['age','committed','prediction']].round(2).iloc[-15:]

Unnamed: 0_level_0,age,committed,prediction
person_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
221,45.36,1,True
289,35.08,1,True
211,43.05,1,True
148,13.74,0,False
165,54.82,0,True
78,37.89,0,True
113,40.38,0,True
249,35.08,1,True
250,53.85,1,True
104,39.42,1,True
