# Applied Modelling for Senior Leaders
Welcome to your interactive notebook on applied modelling!  This is a programmatic notebook in Python, that will walk you through this tutorial exercise.


## Practical 1 - Correlation

In [1]:
import numpy as np
import statsmodels.api as smf
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

Let's start by "ingesting" some data.

In [2]:
train = pd.read_csv('../data/training.csv', index_col=0)
test = pd.read_csv('../data/testing.csv', index_col=0)


## Milestone 1 - Correlation

Below, we've calculated correlation between age and years in the labour force.

In [3]:
train[['age','years_in_labour_force']].corr().iloc[0:1]

Unnamed: 0,age,years_in_labour_force
age,1.0,0.749


How would you tweak the below code to examine how the factors correlate to our `committed` column?

In [4]:
train[[MISSING,'age', 'years_in_labour_force', 'years_of_further_education',
       'miles_resided_from_coast']].corr().round(2).iloc[0:1]

Unnamed: 0,committed,age,years_in_labour_force,years_of_further_education,miles_resided_from_coast
committed,1.0,0.27,0.23,-0.15,-0.1


## Practical 2 - Regression

Now you've developed some understanding for how your data is connected, let's use it to develop a model.  Here, we've shown a model looking at how years_in_labour is predicted by age.

In [5]:
train = pd.read_csv('../data/training.csv',index_col=0)

In [6]:
model = smf.ols(formula='years_in_labour_force ~ age', data=train)
results = model.fit()

Can you tweak the below formula to add missing features to your model?

In [7]:
model = smf.ols(formula='committed ~ age + MISSING + MISSING + MISSING', data=train)
results = model.fit()
model.summary().tables[0]

AttributeError: 'OLS' object has no attribute 'summary'

## Practical 2 - Regression
So, you've learnt how to build a model, and have seen how to improve the model fit.

In [8]:
model = smf.ols(formula='committed ~ age + years_in_labour_force + years_of_further_education+ miles_resided_from_coast', data=train)
results = model.fit()
results.summary().tables[0]

0,1,2,3
Dep. Variable:,committed,R-squared:,0.101
Model:,OLS,Adj. R-squared:,0.088
Method:,Least Squares,F-statistic:,7.457
Date:,"Thu, 13 Jul 2023",Prob (F-statistic):,1.05e-05
Time:,14:41:25,Log-Likelihood:,-145.96
No. Observations:,270,AIC:,301.9
Df Residuals:,265,BIC:,319.9
Df Model:,4,,
Covariance Type:,nonrobust,,


Can you tweak the below code to make our model `predict` on our 'test' data? 

In [9]:
predictions = results._____(____)
predictions

AttributeError: 'OLSResults' object has no attribute '_____'

To make sure our predictions are meaningful, we set a `threshold` for our `predictions`.  Can you set a threshold and finish your predictions?

In [10]:
threshold = __
test['prediction'] = predictions > threshold
test[['age','committed','prediction']].round(2).iloc[-15:]

NameError: name 'predictions' is not defined