<img src="https://github.com/CorndelDataAnalyticsDiploma/workshop/blob/master/Corndel%20Digital%20Logo%20Centre.png?raw=true" alt="Corndel" width ="301.5" align="left">

# Predictive Modelling

In this notebook we will build a Linear Predictive Model





### Simple Linear Regression

In [1]:
# # Discuss all packages, alias and importing specific tools
# import matplotlib.pyplot as plt
# import statsmodels.api as sm
# import pandas as pd
# import seaborn as sns
# import os

# from sklearn.model_selection import train_test_split
# from sklearn.metrics import root_mean_squared_error
# from sklearn.metrics import mean_absolute_percentage_error



# Multiple Linear Regression
#### We will start with Simple Linear Regression and gradually build in more variables


Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

Number of Instances: 442

Number of Attributes: First 10 columns are numeric predictive values

Target: Column 11 is a quantitative measure of disease progression one year after baseline

Attribute Information:
- Age
- Sex
- Body mass index
- Average blood pressure
- S1
- S2
- S3
- S4
- S5
- S6

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

In [2]:
# # Load the diabetes dataset and display the head
# df_dia = pd.read_csv("https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt",sep="\t")
# # or
# # df_dia = pd.read_csv("data/diabetes.tab.txt",sep="\t")
# df_dia.head()

<div class="alert alert-block alert-warning">

#### Any issues with this data?
    
    --Encoding of Sex is an issue
    
    -- We will address later but discuss now

# Discussion on encoding

Categorical encoding, ordinal, one-hot

🤖`*"can i encode male and female as 1 and 2 for linear regression model?"*`

LLM will warn of ths issues with this and offer solutions

### Step 1, EDA and Data Viz


<div class="alert alert-block alert-warning">

#### Examine the data, does it look like there is a linear relationship?
#### Is a Linear model suitable?
#### Do you see any outliers?
#### What options do you have to deal with outliers?

In [3]:
# Get a description of the data


In [4]:
# Use Pandas to generate hitograms of all columns, bins = 50 and figsize= (20,20) 
# note: no easy way to do this in Seaborn


<div class="alert alert-block alert-warning">

# Obervations
You will notice that BMI, S3 and target variable are all skewed.\
Transformations should be applied to these columns.\
This is covered in the extension material.

# Pairplot

In [5]:
# Use Seaborn to Generate Pair PLots


<div class="alert alert-block alert-warning">

# Observations
We can see S1 and S2 are highly correlated.\
This woul dbe an issue for Regression analysis, but does not concern us for predictive modelling

# Train, Test Split

We must divide the data into:
 - a Training set and 
  - a Test set 
  - with a 70:30 split
  
We will build the model using the training data and asses how well our model performs on the test data.\
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

To test asses how well our model performs we will use both:
 - Root Mean Squared Error (RMSE) and 
  - Mean Absolute Percentage Error (MAPE). 
  
 We will look at these in detail later

In [6]:
# # Examine columns again
# df_dia.columns

In [7]:
# Assign Inputs to X and output to y, i copied and pasted from above


In [8]:
# # Use Scikit-Learn to get Traiing and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, 
#                                                     test_size=0.30, 
#                                                     random_state=42)

<div class="alert alert-block alert-warning">

# Discussion
Discuss train and Test Split

<img src="Data/TrainTestDiagram.png" alt="Corndel" width ="600" align="centre">

There are many more ways to split data for model training and evaluation and these are covered in the extension material

Examining the test and train sets\
Discussion about Stratfied samples and an exporation of them below

<img src="Data/Stratified.jpg" alt="Corndel" width ="600" align="centre">

In [9]:
# # Numerically examine the X_train split
# X_train.describe().round(2)

In [10]:
# # Numerically examine the X_test split

# X_test.describe().round(2)

<div class="alert alert-block alert-warning">

# Discussion
    What do you notice about the two samples?
    In which ways are they similar?
    In which ways are they dissimilar?

In [11]:
# Numerically examine the y_train split


In [12]:
# Numerically examine the y_test split


# Simple Linear Regression

### We will build our first model using the S5 marker

In [13]:
# # Use Seaborn to create a scatter plot of S5 against Y
# sns.scatterplot(data=df_dia, x="S5", y="Y");

In [14]:
# # Fit the model and make predictions
# # Prepare data
# # You will note that we do not print out a model summary anymore
# # In this instance we are not concered about that, our focus is how well we can make predictions

# # Selected inputs
# X = sm.add_constant(X_train['S5'])
# y = y_train

# # Fit model
# model = sm.OLS(y, X).fit()

# # make predictions
# predictions = model.predict(sm.add_constant(X_test['S5']))


### Examint the output, look at the predictions and the test set
Notice how the indexes line up, now we can get metrics for our results

In [15]:
# # examine predictions


In [16]:
# # examine true values


<div class="alert alert-block alert-warning">

# Discussion
    What do you notice about the two samples?
    In which ways are they similar?
    In which ways are they dissimilar?

## Root Mean Squared Error
This measures how many "units" we are off on average.\
The "Root Mean Squared" ensures it is always positive, not necessary to understand\
https://en.wikipedia.org/wiki/Root_mean_square_deviation

In [17]:
# # Using Scikit-Learn Package to calculate RMSE
# rmse = root_mean_squared_error(y_test, predictions).round(2)
# print(f'RMSE: {rmse}')

## Mean Absolute Percentage Error
This measures what percentage we are off on average.\
The "Absolute" ensures it is always positive, not necessary to understand\
https://en.wikipedia.org/wiki/Mean_absolute_percentage_error

In [18]:
# # # Using Scikit-Learn Package to calculate RMSE
# mape = mean_absolute_percentage_error(y_test, pred).round(2)*100
# print(f'MAPE: {mape}%')

# Multiple Linear Regression
Lets now add in some variables and see how our results change

In [19]:
# Look at the head of the training inputs again


Lets add in BMI

In [20]:
# Create a list called "inputs" with "S5" and "BMI" as elements. 
# Note notation for lists is []


In [21]:
# # Fit the model with both inputs, and calculte the new RMSE and MAPE Scores

# X = sm.add_constant(X_train[inputs])  # adds intercept term
# y = y_train

# # Fit model
# model = sm.OLS(y, X).fit()

# # make predictions
# pred = model.predict(sm.add_constant(X_test[inputs]))

# rmse_2 = root_mean_squared_error(y_test, pred).round(2)
# print(f'RMSE_2: {rmse_2}')

# mape_2 = mean_absolute_percentage_error(y_test, pred).round(2)*100
# print(f'MAPE_2: {mape_2}')

In [22]:
# # Compare these new scores to the previous
# rmse_2 = root_mean_squared_error(y_test, pred).round(2)
# print(f'RMSE_2: {rmse_2}, Previous RMSE: {rmse}, Improvement = {(rmse - rmse_2).round(2)}' )

# mape_2 = mean_absolute_percentage_error(y_test, pred).round(2)*100
# print(f'MAPE_2: {mape_2}%  Previous MAPE: {mape}%, Improvement = {(mape - mape_2).round(2)}%')

<div class="alert alert-block alert-warning">

# Observations
    We can see here that we have made a marginal inprovement by Adding BMI

# Menti for Quiz on Terms

# Fix the SEX column

In [23]:
# # Replacing 1 and 2 with 0 and 1 as previously discussed
# df_dia['SEX'] = df_dia['SEX'].replace({1: 0, 2: 1})

# Explore adding more features
Use the code below as a template to add more features and asses any changes in output\
A good idea would be to save the metrics as element in a dictionary to compare results

In [24]:
# # Creating a dict to hold results
# results = {}

In [25]:
# # look at features
# X_train.head()

# Run the two cells below repeatidly with different inputs then examine the dictionary of stored results

In [26]:
# # Select features

# inputs = [.......]

In [27]:
# # Fit the model with both inputs, and calculte the new RMSE and MAPE Scores

# X = sm.add_constant(X_train[inputs])  # adds intercept term
# y = y_train

# # Fit model
# model = sm.OLS(y, X).fit()

# # make predictions
# pred = model.predict(sm.add_constant(X_test[inputs]))


# # be aware of overwriting previous results, by calling this variable "rmse" we overwrite what was previously stored 
# rmse = root_mean_squared_error(y_test, pred).round(2)
# # printing results
# print(f'RMSE: {rmse}')



# # be aware of overwriting previous results, by calling this variable "mape" we overwrite what was previously stored 
# mape = mean_absolute_percentage_error(y_test, pred).round(2)*100
# # printing results
# print(f'MAPE: {mape}')



# # adding result to dict
# results[str(inputs)] = {'RMSE':rmse, "MAPE": mape}


In [28]:
# # Printing out dict of results
# results