[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorial_notebooks/4_predictive_analytics_tasks.ipynb) 

# Tutorial 4 - Foundations of predictive analytics
The lecture has introduced the standard data structure for predictive analytics. We need data with input variables and a **target variable**. The goal of predictive analytics is to derive a functional relationship between the input variables and the target. We assume that we can observe, measure, or control the input variables. Hence, our predictive model (the functional relationship between inputs and the target that we infer from past data), facilitates forecasting the expected value of the target variable based on the input variables. Whenever we observe a new case, we gather the values of the input variables for that case and feed them into our prediction model. Receiving that input, the model will produce a forecast of the target variable value for that case. So, predictive analytics is all about finding *good* **input-to-output mappings**.

<img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/sl_xy_map.png" width="300" height="200" alt="Supervised Learning Principle">




 You can think of linear regression. Formally speaking, a linear regression function maps inputs $\boldsymbol x = {x_1, x_2, ..., x_d}$ from the domain $X \in  \mathbb{R}^d$ to the outputs $y \in  \mathbb{R}$. 



Recall from the lecture that many alternative terms are in use to refer to the input variables. Covariates, (independent) variables, attributes are only a few examples. In the interest of having consistent terminology, we will use the term **features** instead of input variables in the following. 

## Our first predictive model: linear regression
Linear regression assumes a linear additive relationship between features and the target. Specifically, we assume a model:
$$ y = \beta_0 + \beta_1 x_1 +  \beta_2 x_2 + ... + \beta_m x_m + \epsilon $$
where $y$ is the target variable, $\beta$ denotes the regression coefficients (as usual), $x_j, j=1, ..., d$ are our features, and $\epsilon$ denotes the error term. Adopting the above perspective, when using linear regression, we assume we *know* the true functional form of the input-to-output mapping. Specifically, we assume this mapping to be linear and additive. Under this assumption, our task is to find the unknown parameters that characterize our mapping function, and these are the regression coefficients $\beta$. 

### Data generation

To warm-up, we create synthetic data for regression modeling. To keep things simple, we consider a univariate setting with only one feature. The classic example in business is that of a price response function, so we can assume that our single feature corresponds to the sales price of some product and our target to the sales volume of that product at a given price.

In [1]:
# load relevant libraries

import random
import numpy as np
import matplotlib.pyplot as plt
random.seed(888)  # for reproducibility

#### Exercise 1 (peer programming): Synthetic data for regression

##### a) Generate feature values
Create a table (i.e. matrix) of normally distributed random numbers. This table will serve as our synthetic feature matrix $X$.
- Declare variables to control the number of data points and the number of features. 
- Use the `Numpy` function `random.normal()` to create a normally distributed random numbers with suitable dimensionality.
- Store the resulting random number matrix in a variable X

In [2]:
# Exercise 1a


##### b) Generate dependent variable (aka target)
Create a dependent variable $y$. To achieve this, recall the regression equation shown above. Given you already created $X$, you need regression coefficients $\beta$ and residuals $\epsilon$. Since we work with synthetic data, we can simply set $\beta$ to some arbitrary values or sample *true* coefficient values randomly. As to the residuals, we must generate these as random numbers. Lastly, we must ensure that you *true* coefficients and random numbers are of the right dimensionality. 
- Create a variable `beta` as an array of random numbers of *suitable size*
- Create a variable `epsilon` as an array of random numbers of *suitable size*
- Create a variable `y` and compute its value by evaluating the regression equation $y= \beta \times X + \epsilon$. Note that $\times$ refers to scalar multiplication of feature matrix $X$ and coefficient vector $\beta$. You can use the function `dot()` from `Numpy` to compute the dot product. 


In [3]:
# Exercise 1b


##### c) Putting everything together
Create a matrix of scatter plots using the `Matplotlib` function `subplots`. 
- Study the documentation to understand how the function works and what inputs it requires
- For each feature, label the x-axis of the scatter plot so as to also display the *true* coefficient of the corresponding feature. 

In [4]:
# Exercise 1c

### Linear regression
The lecture elaborated on linear regression including its internal functioning. Recall our visual summary of the method: <br>
<img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/linreg/summary.PNG" width="640" height="360" alt="Linear Regression as Supervised Learning Algorithm">
<br>
In the following, we revisit the two key steps of estimating the regression model and using it to compute forecasts

#### OLS estimate of regression coefficients
The lecture briefly mentioned that for linear regression, it is straightforward to *estimate* a mode because we can compute the minimum of the least-square loss function analytically. The equation was $$ \hat{\beta} = (X^{\top}X)^{-1}X^{\top} y $$

##### Exercise 2: Compute the OLS estimate by hand
To implement the above equation, you can make use of the following `Numpy` functions:
- `.transpose()` to compute the transpose of a matrix
- `.dot()` to compute the dot product 
- `.linalg.inv()` to compute the inverse of a matrix

In [5]:
# Exercise 2


While calculating the normal equation *by hand* is useful for education, we would never do this in practice. Instead, we would use a suitable library to fit (aka) train a linear regression model. Of course, this is our next exercise. 

##### Exercise 3: Linear regression using sklearn
Python features at least two libraries, which are commonly used to estimate linear regression (and other) models. One is the `statsmodels` library and the other is the - already known - library `sklearn`. The former is very suitable is the goal of regression is *explanatory modeling*. For prediction, `sklearn` is the better choice. Here, we focus on `sklearn`. To use it for estimating a linear regression model using our synthetic data, we need to implement the following steps:
- Import the class `LinearRegression` from the namespace `sklearn.linear_model`.
- Apply the method `fit` to our data, which we store in the variables `X` and `y`

To then check that the values of the estimated coefficients are the same as those we computed above, you can access the estimated coefficients `lr_model.coef_`, where we assume that the fitted model is stored in a variable with name `lr_model`. 



In [6]:
# Exercise 3


#### Forecasting
To complete our first linear regression demo and the part on working with synthetic data, let us also illustrate the second core step in supervised machine learning, the **calculation of forecasts**. We follow the previous approach of first doing the calculations *by hand* and then using a library to do it, which is representative of how we would proceed in practice.  

##### Exercise 4: Calculation of forecasts
Here is the set of tasks to illustrate the calculation of forecasts by hand and using `Numpy`.
- Reusing codes from Exercise 1, create some additional synthetic data
    - Call the variables storing your new data `X_test` and `y_test`
    - For `X_test`, you need to create new random feature values
    - For `y_test`, you need to create new residuals, whereas you re-use the *true* coefficients
- To calculate forecasts *by hand*, use your vector of OLS coefficients `beta_hat` and apply it to your new synthetic data `X_test`. Recall that the calculation of regression model outputs involves the dot product $\hat{\beta} \times X$. We saw ways to do this calculation using `Numpy`.
- To calculate forecasts using `sklearn`, which is a lot easier, you only need to call the method `predict` to your trained regression model `lr_model`, which you created in exercise 3.

In [7]:
# Exercise 4


##### Exercise 5: Calculation of forecast accuracy
The lecture sketched a few common performance metrics to assess linear regression models including the mean squared error (MSE). Recall that MSE is defined as:  <br><br>
$ MSE = \frac{1}{n}\sum_{i=1}^{n} \left( Y_i - \hat{Y}_i \right)^2 $,
<br><br>
with:
- $n$ = number of data points
- $Y$ = true values of the target variable
- $\hat{Y}$ = forecasts of the regression model

Provided you solved exercise 4, you have already calculated predictions. Calculate, for a last time *by hand*, the MSE of your regression model. Afterwards, run a web-search to find a function that does the calculation for you, and re-implement the code to calculate predictions using that function. 

In [8]:
# Exercise 5

# Summary
That's the end of today, another demo notebook completed. *Well Done!*

We not actually spend so much time on prediction but concentrated on basic methods like linear regression, which can be used for prediction. And, importantly, we have spent a lot of time on the data that we need for prediction. Data with features and a target variable. Having experienced how such data really looks and how you can create it yourself will help you a lot on your data science journey. 

Next up, we continue with elaborating on data handling and readying data for modeling.