<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/06_linear-regression-varia.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Linear Regression Model: Basics
___
We are going to cover the linear regression more in depth later in the first week of the break. However, you will need to understand some of the basics for the big assignment. This notebook will introduce you to those basics and we will refine them later in the course.

Linear regression is the *Hello world!* of statistical modeling and machine learning. It is often considered the simplest statistical model out there and you will encounter it **a lot**. We recommend you pay close attention and try to understand as much as you can. It is not only a useful tool for your work and research but it is also a technique which permeates many modern studies. If you ever read a newspaper article about how *XYZ is bad or good for your health* and *ABC causes this or that*, chances are, there is a linear regression going on somewhere. In a sense, understanding linear regression (and other statistical models) gives you a new understanding of the world!

___
## Data pre-processing
We are going to work with bike rental data, which you may remember from the trial lecture.

In [None]:
# Import necessary packages
import matplotlib.pyplot as plt # Plotting
import numpy as np # Numerical computing
import pandas as pd # Dataframes

# Import sklearn utilities
from sklearn.linear_model import LinearRegression # Linear regression estimator
from sklearn.model_selection import train_test_split # Train/test splitting

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
# Start by importing the data
rentals = pd.read_csv(f"{DATA_PATH}/bike_rental.csv")
rentals.head(5) # Display the first 5 observations

We will start by splitting the data into two datasets. One with roughly 75% and one with roughly 25% of the observations. In data science, we often talk about *training* and *testing* datasets. I don't won't to get ahead of myself too much, because we will cover this in detail later on, but the general idea is to have a dataset (*train*) which we use to **train** (or estimate or fit) our model, and another dataset (*test*) which we use to **test** (or validate) our model.

While it is fairly simple to write our own code to split the data in two samples. The *conventional way* of doing it in data science is by using the `train_test_split` function from the `sklearn` package.

In [None]:
# Split the indices of the dataset into two subsamples 75%/25%
rentals1, rentals2 = train_test_split(rentals, test_size=0.25, random_state=72)

In [None]:
# Make sure that the size of the dataframes is what we expect
print(f"bike_rental_1 has {rentals1.shape[0]:>5} rows")
print(f"bike_rental_2 has {rentals2.shape[0]:>5} rows.")

Now that we have our two datasets, let's keep working with the first one for the time being. What is the relationship between the outside temperature (`temp`) and the count of bike rentals (`cnt`)?

In [None]:
# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Make a scatterplot with temperature on the x-axis and number of rentals on the y axis
ax.scatter(rentals1["temp"], rentals1["cnt"], alpha=0.4)
# Add labels on the axes and a grid
ax.grid(True)
ax.set_xlabel("Normalized temperature in Celsius")
ax.set_ylabel("Number of rentals")

Do you see the lines? Why doesn't the temperature look continuous? You can find more information with regards to how the temperature is encoded on the dataset's website: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset


When we are making a scatter plot with data that has categorical or rounded x-values, e.g., as is the case for this normalized temperature. We get the lines we see above. This is slightly annoying, because it makes it *hard to read*. It's a bit unclear as to where the *mass* of the points is situated because they are all overlapping.

What we can do is introduce **jitter**, i.e., we add some random noise to the x-value, so that the variables are not *exactly* on the line, but spread closely around them. Be careful to not make the noise too large, you don't want your data points to change completely!

#### ➡️ ✏️ Task 1

Replace the line
```python 
ax.scatter(rentals1["temp"], rentals1["cnt"], alpha=0.4)
``` 

with

```python
ax.scatter(rentals1["temp"] + np.random.randn(rentals1.shape[0]) * .005, 
            rentals1["cnt"], alpha=0.4)
```

Comment on the difference between both plots. Do you think one of them is easier/harder to interpret? Do you understand what 
```python
... + np.random.randn(rentals1.shape[0]) * .005
```
is doing, and why we chose `0.005` ? (🙀 🤯 This one is difficult, in particular if you haven't had a lot of statistics yet!)

___
## Our first linear regression
Suppose that you now want to have a simple model, or *equation*, which indicates how the general tendency, or *trend*/*pattern* in this cloud of dots behave. 

The **linear regression** is one of the simplest ways to create such a model. The main idea is to estimate a line, i.e., a linear fit, for which the average squared distance to the line is minimal. The regression line is the one which is **closest** to the points.

Fitting a linear regression with `sklearn` is very easy. Observe the following:

In [None]:
# Instantiate the linear regression model
linreg = LinearRegression()

In [None]:
# Fit the model to our data
# ⚠️ Careful with the double bracket around "temp". Don't think about it too much
# for now, but the main idea is that our first input needs to be a matrix and not
# a vector. We will look at this in more detail in later classes
linreg.fit(rentals1[["temp"]], rentals1["cnt"])
# cnt stands for count

That's it? Well yes, now we have fitted a linear regression model to the data. We can look at the coefficients estimate by this model.

In [None]:
print(f"The intercept is {linreg.intercept_:.2f}")
print(f"The temperature coefficient is {linreg.coef_[0]:.2f}")
linreg.coef_

So what the **intercept** and the **coefficient**? Well, above we mentioned something about estimating a *line* which best fits the data. You might remember that a line can be represented by

$$y = a + mx$$

In this case, $a$ is your intercept, and $m$ is your coefficient on $x$. So what this tells us is that we can model

$$\text{number of rentals} = 1.75 + 377.28 \cdot \text{normalized temperature}.$$

But we are skipping a lot of details. You will learn about all of this as the course goes. In any case, now that we have a line, we can also plot it! Furthermore, we can use this equation to make predictions about the number of rentals for a given temperature level.

In [None]:
# Making predictions from a fitted model with sklearn is very easy!
rentals1["pred"] = linreg.predict(rentals1[["temp"]])

In [None]:
# The x-axis goes from 0 to 1, create a sequence from 0 to 1
xs = np.linspace(0, 1, num=100)
# Create the y = a + mx line described above
ys = linreg.intercept_ + linreg.coef_ * xs

In [None]:
# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Make a scatterplot with temperature on the x-axis and number of rentals on the y axis
ax.scatter(rentals1["temp"] + np.random.randn(rentals1.shape[0]) * .005, 
            rentals1["cnt"], alpha=0.4, label="Data points")
# Add our straight line, make it orange and a bit wider such that it is visible
ax.plot(xs, ys, label="Linear fit", color="orange", lw=3)
# Add the predicted data points
ax.scatter(rentals1["temp"], rentals1["pred"], label="Predicted points", 
            color="orange")
# Add the predictions of our model using a scatter plot
# Add labels on the axes, a legend, and a grid
ax.grid(True)
ax.legend()
ax.set_xlabel("Normalized temperature in Celsius")
ax.set_ylabel("Number of rentals")

One of the strengths of linear regression (and other statistical models) is that they *abstract* a relationship between variables. Obviously, our model is not the best, but that's not the point here. Using a model, we have now an abstract way to formulate how the number of rentals changes as the temperature changes. Even better, we can use this abstraction on new data, we have never seen before, or, we could also make predictions on how many bikes we need to have in store depending on the weather forecast!

Let's go ahead and predict the number of rentals for our second dataset, `rentals2`, which the model has never *seen* until now.

In [None]:
# Making prediction for new data is just as easy
rentals2["pred"] = linreg.predict(rentals2[["temp"]])

In [None]:
# Repeat the same plot but for the second dataset
fig, ax = plt.subplots(figsize=(12, 8))
# Make a scatterplot with temperature on the x-axis and number of rentals on the y axis
ax.scatter(rentals2["temp"] + np.random.randn(rentals2.shape[0]) * .005, 
            rentals2["cnt"], alpha=0.4, label="Data points")
# Add our straight line, make it orange and a bit wider such that it is visible
ax.plot(xs, ys, label="Linear fit", color="orange", lw=3)
# Add the predicted data points
ax.scatter(rentals2["temp"], rentals2["pred"], label="Predicted points", 
            color="orange")
# Add the predictions of our model using a scatter plot
# Add labels on the axes, a legend, and a grid
ax.grid(True)
ax.legend()
ax.set_xlabel("Normalized temperature in Celsius")
ax.set_ylabel("Number of rentals")

___
## Varia: Growth rates and cumulative growth

Alright, that was a brief introduction to linear regression, you will get to play around with this much more during our workshop. Let's now add miscellaneous things that will help you when playing around with regressions and other statistical models, e.g. in your projects. They all relate to growth rates (rates of changes) and cumulative growth. This provides often a useful scaling of the data and delivers better predictive performance than level data.

In [None]:
np.random.seed(72) # Set the random seed
# Create a random series 
myseries = .9 + np.random.rand(20) * .2
myseries[0] = 1 # Set the first element to one

In [None]:
myseries

In [None]:
# Plot the series
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(myseries)
# Add grid, axis ticks
ax.grid(True)
ax.set_xticks(range(myseries.shape[0]), range(1, myseries.shape[0] + 1))

In [None]:
# Apply a cumulative product to the series
myseries2 = myseries.cumprod()

In [None]:
# Plot the new series
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(myseries2)
# Add grid, axis ticks
ax.grid(True)
ax.set_xticks(range(myseries2.shape[0]), range(1, myseries2.shape[0] + 1))

Can you see what `.cumprod()` did? What if we print out the series next to each other?

In [None]:
# Print out the series next to each other
series = pd.DataFrame({"S1": myseries, "S2": myseries2})
series

What if we instead want to compute the rate of change? Let's add a 3rd column to our dataframe. We now keep working with dataframes because they have a few nice functionalities for *time series operations* which are not implemented as nicely in `numpy` base.

### Shifting series by a specific lag

`pandas` provides some amazing tools when it comes to data handling, one of them is the `.shift` method. It allows us to *shift* the data by a given number of lags (default is 1 lag). Consider the following example.

In [None]:
# Create a dataframe with one column (1, 2, 3, 4, 5)
df = pd.DataFrame({"A": range(1, 6)})
# Add a column which is the column A, but shifted by 1 lag
df["B"] = df["A"].shift()
# Add a column, which is the column A, but shifted by 3 lags
df["C"] = df["A"].shift(3)
df # Display the final dataframe

Of course, we end up with some `NaN`. Do you see why?

So what if we wanted to compute a growth rate? Consider that for a time series $\mathbf{x} = \{x_1, x_2, \dots, x_T\}$, the growth rate can be computed as 

$$\text{growth rate at time } t = 100 \cdot \frac{x_t - x_{t-1}}{x_{t-1}}.$$

Hence, we could use

In [None]:
# Compute the growth rate
series["GR1"] = (series["S1"] - series["S1"].shift()) / series["S1"].shift() * 100
series # Show the result

While this works, `pandas` is actually even simpler than this... observe:

In [None]:
# The better way to compute the growth rate
series["GR2"] = series["S1"].pct_change() * 100
series # Display the result

So `pandas` actually implements the method `.pct_change` which directly allows us to compute the growth rate, pretty nifty!