# ML Week 2 - Linear Regression

---

[Top](#ML-Week-2---Linear-Regression) | [Previous section](#ML-Week-2---Linear-Regression) | [Next section](#Part-0:-Quick-review) | [Bottom](#Thank-you)

This notebook has the following sections:

* [Part 0: Quick review!](#Part-0:-Quick-review)
* [Part 1: Introduction to regression](#Part-1:-Introduction-to-regression)
* [Part 2: Solving the linear regression problem](#Part-2:-Solving-the-linear-regression-problem)
* [Part 3: Improving the regression: adding more features](#Part-3:-Improving-the-regression:-adding-more-features)
* [Part 4: Testing the model and more](#Part-4:-Testing-the-model-and-more)


## Part 0: Quick review
---

[Top](#ML-Week-2---Linear-Regression) | [Previous section](#ML-Week-2---Linear-Regression) | [Next section](#Part-1:-Introduction-to-regression) | [Bottom](#Thank-you)

Let's see who took a look at the **median** and **correlation** lessons.

Run the following code to load-in a dataset based off of the [capital bike sharing dataset UCI machine](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset) library. We'll also import the pandas library so we can load the file as a DataFrame.

In [None]:
# Import pandas
import pandas as pd

# Import bike sharing data
bike_sharing_data = pd.read_csv('data/ml_week_2_bike_sharing.csv')

Let's also import our visualisation libraries.

In [None]:
# Import in matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

---

### Exercise 

The following code creates a [boxplot](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51) of the **temp_degrees_c** column in our dataset. Answer the following questions on a piece of paper, or using Python...

1. Approximately, what is the median?
2. Approximately, what is the 25th percentile?
3. Approximately, what is the 75th percentile?

In [None]:
# Setup a figure
plt.figure(figsize=(10, 5))

# Create boxplot
sns.boxplot(bike_sharing_data['temp_degrees_c'])

Let's import numpy as well.

In [None]:
# Import numpy
import numpy as np

---

### Exercise

The following code creates a [**heatmap**](https://seaborn.pydata.org/generated/seaborn.heatmap.html) of the correlation matrix in our dataset. Please answer the following on a sheet of paper, or using Python.

1. What features positively correlate?
2. What features negatively correlate?

---

In [None]:
# Setup a figure
plt.figure(figsize=(10, 5))

# Create correlation matdix
corr_matrix = bike_sharing_data.corr()

# Draw the heatmap with the mask and correct aspect ratio
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr_matrix, vmax=1.0, vmin=-1.0, linewidths=.5, cmap=cmap, annot=True)

## Part 1: Introduction to regression

---
[Top](#ML-Week-2---Linear-Regression) | [Previous section](#Part-0:-Quick-review) | [Next section](#Part-2:-Solving-the-linear-regression-problem) | [Bottom](#Thank-you)

Let's take a quick look at our dataset. The dataset has the following columns...

| column | description |
| :----- | :--- |
| dteday | A specific date |
| temp_degrees_c | The temperature in degrees celsius |
| windspeed | A normalised windspeed for that specific day. <br>The data has been normalised such that the min=0, and max=1 |
| cnt | The count of capital bikes used on that day |

This data, as mentioned is from the [capital bikeshare system](https://www.capitalbikeshare.com/) in Washington D.C.

<br>

<img src="https://momentummag.com/wp-content/uploads/2016/04/sdf-1.jpg" width="500">

---

### Thought exercise

Look at the data and do some research on the capital bike system. Try to answer the following questions. Feel free to partner up.

1. What is the capital bikeshare system?
2. What _business questions_ might we be able to answer from the data?

---

### What is regression?

From [The Hundred-Page Machine Learning Book](http://themlbook.com/):

> **Regression** is a problem of predicting a real-valued label (often called a **target**) given an unlabeled example.

Put into some of the terminology of the last lesson, a regression problem tries to develop a method to predict a **continuous variable** within our dataset.

Today, we're going to build a regression algorithm to predict the **count of bikes** used based upon the weather and windspeed patterns.

---

### Thought exercise

* Why would we do this?? 
* What would capital bikeshare be able to do with this information??
* Have you done regression before? What other problems might it be useful to do regression? 

---


### Problem setup

#### Side-step...linear equations

In math, the following function is called a **linear equation**.

$$ f(x) = mx + b $$

* $x$ is an **input** to the equation
* $f(x)$ is an **output** to the equation
* $m$ is called the **slope** of a line
* $b$ is called the **y-intercept**

Let's run the following code, which plots the following linear equation.

$$f(x) = 3x + 2$$

In this equation:

* $m = 3$
* $b = 2$

In [None]:
# Make a numpy array. Each value represents an input, x
x = np.array([-2, -1, 0, 1, 2])

# Compute 3 * x + 2
f_x = 3*x + 2

# Plot
plt.figure(figsize=(8, 4))
with sns.axes_style("whitegrid"):
    sns.lineplot(x=x, y=f_x)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Plot of f(x)=3x + 2')
print('')

---
### Exercise

* When we change a value of $x$ by 1, how does the value of $f(x)$ change?
* When $x=0$, what is the value of $f(x)$?
* How do these values relate to the equation $f(x) = 3x + 2$?
---

We call $m$ the **slope** and $b$ the **y-intercept**. Hopefully these are familiar terms. Since there is a single input, $x$, we call this a **univariate** (one variable) linear model.

### What does this have to do with our dataset?

In our problem, we want to help capital bikeshare predict bike usage based upon the weather for a given day. We have many columns in our dataset, but let's pretend we have just the daily temperature, in degrees celsius, and the count of bike users for a specific day. So, in other words, in our dataset we have...

* A vector, $x$ of temperatures for each day, in degrees celsius
* An output, $y$, of counts that represent the number of bikes used per day

To do be able to estimate bike usage, we can make a **linear equation**, of the form $y=mx + b$, that allows us to _predict_ the number of bikes used in a certain day, given a likely daily temperature $x$.

#### How do we do this?

Let's start with some basics. The following code plots the count of bikes using on a specific day vs. the daily temperature. It does this three different plots (each plot has the same data). It then plots a different line on each of these plots, by varying the values of $m$ and $b$.

1. **Left-hand plot**: $m = 350$, $b = 1000$
2. **Center plot**: $m = 150$, $b = 2000$
3. **Right-hand plot**: $m = 100$, $b = 1000$


Which plot develops the best line to fit our data?

In [None]:
# Develop lines
line_1 = 350 * bike_sharing_data['temp_degrees_c'] + 1000 # y = 350*x + 1000
line_2 = 150 * bike_sharing_data['temp_degrees_c'] + 2000 # y = 150*x + 2000
line_3 = 100 * bike_sharing_data['temp_degrees_c'] + 1000 # y = 100*x + 1000

# Function to plot lines
def plot_scatter_w_lines(x, y, line, xlabel, ylabel, title, ylim):
    sns.regplot(x=x, y=y, fit_reg=None)
    sns.regplot(x=x, y=line, marker='')
    plt.ylim(0,ylim)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)

# Plot each line
plt.figure(figsize=(15, 5))

# Plot data and each line
plt.subplot(1, 3, 1)
plot_scatter_w_lines(
    bike_sharing_data['temp_degrees_c'], bike_sharing_data['cnt'], line_1, 
    'Temperature (Degrees C)', 'Count of Bikes', 'y = 350x + 1000', 9000
)

plt.subplot(1, 3, 2)
plot_scatter_w_lines(
    bike_sharing_data['temp_degrees_c'], bike_sharing_data['cnt'], line_2, 
    'Temperature (Degrees C)', '', 'y = 150x + 2000', 9000
)

plt.subplot(1, 3, 3)
plot_scatter_w_lines(
    bike_sharing_data['temp_degrees_c'], bike_sharing_data['cnt'], line_3, 
    'Temperature (Degrees C)', '', 'y = 100x + 1000', 9000
)

print('')

## Part 2: Solving the linear regression problem

[Top](#ML-Week-2---Linear-Regression) | [Previous section](#Part-1:-Introduction-to-regression) | [Next section](#Part-3:-Improving-the-regression:-adding-more-features) | [Bottom](#Thank-you)

Based upon the above lines, you can start to get a feel about what slope and y-intercept best fit our data. So let's reframe the linear regression as the following...

> **Linear regression** predicts a continuous variable from a dataset by **finding the best $m$ and $b$**, where $m$ and $b$ are the slope and y-intercept of the equation $f(x) = mx + b$.

We will call $m$ and $b$ the **parameters** of our model.

### Error functions

To be able to have our computer find the best $m$ and $b$, we need to solidify our definition of what makes one combination of $m$ and $b$ better than another combination.

To do this, let's define our first **error function** or **mean squared error (MSE)**. If we have a line, the MSE helps define how _far away_ our line is from a set of data points.

We'll walk through an example using the code below. The code will graph a line based upon a set of data points.

In [None]:
# Here is our data
x = np.array([0, 1, 2, 3, 4, 5])
y = np.array([-3, 1, 0, 5, 10, 8])
f_x = 2 * x - 2


# Plot
plt.figure(figsize=(10, 5))
sns.regplot(x=x, y=y, fit_reg=None)
sns.regplot(x=x, y=f_x, marker='')

Here is a small visual to describe the process of finding the mean squared error, based upon...

* The data points given (in blue)
* A specific line (in orange)

---

<img src="img/MSE_Diagram.png" width="900">

---

### Exercise

The MSE can be found using the following four steps...

1. Find the difference between the line `f_x` and the true points `y`
2. Square these differences
3. Sum the result
4. Divide by the total amount of data points (you can usee the `len(my_vec)` function to find the length of an array)

Complete the following code to manually find the MSE. The comments should help describe the process step-by-step.

In [None]:
# INSERT YOUR CODE HERE

# Print the variables f_x and y that can be used to find the MSE
print(f_x)
print(y)

# Find the difference between f_x and y


# Square these differences


# Sum the result


# Divide by the total amount of data points
mse_ans = 0

print(mse_ans)

To summarise these steps, the MSE is written by this formula...


$$ \frac{1}{N}\sum_{i=1}^{N}[f(x_i) - y_i]^2 $$


We can easily find the MSE using the [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) library. The following code re-plots our three different lines described above, with MSE values.  

In [None]:
# Import in the sklearn MSE
from sklearn.metrics import mean_squared_error

# Plot with MSE
# Plot each line
plt.figure(figsize=(15, 5))

# Plot data and each line
plt.subplot(1, 3, 1)
mse_1 = round(mean_squared_error(bike_sharing_data['cnt'], line_1), 2)
plot_scatter_w_lines(
    bike_sharing_data['temp_degrees_c'], bike_sharing_data['cnt'], line_1, 
    'Temperature (Degrees C)', 'Count of Bikes', 'y = 350x + 1000, MSE: ' + str(mse_1), 9000
)

plt.subplot(1, 3, 2)
mse_2 = round(mean_squared_error(bike_sharing_data['cnt'], line_2), 2)
plot_scatter_w_lines(
    bike_sharing_data['temp_degrees_c'], bike_sharing_data['cnt'], line_2, 
    'Temperature (Degrees C)', '', 'y = 150x + 2000, MSE: ' + str(mse_2), 9000
)

plt.subplot(1, 3, 3)
mse_3 = round(mean_squared_error(bike_sharing_data['cnt'], line_3), 2)
plot_scatter_w_lines(
    bike_sharing_data['temp_degrees_c'], bike_sharing_data['cnt'], line_3, 
    'Temperature (Degrees C)', '', 'y = 100x + 1000, MSE: ' + str(mse_3), 9000
)

As you see, the better the line fits our data, the **lower the MSE**! So, when we run linear regression, the goal is to **find the values of $m$ and $b$ that achieve the minimum MSE** for our data.

### Running linear regression in Python

Finding the best $m$ and $b$ in Python is pretty easy. We'll use the [sklearn "Linear Regression"](https://scikit-learn.org/stable/modules/generates/sklearn.linear_model.LinearRegression.html) module to do this.

The following code will run our first **learning algorithm**. It will do this within the following steps...

1. We will first create an sklearn LinearRegression object called `lr`
2. We will then _fit_ a line to our data, using the `lr.fit(X, y)` method. This will find the optimal $m$ and $b$ at the minimum MSE
3. We will then _re-predict_ our data from the linear equation that was created.

We can then assess how good our line was using the MSE metric. Also note, since we only have one-column of data we're using, we call this process **univariate linear regression**.

In [None]:
# Import LinearRegression from sklearn.linear_model
from sklearn.linear_model import LinearRegression

# 1. Create an object
lr = LinearRegression()

# 2. Fit the data
lr.fit(X=bike_sharing_data[['temp_degrees_c']], y=bike_sharing_data['cnt'])

# 3. Repredict the data
predicted_data = lr.predict(X=bike_sharing_data[['temp_degrees_c']])

# MSE
mse = mean_squared_error(bike_sharing_data['cnt'], predicted_data)

# Plot line
plt.figure(figsize=(10, 5))
plot_scatter_w_lines(
    bike_sharing_data['temp_degrees_c'], bike_sharing_data['cnt'], predicted_data, 
    'Temperature (Degrees C)', '', 'y = %.2fx + %.2f, MSE: %.2f' % (lr.coef_[0], lr.intercept_, mse), 9000
)

### Optional: How does sklearn find the optimal equation?

sklearn does not just magically find an optimal $m$ and $b$ for a set of data. Instead, it uses the MSE to develop a function, and then finds a minimum of that function. Let's recall our example with the small `f_x` and `y` dataset (taking just the first three values of each vector for simplicity)...

* $x = [0, 1, 2]$
* $y = [-3, 1, 0]$


We can plug these values into our MSE equation to get the following, pretending we do not know what the values of $m$ and $b$ are...

<br>

$$
MSE = \frac{1}{N}\sum_{i=1}^{N}[f(x_i) - y_i]^2  \\
= \frac{1}{3}[(f(x_1) - y_1)^2 + (f(x_2) - y_2)^2 + (f(x_3) - y_3)^2]
$$

<br>

Let's plug values in. We'll plug each $y_i$ in from our vector, and for each $f(x_i)$ we'll sub in the formula $f(x_i) = mx_i + b$ with each appropriate $x_i$ value.

<br>

$$
= \frac{1}{3}[(m*0 + b - -3)^2 + (m*1 + b - 1)^2 + (m*2 + b - 0)^2]
$$

<br>

and simplifying, we get the following **quadratic equation** (remember from the last lesson that _quadratic equations_ had a highest power of $2$ on any variable).

<br>

$$
MSE = 5m^2 + 3b^2 + 6mb - 2m + 4b + 10
$$

<br>

For convenience sake, let's assume $b = 0$ in our final graph. The function is now...

<br>

$$
MSE = 5m^2 - 2m + 10
$$

<br>

Let's graph this function. What shape do we get?

In [None]:
# Create m and mse
m = np.array(range(-10, 10))
mse = 5*np.power(m, 2) - 2*m + 10

# Plot
plt.figure(figsize=(10, 5))
sns.regplot(x=m, y=mse, marker='.', order=2)
plt.xlabel('m')
plt.ylabel('MSE')
print('')

Returning to our brief lesson on optimisation from last week, the minimum of the MSE function occurs where the **gradient = 0**. In this function, the gradient = 0 at $m = \frac{1}{5} = 0.2$, so the best $m$ to fit our data (given $b=0$) is $m=0.2$.

Here's a gif that shows minimisation down a quadratic function. The **cost** on the y-axis in the left-hand graph should look familiar...it's our MSE!


<img src="https://cdn-images-1.medium.com/max/1600/1*KQVi812_aERFRolz_5G3rA.gif" width="800">

#### What happens when our function is more complex?

The MSE function we looked at within the last image had a pretty obvious minimum. But, what happens when the MSE is a more complex function? Our real MSE had two inputs, $m$ _and_ $b$.

We can graph this below, using a 3D plot (since we have two inputs, $m$ and $b$, for every $MSE$ output).

In [None]:
# Import
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D

# Create figure
fig = plt.figure(figsize=(15, 6))
ax = fig.add_subplot(111, projection='3d')

# For each set of style and range settings, plot n random points in the box
m = np.arange(-10, 10, step=1)
b = np.arange(-10, 10, step=1)
m, b = np.meshgrid(m, b)
mse = 5*np.power(m, 2) + 3*np.power(b, 2) + 6*m*b - 2*m + 4*b + 10
surf = ax.plot_surface(m, b, mse, linewidth=0, cmap=cm.coolwarm, antialiased=False)

ax.set_xlabel('m')
ax.set_ylabel('b')
ax.set_zlabel('MSE')
print('')

Now this function is a little bit more complex, and even visually, it's not super clear where the minimum of the function is. Thus, computers typically use optimisation methods that _approximate_ the gradient. One such technique is called **gradient descent**.

#### Gradient descent

Gradient descent in an **optimisation** algorithm that iteratively tries to find a minimum of a function. [This blogpost](https://towardsdatascience.com/gradient-descent-in-a-nutshell-eaf8c18212f0) does a great job of explaining that gradient descent is like climbing down a hill. It has two steps...

1. A **direction update** step, where we approximate which direction we should move down the hill
2. A **parameter update** step, where we move using a given **step size** within the direction we chose

The **step size** parameter is _really important_ in gradient descent, and is often represented by the greek letter $\alpha$. Here's a visual of what gradient descent looks like, moving down a parabola, with a big and small value of $\alpha$:

<img src="https://cdn-images-1.medium.com/max/1600/0*QwE8M4MupSdqA3M4.png" width="500">

Tldr;

* Large $\alpha$ trains the algorithm faster by taking _larger steps_ in the direction of the minimum
* Small $\alpha$ values train the algorithm slower

It seems like we would always choose large $\alpha$ then, right? Well, it turns out there's an issue with this. It's possible if our $\alpha$ is too big, and we have multiple local minimums, we might completely miss the value we want to achieve, and climb down the wrong hill!

<br>

![](img/optimums.png)

<br>

If you want a mathemtical discussion of gradient descent, see more [here](http://mccormickml.com/2014/03/04/gradient-descent-derivation/). You can directly derive gradient descent algorithms from the partial derivaties of the MSE function.

#### Last note

It turns out that for linear regression, using the MSE, you do not need to use gradient descent. The techniques typically used come from the field of Linear Algebra. See more [here](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.linalg.lstsq.html).

## Part 3: Improving the regression: adding more features

---
[Top](#ML-Week-2---Linear-Regression) | [Previous section](#Part-2:-Solving-the-linear-regression-problem) | [Next section](#Part-4:-Testing-the-model-and-more) | [Bottom](#Thank-you)

### Using more features

Thus far, we have done regression with **one input variable**. We have another variable in our dataset that describes the **windspeed**. Let's graph our scatterplot data with this additional variable, and see how it affects the count of bikes.

In [None]:
# Make a figure
plt.figure(figsize=(15, 5))

# Graph count vs. temp_degrees_c
plt.subplot(1, 2, 1)
sns.scatterplot(x=bike_sharing_data['temp_degrees_c'], y=bike_sharing_data['cnt'])
plt.xlabel('Temperature (Degrees C)')
plt.ylabel('Count')

# Graph count vs. windspeed
plt.subplot(1, 2, 2)
sns.scatterplot(x=bike_sharing_data['windspeed'], y=bike_sharing_data['cnt'])
plt.xlabel('Windspeed')
plt.ylabel('')
print('')

How does the count of bikes used vary with windspeed? Hopefully, the relationship makes logical sense.

---

### Exercise

Go back to the [linear regression](#Running-linear-regression-in-Python) code. 


In the cell below, perform the following:

* Copy and paste the code to build a linear regression model. **DO NOT PASTE** the plotting section.
* Make a linear regression model called `lr_2`
* Instead of using just the `temp_degrees_c`, add the `windspeed` column to the `fit()` method.

**Answer the following:** Does the MSE increase or decrease when adding the additional column?

In [None]:
# INSERT YOUR CODE HERE


# 1. Create an object

# 2. Fit the data

# 3. Repredict the data

# 4. Calculate the MSE using the mean_squared_error() function

print(mse)

This is called a **multivariate** regression problem. Notice how we used **two features**. This expanded our linear equation to look like the following...

<br>


$$f(\textbf{x}) = m_{temp}x_{temp} + m_{wind}x_{wind} + b$$

<br>

Thus, we have one $m_i$ where $i$ represents **each feature** in our dataset. Notice how I used a bolded $\textbf{x}$ to signify that we are now inputting a _vector_ into our model. Where $\textbf{x} = [x_{temp}, x_{wind}]$

### Creating more features

Though we have **two columns**, we're not necessarily done. It's possible to create more features off of our dataset. 

___

### Thought exercise

Look at the relationship between temperature and the count of bicycles. Is a line the best relationship for our model? Think about when people are likely to use bicycles.

___

Let's create a new feature, where we **square** the temperature. We can then run our model again.

In [None]:
# Create the new column
bike_sharing_data['temp_degrees_c_2'] = bike_sharing_data['temp_degrees_c'].copy()**2

# 1. Create an object
lr_3 = LinearRegression()

# 2. Fit the data
lr_3.fit(X=bike_sharing_data[['temp_degrees_c', 'windspeed', 'temp_degrees_c_2']], y=bike_sharing_data['cnt'])

# 3. Repredict the data
predicted_data = lr_3.predict(X=bike_sharing_data[['temp_degrees_c', 'windspeed', 'temp_degrees_c_2']])

# MSE
mse = mean_squared_error(bike_sharing_data['cnt'], predicted_data)

# Print MSE
print('m_temp = %.2f, m_temp_2 = %.2f, m_wind = %.2f, b=%.2f'
      % (lr_3.coef_[0], lr_3.coef_[1], lr_3.coef_[2], lr_3.intercept_))
print('MSE: %.2f' % mse)

We now have **three features** in our model, and **four parameters** trained. The model looks liks this:

<br>

$$f(\textbf{x}) = m_{temp}x_{temp} + m_{temp^2}x_{temp}^2 + m_{wind}x_{wind} + b$$

<br>

We could keep adding parameters in our model. We'll generalise our linear equation equation to look like the following...

<br>

$$f(\textbf{x}) = \sum_{j=1}^{L}m_{j}x_{j} + b$$

<br>

This says we have a total of $L$ features in our model, and each feature has a slope $m_j$. Our equation sums through $L$ features and adds the intercept, $b$ (**thoughts**...what does $b$ represent?).

#### Generalising our mdoel

The following function adds more and more parameters to our model, by adding additional powers to one or more of our two original features. It then plots the fit of temperature vs. count, and windspeed vs. count. Run the cell.

In [None]:
def linear_regression(data, temp_pow, wind_pow):
    """
    Run linear regression and plot the result.
    
    :input data: <pd.DataFrame>, the dataframe with data
    :input temp_pow: <int>, the max power of the temperature
    :input wind_pow: <int>, the max power of the windspeed
    """
    # Create data
    temp_data = data.copy()
    
    if temp_pow > 1:
        for i in range(2, temp_pow + 1):
            temp_data['temp_degrees_c_' + str(i)] = temp_data['temp_degrees_c'] ** i
    if wind_pow > 1:
        for i in range(2, wind_pow + 1):
            temp_data['windspeed_' + str(i)] = temp_data['windspeed'] ** i
            
    # Run linear regression
    cols = [c for c in temp_data.columns if c not in ['cnt', 'dteday']]
    linreg = LinearRegression()
    linreg.fit(X=temp_data[cols], y=temp_data['cnt'])

    # Predict
    pred = linreg.predict(X=temp_data[cols])
    
    # Plot
    fig = plt.figure(figsize=(15, 5))
    # Graph count vs. temp_degrees_c
    plt.subplot(1, 2, 1)
    sns.regplot(x=temp_data['temp_degrees_c'], y=temp_data['cnt'], fit_reg=False)
    sns.regplot(x=temp_data['temp_degrees_c'], y=pred, marker='', ci=False, order=temp_pow)
    plt.xlabel('Temperature (Degrees C)')
    plt.ylabel('Count')

    # Graph count vs. windspeed
    plt.subplot(1, 2, 2)
    sns.regplot(x=temp_data['windspeed'], y=temp_data['cnt'], fit_reg=False)
    sns.regplot(x=temp_data['windspeed'], y=pred, marker='', ci=False, order=wind_pow)
    plt.xlabel('Windspeed')
    plt.ylabel('')
    
    # Title
    mse = mean_squared_error(temp_data['cnt'], pred)
    fig.suptitle('Regression with MSE=%.2f' % mse)

Here's an example of running this code for the equation we have already developed. Remember the equation was:

<br>

$$f(\textbf{x}) = m_{temp}x_{temp} + m_{temp^2}x_{temp}^2 + m_{wind}x_{wind} + b$$

<br>

Here, `temp_pow = 2`, and `wind_pow = 1`.

In [None]:
# Run the code
linear_regression(data=bike_sharing_data, temp_pow=2, wind_pow=1)

### Exercise

In the cell below, play around with creating different features by changing `temp_pow` and `windspeed`. How does the MSE change with higher powers?

**Note:** Normally we only raise the powers if it makes sense from our qualitative understanding of the problem. More on this later.

In [None]:
# CHANGE THE CODE BELOW
linear_regression(data=bike_sharing_data, temp_pow=2, wind_pow=1)

## Part 4: Testing the model and more

---
[Top](#ML-Week-2---Linear-Regression) | [Previous section](#Part-3:-Improving-the-regression:-adding-more-features) | [Next section](#Thank-you) | [Bottom](#Thank-you)

The entire process we have been doing today is a form of **supervised learning**.

> **Supervised learning** trains a model based upon an inputted set of features _and_ a label or variable that the model is trying to predict.

![](https://qph.fs.quoracdn.net/main-qimg-33a660216575e3754b12b1718c0e052c)


<br>

We've sort of left a step out though. Usually after we train a model, we save the model and then use it to provide a prediction for _new_ data. Capital bikeshare wouldn't be interested in using the model on past data, but would really want it to predict what is likely to happen in the _future_, once it knows a projected temperature/windspeed for a day. 

### Test sets

Since we don't have future data available, what we normally do is split our dataset into **two datasets**.

* A **training set**: the set of data we use to _fit_ a model
* A **test set**: the set of data we use to _test_ the model's performance

The **test set** is supposed to represent how our model might perform on _future_ data, and is not involved in training a model. The sklearn library has a module which splits the data into training and test sets, called [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Usually we leave out about 20-30% of our data for testing.

Run the following code to create a function called `linear_regressin_w_test()`. It is very similar to our original `linear_regression()` function, except it also...

* Splits our data into train and test sets
* Only trains using the training set
* Prints the MSE of the model on the test set within our graph

In [None]:
from sklearn.model_selection import train_test_split

def linear_regression_w_test(data, temp_pow, wind_pow):
    """
    Run linear regression and plot the result.
    
    :input data: <pd.DataFrame>, the dataframe with data
    :input temp_pow: <int>, the max power of the temperature
    :input wind_pow: <int>, the max power of the windspeed
    """
    # Create data
    temp_data = data.copy()
    
    if temp_pow > 1:
        for i in range(2, temp_pow + 1):
            temp_data['temp_degrees_c_' + str(i)] = temp_data['temp_degrees_c'] ** i
    if wind_pow > 1:
        for i in range(2, wind_pow + 1):
            temp_data['windspeed_' + str(i)] = temp_data['windspeed'] ** i
            
    # Split into train and test sets
    train, test = train_test_split(temp_data, random_state=42)
            
    # Run linear regression
    cols = [c for c in temp_data.columns if c not in ['cnt', 'dteday']]
    linreg = LinearRegression()
    linreg.fit(X=train[cols], y=train['cnt'])

    # Predict
    pred = linreg.predict(X=test[cols])
    
    # Plot
    fig = plt.figure(figsize=(15, 5))
    # Graph count vs. temp_degrees_c
    plt.subplot(1, 2, 1)
    sns.regplot(x=test['temp_degrees_c'], y=test['cnt'], fit_reg=False)
    sns.regplot(x=test['temp_degrees_c'], y=pred, marker='', ci=False, order=temp_pow)
    plt.xlabel('Temperature (Degrees C)')
    plt.ylabel('Count')

    # Graph count vs. windspeed
    plt.subplot(1, 2, 2)
    sns.regplot(x=test['windspeed'], y=test['cnt'], fit_reg=False)
    sns.regplot(x=test['windspeed'], y=pred, marker='', ci=False, order=wind_pow)
    plt.xlabel('Windspeed')
    plt.ylabel('')
    
    # Title
    mse = mean_squared_error(test['cnt'], pred)
    fig.suptitle('Regression with MSE=%.2f' % mse)

---

### Exercise


The following code can be run in the exact same way as our `linear_regression` function, with the same parameters. Change `temp_pow` and `wind_pow` as you did previously to create new features. The only difference is it calculates the **MSE of the test set** in relation to the line created from the training set.

How does the MSE change when you create features of higher order? Is it different than it changed previously?

In [None]:
# CHANGE THE CODE BELOW
linear_regression_w_test(data=bike_sharing_data, temp_pow=3, wind_pow=2)

As you create higher order features, you **overfit** the model. This is a common problem in machine learning, and we'll talk about it next week.

---

### Exercise

The original dataset can be loaded in the cell below. It has a lot more features, which you can find research on the [its UCI ML page](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset).

In [None]:
# Upload the full dataset
full_bike_sharing_data = pd.read_csv('data/day.csv')

The following cell has a template for a linear regression model. You can change the `cols` variable to add more features. Play around with linear regression! Create more features as you'd like.

If you want to plot the result, you'll have to change the `plot_col` column. Currently the code is plotting the `temp` variable. If you also add higher order features, you can change the `order` variable.

In [None]:
# Split into train and test sets
train_data, test_data = train_test_split(full_bike_sharing_data, random_state=42)

# CHANGE COLS TO FIT MODEL WITH HERE
cols = ['temp', 'windspeed']
full_lr = LinearRegression()
full_lr.fit(X=train_data[cols], y=train_data['cnt'])

# Predict
pred = full_lr.predict(X=test_data[cols])

# CHANGE COL TO PLOT AND ORDER HERE
plot_col = 'windspeed'
order = 1

# Plot
fig = plt.figure(figsize=(15, 5))
sns.regplot(x=test_data['temp'], y=test_data['cnt'], fit_reg=False)
sns.regplot(x=test_data['temp'], y=pred, marker='', ci=False, order=order)

## Thank you

[Top](#ML-Week-2---Linear-Regression) | [Previous section](#Part-4:-Testing-the-model-and-more) | [Next section](#Thank-you) | [Bottom](#Thank-you)

That concludes our week 2 lesson. Hopefully you enjoyed :)

### Downloading the notebook

If you would like to retain your work, please follow the following directions:
* On the top of this screen, in the header menu, click "File", then "Download as" and then "Notebook".
* You will need to download [Python 3.7 with Anaconda](https://www.anaconda.com/distribution/#download-section) to use this in the future