# Notebook 11: Simple Linear Regression & Multiple Linear Regression

Welcome to Notebook 11! This week's lab will focus on simple and multiple linear regression.

First, set up the notebook by running the cell below.

In [None]:
# Run this cell
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline

## 1. SLR: Cryptocurrencies

Imagine you're an investor in December 2017. Cryptocurrencies, online currencies backed by secure software, are becoming extremely valuable, and you want in on the action!

The two most valuable cryptocurrencies are Bitcoin (BTC) and Ethereum (ETH). Each one has a dollar price attached to it at any given moment in time. For example, on December 1st, 2017, one BTC costs $\$10,859.56$ and one ETH costs $\$424.64.$

For fun, here are the current prices of [Bitcoin](https://www.coinbase.com/price/bitcoin) and [Ethereum](https://www.coinbase.com/price/ethereum)!

**You want to predict the price of ETH at some point in time based on the price of BTC.** Below, we load two [tables](https://www.kaggle.com/jessevent/all-crypto-currencies/data) called `btc` and `eth`. Each has 5 columns:
* `date`, the date
* `open`, the value of the currency at the beginning of the day
* `close`, the value of the currency at the end of the day
* `market`, the market cap or total dollar value invested in the currency
* `day`, the number of days since the start of our data

In [None]:
btc = pd.read_csv('btc.csv')
btc.head()

In [None]:
eth = pd.read_csv('eth.csv')
eth.head()

**Question 1.1.** In the cell below, create an overlaid line plot that visualizes the BTC and ETH open prices as a function of the day. Both BTC and ETH open prices should be plotted on the same graph.

*Hint*: [Section 7.3](https://inferentialthinking.com/chapters/07/3/Overlaid_Graphs.html#overlaid-line-plots) in the textbook might be helpful!


In [None]:
# Your code here

**Question 1.2.** Now, write a `standard_units` and `correlation` function to calculate the correlation coefficient between the opening prices of BTC and ETH.

In [None]:
def standard_units(x):
    # Your code here

def correlation(x, y):
    # Your code here

In [None]:
correlation(btc['open'], eth['open'])

**Question 2.3.** Using scikitlearn's `LinearRegression`, perform a linear regression that predicts the opening price of ETH using the opening BTC price as its sole feature. 

In [None]:
# Your code here

**Question 2.4.** Now, using the `eth_predictor` function you just defined, make a scatter plot with BTC prices along the x-axis and both real and predicted ETH prices along the y-axis. The color of the dots for the real ETH prices should be different from the color for the predicted ETH prices. 


In [None]:
# Generate predictions here

In [None]:
# Create plot here

**Question 2.5.** Considering the shape of the scatter plot of the true data, is the model we used reasonable? If so, what features or characteristics make this model reasonable? If not, what features or characteristics make it unreasonable? 


**SOLUTION:** 

---
## 2: What about multiple linear regression?


In [None]:
# Here, we load the fuel dataset, and drop any rows that have missing data
vehicle_data = sns.load_dataset('mpg').dropna()
vehicle_data = vehicle_data.sort_values('horsepower', ascending=True)
vehicle_data.head(5)


### Question 2a

Using scikit learn's `LinearRegression`, create and fit three separate models that:
<ol>
    <li>predicts <b>mpg</b> from <b>horsepower</b> without any data transformation</li>
    <li>predicts <b>mpg</b> from <b>sqrt(horsepower)</b></li>
    <li>predicts <b>mpg</b> from <b>horsepower</b> AND <b>horsepower^2</b></li>
</ol>

In [None]:
# Model 1
# Your code here

In [None]:
# Model 2
# Your code here

In [None]:
# Model 3
# Your code here

---

### Question 2b

Using the model_multi, extract the coefficients and, in LaTeX, write out the function that the model is using to predict `mpg` from `horsepower` and `hp^2`.


In [None]:
# Your code here

**SOLUTION**:

$$\text{mpg} = 56.9 - 0.466 \cdot \text{horsepower} + 0.00123 \cdot \text{horsepower}^2$$

---
### Question 2c

Calculate and print the R^2 value for each of the three models. How do they compare? Is the R^2 enough to evaluate model performance on its own?

In [None]:
# Your code here

<br/><br/>

The plot below shows the prediction of our model. It's much better!

In [None]:
# just run this cel

sns.scatterplot(x='horsepower', y='mpg', data=vehicle_data)
plt.plot(vehicle_data['horsepower'],  predicted_mpg_hp_only, label='hp only');
plt.plot(vehicle_data['horsepower'],  predicted_mpg_hp_sqrt, color = 'r', linestyle='--', label='sqrt(hp) fit');
plt.plot(vehicle_data['horsepower'],  predicted_mpg_multi, color = 'gold', linewidth=2, label='hp and hp^2');
plt.legend();

---

### Question 2d

In the cell below, we assign the mean of the `mpg` column of the vehicle `data` dataframe to `mean_mpg`. Given this information, what is the mean of the `mean_predicted_mpg_hp_only`, `predicted_mpg_hp_sqrt`, and `predicted_mpg_multi` arrays?

In [None]:
vehicle_data['mpg'].mean()

In [None]:
predicted_mpg_hp_only.mean()

In [None]:
predicted_mpg_hp_sqrt.mean()

In [None]:
predicted_mpg_multi.mean()

---
## Question 3: Overfitting with Too Many Features

Let's take what we've learned so far and go one step further: introduce even more features.

Again, using scikit learn's `LinearRegression`, we fit a model that tries to predict `mpg` using each of the following as features:
- `horsepower`
- `hp^2`
- `model_year`
- `acceleration`

In [None]:
# just run this cell
desired_columns = ['horsepower', 'hp^2', 'model_year', 'acceleration']
model_overfit = LinearRegression()
model_overfit.fit(X = vehicle_data[desired_columns], y= vehicle_data['mpg'])
predicted_mpg_overfit = model_overfit.predict(vehicle_data[['horsepower', 'hp^2', 'model_year', 'acceleration']])

<br/>
The plot below shows the prediction of our more sophisticated model. Note we arbitrarily plot against horsepower for the ease of keeping our plots 2-dimensional.

In [None]:
# just run this cell
sns.scatterplot(x='horsepower', y='mpg', data=vehicle_data)
plt.plot(vehicle_data['horsepower'],  predicted_mpg_overfit, color = 'r')
plt.show()

Think about what you see in the above plot. Why is the shape of our prediction curve so jagged? Do you think this is a good model to predict the `mpg` of some car we don't already have information on?

This idea –the **bias-variance tradeoff**– is an idea we will explore in the coming weeks.

---

## Question 4: Comparing $R^2$

Lastly, set `r2_overfit` to be the multiple $R^2$ coefficient obtained by using `model_overfit`.

In [None]:
# Your code here

Comparing this model with previous models:

In [None]:
# just run this cell
# compares q1, q2, q3, and overfit models (ignores redundant model)
print('Multiple R^2 using only horsepower: ', r2_hp_only)
print('Multiple R^2 using sqrt(hp): ', r2_hp_sqrt)
print('Multiple R^2 using both hp and hp^2: ', r2_multi)
print('Multiple R^2 using hp, hp^2, model year, and acceleration: ', r2_overfit)

If everything was done correctly, the multiple $R^2$ of our latest model should be substantially higher than that of the previous models. This is because multiple $R^2$ increases with the number of covariates (i.e., features) we add to our model. 

<br/>

**A Word on Overfitting**: We might not always want to use models with large multiple $R^2$ values because these models could be **overfitting** to our specific sample data, and won't generalize well to unseen data from the population. Again, this is an idea we will explore in future lectures and assignments.