# Linear regression

## Linear regression with one feature

$$
\hat{y} = w_0 + w_1x_1
$$

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

### Load the data

Use Pandas to import the *grades.csv* file. This data represents student scores as a function of the number of hours of studying. For our model, we will want to predict the **scores**.  

Take a look at the data and use `.shape` to get the number of rows and columns. 

In [None]:
grades = pd.read_csv("grades.csv")

In [None]:
grades.head()

In [None]:
grades.shape

### Plot the data

Create a scatter plot with hours studied on the x axis and scores on the y axis. 

In [None]:
plt.scatter(grades['Hours'], grades['Scores'])
plt.title("Hours vs Scores", size=20)
plt.xlabel("Hours Studied", size = 15)
plt.ylabel("Score Achieved (%)", size=15)
plt.ylim(0, 100)
plt.show()

### Create target and feature arrays

Our dataframe currently has both the target and the feature arrays. Create two NumPy arrays, one for the feature and one for the target.  

In [None]:
X = grades.drop('Scores', axis=1)
X.head()

In [None]:
y = grades['Scores']
y.head()

### Split the data

Split the data into a train and test set. (Remember to set the `random_state` parameter.)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=113)

### Build a linear regression model

Use the `LinearRegression` class from sklearn's `linear_model` module to create a linear regression model.

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

### Model parameters

Use the `.intercept_` and the `.coef_` methods to access the parameters that define our linear model. 

In [None]:
lr.intercept_

In [None]:
lr.coef_

### Make predictions

Now use the model to make predictions for the test set. 

In [42]:
y_preds = lr.predict(X_test)
y_preds

array([63.67921536, 77.44027048, 89.08424019, 86.96715479, 34.04001972,
       10.75208029, 53.09378834])

\begin{align*}
\hat{y} = \rm{``predicted}\; \rm{score"}&= \rm{intercept} + w_1\times\rm{Hours} \\
&= -0.892 + 10.585\times \rm{Hours}
\end{align*}

### Make predictions by hand

Now use the model's intercept and coefficient to make predictions for the test set. 

In [43]:
(lr.intercept_ + lr.coef_ * X_test).values

array([[63.67921536],
       [77.44027048],
       [89.08424019],
       [86.96715479],
       [34.04001972],
       [10.75208029],
       [53.09378834]])

### Plot the predictions

Recreate the scatter plot from above, but this time add in the predictions for the test set. Colour these *red* so you can distinguish them from the training data. 

In [None]:
# create values for the line learned by our model: y = 10.585 * x - 0.892
# use these values to plot the line learned by our model
x_vals = np.linspace(-1, 10, 100)
y_vals = lr.coef_[0] * x_vals + lr.intercept_

In [None]:
plt.scatter(grades['Hours'], grades['Scores'], label="Train")
plt.scatter(X_test['Hours'], y_preds, c = 'red', label="Test")
plt.plot(x_vals, y_vals, c='black', lw=0.5, label="Model")
plt.title("Hours vs Scores", size=20)
plt.xlabel("Hours Studied", size = 15)
plt.ylabel("Score Achieved (%)", size=15)
plt.ylim(0, 100)
plt.legend()
plt.show()

### Other metrics

Using the appropriate methods from the `sklearn.metrics` module (and/or some basic math), calculate the following metrics for this model: 

- mean squared error
- root mean squared error
- mean absolute error
- $R^2$

Note that these functions take **y_true** and **y_preds** as input, which is different than the `.score()` method we have been using.  

#### Mean squared error

In [None]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_preds)
mse

#### Root mean squared error

In [None]:
from sklearn.metrics import root_mean_squared_error

rmse = root_mean_squared_error(y_test, y_preds)
rmse

#### Mean absolute error

In [None]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_preds)
mae

#### $R^2$

In [None]:
# Method 1: use 'score' method of 'LinearRegression'
R2_method_1 = lr.score(X_test, y_test)
R2_method_1

In [None]:
# Method 2: use standalone 'r2_score' function
from sklearn.metrics import r2_score

R2_method_2 = r2_score(y_test, y_preds)
R2_method_2

## Linear regression with multiple features

Another example for linear regression but for multiple features is contained in the *petrol_consumption.csv* file (petrol = gasoline):
- petrol consumption (in millions of gallons) in 48 US states **(target)**
- petrol taxes (in cents), 
- per capita income (dollars)
- paved highways (in miles) 
- proportion of population that has a drivers license

\begin{align*}
\hat{y} &= w_0 + w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 \\
\rm{``pred}\; \rm{consumption"} &= \rm{intercept} + w_1\times\rm{taxes} + w_2\times\rm{income} + w_3\times\rm{highways} + w_4\times\rm{population}
\end{align*}

### Load the data

Use Pandas to import the file. For our model, we will want to predict **petrol consumption**.  

Take a look at the data and use `.shape` to get the number of rows and columns. 

In [None]:
petrol = pd.read_csv("petrol_consumption.csv")

In [None]:
petrol.shape

In [None]:
petrol.head()

### Create target and feature arrays

Our dataframe currently has both the target and the feature arrays. Create two NumPy arrays, one for the feature and one for the target.  

In [None]:
X_gas = petrol.drop('Petrol_Consumption', axis=1)
X_gas.head()

In [None]:
y_gas = petrol['Petrol_Consumption']
y_gas.head()

### Split the data

Split the data into a train and test set. (Remember to set the `random_state` parameter.)

In [None]:
X_gas_train, X_gas_test, y_gas_train, y_gas_test = train_test_split(X_gas, y_gas, random_state=784)

### Build a linear regression model

Use the `LinearRegression` class from sklearn's `linear_model` module to create a linear regression model.

In [None]:
lr_gas = LinearRegression()
lr_gas.fit(X_gas_train, y_gas_train)

### Model parameters

Use the `.intercept_` and the `.coef_` methods to access the parameters that define our linear model. 

In [None]:
lr_gas.intercept_

In [None]:
lr_gas.coef_

### Make predictions

Now use the model to make predictions for the test set. 

In [None]:
y_gas_preds = lr_gas.predict(X_gas_test)
y_gas_preds

### Make predictions by hand

Now use the model's *intercept* and *coefficients* to make predictions for the test set. 

\begin{align*}
\rm{``pred}\; \rm{consumption"} &= \rm{intercept} + w_1\times\rm{taxes} + w_2\times\rm{income} + w_3\times\rm{highways} + w_4\times\rm{population}
\end{align*}

In [None]:
hand_preds = np.dot(X_gas_test.values, lr_gas.coef_) + lr_gas.intercept_
hand_preds

### Other metrics

Using the appropriate methods from the `sklearn.metrics` module (and/or some basic math), calculate the following metrics for this model: 

- mean squared error
- root mean squared error
- mean absolute error
- $R^2$

#### Mean squared error

In [None]:
mse_gas = mean_squared_error(y_gas_test, y_gas_preds)
mse_gas

#### Root mean squared error

In [None]:
# Method 1: take square root of "mean squared error"
rmse_gas_method_1 = np.sqrt(mse_gas)
rmse_gas_method_1

In [None]:
# Method 2: adjust 'squared' parameter in 'mean squared error'
rmse_gas_method_2 = mean_squared_error(y_gas_test, y_gas_preds, squared=False)
rmse_gas_method_2

#### Mean absolute error

In [None]:
mae_gas = mean_absolute_error(y_gas_test, y_gas_preds)
mae_gas

#### $R^2$

In [None]:
# Method 1: use 'score' method of 'LinearRegression'
R2_gas_method_1 = lr_gas.score(X_gas_test, y_gas_test)
R2_gas_method_1

In [None]:
# Method 2: use standalone 'r2_score' function
R2_gas_method_2 = r2_score(y_gas_test, y_gas_preds)
R2_gas_method_2