<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-problem" data-toc-modified-id="The-problem-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The problem</a></span></li><li><span><a href="#Data-exporation" data-toc-modified-id="Data-exporation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data exporation</a></span></li><li><span><a href="#The-linear-model" data-toc-modified-id="The-linear-model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The linear model</a></span></li><li><span><a href="#The-optimal-linear-model" data-toc-modified-id="The-optimal-linear-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The optimal linear model</a></span></li></ul></div>

# Linear regression motivation

In [None]:
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt
import seaborn as sns

## The problem

In [None]:
data = pd.read_csv("../datasets/hours_vs_mark.csv", index_col=0)

We have 100 students, and we know:
 * how many hours they studied for their exam
 * what mark they got (0 to 100)

In [None]:
data.head()

In [None]:
data.sample(5)

We would like to understand the relationship $$mark = f(hours)$$

So that we can **predict the expected mark** we will get by studying a given number of hours

## Data exporation

In [None]:
data.describe()

In [None]:
sns.histplot(data.hours, bins=10)

In [None]:
sns.scatterplot(x=data["hours"], y=data["mark"])

## The linear model

Lets try a linear regression $$Y = m * X + n$$

$m$ is the slope  
$n$ is the value of $Y$ when $X=0$ 

$$mark = m * hours + n$$

We want to find $m$ and $n$ that *best* model our data

Lets guess:

$$mark = ...$$

$$mark_2 = ...$$

Which model performs better?

In [None]:
data.head()

In [None]:
data["prediction_1"] = 0.1 * data.hours

In [None]:
data["prediction_2"] = 0.12 * data.hours + 10

In [None]:
data.head(10)

Lets measure error of both models

In [None]:
data['error_1'] = (data.mark - data.prediction_1).abs()

In [None]:
data['error_2'] = (data.mark - data.prediction_2).abs()

In [None]:
data.head(10)

In [None]:
data.error_1.mean()

In [None]:
data.error_2.mean()

So model 1 performs better!

Lets plot our models

In [None]:
fig, ax = plt.subplots()
sns.scatterplot(x=data["hours"], y=data["mark"])

plt.plot(data.hours, data.prediction_1, color='r', label='better')
plt.plot(data.hours, data.prediction_2, color='g', label='worse')

plt.legend()

$$mark = m * hours + n$$

$$\text{model_error} = L(m, n)$$

$$L(0.1, 0) = 12.7$$

$$L(0.12, 10) = 18.7$$

## The optimal linear model

Can we find the **best**?

`scikit-learn` is a Python library for building ML models

Linear regression is now called a ML algorithm (years ago it was only basic statistical inference... you know, the hype)

`!pip install scikit-learn`

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

In [None]:
data.head()

In [None]:
lr.fit(
    # X = data[["hours", "age", "n_bedrooms"]],
    X = data[["hours"]],
    y = data.mark
)

$$mark = m * hours + n$$

In [None]:
lr.coef_

In [None]:
optimal_m = lr.coef_[0]

In [None]:
optimal_m

In [None]:
optimal_n = lr.intercept_

In [None]:
optimal_n

$$mark = 0.084 * hours + 11.78$$

In [None]:
data.head()

In [None]:
data["best_prediction"] = data.hours * optimal_m + optimal_n

In [None]:
data["best_prediction_error"] = (data.best_prediction - data.mark).abs()

In [None]:
data.head()

In [None]:
data.best_prediction_error.mean()

In [None]:
fig, ax = plt.subplots()
sns.scatterplot(x=data["hours"], y=data["mark"])

plt.plot(data.hours, data.prediction_1, color='r', label='better')
plt.plot(data.hours, data.prediction_2, color='g', label='worse')
plt.plot(data.hours, data.best_prediction, color='y', label='best')

plt.legend()