# Linear Regression Example

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()

from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

Assume we have data, where `y` = salary and `X` is a matrix containing two columns, the first column shows how many years the person has studied and the second columns shows how many years of experience the person has. 

## Estimating a simple linear regression model.

In [2]:
X = np.array([[5, 2], [3, 4], [2, 7], [0, 10]])
y_true = np.array([34000, 31000, 33000, 28000])

In [3]:
print(X)
print()
print(y_true)

[[ 5  2]
 [ 3  4]
 [ 2  7]
 [ 0 10]]

[34000 31000 33000 28000]


In [4]:
df = pd.DataFrame(X, columns=["years_studied", "years_experience"])
df["salary"] = y_true
df

Unnamed: 0,years_studied,years_experience,salary
0,5,2,34000
1,3,4,31000
2,2,7,33000
3,0,10,28000


In [5]:
# 1. Initialize the model.
lin_reg = LinearRegression(fit_intercept=False)

In [18]:
# 1. Initialize the model.
lin_reg_intercept = LinearRegression()

In [7]:
# 2. Fit the model. 
lin_reg = lin_reg.fit(X, y_true)

In [19]:
# 2. Fit the model. 
lin_reg_intercept = lin_reg_intercept.fit(X, y_true)

In [11]:
print(lin_reg.coef_)
print(lin_reg.intercept_)

[5909.67616075 2900.89738588]
0.0


In [20]:
print(lin_reg_intercept.coef_)
print(lin_reg_intercept.intercept_)

[3370.96774194 1387.09677419]
15096.774193548408


In [9]:
5 * 5909.7 + 2 * 2901

35350.5

In [22]:
15096.77 + 5 * 3371 + 2 * 1387.1

34725.97

In [10]:
# 3. Predicting the train data.
y_pred = lin_reg.predict(X)

print('predictions', y_pred)
print('true y', y_true)

predictions [35350.1755755  29332.61802575 32125.63402263 29008.97385876]
true y [34000 31000 33000 28000]


In [23]:
# 3. Predicting the train data.
y_pred_intercept = lin_reg_intercept.predict(X)

print('predictions', y_pred_intercept)
print('true y', y_true)

predictions [34725.80645161 30758.06451613 31548.38709677 28967.74193548]
true y [34000 31000 33000 28000]


In [10]:
# Predicting "new data" where we want to predict the salary y which in reality is unknow now. 

In [12]:
X_new = np.array([[7, 3], [2, 9], [6, 1]])
y_pred_new = lin_reg.predict(X_new)
y_pred_new

array([50070.42528287, 37927.42879438, 38358.95435037])

In [None]:
# Look at hyperparameters, we used all the default values in this case. 
lin_reg.get_params()

## Calculate Root Mean Squared Error (RMSE) for the training data

In [13]:
root_mean_squared_error(y_true, y_pred)

1263.495235721369

In [24]:
root_mean_squared_error(y_true, y_pred_intercept)

952.500952501429

In [17]:
(np.mean((y_pred - y_true)**2))**.5

np.float64(1263.495235721369)

In [None]:
# Calulating the root mean squared error manually. 
(np.mean(((y_pred - y_true)**2)))**0.5


### Vad är RMSE?

RMSE står för Root Mean Squared Error.\
Det är ett mått för att utvärdera regressionsproblem och mäter prediktionernas medelavstånd från de äkta, observerade värdena.

Matematiska formeln för RMSE är:

$RMSE = \sqrt{\frac{1} {n} \sum_{i=1}^{n}(\hat{y}_i-y_i)^2}$

Idén bakom RMSE är simpel:
- Man tar skillnaden mellan en prediktion och respektive observerad värde: $\hat{y}_i-y_i$;    Det kallas för __Error__.
- Vi bryr oss inte om det är en positiv eller negativ skillnad, därför kvadrerar vi: $(\hat{y}_i-y_i)^2$;   Det kallas för __Squared Error__.
- Vi räknar ut medelvärdet för Squared Error: $\frac{1} {n} \sum_{i=1}^{n}(\hat{y}_i-y_i)^2$;    Det kallas för __Mean Squared Error__.
- Vi tar roten ur Mean Squared Error, så måttet är på det samma skala som datan och därför lättare att tolka: $\sqrt{\frac{1} {n} \sum_{i=1}^{n}(\hat{y}_i-y_i)^2}$