This is a very simple tutorial intended for the beginners to understand and implement Simple Linear Regression from the scratch. 



<font color='blue'> Simple Linear Regression </font> is a great first machine learning algorithm to implement as it requires you to estimate properties from your training dataset, but is simple enough for beginners to understand. Linear regression is a prediction method that is more than 200 years old. In this tutorial, you will discover how to implement the simple linear regression algorithm from scratch in Python.

After completing this tutorial you will know:<br>
&#9632; How to estimate statistical quantities from training data.<br>
&#9632; How to estimate linear regression coefficients from data.<br>
&#9632; How to make predictions using linear regression for new data.<br>


Linear regression assumes a **linear or straight line relationship between the input variables (X) and the single output variable (y).** More specifically, that output (y) can be calculated from a linear combination of the input variables (X). When there is a single input variable, the method is referred to as a simple linear regression.

In simple linear regression we can use statistics on the training data to estimate the coefficients required by the model to make predictions on new data.

The line for a simple linear regression model can be written as:

$$ y = b_0 + b_1 * x $$
where $b_0$ and $b_1$ are the coefficients we must estimate from the training data. Once the coefficients are known, we can use this equation to estimate output values for $y$ given new input examples of $x$. It requires that you calculate statistical properties from the data such as **mean, variance** and **covariance.**


If somehow this notebook helps you, please do <font color='red'> UPVOTE </font>

## <font color = 'blue'> Swedish Insurance Dataset</font>
We will use a real dataset to demonstrate simple linear regression. The dataset is called the **“Auto Insurance in Sweden”** dataset and involves **<font color='blue'> predicting the total payment for all the claims in thousands of Swedish Kronor (y) given the total number of claims (x). </font>**

This means that for a new number of claims (x) we will be able to predict the total payment of claims (y).

Let's load some basic python libraries that we will need over the course of this tutorial. 

In [38]:
# library for manipulating the csv data
import pandas as pd

# library for scientific calculations on numbers + linear algebra
import numpy as np
import math

# library for regular plot visualizations
import matplotlib.pyplot as plt

#library for responsive visualizations
import plotly.express as px


In [6]:
#data = pd.read_csv('../input/auto-insurance-in-sweden/swedish_insurance.csv')
data = pd.read_csv('swedish_insurance.csv')
data.sort_values('X', inplace= True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 63 entries, 30 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X       63 non-null     int64  
 1   Y       63 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 1.5 KB


In [25]:
print(data.head(10))
print(data.tail(10))

    X     Y
30  0   0.0
15  2   6.6
49  3  39.9
23  3  13.2
18  3   4.4
29  4  38.1
38  4  12.6
26  4  11.8
33  5  40.3
10  5  20.9
      X      Y
42   41  181.3
8    45  214.0
11   48  248.1
61   53  244.6
44   55  162.8
5    57  170.9
41   60  202.4
36   61  217.6
0   108  392.5
3   124  422.2


In [10]:
data.columns

Index(['X', 'Y'], dtype='object')

In [5]:
y = data['Y'].values
print(y)
print(type(y))

[  0.    6.6  39.9  13.2   4.4  38.1  12.6  11.8  40.3  20.9  50.9  14.6
  14.8  27.9  48.8  77.5  76.1  55.6  48.7  87.4  52.1  65.3  57.2  21.3
  23.5  58.1  31.9  15.7  93.   89.9  77.5  95.5  32.1  59.6 142.1  46.2
  98.1 161.5 113.   39.6  56.9 137.9 134.9  69.2 187.5  92.6 103.9 133.3
 194.5 209.8 152.8 119.4  73.4 181.3 214.  248.1 244.6 162.8 170.9 202.4
 217.6 392.5 422.2]
<class 'numpy.ndarray'>


Let's have a look at the data itself. You can either use `matplotlib.pyplot` or `plotly` for visualization. The latter one produces responsive visualizations. Try hovering over the points on the graph to see the actual values.

In [11]:
fig = px.box(data['X'], points = 'all')
fig.update_layout(title = f'Distribution of X',title_x=0.5, yaxis_title= "Number of Insurance Claims")
fig.show()

fig = px.box(data['Y'], points = 'all')
fig.update_layout(title = f'Distribution of Y',title_x=0.5, yaxis_title= "Amount of Insurance Paid")
fig.show()

In [12]:
fig = px.scatter(x = data['X'], y=data['Y'])
fig.update_layout(title = 'Swedish Automobiles Data', title_x=0.5, xaxis_title= "Number of Claims", yaxis_title="Payment in Claims", height = 500, width = 700)
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

**This tutorial is broken down into five parts:<br>**
&#9832; Calculate Mean and Variance.<br>
&#9832; Calculate Covariance (X,Y).<br>
&#9832; Estimate Coefficients.<br>
&#9832; Make Predictions.<br>
&#9832; Visual Comparison for Correctness.<br>
These steps will give you the foundation you need to implement and train simple linear regression models for your own prediction problems.

### 1. Calculate Mean and Variance.
As said earlier, simple linear regression uses mean and variance of the given data. We will use `numpy` builtin functions to calculate them. 

In [13]:
data['Y']

30      0.0
15      6.6
49     39.9
23     13.2
18      4.4
      ...  
5     170.9
41    202.4
36    217.6
0     392.5
3     422.2
Name: Y, Length: 63, dtype: float64

In [16]:
mean_x = np.mean(data['X'])
mean_y = np.mean(data['Y'])

var_x = np.var(data['X'])
var_y = np.var(data['Y'])


print('x stats: mean= %.3f   variance= %.3f' % (mean_x, var_x))
print('y stats: mean= %.3f   variance= %.3f' % (mean_y, var_y))

x stats: mean= 22.905   variance= 536.658
y stats: mean= 98.187   variance= 7505.052


### 2. Calculate Covariance.
The covariance of two groups of numbers describes how those numbers change together. Covariance is a generalization of correlation. Correlation describes the relationship between two groups of numbers, whereas covariance can describe the relationship between two or more groups of numbers. It is calculated by the following formula. 
$$ Cov(X,Y) = \frac{\sum{(X_i - \overline{X})}{(Y_j - \overline{Y})}}{n} $$

You can simply implement it by yourself or use builtin function `numpy.cov()`


In [17]:
# Calculate covariance between x and y
def covariance(x, y):
    mean_x = np.mean(x)
    mean_y = np.mean(y)
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar/len(x)



covar_xy = covariance(data['X'], data['Y'])
print(f'Cov(X,Y): {covar_xy}')


Cov(X,Y): 1832.0543461829182


### 3. Estimate Coefficients
We must estimate the values for two coefficients in simple linear regression.

In [18]:
b1 = covar_xy / var_x
b0 = mean_y - b1 * mean_x

print(f'Coefficents:\n b0: {b0}  b1: {b1} ')


Coefficents:
 b0: 19.99448575911478  b1: 3.413823560066368 


### 4. Make Predictions
The simple linear regression model is a line defined by coefficients estimated from training data. Once the coefficients are estimated, we can use them to make predictions. The equation to make predictions with a simple linear regression model is as follows:
$$ \hat{y} = b_0 + b_1 * x $$

In [19]:
x = data['X'].values.copy()
x

array([  0,   2,   3,   3,   3,   4,   4,   4,   5,   5,   6,   6,   6,
         7,   7,   7,   8,   8,   9,   9,   9,  10,  11,  11,  11,  12,
        13,  13,  13,  13,  14,  14,  15,  16,  17,  19,  20,  22,  23,
        23,  23,  24,  24,  25,  26,  27,  29,  29,  30,  31,  37,  40,
        41,  41,  45,  48,  53,  55,  57,  60,  61, 108, 124], dtype=int64)

In [22]:
# Taking the values from the dataframe and sorting only X for the ease of plotting line later on
x = data['X'].values.copy()
# x.sort()
print(f'x: {x}')

# Predicting the new data based on calculated coeffiecents. 
y_hat = b0 + b1 * x
print(f'\n\ny_hat: {y_hat}')

y = data['Y'].values
print(f'\n\ny: {y}')

x: [  0   2   3   3   3   4   4   4   5   5   6   6   6   7   7   7   8   8
   9   9   9  10  11  11  11  12  13  13  13  13  14  14  15  16  17  19
  20  22  23  23  23  24  24  25  26  27  29  29  30  31  37  40  41  41
  45  48  53  55  57  60  61 108 124]


y_hat: [ 19.99448576  26.82213288  30.23595644  30.23595644  30.23595644
  33.64978     33.64978     33.64978     37.06360356  37.06360356
  40.47742712  40.47742712  40.47742712  43.89125068  43.89125068
  43.89125068  47.30507424  47.30507424  50.7188978   50.7188978
  50.7188978   54.13272136  57.54654492  57.54654492  57.54654492
  60.96036848  64.37419204  64.37419204  64.37419204  64.37419204
  67.7880156   67.7880156   71.20183916  74.61566272  78.02948628
  84.8571334   88.27095696  95.09860408  98.51242764  98.51242764
  98.51242764 101.9262512  101.9262512  105.34007476 108.75389832
 112.16772188 118.995369   118.995369   122.40919256 125.82301612
 146.30595748 156.54742816 159.96125172 159.96125172 173.61654596
 183.8

### 5. Visual Comparison for Correctness 

In [27]:
import plotly.graph_objects as go
fig = go.Figure()

fig.add_trace(go.Scatter(x=data['X'], y=data['Y'], name='train', mode='markers', marker_color='rgba(152, 0, 0, .8)'))
fig.add_trace(go.Scatter(x=x, y=y_hat, name='prediction', mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'))

fig.update_layout(title = f'Swedish Automobiles Data\n (visual comparison for correctness)',title_x=0.5, xaxis_title= "Number of Claims", yaxis_title="Payment in Claims")
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

## Where To Go From Here 
* <font color="red">Can you find out the **mean squared error (MSE)** of the predictions???</font>
* Extend the same problem for multiple input features. 

In [None]:
#First we take y and y_hat

print(f'\n\ny_hat: {y_hat}')
print(f'\n\ny: {y}')

#then we substract one from the other
sub = y_hat-y
#square them and add them up
sub = sub*sub
mean = sub.sum()
#then take the mean
MSE = mean/len(y)

print(f"Mean Squared Error (MSE): {MSE}")



In [39]:
#First we take y and y_hat

print(f'\n\ny_hat: {y_hat}')
print(f'\n\ny: {y}')

#then we substract one from the other
sub = y_hat-y
#square them and add them up
sub = sub*sub
mean = sub.sum()
#then take the mean
MSE = mean/len(y)

print(f"Mean Squared Error (MSE): {MSE}")





y_hat: [ 19.99448576  26.82213288  30.23595644  30.23595644  30.23595644
  33.64978     33.64978     33.64978     37.06360356  37.06360356
  40.47742712  40.47742712  40.47742712  43.89125068  43.89125068
  43.89125068  47.30507424  47.30507424  50.7188978   50.7188978
  50.7188978   54.13272136  57.54654492  57.54654492  57.54654492
  60.96036848  64.37419204  64.37419204  64.37419204  64.37419204
  67.7880156   67.7880156   71.20183916  74.61566272  78.02948628
  84.8571334   88.27095696  95.09860408  98.51242764  98.51242764
  98.51242764 101.9262512  101.9262512  105.34007476 108.75389832
 112.16772188 118.995369   118.995369   122.40919256 125.82301612
 146.30595748 156.54742816 159.96125172 159.96125172 173.61654596
 183.85801664 200.92713444 207.75478156 214.58242868 224.82389936
 228.23772292 388.68743025 443.30860721]


y: [  0.    6.6  39.9  13.2   4.4  38.1  12.6  11.8  40.3  20.9  50.9  14.6
  14.8  27.9  48.8  77.5  76.1  55.6  48.7  87.4  52.1  65.3  57.2  21.3
  23.5  