## Regression

In `statistical modeling`, **regression analysis** is a set of statistical processes for `estimating` the relationships between a `dependent variable` (often called the 'outcome variable') and one or more `independent variables` (often called 'predictors', 'covariates', or 'features'). The most common form of regression analysis is linear regression.


### Types of Regression

1. Linear Regression
2. Multiliner Regression
3. Polynomial Regression

## Simple Linear	Regression
The simple linear regression model.
A simple linear regression model estimates the relationship between two quantitative variables where one is referred to as the **independent variable** and the
other the **dependent variable**. The independent variable (X) is used to predict and also called the predictor while the **predicted variable** is referred to as the
**response variable** (Y) (e.g. finding the relationship between the amount of CO gas emitted and the number of trees cut down). The value of Y can be
obtained from X by finding the line of best fit (regression line) with minimum error for the data points on a scatter plot for both variables. A simple linear
regression can be represented as:
where


Where
>y = $\beta_0$ + $\beta_1x_1$ + $\dots$ + $\beta_nx_n$ + $\varepsilon$

> $x$ is the independent variable,

>$\beta_1$ is the intercept

>$\beta_0$ is the slope of the line of best fit

>$\varepsilon$ is the random error

>$n$ is the number of predictor

>$\beta_0$, $\beta_1$ and $\beta_n$ are known as regression coefficients 

<img src="linear_regression_01.png" width="500">


Linear regression calculates the **estimators** of the regression coefficients or simply the predicted weights, denoted with $𝑏_0$, $𝑏_1$, …, $𝑏_r$.

The **estimated or predicted response**, 𝑓($x_i$), for each observation 𝑖 = 1, …, $𝑛$, should be as close as possible to the corresponding actual response $y_i$. The differences 𝑦ᵢ - 𝑓($x_i$) for all observations 𝑖 = 1, …, $𝑛$, are called the **residuals**. Regression is about determining the **best predicted weights**, that is the weights corresponding to the smallest residuals.

To get the best weights, you usually minimize the sum of squared residuals (SSR) for all observations 𝑖 = 1, …, 𝑛: SSR = $\sum_i(y_i - 𝑓(𝐱_i))^2$. This approach is called the method of ordinary least squares.

To get the best weights, you usually minimize the sum of squared residuals (SSR) for all observations 𝑖 = 1, …, 𝑛: SSR = $\sum_i(y_i - 𝑓(𝐱_i))^2$. This approach is called the **method of ordinary least squares.**

#### Implementation of Linear Reegression in Python (scikit-learn)

**Step 1: Import packages and classes**

In [3]:
import numpy as np
from sklearn.linear_model import LinearRegression

**Step 2: Provide data**

The second step is defining data to work with. The inputs (regressors, 𝑥) and output (predictor, 𝑦) should be arrays (the instances of the class numpy.ndarray) or similar objects. This is the simplest way of providing data for regression:

In [4]:
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])

`Note`

You should call `.reshape()` on x because this array is required to be **two-dimensional**, or to be more precise, to have **one column and as many rows as necessary.** That’s exactly what the argument `(-1, 1)` of .reshape() specifies.

In [5]:
print(x, ' \n',   x.shape)

[[ 5]
 [15]
 [25]
 [35]
 [45]
 [55]]  
 (6, 1)


In [6]:
print(y, '\n', y.shape)

[ 5 20 14 32 22 38] 
 (6,)


**Step 3: Create a model and fit it**

The next step is to create a linear regression model and fit it using the existing data.

Let’s create an instance of the class LinearRegression, which will represent the regression model:

In [12]:
model = LinearRegression(normalize=True)

This statement creates the variable model as the instance of LinearRegression. You can provide several optional parameters to LinearRegression:

- fit_intercept is a Boolean (True by default) that decides whether to calculate the intercept 𝑏₀ (True) or consider it equal to zero (False).
- normalize is a Boolean (False by default) that decides whether to normalize the input variables (True) or not (False).
- copy_X is a Boolean (True by default) that decides whether to copy (True) or overwrite the input variables (False).
- n_jobs is an integer or None (default) and represents the number of jobs used in parallel computation. None usually means one job and -1 to use all processors.

In [13]:
model.fit(x, y)

LinearRegression(normalize=True)

With `.fit()`, you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output (x and y) as the arguments. In other words, `.fit()` **fits the model**. It returns self, which is the variable model itself. That’s why you can replace the last two statements with this one:



In [14]:
model = LinearRegression().fit(x, y)

**Step 4: Get results**

Once you have your model fitted, you can get the results to check whether the model works satisfactorily and interpret it.

You can obtain the coefficient of determination ($𝑅^2$) with .score() called on model:

In [15]:
r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)


coefficient of determination: 0.7158756137479542


`Note`

$R^2$ is between 0 and 1 and mostly, the higher the value, the better the model however, this is not necessarily always true

Next, we get the `_intercept` and `.coef_,`. That is $\beta_0$, $\beta_1$

In [16]:
print('intercept:', model.intercept_)
print('slope:', model.coef_)

intercept: 5.633333333333329
slope: [0.54]


The value $b_0$ = 5.63 (approximately) illustrates that your model predicts the response 5.63 when 𝑥 is zero. The value $b_1$ = 0.54 means that the predicted response rises by 0.54 when 𝑥 is increased by one.

**Step 5: Predict response**

Once there is a satisfactory model, you can use it for predictions with either existing or new data.


To obtain the predicted response, use `.predict()`:

In [17]:
y_pred = model.predict(x)

print('predicted response:', y_pred, sep='\n')

predicted response:
[ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333 35.33333333]


In [18]:
import pandas as pd
pred_vs_actual = pd.DataFrame({'Predicted': np.array(y_pred), 'Actual': y})

In [19]:
pred_vs_actual

Unnamed: 0,Predicted,Actual
0,8.333333,5
1,13.733333,20
2,19.133333,14
3,24.533333,32
4,29.933333,22
5,35.333333,38


In practice, regression models are often applied for forecasts. This means that you can use fitted models to calculate the outputs based on some other, new inputs:

In [20]:
x_new = np.arange(5).reshape((-1, 1))

In [21]:
print(x_new)

[[0]
 [1]
 [2]
 [3]
 [4]]


In [22]:
y_new = model.predict(x_new)

In [23]:
print(y_new)

[5.63333333 6.17333333 6.71333333 7.25333333 7.79333333]


### Multiple Linear Regression With scikit-learn

**Steps 1 and 2: Import packages and classes, and provide data**

In [29]:
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]

In [30]:
x, y = np.array(x), np.array(y)

In [31]:
print(x)

[[ 0  1]
 [ 5  1]
 [15  2]
 [25  5]
 [35 11]
 [45 15]
 [55 34]
 [60 35]]


In [32]:
print(y)

[ 4  5 20 14 32 22 38 43]


In multiple linear regression, `x` is a **two-dimensional array with at least two columns**, while `y` is usually a **one-dimensional array**. This is a simple example of multiple linear regression, and x has exactly two columns.

**Step 3: Create a model and fit it**

The next step is to create the regression model as an instance of LinearRegression and fit it with `.fit()`:

In [34]:
model = LinearRegression().fit(x, y)

In [None]:
r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)


print('slope:', model.coef_)

In [35]:
print('intercept:', model.intercept_)

intercept: 5.52257927519819


In [36]:
print('slope:', model.coef_)


slope: [0.44706965 0.25502548]


In [37]:
y_pred = model.predict(x)
print('predicted response:', y_pred, sep='\n')

predicted response:
[ 5.77760476  8.012953   12.73867497 17.9744479  23.97529728 29.4660957
 38.78227633 41.27265006]


In [38]:
pred_vs_actual2 = pd.DataFrame({'Predicted': np.array(y_pred), 'Actual': y})
pred_vs_actual2

Unnamed: 0,Predicted,Actual
0,5.777605,4
1,8.012953,5
2,12.738675,20
3,17.974448,14
4,23.975297,32
5,29.466096,22
6,38.782276,38
7,41.27265,43


In [39]:
x_new = np.arange(10).reshape((-1, 2))

In [40]:
print(x_new)

[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]


In [41]:
y_new = model.predict(x_new)
print(y_new)

[ 5.77760476  7.18179502  8.58598528  9.99017554 11.3943658 ]


### Polynomial Regression With scikit-learn

Implementing **polynomial regression** with scikit-learn is very similar to linear regression. There is only one extra step: you need to transform the array of inputs to include non-linear terms such as $x^2$

**Step 1: Import packages and classes**

In [44]:
from sklearn.preprocessing import PolynomialFeatures

**Step 2a: Provide data**

Now you have the input and output in a suitable format. Keep in mind that you need the input to be a **two-dimensional array**. That’s why `.reshape()` is used.

In [45]:
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([15, 11, 2, 8, 25, 32])

**Step 2b: Transform input data**
    
This is the **new step** you need to implement for polynomial regression!

As you’ve seen earlier, you need to include $x^2$ (and perhaps other terms) as additional features when implementing polynomial regression. For that reason, you should transform the input array $x$ to contain the additional column(s) with the values of $x^2$ (and eventually more features).

It’s possible to transform the input array in several ways (like using insert() from numpy), but the class PolynomialFeatures is very convenient for this purpose. Let’s create an instance of this class:

In [46]:
transformer = PolynomialFeatures(degree=2, include_bias=False)

The variable transformer refers to an instance of PolynomialFeatures which you can use to transform the input x.

You can provide several optional parameters to PolynomialFeatures:

- **degree** is an integer (2 by default) that represents the degree of the polynomial regression function.
- **interaction_only** is a Boolean (False by default) that decides whether to include only interaction features (True) or all features (False).
- **include_bias** is a Boolean (True by default) that decides whether to include the bias (intercept) column of ones (True) or not (False).

In [47]:
transformer.fit(x)

PolynomialFeatures(include_bias=False)

In [48]:
x_ = transformer.transform(x)

In [49]:
x_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(x)

at’s fitting and transforming the input array in one statement with `.fit_transform()`. It also takes the input array and effectively does the same thing as `.fit()` and `.transform()` called in that order. It also returns the modified array. This is how the new input array looks:



In [50]:
print(x_)

[[   5.   25.]
 [  15.  225.]
 [  25.  625.]
 [  35. 1225.]
 [  45. 2025.]
 [  55. 3025.]]


**Step 3: Create a model and fit it**

same as for linear regression

In [51]:
model = LinearRegression().fit(x_, y)

**Step 4: Get results**

In [52]:
r_sq = model.score(x_, y)
print('coefficient of determination:', r_sq)


coefficient of determination: 0.8908516262498564


In [53]:
print('intercept:', model.intercept_)
intercept: 21.372321428571425

intercept: 21.372321428571425


In [54]:
print('coefficients:', model.coef_)

coefficients: [-1.32357143  0.02839286]
