# Type of variables

- Categorical:
    - Nominal (Man, woman),(Blue, red, yellow)
    - Ordinal (A,B,C), (Short, Median, Big)

- Numeric:
    - Discrete (1,2,3), (345 people)
    - Continue (Age, Height)

# Linear regression

`"The relation existence between variables"`

- Linear

    - Simple
    - Multiple

- Logistic
    - Simple
    - Multiple

keys:

1. It serves us mainly to be able to estimate one variable with respect to the other, it is the one that best fits the cloud of points.

2. Linear regression. It allows determining the degree of dependence of the series of values X and Y, predicting the estimated y value that would be obtained for a value x that is not in the distribution.

3. It allows predicting the behavior of a certain phenomenon since it efficiently approximates a certain amount of data

4. Predictive supervised data mining algorithm since the variables are numerical

5. The variable we want to forecast is the dependent variable and gives rise to further division

Simple linear regression is based on studying the changes in a non-random variable, which affect a random variable, in the case of a functional relationship between both variables that can be established by a linear expression, that is, its graphical representation is a straight line. That is, we are in the presence of a simple linear regression when an independent variable influences another dependent variable.

$$ y = f(x) $$


# The least squares method



The least squares method is a statistical method used to determine the equation of a regression. That is, the least squares method is a criterion that is used in a regression model to minimize the error obtained when calculating the regression equation.

Specifically, the least squares method consists of minimizing the sum of the squares of the residuals, or in other words, it is based on minimizing the sum of the squares of the differences between the values predicted by the regression model and the observed values.

## Estimation error

In statistics, the estimation error, also called residual, is the difference between the real value and the value fitted by the regression model. Therefore, a statistical residual is calculated as follows:

$$ e_i = {( y_i - \bar{y})} $$

## Minimize error squares

Now that we know what a residual is in statistics, it will be easier to understand how the squares of the errors are minimized.

The square of an error is the square of a residual, therefore, the square of an error is equal to the difference between the true value and the value fitted by the regression model raised to the power of two.

$$ e_i^2 = {( y_i - \bar{y})^2} $$

So: 

$$ e_i = [min] \sum_{i = 1}^{n}{( y_i - \bar{y})^2} $$


## Practice example

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [29]:
dataset = pd.read_csv("./datasets/Salary_Data.csv")
dataset

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0
5,2.9,56642.0
6,3.0,60150.0
7,3.2,54445.0
8,3.2,64445.0
9,3.7,57189.0


In [30]:
x = dataset.iloc[:,0].values 
y = dataset.iloc[:,1].values

In [31]:
x # Years of experience

array([ 1.1,  1.3,  1.5,  2. ,  2.2,  2.9,  3. ,  3.2,  3.2,  3.7,  3.9,
        4. ,  4. ,  4.1,  4.5,  4.9,  5.1,  5.3,  5.9,  6. ,  6.8,  7.1,
        7.9,  8.2,  8.7,  9. ,  9.5,  9.6, 10.3, 10.5])

In [32]:
y # Salary based on years of experience

array([ 39343.,  46205.,  37731.,  43525.,  39891.,  56642.,  60150.,
        54445.,  64445.,  57189.,  63218.,  55794.,  56957.,  57081.,
        61111.,  67938.,  66029.,  83088.,  81363.,  93940.,  91738.,
        98273., 101302., 113812., 109431., 105582., 116969., 112635.,
       122391., 121872.])

In [33]:
# train and testing sets

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=1/3, random_state=0)

In [34]:
x_train

array([ 2.9,  5.1,  3.2,  4.5,  8.2,  6.8,  1.3, 10.5,  3. ,  2.2,  5.9,
        6. ,  3.7,  3.2,  9. ,  2. ,  1.1,  7.1,  4.9,  4. ])

In [35]:
x_test

array([ 1.5, 10.3,  4.1,  3.9,  9.5,  8.7,  9.6,  4. ,  5.3,  7.9])

In [36]:
y_train

array([ 56642.,  66029.,  64445.,  61111., 113812.,  91738.,  46205.,
       121872.,  60150.,  39891.,  81363.,  93940.,  57189.,  54445.,
       105582.,  43525.,  39343.,  98273.,  67938.,  56957.])

In [37]:
y_test

array([ 37731., 122391.,  57081.,  63218., 116969., 109431., 112635.,
        55794.,  83088., 101302.])

In [38]:
from sklearn.linear_model import LinearRegression
# Reshape x, y training sets
x_train = np.array(x_train).reshape(1,-1)
y_train = np.array(y_train).reshape(1,-1)
# Create linear regression model with x and y training sets
regression = LinearRegression()
regression.fit(x_train, y_train)

# Regression line of Y on X:

The regression line of Y on X is used to estimate the values of Y from those of X, for this reason the variable Y will be called a dependent variable.

**The slope of the line is the quotient between the covariance and the variance of the variable X**

$$ y - \bar{y} = \frac{S_{XY}}{S_{X}^2} * (x - \bar{x}) $$
$$ y  = (\frac{S_{XY}}{S_{X}^2})x - [(\frac{S_{XY}}{S_{X}^2})\bar{x} + \bar{y}]$$

# Regression line of X on Y:

The regression line of X on Y is used to estimate the values of X from those of Y, for this reason the varibale X will be called dependent varibale.

**The slope of the line is the quotient between the covariance and the variance of the variable Y**

$$ y - \bar{y} = \frac{S_{XY}}{S_{X}^2} * (x - \bar{x}) $$
$$ y  = (\frac{S_{XY}}{S_{X}^2})x - [(\frac{S_{XY}}{S_{X}^2})\bar{x} + \bar{y}]$$

What we want is to find a simple straight line of the form:

$$ y = mx + b $$



# Coefficient of determination (R squared)

The determination coefficient is defined as the proportion of the total variance of the variable explained by the regression. The coefficient of determination, also called R square, reflects the goodness of the fit of a model to the variable that it tries to explain.

It is important to know that the result of the coefficient of determination oscillates between 0 and 1. The closer its value is placed to 1, the greater the fit of the model to the variable that we are trying to explain. Conversely, the closer to zero, the less fitted the model will be and, therefore, the less reliable it will be.

It allows making predictions about the growth or variation of a given data according to how it is correlated with other variables. The R square is the indicator that will allow us to know how well these results can be predicted.

The greater the variance explained by the regression model, the closer the data points will be to the fitted regression line.

The following expression corresponds to the variance, but with two fundamental differences.

$$ R^2 = \frac{\sum_{t=1}^{T}}{\sum_{t=1}^{T}} $$



In [39]:
import pandas as pd
import statsmodels.formula.api as sfm
import matplotlib.pyplot as plt
import numpy as np
from math import ceil

In [40]:
data = pd.read_csv("Advertising.csv")
data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Advertising.csv'

In [None]:
# Create a model that fits the parameters
lm = sfm.ols(formula='Sales~TV',data=data).fit()

In [None]:
lm.params

Intercept    7.032594
TV           0.047537
dtype: float64

# Predict linear model

$$ sales = 7.032594 + 0.047537 * TV $$
$$ \alpha = 7.032594 $$
$$ \beta = 0.047537 $$