<a href="https://colab.research.google.com/github/GDianaS/machine-learning-basics/blob/main/03_Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Multiple Linear Regression ###
$$
y = b_{0} + b_{1}*x_{1} + b_{2}*x_{2}+ ... +b_{n}*x_{n}
$$
1. Linearidade
2. Homescedasticity (homocedasticidade)
3. Multivariate normality
4. Independence of erros
5. Lack of multicollinearity

### Dummy Variable Trap ###

**Categorical Variable** -> cria-se dummy variables através da criação de novas colunas que tem aparência semelhante a uma tabela verdade. Quebra-se assim a variavel categorica em outras variaveis numéricas. Se estiver somente duas opções pode se usar somente uma coluna com 1 e 0. Assim não é preciso inserir todas as dummy variables "D" (switches) no dataset

$$
y = b_{0} + b_{1}*x_{1} + b_{2}*x_{2} + b_{3}*x_{3} + b_{4}*D_{1}+b_{5}*D_{2}
$$

> Você não pode ter a constante ($ b_{0} $) e as todas as dummy variaveis juntas. So when building a model, always omit one dummy variable.

### P-Value ###
Quando temos um evento com probabilidade de acontecer menor ou igual a 5% ($\alpha$=0.05) sentimos uma certa desconfiança ligada ao acontecimento desse evento, a partir do qual é um valor que nós permite rejeitar ou aceitar uma hipótese relacionada a esse evento.

### Building a Model ###
**"All-in"** : usar todas as variáveis. Você sabe que elas são todas necessárias para construir o modelo. 

**Backward Elimination**:
 1. STEP 1: Select a significance level to stay in the model (e.g. SL = 0.05)
 2. STEP 2: Fit the full model with all possible predictors
 3. STEP 3: Consider the predictor with the __highest__ P-value. If P>SL, go to STEP 4, otherwise go to FIN
 4. STEP 4: Remove the predictor
 5. STEP 5: Fit model without this variable* Go back to STEP 3
 
 FIN: YOU MODEL IS READY
 
 **Forward Selection**:
 1. STEP 1: Select a significance level to enter in the model (e.g. SL = 0.05)
 2. STEP 2: Fit all simple regression models y ~x$_{n}$. Nós pegamos a varrável dependente (y) e criamos um modelo de regressão com cada variável independente. Select the one with __lowest__ P-value.
 3. STEP 3: Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have. Selecionamos uma regressão linear simples com uma variável para depois construir todas as regressões lineares possíveis com duas variáveis, ou seja, adicionamos todas as possíveis todas as outras variáveis uma por uma.
 4. STEP 4: Consider the predictor with the __lowest__ P-value. If P < SL, got to STEP 3 (Agora para o PASSO 3 vc terá duas variáveis e vai adicionar uma terceira), otherwise go to FIN
 
 FIN: DON´T KEEP THE CURENT MODEL. KEEP THE PREVIOUS MODEL
 
 **Bidirectional Elimination**:
 1. STEP 1: Select a significance level to enter an to stay in the model (e.g. SLENTER = 0.05, SLSTAY = 0.05)
 2. STEP 2: New variables must have: P < SLENTER to enter
 3. STEP 3: Performa ALL steps of Backward Elimination. Old variables ust have P < SLSTAY to stay. Go back to STEP 2
 4. STEP 4: No new variables can enter and no old variables can exit
 
 FIN: YOU MODEL IS READY
 
 **All Possible Models**:
  1. STEP 1: Select a criterion of goodness of fit.
  2. STEP 2: Construct all possible regression models: $2^n - 1$ total combinations, n being the numbers of columns
  3. STEP 3: Select the one with the best criterion
  
   FIN: YOU MODEL IS READY

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

In [None]:
dataset = pd.read_csv('Dataset/50_Startups.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

In [None]:
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder="passthrough")
X = np.array(ct.fit_transform(X))

In [None]:
#the encoder columns go to the front
#New York (001) , California(100), Florida(010)
print(X)
#Não é preciso escalar recursos em Multiple Linear Regression, pois os coeficientes compensarão as escalas

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=0)

In [None]:
# A classe automaticamente evita the Dummy Variable Trap e faz the Backward Elimination
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

In [None]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)  #Duas casas decimais
#reshape(linhas, colunas)
print(
    np.concatenate(
        (y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)),
        axis=1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


### Making a single prediction (for example the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = 'California') ###

In [None]:
#predict methos always expects a 2D array as the input
#12→scalar 
#[12]→1D array 
#[[12]]→2D array
print(regressor.predict([[1,0,0,160000,130000,300000]]))

[181566.92]


In [None]:
print(regressor.coef_)
print(regressor.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.52924854249



$$\textrm{Profit} = 86.6 \times \textrm{Dummy State 1} - 873 \times \textrm{Dummy State 2} + 786 \times \textrm{Dummy State 3} + 0.773 \times \textrm{R&D Spend} + 0.0329 \times \textrm{Administration} + 0.0366 \times \textrm{Marketing Spend} + 42467.53$$