<a href="https://colab.research.google.com/github/Rogerio-mack/IMT_CD_2024/blob/main/IMT_regressao_robust_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<head>
  <meta name="author" content="Rogério de Oliveira">
  <meta institution="author" content="ITM">
</head>

<img src="https://maua.br/images/selo-60-anos-maua.svg" width=300, align="right">
<!-- <h1 align=left><font size = 6, style="color:rgb(200,0,0)"> optional title </font></h1> -->


# **Aprendizado Supervisionado: Regressão Linear Robusta `scikit-learn`**

Neste laboratório:

1. Termos de Interação
2. Scikit-learn
3. Outliers e modelos Robustos
4. Resíduos*
5. Homocedasticidade de Log*
6. Variáveis preditoras categóricas*



In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import statsmodels.formula.api as sm

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Métricas de Erro

Veja [A Comprehensive Overview of Regression Evaluation Metrics](https://developer.nvidia.com/blog/a-comprehensive-overview-of-regression-evaluation-metrics/) e [Metrics and scoring: quantifying the quality of predictions](
https://scikit-learn.org/stable/modules/model_evaluation.html)

In [22]:
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error

def calculate_metrics(y_true, y_pred):
    metrics = {}

    # Mean Squared Error (MSE)
    metrics['MSE'] = mean_squared_error(y_true, y_pred)

    # Root Mean Squared Error (RMSE)
    metrics['RMSE'] = np.sqrt(metrics['MSE'])

    # Mean Absolute Percentage Error (MAPE)
    if np.any(y_true == 0):
        metrics['MAPE'] = 'Undefined (division by zero)'
    else:
        metrics['MAPE'] = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

    # Mean Absolute Error (MAE)
    metrics['MAE'] = mean_absolute_error(y_true, y_pred)

    # Symmetric Mean Absolute Percentage Error (sMAPE)
    metrics['sMAPE'] = 100/len(y_true) * np.sum(2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred)))

    # Mean Absolute Deviation (MAD)
    metrics['MAD'] = np.mean(np.abs(y_true - np.mean(y_true)))

    for key, value in metrics.items():
        print(f"{key}: {value:.2f}")

    return metrics

# calculate_metrics(y_true, y_pred)


# Caso: **Estimando o preço de imóveis**

Empregue modelos de regressão simples e múltipla para estimar o preço dos imóveis.


In [24]:
df = pd.read_excel('http://meusite.mackenzie.br/rogerio/data_load/regressao_preco_imoveis.xlsx')
df = df.drop(columns='bairro')
df.head()

Unnamed: 0,areaM2,suites,dormitorios,banheiros,vagas,preco
0,32,1,1,1,1,490000
1,157,2,2,2,2,3180000
2,205,2,3,3,3,1900000
3,193,3,3,3,3,3565000
4,116,1,3,2,2,1605000


In [25]:
df_case = pd.DataFrame({'areaM2':[134], 'suites':[1], 'dormitorios':[4], 'vagas':[2]})
df_case


Unnamed: 0,areaM2,suites,dormitorios,vagas
0,134,1,4,2


# `Statsmodels`

In [26]:
lm = sm.ols(formula='preco ~ areaM2 + suites + dormitorios + vagas', data=df)
# lm = sm.ols(formula='preco ~ areaM2 + suites + dormitorios + vagas - 1', data=df)
lm = lm.fit()
print(lm.summary())
print()

calculate_metrics(df.preco, lm.predict(df.drop(columns='preco')))
print()

preco = lm.predict(df_case)
print()
print(f'Preço estimado: {preco[0]:.2f}')

                            OLS Regression Results                            
Dep. Variable:                  preco   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.684
Method:                 Least Squares   F-statistic:                     2027.
Date:                Wed, 21 Aug 2024   Prob (F-statistic):               0.00
Time:                        20:26:10   Log-Likelihood:                -55099.
No. Observations:                3741   AIC:                         1.102e+05
Df Residuals:                    3736   BIC:                         1.102e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -3407.3121   4.04e+04     -0.084      

# `Scikit-learn`

In [28]:
X = df[['areaM2', 'suites', 'dormitorios', 'vagas']]
y = df['preco']

# model = LinearRegression(fit_intercept=False) # set fit_intercept to False
model = LinearRegression()
model.fit(X, y)

print("Coeficientes: ", dict(zip(model.feature_names_in_, model.coef_)))
print("Intercept: ", model.intercept_)
print("Score (R2): ", model.score(X, y)) # R2
print()

calculate_metrics(df.preco, lm.predict(df.drop(columns='preco')))
print()

# Predição
y_pred = model.predict(df_case)
print(f'Preço estimado (scikit-learn): {y_pred[0]:.2f}')

Coeficientes:  {'areaM2': 10468.461739070792, 'suites': 202771.70454037527, 'dormitorios': -311340.86561812006, 'vagas': 296751.59991048346}
Intercept:  -3407.312060851371
Score (R2):  0.6845689083685004

MSE: 363479068155.80
RMSE: 602892.25
MAPE: 27.16
MAE: 398507.18
sMAPE: 24.81
MAD: 773278.64

Preço estimado (scikit-learn): 950278.00


# Q1. Heterocedasticidade e Log

Faça o modelo de regressão anterior considerando apenas coeficientes significativos e o log dos valores (tanto da variável preditora como da variável alvo).

Faça a predição do novo caso com o novo modelo.

# Q2. Termos de interação

Volte a considerar os valores normais, sem aplicação do log. Crie modelos de regressão considerando as interações:

* dormitorios+vagas
* dormitorios+suites
* dormitorios+suites e areaM2+vagas

Considerando somente coeficientes significativos.

Qual o R2 e a predição do melhor modelo para o novo caso?

# Q3. Termos de Interação com o scikit-learn

Reconstrua o melhor modelo do exercício anterior com o scikit-learn.

# Q4. Modelos robustos

Aplique algum dos modelos robustos de regressão linear do scikit-learn empregando os mesmos preditores do exercício 1.

# Q5. Modelos não lineares

Aplique aos dados abaixo modelos lineares para estimar os valores de median_house_value. Depois de desistir (kkk) aplique um modelo de DecisionTreeRegressor. Compare as métricas de erro.

In [None]:
df = pd.read_csv('/content/sample_data/california_housing_train.csv')
df.drop(columns=['longitude', 'latitude'], inplace=True)
df.head()

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [None]:
df_case = pd.DataFrame(df.mean()).transpose()
df_case = df_case.drop(columns='median_house_value')
df_case


Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
0,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578
