# EBAC - Regressão II - regressão múltipla

## Tarefa I

#### Previsão de renda II

Vamos continuar trabalhando com a base 'previsao_de_renda.csv', que é a base do seu próximo projeto. Vamos usar os recursos que vimos até aqui nesta base.

|variavel|descrição|
|-|-|
|data_ref                | Data de referência de coleta das variáveis |
|index                   | Código de identificação do cliente|
|sexo                    | Sexo do cliente|
|posse_de_veiculo        | Indica se o cliente possui veículo|
|posse_de_imovel         | Indica se o cliente possui imóvel|
|qtd_filhos              | Quantidade de filhos do cliente|
|tipo_renda              | Tipo de renda do cliente|
|educacao                | Grau de instrução do cliente|
|estado_civil            | Estado civil do cliente|
|tipo_residencia         | Tipo de residência do cliente (própria, alugada etc)|
|idade                   | Idade do cliente|
|tempo_emprego           | Tempo no emprego atual|
|qt_pessoas_residencia   | Quantidade de pessoas que moram na residência|
|renda                   | Renda em reais|

In [162]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

import numpy as np

import seaborn as sns

import statsmodels.formula.api as smf
import statsmodels.api as sm

import patsy


In [163]:
df = pd.read_csv(r'C:\Users\Gabriel\Documents\Data Science\Data Science EBAC\Módulo 13\previsao_de_renda.csv')

1. Separe a base em treinamento e teste (25% para teste, 75% para treinamento).
2. Rode uma regularização *ridge* com alpha = [0, 0.001, 0.005, 0.01, 0.05, 0.1] e avalie o $R^2$ na base de testes. Qual o melhor modelo?
3. Faça o mesmo que no passo 2, com uma regressão *LASSO*. Qual método chega a um melhor resultado?
4. Rode um modelo *stepwise*. Avalie o $R^2$ na vase de testes. Qual o melhor resultado?
5. Compare os parâmetros e avalie eventuais diferenças. Qual modelo você acha o melhor de todos?
6. Partindo dos modelos que você ajustou, tente melhorar o $R^2$ na base de testes. Use a criatividade, veja se consegue inserir alguma transformação ou combinação de variáveis.
7. Ajuste uma árvore de regressão e veja se consegue um $R^2$ melhor com ela.

## Tratando os dados

In [164]:
df.head()

Unnamed: 0.1,Unnamed: 0,data_ref,id_cliente,sexo,posse_de_veiculo,posse_de_imovel,qtd_filhos,tipo_renda,educacao,estado_civil,tipo_residencia,idade,tempo_emprego,qt_pessoas_residencia,renda
0,0,2015-01-01,15056,F,False,True,0,Empresário,Secundário,Solteiro,Casa,26,6.60274,1.0,8060.34
1,1,2015-01-01,9968,M,True,True,0,Assalariado,Superior completo,Casado,Casa,28,7.183562,2.0,1852.15
2,2,2015-01-01,4312,F,True,True,0,Empresário,Superior completo,Casado,Casa,35,0.838356,2.0,2253.89
3,3,2015-01-01,10639,F,False,True,1,Servidor público,Superior completo,Casado,Casa,30,4.846575,3.0,6600.77
4,4,2015-01-01,7064,M,True,False,0,Assalariado,Secundário,Solteiro,Governamental,33,4.293151,1.0,6475.97


In [165]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             15000 non-null  int64  
 1   data_ref               15000 non-null  object 
 2   id_cliente             15000 non-null  int64  
 3   sexo                   15000 non-null  object 
 4   posse_de_veiculo       15000 non-null  bool   
 5   posse_de_imovel        15000 non-null  bool   
 6   qtd_filhos             15000 non-null  int64  
 7   tipo_renda             15000 non-null  object 
 8   educacao               15000 non-null  object 
 9   estado_civil           15000 non-null  object 
 10  tipo_residencia        15000 non-null  object 
 11  idade                  15000 non-null  int64  
 12  tempo_emprego          12427 non-null  float64
 13  qt_pessoas_residencia  15000 non-null  float64
 14  renda                  15000 non-null  float64
dtypes:

In [166]:
df.isnull().sum()

Unnamed: 0                  0
data_ref                    0
id_cliente                  0
sexo                        0
posse_de_veiculo            0
posse_de_imovel             0
qtd_filhos                  0
tipo_renda                  0
educacao                    0
estado_civil                0
tipo_residencia             0
idade                       0
tempo_emprego            2573
qt_pessoas_residencia       0
renda                       0
dtype: int64

#### Percebi que o tempo de emprego possui dados faltantes, então decidi assumir que o tempo de emprego faltante é igual a 0

In [167]:
df.fillna({'tempo_emprego': 0}, inplace=True)

In [168]:
df.isnull().sum()

Unnamed: 0               0
data_ref                 0
id_cliente               0
sexo                     0
posse_de_veiculo         0
posse_de_imovel          0
qtd_filhos               0
tipo_renda               0
educacao                 0
estado_civil             0
tipo_residencia          0
idade                    0
tempo_emprego            0
qt_pessoas_residencia    0
renda                    0
dtype: int64

#### Retirando variáveis desnecessárias

In [169]:
df.drop(columns = ['data_ref' , 'Unnamed: 0' , 'id_cliente'] , inplace = True)

In [170]:
df.head()

Unnamed: 0,sexo,posse_de_veiculo,posse_de_imovel,qtd_filhos,tipo_renda,educacao,estado_civil,tipo_residencia,idade,tempo_emprego,qt_pessoas_residencia,renda
0,F,False,True,0,Empresário,Secundário,Solteiro,Casa,26,6.60274,1.0,8060.34
1,M,True,True,0,Assalariado,Superior completo,Casado,Casa,28,7.183562,2.0,1852.15
2,F,True,True,0,Empresário,Superior completo,Casado,Casa,35,0.838356,2.0,2253.89
3,F,False,True,1,Servidor público,Superior completo,Casado,Casa,30,4.846575,3.0,6600.77
4,M,True,False,0,Assalariado,Secundário,Solteiro,Governamental,33,4.293151,1.0,6475.97


### Dividindo a base de dados
[Voltar ao índice](#topo)

In [171]:
y , X = patsy.dmatrices( 'np.log(renda) ~ C(sexo) + C(posse_de_veiculo) \
+ C(posse_de_imovel) + C(tipo_renda) \
+ C(educacao) + C(estado_civil) \
+ C(tipo_residencia) + qt_pessoas_residencia \
+ qtd_filhos + tempo_emprego + idade' , data = df)

In [172]:
X_train , X_test , y_train , y_test = train_test_split(X , y , test_size = 0.25 , random_state = 100)

### Ajustando modelo tipo Ridge

In [173]:
modelo = sm.OLS(y_train , X_train)

In [174]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 0.01
                         , alpha = 0)

In [175]:
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.349
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,240.3
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:27,Log-Likelihood:,-12178.0
No. Observations:,11250,AIC:,24410.0
Df Residuals:,11225,BIC:,24600.0
Df Model:,25,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,6.7550,0.235,28.703,0.000,6.294,7.216
var_1,0.7975,0.016,50.029,0.000,0.766,0.829
var_2,0.0282,0.015,1.869,0.062,-0.001,0.058
var_3,0.0904,0.015,6.069,0.000,0.061,0.120
var_4,0.1839,0.293,0.629,0.530,-0.390,0.757
var_5,0.1474,0.017,8.631,0.000,0.114,0.181
var_6,0.3097,0.028,11.246,0.000,0.256,0.364
var_7,0.0742,0.025,2.931,0.003,0.025,0.124
var_8,0.0841,0.176,0.476,0.634,-0.262,0.430

0,1,2,3
Omnibus:,0.107,Durbin-Watson:,2.023
Prob(Omnibus):,0.948,Jarque-Bera (JB):,0.087
Skew:,0.0,Prob(JB):,0.957
Kurtosis:,3.014,Cond. No.,2250.0


In [176]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 0.01
                         , alpha = 0.001)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.349
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,250.3
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:28,Log-Likelihood:,-12178.0
No. Observations:,11250,AIC:,24410.0
Df Residuals:,11226,BIC:,24590.0
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,6.7656,0.234,28.879,0.000,6.306,7.225
var_1,0.7973,0.016,50.032,0.000,0.766,0.829
var_2,0.0284,0.015,1.882,0.060,-0.001,0.058
var_3,0.0905,0.015,6.078,0.000,0.061,0.120
var_4,0.1840,0.293,0.629,0.529,-0.389,0.757
var_5,0.1477,0.017,8.653,0.000,0.114,0.181
var_6,0.3094,0.028,11.238,0.000,0.255,0.363
var_7,0.0742,0.025,2.931,0.003,0.025,0.124
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,0.113,Durbin-Watson:,2.023
Prob(Omnibus):,0.945,Jarque-Bera (JB):,0.092
Skew:,0.0,Prob(JB):,0.955
Kurtosis:,3.014,Cond. No.,2250.0


In [177]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 0.01
                         , alpha = 0.005)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.348
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,315.5
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:28,Log-Likelihood:,-12183.0
No. Observations:,11250,AIC:,24410.0
Df Residuals:,11231,BIC:,24550.0
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,7.2094,0.079,91.782,0.000,7.055,7.363
var_1,0.7976,0.016,50.151,0.000,0.766,0.829
var_2,0.0293,0.015,1.944,0.052,-0.000,0.059
var_3,0.0909,0.015,6.107,0.000,0.062,0.120
var_4,0.1895,0.293,0.648,0.517,-0.384,0.763
var_5,0.1468,0.017,8.609,0.000,0.113,0.180
var_6,0.3094,0.028,11.246,0.000,0.255,0.363
var_7,0.0737,0.025,2.915,0.004,0.024,0.123
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,0.175,Durbin-Watson:,2.023
Prob(Omnibus):,0.916,Jarque-Bera (JB):,0.15
Skew:,-0.002,Prob(JB):,0.928
Kurtosis:,3.017,Cond. No.,2250.0


In [178]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 0.01
                         , alpha = 0.01)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.348
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,315.2
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:28,Log-Likelihood:,-12185.0
No. Observations:,11250,AIC:,24410.0
Df Residuals:,11231,BIC:,24560.0
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,7.2193,0.100,72.191,0.000,7.023,7.415
var_1,0.8073,0.015,53.562,0.000,0.778,0.837
var_2,0,0,,,0,0
var_3,0.0917,0.015,6.156,0.000,0.063,0.121
var_4,0.1768,0.293,0.604,0.546,-0.397,0.750
var_5,0.1475,0.017,8.647,0.000,0.114,0.181
var_6,0.3070,0.028,11.162,0.000,0.253,0.361
var_7,0.0730,0.025,2.886,0.004,0.023,0.123
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,0.136,Durbin-Watson:,2.024
Prob(Omnibus):,0.934,Jarque-Bera (JB):,0.114
Skew:,-0.001,Prob(JB):,0.945
Kurtosis:,3.015,Cond. No.,2250.0


In [179]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 0.01
                         , alpha = 0.05)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.348
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,300.3
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:29,Log-Likelihood:,-12179.0
No. Observations:,11250,AIC:,24400.0
Df Residuals:,11230,BIC:,24550.0
Df Model:,20,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,6.7263,0.227,29.593,0.000,6.281,7.172
var_1,0.7976,0.016,50.197,0.000,0.766,0.829
var_2,0.0278,0.015,1.840,0.066,-0.002,0.057
var_3,0.0910,0.015,6.112,0.000,0.062,0.120
var_4,0,0,,,0,0
var_5,0.1479,0.017,8.670,0.000,0.114,0.181
var_6,0.3100,0.027,11.272,0.000,0.256,0.364
var_7,0.0735,0.025,2.908,0.004,0.024,0.123
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,0.095,Durbin-Watson:,2.023
Prob(Omnibus):,0.954,Jarque-Bera (JB):,0.076
Skew:,0.0,Prob(JB):,0.963
Kurtosis:,3.013,Cond. No.,2250.0


In [180]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 0.01
                         , alpha = 0.1)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.348
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,286.0
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:29,Log-Likelihood:,-12179.0
No. Observations:,11250,AIC:,24400.0
Df Residuals:,11229,BIC:,24560.0
Df Model:,21,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,6.7215,0.231,29.058,0.000,6.268,7.175
var_1,0.7978,0.016,50.098,0.000,0.767,0.829
var_2,0.0278,0.015,1.840,0.066,-0.002,0.057
var_3,0.0909,0.015,6.103,0.000,0.062,0.120
var_4,0,0,,,0,0
var_5,0.1478,0.017,8.669,0.000,0.114,0.181
var_6,0.3099,0.028,11.270,0.000,0.256,0.364
var_7,0.0736,0.025,2.910,0.004,0.024,0.123
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,0.092,Durbin-Watson:,2.023
Prob(Omnibus):,0.955,Jarque-Bera (JB):,0.073
Skew:,0.0,Prob(JB):,0.964
Kurtosis:,3.012,Cond. No.,2250.0


In [181]:
modelo_test = sm.OLS(y_test , X_test)
reg = modelo_test.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 0.1
                         , alpha = 0.001)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.351
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,87.72
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:29,Log-Likelihood:,-4023.9
No. Observations:,3750,AIC:,8096.0
Df Residuals:,3727,BIC:,8245.0
Df Model:,23,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,6.5835,0.837,7.862,0.000,4.942,8.225
var_1,0.7887,0.027,28.779,0.000,0.735,0.842
var_2,0.0545,0.026,2.104,0.035,0.004,0.105
var_3,0.0881,0.025,3.462,0.001,0.038,0.138
var_4,0.2454,0.411,0.597,0.551,-0.560,1.051
var_5,0.1717,0.029,5.979,0.000,0.115,0.228
var_6,0.2020,0.047,4.313,0.000,0.110,0.294
var_7,0,0,,,0,0
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,1.561,Durbin-Watson:,1.937
Prob(Omnibus):,0.458,Jarque-Bera (JB):,1.496
Skew:,0.039,Prob(JB):,0.473
Kurtosis:,3.06,Cond. No.,4900.0


## Ajustando modelo tipo Lasso

In [182]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 1
                         , alpha = 0)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.349
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,240.3
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:30,Log-Likelihood:,-12178.0
No. Observations:,11250,AIC:,24410.0
Df Residuals:,11225,BIC:,24600.0
Df Model:,25,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,6.7550,0.235,28.703,0.000,6.294,7.216
var_1,0.7975,0.016,50.029,0.000,0.766,0.829
var_2,0.0282,0.015,1.869,0.062,-0.001,0.058
var_3,0.0904,0.015,6.069,0.000,0.061,0.120
var_4,0.1839,0.293,0.629,0.530,-0.390,0.757
var_5,0.1474,0.017,8.631,0.000,0.114,0.181
var_6,0.3097,0.028,11.246,0.000,0.256,0.364
var_7,0.0742,0.025,2.931,0.003,0.025,0.124
var_8,0.0841,0.176,0.476,0.634,-0.262,0.430

0,1,2,3
Omnibus:,0.107,Durbin-Watson:,2.023
Prob(Omnibus):,0.948,Jarque-Bera (JB):,0.087
Skew:,0.0,Prob(JB):,0.957
Kurtosis:,3.014,Cond. No.,2250.0


In [183]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 1
                         , alpha = 0.001)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.348
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,374.0
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:30,Log-Likelihood:,-12187.0
No. Observations:,11250,AIC:,24410.0
Df Residuals:,11234,BIC:,24530.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,7.2446,0.081,89.948,0.000,7.087,7.403
var_1,0.7934,0.016,50.126,0.000,0.762,0.824
var_2,0.0260,0.015,1.726,0.084,-0.004,0.056
var_3,0.0898,0.015,6.035,0.000,0.061,0.119
var_4,0,0,,,0,0
var_5,0.1368,0.017,8.200,0.000,0.104,0.170
var_6,0.3034,0.027,11.095,0.000,0.250,0.357
var_7,0,0,,,0,0
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,0.207,Durbin-Watson:,2.023
Prob(Omnibus):,0.902,Jarque-Bera (JB):,0.18
Skew:,-0.002,Prob(JB):,0.914
Kurtosis:,3.019,Cond. No.,2250.0


In [184]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 1
                         , alpha = 0.005)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.342
Model:,OLS,Adj. R-squared:,0.342
Method:,Least Squares,F-statistic:,730.5
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:30,Log-Likelihood:,-12234.0
No. Observations:,11250,AIC:,24490.0
Df Residuals:,11242,BIC:,24550.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,7.2627,0.043,169.898,0.000,7.179,7.346
var_1,0.7988,0.015,53.306,0.000,0.769,0.828
var_2,0,0,,,0,0
var_3,0.0903,0.015,6.153,0.000,0.062,0.119
var_4,0,0,,,0,0
var_5,0,0,,,0,0
var_6,0.2533,0.027,9.425,0.000,0.201,0.306
var_7,0,0,,,0,0
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,0.042,Durbin-Watson:,2.022
Prob(Omnibus):,0.979,Jarque-Bera (JB):,0.031
Skew:,-0.003,Prob(JB):,0.984
Kurtosis:,3.006,Cond. No.,2250.0


In [185]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 1
                         , alpha = 0.01)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.334
Model:,OLS,Adj. R-squared:,0.334
Method:,Least Squares,F-statistic:,1128.0
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:30,Log-Likelihood:,-12302.0
No. Observations:,11250,AIC:,24620.0
Df Residuals:,11245,BIC:,24660.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,7.2469,0.031,235.073,0.000,7.186,7.307
var_1,0.7867,0.015,52.675,0.000,0.757,0.816
var_2,0,0,,,0,0
var_3,0,0,,,0,0
var_4,0,0,,,0,0
var_5,0,0,,,0,0
var_6,0,0,,,0,0
var_7,0,0,,,0,0
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,0.033,Durbin-Watson:,2.021
Prob(Omnibus):,0.984,Jarque-Bera (JB):,0.026
Skew:,-0.003,Prob(JB):,0.987
Kurtosis:,3.004,Cond. No.,2250.0


In [186]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 1
                         , alpha = 0.05)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.331
Model:,OLS,Adj. R-squared:,0.33
Method:,Least Squares,F-statistic:,1389.0
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:30,Log-Likelihood:,-12331.0
No. Observations:,11250,AIC:,24670.0
Df Residuals:,11246,BIC:,24710.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,7.2203,0.031,235.161,0.000,7.160,7.280
var_1,0.7837,0.015,52.362,0.000,0.754,0.813
var_2,0,0,,,0,0
var_3,0,0,,,0,0
var_4,0,0,,,0,0
var_5,0,0,,,0,0
var_6,0,0,,,0,0
var_7,0,0,,,0,0
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,0.039,Durbin-Watson:,2.022
Prob(Omnibus):,0.98,Jarque-Bera (JB):,0.028
Skew:,-0.002,Prob(JB):,0.986
Kurtosis:,3.007,Cond. No.,2250.0


In [187]:
reg = modelo.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 1
                         , alpha = 0.1)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.331
Model:,OLS,Adj. R-squared:,0.331
Method:,Least Squares,F-statistic:,1113.0
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,21:00:31,Log-Likelihood:,-12327.0
No. Observations:,11250,AIC:,24670.0
Df Residuals:,11245,BIC:,24710.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,7.1466,0.040,177.804,0.000,7.068,7.225
var_1,0.7796,0.015,51.871,0.000,0.750,0.809
var_2,0,0,,,0,0
var_3,0,0,,,0,0
var_4,0,0,,,0,0
var_5,0,0,,,0,0
var_6,0,0,,,0,0
var_7,0,0,,,0,0
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,0.068,Durbin-Watson:,2.022
Prob(Omnibus):,0.966,Jarque-Bera (JB):,0.054
Skew:,-0.003,Prob(JB):,0.973
Kurtosis:,3.009,Cond. No.,2250.0


In [141]:
modelo_test = sm.OLS(y_test , X_test)
reg = modelo_test.fit_regularized(method = 'elastic_net' 
                         , refit = True
                         , L1_wt = 1
                         , alpha = 0.001)
reg.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.351
Model:,OLS,Adj. R-squared:,0.348
Method:,Least Squares,F-statistic:,126.1
Date:,"Mon, 03 Jun 2024",Prob (F-statistic):,0.0
Time:,20:55:33,Log-Likelihood:,-4025.2
No. Observations:,3750,AIC:,8084.0
Df Residuals:,3734,BIC:,8190.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
var_0,7.2298,0.105,69.178,0.000,7.025,7.435
var_1,0.7870,0.027,28.874,0.000,0.734,0.840
var_2,0.0545,0.026,2.109,0.035,0.004,0.105
var_3,0.0890,0.025,3.540,0.000,0.040,0.138
var_4,0,0,,,0,0
var_5,0.1719,0.029,6.000,0.000,0.116,0.228
var_6,0.2014,0.046,4.333,0.000,0.110,0.293
var_7,0,0,,,0,0
var_8,0,0,,,0,0

0,1,2,3
Omnibus:,1.579,Durbin-Watson:,1.939
Prob(Omnibus):,0.454,Jarque-Bera (JB):,1.514
Skew:,0.04,Prob(JB):,0.469
Kurtosis:,3.057,Cond. No.,4900.0


O modelo lasso foi o melhor modelo.

## Modelo stepwise

In [215]:
y, X = patsy.dmatrices(
    'np.log(renda) ~ C(sexo) + C(posse_de_veiculo) + C(posse_de_imovel) + C(tipo_renda) + C(educacao) + C(estado_civil) + C(tipo_residencia) + qt_pessoas_residencia + qtd_filhos + tempo_emprego + idade', 
    data=df, return_type='dataframe'
)

# Divisão dos dados em treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=100)

# Garantir que os índices estejam alinhados
y_train = y_train.reset_index(drop=True)
X_train = X_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)


In [216]:
def stepwise_selection(X, y, initial_list=[], threshold_in=0.05, threshold_out=0.05, verbose=True):
    included = list(initial_list)
    while True:
        changed = False
        # forward step
        excluded = list(set(X.columns) - set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [new_column]])).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed = True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max()
        if worst_pval > threshold_out:
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            changed = True
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

In [217]:
# Aplicar o método stepwise selection
selected_features = stepwise_selection(X_test, y_test)

# Verificar as features selecionadas
print("Selected features:", selected_features)

# Ajustar o modelo final com as features selecionadas
X_test_selected = X_test[selected_features]
final_model = sm.OLS(y_test, sm.add_constant(X_test_selected)).fit()

# Obter o resumo do modelo final
print(final_model.summary())

Add  Intercept                      with p-value 0.0
Add  tempo_emprego                  with p-value 3.3784e-149
Add  C(sexo)[T.M]                   with p-value 2.19166e-161
Add  idade                          with p-value 1.56569e-21
Add  C(educacao)[T.Superior completo] with p-value 1.84562e-11
Add  C(tipo_renda)[T.Empresário]    with p-value 1.85987e-07
Add  C(tipo_renda)[T.Pensionista]   with p-value 1.85314e-05
Add  C(posse_de_imovel)[T.1]        with p-value 0.000364692
Add  C(tipo_residencia)[T.Estúdio]  with p-value 0.0120188
Add  C(posse_de_veiculo)[T.1]       with p-value 0.0348263
Selected features: ['Intercept', 'tempo_emprego', 'C(sexo)[T.M]', 'idade', 'C(educacao)[T.Superior completo]', 'C(tipo_renda)[T.Empresário]', 'C(tipo_renda)[T.Pensionista]', 'C(posse_de_imovel)[T.1]', 'C(tipo_residencia)[T.Estúdio]', 'C(posse_de_veiculo)[T.1]']
                            OLS Regression Results                            
Dep. Variable:          np.log(renda)   R-squared:        

O modelo stepwise foi o melhor modelo.

In [218]:
y, X = patsy.dmatrices(
    'np.log(renda) ~ C(sexo) + C(posse_de_veiculo) + C(posse_de_imovel) + C(tipo_renda) + C(educacao) + C(estado_civil) + C(tipo_residencia) + qt_pessoas_residencia + qtd_filhos + tempo_emprego + np.power(idade , 2)', 
    data=df, return_type='dataframe'
)

# Divisão dos dados em treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=100)

# Garantir que os índices estejam alinhados
y_train = y_train.reset_index(drop=True)
X_train = X_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

In [220]:
# Aplicar o método stepwise selection
selected_features = stepwise_selection(X_test, y_test)

# Verificar as features selecionadas
print("Selected features:", selected_features)

# Ajustar o modelo final com as features selecionadas
X_test_selected = X_test[selected_features]
final_model = sm.OLS(y_test, sm.add_constant(X_test_selected)).fit()

# Obter o resumo do modelo final
print(final_model.summary())

Add  Intercept                      with p-value 0.0
Add  tempo_emprego                  with p-value 3.3784e-149
Add  C(sexo)[T.M]                   with p-value 2.19166e-161
Add  np.power(idade, 2)             with p-value 5.91791e-21
Add  C(educacao)[T.Superior completo] with p-value 3.24319e-11
Add  C(tipo_renda)[T.Empresário]    with p-value 1.20426e-07
Add  C(tipo_renda)[T.Pensionista]   with p-value 5.96507e-05
Add  C(posse_de_imovel)[T.1]        with p-value 0.000288941
Add  C(tipo_residencia)[T.Estúdio]  with p-value 0.0109267
Add  C(posse_de_veiculo)[T.1]       with p-value 0.0327578
Selected features: ['Intercept', 'tempo_emprego', 'C(sexo)[T.M]', 'np.power(idade, 2)', 'C(educacao)[T.Superior completo]', 'C(tipo_renda)[T.Empresário]', 'C(tipo_renda)[T.Pensionista]', 'C(posse_de_imovel)[T.1]', 'C(tipo_residencia)[T.Estúdio]', 'C(posse_de_veiculo)[T.1]']
                            OLS Regression Results                            
Dep. Variable:          np.log(renda)   R-squ

In [227]:
y, X = patsy.dmatrices(
    'np.log(renda) ~ C(sexo) + C(posse_de_veiculo) + C(posse_de_imovel) + C(tipo_renda) + C(educacao) + C(estado_civil) + C(tipo_residencia) + qt_pessoas_residencia + qtd_filhos + np.log(tempo_emprego + 0.01) + idade', 
    data=df, return_type='dataframe'
)

# Divisão dos dados em treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=100)

# Garantir que os índices estejam alinhados
y_train = y_train.reset_index(drop=True)
X_train = X_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)


In [228]:
# Aplicar o método stepwise selection
selected_features = stepwise_selection(X_test, y_test)

# Verificar as features selecionadas
print("Selected features:", selected_features)

# Ajustar o modelo final com as features selecionadas
X_test_selected = X_test[selected_features]
final_model = sm.OLS(y_test, sm.add_constant(X_test_selected)).fit()

# Obter o resumo do modelo final
print(final_model.summary())

Add  Intercept                      with p-value 0.0
Add  C(sexo)[T.M]                   with p-value 7.13317e-126
Add  np.log(tempo_emprego + 0.01)   with p-value 3.09292e-44
Add  C(tipo_renda)[T.Pensionista]   with p-value 2.11448e-106
Add  np.power(idade, 2)             with p-value 2.75364e-18
Add  C(educacao)[T.Superior completo] with p-value 2.1269e-10
Add  C(tipo_renda)[T.Empresário]    with p-value 3.16201e-06
Add  C(posse_de_imovel)[T.1]        with p-value 0.00142689
Add  C(tipo_residencia)[T.Estúdio]  with p-value 0.0361533
Selected features: ['Intercept', 'C(sexo)[T.M]', 'np.log(tempo_emprego + 0.01)', 'C(tipo_renda)[T.Pensionista]', 'np.power(idade, 2)', 'C(educacao)[T.Superior completo]', 'C(tipo_renda)[T.Empresário]', 'C(posse_de_imovel)[T.1]', 'C(tipo_residencia)[T.Estúdio]']
                            OLS Regression Results                            
Dep. Variable:          np.log(renda)   R-squared:                       0.311
Model:                            OLS  

## Árvore de regressão

In [253]:
reg = DecisionTreeRegressor(random_state = 21)
reg.fit(X_train , y_train)

In [256]:
y_train_pred = reg.predict(X_train)
y_test_pred = reg.predict(X_test)

In [265]:
r2_train_score = r2_score(y_train , y_train_pred)
r2_test_score = r2_score(y_test , y_test_pred)

print('O R2 da árvore de treino é {:.3f}'.format(r2_train_score))
print('O R2 da árvore de test é {:.3f}'.format(r2_test_score))

O R2 da árvore de treino é 0.740
O R2 da árvore de treino é 0.311
