# EBAC - Regressão II - regressão múltipla

## Tarefa I

#### Previsão de renda

Vamos trabalhar com a base 'previsao_de_renda.csv', que é a base do seu próximo projeto. Vamos usar os recursos que vimos até aqui nesta base.

|variavel|descrição|
|-|-|
|data_ref                | Data de referência de coleta das variáveis |
|index                   | Código de identificação do cliente|
|sexo                    | Sexo do cliente|
|posse_de_veiculo        | Indica se o cliente possui veículo|
|posse_de_imovel         | Indica se o cliente possui imóvel|
|qtd_filhos              | Quantidade de filhos do cliente|
|tipo_renda              | Tipo de renda do cliente|
|educacao                | Grau de instrução do cliente|
|estado_civil            | Estado civil do cliente|
|tipo_residencia         | Tipo de residência do cliente (própria, alugada etc)|
|idade                   | Idade do cliente|
|tempo_emprego           | Tempo no emprego atual|
|qt_pessoas_residencia   | Quantidade de pessoas que moram na residência|
|renda                   | Renda em reais|

In [104]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.formula.api as smf
import statsmodels.api as sm
import patsy

%matplotlib inline

In [105]:
df = pd.read_csv('previsao_de_renda.csv')

In [None]:
df.info()

In [113]:
df_renda = df.drop(columns=['Unnamed: 0', 'data_ref', 'id_cliente']).dropna()
df_renda.head(2)

Unnamed: 0,sexo,posse_de_veiculo,posse_de_imovel,qtd_filhos,tipo_renda,educacao,estado_civil,tipo_residencia,idade,tempo_emprego,qt_pessoas_residencia,renda
0,F,False,True,0,Empresário,Secundário,Solteiro,Casa,26,6.60274,1.0,8060.34
1,M,True,True,0,Assalariado,Superior completo,Casado,Casa,28,7.183562,2.0,1852.15


1. Ajuste um modelo para prever log(renda) considerando todas as covariáveis disponíveis.
    - Utilizando os recursos do Patsy, coloque as variáveis qualitativas como *dummies*.
    - Mantenha sempre a categoria mais frequente como casela de referência
    - Avalie os parâmetros e veja se parecem fazer sentido prático.  


2. Remova a variável menos significante e analise:
    - Observe os indicadores que vimos, e avalie se o modelo melhorou ou piorou na sua opinião.
    - Observe os parâmetros e veja se algum se alterou muito.  


3. Siga removendo as variáveis menos significantes, sempre que o *p-value* for menor que 5%. Compare o modelo final com o inicial. Observe os indicadores e conclua se o modelo parece melhor. 
    

In [114]:
#1)
tipo = 'np.log(renda) ~ C(sexo) + posse_de_veiculo + qtd_filhos + C(tipo_renda) + C(educacao) + C(estado_civil) + C(tipo_residencia) + idade + tempo_emprego + qt_pessoas_residencia'
y,X = patsy.dmatrices(tipo, data=df_renda)
X

DesignMatrix with shape (12427, 24)
  Columns:
    ['Intercept',
     'C(sexo)[T.M]',
     'posse_de_veiculo[T.True]',
     'C(tipo_renda)[T.Bolsista]',
     'C(tipo_renda)[T.Empresário]',
     'C(tipo_renda)[T.Pensionista]',
     'C(tipo_renda)[T.Servidor público]',
     'C(educacao)[T.Pós graduação]',
     'C(educacao)[T.Secundário]',
     'C(educacao)[T.Superior completo]',
     'C(educacao)[T.Superior incompleto]',
     'C(estado_civil)[T.Separado]',
     'C(estado_civil)[T.Solteiro]',
     'C(estado_civil)[T.União]',
     'C(estado_civil)[T.Viúvo]',
     'C(tipo_residencia)[T.Casa]',
     'C(tipo_residencia)[T.Com os pais]',
     'C(tipo_residencia)[T.Comunitário]',
     'C(tipo_residencia)[T.Estúdio]',
     'C(tipo_residencia)[T.Governamental]',
     'qtd_filhos',
     'idade',
     'tempo_emprego',
     'qt_pessoas_residencia']
  Terms:
    'Intercept' (column 0)
    'C(sexo)' (column 1)
    'posse_de_veiculo' (column 2)
    'C(tipo_renda)' (columns 3:7)
    'C(educacao)' (colum

In [115]:
reg = sm.OLS(y, X).fit()
reg.summary()

0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.356
Model:,OLS,Adj. R-squared:,0.354
Method:,Least Squares,F-statistic:,297.6
Date:,"Sun, 19 Jan 2025",Prob (F-statistic):,0.0
Time:,15:47:10,Log-Likelihood:,-13585.0
No. Observations:,12427,AIC:,27220.0
Df Residuals:,12403,BIC:,27400.0
Df Model:,23,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.6093,0.235,28.076,0.000,6.148,7.071
C(sexo)[T.M],0.7845,0.015,53.479,0.000,0.756,0.813
posse_de_veiculo[T.True],0.0463,0.014,3.272,0.001,0.019,0.074
C(tipo_renda)[T.Bolsista],0.2434,0.241,1.008,0.313,-0.230,0.717
C(tipo_renda)[T.Empresário],0.1555,0.015,10.399,0.000,0.126,0.185
C(tipo_renda)[T.Pensionista],-0.3299,0.242,-1.366,0.172,-0.803,0.144
C(tipo_renda)[T.Servidor público],0.0560,0.022,2.515,0.012,0.012,0.100
C(educacao)[T.Pós graduação],0.1255,0.159,0.788,0.431,-0.187,0.437
C(educacao)[T.Secundário],-0.0193,0.072,-0.268,0.789,-0.161,0.122

0,1,2,3
Omnibus:,0.5,Durbin-Watson:,2.023
Prob(Omnibus):,0.779,Jarque-Bera (JB):,0.489
Skew:,0.015,Prob(JB):,0.783
Kurtosis:,3.006,Cond. No.,2180.0


In [116]:
print(df_renda.apply(lambda x: x.value_counts().idxmax()))
tipo = 'np.log(renda) ~ C(sexo) + posse_de_veiculo + qtd_filhos + C(tipo_renda, Treatment(0)) + C(educacao,Treatment(2)) + C(estado_civil,Treatment(0)) + C(tipo_residencia,Treatment(1)) + idade + tempo_emprego + qt_pessoas_residencia'
y,X = patsy.dmatrices(tipo, data=df_renda)
reg2 = sm.OLS(y, X).fit()
reg2.summary()



sexo                               F
posse_de_veiculo               False
posse_de_imovel                 True
qtd_filhos                         0
tipo_renda               Assalariado
educacao                  Secundário
estado_civil                  Casado
tipo_residencia                 Casa
idade                             40
tempo_emprego               4.216438
qt_pessoas_residencia            2.0
renda                        9826.31
dtype: object


0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.356
Model:,OLS,Adj. R-squared:,0.354
Method:,Least Squares,F-statistic:,297.6
Date:,"Sun, 19 Jan 2025",Prob (F-statistic):,0.0
Time:,15:47:14,Log-Likelihood:,-13585.0
No. Observations:,12427,AIC:,27220.0
Df Residuals:,12403,BIC:,27400.0
Df Model:,23,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5556,0.219,29.953,0.000,6.127,6.985
C(sexo)[T.M],0.7845,0.015,53.479,0.000,0.756,0.813
posse_de_veiculo[T.True],0.0463,0.014,3.272,0.001,0.019,0.074
"C(tipo_renda, Treatment(0))[T.Bolsista]",0.2434,0.241,1.008,0.313,-0.230,0.717
"C(tipo_renda, Treatment(0))[T.Empresário]",0.1555,0.015,10.399,0.000,0.126,0.185
"C(tipo_renda, Treatment(0))[T.Pensionista]",-0.3299,0.242,-1.366,0.172,-0.803,0.144
"C(tipo_renda, Treatment(0))[T.Servidor público]",0.0560,0.022,2.515,0.012,0.012,0.100
"C(educacao, Treatment(2))[T.Primário]",0.0193,0.072,0.268,0.789,-0.122,0.161
"C(educacao, Treatment(2))[T.Pós graduação]",0.1448,0.142,1.018,0.309,-0.134,0.424

0,1,2,3
Omnibus:,0.5,Durbin-Watson:,2.023
Prob(Omnibus):,0.779,Jarque-Bera (JB):,0.489
Skew:,0.015,Prob(JB):,0.783
Kurtosis:,3.006,Cond. No.,2130.0


####### avaliação ########: 
Houve uma melhora nos indicadores de P>|t|

In [117]:
#2)
print('#####\nRemovido a variável "tipo_residencia"\n#####')
tipo = 'np.log(renda) ~ C(sexo) + C(posse_de_veiculo) + qtd_filhos + C(tipo_renda, Treatment(0)) + C(educacao,Treatment(2)) + C(estado_civil,Treatment(0)) + idade + tempo_emprego + qt_pessoas_residencia'
y,X = patsy.dmatrices(tipo, data=df_renda)
reg3 = sm.OLS(y, X).fit()
reg3.summary()


#####
Removido a variável "tipo_residencia"
#####


0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.355
Model:,OLS,Adj. R-squared:,0.355
Method:,Least Squares,F-statistic:,380.1
Date:,"Sun, 19 Jan 2025",Prob (F-statistic):,0.0
Time:,15:47:18,Log-Likelihood:,-13587.0
No. Observations:,12427,AIC:,27210.0
Df Residuals:,12408,BIC:,27350.0
Df Model:,18,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.5570,0.219,29.975,0.000,6.128,6.986
C(sexo)[T.M],0.7860,0.015,53.738,0.000,0.757,0.815
C(posse_de_veiculo)[T.True],0.0464,0.014,3.279,0.001,0.019,0.074
"C(tipo_renda, Treatment(0))[T.Bolsista]",0.2458,0.241,1.018,0.309,-0.228,0.719
"C(tipo_renda, Treatment(0))[T.Empresário]",0.1564,0.015,10.478,0.000,0.127,0.186
"C(tipo_renda, Treatment(0))[T.Pensionista]",-0.3292,0.242,-1.363,0.173,-0.803,0.144
"C(tipo_renda, Treatment(0))[T.Servidor público]",0.0572,0.022,2.574,0.010,0.014,0.101
"C(educacao, Treatment(2))[T.Primário]",0.0122,0.072,0.170,0.865,-0.129,0.153
"C(educacao, Treatment(2))[T.Pós graduação]",0.1457,0.142,1.025,0.306,-0.133,0.425

0,1,2,3
Omnibus:,0.526,Durbin-Watson:,2.022
Prob(Omnibus):,0.769,Jarque-Bera (JB):,0.518
Skew:,0.016,Prob(JB):,0.772
Kurtosis:,3.005,Cond. No.,2130.0


In [118]:
if reg2.rsquared_adj > reg3.rsquared_adj:
    print(f'Houve uma melhora nos indicadores de R² - ajustado para reg2 = {reg2.rsquared_adj}')
else:
    print(f'Houve uma melhora nos indicadores de R² - ajustado para reg3 = {reg3.rsquared_adj}')

if reg2.aic < reg3.aic:
    print(f'Houve uma melhora nos indicadores de AIC para reg2 = {reg2.aic}')
else:
    print(f'Houve uma melhora nos indicadores de AIC para reg3 = {reg3.aic}')


Houve uma melhora nos indicadores de R² - ajustado para reg3 = 0.3545061584981394
Houve uma melhora nos indicadores de AIC para reg3 = 27212.48323296981


#####> Houve uma variação no F-statistic de 297.6 para 380.1 e uma dimunuição no R-squared em 0,001

In [145]:
print('###\nremovendo variáveis com P>|t| menor q 5%\n###')
tipo = 'np.log(renda) ~ qtd_filhos + C(tipo_renda, Treatment(0)) + C(educacao,Treatment(2)) + C(estado_civil,Treatment(0)) + qt_pessoas_residencia'
y,X = patsy.dmatrices(tipo, data=df_renda)
reg4 = sm.OLS(y, X).fit()
reg4.summary()

###
removendo variáveis com P>|t| menor q 5%
###


0,1,2,3
Dep. Variable:,np.log(renda),R-squared:,0.009
Model:,OLS,Adj. R-squared:,0.008
Method:,Least Squares,F-statistic:,8.025
Date:,"Sun, 19 Jan 2025",Prob (F-statistic):,2.4e-17
Time:,16:05:17,Log-Likelihood:,-16260.0
No. Observations:,12427,AIC:,32550.0
Df Residuals:,12412,BIC:,32660.0
Df Model:,14,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.7994,0.268,29.108,0.000,7.274,8.325
"C(tipo_renda, Treatment(0))[T.Bolsista]",0.1456,0.299,0.487,0.626,-0.441,0.732
"C(tipo_renda, Treatment(0))[T.Empresário]",0.0445,0.018,2.415,0.016,0.008,0.081
"C(tipo_renda, Treatment(0))[T.Pensionista]",-0.3071,0.299,-1.026,0.305,-0.894,0.280
"C(tipo_renda, Treatment(0))[T.Servidor público]",0.1584,0.027,5.795,0.000,0.105,0.212
"C(educacao, Treatment(2))[T.Primário]",-0.0067,0.089,-0.075,0.940,-0.181,0.168
"C(educacao, Treatment(2))[T.Pós graduação]",-0.1490,0.176,-0.845,0.398,-0.494,0.196
"C(educacao, Treatment(2))[T.Superior completo]",0.0620,0.017,3.636,0.000,0.029,0.095
"C(educacao, Treatment(2))[T.Superior incompleto]",-0.1113,0.040,-2.817,0.005,-0.189,-0.034

0,1,2,3
Omnibus:,223.74,Durbin-Watson:,2.035
Prob(Omnibus):,0.0,Jarque-Bera (JB):,245.915
Skew:,0.302,Prob(JB):,3.9799999999999994e-54
Kurtosis:,3.333,Cond. No.,140.0


####> Avaliação: o modelo 1 é melhor que o modelo sem as variáveis com P>|t| menor que 5%. O modelo 1 apresenta os melhores indicadores de AIC e R²-ajustado.