# Lista Aberta do Módulo 3 - Linear na Prática

## Predizendo a glicose do sangue - Diabetes Dataset

Neste trabalho você deverá usa uma regressão linear para o problema de **predição da glicose do sangue**. Para isso, usaremos o dataset de diabetes de Stanford. Notem que as variáveis já estão normalizadas.

Você não precisa implementar os métodos, já que estão disponíveis na biblioteca [statsmodels](https://www.statsmodels.org/stable/index.html) da linguagem Python. Se necessário, pode fazer mais importações de bibliotecas.

## Importando módulos e baixando dataset

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.datasets import load_diabetes

In [36]:
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data,columns =[diabetes.feature_names])

In [3]:
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

In [9]:
df

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


In [37]:
# Variavel que queremos predizer
TARGET = "s6"

Se tiver dúvidas sobre a validação cruzada, recomendo :https://machinelearningmastery.com/k-fold-cross-validation/ . Busque também outros materiais, o importante é você conhecer e saber utilizar a validação cruzada, que é bem mais robusta que a simples divisão em treino e teste.

# Exercício 01:

1. Execute a regressão em statsmodels para prever a variável 6. Para tal, use como exemplo o tutorial [daqui](https://www.statsmodels.org/stable/regression.html). Use apenas as variáveis age, sex, bmi e bp como preditoras.
1. Interprete os intervalos de confiança das variáveis. Existe alguma que poderia ser eliminada? Se sim, qual?

In [38]:
X = df[["age", "sex", "bmi", "bp"]]

X_with_intercept = sm.add_constant(X)

model = sm.OLS(df[TARGET], X_with_intercept)

result = model.fit()

print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                     s6   R-squared:                       0.256
Model:                            OLS   Adj. R-squared:                  0.249
Method:                 Least Squares   F-statistic:                     37.54
Date:                Sun, 19 Nov 2023   Prob (F-statistic):           5.35e-27
Time:                        19:36:58   Log-Likelihood:                 784.28
No. Observations:                 442   AIC:                            -1559.
Df Residuals:                     437   BIC:                            -1538.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1.006e-17      0.002   5.13e-15      1.0

Nota-se que o intervalo de confiança das features não possui o 0 em sua amplitude, ou seja, todos os atributos selecionados são significantes

# Exercício 02:

1. Execute DUAS regressões para prever a quantidade de crimes violentos (violent) e quantidade de assassinatos (murder). Não use crimes violentos para prever assassinatos, nem assassinatos para precer crimes violentos. Use apenas as outras variáveis nos dois casos.
1. Interprete os intervalos de confiança das variáveis. Existe alguma que poderia ser eliminada? Se sim, qual?

In [40]:
crime_data = sm.datasets.statecrime.load()
df = crime_data.data
df.head()

Unnamed: 0_level_0,violent,murder,hs_grad,poverty,single,white,urban
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama,459.9,7.1,82.1,17.5,29.0,70.0,48.65
Alaska,632.6,3.2,91.4,9.0,25.5,68.3,44.46
Arizona,423.2,5.5,84.2,16.5,25.7,80.0,80.07
Arkansas,530.3,6.3,82.4,18.8,26.3,78.4,39.54
California,473.4,5.4,80.6,14.2,27.8,62.7,89.73


In [25]:
def linear_regression(X, y):
    X_with_intercept = sm.add_constant(X)
    
    model = sm.OLS(y, X_with_intercept)

    result = model.fit()
    
    print(result.summary())

In [41]:
X = df.columns.drop(["violent", "murder"])

### Crimes violentos


In [42]:
linear_regression(df[X], df["violent"])

                            OLS Regression Results                            
Dep. Variable:                violent   R-squared:                       0.681
Model:                            OLS   Adj. R-squared:                  0.645
Method:                 Least Squares   F-statistic:                     19.18
Date:                Sun, 19 Nov 2023   Prob (F-statistic):           3.54e-10
Time:                        19:40:11   Log-Likelihood:                -314.97
No. Observations:                  51   AIC:                             641.9
Df Residuals:                      45   BIC:                             653.5
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1907.3378    935.115     -2.040      0.0

Nota-se que pelo intervalo de confiança, algumas features não agregam significância ao modelo e poderiam ser eliminadas, como *hs_grad* e *poverty*, pois contém 0 no intervalo de confiança além de terem uma alta amplitude, comparada a *white* e *urban* também. 

### Assassinatos

In [43]:
linear_regression(df[X], df["murder"])

                            OLS Regression Results                            
Dep. Variable:                 murder   R-squared:                       0.820
Model:                            OLS   Adj. R-squared:                  0.800
Method:                 Least Squares   F-statistic:                     41.02
Date:                Sun, 19 Nov 2023   Prob (F-statistic):           1.14e-15
Time:                        19:41:32   Log-Likelihood:                -94.102
No. Observations:                  51   AIC:                             200.2
Df Residuals:                      45   BIC:                             211.8
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -47.6761     12.303     -3.875      0.0

Diferentemente de *crimes violentos*, em **assassinatos** as variáveis *hs_grad* e *poverty* possuem uma significância maior, não tendo 0 no intervalo de confiança. Esse cargo fica agora com *white* e *urban*, que mantiveram a propriedade de conter 0, mas agora as únicas com tal propriedade, de fato, poderiam ser eliminadas.