# Fundamentos de estadística para Analítica de Datos





In [13]:
import pandas as pd
import numpy as np
import plotly.express as px
from scipy import stats # libreria estadistica de Scipy
from sklearn.feature_selection import RFE # RFE es para seleccionar modelos
from sklearn.model_selection import train_test_split # para dividir la base en train y test 
from sklearn import linear_model # para modelo lineal
from sklearn.metrics import mean_squared_error, r2_score # para sacar las metricas 
import statsmodels.api as sm  ## Parte estadistica
from statsmodels.sandbox.regression.predstd import wls_prediction_std  ## Parte estadistica
from sklearn.impute import KNNImputer
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error, mean_squared_error

In [4]:
X = np.array([[1,160,2], [1,180,0], [1,148,3], [1,175,1], [1,180,2]])
Y = np.array([[75], [78], [50], [80], [90]])

In [17]:
betaGeo = np.linalg.solve(X[0:3], Y[0:3])
betaGeo

array([[-2037.  ],
       [   11.75],
       [  116.  ]])

In [12]:
Y_estimada = np.dot(X, betaGeo)
error = Y -Y_estimada
np.round(error,4)

array([[  -0.  ],
       [  -0.  ],
       [  -0.  ],
       [ -55.25],
       [-220.  ]])

In [16]:
beta_OLS = np.dot(np.linalg.inv(np.dot(np.transpose(X),X)), np.dot(np.transpose(X), Y))
beta_OLS

array([[-142.2302184 ],
       [   1.24048216],
       [   4.80307913]])

In [18]:
Y_estimada = np.dot(X, beta_OLS)
error = Y -Y_estimada
np.round(error,4)

array([[ 9.1469],
       [-3.0566],
       [-5.7704],
       [ 0.3428],
       [-0.6627]])

# Regresión lineal
En una regresión se tiene una variable objetivo $Y$ la cual es cuantitativa y es de interes para el investigador.
Se quiere construir una función $f(X)$ donde $X=(X_1, \ldots, X_p)$ es un conjunto de variables exogenas que se utilizaran para pronosticar a $Y$.

En un modelo de regresión lineal, se usan las funciones del tipo:
$$Y=\beta_0 +\beta_1X_1+\beta_2X_2+...+\beta_pX_p +\epsilon $$ 
o de la forma más general
$$f_0(Y)=\beta_0 +\beta_1 f_1(X_1)+\beta_2 f_2(X_2)+...+\beta_p f_p(X_p) +\epsilon $$ 
donde $\epsilon$ se conoce como el error o ruido del modelo.

Sobre este error se realizan varios supuestos para que el modelo tenga validez estadística.
1. Normalidad o gaussianidad : Campana de Gauss
2. Homocedasticidad : La variabilidad de mi modelo no depende de las X
3. Independencia


## Vamos con variables dummies

1. Vamos a leer este [proyecto](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv)

2. Descripción de los [datos](https://raw.githubusercontent.com/Cruzalirio/Ucentral/master/Bases/Casas/data_description.txt)

In [19]:
url ="https://raw.githubusercontent.com/Cruzalirio/Ucentral/master/Bases/Casas/train.csv"
train = pd.read_csv(url, index_col=0)
url ="https://raw.githubusercontent.com/Cruzalirio/Ucentral/master/Bases/Casas/test.csv"
test = pd.read_csv(url)

In [20]:
train

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125


# Modelo con todas las variables

In [21]:
perdidos = train.isnull().sum().reset_index(name="Conteo").sort_values("Conteo", ascending=False)
perdidos["Prop"] =perdidos["Conteo"]/train.shape[0]
perdidos[perdidos["Prop"]>0.2]["index"].to_list()

['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu']

In [22]:
X = train.drop(["SalePrice",'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'], axis=1)
### Las dummies se deben hacer antes de entrenamiento y prueba
X = pd.get_dummies(X, drop_first=True)
Y = train["SalePrice"]

In [23]:
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, train_size=0.8, random_state=20) ## Muestreo aleatorio simple
X_test.shape

(292, 232)

In [24]:
X_train.shape

(1168, 232)

In [25]:
X_train

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1138,50,54.0,6342,5,8,1875,1996,0.0,0,0,...,0,0,0,0,1,0,0,0,1,0
1336,20,80.0,9650,6,5,1977,1977,360.0,686,0,...,0,0,0,0,1,0,0,0,1,0
460,50,,7015,5,4,1950,1950,161.0,185,0,...,0,0,0,0,1,0,0,0,1,0
116,160,34.0,3230,6,5,1999,1999,1129.0,419,0,...,0,0,0,0,1,0,0,0,1,0
909,20,,8885,5,5,1983,1983,0.0,301,324,...,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
925,20,79.0,10240,6,6,1980,1980,157.0,625,1061,...,0,0,0,0,1,0,0,0,1,0
1248,80,,12328,6,5,1976,1976,335.0,539,0,...,0,0,0,0,1,0,0,0,1,0
272,20,73.0,39104,7,7,1954,2005,0.0,226,1063,...,0,0,0,0,1,0,0,0,1,0
475,120,41.0,5330,8,5,2000,2000,0.0,1196,0,...,0,0,0,0,1,0,0,0,1,0


In [26]:
imputer = KNNImputer(n_neighbors=5)
X_train_imp = imputer.fit_transform(X_train)
X_train_imp = pd.DataFrame(X_train_imp, columns=X_train.columns, index=X_train.index)
modelo1 = linear_model.LinearRegression().fit(X_train_imp, Y_train)

In [27]:
modelo1.coef_

array([-3.49879715e+01,  5.95250864e+00,  7.28414203e-01,  6.17751519e+03,
        5.45164253e+03,  2.79977217e+02,  7.21271952e+01,  3.69487974e+01,
        1.24010130e+01,  9.96208277e+00, -3.59888102e+00,  1.87640348e+01,
        1.04897318e+01,  3.17630441e+01, -1.25157718e+01,  2.97367955e+01,
        1.35333176e+03,  3.41402579e+02,  4.78963800e+03,  2.56477955e+03,
       -3.49054353e+03, -1.25918141e+04,  1.57863071e+03,  3.45427769e+03,
       -2.32331797e+01,  3.65749319e+03,  1.44360092e+01,  1.87356916e+01,
        1.30818022e+01,  1.00821666e+01,  4.06163165e+01,  1.68727478e+01,
        3.22463603e+01, -7.21574476e-02, -5.23184401e+02, -4.12507042e+02,
        3.16049304e+04,  1.82081631e+04,  2.00017354e+04,  1.86418739e+04,
        2.70182420e+04,  5.81984201e+03,  7.08885460e+03,  1.71062793e+03,
        1.18954731e+04, -7.53309838e+03,  8.68928736e+03, -4.27292989e+04,
        1.17018244e+04, -9.52274129e+03, -1.56087843e+04, -1.42612078e+03,
        9.67635897e+03, -

In [28]:
imputer = KNNImputer(n_neighbors=2)
X_test_imp = imputer.fit_transform(X_test)
X_test_imp = pd.DataFrame(X_test_imp, columns=X_test.columns, index=X_test.index)

## Sobreajuste

1. Si la diferencia entre entrenamiento y prueba del $R^2$ supera el 10%, se habla de sobreajuste y deben quitarse parametros o variables en el entrenamiento.

In [29]:
Y_pred_train=modelo1.predict(X_train_imp) ### Entrenamiento
Y_pred_test=modelo1.predict(X_test_imp) ## Prueba
print("R2 train",np.round(r2_score(Y_train, Y_pred_train),2)*100, "%")
print("R2 prueba",np.round(r2_score(Y_test, Y_pred_test),2)*100, "%")

R2 train 93.0 %
R2 prueba 68.0 %


In [30]:
print("MAPE train",np.round(mean_absolute_percentage_error(Y_train, Y_pred_train),2)*100, "%")
print("MAPE prueba",np.round(mean_absolute_percentage_error(Y_test, Y_pred_test),2)*100, "%")

MAPE train 8.0 %
MAPE prueba 11.0 %


# Estadistica inferencial

1. Leer este artículo [acá](https://scielo.isciii.es/scielo.php?script=sci_arttext&pid=S1139-76322017000500014)

2. Uso la libreria [statsmodels](https://www.statsmodels.org/)

In [33]:
X_train_imp_Intercepto

Unnamed: 0_level_0,const,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1138,1.0,50.0,54.0,6342.0,5.0,8.0,1875.0,1996.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1336,1.0,20.0,80.0,9650.0,6.0,5.0,1977.0,1977.0,360.0,686.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
460,1.0,50.0,67.8,7015.0,5.0,4.0,1950.0,1950.0,161.0,185.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
116,1.0,160.0,34.0,3230.0,6.0,5.0,1999.0,1999.0,1129.0,419.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
909,1.0,20.0,74.0,8885.0,5.0,5.0,1983.0,1983.0,0.0,301.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
925,1.0,20.0,79.0,10240.0,6.0,6.0,1980.0,1980.0,157.0,625.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1248,1.0,80.0,74.6,12328.0,6.0,5.0,1976.0,1976.0,335.0,539.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
272,1.0,20.0,73.0,39104.0,7.0,7.0,1954.0,2005.0,0.0,226.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
475,1.0,120.0,41.0,5330.0,8.0,5.0,2000.0,2000.0,0.0,1196.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [44]:
### Le agrego una columna de 1's
### statsmodels no lo hace por defecto
### scikt learn lo hace por defecto
## COnsejo, nunca le quiten el intercepto de primerazo
X_train_imp_Intercepto = sm.add_constant(X_train_imp)
modelo1 = sm.OLS(Y_train, X_train_imp_Intercepto)
resultados = modelo1.fit()
print(resultados.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.934
Model:                            OLS   Adj. R-squared:                  0.919
Method:                 Least Squares   F-statistic:                     59.55
Date:                Fri, 28 Apr 2023   Prob (F-statistic):               0.00
Time:                        00:30:45   Log-Likelihood:                -13232.
No. Observations:                1168   AIC:                         2.692e+04
Df Residuals:                     942   BIC:                         2.806e+04
Df Model:                         225                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                 -4.293e+

In [45]:
### scikt learn lo hace por defecto
## COnsejo, nunca le quiten el intercepto de primerazo
X_train_imp_Intercepto = sm.add_constant(X_train_imp)
modelo2 = sm.OLS(Y_train, X_train_imp)
resultados2 = modelo1.fit()
print(resultados2.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.934
Model:                            OLS   Adj. R-squared:                  0.919
Method:                 Least Squares   F-statistic:                     59.55
Date:                Fri, 28 Apr 2023   Prob (F-statistic):               0.00
Time:                        00:31:18   Log-Likelihood:                -13232.
No. Observations:                1168   AIC:                         2.692e+04
Df Residuals:                     942   BIC:                         2.806e+04
Df Model:                         225                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                 -4.293e+

-82.89306334778564

## Seleccionar las de pvalor más bajo

In [46]:
# Note that tables is a list. The table at index 1 is the "core" table. Additionally, read_html puts dfs in a list, so we want index 0
results_as_html = resultados.summary().tables[1].as_html()
tabla = pd.read_html(results_as_html, header=0, index_col=0)[0]
tabla

Unnamed: 0,coef,std err,t,P>|t|,[0.025,0.975]
const,-429300.0000,1190000.000,-0.361,0.718,-2760000.000,1900000.000
MSSubClass,-34.9880,87.568,-0.400,0.690,-206.838,136.862
LotFrontage,5.9525,49.035,0.121,0.903,-90.277,102.182
LotArea,0.7284,0.143,5.090,0.000,0.448,1.009
OverallQual,6177.5152,1126.380,5.484,0.000,3967.010,8388.020
...,...,...,...,...,...,...
SaleCondition_AdjLand,16630.0000,17100.000,0.973,0.331,-16900.000,50100.000
SaleCondition_Alloca,10440.0000,10000.000,1.040,0.299,-9271.458,30200.000
SaleCondition_Family,4660.8034,6880.464,0.677,0.498,-8842.008,18200.000
SaleCondition_Normal,6994.6083,3313.936,2.111,0.035,491.057,13500.000


In [None]:
tabla.columns

Index(['coef', 'std err', 't', 'P>|t|', '[0.025', '0.975]'], dtype='object')

In [47]:
tabla[tabla['P>|t|']<0.01].index

Index(['LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'MasVnrArea',
       'BsmtFinSF1', 'TotalBsmtSF', '2ndFlrSF', 'GrLivArea', 'WoodDeckSF',
       'LotConfig_CulDSac', 'LandSlope_Sev', 'Neighborhood_Edwards',
       'Neighborhood_Mitchel', 'Neighborhood_StoneBr', 'Condition1_Norm',
       'Condition2_PosN', 'Condition2_RRAe', 'RoofStyle_Shed',
       'RoofMatl_CompShg', 'RoofMatl_Membran', 'RoofMatl_Roll',
       'RoofMatl_Tar&Grv', 'RoofMatl_WdShake', 'RoofMatl_WdShngl',
       'Exterior1st_CBlock', 'ExterQual_Gd', 'ExterQual_TA', 'BsmtQual_Gd',
       'BsmtQual_TA', 'BsmtCond_Po', 'BsmtExposure_Gd', 'KitchenQual_Fa',
       'KitchenQual_Gd', 'KitchenQual_TA', 'GarageQual_Fa', 'GarageQual_Gd',
       'GarageQual_TA', 'GarageCond_Fa', 'GarageCond_TA'],
      dtype='object')

## Modelo más parsimonioso

In [56]:
X_train_selecc = X_train_imp_Intercepto[tabla[tabla['P>|t|']<0.01].index]
X_train_selecc  = sm.add_constant(X_train_selecc )

modelo2 = sm.OLS(Y_train,X_train_selecc  )
resultados = modelo2.fit()
print(resultados.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.886
Model:                            OLS   Adj. R-squared:                  0.882
Method:                 Least Squares   F-statistic:                     224.7
Date:                Fri, 28 Apr 2023   Prob (F-statistic):               0.00
Time:                        00:45:00   Log-Likelihood:                -13554.
No. Observations:                1168   AIC:                         2.719e+04
Df Residuals:                    1128   BIC:                         2.739e+04
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                -1.493e+06 

## Regresión Lasso

(least absolute shrinkage and selection operator, por sus siglas en inglés)


1. Documentación está [acá](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)

2. Es mejor estandarizar las variables antes de iniciar la regresión

3. Funciona igual con Ridge

4. Sirve en escenarios de multicolinealidad (información redundante en las X)


In [59]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV

In [60]:
escala = StandardScaler()
escala.fit(X_train_imp)

In [61]:
X_train_escalado = escala.transform(X_train_imp)
X_train_escalado = pd.DataFrame(X_train_escalado, index=X_train_imp.index, columns=X_test_imp.columns)
X_train_escalado

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1138,-0.177956,-0.683133,-0.459903,-0.779604,2.124198,-3.146228,0.523206,-0.576959,-0.959440,-0.288252,...,-0.058621,-0.058621,-0.302636,-0.029273,0.38971,-0.050746,-0.088121,-0.117851,0.46275,-0.307665
1336,-0.877797,0.377773,-0.096365,-0.064865,-0.510559,0.200848,-0.398740,1.414466,0.522661,-0.288252,...,-0.058621,-0.058621,-0.302636,-0.029273,0.38971,-0.050746,-0.088121,-0.117851,0.46275,-0.307665
460,-0.177956,-0.120037,-0.385942,-0.779604,-1.388812,-0.685143,-1.708875,0.313650,-0.559748,-0.288252,...,-0.058621,-0.058621,-0.302636,-0.029273,0.38971,-0.050746,-0.088121,-0.117851,0.46275,-0.307665
116,2.388128,-1.499214,-0.801900,-0.064865,-0.510559,0.922767,0.668777,5.668371,-0.054192,-0.288252,...,-0.058621,-0.058621,-0.302636,-0.029273,0.38971,-0.050746,-0.088121,-0.117851,0.46275,-0.307665
909,-0.877797,0.132949,-0.180436,-0.779604,-0.510559,0.397735,-0.107599,-0.576959,-0.309130,1.700874,...,-0.058621,-0.058621,-0.302636,-0.029273,0.38971,-0.050746,-0.088121,-0.117851,0.46275,-0.307665
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
925,-0.877797,0.336969,-0.031526,-0.064865,0.367693,0.299292,-0.253170,0.291523,0.390870,6.225520,...,-0.058621,-0.058621,-0.302636,-0.029273,0.38971,-0.050746,-0.088121,-0.117851,0.46275,-0.307665
1248,0.521885,0.157431,0.197937,-0.064865,-0.510559,0.168034,-0.447264,1.276172,0.205068,-0.288252,...,-0.058621,-0.058621,-0.302636,-0.029273,0.38971,-0.050746,-0.088121,-0.117851,0.46275,-0.307665
272,-0.877797,0.092145,3.140524,0.649874,1.245945,-0.553885,0.959918,-0.576959,-0.471168,6.237799,...,-0.058621,-0.058621,-0.302636,-0.029273,0.38971,-0.050746,-0.088121,-0.117851,0.46275,-0.307665
475,1.455006,-1.213585,-0.571118,1.364612,-0.510559,0.955581,0.717300,-0.576959,1.624514,-0.288252,...,-0.058621,-0.058621,-0.302636,-0.029273,0.38971,-0.050746,-0.088121,-0.117851,0.46275,-0.307665


In [62]:
### Hacemos la validación cruzada
#### Es decir, le decimos a python que nos calcule el valor de
#### alpha y los betas estimados
reg = LassoCV(cv=5, random_state=0).fit(X_train_escalado, Y_train)
reg.alpha_
### El costo es 767 (Esta asociado al valor de la variable Y)

767.5160686474742

In [64]:
datos = pd.DataFrame(reg.coef_, index=X_train_imp.columns, columns = ["Beta por lasso"])
datos

Unnamed: 0,Beta por lasso
MSSubClass,-7361.875972
LotFrontage,-0.000000
LotArea,1020.318788
OverallQual,15832.293746
OverallCond,4501.501249
...,...
SaleCondition_AdjLand,0.000000
SaleCondition_Alloca,0.000000
SaleCondition_Family,-0.000000
SaleCondition_Normal,73.524296


In [65]:
datos["AbsBeta"] = np.abs(datos["Beta por lasso"])
datos.sort_values("AbsBeta", ascending= False)

Unnamed: 0,Beta por lasso,AbsBeta
GrLivArea,21353.879117,21353.879117
OverallQual,15832.293746,15832.293746
Neighborhood_NridgHt,8622.911496,8622.911496
GarageCars,7368.333708,7368.333708
MSSubClass,-7361.875972,7361.875972
...,...,...
Exterior1st_BrkComm,-0.000000,0.000000
Exterior1st_CBlock,0.000000,0.000000
Exterior1st_ImStucc,-0.000000,0.000000
Exterior1st_MetalSd,0.000000,0.000000


In [68]:
datos[datos["AbsBeta"]==0]

Unnamed: 0,Beta por lasso,AbsBeta
LotFrontage,-0.0,0.0
BsmtFinSF2,0.0,0.0
BsmtUnfSF,-0.0,0.0
TotalBsmtSF,0.0,0.0
1stFlrSF,0.0,0.0
...,...,...
SaleType_WD,-0.0,0.0
SaleCondition_AdjLand,0.0,0.0
SaleCondition_Alloca,0.0,0.0
SaleCondition_Family,-0.0,0.0


## Probar el modelo

In [69]:
Y_pred_train=reg.predict(X_train_escalado) ### Entrenamiento
X_test_escalado =pd.DataFrame(escala.transform(X_test_imp), index=X_test_imp.index, columns=X_test_imp.columns)

Y_pred_test=reg.predict(X_test_escalado) ## Prueba

print("R2 train",np.round(r2_score(Y_train, Y_pred_train),2)*100, "%")
print("R2 prueba",np.round(r2_score(Y_test, Y_pred_test),2)*100, "%")

R2 train 89.0 %
R2 prueba 85.0 %


## Ejercicio

Vamos con Ridge, que sucede?

1. Realicen una regresión [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge)

In [73]:
from sklearn.linear_model import RidgeCV
regRidge = RidgeCV(cv=5).fit(X_train_escalado, Y_train)
regRidge

In [75]:
datos = pd.DataFrame(regRidge.coef_, index=X_train_imp.columns, columns = ["Beta por Ridge"])
datos

Unnamed: 0,Beta por Ridge
MSSubClass,-3589.591482
LotFrontage,-1318.123233
LotArea,5255.533630
OverallQual,10398.956315
OverallCond,5546.703720
...,...
SaleCondition_AdjLand,638.198347
SaleCondition_Alloca,1142.980793
SaleCondition_Family,610.472596
SaleCondition_Normal,2725.693578


In [77]:
datos["AbsBeta"] = np.abs(datos["Beta por Ridge"])
datos.sort_values("AbsBeta", ascending= False)

Unnamed: 0,Beta por Ridge,AbsBeta
RoofMatl_CompShg,41872.195370,41872.195370
RoofMatl_WdShngl,27376.036834,27376.036834
RoofMatl_Tar&Grv,26046.629815,26046.629815
RoofMatl_WdShake,16917.662155,16917.662155
GrLivArea,13590.188119,13590.188119
...,...,...
RoofMatl_Metal,0.000000,0.000000
Electrical_Mix,0.000000,0.000000
Exterior2nd_CBlock,0.000000,0.000000
Exterior1st_CBlock,0.000000,0.000000


In [78]:
Y_pred_train=regRidge.predict(X_train_escalado) ### Entrenamiento
X_test_escalado =pd.DataFrame(escala.transform(X_test_imp), index=X_test_imp.index, columns=X_test_imp.columns)

Y_pred_test=regRidge.predict(X_test_escalado) ## Prueba

print("R2 train",np.round(r2_score(Y_train, Y_pred_train),2)*100, "%")
print("R2 prueba",np.round(r2_score(Y_test, Y_pred_test),2)*100, "%")

R2 train 93.0 %
R2 prueba 81.0 %


## Tarea individual de regresión lógistica

1. Sea $Y=class$ la variable que indica si una persona tiene o no diabetes.

2. Realizar la tabla cruzada entre Polyuria y class, ¿hay alguna relación?

3. Realizar la tabla cruzada entre Gender y class, ¿hay alguna relación?



In [80]:
datos = pd.read_csv("https://raw.githubusercontent.com/Cruzalirio/Ucentral/master/Bases/Diabetes.csv", sep=";")
datos

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
515,39,Female,Yes,Yes,Yes,No,Yes,No,No,Yes,No,Yes,Yes,No,No,No,Positive
516,48,Female,Yes,Yes,Yes,Yes,Yes,No,No,Yes,Yes,Yes,Yes,No,No,No,Positive
517,58,Female,Yes,Yes,Yes,Yes,Yes,No,Yes,No,No,No,Yes,Yes,No,Yes,Positive
518,32,Female,No,No,No,Yes,No,No,Yes,Yes,No,Yes,No,No,Yes,No,Negative


## Logistica

1. Sea $p$ la probabilidad de tener diabetes, $p$ es un valor entre 0 y 1

3. Graficar la función $f(p)=\frac{p}{1-p}$

4. Graficar la función $f(p)=ln\left(\frac{p}{1-p}\right)$
