Restricciónes de la regresión lineal:
- Linealidad
- Homocedasticidad
- Normalidad multivariable
- Independecia de los errores
- Ausencia de multicolinealidad

Si no se cumple estas 5 propiedades, no tiene sentido un modelo de regresión lineal

## $$\text{Regresión Lineal multiple}$$

$$y = b_0 + b_1*x_1 + b_2*x_2 + ... + b_n*x_n$$

In [1]:
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd

###  Importamos el data set

In [2]:
df = pd.read_csv('./data/50_Startups.csv')
X = df.iloc[:,:-1].values
y = df.iloc[:,4].values

### Codificamos los datos categóricos

In [3]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
onehotencoder = make_column_transformer((OneHotEncoder(), [3]), remainder = "passthrough")
X = onehotencoder.fit_transform(X)
X[:3]

array([[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54]], dtype=object)

### Evitamos la trampa de las variables dummy

In [4]:
X = X[:, 1:]
X[:3]

array([[0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [1.0, 0.0, 153441.51, 101145.55, 407934.54]], dtype=object)

### Dividimos en datos de entrenamiento y test

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Escalamos ...

In [6]:
# no es necesario en este ejemplo

# from sklearn.preprocessing import StandardScaler
# sc_X = StandardScaler()
# X_train = sc_X.fit_transform(X_train)
# X_test = sc_X.transform(X_test)

#### Ajustamos el modelo de regresión lineal multiple

In [7]:
from sklearn.linear_model import LinearRegression

regression = LinearRegression()
regression.fit(X_train, y_train)

### Predecimos los resultados del conjunto de testing

In [8]:
y_pred = regression.predict(X_test)
y_pred

array([103015.20159797, 132582.27760815, 132447.73845175,  71976.09851258,
       178537.48221055, 116161.24230165,  67851.69209676,  98791.73374688,
       113969.43533013, 167921.0656955 ])

### Construir el modelo optimo de RLM utilizando la eliminación hacia atrás

#### Eliminación hacia atrás

- paso1: Seleccionar el nivel de significación para permanecer en el modelo (p.e. SL = 0.05)
- paso2: Se calcula el modelo con todas las posibles variables predictoras
- paso3: Considera la variable predictora con el p-valor más grande. Si P > SL, entonces vamos al "paso4", si no vamos a fin(el modelo está listo)
- paso4: Se elimina la variable predictora
- paso5: Ajustar el modelo sin dicha variable

In [9]:
import statsmodels.api as sm 

# la biblioteca stats requiera una columna de unos al inicio
X = np.append(arr=np.ones((50,1)).astype(int), values=X, axis=1) #50 filas 1 columna
SL = 0.05

X_opt = X[:, [0, 1, 2, 3, 4, 5]]
regression_OLS = sm.OLS(endog = y, exog = X_opt.tolist()).fit()
regression_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Fri, 18 Nov 2022",Prob (F-statistic):,1.34e-27
Time:,11:11:30,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04
x1,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
x2,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229
x3,0.8060,0.046,17.369,0.000,0.712,0.900
x4,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x5,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


In [15]:
# eliminamos la variable con mayor p-valor y así sucesivamente
X_opt = X[:, [0, 3, 5]]
regression_OLS = sm.OLS(endog = y, exog = X_opt.tolist()).fit()
regression_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Fri, 18 Nov 2022",Prob (F-statistic):,2.1600000000000003e-31
Time:,11:43:56,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04
x1,0.7966,0.041,19.266,0.000,0.713,0.880
x2,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


In [None]:
# Eliminación hacia atrás utilizando solamente p-valores:
def backwardElimination(x, sl):    
   numVars = len(x[0])    
   for i in range(0, numVars):        
     regressor_OLS = sm.OLS(y, x.tolist()).fit()        
     maxVar = max(regressor_OLS.pvalues).astype(float)        
     if maxVar > sl:            
        for j in range(0, numVars - i):                
          if (regressor_OLS.pvalues[j].astype(float) == maxVar):                    
             x = np.delete(x, j, 1)    
   regressor_OLS.summary()    
   return x

SL = 0.05
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
X_Modeled = backwardElimination(X_opt, SL)

In [None]:
#Eliminación hacia atrás utilizando  p-valores y el valor de  R Cuadrado Ajustado:
def backwardElimination(x, SL):    
   numVars = len(x[0])    
   temp = np.zeros((50,6)).astype(int)    
   for i in range(0, numVars):        
      regressor_OLS = sm.OLS(y, x.tolist()).fit()        
      maxVar = max(regressor_OLS.pvalues).astype(float)        
      adjR_before = regressor_OLS.rsquared_adj.astype(float)        
      if maxVar > SL:            
         for j in range(0, numVars - i):                
            if (regressor_OLS.pvalues[j].astype(float) == maxVar):                    
               temp[:,j] = x[:, j]                    
                  x = np.delete(x, j, 1)                    
                  tmp_regressor = sm.OLS(y, x.tolist()).fit()                    
                  adjR_after = tmp_regressor.rsquared_adj.astype(float)                    
                  if (adjR_before >= adjR_after):                        
                     x_rollback = np.hstack((x, temp[:,[0,j]]))                        
                     x_rollback = np.delete(x_rollback, j, 1)    
                     print (regressor_OLS.summary())                        
                     return x_rollback                    
                  else:                        
                     continue    
   regressor_OLS.summary()    
   return x

SL = 0.05
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
X_Modeled = backwardElimination(X_opt, SL) 

Podemos usar la Regresión Lineal Múltiple para predecir una variable dependiente que crece exponencialmente con el tiempo?
Falso