## Reto 4: Validación Cruzada

### 1. Objetivos:
    - Aplicar la técnica de validación cruzada para evaluar un modelo de Regresión Lineal Múltiple
 
---
    
### 2. Desarrollo:

Para este Reto vamos a utilizar el mismo dataset que en el Reto anterior. Elige las variables con las que obtuviste un mejor resultado. Utilizando esas variables realiza los siguientes procesos:

1. Entrena un modelo de Regresión Lineal Múltiple utilizando validación cruzada de K-iteraciones.
2. Obtén el promedio de tus scores y el nivel de incertidumbre.
3. Compara tu resultado con el resultado obtenido en el Reto anterior.
4. Comparte tus hallazgos con tus compañeros.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [2]:
mainpath = "https://raw.githubusercontent.com/EduHdzVillasana/B2-Analisis-de-Datos-con-Python-2020-Santander/main/Datasets"
filename = "wine_quality_red-clean.csv"
df = pd.read_csv(os.path.join(mainpath,filename), index_col= 0)

df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
variables_independientes = ["citric_acid","density"]
variable_dependiente = "fixed_acidity"

In [4]:
def summary(data):
  minimo = data.min()
  Q1 = data.quantile(0.25)
  median = data.median()
  mean = data.mean()
  Q2 = data.quantile(0.5)
  Q3 = data.quantile(0.75)
  maximo = data.max()
  IQR = Q3 - Q1
  resumen = {'minimo': minimo, 'Q1': Q1, 'mediana':median,'Q2':Q2,'Q3':Q3,'max':maximo,'IQR':IQR, 'media':mean}
  return resumen

In [6]:
summary_dict = summary(df[variables_independientes[1]])
summary_dict

{'IQR': 0.002234999999999876,
 'Q1': 0.9956,
 'Q2': 0.99675,
 'Q3': 0.9978349999999999,
 'max': 1.00369,
 'media': 0.9967466791744833,
 'mediana': 0.99675,
 'minimo': 0.9900700000000001}

In [7]:
IQR = summary_dict["IQR"]
Q1 = summary_dict["Q1"]
Q3 = summary_dict["Q3"]
df = df[(df["density"] > Q1 - 1.5*IQR ) & (df["density"] < Q3 + 1.5*IQR )]

In [8]:
summary_dict = summary(df[variable_dependiente])
summary_dict

{'IQR': 2.0749999999999993,
 'Q1': 7.1,
 'Q2': 7.9,
 'Q3': 9.174999999999999,
 'max': 15.9,
 'media': 8.291570141570169,
 'mediana': 7.9,
 'minimo': 4.6}

In [9]:
IQR = summary_dict["IQR"]
Q1 = summary_dict["Q1"]
Q3 = summary_dict["Q3"]
df = df[(df[variable_dependiente] > Q1 - 1.5*IQR ) & (df[variable_dependiente] < Q3 + 1.5*IQR )]

In [10]:
X = df[variables_independientes]
y = df[variable_dependiente]

In [11]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

In [12]:
scores = cross_validate(lr, X, y, scoring='r2')

In [13]:
scores

{'fit_time': array([0.02491307, 0.00253773, 0.00235128, 0.00236821, 0.00199008]),
 'score_time': array([0.00159502, 0.00139403, 0.00152636, 0.00140691, 0.00113153]),
 'test_score': array([0.41121081, 0.59026741, 0.49956262, 0.44678339, 0.16407308])}

In [14]:
print(f'Score del modelo: {scores["test_score"].mean():.3f} +/- {scores["test_score"].std():.3f}')

Score del modelo: 0.422 +/- 0.143
