# 1.1 - Intro Machine Learning - Aprendizaje Supervisado - Regresion

![venn_ml](images/venn_ml.png)

![ext_sklearn](images/ext_sklearn.jpeg)

![sklearn](images/sklearn.png)

### WorkFlow


1. [Obtener datos](#1.-Obtener-Datos)
2. [Definir objetivo](#2.-Definir-Objetivo)
3. [Limpieza de datos (unidades, outliers, one-hot, etc..)(**)](#3.-Limpieza-de-Datos)
4. [Definir modelo (regresión, clasificación, ...)](#4.-Modelo)
5. [Entrenar (hiperparámetros, validación, ...) (**)](#5.-Entrenamiento)
6. [Predecir (testear)](#6.-Predicción)
7. [Evaluación](#7.-Evaluación)
8. [Si hay mucho error volver a (**)](#WorkFlow)
9. [Super-Bonus H2O](#8.-H2O)

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

import pylab as plt
import seaborn as sns

%matplotlib inline

### 1. Obtener Datos

**explicacion:**
    
+ carat:	peso del diamante (quilates)

+ cut:	calidad del corte (Fair, Ideal, Good, Very Good, Premium)

+ color: color (D (mejor) a J (peor))

+ clarity: claridad (I1 (peor), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (mejor)) 

+ table: ancho del corte superior del diamante

+ x: largo en mm

+ y: ancho en mm

+ z: alto en mm

+ depth:	2*z/(x+y)

+ price:	precio en dolares USA


![dia](images/dia.jpg)

In [None]:
df=pd.read_csv('../data/diamonds.csv')

df.head()

In [None]:
df.info()

### 2. Definir Objetivo


```El objetivo del ejercicio consiste en predecir el precio de los diamantes basándonos en datos como el peso, el color, el corte o la claridad.```

### 3. Limpieza de Datos

El proceso de limpieza es el habitual:

+ Valores nulos
+ Datos inconsistentes 
+ Datos duplicados...

Todo esto implica realizar también un **EDA**. Además de eso, es necesario arreglar los datos, proceso llamado `data wrangling`. Este proceso consiste en preparar los datos de manera adecuada para que el modelo de machine learning "entienda" los datos de manera óptima. Por ejemplo, los datos de corte, color y claridad son datos categóricos que están en formato string, habrá que cambiar estos datos para alimentar al modelo, las máquinas solo entienden de números 🤣.

Este proceso no es solo necesario sino fundamental. **Todo está en los datos.**

In [None]:
df.info(memory_usage='deep')

In [None]:
df.describe(include='all').T

In [None]:
# buscando colinealidad

plt.figure(figsize=(15, 10))

sns.set(style='white')

mask=np.triu(np.ones_like(df.corr(), dtype=bool))

cmap=sns.diverging_palette(0, 10, as_cmap=True)


sns.heatmap(df.corr(),
           mask=mask,
          cmap=cmap,
          center=0,
          square=True,
          annot=True,
          linewidths=0.5,
          cbar_kws={'shrink': 0.5});

In [None]:
df['vol'] = df.x * df.y * df.z

df.head()

In [None]:
# buscando colinealidad

plt.figure(figsize=(15, 10))

sns.set(style='white')

mask=np.triu(np.ones_like(df.corr(), dtype=bool))

cmap=sns.diverging_palette(0, 10, as_cmap=True)


sns.heatmap(df.corr(),
           mask=mask,
          cmap=cmap,
          center=0,
          square=True,
          annot=True,
          linewidths=0.5,
          cbar_kws={'shrink': 0.5});

In [None]:
# scatter matrix

pd.plotting.scatter_matrix(df, figsize=(15, 10), alpha=0.2);

In [None]:
plt.figure(figsize=(15,10))

plt.scatter(df.carat, df.price)

plt.ylabel('price')
plt.xlabel('carat');

## Reflexión : ¿Cómo podemos asegurarnos de que de verdad son diamantes?

1 carat = 0.2 gr de diamante.

Densidad del diamante es 3.4 - 3.5 gr/cm3

In [None]:
df.head()

In [None]:
df2=df.copy()

df2['gr'] = 0.2 * df2.carat

df2['vol'] = df2.x * df2.y * df2.z / 1000 / 2.5

df2['density'] = df2.gr / df2.vol

df2.head()

In [None]:
df2.describe().T

In [None]:
df2[df2.density > 3.7].shape

In [None]:
df2[df2.density < 3.02].shape

## Separación de datos.

![X_y_tts](images/X_y_tts.png)

Antes de transformar definitivamente nada, vamos a separar los datos en X e y. y será la columna objetivo, es decir, el precio. La columna objetivo nunca se toca, nunca se transforma en ningún sentido. X serán el resto de columnas, la características con las que realizaremos nuestras predicciones.

**0 arreglar datos**

In [None]:
X = df.drop('price', axis=1)

y = df.price

In [None]:
#%pip install scikit-learn

In [None]:
# normalizacion, Normal(0, 1)

from sklearn.preprocessing import StandardScaler

In [None]:
X[['depth', 'table', 'x', 'y', 'z']] = StandardScaler().fit_transform(X[['depth', 'table', 'x', 'y', 'z']])

X.head()

**transformando categoricas de varias maneras**

In [None]:
# one-hot encoding , variables dummies (esto funciona muy bien en regresion)

X = pd.get_dummies(X, columns=['cut'], drop_first=True)

X.head()

In [None]:
# label encoder , ordinal encoder

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

X.clarity = LabelEncoder().fit_transform(X.clarity)

X.head()

In [None]:
df.groupby('color').mean().price

In [None]:
df.color.unique()

In [None]:
# label encoder custom 
# aqui entra el conocimiento experto


color = {'J': 1 , 'I': 2, 'H': 5, 'G': 15, 'F': 25, 'E': 45, 'D': 85}

X.color=X.color.apply(lambda x: color[x])

X.head()

In [None]:
X.info()

In [None]:
X.head()

**1 train test split**

In [None]:
X.shape, y.shape

In [None]:
# train_test_split

from sklearn.model_selection import train_test_split as tts  # el alias es cosa mia

X_train, X_test, y_train, y_test  = tts(X, y, train_size=0.8, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

Ahora ya podemos alimentar el modelo predictivo. Probaremos distintos modelos.

### 4. Modelo

Añadir LazyModel

**Regresión Lineal**

In [None]:
from sklearn.linear_model import LinearRegression as LinReg   # alias es mio

from sklearn.linear_model import Lasso        # regularizacion L1
from sklearn.linear_model import Ridge        # regularizacion L2
from sklearn.linear_model import ElasticNet   # regularizacion L1+L2


# se inician los modelos

linreg=LinReg()
lasso=Lasso()
ridge=Ridge()
elastic=ElasticNet()

In [None]:
linreg

**SVR**

In [None]:
from sklearn.svm import SVR  # support vector regressor

svr=SVR()

**Random Forest**

In [None]:
from sklearn.ensemble import RandomForestRegressor as RFR  

from sklearn.tree import ExtraTreeRegressor as ETR

rfr=RFR()

etr=ETR()

**Boosting**

In [None]:
#%pip install xgboost

#%pip install catboost

#%pip install lightgbm
#conda install lightgbm

In [None]:
from sklearn.ensemble import GradientBoostingRegressor as GBR

from xgboost import XGBRegressor as XGBR

from catboost import CatBoostRegressor as CTR

from lightgbm import LGBMRegressor as LGBMR


gbr=GBR()
xgbr=XGBR()
ctr=CTR()
lgbmr=LGBMR()

In [None]:
#lazy
#%pip install lazypredict

In [None]:
from lazypredict.Supervised import LazyRegressor 

lazy=LazyRegressor()

lazy

### 5. Entrenamiento

In [None]:
# regresiones lineales

linreg.fit(X_train, y_train)
lasso.fit(X_train, y_train)
ridge.fit(X_train, y_train)
elastic.fit(X_train, y_train)


In [None]:
lasso.intercept_

In [None]:
lasso.coef_

In [None]:
X.columns

In [None]:
dict(zip(X.columns, lasso.coef_))

In [None]:
# svr

svr.fit(X_train, y_train)

In [None]:
# rfr y etr

rfr.fit(X_train, y_train)
etr.fit(X_train, y_train)

In [None]:
# boosting

gbr.fit(X_train, y_train)

In [None]:
xgbr.fit(X_train, y_train)

In [None]:
ctr.fit(X_train, y_train, verbose=0)

In [None]:
lgbmr.fit(X_train, y_train)

In [None]:
# el vago este

lazy.fit(X_train, X_test, y_train, y_test)

In [None]:
#help(lazy)

In [None]:
# entrenamiento a cuchillo

modelos=[linreg, lasso, ridge, xgbr]

for m in modelos:
    m.fit(X_train, y_train)

### 6. Predicción

In [None]:
X_test.head()

In [None]:
linreg.predict(X_test)[:10]

In [None]:
lasso.predict(X_test)[:10]

In [None]:
ridge.predict(X_test)[:10]

In [None]:
elastic.predict(X_test)[:10]

In [None]:
# svr

svr.predict(X_test)[:10]

In [None]:
# rfr

rfr.predict(X_test)[:10]

In [None]:
# etr

etr.predict(X_test)[:10]

In [None]:
# boosting

gbr.predict(X_test)[:10]

In [None]:
xgbr.predict(X_test)[:10]

In [None]:
ctr.predict(X_test)[:10]

In [None]:
lgbmr.predict(X_test)[:10]

### 7. Evaluación

In [None]:
y_test.head()

In [None]:
from sklearn.metrics import mean_squared_error as mse  # error cuadratico medio

In [None]:
# regresion lineal

y_pred = linreg.predict(X_test)

mse(y_test, y_pred, squared=False)   # RMSE squared=False por la raiz cuadrada

In [None]:
# lasso

y_pred=lasso.predict(X_test)

mse(y_test, y_pred, squared=False) 

In [None]:
# ridge

y_pred=ridge.predict(X_test)

mse(y_test, y_pred, squared=False)

In [None]:
# elastic

y_pred=elastic.predict(X_test)

mse(y_test, y_pred, squared=False)

In [None]:
# etr

y_pred=etr.predict(X_test)

mse(y_test, y_pred, squared=False)  # RMSE

In [None]:
# rfr

y_pred=rfr.predict(X_test)

mse(y_test, y_pred, squared=False)  # RMSE

In [None]:
# svr

y_pred=svr.predict(X_test)

mse(y_test, y_pred, squared=False)  # RMSE

In [None]:
# boosting

y_pred=xgbr.predict(X_test)

mse(y_test, y_pred, squared=False)  # RMSE

In [None]:
y_pred=ctr.predict(X_test)

mse(y_test, y_pred, squared=False)  # RMSE

In [None]:
y_pred=lgbmr.predict(X_test)

mse(y_test, y_pred, squared=False)  # RMSE

In [None]:
y_test.min(), y_test.mean(), y_test.max()

### 8. H2O

https://www.cienciadedatos.net/documentos/py04_machine_learning_con_h2o_y_python


In [None]:
%pip install h2o

In [None]:
import h2o

from h2o.automl import H2OAutoML

In [None]:
# obtener datos

train=pd.read_csv('../data/diamonds_train.csv')
test=pd.read_csv('../data/diamonds_test.csv')

train.head()

In [None]:
test.head()

In [None]:
# inicializamos el modelo h2o

h2o.init()

In [None]:
# parsear datos para h20

h2train=h2o.H2OFrame(train)
h2test=h2o.H2OFrame(test)

In [None]:
h2train.columns

In [None]:
X=[c for c in h2train.columns if c!='price']

y='price'

In [None]:
# inicia auto-machine-learning

automl=H2OAutoML(max_models=20,
                 seed=42,   # random_state
                 max_runtime_secs=300,
                 sort_metric='RMSE')

In [None]:
# entrena

automl.train(x=X,
             y=y,
             training_frame=h2train)

In [None]:
print('[INFO] Leader board:')

leader_board=automl.leaderboard

leader_board.head()

In [None]:
# prediciones del lider

y_pred=automl.leader.predict(h2test)

In [None]:
y_pred[:10]