# LAB 5

En este dataset se desea pronosticar el precio de vehiculos usados. 
El dataset original contiene las siguientes columna: 

- Car_Name: Nombre del vehiculo.
- Year: Año de fabricación.
- Selling_Price: Precio de venta.
- Present_Price: Precio actual.
- Driven_Kms: Kilometraje recorrido.
- Fuel_type: Tipo de combustible.
- Selling_Type: Tipo de vendedor.
- Transmission: Tipo de transmisión.
- Owner: Número de propietarios.

El dataset ya se encuentra dividido en conjuntos de entrenamiento y prueba en la carpeta "files/input/".



## Paso 1.

Preprocese los datos.
- Cree la columna 'Age' a partir de la columna 'Year'. Asuma que el año actual es 2021.
- Elimine las columnas 'Year' y 'Car_Name'.

In [1]:
test_path = '../files/input/test_data.csv.zip'
train_path = '../files/input/train_data.csv.zip'

# Cargar los datos
def load_data(test_path, train_path):
    import pandas as pd
    test_df = pd.read_csv(test_path, compression='zip')
    train_df = pd.read_csv(train_path, compression='zip')    
    return test_df, train_df

test_df, train_df = load_data(test_path, train_path)
print (train_df.describe().columns)
print('='*120)
print("Train DataFrame:")
#print('-'*120)
print(train_df.shape)
print('-'*120)
print(train_df.describe())
print('='*120)
print("Test DataFrame:")
#print('-'*120)
print(test_df.shape)
#print('-'*120)
#print(test_df.describe())

Index(['Year', 'Selling_Price', 'Present_Price', 'Driven_kms', 'Owner'], dtype='object')
Train DataFrame:
(211, 9)
------------------------------------------------------------------------------------------------------------------------
              Year  Selling_Price  Present_Price     Driven_kms       Owner
count   211.000000     211.000000     211.000000     211.000000  211.000000
mean   2013.644550       4.692512       7.561090   35578.009479    0.047393
std       2.794843       4.819333       7.382453   28912.475577    0.271907
min    2003.000000       0.100000       0.480000    1200.000000    0.000000
25%    2012.000000       1.025000       1.365000   15000.000000    0.000000
50%    2014.000000       3.750000       6.400000   32000.000000    0.000000
75%    2016.000000       6.050000       9.900000   47500.000000    0.000000
max    2018.000000      23.500000      35.960000  213000.000000    3.000000
Test DataFrame:
(90, 9)


In [2]:
def preprocess_data(df):
    df['Age'] = 2021 - df['Year']
    df = df.drop(columns=['Year', 'Car_Name'])
    return df

df_train = preprocess_data(train_df)
df_test = preprocess_data(test_df)

print("Preprocessed Train DataFrame:")
#print('-'*120)
print(df_train.shape)
#print('-'*120)
#print(df_train.head())
#print('-'*120)
print("Preprocessed Test DataFrame:")
#print('-'*120)
print(df_test.shape)
#print('-'*120)
#print(df_test.head())

Preprocessed Train DataFrame:
(211, 8)
Preprocessed Test DataFrame:
(90, 8)


## Paso 2.

Divida los datasets en x_train, y_train, x_test, y_test.

In [3]:
X_train = df_train.drop(columns=['Present_Price'])
y_train = df_train['Present_Price']
X_test = df_test.drop(columns=['Present_Price'])
y_test = df_test['Present_Price']

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (211, 7)
y_train shape: (211,)
X_test shape: (90, 7)
y_test shape: (90,)


## Paso 3.
Cree un pipeline para el modelo de clasificación. Este pipeline debe contener las siguientes capas:

- Transforma las variables categoricas usando el método one-hot-encoding.
- Escala las variables numéricas al intervalo [0, 1].
- Selecciona las K mejores entradas.
- Ajusta un modelo de regresion lineal.

In [4]:
numerical = X_train.describe().columns.tolist()
categorical = [c for c in X_train.columns if c not in numerical and c != 'Selling_Price']
print("Numerical features:", numerical)
print("Categorical features:", categorical)

def make_pipeline(categorical, numerical):
    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler, PolynomialFeatures
    from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
    from sklearn.linear_model import LinearRegression

    
    preprocessor = ColumnTransformer(
        transformers=[
            #('num', StandardScaler(), numerical),
            ('num', MinMaxScaler(), numerical),
            ('cat', OneHotEncoder(handle_unknown='ignore'), categorical)
        ], remainder='drop')

    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),        
        #('poly', PolynomialFeatures()),  # Agrega transformación polinómica
        ('feature_selection', SelectKBest(score_func=f_regression)),
        ('regressor', LinearRegression())
    ])

    return pipeline

estimator = make_pipeline(categorical, numerical)
print(estimator)

Numerical features: ['Selling_Price', 'Driven_kms', 'Owner', 'Age']
Categorical features: ['Fuel_Type', 'Selling_type', 'Transmission']
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', MinMaxScaler(),
                                                  ['Selling_Price',
                                                   'Driven_kms', 'Owner',
                                                   'Age']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Fuel_Type', 'Selling_type',
                                                   'Transmission'])])),
                ('feature_selection',
                 SelectKBest(score_func=<function f_regression at 0x000001D972431630>)),
                ('regressor', LinearRegression())])


## Paso 4.

- Optimice los hiperparametros del pipeline usando validación cruzada.
- Use 10 splits para la validación cruzada. Use el error medio absoluto para medir el desempeño modelo.

## Paso 5.

Guarde el modelo (comprimido con gzip) como "files/models/model.pkl.gz".
Recuerde que es posible guardar el modelo comprimido usanzo la libreria gzip.


In [5]:
from sklearn.feature_selection import mutual_info_regression, f_regression
from sklearn.model_selection import KFold


param_grid = {
    'feature_selection__k': [3, 5, 7, 10, 15, 'all'],
    'feature_selection__score_func': [f_regression, mutual_info_regression],  # Nuevo: funciones de score
    'regressor__fit_intercept': [True, False],
    'regressor__copy_X': [True, False],
}

n_folds = 10
n_jobs = -1

def make_grid_search(estimator, param_grid, n_folds, n_jobs):
    from sklearn.model_selection import GridSearchCV, KFold
    from sklearn.metrics import make_scorer, mean_absolute_error

    #mae_scorer = make_scorer(mean_absolute_error, greater_is_better=False)
    kf = KFold(n_splits=10, shuffle=False)

    grid_search = GridSearchCV(estimator=estimator,
                               param_grid=param_grid,
                               scoring='neg_mean_absolute_error',
                               cv=kf,
                               n_jobs=n_jobs,
                               refit=True,
                               error_score='raise',
                               verbose=1)
    return grid_search


In [6]:
def save_model(model, filepath = '../files/models'):
    import gzip
    import pickle
    import os
    os.makedirs(filepath, exist_ok=True)
    model_path = os.path.join(filepath, 'model.pkl.gz')
    with gzip.open(model_path, 'wb') as f:
        pickle.dump(model, f)

# ----------------------------------------------------------------------------

def load_model(filepath = '../files/models', name = 'model.pkl.gz'):
    import os
    import gzip
    import pickle
    model_path = os.path.join(filepath, name)
    if not os.path.exists(model_path):
        return None
    with gzip.open(model_path, 'rb') as f:
        model = pickle.load(f)
    return model

In [7]:
def train_model(model, X_train, y_train, X_test, y_test):
    
    from sklearn.metrics import mean_absolute_error   
    model.fit(X_train, y_train)
    #best_model = load_model()

    #if best_model is not None:
        #saved_mae = mean_absolute_error(y_true = y_test, y_pred = best_model.predict(X_test))
        #current_mae = mean_absolute_error(y_true = y_test, y_pred = model.predict(X_test))
        #if saved_mae <= current_mae:
        #    model = best_model
    
    save_model(model)

# ----------------------------------------------------------------------------

def train_rl_model(param_grid, X_train, y_train, X_test, y_test, categorical, numerical):

    pipeline = make_pipeline(categorical, numerical)
    model = make_grid_search(estimator, param_grid=param_grid, n_folds = 10, n_jobs = -1 )
    train_model(model, X_train, y_train, X_test, y_test)

In [8]:
train_rl_model(param_grid, X_train, y_train, X_test, y_test, categorical, numerical)

Fitting 10 folds for each of 48 candidates, totalling 480 fits


## Paso 6.

Calcule las metricas r2, error cuadratico medio, y error absoluto medio para los conjuntos de entrenamiento y prueba. Guardelas en el archivo files/output/metrics.json. Cada fila del archivo es un diccionario con las metricas de un modelo. Este diccionario tiene un campo para indicar si es el conjunto de entrenamiento o prueba. Por ejemplo:

- {'type': 'metrics', 'dataset': 'train', 'r2': 0.8, 'mse': 0.7, 'mad': 0.9}
- {'type': 'metrics', 'dataset': 'test', 'r2': 0.7, 'mse': 0.6, 'mad': 0.8}

def train_model(model, X_train, y_train, X_test, y_test):
    
    from sklearn.metrics import balanced_accuracy_score
    
    model.fit(X_train, y_train)
    best_model = load_model()

    if best_model is not None:

        saved_bas = balanced_accuracy_score(
        y_true = y_test, y_pred = best_model.predict(X_test)
        )
        
        current_bas = balanced_accuracy_score(
        y_true = y_test, y_pred = model.predict(X_test)
        )

        if saved_bas >= current_bas:
            model = best_model
    
    save_model(model)

def train_rl_model(param_grid, X_train, y_train, X_test, y_test):

    pipeline = make_pipeline()
    
    model = make_grid_search(
        pipeline=pipeline, 
        param_grid=param_grid,
        )

    train_model(model, X_train, y_train, X_test, y_test)

- {'type': 'metrics', 'dataset': 'train', 'r2': 0.8, 'mse': 0.7, 'mad': 0.9}
- {'type': 'metrics', 'dataset': 'test', 'r2': 0.7, 'mse': 0.6, 'mad': 0.8}

In [9]:
def eval_metrics(model, X_train, y_train, X_test, y_test):
        
    from sklearn.metrics import (
        r2_score,
        mean_squared_error,
        mean_absolute_error,
        median_absolute_error,
    )

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    def metrics_dict(y_true, y_pred, dataset):
        return {
            "type": "metrics",
            "dataset": dataset,
            "r2": r2_score(y_true, y_pred),
            "mse": mean_squared_error(y_true, y_pred),
            "mad": median_absolute_error(y_true, y_pred),
        }

    

    # Construye los 4 renglones en el orden exacto que pide el test
    metrics_train = metrics_dict(y_train, y_pred_train, "train")
    metrics_test  = metrics_dict(y_test,  y_pred_test,  "test")

    return metrics_train, metrics_test


In [10]:
def save_report(
   metrics_train,
   metrics_test
):
    
    import json
    import os
    
    path = '../files/output/metrics.json'

    if not os.path.exists(os.path.dirname(path)):
       os.makedirs(os.path.dirname(path))

    with open(path, 'w', encoding='utf-8', newline='\n') as f:
       f.write(json.dumps(metrics_train) + '\n')
       f.write(json.dumps(metrics_test) + '\n')

In [11]:
def print_report(metrics_train, metrics_test, cm_train=None, cm_test=None):
    """Imprime un resumen compacto de métricas.
    Muestra valores de test con el valor de train entre paréntesis.
    """
    def fmt(name, test_val, train_val):
        return f"{name:>20}: {test_val:.4f} ({train_val:.4f})"

    print("-" * 80)
    print("Metrics summary (test (train))")
    print("-" * 80)
    print(fmt("r2", metrics_test["r2"], metrics_train["r2"]))
    print(fmt("MSE",         metrics_test["mse"],         metrics_train["mse"]))
    print(fmt("MAE",            metrics_test["mad"],            metrics_train["mad"]))
    if cm_test and cm_train:
        print("-" * 80)
        print("Confusion matrix (test):")
        print(f" true_0 -> predicted_0: {cm_test['true_0']['predicted_0']}, predicted_1: {cm_test['true_0']['predicted_1']}")
        print(f" true_1 -> predicted_0: {cm_test['true_1']['predicted_0']}, predicted_1: {cm_test['true_1']['predicted_1']}")
    print("-" * 80)

In [12]:
def check_estimator(X_train, X_test, y_train, y_test):

    #x_train, y_train, x_test, y_test = load_clean_data(test_path, train_path)

    model = load_model()
    if hasattr(model, 'best_estimator_'):
        best_model = model.best_estimator_
    else:
        best_model = model




    metrics_train, metrics_test = eval_metrics(
    best_model,
    X_train,
    y_train,
    X_test,
    y_test,
    )
    save_report(
        metrics_train,
        metrics_test
    )
    print_report(             
        metrics_train,
        metrics_test
    )


In [13]:
def print_get_params():
    model = load_model()
    print("Get model parameters:")
    for param, value in model.get_params().items():
        print(f"  {param}: {value}")


def print_best_model_params():
    model = load_model()
    print("Best model parameters:")
    for param, value in model.best_params_.items():
        print(f"  {param}: {value}")

In [14]:
check_estimator(X_train, X_test, y_train, y_test)
print_best_model_params()

--------------------------------------------------------------------------------
Metrics summary (test (train))
--------------------------------------------------------------------------------
                  r2: 0.7326 (0.8917)
                 MSE: 32.5667 (5.8746)
                 MAE: 1.5034 (1.0929)
--------------------------------------------------------------------------------
Best model parameters:
  feature_selection__k: 10
  feature_selection__score_func: <function mutual_info_regression at 0x000001D9724136D0>
  regressor__copy_X: False
  regressor__fit_intercept: False


Para Train:

- R² > 0.889 (mayor que)
- MSE < 5.950 (menor que)
- MAD < 1.600 (menor que)
  
Para Test:

- R² > 0.728 (mayor que)
- MSE < 32.910 (menor que)
- MAD < 2.430 (menor que)

In [15]:
print_get_params()


Get model parameters:
  cv: KFold(n_splits=10, random_state=None, shuffle=False)
  error_score: raise
  estimator__memory: None
  estimator__steps: [('preprocessor', ColumnTransformer(transformers=[('num', MinMaxScaler(),
                                 ['Selling_Price', 'Driven_kms', 'Owner',
                                  'Age']),
                                ('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['Fuel_Type', 'Selling_type',
                                  'Transmission'])])), ('feature_selection', SelectKBest(score_func=<function f_regression at 0x000001D972431630>)), ('regressor', LinearRegression())]
  estimator__transform_input: None
  estimator__verbose: False
  estimator__preprocessor: ColumnTransformer(transformers=[('num', MinMaxScaler(),
                                 ['Selling_Price', 'Driven_kms', 'Owner',
                                  'Age']),
                                ('cat', OneHotEncoder(handle_unknown='i