<a href="https://colab.research.google.com/github/JCaballerot/Decision_Tree_Learning/blob/main/0.%20Regression_trees/%C3%81rboles_de_Decisi%C3%B3n_en_Python_con_Databricks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://www.ctic.uni.edu.pe/wp-content/uploads/2022/04/588px-x-348px-web-1.png" alt="HTML5 Icon" width="900" height="350" >



<h1 align=center><font size = 5>Laboratorio: Árboles de Decisión en Python con Databricks
</font></h1>


---

## Objetivos
- Entender el proceso de creación y evaluación de un modelo de Árbol de Decisión.
- Practicar el preprocesamiento de datos incluyendo manejo de valores atípicos y muestreo.
- Visualizar la importancia de las características y la estabilidad de los nodos en un árbol de decisión.
- Aplicar y evaluar un modelo de Árbol de Decisión.

---

## Tabla de Contenidos

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
    
1. <a href="#item31">Introducción</a>  

2. <a href="#item32">Preprocesamiento de Datos</a>  
    - Muestreo de la data
    - Tratamiento de valores atípicos
3. <a href="#item33">Entrenamiento y visualización</a>  
    - Entrenamiento del Árbol de Decisión
    - Visualización del Árbol de Decisión
4. <a href="#item34">Evaluación del modelo</a>  
    - Predicciones y evaluación del modelo
    - Análisis de desviación porcentual
5. <a href="#item34">Importancia de Características y Estabilidad de Nodo</a>  
    - Importancia de características
    - Estabilidad de nodos
6. <a href="#item34">Árboles Interactivos</a>  
    - Funciones para evaluar y visualizar cortes óptimos
    - Evaluación y visualización del mejor corte

</font>
</div>

---

Ejecutamos un script externo ubicado en la ruta especificada para cargar funciones auxiliares.


In [None]:
%run "/Workspace/Users/jcaballerot@bcp.com.pe/2024Q3 - Decision Trees/libs/aux_funs"

Importamos las bibliotecas necesarias para crear gráficos y visualizaciones.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


### 1. Preprocesamiento de Datos

Cargamos el archivo CSV en un DataFrame de pandas y mostramos las primeras filas para verificar los datos.


In [None]:

pddf = pd.read_csv('/Workspace/Users/jcaballerot@bcp.com.pe/2024Q3 - Decision Trees/data/HM_FLUJOS_fad_mvp2.csv', index_col = 0)
pddf.head()


Agrupamos los datos por la columna 'codmes_flujo' y contamos el número de ocurrencias en cada grupo.


In [None]:

pddf.groupby('codmes_flujo').size()

Calculamos el promedio y la cantidad de valores de 'fad' agrupados por 'codmes_flujo'.

In [None]:
mean_values  = pddf.groupby('codmes_flujo').fad.mean()
count_values = pddf.groupby('codmes_flujo').fad.count()

In [None]:

# Crear una figura y un eje
fig, ax1 = plt.subplots(figsize=(8, 4))

ax1.bar(count_values.index, count_values, color='lightgrey', alpha=0.5, label='Materialidad')
ax2 = ax1.twinx()

# Graficar la línea del promedio
sns.lineplot(x=mean_values.index, y=mean_values, ax=ax2, color='blue', label='Average Value', marker='o')

# Etiquetas y título
ax1.set_xlabel('Codmes_flujo', fontsize = 10)
ax1.set_ylabel('N', color='grey')
ax2.set_ylabel('Average Value', color='blue')
plt.title('FAD por Codmes flujo')

# Ajustar la escala y los límites si es necesario
ax1.set_ylim(0, 1800)
ax2.set_ylim(0, 18000)

ax1.tick_params(axis='y', labelsize=8)
ax2.tick_params(axis='y', labelsize=8)

ax1.set_xticks(mean_values.index)
ax1.set_xticklabels(mean_values.index, fontsize=8)

# Mostrar leyendas
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

# Mostrar el gráfico
plt.show()

Definimos una lista de características que vamos a utilizar en el modelo.


In [None]:


features = ['rat_pas_act_meanprev6_ant_per_med',
 'mtoprincipalsol_rat_avg6_avg12_zn_pastrx_med',
 'mtosaldopromediovigsol_1000_countprev12_ant_per_deu90k_med',
 'mtocuotapresunta_meanprev12_seg_ratio',
 'mtocuotapresunta_ant_ren_ratio',
 'geo_ncc_cuota_tck_ae',
 'geo_c3k_cuota_sum',
 'geo_nt_fpas6_tck',
 'isav_tkt_opec_pago_srv_prm_p6m',
 'isav_tkt_opec_trnf_prm_p6m',
 'isav_tkt_opec_pago_srv_sol_prm_u6m',
 'isav_mto_opea_sav_ahs_prm_u12',
 'can_mto_tmo_tot_sol_prm_u9m',
 'can_tkt_tmo_tot_prm_u9m',
 'can_tkt_tmo_tot_dig_prm_u12',
 'can_mto_tmo_ven_sol_prm_u12',
 'can_mto_tmo_bmo_prm_u3m',
 'can_tkt_tmo_atm_ret_prm_u9m',
 'pos_tkt_trx_td_prm_u9m',
 'pos_mto_trx_g25_prm_u9m',
 'can_ctd_tmo_tot_trf_dig_prm_u3m',
 'rcc_mto_sld_py2_med_u12',
 'monto_cosecha_pyme_meanprev12',
 'dec_var_deuda_pyme_meanprev12',
 'mtototalcargos_numcargos_rat_avg12_avg12_aggsum',
 'mtototalabonos_numabonos_rat_avg12_avg12',
 'numempleados',
 'DEUDA_VIGENTE_S_HIP_meanPrev12',
 'MTOSALDOPROMEDIOVIGSOL_meanPrev12']


###2. Sampling


Determinamos las matrices de datos independientes (X) y dependiente (y)

In [None]:
X = pddf
y = pddf.fad


Realizamos el muestreo de los datos, dividiéndolos en conjuntos de entrenamiento y prueba.


In [None]:
# Muestreo de data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify = X.codmes_flujo,
                                                    test_size = 0.4,
                                                    random_state = 123)


In [None]:
# Crear figura y subplot
fig, ax = plt.subplots(figsize=(8, 3))
sns.boxplot(data = y_train.values, orient="h", ax=ax)
ax.set_title('FAD')

Definimos una función para tratar valores atípicos utilizando el rango intercuartílico (IQR).


In [None]:

import pandas as pd

def treat_outlier_IQR(series):
    # Calculate the Interquartile Range (IQR)
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1

    # Define the limits for outliers
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR

    # Cap the extreme values to the defined limits
    treated_series = series.clip(lower_limit, upper_limit)
    print(f'Valid range: de {lower_limit} a {upper_limit}')
    return treated_series

In [None]:
y_train_t = treat_outlier_IQR(y_train)


In [None]:
y_train_t = y_train.clip(100, 30736.5)
y_test_t  = y_test.clip(100, 30736.5)

In [None]:
# Crear figura y subplots
fig, axes = plt.subplots(2, 1, figsize=(6, 4))

ax1 = axes[0]
sns.boxplot(data=y_test.values, orient="h", ax=ax1)
ax1.set_title('Variable Original', fontsize=10)

ax2 = axes[1]
sns.boxplot(data=y_test_t.values, orient="h", ax=ax2)
ax2.set_title('Variable Tratada', fontsize=10)
# Ajustar espaciado entre subplots
plt.tight_layout()

# Mostrar los gráficos
plt.show()

###3. Model Training, interpretation and Visualization


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

# Definiendo modelo
dtree = DecisionTreeRegressor(max_depth = 3,
                              min_samples_leaf = 0.05,
                              random_state = 123)

dtree = dtree.fit(X_train[features].fillna(X_train[features].mean()), y_train_t)

In [None]:
from sklearn import tree
import matplotlib.pyplot as plt

# Ajustar el tamaño de la figura
plt.figure(figsize=(60, 20))

# Graficar el árbol de decisión
tree.plot_tree(dtree,
               feature_names=features,
               filled=True,
               fontsize=20,  # Ajustar el tamaño de la fuente aquí
               node_ids=True)  # Opcional: añadir IDs a los nodos para referencia

# Mostrar el gráfico
plt.show()


###4. Model Evaluation

In [None]:
y_train_predictions = dtree.predict(X_train[features].fillna(X_train[features].mean()))
y_test_predictions  = dtree.predict(X_test[features].fillna(X_train[features].mean()))

X_train['prediction'] = y_train_predictions
X_test['prediction']  = y_test_predictions

In [None]:
from sklearn.metrics import r2_score
r2_score(y_train_t, y_train_predictions)

In [None]:
r2_score(y_test_t,  y_test_predictions)


In [None]:
y_train_t.groupby(X_train.codmes_flujo).apply(lambda group: r2_score(group, X_train.prediction.loc[group.index]))


In [None]:
y_test_t.groupby(X_test.codmes_flujo).apply(lambda group: r2_score(group, X_test.prediction.loc[group.index]))


In [None]:
# Calcular la desviación porcentual
deviation = (X_test.prediction - y_test_t) / y_test_t * 100

# Definir los rangos
conditions = [
    (deviation >= -25) & (deviation <= 25),
    (deviation > 25),
    (deviation < -25)
]
choices = ['b. ±25%', 'a. > 25%', 'c. < -25%']

# Categorizar cada caso
deviation_category = np.select(conditions, choices)

In [None]:
category_counts_global = pd.Series(deviation_category).value_counts()
category_percentages = (category_counts_global / category_counts_global.sum()) * 100

category_counts_global_df = pd.DataFrame({
    'Count': category_counts_global,
    'Percentage': category_percentages})

category_counts_global_df

In [None]:
# Crear un DataFrame con los resultados
df = pd.DataFrame({
    'segment': X_test.codmes_flujo,
    'category': deviation_category})

# Contar los casos por categoría y segmentación
category_counts = df.groupby(['segment', 'category']).size().unstack(fill_value=0)
category_counts_normalized = category_counts.div(category_counts.sum(axis=1), axis=0)

In [None]:
# Graficar las barras apiladas
category_counts_normalized.plot(kind='bar', stacked=True, figsize=(8, 4), color=['#204e74', '#7ba2cd', '#dfe7f3'], width=1.0)

# Etiquetas y título
plt.title('Precision mensual')
plt.xlabel('Codmes_flujo', fontsize = 10)
plt.ylabel('Proporcion', fontsize = 10)
plt.legend(title='Rango', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.xticks(rotation=0, fontsize=8)
plt.yticks(fontsize=8)

# Ajustar el gráfico
plt.tight_layout()

# Mostrar el gráfico
plt.show()

In [None]:
importances = pd.DataFrame({'features' : features,
                            'importance' : dtree.feature_importances_}).sort_values('importance', ascending = False)
importances.loc[importances.importance > 0]

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filtrar características con importancia mayor que 0
importances_filtered = importances.loc[importances.importance > 0]

# Configuración del estilo del gráfico
sns.set(style="whitegrid")

# Crear la paleta de colores en tonos de azul
custom_palette = sns.light_palette("#1C75BC", n_colors=len(importances_filtered), reverse=True)

# Crear el gráfico de barras
plt.figure(figsize=(8, 3))
sns.barplot(x='importance', y='features', data=importances_filtered, palette=custom_palette)

# Etiquetas y título
plt.title('Decision Tree - Feature Importances', fontsize=10)
plt.xlabel('Importance', fontsize=8)
plt.ylabel('Features', fontsize=8)

# Cambiar el tamaño de las etiquetas de los ticks de los ejes
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)

# Quitar el borde superior y derecho
sns.despine()

# Mostrar el gráfico
plt.tight_layout()
plt.show()

In [None]:
import pandas as pd
import numpy as np
from sklearn import tree
import matplotlib.pyplot as plt

# Aplicar el árbol a las observaciones para obtener los nodos a los que pertenece cada una
nodos = dtree.apply(pddf[features].fillna(pddf[features].mean()))
pred = dtree.predict(pddf[features].fillna(pddf[features].mean()))

# Crear un DataFrame para almacenar la información
df_nodos = pd.DataFrame({'Nodo': nodos,
                         'Predictions': np.round(pred, 0),
                         'Mes': pddf.codmes_flujo})

# Contar cuántas observaciones caen en cada nodo para cada mes
stability = df_nodos.groupby(['Mes', 'Predictions']).size().unstack(fill_value=0)

In [None]:
stability

In [None]:
import matplotlib.pyplot as plt

# Calcular las frecuencias relativas por mes
stability_relative = stability.div(stability.sum(axis=1), axis=0)

# Crear el gráfico de barras apiladas con frecuencias relativas y barras pegadas
stability_relative.plot(kind='bar', stacked=True, figsize=(10, 5), width=1.0)

# Etiquetas y título
plt.title('Estabilidad de los nodos del árbol de decisión mes a mes (Frecuencias Relativas)')
plt.xlabel('Mes')
plt.ylabel('Proporción de observaciones')
plt.legend(title='Nodo', bbox_to_anchor=(1.05, 1), loc='upper left')

# Ajustar el gráfico
plt.tight_layout()

# Mostrar el gráfico
plt.show()

### 5. Árboles interactivos

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def squared_error(y_true, y_pred):
    """
    Calculate the Squared Error between true and predicted values.

    Parameters:
    - y_true (array-like): Array of true target values.
    - y_pred (array-like): Array of predicted values.

    Returns:
    - squared_error (float): Squared Error between true and predicted values.
    """
    return np.mean((y_true - y_pred) ** 2)

def calculate_squared_error_loss_and_plot(df, variable, target, num_bins=10):
    """
    Calculate the squared error loss function for different cut points of a given variable and generate plots for these calculations.

    Parameters:
    - df (DataFrame): Pandas DataFrame containing the dataset.
    - variable (str): The name of the variable to evaluate.
    - target (str): The name of the target variable.
    - num_bins (int, optional): Number of bins to divide the variable into. Default is 10.

    Returns:
    - min_loss (float): The minimum squared error loss value across all cut points.
    - suggested_cut (float): The optimal cut point that yields the minimum loss.
    - cut_points (array): Array of cut points used.
    - losses (list): List of squared error loss values corresponding to each cut point.
    - left_counts (list): List of counts of samples to the left of each cut point.
    - right_counts (list): List of counts of samples to the right of each cut point.
    - left_means (list): List of mean target values for samples to the left of each cut point.
    - right_means (list): List of mean target values for samples to the right of each cut point.
    - missing_count (int): Number of missing values in the variable.
    - missing_target_mean (float): Mean target value for missing values.
    - total_count (int): Total number of observations.
    - total_target_mean (float): Mean target value for all observations.
    """
    # Handle missing values
    df_clean = df.dropna(subset=[variable])
    missing_count = df[variable].isna().sum()
    missing_target_mean = df[target][df[variable].isna()].mean()
    cut_points = np.percentile(df_clean[variable], np.linspace(0, 100, num_bins + 1))[1:-1]  # Exclude min and max

    losses = []
    left_counts = []
    right_counts = []
    left_means = []
    right_means = []

    y_true = df_clean[target].values

    for threshold in cut_points:
        mask = df_clean[variable] <= threshold
        left_count = np.sum(mask)
        right_count = len(y_true) - left_count

        if left_count > 0:
            left_mean = y_true[mask].mean()
            left_pred = np.full_like(y_true[mask], left_mean, dtype=np.float64)
            mse_left = squared_error(y_true[mask], left_pred)
        else:
            mse_left = 0

        if right_count > 0:
            right_mean = y_true[~mask].mean()
            right_pred = np.full_like(y_true[~mask], right_mean, dtype=np.float64)
            mse_right = squared_error(y_true[~mask], right_pred)
        else:
            mse_right = 0

        proportion_left = left_count / len(y_true)
        proportion_right = right_count / len(y_true)

        weighted_mse = (proportion_left * mse_left) + (proportion_right * mse_right)
        losses.append(weighted_mse)
        left_counts.append(left_count)
        right_counts.append(right_count)
        left_means.append(left_mean)
        right_means.append(right_mean)

    min_loss_index = np.argmin(losses)
    suggested_cut = cut_points[min_loss_index]

    # Total data metrics
    total_count = len(df)
    total_target_mean = df[target].mean()

    return (min(losses), suggested_cut, cut_points, losses, left_counts, right_counts, left_means, right_means,
            missing_count, missing_target_mean, total_count, total_target_mean)

def plot_best_regression_variable(df, variables, target, num_bins=10):
    """
    Evaluate multiple variables to determine which provides the best split based on the squared error loss function.
    Plots the squared error loss function for the variable with the best cut.

    Parameters:
    - df (DataFrame): Pandas DataFrame containing the dataset.
    - variables (list of str): List of variable names to evaluate.
    - target (str): The name of the target variable.
    - num_bins (int, optional): Number of bins to divide each variable into. Default is 10.

    Returns:
    - results_df (DataFrame): DataFrame containing results for all variables evaluated, including:
        - 'Variable': The name of the variable.
        - 'Min Loss Function': The minimum squared error loss function value.
        - 'Suggested Cut': The optimal cut point for that variable.
        - 'Left Count': Number of samples to the left of the suggested cut.
        - 'Right Count': Number of samples to the right of the suggested cut.
        - 'Left Mean': Mean target value to the left of the suggested cut.
        - 'Right Mean': Mean target value to the right of the suggested cut.
        - 'Missing Count': Number of missing values in the variable.
        - 'Missing Target Mean': Mean target value for missing values.
        - 'Total Count': Total number of observations.
        - 'Total Target Mean': Mean target value for all observations.
    """
    results = []

    for var in variables:
        (min_loss, suggested_cut, cut_points, losses, left_counts, right_counts, left_means, right_means,
         missing_count, missing_target_mean, total_count, total_target_mean) = calculate_squared_error_loss_and_plot(df, var, target, num_bins)
        results.append({
            'Variable': var,
            'Min Loss Function': min_loss,
            'Suggested Cut': suggested_cut,
            'Left Count': left_counts[np.argmin(losses)],
            'Right Count': right_counts[np.argmin(losses)],
            'Left Mean': left_means[np.argmin(losses)],
            'Right Mean': right_means[np.argmin(losses)],
            'Missing Count': missing_count,
            'Missing Target Mean': missing_target_mean,
            'Total Count': total_count,
            'Total Target Mean': total_target_mean
        })

    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values(by='Min Loss Function')  # Sort by Min Loss Function in ascending order

    # Identify the best variable
    best_variable = results_df.loc[results_df['Min Loss Function'].idxmin()]

    best_var = best_variable['Variable']
    _, suggested_cut, cut_points, losses, _, _, _, _, _, _, _, _ = calculate_squared_error_loss_and_plot(df, best_var, target, num_bins)

    # Plot the squared error loss function for the best variable
    plt.figure(figsize=(8, 4))
    plt.plot(cut_points, losses, marker='o', linestyle='-', color='b', label='Loss Function')
    plt.axvline(x=suggested_cut, color='r', linestyle='--', label='Suggested Cut')
    plt.xlabel(f'Percentiles of {best_var}', fontsize=10)
    plt.ylabel('Loss Function (Squared Error)', fontsize=10)
    plt.title(f'Loss Function vs Percentiles of {best_var} for {target}', fontsize=12)
    plt.legend(fontsize=9)
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    return results_df

In [None]:
pd.set_option('display.precision', 3)
pd.set_option('display.float_format', '{:.3f}'.format)

In [None]:
# Define the variables to test
variables = ['rat_pas_act_meanprev6_ant_per_med', 'mtoprincipalsol_rat_avg6_avg12_zn_pastrx_med', 'mtosaldopromediovigsol_1000_countprev12_ant_per_deu90k_med', 'mtocuotapresunta_meanprev12_seg_ratio']

#variables = features
X_train['fad_t'] = y_train_t

# Get results and plot with the desired number of bins
num_bins = 50  # You can adjust the number of bins here
results_df = plot_best_regression_variable(X_train, variables, 'fad_t', num_bins)
results_df

---

# Gracias por completar este laboratorio!

---
