# Módulo 2: Modelado

En este notebook se presenta la lógica empleada para entrenar y optimizar algoritmos de aprendizaje automático, con el objetivo de mejorar las métricas de la estrategia de trading previamente backtesteada. Se aplican modelos como **CatBoost**, **Random Forest**, **LightGBM**, entre otros.

El proceso de modelado se estructura en varios pasos clave:

- **Separación de operaciones largas y cortas**: Dividir el modelo para entrenar por separado en operaciones largas y cortas permite capturar mejor las características de cada tipo de operación. Esta decisión responde a la dificultad de capturar características significativas en mercados financieros, ya que las operaciones largas y cortas suelen mostrar patrones distintos. Se realiza un nuevo backtest para estudiar cada tipo de operación de forma independiente.

- **Ingeniería de características**: Para capturar información clave de operaciones ganadoras y perdedoras, se utilizan aproximadamente **750 trades** como datos de entrenamiento, tanto para largos como cortos. La inclusión de indicadores con lag individual resulta compleja, por lo que se incorporan métricas estadísticas como promedio, mínimo y máximo en ventanas de lag de 3 para cada característica. Este enfoque reduce la dimensionalidad y resume información esencial que facilita la capacidad predictiva del modelo.

- **Selección de características para el modelado de cortos**: Para optimizar el modelo de posiciones cortas, se utiliza un enfoque basado en la importancia de características. Iterando sobre las importancias de características de cada modelo, se aplica un umbral de decisión para seleccionar únicamente las más relevantes. En este caso, se utilizaron las importancias de características generadas por el **LGBMClassifier** con un filtro del 20%, eliminando aquellas con una importancia inferior a este valor.

-  **Evaluación preliminar de modelos**: Se realiza un entrenamiento rápido de múltiples algoritmos, incluyendo **XGBoost**, **LightGBM**, **CatBoost**, y **Random Forest**, para evaluar su desempeño inicial en función del **AUC Score**. Esto permite seleccionar el modelo con mayor potencial antes de aplicar una optimización más detallada.

- **Optimización de hiperparámetros**: Una vez seleccionado el modelo más prometedor, se optimizan sus hiperparámetros utilizando **RandomizedSearchCV** con validación cruzada temporal (**TimeSeriesSplit**), lo que ayuda a ajustar el modelo a los datos específicos de la serie temporal y maximizar su capacidad predictiva.

De esta manera, el módulo busca mejorar la precisión y efectividad de los algoritmos en la estrategia de trading.

### Estructura del Módulo

1. Preparación de datos
2. Ingeniería de características
3. Modelado y optimización 

## 1. Preparacion de los datos

- En esta primera parte se cargan los dataframe ya preparados del modulo anterior para hacer una nueva adaptacion para preparar 2 modelos diferentes 1 para largos y otro para cortos
-  Realizar los nuevos backtest separando en operaciones largas y cortas

     



In [26]:
# Importo las librerias que utilizare
import pandas as pd 
import numpy as np
import pandas_ta as ta
import warnings
from backtesting import Backtest, Strategy

# Elimina una warning referente a backtesting
warnings.filterwarnings("ignore", message="Jupyter Notebook detected. Setting Bokeh output to notebook.")

# Cargo los datos 

dataf_entrenamiento = pd.read_csv(r"C:\Users\Roger Saavedra\Desktop\ML VS BACKTEST\RESULTADOS FINALES\dataf_entrenamiento.csv", index_col=0, parse_dates=True)
dataf_prueba = pd.read_csv(r"C:\Users\Roger Saavedra\Desktop\ML VS BACKTEST\RESULTADOS FINALES\dataf_prueba.csv", index_col=0, parse_dates=True)

dataf_prueba



Unnamed: 0_level_0,Open,High,Low,Close,Volume,ohlc_avg,EMA70,EMA250,RSI8
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2024-04-29 09:10:00,2339.380,2339.455,2339.030,2339.040,136,2339.22625,2337.780642,2334.036114,69.798570
2024-04-29 09:11:00,2339.040,2339.195,2338.985,2339.185,143,2339.10125,2337.817843,2334.076473,66.029232
2024-04-29 09:12:00,2339.185,2339.290,2339.035,2339.180,133,2339.17250,2337.856002,2334.117079,67.183682
2024-04-29 09:13:00,2339.180,2339.385,2338.300,2338.355,175,2338.80500,2337.882734,2334.154433,55.971263
2024-04-29 09:14:00,2338.355,2338.625,2338.270,2338.340,162,2338.39750,2337.897235,2334.188243,46.200204
...,...,...,...,...,...,...,...,...,...
2024-09-24 15:55:00,2640.490,2644.790,2639.905,2644.200,458,2642.34625,2642.983635,2638.054299,51.338494
2024-09-24 15:56:00,2644.190,2644.205,2642.165,2643.040,335,2643.40000,2642.995363,2638.096894,62.514041
2024-09-24 15:57:00,2643.045,2643.325,2642.245,2642.480,272,2642.77375,2642.989121,2638.134160,54.078550
2024-09-24 15:58:00,2642.495,2643.340,2642.490,2643.115,268,2642.86000,2642.985484,2638.171816,55.033596


In [2]:
# Backtest de operaciones largas

# Definir el valor de un pip para XAUUSD
valor_pip = 0.01
margen_pips = 20  # Margen de 20 pips

# Función para detectar cruces hacia arriba (crossover)
def crossover(series, level):
    return series[-2] < level and series[-1] > level

# Función para detectar cruces hacia abajo (crossunder)
def crossunder(series, level):
    return series[-2] > level and series[-1] < level

# Función para validar si estamos en una tendencia alcista
def is_bullish_trend(ema_fast, ema_slow):
    return ema_fast > ema_slow

class EMARSIWithPipMarginStrategyLong(Strategy):
    risk_reward_ratio = 1.0
    risk_amount = 100  # con 5% de riesgo por operación

    def init(self):
        # Inicialización de las series del DataFrame 
        self.ema_fast = self.I(lambda: self.data['EMA70'])  # EMA rápida 
        self.ema_slow = self.I(lambda: self.data['EMA250'])  # EMA lenta 
        self.rsi = self.I(lambda: self.data['RSI8']) 
        
        # Variable para resetear señales
        self.rsi_below_10 = False  # Para manejar las señales de compra

    def is_within_no_trade_zone(self):
        # Obtener la hora actual del índice de datos
        current_time = self.data.index[-1]
        hour = current_time.hour
        minute = current_time.minute

        # identificar si estamos en las últimas 30 velas (30 minutos) antes del cierre de la sesión de Nueva York
        if (hour == 15 and minute >= 30):  # 30 minutos antes de las 16:00 UTC (fin de sesión NY)
            return True
        return False
          
    def next(self):
        # Verificar si estamos dentro de la ventana de no operación (10 velas antes del cierre)
        if self.is_within_no_trade_zone():
            return  # No abrir operación

        # Verificar si ya hay una posición abierta
        if self.position:
            return  # No abrir una nueva operación si ya hay una posición abierta

        # Obtener los valores correspondientes de la barra actual
        ema_fast = self.ema_fast[-1]
        ema_slow = self.ema_slow[-1]
        
        ### Lógica de compra (largos)
        if is_bullish_trend(ema_fast, ema_slow):
            # Detectamos el cruce hacia abajo del nivel 10 del RSI (crossunder)
            if crossunder(self.rsi, 10) and not self.rsi_below_10:
                self.rsi_below_10 = True  # Marcamos que el RSI ha cruzado hacia abajo
            
            # Luego detectamos el cruce hacia arriba del nivel 10 (crossover) después de haber cruzado hacia abajo
            if self.rsi_below_10 and crossover(self.rsi, 10):
                sl = self.data.Low[-5:].min() - (margen_pips * valor_pip)
                tp = self.data.Close[-1] + (self.data.Close[-1] - sl) * self.risk_reward_ratio
                
                risk_per_unit = self.data.Close[-1] - sl
                
                if risk_per_unit > 0:
                    size = self.risk_amount / risk_per_unit
                    size = max(1, int(size))  # Aseguramos que el tamaño mínimo sea 1
                    self.buy(size=size, sl=sl, tp=tp)
                
                self.rsi_below_10 = False  # Reiniciamos la señal para el siguiente ciclo


# Ejecutar el backtest para largos para el dataframe de entrenamiento y prueba 
bt_entrenamiento_longs = Backtest(dataf_entrenamiento, EMARSIWithPipMarginStrategyLong, cash=10000, margin=1/10000, commission=.000)
stats_entrenamiento_longs = bt_entrenamiento_longs.run()

bt_prueba_longs = Backtest(dataf_prueba, EMARSIWithPipMarginStrategyLong, cash=10000, margin=1/10000, commission=.000)
stats_prueba_longs = bt_prueba_longs.run()

from IPython.display import display, HTML


display(HTML(f"""
<div style="display: flex;">
    <div style="width: 50%; padding-right: 10px;">
        <h3>Backtest Datos de Entrenamiento Largos</h3>
        <pre>{stats_entrenamiento_longs}</pre>
    </div>
    <div style="width: 50%; padding-left: 10px;">
        <h3>Backtest Datos de Prueba Largos</h3>
        <pre>{stats_prueba_longs}</pre>
    </div>
</div>
"""))


In [3]:
# Backtest de operaciones cortas

# Definir el valor de un pip para XAUUSD
valor_pip = 0.01
margen_pips = 20  # Margen de 20 pips

# Función para detectar cruces hacia arriba (crossover)
def crossover(series, level):
    return series[-2] < level and series[-1] > level

# Función para detectar cruces hacia abajo (crossunder)
def crossunder(series, level):
    return series[-2] > level and series[-1] < level

# Función para validar si estamos en una tendencia bajista
def is_bearish_trend(ema_fast, ema_slow):
    return ema_fast < ema_slow

class EMARSIWithPipMarginStrategyShort(Strategy):
    risk_reward_ratio = 1.0
    risk_amount = 100  # con 5% de riesgo por operación, llega a un 97% de dd con un return del 52%

    def init(self):
        # Inicialización de las series del DataFrame 
        self.ema_fast = self.I(lambda: self.data['EMA70'])  # EMA rápida 
        self.ema_slow = self.I(lambda: self.data['EMA250'])  # EMA lenta 
        self.rsi = self.I(lambda: self.data['RSI8']) 
        
        # Variable para resetear señales
        self.rsi_above_90 = False  # Para manejar las señales de venta

    def is_within_no_trade_zone(self):
        # Obtener la hora actual del índice de datos
        current_time = self.data.index[-1]
        hour = current_time.hour
        minute = current_time.minute

        # Solo identificar si estamos en las últimas 30 velas (30 minutos) antes del cierre de la sesión de Nueva York
        if (hour == 15 and minute >= 30):  # 30 minutos antes de las 16:00 UTC (fin de sesión NY)
            return True
        return False
          
    def next(self):
        # Verificar si estamos dentro de la ventana de no operación (10 velas antes del cierre)
        if self.is_within_no_trade_zone():
            return  # No abrir operación

        # Verificar si ya hay una posición abierta
        if self.position:
            return  # No abrir una nueva operación si ya hay una posición abierta

        # Obtener los valores correspondientes de la barra actual
        ema_fast = self.ema_fast[-1]
        ema_slow = self.ema_slow[-1]
        
        ### Lógica de venta (cortos)
        if is_bearish_trend(ema_fast, ema_slow):
            # Detectamos el cruce hacia arriba del nivel 90 del RSI (crossover)
            if crossover(self.rsi, 90) and not self.rsi_above_90:
                self.rsi_above_90 = True  # Marcamos que el RSI ha cruzado hacia arriba
            
            # Luego detectamos el cruce hacia abajo del nivel 90 (crossunder) después de haber cruzado hacia arriba
            if self.rsi_above_90 and crossunder(self.rsi, 90):
                sl = self.data.High[-5:].max() + (margen_pips * valor_pip)
                tp = self.data.Close[-1] - (sl - self.data.Close[-1]) * self.risk_reward_ratio
                
                risk_per_unit = sl - self.data.Close[-1]
                
                if risk_per_unit > 0:
                    size = self.risk_amount / risk_per_unit
                    size = max(1, int(size))  # Aseguramos que el tamaño mínimo sea 1
                    self.sell(size=size, sl=sl, tp=tp)
                
                self.rsi_above_90 = False  # Reiniciamos la señal para el siguiente ciclo


# Ejecutar el backtest para largos para el dataframe de entrenamiento y prueba 
bt_entrenamiento_shorts = Backtest(dataf_entrenamiento, EMARSIWithPipMarginStrategyShort, cash=10000, margin=1/10000, commission=.000)
stats_entrenamiento_shorts = bt_entrenamiento_shorts.run()

bt_prueba_shorts = Backtest(dataf_prueba, EMARSIWithPipMarginStrategyShort, cash=10000, margin=1/10000, commission=.000)
stats_prueba_shorts = bt_prueba_shorts.run()

from IPython.display import display, HTML


display(HTML(f"""
<div style="display: flex;">
    <div style="width: 50%; padding-right: 10px;">
        <h3>Backtest Datos de Entrenamiento Cortos</h3>
        <pre>{stats_entrenamiento_shorts}</pre>
    </div>
    <div style="width: 50%; padding-left: 10px;">
        <h3>Backtest Datos de Prueba Cortos</h3>
        <pre>{stats_prueba_shorts}</pre>
    </div>
</div>
"""))



In [4]:
# Creo las variables del backtest para los 2 dataframes 

#Entrenamiento
trades_longs_original_entrenamiento = stats_entrenamiento_longs['_trades']
trades_shorts_original_entrenamiento = stats_entrenamiento_shorts['_trades']

#Prueba 
trades_longs_original_prueba = stats_prueba_longs['_trades']
trades_shorts_original_prueba = stats_prueba_shorts['_trades']


def procesar_trades(trades):
    # Dropeo las columnas que no voy a utilizar
    trades = trades.drop(columns=['Size', 'EntryBar', 'ExitBar', 'ReturnPct'])
    
    # Creo la variable objetivo
    trades['result'] = trades['PnL'].apply(lambda x: 1 if x > 0 else (0 if pd.notnull(x) else np.nan)).astype('Int64')
    
    # Ajuste estándar: el backtest se realiza entrando en la apertura de la vela anterior a la señal
    trades['AdjustedEntryTime'] = trades['EntryTime'] - pd.Timedelta(minutes=1)
    
    # Caso específico: Ajuste de tiempo para el 2021-08-31 05:10:00
    # En este caso, TradingView no da la vela, por lo que restamos 3 minutos en lugar de 1
    specific_time_1 = pd.Timestamp('2021-08-31 05:10:00')
    trades.loc[trades['EntryTime'] == specific_time_1, 'AdjustedEntryTime'] = trades['EntryTime'] - pd.Timedelta(minutes=3)
    
    return trades

#Entrenamiento
trades_longs_entrenamiento = procesar_trades(trades_longs_original_entrenamiento)
trades_shorts_entrenamiento = procesar_trades(trades_shorts_original_entrenamiento)

#Prueba
trades_longs_prueba= procesar_trades(trades_longs_original_prueba)
trades_shorts_prueba = procesar_trades(trades_shorts_original_prueba)

trades_longs_entrenamiento



Unnamed: 0,EntryPrice,ExitPrice,PnL,EntryTime,ExitTime,Duration,result,AdjustedEntryTime
0,1929.185,1930.487,98.952,2021-01-04 10:30:00,2021-01-04 10:34:00,0 days 00:04:00,1,2021-01-04 10:29:00
1,1930.788,1931.526,99.630,2021-01-04 12:51:00,2021-01-04 12:52:00,0 days 00:01:00,1,2021-01-04 12:50:00
2,1933.622,1932.564,-99.452,2021-01-04 15:15:00,2021-01-04 15:16:00,0 days 00:01:00,0,2021-01-04 15:14:00
3,1933.883,1938.550,98.007,2021-01-04 15:20:00,2021-01-04 15:39:00,0 days 00:19:00,1,2021-01-04 15:19:00
4,1951.007,1949.586,-99.470,2021-01-06 11:35:00,2021-01-06 11:50:00,0 days 00:15:00,0,2021-01-06 11:34:00
...,...,...,...,...,...,...,...,...
719,2382.600,2379.520,-98.560,2024-04-19 08:13:00,2024-04-19 08:27:00,0 days 00:14:00,0,2024-04-19 08:12:00
720,2369.105,2367.555,-99.200,2024-04-22 05:09:00,2024-04-22 05:36:00,0 days 00:27:00,0,2024-04-22 05:08:00
721,2316.400,2318.435,99.715,2024-04-25 05:03:00,2024-04-25 05:36:00,0 days 00:33:00,1,2024-04-25 05:02:00
722,2326.290,2325.505,-99.695,2024-04-25 12:19:00,2024-04-25 12:20:00,0 days 00:01:00,0,2024-04-25 12:18:00


# 2. Ingenieria de caracteristicas

In [5]:
import numpy as np
import pandas as pd
import pandas_ta as ta

def calc_price_features(df):
    df['perc_var_open_close'] = ((df['Close'] - df['Open']) / df['Open']) * 100
    df['candle_range_perc'] = ((df['High'] - df['Low']) / df['Open']) * 100
    df['body_size_perc'] = (abs(df['Close'] - df['Open']) / df['Open']) * 100
    df['upper_shadow_perc'] = ((df['High'] - df[['Open', 'Close']].max(axis=1)) / df['Open']) * 100
    df['lower_shadow_perc'] = ((df[['Open', 'Close']].min(axis=1) - df['Low']) / df['Open']) * 100
    df['upper_shadow_ratio'] = df['upper_shadow_perc'] / df['candle_range_perc']
    df['lower_shadow_ratio'] = df['lower_shadow_perc'] / df['candle_range_perc']
    df['body_to_range_ratio'] = df['body_size_perc'] / df['candle_range_perc']
    return df.copy()  # Crear una copia para evitar fragmentación

def calc_window_stats(df, window):
    df[f'perc_var_open_close_mean_{window}'] = df['perc_var_open_close'].rolling(window).mean()
    df[f'perc_var_open_close_std_{window}'] = df['perc_var_open_close'].rolling(window).std()
    df[f'perc_var_open_close_min_{window}'] = df['perc_var_open_close'].rolling(window).min()
    df[f'perc_var_open_close_max_{window}'] = df['perc_var_open_close'].rolling(window).max()
    df[f'perc_var_open_close_median_{window}'] = df['perc_var_open_close'].rolling(window).median()
    df[f'candle_range_perc_mean_{window}'] = df['candle_range_perc'].rolling(window).mean()
    df[f'candle_range_perc_std_{window}'] = df['candle_range_perc'].rolling(window).std()
    df[f'candle_range_perc_min_{window}'] = df['candle_range_perc'].rolling(window).min()
    df[f'candle_range_perc_max_{window}'] = df['candle_range_perc'].rolling(window).max()
    df[f'candle_range_perc_median_{window}'] = df['candle_range_perc'].rolling(window).median()
    df[f'upper_shadow_ratio_mean_{window}'] = df['upper_shadow_ratio'].rolling(window).mean()
    df[f'upper_shadow_ratio_std_{window}'] = df['upper_shadow_ratio'].rolling(window).std()
    df[f'lower_shadow_ratio_mean_{window}'] = df['lower_shadow_ratio'].rolling(window).mean()
    df[f'lower_shadow_ratio_std_{window}'] = df['lower_shadow_ratio'].rolling(window).std()
    df[f'body_to_range_ratio_mean_{window}'] = df['body_to_range_ratio'].rolling(window).mean()
    df[f'body_to_range_ratio_std_{window}'] = df['body_to_range_ratio'].rolling(window).std()
    return df.copy()  # Crear una copia para evitar fragmentación

def calc_rsi_features(df, window):
    df[f'RSI8_mean_{window}'] = df['RSI8'].rolling(window).mean()
    df[f'RSI8_std_{window}'] = df['RSI8'].rolling(window).std()
    df[f'RSI8_min_{window}'] = df['RSI8'].rolling(window).min()
    df[f'RSI8_25%_{window}'] = df['RSI8'].rolling(window).quantile(0.25)
    df[f'RSI8_50%_{window}'] = df['RSI8'].rolling(window).quantile(0.50)
    df[f'RSI8_75%_{window}'] = df['RSI8'].rolling(window).quantile(0.75)
    df[f'RSI8_max_{window}'] = df['RSI8'].rolling(window).max()
    df['RSI8_slope_5'] = (df['RSI8'] - df['RSI8'].shift(3)) / 3
    return df.copy()  # Crear una copia para evitar fragmentación

def calc_ema_features(df, window):
    df[f'EMA70_mean_{window}'] = df['EMA70'].rolling(window).mean()
    df[f'EMA70_std_{window}'] = df['EMA70'].rolling(window).std()
    df[f'EMA70_min_{window}'] = df['EMA70'].rolling(window).min()
    df[f'EMA70_max_{window}'] = df['EMA70'].rolling(window).max()
    df[f'EMA70_median_{window}'] = df['EMA70'].rolling(window).median()
    df[f'EMA250_mean_{window}'] = df['EMA250'].rolling(window).mean()
    df[f'EMA250_std_{window}'] = df['EMA250'].rolling(window).std()
    df[f'EMA250_min_{window}'] = df['EMA250'].rolling(window).min()
    df[f'EMA250_max_{window}'] = df['EMA250'].rolling(window).max()
    df[f'EMA250_median_{window}'] = df['EMA250'].rolling(window).median()
    df['ema_diff'] = df['EMA70'] - df['EMA250']
    df[f'ema_diff_mean_{window}'] = df['ema_diff'].rolling(window).mean()
    df[f'ema_diff_std_{window}'] = df['ema_diff'].rolling(window).std()
    df[f'ema_diff_min_{window}'] = df['ema_diff'].rolling(window).min()
    df[f'ema_diff_max_{window}'] = df['ema_diff'].rolling(window).max()
    df[f'ema_diff_median_{window}'] = df['ema_diff'].rolling(window).median()
    df['close_to_ema70'] = df['Close'] - df['EMA70']
    df['close_to_ema250'] = df['Close'] - df['EMA250']
    df[f'close_to_ema70_mean_{window}'] = df['close_to_ema70'].rolling(window).mean()
    df[f'close_to_ema70_std_{window}'] = df['close_to_ema70'].rolling(window).std()
    df[f'close_to_ema250_mean_{window}'] = df['close_to_ema250'].rolling(window).mean()
    df[f'close_to_ema250_std_{window}'] = df['close_to_ema250'].rolling(window).std()
    df['EMA_ratio'] = df['EMA70'] / df['EMA250']
    df[f'EMA_ratio_mean_{window}'] = df['EMA_ratio'].rolling(window).mean()
    df[f'EMA_ratio_std_{window}'] = df['EMA_ratio'].rolling(window).std()
    df['EMA70_slope_5'] = (df['EMA70'] - df['EMA70'].shift(3)) / 3
    df['EMA250_slope_5'] = (df['EMA250'] - df['EMA250'].shift(3)) / 3
    df['ema_diff_slope_5'] = (df['ema_diff'] - df['ema_diff'].shift(3)) / 3
    return df.copy()  # Crear una copia para evitar fragmentación

def calc_volatility_atr(df, window):
    df['atr_14'] = ta.atr(df['High'], df['Low'], df['Close'], length=14)
    df[f'atr_14_mean_{window}'] = df['atr_14'].rolling(window).mean()
    df[f'atr_14_std_{window}'] = df['atr_14'].rolling(window).std()
    df[f'atr_14_min_{window}'] = df['atr_14'].rolling(window).min()
    df[f'atr_14_max_{window}'] = df['atr_14'].rolling(window).max()
    df[f'atr_14_median_{window}'] = df['atr_14'].rolling(window).median()
    df['atr_14_slope_5'] = (df['atr_14'] - df['atr_14'].shift(3)) / 3
    return df.copy()  # Crear una copia para evitar fragmentación

def calc_volume_features(df, window):
    df[f'Volume_mean_{window}'] = df['Volume'].rolling(window).mean()
    df[f'Volume_std_{window}'] = df['Volume'].rolling(window).std()
    df[f'Volume_min_{window}'] = df['Volume'].rolling(window).min()
    df[f'Volume_max_{window}'] = df['Volume'].rolling(window).max()
    df[f'Volume_median_{window}'] = df['Volume'].rolling(window).median()
    df[f'Volume_relative_{window}'] = df['Volume'] / df[f'Volume_mean_{window}']
    return df.copy()  # Crear una copia para evitar fragmentación

def calc_obv_features(df, window):
    df['obv'] = ta.obv(df['Close'], df['Volume'])
    df[f'obv_mean_{window}'] = df['obv'].rolling(window).mean()
    df[f'obv_std_{window}'] = df['obv'].rolling(window).std()  
    df[f'obv_slope_{window}'] = (df['obv'] - df['obv'].shift(window)) / window
    return df.copy()  # Crear una copia para evitar fragmentación

def calc_cmf_features(df, window):
    df['cmf_20'] = ta.cmf(df['High'], df['Low'], df['Close'], df['Volume'], length=20)
    df[f'cmf_mean_{window}'] = df['cmf_20'].rolling(window).mean()
    df[f'cmf_std_{window}'] = df['cmf_20'].rolling(window).std()
    df[f'cmf_slope_{window}'] = (df['cmf_20'] - df['cmf_20'].shift(window)) / window
    return df.copy()  # Crear una copia para evitar fragmentación

def calc_return_volatility_features(df, window):
    df['volatility_14'] = df['Close'].pct_change().rolling(window=14).std()
    df['return_5'] = df['Close'].pct_change(3)
    df['risk_adjusted_return_5'] = df['return_5'] / df['volatility_14'].replace(0, np.nan)
    df['rsi_volatility_ratio'] = df['RSI8'] / df['atr_14'].replace(0, np.nan)
    df[f'volatility_14_mean_{window}'] = df['volatility_14'].rolling(window).mean()
    df[f'volatility_14_std_{window}'] = df['volatility_14'].rolling(window).std()
    df[f'volatility_14_min_{window}'] = df['volatility_14'].rolling(window).min()
    df[f'volatility_14_max_{window}'] = df['volatility_14'].rolling(window).max()
    df[f'volatility_14_median_{window}'] = df['volatility_14'].rolling(window).median()
    df[f'volatility_14_slope_{window}'] = (df['volatility_14'] - df['volatility_14'].shift(window)) / window
    df[f'return_5_mean_{window}'] = df['return_5'].rolling(window).mean()
    df[f'return_5_std_{window}'] = df['return_5'].rolling(window).std()
    df[f'risk_adjusted_return_5_mean_{window}'] = df['risk_adjusted_return_5'].rolling(window).mean()
    df[f'risk_adjusted_return_5_std_{window}'] = df['risk_adjusted_return_5'].rolling(window).std()
    df[f'rsi_volatility_ratio_mean_{window}'] = df['rsi_volatility_ratio'].rolling(window).mean()
    df[f'rsi_volatility_ratio_std_{window}'] = df['rsi_volatility_ratio'].rolling(window).std()
    df[f'rsi_volatility_ratio_min_{window}'] = df['rsi_volatility_ratio'].rolling(window).min()
    df[f'rsi_volatility_ratio_max_{window}'] = df['rsi_volatility_ratio'].rolling(window).max()
    df[f'rsi_volatility_ratio_median_{window}'] = df['rsi_volatility_ratio'].rolling(window).median()
    df[f'rsi_volatility_ratio_slope_{window}'] = (df['rsi_volatility_ratio'] - df['rsi_volatility_ratio'].shift(window)) / window
    return df.copy()  # Crear una copia para evitar fragmentación

def calculate_all_features(df):
    df = calc_price_features(df)
    df = calc_window_stats(df, window=3)
    df = calc_rsi_features(df, window=3)
    df = calc_ema_features(df, window=3)
    df = calc_volatility_atr(df, window=3)
    df = calc_volume_features(df, window=3)
    df = calc_obv_features(df, window=3)
    df = calc_cmf_features(df, window=3)
    df = calc_return_volatility_features(df, window=3)
    
    # Eliminar filas con valores NaN generados por los cálculos
    df.dropna(inplace=True)
    return df

# Usar la función para calcular todas las características
dataf_entrenamiento_features = calculate_all_features(dataf_entrenamiento)
dataf_prueba_features = calculate_all_features(dataf_prueba)

dataf_entrenamiento_features


Unnamed: 0_level_0,Open,High,Low,Close,Volume,ohlc_avg,EMA70,EMA250,RSI8,perc_var_open_close,...,return_5_mean_3,return_5_std_3,risk_adjusted_return_5_mean_3,risk_adjusted_return_5_std_3,rsi_volatility_ratio_mean_3,rsi_volatility_ratio_std_3,rsi_volatility_ratio_min_3,rsi_volatility_ratio_max_3,rsi_volatility_ratio_median_3,rsi_volatility_ratio_slope_3
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-01-04 09:32:00,1932.602,1932.974,1932.552,1932.974,22,1932.77550,1931.566920,1926.439888,72.727035,0.019249,...,1.711241e-04,0.000157,1.007084,1.061709,138.942364,14.332174,125.848318,154.254128,136.724646,14.456115
2021-01-04 09:33:00,1932.974,1933.056,1932.825,1932.825,13,1932.92000,1931.605035,1926.491522,75.847171,-0.007708,...,2.333946e-04,0.000090,1.646248,0.777490,153.116109,15.853118,136.724646,168.369553,154.254128,14.173745
2021-01-04 09:34:00,1932.825,1933.098,1932.777,1932.928,14,1932.90700,1931.641710,1926.542641,74.965373,0.005329,...,2.363182e-04,0.000086,1.942865,0.271645,164.416872,8.873273,154.254128,170.626935,168.369553,11.300763
2021-01-04 09:35:00,1932.928,1933.017,1932.555,1932.555,18,1932.76375,1931.673317,1926.592212,65.391367,-0.019297,...,5.297129e-05,0.000234,0.574108,2.179562,162.395051,12.354756,148.188666,170.626935,168.369553,-2.021821
2021-01-04 09:36:00,1932.555,1932.802,1932.555,1932.748,8,1932.66500,1931.701252,1926.640601,59.413421,0.009987,...,-2.930600e-05,0.000193,-0.214276,1.792322,152.867385,15.943653,139.786556,170.626935,148.188666,-9.527666
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-04-25 15:55:00,2330.015,2330.300,2329.370,2329.905,251,2329.89750,2331.423189,2328.697438,38.166977,-0.004721,...,-2.481072e-04,0.000313,-0.914394,1.166351,40.354143,7.601603,32.767482,47.970601,40.324346,-4.107791
2024-04-25 15:56:00,2329.905,2331.120,2329.900,2330.610,250,2330.38375,2331.393909,2328.710874,47.173745,0.030259,...,-3.510680e-04,0.000140,-1.290783,0.563993,37.818404,4.374270,32.767482,40.363383,40.324346,-2.535739
2024-04-25 15:57:00,2330.610,2330.615,2329.990,2330.010,233,2330.30625,2331.363271,2328.723587,45.954447,-0.025744,...,-2.252227e-04,0.000237,-0.874210,0.938166,37.934175,4.477141,32.767482,40.671659,40.363383,0.115771
2024-04-25 15:58:00,2330.010,2330.615,2329.970,2330.085,228,2330.17000,2331.329658,2328.735112,43.685751,0.003219,...,-4.145611e-05,0.000142,-0.156170,0.529605,40.307154,0.395628,39.886419,40.671659,40.363383,2.372979


In [6]:
# Función para crear el DataFrame de modelado
# Este proceso consta de dos pasos:
# 1. Primero, calculamos todas las features relevantes para cada vela en el DataFrame `df_entrenamiento`.
# 2. Luego, realizamos un merge entre `df_entrenamiento` y `df_trades` para filtrar solo las velas en las que se da una operación.
#    Esto nos permite asociar cada operación con todas las features calculadas previamente.

def merge_and_filter_dataframes(df_entrenamiento, df_trades, result_values=[0, 1]):
    # Realizar el merge utilizando el índice de df_entrenamiento y la columna AdjustedEntryTime de df_trades
    df_merged = df_entrenamiento.merge(df_trades, how='left', left_index=True, right_on='AdjustedEntryTime')
    
    # Eliminar la columna AdjustedEntryTime después del merge si no es necesaria
    if 'AdjustedEntryTime' in df_merged.columns:
        df_merged.drop(columns=['AdjustedEntryTime'], inplace=True)
    
    # Mantener el índice original de df_entrenamiento
    df_merged.set_index(df_entrenamiento.index, inplace=True)
    
    # Filtrar las filas donde el valor de 'result' es 0 o 1 (o valores específicos definidos en result_values)
    df_filtered = df_merged[df_merged['result'].isin(result_values)]
    
    # Configurar la opción para mostrar todas las columnas del DataFrame
    pd.set_option('display.max_columns', None)
  
    return df_filtered

# Aplicar la función a dataf_entrenamiento_features
df_longs_entrenamiento = merge_and_filter_dataframes(dataf_entrenamiento_features, trades_longs_entrenamiento)
df_shorts_entrenamiento = merge_and_filter_dataframes(dataf_entrenamiento_features, trades_shorts_entrenamiento)

# Aplicar la función a dataf_prueba_features
df_longs_prueba = merge_and_filter_dataframes(dataf_prueba_features, trades_longs_prueba)
df_shorts_prueba = merge_and_filter_dataframes(dataf_prueba_features, trades_shorts_prueba)

# Guardo las dataf que utilizare para el modulo de resultados
#df_longs_prueba.to_csv('df_longs_prueba.csv')
#df_shorts_prueba.to_csv('df_shorts_prueba.csv')

df_longs_entrenamiento


Unnamed: 0_level_0,Open,High,Low,Close,Volume,ohlc_avg,EMA70,EMA250,RSI8,perc_var_open_close,candle_range_perc,body_size_perc,upper_shadow_perc,lower_shadow_perc,upper_shadow_ratio,lower_shadow_ratio,body_to_range_ratio,perc_var_open_close_mean_3,perc_var_open_close_std_3,perc_var_open_close_min_3,perc_var_open_close_max_3,perc_var_open_close_median_3,candle_range_perc_mean_3,candle_range_perc_std_3,candle_range_perc_min_3,candle_range_perc_max_3,candle_range_perc_median_3,upper_shadow_ratio_mean_3,upper_shadow_ratio_std_3,lower_shadow_ratio_mean_3,lower_shadow_ratio_std_3,body_to_range_ratio_mean_3,body_to_range_ratio_std_3,RSI8_mean_3,RSI8_std_3,RSI8_min_3,RSI8_25%_3,RSI8_50%_3,RSI8_75%_3,RSI8_max_3,RSI8_slope_5,EMA70_mean_3,EMA70_std_3,EMA70_min_3,EMA70_max_3,EMA70_median_3,EMA250_mean_3,EMA250_std_3,EMA250_min_3,EMA250_max_3,EMA250_median_3,ema_diff,ema_diff_mean_3,ema_diff_std_3,ema_diff_min_3,ema_diff_max_3,ema_diff_median_3,close_to_ema70,close_to_ema250,close_to_ema70_mean_3,close_to_ema70_std_3,close_to_ema250_mean_3,close_to_ema250_std_3,EMA_ratio,EMA_ratio_mean_3,EMA_ratio_std_3,EMA70_slope_5,EMA250_slope_5,ema_diff_slope_5,atr_14,atr_14_mean_3,atr_14_std_3,atr_14_min_3,atr_14_max_3,atr_14_median_3,atr_14_slope_5,Volume_mean_3,Volume_std_3,Volume_min_3,Volume_max_3,Volume_median_3,Volume_relative_3,obv,obv_mean_3,obv_std_3,obv_slope_3,cmf_20,cmf_mean_3,cmf_std_3,cmf_slope_3,volatility_14,return_5,risk_adjusted_return_5,rsi_volatility_ratio,volatility_14_mean_3,volatility_14_std_3,volatility_14_min_3,volatility_14_max_3,volatility_14_median_3,volatility_14_slope_3,return_5_mean_3,return_5_std_3,risk_adjusted_return_5_mean_3,risk_adjusted_return_5_std_3,rsi_volatility_ratio_mean_3,rsi_volatility_ratio_std_3,rsi_volatility_ratio_min_3,rsi_volatility_ratio_max_3,rsi_volatility_ratio_median_3,rsi_volatility_ratio_slope_3,EntryPrice,ExitPrice,PnL,EntryTime,ExitTime,Duration,result
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1
2021-01-04 10:29:00,1928.665,1929.712,1928.207,1929.185,95,1928.94225,1931.430422,1928.349681,13.763838,0.026962,0.078033,0.026962,0.027325,0.023747,0.350166,0.304319,0.345515,-0.000412,0.028266,-0.029494,0.026962,0.001296,0.053466,0.028180,0.022704,0.078033,0.059661,0.425702,0.468064,0.275316,0.246118,0.298982,0.222320,10.682264,2.893364,8.023674,9.141477,10.259279,12.011559,13.763838,0.081813,1931.504576,0.075190,1931.430422,1931.580762,1931.502543,1928.345279,0.004235,1928.341234,1928.349681,1928.344921,3.080741,3.159297,0.079407,3.080741,3.239528,3.157622,-2.245422,0.835319,-2.476576,0.316687,0.682721,0.315368,1.001598,1.001638,0.000041,-0.071944,0.005440,-0.077384,0.601442,0.539030,0.059098,0.483927,0.601442,0.531722,0.037990,74.333333,37.541089,31.0,97.0,95.0,1.278027,-139.0,-170.000000,55.434646,9.666667,-0.242182,-0.303404,0.068501,0.040950,0.000224,-0.000012,-0.055655,22.884727,0.000215,0.000008,0.000209,0.000224,0.000213,-0.000008,-0.000538,0.000470,-2.544629,2.215227,19.724921,4.101415,15.089969,22.884727,21.200068,-1.615634,1929.185,1930.487,98.952,2021-01-04 10:30:00,2021-01-04 10:34:00,0 days 00:04:00,1
2021-01-04 12:50:00,1930.342,1931.268,1930.342,1930.788,70,1930.68500,1933.144189,1931.887022,21.029246,0.023105,0.047971,0.023105,0.024866,0.000000,0.518359,0.000000,0.481641,-0.001828,0.026243,-0.029209,0.023105,0.000622,0.037123,0.009490,0.030358,0.047971,0.033042,0.492450,0.364200,0.045506,0.078819,0.462044,0.432101,13.628847,6.430294,9.404959,9.928648,10.452337,15.740791,21.029246,3.167935,1933.218531,0.075918,1933.144189,1933.295933,1933.215470,1931.897360,0.010696,1931.887022,1931.908381,1931.896677,1.257167,1.321171,0.065225,1.257167,1.387552,1.318793,-2.356189,-1.099022,-2.731864,0.328612,-1.410693,0.270175,1.000651,1.000684,0.000034,-0.076334,-0.010541,-0.065793,0.678964,0.668191,0.009753,0.659961,0.678964,0.665650,0.003729,65.000000,5.000000,60.0,70.0,65.0,1.076923,152.0,83.666667,67.515430,25.000000,-0.220022,-0.239026,0.020566,-0.011215,0.000253,-0.000055,-0.217307,30.972569,0.000246,0.000011,0.000233,0.000253,0.000252,-0.000003,-0.000230,0.000153,-0.945035,0.630709,20.308601,9.263746,14.250785,30.972569,15.702450,4.571051,1930.788,1931.526,99.630,2021-01-04 12:51:00,2021-01-04 12:52:00,0 days 00:01:00,1
2021-01-04 15:14:00,1933.157,1934.678,1933.157,1933.622,179,1933.65350,1939.863030,1936.910845,13.495900,0.024054,0.078680,0.024054,0.054626,0.000000,0.694280,0.000000,0.305720,0.002035,0.019118,-0.010345,0.024054,-0.007603,0.062431,0.032339,0.025189,0.083424,0.078680,0.489865,0.234357,0.240958,0.208768,0.269177,0.162876,9.623725,3.356677,7.539401,7.687637,7.835873,10.665887,13.495900,1.568944,1940.048729,0.188620,1939.863030,1940.240140,1940.043016,1936.938179,0.027939,1936.910845,1936.966685,1936.937008,2.952185,3.110550,0.160683,2.952185,3.273455,3.106008,-6.241030,-3.288845,-6.670062,0.371556,-3.559513,0.249396,1.001524,1.001606,0.000083,-0.190821,-0.027894,-0.162927,1.542584,1.570800,0.047440,1.542584,1.625571,1.544245,-0.027985,173.666667,34.312291,137.0,205.0,179.0,1.030710,-136.0,-209.666667,93.607336,-54.333333,-0.274399,-0.270910,0.004762,-0.022944,0.000766,0.000061,0.079686,8.748891,0.000786,0.000017,0.000766,0.000796,0.000795,-0.000013,-0.000557,0.000609,-0.700058,0.767365,6.150510,2.250476,4.820382,8.748891,4.882258,1.115116,1933.622,1932.564,-99.452,2021-01-04 15:15:00,2021-01-04 15:16:00,0 days 00:01:00,0
2021-01-04 15:19:00,1931.292,1934.093,1931.292,1933.883,216,1932.64000,1938.863016,1936.732735,30.641248,0.134159,0.145032,0.134159,0.010874,0.000000,0.074973,0.000000,0.925027,0.043374,0.079864,-0.016049,0.134159,0.012012,0.108746,0.038689,0.068035,0.145032,0.113170,0.212206,0.304935,0.373328,0.439831,0.414466,0.442500,17.064835,11.768965,9.757676,10.276629,10.795581,20.718414,30.641248,5.743708,1939.061194,0.207651,1938.863016,1939.277173,1939.043393,1936.770146,0.039874,1936.732735,1936.812095,1936.765608,2.130281,2.291048,0.167792,2.130281,2.465078,2.277785,-4.980016,-2.849735,-6.802194,1.578513,-4.511146,1.444844,1.001100,1.001183,0.000087,-0.212378,-0.040450,-0.171928,1.747295,1.679932,0.061668,1.626259,1.747295,1.666241,0.032339,221.000000,64.645185,159.0,288.0,216.0,0.977376,-63.0,-111.000000,149.879952,29.000000,-0.093407,-0.173399,0.075274,0.053448,0.000917,0.001301,1.419444,17.536391,0.000839,0.000067,0.000799,0.000917,0.000802,0.000043,-0.000444,0.001535,-0.622300,1.799809,10.010261,6.529543,5.856102,17.536391,6.638290,3.136806,1933.883,1938.550,98.007,2021-01-04 15:20:00,2021-01-04 15:39:00,0 days 00:19:00,1
2021-01-06 11:34:00,1950.819,1951.470,1950.593,1951.007,102,1950.97225,1954.911430,1953.551064,18.565451,0.009637,0.044955,0.009637,0.023734,0.011585,0.527936,0.257697,0.214367,-0.002680,0.025949,-0.032493,0.014816,0.009637,0.043723,0.016584,0.026557,0.059657,0.044955,0.390637,0.169002,0.170378,0.147567,0.438985,0.194638,12.640703,5.134921,9.477258,9.678328,9.879398,14.222425,18.565451,4.457863,1955.029036,0.119356,1954.911430,1955.150069,1955.025609,1953.572476,0.021771,1953.551064,1953.594589,1953.571777,1.360366,1.456559,0.097586,1.360366,1.555480,1.453832,-3.904430,-2.544064,-4.243703,0.359259,-2.787143,0.261959,1.000696,1.000746,0.000050,-0.121267,-0.021905,-0.099362,0.762789,0.762984,0.009079,0.754004,0.772158,0.762789,0.006924,86.000000,19.697716,64.0,102.0,92.0,1.186047,1653.0,1563.666667,83.721761,24.666667,-0.348311,-0.378267,0.027918,0.002589,0.000230,-0.000080,-0.350314,24.338901,0.000247,0.000017,0.000230,0.000263,0.000249,-0.000007,-0.000025,0.000084,-0.114083,0.336578,16.567559,6.731122,12.569245,24.338901,12.794530,5.780644,1951.007,1949.586,-99.470,2021-01-06 11:35:00,2021-01-06 11:50:00,0 days 00:15:00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-04-19 08:12:00,2380.385,2382.815,2380.145,2382.600,348,2381.48625,2385.980553,2384.614680,25.686246,0.093052,0.112167,0.093052,0.009032,0.010082,0.080524,0.089888,0.829588,0.011771,0.071064,-0.038627,0.093052,-0.019111,0.094219,0.019736,0.073083,0.112167,0.097407,0.316592,0.218924,0.187530,0.168576,0.495878,0.296786,12.508560,11.446217,5.038057,5.919718,6.801378,16.243812,25.686246,5.820021,2386.120632,0.145232,2385.980553,2386.270521,2386.110823,2384.642245,0.028857,2384.614680,2384.672246,2384.639808,1.365873,1.478388,0.116376,1.365873,1.598275,1.471015,-3.380553,-2.014680,-4.845632,1.277358,-3.367245,1.190258,1.000573,1.000620,0.000049,-0.142313,-0.027562,-0.114751,1.426310,1.352034,0.066223,1.299152,1.426310,1.330641,0.068561,335.333333,48.263167,282.0,376.0,348.0,1.037773,-247918.0,-248024.666667,209.469170,-103.333333,-0.139383,-0.176468,0.036129,0.001855,0.000351,0.000353,1.005431,18.008883,0.000247,0.000090,0.000190,0.000351,0.000201,0.000047,-0.000420,0.000672,-2.427167,3.005265,9.010105,7.826778,3.786187,18.008883,5.235244,3.756521,2382.600,2379.520,-98.560,2024-04-19 08:13:00,2024-04-19 08:27:00,0 days 00:14:00,0
2024-04-22 05:08:00,2369.015,2369.580,2368.920,2369.105,51,2369.15500,2386.772170,2385.939538,12.644381,0.003799,0.027860,0.003799,0.020051,0.004010,0.719697,0.143939,0.136364,0.009219,0.020085,-0.007600,0.031458,0.003799,0.027091,0.005007,0.021744,0.031669,0.027860,0.385530,0.362587,0.121399,0.105288,0.493071,0.446156,9.096519,3.230300,6.325422,7.322587,8.319753,10.482067,12.644381,2.101740,2387.292711,0.525559,2386.772170,2387.823148,2387.282813,2386.076090,0.137428,2385.939538,2386.214378,2386.074353,0.832632,1.216621,0.388133,0.832632,1.608770,1.208460,-17.667170,-16.834538,-18.496044,0.963481,-17.279423,0.586736,1.000349,1.000510,0.000163,-0.538098,-0.139339,-0.398758,1.808417,1.896714,0.088276,1.808417,1.984968,1.896756,-0.096542,48.666667,5.859465,42.0,53.0,51.0,1.047945,-243870.0,-243918.000000,46.572525,13.333333,0.090106,0.133347,0.052312,-0.052597,0.002896,0.000277,0.095509,6.991962,0.002895,0.000002,0.002892,0.002897,0.002896,0.000001,0.000145,0.000171,0.050046,0.059119,4.854976,1.945461,3.186661,6.991962,4.386305,1.323500,2369.105,2367.555,-99.200,2024-04-22 05:09:00,2024-04-22 05:36:00,0 days 00:27:00,0
2024-04-25 05:02:00,2315.265,2316.475,2315.190,2316.400,91,2315.83250,2326.519256,2323.834537,12.718520,0.049022,0.055501,0.049022,0.003239,0.003239,0.058366,0.058366,0.883268,0.014326,0.032154,-0.014468,0.049022,0.008423,0.036929,0.016327,0.024834,0.055501,0.030453,0.137761,0.183719,0.281414,0.376674,0.580824,0.303340,7.716517,4.332270,5.156033,5.215515,5.274997,8.996758,12.718520,-1.160522,2326.839642,0.325829,2326.519256,2327.170655,2326.829017,2323.901097,0.067732,2323.834537,2323.969945,2323.898810,2.684719,2.938545,0.258096,2.684719,3.200709,2.930206,-10.119256,-7.434537,-11.261309,1.024797,-8.322764,0.780652,1.001155,1.001264,0.000111,-0.332043,-0.068409,-0.263634,1.523806,1.557519,0.043465,1.523806,1.606574,1.542176,0.126473,113.333333,32.005208,91.0,150.0,99.0,0.802941,-237235.0,-237345.666667,121.697713,47.333333,-0.028408,-0.051074,0.029934,0.000807,0.000889,-0.002689,-3.025772,8.346546,0.000892,0.000015,0.000879,0.000909,0.000889,0.000190,-0.003412,0.000696,-3.819861,0.739326,4.991093,2.906063,3.283382,8.346546,3.343349,-1.936524,2316.400,2318.435,99.715,2024-04-25 05:03:00,2024-04-25 05:36:00,0 days 00:33:00,1
2024-04-25 12:18:00,2325.890,2326.455,2325.875,2326.290,232,2326.12750,2328.309640,2326.775633,12.560402,0.017198,0.024937,0.017198,0.007094,0.000645,0.284483,0.025862,0.689655,-0.000071,0.015020,-0.010102,0.017198,-0.007308,0.027872,0.006436,0.023428,0.035253,0.024937,0.320549,0.230616,0.236729,0.216641,0.442722,0.241376,10.368868,1.905360,9.104922,9.273102,9.441282,11.000842,12.560402,-0.049841,2328.374299,0.065373,2328.309640,2328.440366,2328.372890,2326.781074,0.005544,2326.775633,2326.786749,2326.780839,1.534007,1.593225,0.059813,1.534007,1.653617,1.592052,-2.019640,-0.485633,-2.294299,0.243323,-0.701074,0.203819,1.000659,1.000685,0.000026,-0.065898,-0.005464,-0.060434,0.680307,0.682067,0.005300,0.677870,0.688023,0.680307,-0.002595,220.333333,11.060440,210.0,232.0,219.0,1.052950,-234104.0,-234185.666667,130.354645,-65.666667,-0.095064,-0.104524,0.011213,0.013591,0.000169,-0.000002,-0.012700,18.462854,0.000166,0.000012,0.000152,0.000175,0.000169,-0.000005,-0.000383,0.000344,-2.380130,2.221643,15.208059,2.840038,13.233465,18.462854,13.927857,-0.002808,2326.290,2325.505,-99.695,2024-04-25 12:19:00,2024-04-25 12:20:00,0 days 00:01:00,0


In [7]:
# Contar cuántos valores no nulos (validos) hay en la columna 'result' para verificar que esten todas las operaciones

#Entrenamiento
if 'result' in df_longs_entrenamiento.columns:
    count_results_longs = df_longs_entrenamiento['result'].notna().sum()
    print(f"Número de resultados en entrenamiento 'result': {count_results_longs}")
else:
    print("La columna 'result' no está en df_combined.")

if 'result' in df_shorts_entrenamiento.columns:
    count_results_shorts = df_shorts_entrenamiento['result'].notna().sum()
    print(f"Número de resultados en entrenamiento 'result': {count_results_shorts}")
else:
    print("La columna 'result' no está en df_combined.")

#Prueba
if 'result' in df_longs_prueba.columns:
    count_results_longs = df_longs_prueba['result'].notna().sum()
    print(f"Número de resultados en prueba 'result': {count_results_longs}")
else:
    print("La columna 'result' no está en df_combined.")

if 'result' in df_shorts_prueba.columns:
    count_results_shorts = df_shorts_prueba['result'].notna().sum()
    print(f"Número de resultados en prueba 'result': {count_results_shorts}")
else:
    print("La columna 'result' no está en df_combined.")


# No falta ninguna operacion en los backtest de los datos de entrenamiento y prueba


Número de resultados en entrenamiento 'result': 724
Número de resultados en entrenamiento 'result': 735
Número de resultados en prueba 'result': 92
Número de resultados en prueba 'result': 83


## 3. Modelado y optimización

## Modelado largos

In [8]:
for col in df_longs_entrenamiento.columns:
    print(col)

Open
High
Low
Close
Volume
ohlc_avg
EMA70
EMA250
RSI8
perc_var_open_close
candle_range_perc
body_size_perc
upper_shadow_perc
lower_shadow_perc
upper_shadow_ratio
lower_shadow_ratio
body_to_range_ratio
perc_var_open_close_mean_3
perc_var_open_close_std_3
perc_var_open_close_min_3
perc_var_open_close_max_3
perc_var_open_close_median_3
candle_range_perc_mean_3
candle_range_perc_std_3
candle_range_perc_min_3
candle_range_perc_max_3
candle_range_perc_median_3
upper_shadow_ratio_mean_3
upper_shadow_ratio_std_3
lower_shadow_ratio_mean_3
lower_shadow_ratio_std_3
body_to_range_ratio_mean_3
body_to_range_ratio_std_3
RSI8_mean_3
RSI8_std_3
RSI8_min_3
RSI8_25%_3
RSI8_50%_3
RSI8_75%_3
RSI8_max_3
RSI8_slope_5
EMA70_mean_3
EMA70_std_3
EMA70_min_3
EMA70_max_3
EMA70_median_3
EMA250_mean_3
EMA250_std_3
EMA250_min_3
EMA250_max_3
EMA250_median_3
ema_diff
ema_diff_mean_3
ema_diff_std_3
ema_diff_min_3
ema_diff_max_3
ema_diff_median_3
close_to_ema70
close_to_ema250
close_to_ema70_mean_3
close_to_ema70_std_3


In [9]:
import os
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import TimeSeriesSplit

# Ajustar la variable de entorno para evitar la advertencia de joblib
os.environ["LOKY_MAX_CPU_COUNT"] = "14"  # Ajustar según el sistema

# Definir columnas irrelevantes para el modelado de largos
columns_to_drop_longs = ['EntryPrice', 'ExitPrice', 'PnL', 'EntryTime', 'ExitTime', 'Duration', 'result',
                         'RSI8', 'EMA70', 'EMA250']

# Preparar conjunto de datos para largos
X_longs = df_longs_entrenamiento.drop(columns=columns_to_drop_longs)
y_longs = df_longs_entrenamiento['result']

# Definir el número de divisiones (folds) para la validación cruzada temporal
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)

# Lista de clasificadores para el modelo de largos
classifiers_longs = [
    XGBClassifier(random_state=42, eval_metric='logloss'),
    LGBMClassifier(random_state=42, verbose=-1),
    CatBoostClassifier(random_state=42, verbose=False),
    RandomForestClassifier(random_state=42)
]

# Diccionario para almacenar modelos entrenados
trained_classifiers_longs = {}

# Entrenamiento y evaluación de modelos largos con TimeSeriesSplit
for clf in classifiers_longs:
    auc_scores_longs = []  # Para almacenar los AUC de cada fold
    for train_index, test_index in tscv.split(X_longs):
        X_train_longs, X_test_longs = X_longs.iloc[train_index], X_longs.iloc[test_index]
        y_train_longs, y_test_longs = y_longs.iloc[train_index], y_longs.iloc[test_index]
        
        clf.fit(X_train_longs, y_train_longs)
        y_pred_longs = clf.predict_proba(X_test_longs)[:, 1]
        
        # Calcular el AUC para este fold
        auc_score = roc_auc_score(y_test_longs, y_pred_longs)
        auc_scores_longs.append(auc_score)
    
    # Calcular el AUC promedio de todas las particiones
    mean_auc_score = sum(auc_scores_longs) / len(auc_scores_longs)
    print(f'Largos - {type(clf).__name__}: Mean AUC Score={mean_auc_score:.3f}')
    
    # Almacenar el modelo entrenado
    trained_classifiers_longs[type(clf).__name__] = clf


Largos - XGBClassifier: Mean AUC Score=0.488
Largos - LGBMClassifier: Mean AUC Score=0.493
Largos - CatBoostClassifier: Mean AUC Score=0.518
Largos - RandomForestClassifier: Mean AUC Score=0.526


In [25]:
# Crear un DataFrame para almacenar las importancias de características de cada modelo
feature_importance_df_longs = pd.DataFrame(index=X_longs.columns)

# Extraer e imprimir importancias de características para cada modelo entrenado en largos
for clf_name, clf in trained_classifiers_longs.items():
    if hasattr(clf, "feature_importances_"):
        # Obtener importancia de características
        feature_importance = clf.feature_importances_
    elif hasattr(clf, "get_feature_importance"):
        feature_importance = clf.get_feature_importance()
    else:
        continue  # Pasar si el modelo no tiene atributo de importancia de características
    
    # Almacenar importancias en el DataFrame
    feature_importance_df_longs[clf_name] = feature_importance
    
    # Imprimir importancia de características
    print(f'\n{clf_name} Feature Importances:')
    for feature, importance in sorted(zip(X_longs.columns, feature_importance), key=lambda x: x[1], reverse=True):
        print(f'{feature}: {importance:.4f}')



XGBClassifier Feature Importances:
volatility_14_mean_3: 0.0265
rsi_volatility_ratio_min_3: 0.0239
perc_var_open_close_min_3: 0.0197
Open: 0.0182
upper_shadow_ratio_mean_3: 0.0178
body_size_perc: 0.0175
EMA70_slope_5: 0.0169
RSI8_50%_3: 0.0169
close_to_ema250: 0.0165
candle_range_perc_max_3: 0.0158
close_to_ema70: 0.0155
EMA_ratio: 0.0154
candle_range_perc_median_3: 0.0151
rsi_volatility_ratio_max_3: 0.0150
rsi_volatility_ratio_std_3: 0.0150
ema_diff: 0.0147
volatility_14_std_3: 0.0146
close_to_ema250_mean_3: 0.0141
candle_range_perc: 0.0140
candle_range_perc_min_3: 0.0139
perc_var_open_close_std_3: 0.0138
perc_var_open_close_median_3: 0.0138
RSI8_min_3: 0.0138
body_to_range_ratio_mean_3: 0.0134
Volume_min_3: 0.0133
body_to_range_ratio: 0.0129
Volume: 0.0127
rsi_volatility_ratio_mean_3: 0.0124
risk_adjusted_return_5_mean_3: 0.0124
Volume_max_3: 0.0121
perc_var_open_close: 0.0121
close_to_ema250_std_3: 0.0120
cmf_20: 0.0119
body_to_range_ratio_std_3: 0.0117
rsi_volatility_ratio: 0.0116

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.metrics import roc_auc_score
import numpy as np

# Definir el modelo RandomForestClassifier para largos
clf_longs = RandomForestClassifier(random_state=42)

# Hiperparámetros ampliados para optimización
param_grid_longs = {
    'n_estimators': [200, 500, 1000, 1500],           # Número de árboles en el bosque
    'max_depth': [10, 20, 30, None],                  # Profundidad máxima del árbol (None para sin límite)
    'min_samples_split': [2, 5, 10],                  # Mínimo número de muestras requeridas para dividir un nodo
    'min_samples_leaf': [1, 2, 4],                    # Mínimo número de muestras en cada hoja
    'max_features': ['auto', 'sqrt', 'log2'],         # Número de características a considerar para encontrar la mejor división
    'bootstrap': [True, False],                       # Método de muestreo (con reemplazo o sin reemplazo)
}

# Validación cruzada temporal (TimeSeriesSplit) para largos
tscv_longs = TimeSeriesSplit(n_splits=5)

# Realizar la búsqueda de hiperparámetros con más combinaciones para largos
random_search_longs = RandomizedSearchCV(
    estimator=clf_longs, 
    param_distributions=param_grid_longs, 
    cv=tscv_longs, 
    scoring='roc_auc', 
    n_jobs=-1, 
    n_iter=50, 
    random_state=42
)

# Entrenar y ajustar el modelo de largos
random_search_longs.fit(X_longs, y_longs)

# Imprimir los mejores parámetros encontrados
print(f"Mejores Hiperparámetros para largos: {random_search_longs.best_params_}")

# Ver el mejor modelo después de la optimización
best_model_longs = random_search_longs.best_estimator_

# Evaluar el mejor modelo
print(f"Mejor modelo para largos después de optimización: {best_model_longs}")


Mejores Hiperparámetros para largos: {'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': None, 'bootstrap': False}
Mejor modelo para largos después de optimización: RandomForestClassifier(bootstrap=False, max_features='log2', min_samples_leaf=2,
                       min_samples_split=10, n_estimators=500, random_state=42)


In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Definir el modelo RandomForestClassifier con los mejores hiperparámetros para largos
modelo_rf_longs = RandomForestClassifier(
    n_estimators=500, 
    max_depth=None, 
    min_samples_split=10, 
    min_samples_leaf=2, 
    max_features='log2', 
    bootstrap=False, 
    random_state=42
)

# Validación cruzada temporal para largos
tscv_longs = TimeSeriesSplit(n_splits=7)

# Almacenar los resultados para largos
accuracy_scores_longs = []
f1_scores_longs = []
precision_scores_longs = []
recall_scores_longs = []

# Validación cruzada temporal y evaluación del modelo para largos
for train_index, test_index in tscv_longs.split(X_longs):
    X_train_longs, X_test_longs = X_longs.iloc[train_index], X_longs.iloc[test_index]
    y_train_longs, y_test_longs = y_longs.iloc[train_index], y_longs.iloc[test_index]
    
    # Entrenamiento del modelo optimizado para largos
    modelo_rf_longs.fit(X_train_longs, y_train_longs)
    
    # Predicciones
    y_pred_longs = modelo_rf_longs.predict(X_test_longs)
    
    # Métricas
    accuracy_longs = accuracy_score(y_test_longs, y_pred_longs)
    f1_longs = f1_score(y_test_longs, y_pred_longs)
    precision_longs = precision_score(y_test_longs, y_pred_longs)
    recall_longs = recall_score(y_test_longs, y_pred_longs)
    
    # Guardar resultados
    accuracy_scores_longs.append(accuracy_longs)
    f1_scores_longs.append(f1_longs)
    precision_scores_longs.append(precision_longs)
    recall_scores_longs.append(recall_longs)
    
    # Imprimir las métricas para cada fold
    print(f"Fold (Largos): Accuracy = {accuracy_longs:.3f}, F1-Score = {f1_longs:.3f}, Precision = {precision_longs:.3f}, Recall = {recall_longs:.3f}")
    
    # Reporte de clasificación y matriz de confusión para cada fold
    print("\nReporte de Clasificación para Largos:")
    print(classification_report(y_test_longs, y_pred_longs))
    
    print("Matriz de Confusión para Largos:")
    print(confusion_matrix(y_test_longs, y_pred_longs))

# Promediar las métricas de los 7 folds para largos
print(f"\nPromedio Accuracy (Largos): {np.mean(accuracy_scores_longs):.3f}")
print(f"Promedio F1-Score (Largos): {np.mean(f1_scores_longs):.3f}")
print(f"Promedio Precision (Largos): {np.mean(precision_scores_longs):.3f}")
print(f"Promedio Recall (Largos): {np.mean(recall_scores_longs):.3f}")


Fold (Largos): Accuracy = 0.567, F1-Score = 0.400, Precision = 0.650, Recall = 0.289

Reporte de Clasificación para Largos:
              precision    recall  f1-score   support

         0.0       0.54      0.84      0.66        45
         1.0       0.65      0.29      0.40        45

    accuracy                           0.57        90
   macro avg       0.60      0.57      0.53        90
weighted avg       0.60      0.57      0.53        90

Matriz de Confusión para Largos:
[[38  7]
 [32 13]]
Fold (Largos): Accuracy = 0.522, F1-Score = 0.442, Precision = 0.515, Recall = 0.386

Reporte de Clasificación para Largos:
              precision    recall  f1-score   support

         0.0       0.53      0.65      0.58        46
         1.0       0.52      0.39      0.44        44

    accuracy                           0.52        90
   macro avg       0.52      0.52      0.51        90
weighted avg       0.52      0.52      0.51        90

Matriz de Confusión para Largos:
[[30 16]
 [27

In [12]:
import joblib

# Guardar el modelo entrenado en un archivo .pkl
joblib.dump(modelo_rf_longs, 'modelo_rf_longs.pkl')
print("Modelo guardado exitosamente.")



Modelo guardado exitosamente.


## Modelado de cortos

In [13]:
for col in df_shorts_entrenamiento.columns:
    print(col)


Open
High
Low
Close
Volume
ohlc_avg
EMA70
EMA250
RSI8
perc_var_open_close
candle_range_perc
body_size_perc
upper_shadow_perc
lower_shadow_perc
upper_shadow_ratio
lower_shadow_ratio
body_to_range_ratio
perc_var_open_close_mean_3
perc_var_open_close_std_3
perc_var_open_close_min_3
perc_var_open_close_max_3
perc_var_open_close_median_3
candle_range_perc_mean_3
candle_range_perc_std_3
candle_range_perc_min_3
candle_range_perc_max_3
candle_range_perc_median_3
upper_shadow_ratio_mean_3
upper_shadow_ratio_std_3
lower_shadow_ratio_mean_3
lower_shadow_ratio_std_3
body_to_range_ratio_mean_3
body_to_range_ratio_std_3
RSI8_mean_3
RSI8_std_3
RSI8_min_3
RSI8_25%_3
RSI8_50%_3
RSI8_75%_3
RSI8_max_3
RSI8_slope_5
EMA70_mean_3
EMA70_std_3
EMA70_min_3
EMA70_max_3
EMA70_median_3
EMA250_mean_3
EMA250_std_3
EMA250_min_3
EMA250_max_3
EMA250_median_3
ema_diff
ema_diff_mean_3
ema_diff_std_3
ema_diff_min_3
ema_diff_max_3
ema_diff_median_3
close_to_ema70
close_to_ema250
close_to_ema70_mean_3
close_to_ema70_std_3


In [14]:
import os
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import TimeSeriesSplit
import pandas as pd

# Ajustar la variable de entorno para evitar la advertencia de joblib
os.environ["LOKY_MAX_CPU_COUNT"] = "14"  # Ajustar según el sistema


columns_to_drop_shorts = [
    'EntryPrice', 'ExitPrice', 'PnL', 'EntryTime', 'ExitTime', 'Duration', 'result']



# Separar características (X) y la variable objetivo (y)
X_shorts = df_shorts_entrenamiento.drop(columns=columns_to_drop_shorts)
y_shorts = df_shorts_entrenamiento['result']

# Definir el número de divisiones (folds) para la validación cruzada temporal
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)

# Lista de clasificadores
classifiers = [
    XGBClassifier(random_state=42, eval_metric='logloss'),
    LGBMClassifier(random_state=42, verbose=-1),
    CatBoostClassifier(random_state=42, verbose=False),
    RandomForestClassifier(random_state=42)
]

# Diccionario para almacenar modelos entrenados
trained_classifiers = {}

# Entrenamiento y evaluación de los modelos con TimeSeriesSplit
for clf in classifiers:
    auc_scores = []  # Para almacenar los AUC de cada fold
    for train_index, test_index in tscv.split(X_shorts):  # Uso de nuevas variables
        # Separar los datos de entrenamiento y prueba
        X_train_shorts, X_test_shorts = X_shorts.iloc[train_index], X_shorts.iloc[test_index]
        y_train_shorts, y_test_shorts = y_shorts.iloc[train_index], y_shorts.iloc[test_index]
        
        # Entrenar el modelo
        clf.fit(X_train_shorts, y_train_shorts)
        
        # Predecir probabilidades para la clase positiva (1)
        y_pred_shorts = clf.predict_proba(X_test_shorts)[:, 1]
        
        # Calcular el AUC para este fold
        auc_score = roc_auc_score(y_test_shorts, y_pred_shorts)
        auc_scores.append(auc_score)
    
    # Calcular el AUC promedio de todas las particiones
    mean_auc_score = sum(auc_scores) / len(auc_scores)
    print(f'{type(clf).__name__}: Mean AUC Score={mean_auc_score:.3f}')
    
    # Almacenar el modelo entrenado
    trained_classifiers[type(clf).__name__] = clf


XGBClassifier: Mean AUC Score=0.496
LGBMClassifier: Mean AUC Score=0.488
CatBoostClassifier: Mean AUC Score=0.525
RandomForestClassifier: Mean AUC Score=0.506


In [15]:
# Crear un DataFrame para almacenar las importancias de características de cada modelo
feature_importance_df = pd.DataFrame(index=X_shorts.columns)

# Extraer e imprimir importancias de características para cada modelo
for clf_name, clf in trained_classifiers.items():
    if hasattr(clf, "feature_importances_"):
        # Obtener importancia de características
        feature_importance = clf.feature_importances_
    elif hasattr(clf, "get_feature_importance"):
        feature_importance = clf.get_feature_importance()
    else:
        continue  # Pasar si el modelo no tiene atributo de importancia de características
    
    # Almacenar importancias en el DataFrame
    feature_importance_df[clf_name] = feature_importance
    
    # Imprimir importancia de características
    print(f'\n{clf_name} Feature Importances:')
    for feature, importance in sorted(zip(X_shorts.columns, feature_importance), key=lambda x: x[1], reverse=True):
        print(f'{feature}: {importance:.4f}')




XGBClassifier Feature Importances:
ema_diff_slope_5: 0.0332
Volume: 0.0276
body_size_perc: 0.0250
Volume_max_3: 0.0241
RSI8_std_3: 0.0237
body_to_range_ratio_mean_3: 0.0222
atr_14_slope_5: 0.0186
perc_var_open_close_mean_3: 0.0183
return_5_std_3: 0.0181
rsi_volatility_ratio_slope_3: 0.0180
return_5: 0.0177
RSI8_min_3: 0.0174
RSI8: 0.0168
Volume_mean_3: 0.0164
atr_14: 0.0154
EMA70_min_3: 0.0149
candle_range_perc_median_3: 0.0145
close_to_ema250: 0.0144
RSI8_50%_3: 0.0143
candle_range_perc_max_3: 0.0143
EMA250_std_3: 0.0140
volatility_14: 0.0137
candle_range_perc_mean_3: 0.0130
RSI8_75%_3: 0.0129
volatility_14_max_3: 0.0127
return_5_mean_3: 0.0120
rsi_volatility_ratio_mean_3: 0.0118
RSI8_max_3: 0.0115
close_to_ema70: 0.0115
EMA250: 0.0115
cmf_slope_3: 0.0114
volatility_14_slope_3: 0.0114
rsi_volatility_ratio_min_3: 0.0109
perc_var_open_close_max_3: 0.0107
ema_diff_min_3: 0.0106
EMA_ratio_std_3: 0.0104
upper_shadow_perc: 0.0103
candle_range_perc_min_3: 0.0102
RSI8_25%_3: 0.0100
lower_sha

## En el modelado de posiciones cortas, utilicé el siguiente enfoque: iteré sobre las importancias de características de cada modelo, aplicando un umbral de decisión específico. Seleccioné las características más relevantes según el LGBMClassifier, utilizando un filtro del 20%, es decir, eliminando aquellas características con una importancia inferior al 20%.

In [16]:
import os
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import TimeSeriesSplit
import pandas as pd

# Ajustar la variable de entorno para evitar la advertencia de joblib
os.environ["LOKY_MAX_CPU_COUNT"] = "14"  # Ajustar según el sistema


columns_to_drop_shorts = [
    'EntryPrice', 'ExitPrice', 'PnL', 'EntryTime', 'ExitTime', 'Duration', 'result',
    'volatility_14_median_3', 'ema_diff', 'EMA_ratio_std_3', 'ema_diff_slope_5',
    'body_size_perc', 'perc_var_open_close_mean_3', 'candle_range_perc_median_3',
    'Volume_median_3', 'ema_diff_min_3', 'volatility_14', 'risk_adjusted_return_5',
    'rsi_volatility_ratio', 'perc_var_open_close', 'close_to_ema250',
    'volatility_14_min_3', 'EMA70_slope_5', 'EMA250_slope_5', 'EMA250',
    'EMA70_std_3', 'close_to_ema250_mean_3', 'Volume_mean_3', 'volatility_14_mean_3',
    'close_to_ema250_std_3', 'Volume_max_3', 'volatility_14_max_3', 'EMA70',
    'RSI8_25%_3', 'EMA_ratio_mean_3', 'rsi_volatility_ratio_min_3', 'Low',
    'close_to_ema70_mean_3', 'atr_14_max_3', 'return_5', 'atr_14', 'obv_mean_3',
    'rsi_volatility_ratio_mean_3', 'rsi_volatility_ratio_max_3', 'EMA70_min_3',
    'atr_14_min_3', 'ema_diff_mean_3', 'atr_14_mean_3', 'atr_14_median_3', 'High',
    'ema_diff_median_3', 'rsi_volatility_ratio_median_3', 'Close', 'ohlc_avg',
    'EMA70_mean_3', 'EMA70_max_3', 'EMA70_median_3', 'EMA250_mean_3', 
    'EMA250_min_3', 'EMA250_max_3', 'EMA250_median_3', 'ema_diff_max_3'
]


# Separar características (X) y la variable objetivo (y)
X_shorts = df_shorts_entrenamiento.drop(columns=columns_to_drop_shorts)
y_shorts = df_shorts_entrenamiento['result']

# Definir el número de divisiones (folds) para la validación cruzada temporal
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)

# Lista de clasificadores
classifiers = [
    XGBClassifier(random_state=42, eval_metric='logloss'),
    LGBMClassifier(random_state=42, verbose=-1),
    CatBoostClassifier(random_state=42, verbose=False),
    RandomForestClassifier(random_state=42)
]

# Diccionario para almacenar modelos entrenados
trained_classifiers = {}

# Entrenamiento y evaluación de los modelos con TimeSeriesSplit
for clf in classifiers:
    auc_scores = []  # Para almacenar los AUC de cada fold
    for train_index, test_index in tscv.split(X_shorts):  # Uso de nuevas variables
        # Separar los datos de entrenamiento y prueba
        X_train_shorts, X_test_shorts = X_shorts.iloc[train_index], X_shorts.iloc[test_index]
        y_train_shorts, y_test_shorts = y_shorts.iloc[train_index], y_shorts.iloc[test_index]
        
        # Entrenar el modelo
        clf.fit(X_train_shorts, y_train_shorts)
        
        # Predecir probabilidades para la clase positiva (1)
        y_pred_shorts = clf.predict_proba(X_test_shorts)[:, 1]
        
        # Calcular el AUC para este fold
        auc_score = roc_auc_score(y_test_shorts, y_pred_shorts)
        auc_scores.append(auc_score)
    
    # Calcular el AUC promedio de todas las particiones
    mean_auc_score = sum(auc_scores) / len(auc_scores)
    print(f'{type(clf).__name__}: Mean AUC Score={mean_auc_score:.3f}')
    
    # Almacenar el modelo entrenado
    trained_classifiers[type(clf).__name__] = clf


XGBClassifier: Mean AUC Score=0.510
LGBMClassifier: Mean AUC Score=0.499
CatBoostClassifier: Mean AUC Score=0.527
RandomForestClassifier: Mean AUC Score=0.532


In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.metrics import roc_auc_score
import numpy as np

# Definir el modelo RandomForestClassifier para cortos
clf_shorts = RandomForestClassifier(random_state=42)

# Hiperparámetros ampliados para optimización en cortos
param_grid_shorts = {
    'n_estimators': [200, 500, 1000, 1500],           # Número de árboles en el bosque
    'max_depth': [10, 20, 30, None],                  # Profundidad máxima del árbol (None para sin límite)
    'min_samples_split': [2, 5, 10],                  # Mínimo número de muestras requeridas para dividir un nodo
    'min_samples_leaf': [1, 2, 4],                    # Mínimo número de muestras en cada hoja
    'bootstrap': [True, False],                       # Método de muestreo (con reemplazo o sin reemplazo)
}

# Validación cruzada temporal (TimeSeriesSplit) para cortos
tscv_shorts = TimeSeriesSplit(n_splits=5)

# Realizar la búsqueda de hiperparámetros con más combinaciones para cortos
random_search_shorts = RandomizedSearchCV(
    estimator=clf_shorts, 
    param_distributions=param_grid_shorts, 
    cv=tscv_shorts, 
    scoring='roc_auc', 
    n_jobs=-1, 
    n_iter=50, 
    random_state=42
)

# Entrenar y ajustar el modelo de cortos
random_search_shorts.fit(X_shorts, y_shorts)

# Imprimir los mejores parámetros encontrados
print(f"Mejores Hiperparámetros para cortos: {random_search_shorts.best_params_}")

# Ver el mejor modelo después de la optimización
best_model_shorts = random_search_shorts.best_estimator_

# Evaluar el mejor modelo
print(f"Mejor modelo para cortos después de optimización: {best_model_shorts}")


Mejores Hiperparámetros para cortos: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': None, 'bootstrap': False}
Mejor modelo para cortos después de optimización: RandomForestClassifier(bootstrap=False, min_samples_split=5, n_estimators=200,
                       random_state=42)


In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Definir el modelo RandomForestClassifier con los mejores hiperparámetros para cortos
modelo_rf_shorts = RandomForestClassifier(
    n_estimators=200, 
    max_depth=None, 
    min_samples_split=5, 
    min_samples_leaf=1, 
    bootstrap=False, 
    random_state=42
)

# Validación cruzada temporal para cortos
tscv_shorts = TimeSeriesSplit(n_splits=3)

# Almacenar los resultados para cortos
accuracy_scores_shorts = []
f1_scores_shorts = []
precision_scores_shorts = []
recall_scores_shorts = []

# Validación cruzada temporal y evaluación del modelo para cortos
for train_index, test_index in tscv_shorts.split(X_shorts):
    X_train_shorts, X_test_shorts = X_shorts.iloc[train_index], X_shorts.iloc[test_index]
    y_train_shorts, y_test_shorts = y_shorts.iloc[train_index], y_shorts.iloc[test_index]
    
    # Entrenamiento del modelo optimizado para cortos
    modelo_rf_shorts.fit(X_train_shorts, y_train_shorts)
    
    # Predicciones
    y_pred_shorts = modelo_rf_shorts.predict(X_test_shorts)
    
    # Métricas
    accuracy_shorts = accuracy_score(y_test_shorts, y_pred_shorts)
    f1_shorts = f1_score(y_test_shorts, y_pred_shorts)
    precision_shorts = precision_score(y_test_shorts, y_pred_shorts)
    recall_shorts = recall_score(y_test_shorts, y_pred_shorts)
    
    # Guardar resultados
    accuracy_scores_shorts.append(accuracy_shorts)
    f1_scores_shorts.append(f1_shorts)
    precision_scores_shorts.append(precision_shorts)
    recall_scores_shorts.append(recall_shorts)
    
    # Imprimir las métricas para cada fold
    print(f"Fold (Cortos): Accuracy = {accuracy_shorts:.3f}, F1-Score = {f1_shorts:.3f}, Precision = {precision_shorts:.3f}, Recall = {recall_shorts:.3f}")
    
    # Reporte de clasificación y matriz de confusión para cada fold
    print("\nReporte de Clasificación para Cortos:")
    print(classification_report(y_test_shorts, y_pred_shorts))
    
    print("Matriz de Confusión para Cortos:")
    print(confusion_matrix(y_test_shorts, y_pred_shorts))

# Promediar las métricas de los 7 folds para cortos
print(f"\nPromedio Accuracy (Cortos): {np.mean(accuracy_scores_shorts):.3f}")
print(f"Promedio F1-Score (Cortos): {np.mean(f1_scores_shorts):.3f}")
print(f"Promedio Precision (Cortos): {np.mean(precision_scores_shorts):.3f}")
print(f"Promedio Recall (Cortos): {np.mean(recall_scores_shorts):.3f}")


Fold (Cortos): Accuracy = 0.525, F1-Score = 0.563, Precision = 0.538, Recall = 0.589

Reporte de Clasificación para Cortos:
              precision    recall  f1-score   support

         0.0       0.51      0.45      0.48        88
         1.0       0.54      0.59      0.56        95

    accuracy                           0.52       183
   macro avg       0.52      0.52      0.52       183
weighted avg       0.52      0.52      0.52       183

Matriz de Confusión para Cortos:
[[40 48]
 [39 56]]
Fold (Cortos): Accuracy = 0.590, F1-Score = 0.634, Precision = 0.591, Recall = 0.684

Reporte de Clasificación para Cortos:
              precision    recall  f1-score   support

         0.0       0.59      0.49      0.53        88
         1.0       0.59      0.68      0.63        95

    accuracy                           0.59       183
   macro avg       0.59      0.59      0.58       183
weighted avg       0.59      0.59      0.59       183

Matriz de Confusión para Cortos:
[[43 45]
 [30

In [19]:
import joblib

# Guardar el modelo entrenado en un archivo .pkl
joblib.dump(modelo_rf_shorts, 'modelo_rf_shorts.pkl')
print("Modelo guardado exitosamente.")


Modelo guardado exitosamente.
