### Caveats or Process Did

By combining the training and test sets before feature engineering, we make sure that these features are computed consistently and accurately, even across the boundary between the last training date and the first test date. This prevents missing or incorrect values for the first test rows, maintains continuity for rolling windows and momentum indicators, and avoids artificial discontinuities in computed statistics.

#### Problems that arose (1)[Negative Values]:

Approach(1)
- Getting me non-sense cumalative and SHARPE ratio, negative values
- Probable causes: since I am apply the transform method the to the test set alone, it causes my lag, rolling and EMA features to be most incorrect for the first few rows of the test set, which might throw of my sharpe-ratio
- Soln: is to combine last few rows of trainging set with test set so that the first few rows of the test set would have valid values for the lag-features

- No need to worry about data leakage and lookforward bias, since I am only using my previous few computes features to compute the transformations in my test-sets, thus, I am only using my previous values.

-Feedback: Didn't work

Approach (2):
- Since market_excess_returns, represent a fraction probably need to scale them
- Box-Cox Transformations.

**REMARK**: Currently only trained on the training set, with held out test set. Need to integrate the the test set by combining the dataset, then splitting the dataset using TimeSeriesSplit.



In [None]:
import numpy as np
import warnings
from scipy.stats import boxcox
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error, root_mean_squared_error
import xgboost as xgb

warnings.filterwarnings('ignore')

In [12]:
train_df = pd.read_csv('../data/train.csv')

In [10]:
class TimeSeriesEngineering:
    def __init__(self,
                 base_features,
                 lag_windows=[1,5,10],
                 rolling_windows=[5,10,20],
                 ema_windows=[10,30,60],
                 eps=1e-8,
                 sparse_cols=None,
                 clip_quantiles=(0.01,0.99),
                 scale_features=True):
        self.base_features = base_features
        self.lag_windows = lag_windows
        self.rolling_windows = rolling_windows
        self.ema_windows = ema_windows
        self.eps = eps
        self.sparse_cols = sparse_cols or []
        self.clip_quantiles = clip_quantiles
        self.scale_features_flag = scale_features
        self.scaler = None

        # Store statistics computed on training data
        self.clip_bounds = {}
        self.boxcox_lambdas = {}
        self.boxcox_shifts = {}

    def drop_sparse(self, df):
        df = df.drop(columns=[c for c in self.sparse_cols if c in df.columns], errors='ignore')
        return df

    def forward_fill(self, df):
        df = df.sort_values(by='date_id').copy()
        df.fillna(method='ffill', inplace=True)
        df.fillna(0, inplace=True)
        return df

    def clip_outliers(self, df, fit=True):
        """Clip outliers using bounds computed on training data"""
        for col in self.base_features:
            if col in df.columns:
                if fit:
                    # Compute and store bounds on training data
                    lower = df[col].quantile(self.clip_quantiles[0])
                    upper = df[col].quantile(self.clip_quantiles[1])
                    self.clip_bounds[col] = (lower, upper)

                # Apply stored bounds
                if col in self.clip_bounds:
                    lower, upper = self.clip_bounds[col]
                    df[col] = df[col].clip(lower, upper)
        return df

    def apply_boxcox(self, df, fit=True):
        """Apply Box-Cox transformation using parameters from training data"""
        EPS = self.eps
        for col in self.base_features:
            if col not in df.columns:
                continue

            if fit:
                # Compute shift and lambda on training data
                min_val = df[col].min()
                shift = -min_val + EPS if min_val <= 0 else 0
                shifted = df[col] + shift
                shifted[shifted <= 0] = EPS

                try:
                    transformed, lmbda = boxcox(shifted)
                    df[col] = transformed
                    self.boxcox_lambdas[col] = lmbda
                    self.boxcox_shifts[col] = shift
                except ValueError:
                    # If Box-Cox fails, store None to skip this column
                    self.boxcox_lambdas[col] = None
                    continue
            else:
                # Apply stored transformation
                if col in self.boxcox_lambdas and self.boxcox_lambdas[col] is not None:
                    lmbda = self.boxcox_lambdas[col]
                    shift = self.boxcox_shifts[col]
                    shifted = df[col] + shift
                    shifted[shifted <= 0] = EPS

                    # Apply Box-Cox with stored lambda
                    if lmbda == 0:
                        df[col] = np.log(shifted)
                    else:
                        df[col] = (shifted**lmbda - 1) / lmbda
        return df

    def scale_features(self, df, fit=True):
        """Scale features using StandardScaler"""
        if not self.scale_features_flag:
            return df

        if fit:
            self.scaler = StandardScaler()
            df[self.base_features] = self.scaler.fit_transform(df[self.base_features])
        else:
            if self.scaler is None:
                raise ValueError("Scaler not fitted. Call transform with fit=True first.")
            df[self.base_features] = self.scaler.transform(df[self.base_features])
        return df

    def add_lag(self, df):
        """Add lag features"""
        for col in self.base_features:
            if col not in df.columns:
                continue
            for lag in self.lag_windows:
                df[f"{col}_lag_{lag}"] = df[col].shift(lag)
        return df

    def add_rolling(self, df):
        """Add rolling window features"""
        for col in self.base_features:
            if col not in df.columns:
                continue
            for w in self.rolling_windows:
                # Use shift(1) to avoid look-ahead bias
                shifted = df[col].shift(1)
                df[f"{col}_roll_mean_{w}"] = shifted.rolling(w).mean()
                df[f"{col}_roll_std_{w}"] = shifted.rolling(w).std()
                df[f"{col}_roll_min_{w}"] = shifted.rolling(w).min()
                df[f"{col}_roll_max_{w}"] = shifted.rolling(w).max()
        return df

    def add_ema(self, df):
        """Add exponential moving average features"""
        for col in self.base_features:
            if col not in df.columns:
                continue
            for w in self.ema_windows:
                df[f"{col}_ema_{w}"] = df[col].shift(1).ewm(span=w, adjust=False).mean()
        return df

    def transform(self, df, fit=True):
        """
        Transform the dataframe with feature engineering.

        Args:
            df: Input dataframe
            fit: If True, fit transformations on this data (use for training).
                 If False, apply previously fitted transformations (use for validation/test).
        """
        df = df.copy()
        df = self.drop_sparse(df)
        df = self.forward_fill(df)
        df = self.clip_outliers(df, fit=fit)
        df = self.apply_boxcox(df, fit=fit)
        df = self.scale_features(df, fit=fit)
        df = self.add_lag(df)
        df = self.add_rolling(df)
        df = self.add_ema(df)
        return df

In [9]:
class TimeSeriesModel:
    def __init__(self,
                 target_col='market_forward_excess_returns',
                 date_col='date_id',
                 n_splits=5,
                 params=None,
                 n_estimators=1000,
                 early_stopping_rounds=50,
                 verbose=True):
        self.target_col = target_col
        self.date_col = date_col
        self.n_splits = n_splits
        self.params = params if params else {
            'objective': 'reg:squarederror',
            'learning_rate': 0.05,
            'max_depth': 5,
            'subsample': 0.7,
            'colsample_bytree': 0.7,
            'n_jobs': -1,
            'tree_method': 'hist',
            'random_state': 42
        }
        self.n_estimators = n_estimators
        self.early_stopping_rounds = early_stopping_rounds
        self.verbose = verbose
        self.models = []

    def fit(self, df, features):
        df = df.sort_values(by=self.date_col).copy()
        X = df[features].values
        y = df[self.target_col].values

        tscv = TimeSeriesSplit(n_splits=self.n_splits)
        fold = 1

        for train_idx, val_idx in tscv.split(X):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]

            dtrain = xgb.DMatrix(X_train, label=y_train)
            dval = xgb.DMatrix(X_val, label=y_val)

            model = xgb.train(
                params=self.params,
                dtrain=dtrain,
                num_boost_round=self.n_estimators,
                evals=[(dtrain, 'train'), (dval, 'val')],
                early_stopping_rounds=self.early_stopping_rounds,
                verbose_eval=self.verbose
            )

            if self.verbose:
                print(f"Fold {fold} finished. Best Iteration: {model.best_iteration}")

            self.models.append(model)
            fold += 1

    def predict(self, df, features):
        X = df[features].values
        dmatrix = xgb.DMatrix(X)
        preds = np.mean([m.predict(dmatrix) for m in self.models], axis=0)
        return preds

    def evaluate(self, df, features, risk_free_col=None):
        """
        Evaluate model performance using both regression and trading strategy metrics.

        Key fix: Use actual returns (y_true) for strategy evaluation, not predictions.
        Predictions are used to decide position sizing/direction.
        """
        y_true = df[self.target_col].values
        y_pred = self.predict(df, features)

        # Regression Metrics
        mse = mean_squared_error(y_true, y_pred)
        rmse = root_mean_squared_error(y_true, y_pred)



        # Trading Strategy Metrics
        # Strategy: Take position proportional to predicted excess returns
        # Actual returns are the realized returns based on our positions

        # Simple strategy: if prediction > 0, go long; if < 0, go short
        positions = y_pred  # Try to instead use signals 0 negative , 1 for positie

        # Actual strategy returns = position * actual returns
        strategy_returns = positions * y_true

        # Calculate Sharpe ratio on strategy returns
        mean_return = np.mean(strategy_returns)
        std_return = np.std(strategy_returns)
        sharpe_ratio = (mean_return / (std_return + 1e-8)) * np.sqrt(252)

        return {
            'MSE': mse,
            'RMSE': rmse,
            'Sharpe Ratio': sharpe_ratio,
        }
    def market_allocation(self, preds):
      """
      Based on the predicted returns, give an allocation value between [0-2], k = sensitivity factor
      """
      market_returns = preds
      k = 75 # sensitivity factor
      allocations = 2 / (1 + np.exp(-k * market_returns))
      return allocations

In [13]:
exclude_cols = ['date_id', 'forward_returns', 'risk_free_rate',
                'market_forward_excess_returns', 'is_scored',
                'lagged_forward_returns', 'lagged_risk_free_rate',
                'lagged_market_forward_excess_returns']

# ---------------------- Initialize Feature Engineering ----------------------
ts_engineer = TimeSeriesEngineering(
    base_features=['M1','V1','S1','P1','E1'], # base features to prevent exploding the dataset
    lag_windows=[1,5],
    rolling_windows=[5,10,20],
    ema_windows=[10,30,60],
    sparse_cols=['M6','M13','M14','V9','V10','S3','E7','D2'],
    scale_features=True
)

# ---------------------- Preprocess Training Set ----------------------
train_df_fe = ts_engineer.transform(train_df, fit=True)
features = [c for c in train_df_fe.columns if c not in exclude_cols]

# ---------------------- Train Time Series Model ----------------------
xgb_tcsv = TimeSeriesModel(
    target_col='market_forward_excess_returns',
    n_splits=5
)
xgb_tcsv.fit(train_df_fe, features)

# ---------------------- Evaluate on Training Set ----------------------
train_eval_results = xgb_tcsv.evaluate(train_df_fe, features)
print("\n=== Train Evaluation Results ===")
for metric, value in train_eval_results.items():
    if metric != 'Cumulative Returns':
        print(f"{metric}: {value:.6f}")

[0]	train-rmse:0.00731	val-rmse:0.01209
[1]	train-rmse:0.00730	val-rmse:0.01208
[2]	train-rmse:0.00728	val-rmse:0.01209
[3]	train-rmse:0.00727	val-rmse:0.01209
[4]	train-rmse:0.00726	val-rmse:0.01212
[5]	train-rmse:0.00724	val-rmse:0.01212
[6]	train-rmse:0.00723	val-rmse:0.01212
[7]	train-rmse:0.00721	val-rmse:0.01211
[8]	train-rmse:0.00720	val-rmse:0.01212
[9]	train-rmse:0.00718	val-rmse:0.01213
[10]	train-rmse:0.00717	val-rmse:0.01214
[11]	train-rmse:0.00715	val-rmse:0.01214
[12]	train-rmse:0.00714	val-rmse:0.01215
[13]	train-rmse:0.00713	val-rmse:0.01214
[14]	train-rmse:0.00712	val-rmse:0.01214
[15]	train-rmse:0.00711	val-rmse:0.01214
[16]	train-rmse:0.00710	val-rmse:0.01213
[17]	train-rmse:0.00709	val-rmse:0.01213
[18]	train-rmse:0.00708	val-rmse:0.01213
[19]	train-rmse:0.00706	val-rmse:0.01213
[20]	train-rmse:0.00705	val-rmse:0.01214
[21]	train-rmse:0.00705	val-rmse:0.01213
[22]	train-rmse:0.00703	val-rmse:0.01213
[23]	train-rmse:0.00703	val-rmse:0.01214
[24]	train-rmse:0.00702	va

### Observations
- The sharpe ratio [1-3], which is good for real-markets.
- Might need to introduce scale, or signals because values for market_forward_excess_rates are very small or in percentages.
- Need to combine triaining and test since the sample size is so small, the evaluation on the test set is insignificant
- Probably need also try different lag values, since