### Importing Python Packages and Utility Functions

In [51]:
# prompt: mount drive

from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [52]:
import pandas as pd
import numpy as np

import datetime
import os, sys
import importlib

import utils
importlib.reload(utils)

from utils import plot_series, plot_series_with_names, plot_series_bar
from utils import plot_dataframe
from utils import get_universe_adjusted_series, scale_weights_to_one, scale_to_book_long_short
from utils import generate_portfolio, backtest_portfolio
from utils import match_implementations

import plotly.graph_objects as go

import nbformat

### Data Loading and Structure

The dataset consists of three key files:


- **features.parquet**: A DataFrame containing **22 stock features** for each trading day from **2005 to 2025**, stored in a **columnar fashion**. For example, `features["macd"]` will return the `macd` of shape **(5068, 2167)**. All **22** features are guaranteed to have the same shape.

- **universe.parquet**: A DataFrame of the same shape as `features["macd"]`, containing `0` and `1` values, where `1` indicates that the stock is tradable on that day.

- **returns.parquet**: A DataFrame of shape **(3775, 2167)** containing **daily stock returns** from **2005 to 2019**. You will **never receive** the returns for the testing period i.e. from **2020 to 2025**.

### Data Organization:
- **Columns** represent stock identifiers, ranging from **1 to 2167** in increasing order.
  
- **Rows** represent trading days when the market was open.  


In [53]:
# This directory can be used if you're working on a Kaggle Notebook inside the competition
# Change the directory as per your requirements if you're working somewhere else
data_dir = "/content/drive/MyDrive/qrt/"

features = pd.read_parquet(os.path.join(data_dir, "features.parquet"))

universe = pd.read_parquet(os.path.join(data_dir, "universe.parquet"))

returns = pd.read_parquet(os.path.join(data_dir, "returns.parquet"))

### Benchmark Strategy: Vectorized Portfolio Generation

In this step we will write a vectorized code to generate our strategy portfolio for all trading days at once without a `for` loop. For doing this we will use direct operations on feature dataframes. But, here's one caution. It is very easy to look into the future feature data when you're using dataframes to construct your portfolio at once.

Below is a vectorized implementation of the benchmark strategy [which you see on Kaggle]. Make sure to note the `alpha.shift(1)` operation in the code! This is to make sure you use only feature values upto the last trading day to construct today's portfolio weights.

You can change the code below to implement your own strategy. Since this vectorized implementation is faster, you can try various versions of your strategy and see their performances.

#### Inputs:
- `entire_features :pd.DataFrame`: Historical feature data (MultiIndex columns: Feature Names, Stock Identifiers).
- `universe: pd.DataFrame`: Binary DataFrame indicating tradable stocks per day.
- `start_date`, `end_date` :`str`: Backtest period in `'YYYY-MM-DD'` format.

#### Output:
- `portfolio(pd.DataFrame)`: Normalized portfolio weights for each stock per day in the specified date range.


In [54]:
# # A Benchmark Strategy for your reference:
# # This is the code used to generate the Benchmark submission you see in the Kaggle Leaderboard

# # This strategy shows how you can combine different features
def generate_portfolio_vectorized(
    entire_features: pd.DataFrame,
    universe: pd.DataFrame,
    start_date: str,
    end_date: str
):
    # Validate date format
    try:
        start_dt = datetime.datetime.strptime(start_date, '%Y-%m-%d')
        end_dt = datetime.datetime.strptime(end_date, '%Y-%m-%d')
        cutoff_date = datetime.datetime.strptime('2005-01-01', '%Y-%m-%d')
    except ValueError:
        raise ValueError("start_date and end_date must be strings in 'YYYY-MM-DD' format.")

    # Ensure start_date is before end_date
    if start_dt >= end_dt:
        raise ValueError("start_date must be earlier than end_date.")

    # Ensure start_date is not before '2005-01-01'
    if start_dt < cutoff_date:
        raise ValueError("start_date must be later than '2005-01-01'.")

    # Get trading days within the date range
    trading_days = universe.index[(universe.index >= start_dt) & (universe.index <= end_dt)]

    if len(trading_days) == 0:
        raise ValueError("No Trading Days in the specified dates")

    portfolio = 0

    universe_boolean = universe.loc[:end_date].astype(bool)

    features_ = entire_features.loc[:end_date]

    rsi = features_["relative_strength_index"].shift(1)
    obv = features_["on_balance_volume"].shift(1)
    obv_diff = obv.diff()  # Check OBV trend

    # Bullish and Bearish Market Conditions
    bullish_market = (rsi > 65) & (obv_diff > 10)
    bearish_market = (rsi < 35) & (obv_diff < -10)
    neutral_market = ~(bullish_market | bearish_market)

    bullish_feature = "stochastic_oscillator"
    bearish_feature = "stochastic_oscillator"
    neutral_feature = "ichimoku"

    threshold_rank = 2
    signal_bullish = features_[bullish_feature].shift(1)
    signal_bullish = signal_bullish.where(universe_boolean, np.nan)
    signal_bullish = signal_bullish.where(bullish_market, np.nan)
    signal_bullish_long = signal_bullish.rank(axis=1, method="min", ascending=True)
    signal_bullish_long = signal_bullish_long.mask(signal_bullish_long <= threshold_rank, 0.1)
    signal_bullish_long = signal_bullish_long.mask(signal_bullish_long > threshold_rank, 0)
    signal_bullish_long = signal_bullish_long.fillna(0)
    signal_bullish_short = signal_bullish.rank(axis=1, method="min", ascending=False)
    signal_bullish_short = signal_bullish_short.mask(signal_bullish_short <= threshold_rank, -0.1)
    signal_bullish_short = signal_bullish_short.mask(signal_bullish_short > threshold_rank, 0)
    signal_bullish_short = signal_bullish_short.fillna(0)
    signal_bullish = signal_bullish_long + signal_bullish_short
    # signal_bullish = signal_bullish_long
    signal_bullish = signal_bullish.fillna(0)

    # threshold_rank = 1
    signal_bearish = features_[bearish_feature].shift(1)
    signal_bearish = signal_bearish.where(universe_boolean, np.nan)
    signal_bearish = signal_bearish.where(bearish_market, np.nan)
    signal_bearish_long = signal_bearish.rank(axis=1, method="min", ascending=True)
    signal_bearish_long = signal_bearish_long.mask(signal_bearish_long <= threshold_rank, 0.1)
    signal_bearish_long = signal_bearish_long.mask(signal_bearish_long > threshold_rank, 0)
    signal_bearish_short = signal_bearish.rank(axis=1, method="min", ascending=False)
    signal_bearish_short = signal_bearish_short.mask(signal_bearish_short <= threshold_rank, -0.1)
    signal_bearish_short = signal_bearish_short.mask(signal_bearish_short > threshold_rank, 0)
    signal_bearish_short = signal_bearish_short.fillna(0)
    signal_bearish = signal_bearish_long + signal_bearish_short
    # signal_bearish = signal_bearish_short
    signal_bearish = signal_bearish.fillna(0)
    # print(signal_bearish.shape)

    threshold_rank = 20
    signal_neutral = features_[neutral_feature].shift(1)
    signal_neutral = signal_neutral.where(universe_boolean, np.nan)
    signal_neutral = signal_neutral.where(neutral_market, np.nan)
    signal_neutral_bullish = signal_neutral.rank(axis=1, method="min", ascending=True)
    signal_neutral_bullish = signal_neutral_bullish.mask(signal_neutral_bullish <= threshold_rank, +0.1)
    signal_neutral_bullish = signal_neutral_bullish.mask(signal_neutral_bullish > threshold_rank, 0)
    signal_neutral_bullish = signal_neutral_bullish.fillna(0)
    signal_neutral_bearish = signal_neutral.rank(axis=1, method="min", ascending=False)
    signal_neutral_bearish = signal_neutral_bearish.mask(signal_neutral_bearish <= threshold_rank, -0.1)
    signal_neutral_bearish = signal_neutral_bearish.mask(signal_neutral_bearish > threshold_rank, 0)
    signal_neutral_bearish = signal_neutral_bearish.fillna(0)
    signal_neutral = signal_neutral_bullish + signal_neutral_bearish
    signal_neutral = signal_neutral.fillna(0)
    # signal_neutral = signal_neutral.div(10*signal_neutral.abs().max(axis=1), axis=0)


    portfolio = signal_bullish + signal_bearish + signal_neutral
    portfolio = portfolio.where(universe_boolean, np.nan)
    portfolio = portfolio.sub(portfolio.mean(axis=1), axis=0)
    # iterate over the rows of the portfolio
    for i in range(1, portfolio.shape[0]):
        abs_sum = portfolio.iloc[i].abs().sum()

        if abs_sum > 1:
            portfolio.iloc[i] = portfolio.iloc[i] / abs_sum
        row_abs_max = portfolio.iloc[i].abs().max()
        if row_abs_max > 0.1:
            portfolio.iloc[i] = portfolio.iloc[i] / (10*row_abs_max)
    # print(portfolio.fillna(0).loc[start_date:end_date])
    # print(portfolio.shape)

    return portfolio.fillna(0).loc[start_date:end_date]

### Generate your portfolio using the `generate_portfolio_vectorized` function you wrote above

In [55]:
benchmark_portfolio_vectorized = generate_portfolio_vectorized(
    features,
    universe,
    "2005-01-03",
    "2025-02-07"
)


### Backtest your portfolio generated using vectorized code

Note that you can backtest your portfolio till `2019-12-31` since this is the last date in the training period. You don't have access to returns after this date.

In [56]:
sr_vectorized, pnl_vectorized = backtest_portfolio(benchmark_portfolio_vectorized.loc[:"2019"], returns.loc[:"2019"], universe.loc[:"2019"], True, True)

Gross Sharpe Ratio:  2.594
Net Sharpe Ratio:  2.302
Turnover %:  89.179


### Benchmark Strategy: Iterative Portfolio Generation

Although the Vectorized Function generated the portfolio very quickly, it is very easy to look into the future data if you are not careful. For instance, remove the `shift(1)` operation and see the performance of the portfolio 😊. Hence, if your vectorized code has a forward bias [lookahead bias], you may see very high [or very low] sharpe ratios which may never be realised in real trading.

To avoid making these mistakes, we simulate our portfolio in a daily iterative fashion, where we call the `get_weights(features: pd.DataFrame, today_universe: pd.Series) -> dict[str, float]` function with **only** the past features data and the current day's trading universe.

### Function Inputs:
- `features(pd.DataFrame)`:  
  - Contains various stock features indexed by date and stock identifiers.  
  - The features are structured in a **MultiIndex format**, where level 0 represents **feature names** (e.g., "macd", "volatility_60"), and level 1 represents **stock identifiers** (e.g., "1", "2", ..., "2167").  

- `today_universe(pd.Series)`:  
  - A series indicating which stocks can be traded on the current day.  
  - Contains **binary values (0 or 1)**, where **1** means a stock is **tradable**, and **0** means it is not.

You have to change this code, and write your own strategy code inside this function. Make sure it follows the same semantics as explained above.


In [57]:



import pandas as pd
import numpy as np
import datetime

def get_weights(entire_features: pd.DataFrame, today_universe: pd.Series) -> dict[str, float]:
    universe_boolean = today_universe.astype(bool)
    features_ = entire_features
    # add dimension to the universe_boolean AND features_
    # universe_boolean = universe_boolean[:, np.newaxis]
    # features_ = features_.loc[:today_universe.index[-1]]
    # features_ = entire_features.loc[:today_universe.index[-1]]
    rsi = features_["relative_strength_index"].iloc[-1]
    obv = features_["on_balance_volume"]
    obv_diff = obv.diff().iloc[-1]  # Check OBV trend
    # Bullish and Bearish Market Conditions
    bullish_market = (rsi > 65) & (obv_diff > 10)
    bearish_market = (rsi < 35) & (obv_diff < -10)
    neutral_market = ~(bullish_market | bearish_market)
    bullish_feature = "stochastic_oscillator"
    bearish_feature = "stochastic_oscillator"
    neutral_feature = "ichimoku"
    threshold_rank = 2
    signal_bullish = features_[bullish_feature].iloc[-1]
    signal_bullish = signal_bullish.where(universe_boolean, np.nan)
    signal_bullish = signal_bullish.where(bullish_market, np.nan)
    signal_bullish_long = signal_bullish.rank(method="min", ascending=True)
    signal_bullish_long = signal_bullish_long.mask(signal_bullish_long <= threshold_rank, 0.1)
    signal_bullish_long = signal_bullish_long.mask(signal_bullish_long > threshold_rank, 0)
    signal_bullish_short = signal_bullish.rank(method="min", ascending=False)
    signal_bullish_short = signal_bullish_short.mask(signal_bullish_short <= threshold_rank, -0.1)
    signal_bullish_short = signal_bullish_short.mask(signal_bullish_short > threshold_rank, 0)
    signal_bullish = signal_bullish_long + signal_bullish_short
    signal_bullish = signal_bullish.fillna(0)
    signal_bearish = features_[bearish_feature].iloc[-1]
    signal_bearish = signal_bearish.where(universe_boolean, np.nan)
    signal_bearish = signal_bearish.where(bearish_market, np.nan)
    signal_bearish_long = signal_bearish.rank(method="min", ascending=True)
    signal_bearish_long = signal_bearish_long.mask(signal_bearish_long <= threshold_rank, 0.1)
    signal_bearish_long = signal_bearish_long.mask(signal_bearish_long > threshold_rank, 0)
    signal_bearish_short = signal_bearish.rank(method="min", ascending=False)
    signal_bearish_short = signal_bearish_short.mask(signal_bearish_short <= threshold_rank, -0.1)
    signal_bearish_short = signal_bearish_short.mask(signal_bearish_short > threshold_rank, 0)
    signal_bearish = signal_bearish_long + signal_bearish_short
    signal_bearish = signal_bearish.fillna(0)
    threshold_rank = 20
    signal_neutral = features_[neutral_feature].iloc[-1]
    signal_neutral = signal_neutral.where(universe_boolean, np.nan)
    signal_neutral = signal_neutral.where(neutral_market, np.nan)
    signal_neutral_bullish = signal_neutral.rank(method="min", ascending=True)
    signal_neutral_bullish = signal_neutral_bullish.mask(signal_neutral_bullish <= threshold_rank, +0.1)
    signal_neutral_bullish = signal_neutral_bullish.mask(signal_neutral_bullish > threshold_rank, 0)
    signal_neutral_bearish = signal_neutral.rank(method="min", ascending=False)
    signal_neutral_bearish = signal_neutral_bearish.mask(signal_neutral_bearish <= threshold_rank, -0.1)
    signal_neutral_bearish = signal_neutral_bearish.mask(signal_neutral_bearish > threshold_rank, 0)
    signal_neutral = signal_neutral_bullish + signal_neutral_bearish
    signal_neutral = signal_neutral.fillna(0)
    portfolio = signal_bullish + signal_bearish + signal_neutral
    portfolio = portfolio.where(universe_boolean, np.nan)
    portfolio = portfolio.sub(portfolio.mean(axis=0))
    # portfolio = portfolio.fillna(0)
    # for i in range(1, portfolio.shape[0]):
    abs_sum = portfolio.abs().sum()
    if abs_sum > 1:
        portfolio = portfolio / abs_sum
    row_abs_max = portfolio.abs().max()
    if row_abs_max > 0.1:
        portfolio = portfolio / (10*row_abs_max)
    portfolio = portfolio[universe_boolean==1]
    return portfolio.to_dict()

### Generate your portfolio using the `generate_portfolio` and `get_weights` function you wrote above

Since this function is iteratively called on every trading day, it takes a lot of time [about 40 mintues] to generate the entire portfolio dataframe from `2005-01-03` to `2025-02-07`. Hence, to show an example we call it only for a one year period from `2010-01-01` to `2010-12-31`.

In [58]:
benchmark_portfolio = generate_portfolio(
    get_weights,
    features,
    universe,
    "2010-01-01",
    "2010-12-31",
)

100%|██████████| 252/252 [01:14<00:00,  3.37it/s]

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



### Backtest your portfolio

Note that you can backtest your portfolio till `2019-12-31` since this is the last date in the training period. You don't have access to returns after this date.

In [59]:
sr, pnl = backtest_portfolio(benchmark_portfolio.loc["2010-01-01":
    "2010-12-31"], returns.loc["2010-01-01":
    "2010-12-31"], universe.loc["2010-01-01":
    "2010-12-31"], True, True)

Gross Sharpe Ratio:  3.688
Net Sharpe Ratio:  3.23
Turnover %:  94.247


You can also check the performance of your vectorized portfolio in this period to see if they match!

In [60]:
sr, pnl = backtest_portfolio(benchmark_portfolio_vectorized.loc["2010"], returns.loc["2010"], universe.loc["2010"], True, True)

Gross Sharpe Ratio:  3.688
Net Sharpe Ratio:  3.23
Turnover %:  94.247


### Comparing Iterative and Vectorized Portfolio Implementations
This function evaluates **iterative** and **vectorized** portfolio generation methods by comparing their PnL correlation. If **correlation ≥ 0.98**, both implementations are considered equivalent.
#### Steps:
1. Selects a random start date.
2. Generates portfolios using both methods.
3. Backtests portfolios and computes PnLs.
4. Validates PnL correlation.
#### Criteria:
- **Pass:** Correlation **≥ 0.98** (implementations match).
- **Fail:** Correlation **< 0.98** (error raised).
#### Inputs:
- `contestant_get_weights`: Function for portfolio weights.
- `contestant_vectorized_portfolio`: A Pandas DataFrame containing portfolio weights generated using Vectorized Implementation
- `entire_features`: Feature data (MultiIndex columns).
- `universe`: Tradable stocks (binary).
- `returns`: Daily stock returns.
#### Output:
- Prints PnL correlation or raises an error if mismatched.

In [61]:
match_implementations(get_weights, benchmark_portfolio_vectorized, features, universe, returns)

Starting to generate Iterative Portfolio


100%|██████████| 41/41 [00:24<00:00,  1.67it/s]

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



Iterative Portfolio Generated
Correlation of 1.0 between Iterative and Vectorized Implementations. Both implementations match!


### Final Notes
- We recommend using vectorized code to test out your strategies. This will be easier and faster to run but make sure to `shift(1)` feature dataframes in order to avoid lookahead or forward bias.
- At the end when you have decided your final strategy that you want to submit for the competition, we advise you to write code for `get_weights` which will help iteratively generate your portfolio.
- Finally, before submitting make sure to run `match_implementations` to make sure that both versions of your code produce the same portfolio
- If these two portfolios match, you can submit the one which was produced by `generate_portfolio_vectorized` without waiting for the iterative portfolio. You don't need to run the `generate_portfolio` function for 20 years!
- We will check that the submission you made on Kaggle matches with the portfolio generated by your code. If these two don't match, you will be eliminated from the competition.

In [62]:
# Submit this csv file on kaggle
benchmark_portfolio_vectorized.to_csv("submission.csv")