# ETF Stock Prediction Competition - Training & Submission

Memory-optimized version: processes each year separately.

## Competition Overview
- **Goal**: Predict Top-100 stocks with highest 3-month returns
- **Period**: 2020-2024 (5 years)
- **Evaluation**: Average accuracy of Top-100 overlap

## 1. Setup

In [1]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")

Project root: /Users/jeong-uchang/etf-trading-project/etf-model


In [2]:
import pandas as pd
import numpy as np
import warnings
import gc
warnings.filterwarnings('ignore')

# Check memory
import psutil
mem = psutil.virtual_memory()
print(f"Available memory: {mem.available / 1024**3:.1f} GB / {mem.total / 1024**3:.1f} GB")

Available memory: 10.4 GB / 24.0 GB


In [3]:
# Import our modules
from src.config import config, DATA_DIR, SUBMISSIONS_DIR
from src.pipeline import CompetitionPipeline

print(f"Data directory: {DATA_DIR}")
print(f"Submissions directory: {SUBMISSIONS_DIR}")

Data directory: /Users/jeong-uchang/etf-trading-project/etf-model/data
Submissions directory: /Users/jeong-uchang/etf-trading-project/etf-model/submissions


## 2. Check Data Files

In [4]:
print("Data files:")
for f in sorted(DATA_DIR.glob("*.csv")):
    print(f"  {f.name}")

Data files:
  2020_final_universe.csv
  2020_sample_submission.csv
  2021_final_universe.csv
  2021_sample_submission.csv
  2022_final_universe.csv
  2022_sample_submission.csv
  2023_final_universe.csv
  2023_sample_submission.csv
  2024_final_universe.csv
  2024_sample_submission.csv


## 3. Run Pipeline (Single Year Test)

Test with 2024 only first to verify everything works.

In [5]:
# Initialize pipeline
pipeline = CompetitionPipeline()

# Test with single year first
submission_2024 = pipeline.process_year(2024, train_years=10, verbose=True)


Processing 2024
Universe: 1016 tickers
Loading data from 2014-01-01 to 2024-12-31...


Loading & features: 100%|██████████| 1016/1016 [08:22<00:00,  2.02it/s]


Loaded 1011/1016 tickers (failed: 0)
Panel: 2,123,158 rows, 1141.7 MB
Adding cross-sectional features...
Using 160 features
Training rows: 1,869,397
Prediction rows: 253,761

Training model...
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[20]	train's rmse: 1.06888	valid's rmse: 0.344914

Top 10 features:
        feature   importance  importance_pct
            obv 1.506760e+07       51.016116
     obv_sma_20 9.691142e+06       32.812426
    macd_signal 1.123721e+06        3.804714
      ad_sma_20 1.022686e+06        3.462625
         ema_63 3.367082e+05        1.140032
       minus_di 2.637609e+05        0.893046
  universe_size 2.499015e+05        0.846121
             ad 2.187641e+05        0.740695
      sharpe_63 1.466327e+05        0.496472
price_to_sma_63 1.245981e+05        0.421866

Generating predictions...


Predicting: 100%|██████████| 251/251 [00:00<00:00, 472.95it/s]

Generated 25,100 predictions





In [6]:
# Check result
print(f"\n2024 Submission: {len(submission_2024)} rows")
submission_2024.head(20)


2024 Submission: 25100 rows


Unnamed: 0,date,rank,ticker
0,2024-01-02,1,ONDS
1,2024-01-02,2,BITF
2,2024-01-02,3,AVX
3,2024-01-02,4,CLSK
4,2024-01-02,5,IVP
5,2024-01-02,6,IVVD
6,2024-01-02,7,BTBT
7,2024-01-02,8,HEPS
8,2024-01-02,9,BGM
9,2024-01-02,10,IXHL


In [7]:
# Clear memory before full run
del submission_2024
gc.collect()

0

## 4. Run Full Pipeline (All Years)

Processes each year separately to manage memory.

In [8]:
# Re-initialize pipeline
pipeline = CompetitionPipeline()

# Run for all years
paths = pipeline.run(
    pred_years=[2020, 2021, 2022, 2023, 2024],
    train_years=10
)


ETF STOCK PREDICTION COMPETITION PIPELINE
Memory-optimized: processing year by year

Processing 2020
Universe: 1016 tickers
Loading data from 2010-01-01 to 2020-12-31...


Loading & features: 100%|██████████| 1016/1016 [08:02<00:00,  2.11it/s]


Loaded 1000/1016 tickers (failed: 0)
Panel: 2,111,602 rows, 1127.4 MB
Adding cross-sectional features...
Using 160 features
Training rows: 1,858,959
Prediction rows: 252,000

Training model...
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[4]	train's rmse: 15.321	valid's rmse: 1.07816

Top 10 features:
              feature   importance  importance_pct
                sma_5 1.030822e+08       37.223468
            volume_cv 8.330245e+07       30.080895
           avg_gap_20 5.673150e+07       20.486001
                rsi_7 1.789987e+07        6.463726
             bb_lower 4.622865e+06        1.669338
          macd_signal 3.809512e+06        1.375632
     bb_position_rank 1.995482e+06        0.720578
             bb_width 1.441958e+06        0.520697
avg_intraday_range_20 1.332460e+06        0.481157
        from_high_252 9.766838e+05        0.352685

Generating predictions...


Predicting: 100%|██████████| 252/252 [00:00<00:00, 527.33it/s]


Generated 25,200 predictions
Saved: /Users/jeong-uchang/etf-trading-project/etf-model/submissions/2020.submission.csv

Processing 2021
Universe: 1016 tickers
Loading data from 2011-01-01 to 2021-12-31...


Loading & features: 100%|██████████| 1016/1016 [07:56<00:00,  2.13it/s]


Loaded 970/1016 tickers (failed: 1)
Panel: 1,960,840 rows, 1046.9 MB
Adding cross-sectional features...
Training rows: 1,716,701
Prediction rows: 243,470

Training model...
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[42]	train's rmse: 6.11848	valid's rmse: 12.6826

Top 10 features:
         feature   importance  importance_pct
      calmar_252 7.024106e+08       23.163311
     cum_ret_ytd 4.902484e+08       16.166863
     ret_skew_63 4.145651e+08       13.671064
   volatility_63 2.266773e+08        7.475109
pos_ret_ratio_63 1.778991e+08        5.866558
         plus_di 1.726253e+08        5.692642
          rsi_14 1.248347e+08        4.116660
             adx 8.886282e+07        2.930419
        minus_di 7.929978e+07        2.615059
      avg_gap_20 7.717180e+07        2.544885

Generating predictions...


Predicting: 100%|██████████| 251/251 [00:00<00:00, 517.57it/s]


Generated 25,100 predictions
Saved: /Users/jeong-uchang/etf-trading-project/etf-model/submissions/2021.submission.csv

Processing 2022
Universe: 1017 tickers
Loading data from 2012-01-01 to 2022-12-31...


Loading & features: 100%|██████████| 1017/1017 [07:51<00:00,  2.16it/s]


Loaded 1001/1017 tickers (failed: 0)
Panel: 1,959,754 rows, 1046.3 MB
Adding cross-sectional features...
Training rows: 1,707,831
Prediction rows: 251,251

Training model...
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[1]	train's rmse: 5.24506	valid's rmse: 0.32234

Top 10 features:
               feature   importance  importance_pct
                   obv 3.659903e+06       78.820395
            ret_std_63 5.749637e+05       12.382530
         rsi_14_zscore 2.030940e+05        4.373872
              ret_252d 9.211480e+04        1.983802
            calmar_252 2.569259e+04        0.553321
                 adosc 2.372620e+04        0.510972
volatility_20_relative 1.657127e+04        0.356882
          from_low_252 5.763538e+03        0.124125
       atr_14_relative 5.207877e+03        0.112158
              kc_lower 5.054881e+03        0.108863

Generating predictions...


Predicting: 100%|██████████| 251/251 [00:00<00:00, 530.93it/s]


Generated 25,100 predictions
Saved: /Users/jeong-uchang/etf-trading-project/etf-model/submissions/2022.submission.csv

Processing 2023
Universe: 1017 tickers
Loading data from 2013-01-01 to 2023-12-31...


Loading & features: 100%|██████████| 1017/1017 [08:07<00:00,  2.09it/s]


Loaded 1008/1017 tickers (failed: 0)
Panel: 2,055,306 rows, 1097.4 MB
Adding cross-sectional features...
Training rows: 1,802,637
Prediction rows: 252,000

Training model...
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[1]	train's rmse: 1.84197	valid's rmse: 0.407278

Top 10 features:
               feature   importance  importance_pct
                   obv 2.395248e+06       92.832437
      pos_ret_ratio_63 6.244670e+04        2.420242
              ret_252d 4.638986e+04        1.797928
                 adosc 2.962084e+04        1.148013
volatility_20_relative 4.608081e+03        0.178595
         volume_sma_10 4.309470e+03        0.167022
              bb_lower 4.207954e+03        0.163087
              kc_lower 4.161047e+03        0.161269
            calmar_252 3.743380e+03        0.145082
                ema_63 3.315172e+03        0.128486

Generating predictions...


Predicting: 100%|██████████| 250/250 [00:00<00:00, 542.02it/s]


Generated 25,000 predictions
Saved: /Users/jeong-uchang/etf-trading-project/etf-model/submissions/2023.submission.csv

Processing 2024
Universe: 1016 tickers
Loading data from 2014-01-01 to 2024-12-31...


Loading & features: 100%|██████████| 1016/1016 [08:08<00:00,  2.08it/s]


Loaded 1011/1016 tickers (failed: 0)
Panel: 2,123,158 rows, 1141.7 MB
Adding cross-sectional features...
Training rows: 1,869,397
Prediction rows: 253,761

Training model...
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[36]	train's rmse: 0.733893	valid's rmse: 0.344072

Top 10 features:
          feature   importance  importance_pct
              obv 1.967859e+07       51.941488
       obv_sma_20 9.773544e+06       25.797192
        ad_sma_20 1.941608e+06        5.124860
      macd_signal 1.484443e+06        3.918174
ret_63d_vs_market 5.043601e+05        1.331254
           sma_63 4.822911e+05        1.273003
               ad 3.800604e+05        1.003166
           ema_63 3.417029e+05        0.901922
  atr_14_relative 3.407791e+05        0.899484
    universe_size 3.116589e+05        0.822621

Generating predictions...


Predicting: 100%|██████████| 251/251 [00:00<00:00, 450.69it/s]


Generated 25,100 predictions
Saved: /Users/jeong-uchang/etf-trading-project/etf-model/submissions/2024.submission.csv

Pipeline completed in 41.8 minutes

Validation:
  2020: 25,200 rows (expected ~25,300) [OK]
  2021: 25,100 rows (expected ~25,200) [OK]
  2022: 25,100 rows (expected ~25,100) [OK]
  2023: 25,000 rows (expected ~25,000) [OK]
  2024: 25,100 rows (expected ~25,200) [OK]


## 5. Verify Submissions

In [9]:
# Check all submission files
print("\nSubmission files:")
for year, path in paths.items():
    if path.exists():
        df = pd.read_csv(path)
        dates = df['date'].nunique()
        print(f"  {year}: {len(df):,} rows ({dates} days x 100)")


Submission files:
  2020: 25,200 rows (252 days x 100)
  2021: 25,100 rows (251 days x 100)
  2022: 25,100 rows (251 days x 100)
  2023: 25,000 rows (250 days x 100)
  2024: 25,100 rows (251 days x 100)


In [10]:
# Sample from 2024
if 2024 in paths and paths[2024].exists():
    df = pd.read_csv(paths[2024])
    print("2024 first day predictions:")
    display(df[df['date'] == df['date'].min()].head(20))

2024 first day predictions:


Unnamed: 0,date,rank,ticker
0,2024-01-02,1,ONDS
1,2024-01-02,2,IVVD
2,2024-01-02,3,BITF
3,2024-01-02,4,BTBT
4,2024-01-02,5,AVX
5,2024-01-02,6,CLSK
6,2024-01-02,7,IVP
7,2024-01-02,8,HEPS
8,2024-01-02,9,ENLV
9,2024-01-02,10,BGM


## 6. Prepare for Upload

Files are saved at: `submissions/YYYY.submission.csv`

In [11]:
print("Files ready for upload:")
print(f"\nLocation: {SUBMISSIONS_DIR}")
for f in sorted(SUBMISSIONS_DIR.glob("*.csv")):
    print(f"  {f.name}")

print(f"\nUpload to: http://ahnbi1.suwon.ac.kr:5151")

Files ready for upload:

Location: /Users/jeong-uchang/etf-trading-project/etf-model/submissions
  2020.submission.csv
  2021.submission.csv
  2022.submission.csv
  2023.submission.csv
  2024.submission.csv

Upload to: http://ahnbi1.suwon.ac.kr:5151
