# Stock Trend Prediction - Kaggle Submission

This notebook generates predictions for the Kaggle competition:
- Load test data (ID, Date)
- Load historical stock data for test dates
- Apply same feature engineering as training
- Generate predictions using best model
- Format submission CSV file

In [1]:
# Imports
import sys
import os
import pandas as pd
import numpy as np
from pathlib import Path
import pickle
from datetime import datetime, timedelta

# Add parent directory to path
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))

# Import KhayatMiniNN
from KhayatMiniNN.neural_network import NeuralNetwork
from KhayatMiniNN.trainer import Trainer
from KhayatMiniNN.layers import LSTM, GRU, Conv1D, Dense, ReLU, Sigmoid, MaxPooling1D
from KhayatMiniNN.layers.base import Layer
from KhayatMiniNN.regularization import Dropout, BatchNormalization
from KhayatMiniNN.losses import BinaryCrossEntropy
from KhayatMiniNN.optimizers import Adam

# Import data loader and feature engineering
from load_data import StockDataLoader
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

# Custom Flatten layer (same as in training notebook)
class Flatten(Layer):
    """Flatten layer to convert 3D (batch, seq, features) to 2D (batch, seq*features)."""
    def __init__(self, name="Flatten"):
        super().__init__(name)
    
    def forward(self, input_data):
        self.input = input_data
        batch_size = input_data.shape[0]
        return input_data.reshape(batch_size, -1)
    
    def backward(self, output_grad):
        return output_grad.reshape(self.input.shape)

print("✓ Imports complete")

✓ Imports complete


## 1. Load Test Data and Model

In [2]:
# Load test data
data_dir = Path("../data")
model_dir = Path("../models")
processed_dir = Path("../data/processed")

# Load test.csv (contains ID and Date for predictions)
test_submission_df = pd.read_csv(data_dir / "test.csv")
print(f"Test samples: {len(test_submission_df):,}")
print(f"\nTest data columns: {test_submission_df.columns.tolist()}")
print(f"\nFirst few rows:")
print(test_submission_df.head(10))

# Parse dates
test_submission_df['Date'] = pd.to_datetime(test_submission_df['Date'])
print(f"\nDate range: {test_submission_df['Date'].min()} to {test_submission_df['Date'].max()}")

# Load model information
with open(model_dir / "model_comparison.pkl", "rb") as f:
    model_info = pickle.load(f)

best_model_name = model_info['best_model']
sequence_length = model_info['sequence_length']
feature_cols = model_info['feature_cols']

print(f"\n{'='*60}")
print(f"Model Information")
print(f"{'='*60}")
print(f"Best Model: {best_model_name}")
print(f"Sequence Length: {sequence_length}")
print(f"Number of Features: {len(feature_cols)}")

Test samples: 5,000

Test data columns: ['ID', 'Date']

First few rows:
            ID        Date
0     ticker_1  2024-11-04
1    ticker_10  2024-11-04
2   ticker_100  2024-11-04
3  ticker_1000  2024-11-04
4  ticker_1001  2024-11-04
5  ticker_1002  2024-11-04
6  ticker_1003  2024-11-04
7  ticker_1004  2024-11-04
8  ticker_1005  2024-11-04
9  ticker_1006  2024-11-04

Date range: 2024-11-04 00:00:00 to 2024-11-04 00:00:00

Model Information
Best Model: Hybrid
Sequence Length: 20
Number of Features: 71


## 2. Load Historical Data for Test Dates

We need historical stock data for each ticker up to (and including) the prediction date to create sequences.

In [3]:
# Load full training data to get historical prices
# We'll use this to get historical data for test predictions
print("Loading training data for historical context...")
loader = StockDataLoader(data_dir="../data", train_file="train.csv")

# Load raw data (we need all historical data)
print("Loading raw training data...")
full_df = loader.load_data()

# Process data (handle missing values, but don't create targets yet)
print("Processing data...")
full_df = loader.handle_missing_values(full_df)

# Ensure Date is datetime
full_df['Date'] = pd.to_datetime(full_df['Date'])

print(f"\n✓ Loaded historical data:")
print(f"  Total rows: {len(full_df):,}")
print(f"  Unique tickers: {full_df['Ticker'].nunique()}")
print(f"  Date range: {full_df['Date'].min()} to {full_df['Date'].max()}")

# Extract unique tickers and dates from test
test_tickers = test_submission_df['ID'].unique()
test_dates = test_submission_df['Date'].unique()

print(f"\nTest data:")
print(f"  Unique IDs: {len(test_tickers)}")
print(f"  Unique dates: {len(test_dates)}")
print(f"  Date range: {test_dates.min()} to {test_dates.max()}")

Loading training data for historical context...
Loading raw training data...
Loading training data...
Reading from: ../data/train.csv
✓ Loaded 21,033,522 rows
✓ Data loaded: 21,033,522 rows, 5000 unique tickers
  Date range: 1962-01-02 00:00:00 to 2024-09-23 00:00:00
Processing data...

Handling missing values...
✓ No missing values found
  Removing 807,778 rows with invalid prices
✓ Removed 807,778 rows with invalid data
✓ Final dataset: 20,225,744 rows, 5000 unique tickers

✓ Loaded historical data:
  Total rows: 20,225,744
  Unique tickers: 5000
  Date range: 1962-01-02 00:00:00 to 2024-09-23 00:00:00

Test data:
  Unique IDs: 5000
  Unique dates: 1
  Date range: 2024-11-04 00:00:00 to 2024-11-04 00:00:00


## 3. Prepare Test Data with Historical Context

For each test sample, we need at least `sequence_length` days of historical data before the prediction date.

In [4]:
# Skip this cell - data preparation is now done in the next cell
# which creates features from raw historical data properly

print("="*60)
print("Data Preparation - Skipped")
print("="*60)
print("\nData preparation will be done in the next cell where we:")
print("  1. Take raw historical data for test tickers")
print("  2. Create all features using create_features()")
print("  3. Scale the features using the saved scaler")
print("\nThis ensures we use the MOST RECENT data from training (up to 2024-09-23)")
print("rather than stale data from a pre-computed test split.")

# Just verify we have the data we need
train_tickers = sorted(full_df['Ticker'].unique())
test_ids = set(test_submission_df['ID'].unique())
matching_tickers = test_ids.intersection(set(train_tickers))

print(f"\nData check:")
print(f"  Training tickers: {len(train_tickers):,}")
print(f"  Test IDs: {len(test_ids):,}")
print(f"  Matching: {len(matching_tickers):,}")

if len(matching_tickers) < len(test_ids):
    missing = test_ids - set(train_tickers)
    print(f"  ⚠ Missing tickers: {len(missing)}")
    print(f"    Sample: {list(missing)[:5]}")

Data Preparation - Skipped

Data preparation will be done in the next cell where we:
  1. Take raw historical data for test tickers
  2. Create all features using create_features()
  3. Scale the features using the saved scaler

This ensures we use the MOST RECENT data from training (up to 2024-09-23)
rather than stale data from a pre-computed test split.

Data check:
  Training tickers: 5,000
  Test IDs: 5,000
  Matching: 5,000


In [5]:
# IMPORTANT: We need to use the RAW historical data, NOT the pre-processed test_features.csv
# The test_features.csv is the validation test split, NOT the actual Kaggle test data!
# The Kaggle test date is 2024-11-04, but our training data ends at 2024-09-23.
# We use the most recent data from training to make predictions.

print("="*60)
print("Creating Features from Raw Historical Data")
print("="*60)
print("\n⚠ NOTE: Not using test_features.csv - that's the validation split!")
print("  We need features from the MOST RECENT historical data for each ticker.")

# Import the feature engineering function
from feature_utils import create_features

# Get the most recent data for each ticker (need enough for feature calculation)
# Features like MA200 need 200+ days of history
min_history_required = 250  # At least 250 days for MA200 and other features

print(f"\nFiltering to last {min_history_required} days per ticker...")
test_ids = set(test_submission_df['ID'].unique())

# Filter to test tickers only
historical_df = full_df[full_df['Ticker'].isin(test_ids)].copy()
print(f"  Filtered to test tickers: {len(historical_df):,} rows")

# Get the last min_history_required rows per ticker for feature calculation
historical_df = historical_df.sort_values(['Ticker', 'Date'])
historical_df = historical_df.groupby('Ticker').tail(min_history_required).reset_index(drop=True)
print(f"  After taking last {min_history_required} days: {len(historical_df):,} rows")
print(f"  Date range: {historical_df['Date'].min()} to {historical_df['Date'].max()}")

# Load ticker statistics from training (for ticker_mean_return_30d feature - prevents leakage)
# This should have been saved during feature engineering, but we'll recalculate if needed
print("\nCreating features...")
test_df, computed_feature_cols = create_features(historical_df, verbose=True, ticker_stats=None)

print(f"\n✓ Created {len(computed_feature_cols)} features")
print(f"  Test data shape: {test_df.shape}")
print(f"  Date range: {test_df['Date'].min()} to {test_df['Date'].max()}")

Creating Features from Raw Historical Data

⚠ NOTE: Not using test_features.csv - that's the validation split!
  We need features from the MOST RECENT historical data for each ticker.

Filtering to last 250 days per ticker...
  Filtered to test tickers: 20,225,744 rows
  After taking last 250 days: 1,249,670 rows
  Date range: 2017-09-25 00:00:00 to 2024-09-23 00:00:00

Creating features...
  [1/7] Basic price features...
  [2/7] Technical indicators (RSI, MACD, Bollinger)...
  [3/7] Volume features...
  [4/7] Rolling statistics...
  [5/7] Seasonality features...
  [6/7] Long-term features (MA, volatility, trend)...
  [7/7] Market regime & interactions...
  Cleaning up NaN/Inf values...
  Done!

✓ Created 132 features
  Test data shape: (1249670, 141)
  Date range: 2017-09-25 00:00:00 to 2024-09-23 00:00:00


## 5. Apply Feature Engineering to Test Data

Apply the same feature engineering pipeline used during training.

In [6]:
# Scale the features using the same scaler from training
print("="*60)
print("Scaling Features")
print("="*60)

# Load scaler
with open(processed_dir / "scaler.pkl", "rb") as f:
    scaler = pickle.load(f)
print("✓ Loaded scaler from training")

# Verify we have the right features
missing_features = [f for f in feature_cols if f not in test_df.columns]
if missing_features:
    print(f"⚠ Warning: {len(missing_features)} features missing from test data:")
    print(f"  {missing_features[:10]}...")
    # Use only available features
    available_features = [f for f in feature_cols if f in test_df.columns]
    print(f"  Using {len(available_features)} available features")
else:
    available_features = feature_cols
    print(f"✓ All {len(feature_cols)} features present")

# Clean inf/nan values before scaling
print("\nCleaning inf/nan values...")
for col in available_features:
    test_df[col] = test_df[col].replace([np.inf, -np.inf], np.nan)
    test_df[col] = test_df[col].fillna(0)

# Apply scaler to test features
print("Applying scaler to test features...")
test_features = test_df[available_features].values

# Check for issues before scaling
print(f"  Pre-scaling stats: min={test_features.min():.4f}, max={test_features.max():.4f}, mean={test_features.mean():.4f}")

# Scale
test_features_scaled = scaler.transform(test_features)

# Check for issues after scaling
print(f"  Post-scaling stats: min={test_features_scaled.min():.4f}, max={test_features_scaled.max():.4f}, mean={test_features_scaled.mean():.4f}")

# Replace original features with scaled values
test_df[available_features] = test_features_scaled

# Update feature_cols to use only available features
feature_cols = available_features

print(f"\n✓ Test data scaled and ready: {len(test_df):,} rows, {len(feature_cols)} features")

Scaling Features
✓ Loaded scaler from training
✓ All 71 features present

Cleaning inf/nan values...
Applying scaler to test features...
  Pre-scaling stats: min=-5533.7202, max=149865296.0000, mean=18504.4824
  Post-scaling stats: min=-1344.1433, max=16746.5449, mean=0.3943

✓ Test data scaled and ready: 1,249,670 rows, 71 features


## 6. Create Sequences and Generate Predictions

In [7]:
# Create sequences for test data - OPTIMIZED VERSION
def create_sequences_per_ticker_optimized(df, feature_cols, sequence_length):
    """Create sequences grouped by ticker - OPTIMIZED with groupby."""
    X_list = []
    metadata_list = []  # Store (ticker, date) for each sequence
    
    # Sort once and group - much faster than filtering repeatedly
    df_sorted = df.sort_values(['Ticker', 'Date'])
    
    # Pre-group the dataframe - this is the key optimization
    ticker_groups = df_sorted.groupby('Ticker')
    
    # Get unique tickers
    unique_tickers = df_sorted['Ticker'].unique()
    print(f"  Processing {len(unique_tickers)} tickers...")
    
    for i, ticker in enumerate(unique_tickers):
        if i % 1000 == 0 and i > 0:
            print(f"    Processed {i}/{len(unique_tickers)} tickers...")
        
        # Get pre-grouped data (fast lookup)
        ticker_data = ticker_groups.get_group(ticker)
        
        if len(ticker_data) < sequence_length:
            continue
        
        # Get the last sequence_length rows
        features = ticker_data[feature_cols].values[-sequence_length:]
        X_list.append(features)
        
        # Store metadata (use the last date in the sequence)
        last_date = ticker_data['Date'].iloc[-1]
        metadata_list.append({
            'Ticker': ticker,
            'Date': last_date
        })
    
    return np.array(X_list), metadata_list

print("Creating sequences for test data (optimized)...")
X_test, test_metadata = create_sequences_per_ticker_optimized(test_df, feature_cols, sequence_length)
print(f"✓ Created {len(X_test):,} sequences")
print(f"  Sequence shape: {X_test.shape}")

# Load and build model
print(f"\n{'='*60}")
print(f"Loading {best_model_name} Model")
print(f"{'='*60}")

# Model parameters - MUST MATCH the trained model exactly!
# These values are inferred from best_model_params.pkl shapes:
# conv1: W (3, 71, 32) -> input=71, filters=32
# lstm1: W_f (32, 32), U_f (32, 32) -> input=32, hidden=32
# dense1: W (32, 16) -> 32 -> 16
# dense2: W (16, 1) -> 16 -> 1
input_size = len(feature_cols)  # 71
lstm_hidden = 32  # FIXED: was 64, but saved model uses 32
gru_hidden = 32   # FIXED: was 64
conv_filters = 32

# Define model architectures (same as training)
def build_lstm_model():
    model = NeuralNetwork(name="LSTM_Stock_Predictor")
    model.add_layer(LSTM(input_size, lstm_hidden, return_sequences=False), name="lstm1")
    model.add_layer(Dropout(dropout_rate=0.3), name="dropout1")
    model.add_layer(Dense(lstm_hidden, 16), name="dense1")
    model.add_layer(ReLU(), name="relu1")
    model.add_layer(Dense(16, 1), name="dense2")
    model.add_layer(Sigmoid(), name="sigmoid1")
    return model

def build_gru_model():
    model = NeuralNetwork(name="GRU_Stock_Predictor")
    model.add_layer(GRU(input_size, gru_hidden, return_sequences=False), name="gru1")
    model.add_layer(Dropout(dropout_rate=0.3), name="dropout1")
    model.add_layer(Dense(gru_hidden, 16), name="dense1")
    model.add_layer(ReLU(), name="relu1")
    model.add_layer(Dense(16, 1), name="dense2")
    model.add_layer(Sigmoid(), name="sigmoid1")
    return model

def build_conv1d_model():
    seq_after_pool1 = (sequence_length - 2) // 2 + 1
    seq_after_pool2 = (seq_after_pool1 - 2) // 2 + 1
    flattened_size = seq_after_pool2 * (conv_filters * 2)
    
    model = NeuralNetwork(name="Conv1D_Stock_Predictor")
    model.add_layer(Conv1D(input_size, conv_filters, kernel_size=3, padding='same'), name="conv1")
    model.add_layer(ReLU(), name="relu1")
    model.add_layer(MaxPooling1D(pool_size=2, stride=2), name="pool1")
    model.add_layer(Conv1D(conv_filters, conv_filters*2, kernel_size=3, padding='same'), name="conv2")
    model.add_layer(ReLU(), name="relu2")
    model.add_layer(MaxPooling1D(pool_size=2, stride=2), name="pool2")
    model.add_layer(Flatten(), name="flatten")
    model.add_layer(Dense(flattened_size, 32), name="dense1")
    model.add_layer(ReLU(), name="relu3")
    model.add_layer(Dropout(dropout_rate=0.3), name="dropout1")
    model.add_layer(Dense(32, 1), name="dense2")
    model.add_layer(Sigmoid(), name="sigmoid1")
    return model

def build_hybrid_model():
    """Build Hybrid model EXACTLY matching the training architecture.
    Must match layer names exactly for parameter loading to work.
    """
    model = NeuralNetwork(name="Hybrid_ConvLSTM_V2")
    # Conv1D layers for feature extraction
    model.add_layer(Conv1D(input_size, conv_filters, kernel_size=3, padding='same'), name="conv1")
    model.add_layer(BatchNormalization(conv_filters), name="batchnorm1")
    model.add_layer(ReLU(), name="relu1")
    model.add_layer(MaxPooling1D(pool_size=2, stride=2), name="pool1")
    model.add_layer(Dropout(dropout_rate=0.3), name="dropout1")
    # LSTM for sequence modeling
    model.add_layer(LSTM(conv_filters, lstm_hidden, return_sequences=False), name="lstm1")
    model.add_layer(BatchNormalization(lstm_hidden), name="batchnorm2")
    model.add_layer(Dropout(dropout_rate=0.6), name="dropout2")
    model.add_layer(Dense(lstm_hidden, 16), name="dense1")
    model.add_layer(ReLU(), name="relu2")
    model.add_layer(Dropout(dropout_rate=0.5), name="dropout3")
    model.add_layer(Dense(16, 1), name="dense2")
    model.add_layer(Sigmoid(), name="sigmoid1")
    return model

# Build the best model
model_builders = {
    'LSTM': build_lstm_model,
    'GRU': build_gru_model,
    'Conv1D': build_conv1d_model,
    'Hybrid': build_hybrid_model
}

model = model_builders[best_model_name]()
loss_fn = BinaryCrossEntropy(from_logits=False)
model.set_loss(loss_fn)
optimizer = Adam(learning_rate=0.001)
trainer = Trainer(model, optimizer, loss_fn)

# Load best model parameters
with open(model_dir / "best_model_params.pkl", "rb") as f:
    best_params = pickle.load(f)

model.set_params(best_params)
print(f"✓ Loaded {best_model_name} model")

# Generate predictions
print(f"\n{'='*60}")
print("Generating Predictions")
print(f"{'='*60}")

predictions = trainer.predict(X_test)
predictions_binary = (predictions > 0.5).astype(int).flatten()

print(f"✓ Generated {len(predictions_binary):,} predictions")
print(f"  ↑ (Higher) predictions: {predictions_binary.sum():,} ({predictions_binary.mean()*100:.2f}%)")
print(f"  ↓ (Lower) predictions: {(1-predictions_binary).sum():,} ({(1-predictions_binary).mean()*100:.2f}%)")

Creating sequences for test data (optimized)...
  Processing 5000 tickers...
    Processed 1000/5000 tickers...
    Processed 2000/5000 tickers...
    Processed 3000/5000 tickers...
    Processed 4000/5000 tickers...
✓ Created 5,000 sequences
  Sequence shape: (5000, 20, 71)

Loading Hybrid Model
ℹ CuPy not available, using CPU (NumPy)
✓ Loaded Hybrid model

Generating Predictions
✓ Generated 5,000 predictions
  ↑ (Higher) predictions: 17 (0.34%)
  ↓ (Lower) predictions: 4,983 (99.66%)


In [16]:
# DIAGNOSTIC: Analyze raw predictions to understand the distribution
print("="*60)
print("DIAGNOSTIC: Analyzing Raw Predictions")
print("="*60)

# Check raw prediction values (before thresholding)
print(f"\nRaw predictions shape: {predictions.shape}")
print(f"Raw predictions stats:")
print(f"  Min: {predictions.min():.6f}")
print(f"  Max: {predictions.max():.6f}")
print(f"  Mean: {predictions.mean():.6f}")
print(f"  Std: {predictions.std():.6f}")
print(f"  Median: {np.median(predictions):.6f}")

# Check distribution of predictions
print(f"\nPrediction distribution:")
print(f"  < 0.1: {(predictions < 0.1).sum():,} ({(predictions < 0.1).mean()*100:.2f}%)")
print(f"  0.1-0.3: {((predictions >= 0.1) & (predictions < 0.3)).sum():,}")
print(f"  0.3-0.5: {((predictions >= 0.3) & (predictions < 0.5)).sum():,}")
print(f"  0.5-0.7: {((predictions >= 0.5) & (predictions < 0.7)).sum():,}")
print(f"  0.7-0.9: {((predictions >= 0.7) & (predictions < 0.9)).sum():,}")
print(f"  >= 0.9: {(predictions >= 0.9).sum():,}")

# Show histogram-like distribution
print(f"\nDetailed distribution (percentiles):")
for p in [1, 5, 10, 25, 50, 75, 90, 95, 99]:
    print(f"  {p}th percentile: {np.percentile(predictions, p):.6f}")

# Check input data statistics
print(f"\n{'='*60}")
print("DIAGNOSTIC: Input Data Statistics")
print("="*60)
print(f"X_test shape: {X_test.shape}")
print(f"X_test stats:")
print(f"  Min: {X_test.min():.4f}")
print(f"  Max: {X_test.max():.4f}")
print(f"  Mean: {X_test.mean():.4f}")
print(f"  Std: {X_test.std():.4f}")

# Check for NaN/Inf values
print(f"\nNaN values in X_test: {np.isnan(X_test).sum()}")
print(f"Inf values in X_test: {np.isinf(X_test).sum()}")

# Try different thresholds
print(f"\n{'='*60}")
print("Trying Different Thresholds")
print("="*60)
for threshold in [0.1, 0.2, 0.3, 0.4, 0.5]:
    pred_binary = (predictions > threshold).astype(int).flatten()
    print(f"  Threshold {threshold}: {pred_binary.sum():,} predictions of 1 ({pred_binary.mean()*100:.2f}%)")

# RECOMMENDATION based on median
median_pred = np.median(predictions)
print(f"\n{'='*60}")
print("RECOMMENDATION")
print("="*60)
print(f"Median prediction: {median_pred:.4f}")
if median_pred < 0.5:
    suggested_threshold = median_pred
    print(f"Your model is biased toward predicting 0.")
    print(f"Suggested threshold: {suggested_threshold:.4f} (use median to balance predictions)")
    balanced_preds = (predictions > suggested_threshold).astype(int).flatten()
    print(f"  With threshold {suggested_threshold:.4f}: {balanced_preds.sum():,} predictions of 1 ({balanced_preds.mean()*100:.2f}%)")

DIAGNOSTIC: Analyzing Raw Predictions

Raw predictions shape: (5000, 1)
Raw predictions stats:
  Min: 0.412693
  Max: 0.510982
  Mean: 0.463373
  Std: 0.013778
  Median: 0.463333

Prediction distribution:
  < 0.1: 0 (0.00%)
  0.1-0.3: 0
  0.3-0.5: 4,983
  0.5-0.7: 17
  0.7-0.9: 0
  >= 0.9: 0

Detailed distribution (percentiles):
  1th percentile: 0.432778
  5th percentile: 0.440803
  10th percentile: 0.445450
  25th percentile: 0.454085
  50th percentile: 0.463333
  75th percentile: 0.472317
  90th percentile: 0.481673
  95th percentile: 0.486641
  99th percentile: 0.495222

DIAGNOSTIC: Input Data Statistics
X_test shape: (5000, 20, 71)
X_test stats:
  Min: -169.1188
  Max: 1366.0542
  Mean: 0.4124
  Std: 2.1168

NaN values in X_test: 0
Inf values in X_test: 0

Trying Different Thresholds
  Threshold 0.1: 5,000 predictions of 1 (100.00%)
  Threshold 0.2: 5,000 predictions of 1 (100.00%)
  Threshold 0.3: 5,000 predictions of 1 (100.00%)
  Threshold 0.4: 5,000 predictions of 1 (100.00%)


In [14]:
# OPTIONAL: Use a different threshold to get more balanced predictions
# Change USE_CUSTOM_THRESHOLD to True and adjust CUSTOM_THRESHOLD as needed

USE_CUSTOM_THRESHOLD = True  # Set to True to use custom threshold
CUSTOM_THRESHOLD = np.median(predictions)  # Use median for balanced predictions

if USE_CUSTOM_THRESHOLD:
    print(f"Using custom threshold: {CUSTOM_THRESHOLD:.4f}")
    predictions_binary = (predictions > CUSTOM_THRESHOLD).astype(int).flatten()
else:
    print("Using default threshold: 0.5")
    predictions_binary = (predictions > 0.5).astype(int).flatten()

print(f"\nPrediction distribution with threshold {CUSTOM_THRESHOLD if USE_CUSTOM_THRESHOLD else 0.5:.4f}:")
print(f"  ↑ (Higher/1): {predictions_binary.sum():,} ({predictions_binary.mean()*100:.2f}%)")
print(f"  ↓ (Lower/0): {(1-predictions_binary).sum():,} ({(1-predictions_binary).mean()*100:.2f}%)")

Using custom threshold: 0.4633

Prediction distribution with threshold 0.4633:
  ↑ (Higher/1): 2,500 (50.00%)
  ↓ (Lower/0): 2,500 (50.00%)


In [15]:
# Create submission DataFrame - OPTIMIZED VERSION
# We need to match test_metadata to test_submission_df
# The submission format should be: ID, Pred (where Pred is 0 or 1)

print("Creating submission DataFrame (optimized)...")

# Create a DataFrame from predictions metadata - VECTORIZED approach
predictions_df = pd.DataFrame(test_metadata)
predictions_df['Pred'] = predictions_binary

# Since test IDs match tickers directly, we can use a simple merge/map
# Create a mapping from Ticker to prediction
ticker_to_pred = dict(zip(predictions_df['Ticker'], predictions_df['Pred']))

# Map predictions to test submission IDs - VECTORIZED (no loops!)
submission_df = test_submission_df[['ID']].copy()
submission_df['Pred'] = submission_df['ID'].map(ticker_to_pred)

# Count matches
matched_count = submission_df['Pred'].notna().sum()
unmatched_ids = submission_df[submission_df['Pred'].isna()]['ID'].tolist()

# Fill missing predictions with default (0)
if len(unmatched_ids) > 0:
    print(f"⚠ Warning: {len(unmatched_ids)} IDs not matched, filling with default (0)")
    if len(unmatched_ids) <= 10:
        print(f"  Unmatched IDs: {unmatched_ids}")
    submission_df['Pred'] = submission_df['Pred'].fillna(0)

# Convert to int
submission_df['Pred'] = submission_df['Pred'].astype(int)

print(f"\n{'='*60}")
print("Submission File Created")
print(f"{'='*60}")
print(f"Total predictions: {len(submission_df):,}")
print(f"Matched predictions: {matched_count:,}")
print(f"↑ (Higher) predictions: {submission_df['Pred'].sum():,} ({submission_df['Pred'].mean()*100:.2f}%)")
print(f"↓ (Lower) predictions: {(1-submission_df['Pred']).sum():,} ({(1-submission_df['Pred']).mean()*100:.2f}%)")

# Verify format
print(f"\nSubmission format check:")
print(f"  Columns: {submission_df.columns.tolist()}")
print(f"  Shape: {submission_df.shape}")
print(f"\nFirst 10 rows:")
print(submission_df.head(10))
print(f"\nLast 10 rows:")
print(submission_df.tail(10))

# Verify all IDs are present
expected_ids = set(test_submission_df['ID'].unique())
submission_ids = set(submission_df['ID'].unique())
missing_ids = expected_ids - submission_ids

if missing_ids:
    print(f"\n⚠ Warning: {len(missing_ids)} IDs missing from submission")
    print(f"  Sample missing IDs: {list(missing_ids)[:5]}")
    # Add missing IDs with default prediction
    missing_df = pd.DataFrame({'ID': list(missing_ids), 'Pred': 0})
    submission_df = pd.concat([submission_df, missing_df], ignore_index=True)

# Sort by ID to match expected format
submission_df = submission_df.sort_values('ID').reset_index(drop=True)

# Verify predictions are 0 or 1
assert submission_df['Pred'].isin([0, 1]).all(), "All predictions must be 0 or 1"
print(f"\n✓ All predictions are valid (0 or 1)")

Creating submission DataFrame (optimized)...

Submission File Created
Total predictions: 5,000
Matched predictions: 5,000
↑ (Higher) predictions: 2,500 (50.00%)
↓ (Lower) predictions: 2,500 (50.00%)

Submission format check:
  Columns: ['ID', 'Pred']
  Shape: (5000, 2)

First 10 rows:
            ID  Pred
0     ticker_1     0
1    ticker_10     0
2   ticker_100     0
3  ticker_1000     1
4  ticker_1001     1
5  ticker_1002     1
6  ticker_1003     0
7  ticker_1004     1
8  ticker_1005     0
9  ticker_1006     0

Last 10 rows:
              ID  Pred
4990  ticker_990     1
4991  ticker_991     0
4992  ticker_992     1
4993  ticker_993     1
4994  ticker_994     1
4995  ticker_995     0
4996  ticker_996     1
4997  ticker_997     0
4998  ticker_998     1
4999  ticker_999     1

✓ All predictions are valid (0 or 1)


In [17]:
# Save submission file
submission_file = Path("../submission.csv")
submission_df.to_csv(submission_file, index=False)

print(f"{'='*60}")
print("✓ Submission file saved!")
print(f"{'='*60}")
print(f"File: {submission_file.absolute()}")
print(f"Rows: {len(submission_df):,}")
print(f"Columns: {submission_df.columns.tolist()}")
print(f"\nSubmission file is ready for Kaggle upload!")

# Display sample
print(f"\n{'='*60}")
print("Sample Submission (first 20 rows)")
print(f"{'='*60}")
print(submission_df.head(20).to_string(index=False))

✓ Submission file saved!
File: /Users/mahmoud/Documents/University/Fourth_year/NN/Sem-project/stock-predictor/notebooks/../submission.csv
Rows: 5,000
Columns: ['ID', 'Pred']

Submission file is ready for Kaggle upload!

Sample Submission (first 20 rows)
         ID  Pred
   ticker_1     0
  ticker_10     0
 ticker_100     0
ticker_1000     1
ticker_1001     1
ticker_1002     1
ticker_1003     0
ticker_1004     1
ticker_1005     0
ticker_1006     0
ticker_1007     0
ticker_1008     0
ticker_1009     1
 ticker_101     0
ticker_1010     1
ticker_1011     1
ticker_1012     0
ticker_1013     0
ticker_1014     0
ticker_1015     0


## 9. Submission Instructions

1. **Verify the submission file**:
   - Check that `submission.csv` has the correct format (ID, Pred)
   - Ensure all test IDs are included
   - Verify predictions are 0 or 1

2. **Upload to Kaggle**:
   - Go to the competition page
   - Click "Submit Predictions"
   - Upload `submission.csv`
   - Submit and check your score

3. **Note**: If the ID mapping doesn't work correctly, you may need to:
   - Check how test IDs map to tickers in your data
   - Adjust the matching logic in section 7
   - Ensure test data has sufficient historical context

## Summary

✅ Generated predictions using the best trained model  
✅ Created submission file: `submission.csv`  
✅ Ready for Kaggle submission!

**Next Steps:**
- Review submission file format
- Upload to Kaggle competition
- Monitor leaderboard score

**Note:** The notebook handles ID mapping automatically. If you encounter issues matching test IDs to tickers, you may need to adjust the mapping logic in section 3 based on your specific data format.