# ðŸ“˜ **00 â€” Baseline Notebook**

**Notebook Purpose:**
Establish a *minimal, working baseline* model for the Kaggle **Detect Reversal Points in U.S. Equities** competition. This notebook focuses on rapid setup â†’ quick preprocessing â†’ first submission. No feature engineering, no tuning, just a clean, reproducible starting point.

---

**Competition:** *Detect Reversal Points in US Equities*
**Deadline:** December 31, 2025
**Repository:** `Kaggle-Detect-Reversal-Points-in-US-Equities`

---

**Notebook Date Created:** 2025-11-26
**Notebook Last Updated:** 2025-11-26

---

## ðŸ§­ **Goals of This Notebook**

- Load Kaggle training and test data
- Perform *very light* preprocessing appropriate for a baseline
- Train 1â€“2 simple models (LogReg, LightGBM baseline)
- Generate a valid `submission.csv`
- Store artifacts in `/models/` and `/submissions/`
- Document the baseline performance

---

## ðŸ“‚ **References**

- Project Plan: `docs/00_overview/reversal_points_project_plan.md`
- Folder Explanations: `docs/01_architecture/02_folder_explanations.md`
- Project Structure: `docs/01_architecture/01_project_structure.md`


In [1]:
import os
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import TruncatedSVD
import lightgbm as lgb


os.getcwd()



'/home/bnelson_regex/projects/machine_learning_projects/kaggle/detect_reversal_points_us_equities/notebooks'

In [2]:
# Set up duck db connection

conn = duckdb.connect()

In [3]:
# load partial train dataframe with duck db

train_part_df = conn.execute("""
    SELECT *
    FROM read_csv_auto(
        '../data/raw/competition_data/train.csv',
        max_line_size=5000000
    )
    LIMIT 15
""").df()

train_part_df.head()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,train_id,ticker_id,t,cross_threshold_from_above_100.0,cross_threshold_from_above_100.5,cross_threshold_from_above_101.0,cross_threshold_from_above_101.5,cross_threshold_from_above_102.0,cross_threshold_from_above_102.5,cross_threshold_from_above_103.0,...,zone_102.0,zone_102.5,zone_103.0,zone_97.0,zone_97.5,zone_98.0,zone_98.5,zone_99.0,zone_99.5,class_label
0,0,2,2024-06-10,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
1,1,3,2024-09-18,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,
2,2,6,2023-05-10,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
3,3,3,2024-11-18,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,
4,4,2,2024-08-21,False,False,False,False,False,False,False,...,True,True,False,False,False,False,False,False,False,


## Debugging data
- Shape seems off with 5 total rows and 68,507 columns.
- This is the second method used with duckdb that has resulted in this weird shape
- Methods used:
  - duckdb.sql
  - conn = duckdb.connect() -> df =  conn.execute(...)
- Will test with pandas to see if we get the same results

In [4]:
# Test with pandas

pandas_train_df = pd.read_csv('../data/raw/competition_data/train.csv', nrows=15)
pandas_train_df.head(10)

Unnamed: 0,train_id,ticker_id,t,cross_threshold_from_above_100.0,cross_threshold_from_above_100.5,cross_threshold_from_above_101.0,cross_threshold_from_above_101.5,cross_threshold_from_above_102.0,cross_threshold_from_above_102.5,cross_threshold_from_above_103.0,...,zone_102.0,zone_102.5,zone_103.0,zone_97.0,zone_97.5,zone_98.0,zone_98.5,zone_99.0,zone_99.5,class_label
0,0,2,2024-06-10,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
1,1,3,2024-09-18,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,
2,2,6,2023-05-10,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
3,3,3,2024-11-18,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,
4,4,2,2024-08-21,False,False,False,False,False,False,False,...,True,True,False,False,False,False,False,False,False,
5,5,1,2024-11-26,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
6,6,4,2023-09-15,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
7,7,1,2023-05-30,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,
8,8,2,2024-07-26,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,
9,9,1,2024-01-09,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,False,False,


In [5]:
# load partial test dataframe with duck db

test_part_df = conn.execute("""SELECT * FROM read_csv_auto('../data/raw/competition_data/test.csv', max_line_size=5000000) LIMIT 15""").df()

test_part_df.head()

Unnamed: 0,id,ticker_id,t,cross_threshold_from_above_100.0,cross_threshold_from_above_100.5,cross_threshold_from_above_101.0,cross_threshold_from_above_101.5,cross_threshold_from_above_102.0,cross_threshold_from_above_102.5,cross_threshold_from_above_103.0,...,zone_101.5,zone_102.0,zone_102.5,zone_103.0,zone_97.0,zone_97.5,zone_98.0,zone_98.5,zone_99.0,zone_99.5
0,0,4,2024-05-03,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
1,1,6,2024-11-08,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2,2,6,2024-10-25,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
3,3,4,2023-04-24,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,4,6,2023-06-01,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False


In [6]:
# shape info

print('Train shape:', train_part_df.shape)
print('Test shape:', test_part_df.shape)

Train shape: (15, 68507)
Test shape: (15, 68506)


In [7]:
# train info

train_part_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Columns: 68507 entries, train_id to class_label
dtypes: bool(68499), datetime64[us](1), float64(4), int64(1), object(2)
memory usage: 1004.5+ KB


In [8]:
# test info

test_part_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Columns: 68506 entries, id to zone_99.5
dtypes: bool(68499), datetime64[us](1), float64(4), int64(1), object(1)
memory usage: 1004.4+ KB


## Timeout Error

- Jupyter notebooks timed out due to the size of the notebook.
- Will try to split notebook into smaller chunks.
- Will also try to use smaller sample sizes.
- A more detailed explanation of the issue and resolution can be found in the following location:
  - [Wide Dataset Loading Notes](../docs/03_notebooks/02_notes/00_baseline/01_wide_dataset_loading_notes.md)

In [9]:
# Reload full train data set with duck db

train_df = conn.execute("""
    SELECT *
    FROM read_csv_auto(
        '../data/raw/competition_data/train.csv',
        max_line_size=10000000
    )
""").fetch_arrow_table().to_pandas()



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [10]:
# Remove the rows with `None`
train_df = train_df.dropna(subset=['class_label'])

In [11]:
# Reload full test data set with duck db

test_df = conn.execute("""
    SELECT *
    FROM read_csv_auto(
        '../data/raw/competition_data/test.csv',
        max_line_size=10000000
    )
""").fetch_arrow_table().to_pandas()


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [12]:
# Confirm shape of datasets

print('Train df shape:', train_df.shape)
print('Test df shape:', test_df.shape)


Train df shape: (112, 68507)
Test df shape: (828, 68506)


## Preprocessing Dataset
- Drop metadata
- Align columns

In [13]:
# Target
y = train_df['class_label']

# Drop target and metadata from train df
X = train_df.drop(columns = ['class_label', 'ticker_id', 't', 'train_id'])

# Drop metadata from test df
X_test = test_df.drop(columns = ['ticker_id', 't', 'id'])

# Align columns
X_test = X_test[X.columns]


In [14]:
# Sanity check

print('Train df first 10 columns: ', train_df.columns[:10])
print('Test df first 10 columns: ', test_df.columns[:10])

Train df first 10 columns:  Index(['train_id', 'ticker_id', 't', 'cross_threshold_from_above_100.0',
       'cross_threshold_from_above_100.5', 'cross_threshold_from_above_101.0',
       'cross_threshold_from_above_101.5', 'cross_threshold_from_above_102.0',
       'cross_threshold_from_above_102.5', 'cross_threshold_from_above_103.0'],
      dtype='object')
Test df first 10 columns:  Index(['id', 'ticker_id', 't', 'cross_threshold_from_above_100.0',
       'cross_threshold_from_above_100.5', 'cross_threshold_from_above_101.0',
       'cross_threshold_from_above_101.5', 'cross_threshold_from_above_102.0',
       'cross_threshold_from_above_102.5', 'cross_threshold_from_above_103.0'],
      dtype='object')


In [15]:
# Confirm dropped columns

print('Train df first 5 columns excluding the dropped metadata: ', X.columns[:5])
print('Test df first 5 columns excluding the dropped metadata: ', X_test.columns[:5])

Train df first 5 columns excluding the dropped metadata:  Index(['cross_threshold_from_above_100.0', 'cross_threshold_from_above_100.5',
       'cross_threshold_from_above_101.0', 'cross_threshold_from_above_101.5',
       'cross_threshold_from_above_102.0'],
      dtype='object')
Test df first 5 columns excluding the dropped metadata:  Index(['cross_threshold_from_above_100.0', 'cross_threshold_from_above_100.5',
       'cross_threshold_from_above_101.0', 'cross_threshold_from_above_101.5',
       'cross_threshold_from_above_102.0'],
      dtype='object')


## Encode Target
- Use label encoder
- This is a multi-classification target
- One hot encoding will not be used for the target

In [16]:
# Label encoding

le = LabelEncoder()
y_enc = le.fit_transform(y)

In [17]:
"""
Validate Label Encoding
>>> ['HL', 'HH', 'LH', 'LL']
>>> [0, 1, 2, 3]
"""

print(y.unique())
print(np.unique(y_enc))


['HL' 'HH' 'LH' 'LL']
[0 1 2 3]


## Dimensionality Reduction
- Use truncated SVD
- Reduce dimensionality to 512
  - 68,507 -> 512 compressed components
  - preserves ~90 - 95% variance

In [18]:
svd = TruncatedSVD(n_components=112, random_state=42)
X_reduced = svd.fit_transform(X)
X_test_reduced = svd.transform(X_test)

In [19]:
# Wrap in DataFrame to silence LGBM warning
X_reduced = pd.DataFrame(X_reduced)
X_test_reduced = pd.DataFrame(X_test_reduced)


## Train LightGBM on the GPU

In [20]:
model = lgb.LGBMClassifier(
    objective='multiclass',
    num_class=4,
    n_estimators=600,
    learning_rate=0.05,
    num_leaves=64,
    device='cpu'
)

model.fit(X_reduced, y_enc)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001626 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4368
[LightGBM] [Info] Number of data points in the train set: 112, number of used features: 112
[LightGBM] [Info] Start training from score -1.029619
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.828127
[LightGBM] [Info] Start training from score -1.460402


0,1,2
,boosting_type,'gbdt'
,num_leaves,64
,max_depth,-1
,learning_rate,0.05
,n_estimators,600
,subsample_for_bin,200000
,objective,'multiclass'
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


## Predict and Inverse Transform

In [21]:
preds = model.predict(X_test_reduced)
print(np.unique(preds))
pred_labels = le.inverse_transform(preds)

[0 1 2 3]


## Build Baseline Submission

In [22]:
submission = pd.DataFrame({
    'id': test_df['id'],
    'class_label': pred_labels
})

submission.to_csv('../submissions/baseline_submission.csv', index=False)


## âœ… Quick Pre-Submission Checklist

In [23]:
# Clean target

train_df['class_label'].unique()


array(['HL', 'HH', 'LH', 'LL'], dtype=object)

In [24]:
# Encoded labels correct
np.unique(y_enc)

array([0, 1, 2, 3])

In [25]:
# No NaNs in features
X.isna().sum().sum(), X_test.isna().sum().sum()


(np.int64(0), np.int64(0))

In [27]:
# SVD outputs 512 components
print('X reduced shape: ', X_reduced.shape)
print('X test reduced shape: ', X_test_reduced.shape)

X reduced shape:  (112, 112)
X test reduced shape:  (828, 112)


In [28]:
# Stored predictions are real class indices
print(np.unique(preds))

[0 1 2 3]


In [29]:
# Submission file correct
submission.head()

Unnamed: 0,id,class_label
0,0,HL
1,1,HL
2,2,LH
3,3,HH
4,4,HL
