# ðŸ“˜ **00 â€” Baseline Notebook**

**Notebook Purpose:**
Establish a *minimal, working baseline* model for the Kaggle **Detect Reversal Points in U.S. Equities** competition. This notebook focuses on rapid setup â†’ quick preprocessing â†’ first submission. No feature engineering, no tuning, just a clean, reproducible starting point.

---

**Competition:** *Detect Reversal Points in US Equities*
**Deadline:** December 31, 2025
**Repository:** `Kaggle-Detect-Reversal-Points-in-US-Equities`

---

**Notebook Date Created:** 2025-11-26
**Notebook Last Updated:** 2025-11-26

---

## ðŸ§­ **Goals of This Notebook**

- Load Kaggle training and test data
- Perform *very light* preprocessing appropriate for a baseline
- Train 1â€“2 simple models (LogReg, LightGBM baseline)
- Generate a valid `submission.csv`
- Store artifacts in `/models/` and `/submissions/`
- Document the baseline performance

---

## ðŸ“‚ **References**

- Project Plan: `docs/00_overview/reversal_points_project_plan.md`
- Folder Explanations: `docs/01_architecture/02_folder_explanations.md`
- Project Structure: `docs/01_architecture/01_project_structure.md`


In [2]:
import os
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

os.getcwd()


'/home/bnelson_regex/projects/machine_learning_projects/kaggle/detect_reversal_points_us_equities/notebooks'

In [3]:
# Set up duck db connection

conn = duckdb.connect()

In [4]:
# load partial train dataframe with duck db

train_part_df = conn.execute("""
    SELECT *
    FROM read_csv_auto(
        '../data/raw/competition_data/train.csv',
        max_line_size=5000000
    )
    LIMIT 15
""").df()

train_part_df.head()

Unnamed: 0,train_id,ticker_id,t,cross_threshold_from_above_100.0,cross_threshold_from_above_100.5,cross_threshold_from_above_101.0,cross_threshold_from_above_101.5,cross_threshold_from_above_102.0,cross_threshold_from_above_102.5,cross_threshold_from_above_103.0,...,zone_102.0,zone_102.5,zone_103.0,zone_97.0,zone_97.5,zone_98.0,zone_98.5,zone_99.0,zone_99.5,class_label
0,0,2,2024-06-10,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
1,1,3,2024-09-18,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,
2,2,6,2023-05-10,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
3,3,3,2024-11-18,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,
4,4,2,2024-08-21,False,False,False,False,False,False,False,...,True,True,False,False,False,False,False,False,False,


## Debugging data
- Shape seems off with 5 total rows and 68,507 columns.
- This is the second method used with duckdb that has resulted in this weird shape
- Methods used:
  - duckdb.sql
  - conn = duckdb.connect() -> df =  conn.execute(...)
- Will test with pandas to see if we get the same results

In [5]:
# Test with pandas

pandas_train_df = pd.read_csv('../data/raw/competition_data/train.csv', nrows=15)
pandas_train_df.head(10)

Unnamed: 0,train_id,ticker_id,t,cross_threshold_from_above_100.0,cross_threshold_from_above_100.5,cross_threshold_from_above_101.0,cross_threshold_from_above_101.5,cross_threshold_from_above_102.0,cross_threshold_from_above_102.5,cross_threshold_from_above_103.0,...,zone_102.0,zone_102.5,zone_103.0,zone_97.0,zone_97.5,zone_98.0,zone_98.5,zone_99.0,zone_99.5,class_label
0,0,2,2024-06-10,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
1,1,3,2024-09-18,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,
2,2,6,2023-05-10,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
3,3,3,2024-11-18,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,
4,4,2,2024-08-21,False,False,False,False,False,False,False,...,True,True,False,False,False,False,False,False,False,
5,5,1,2024-11-26,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
6,6,4,2023-09-15,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,
7,7,1,2023-05-30,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,
8,8,2,2024-07-26,False,False,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,
9,9,1,2024-01-09,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,False,False,


In [6]:
# load partial test dataframe with duck db

test_part_df = conn.execute("""SELECT * FROM read_csv_auto('../data/raw/competition_data/test.csv', max_line_size=5000000) LIMIT 15""").df()

test_part_df.head()

Unnamed: 0,id,ticker_id,t,cross_threshold_from_above_100.0,cross_threshold_from_above_100.5,cross_threshold_from_above_101.0,cross_threshold_from_above_101.5,cross_threshold_from_above_102.0,cross_threshold_from_above_102.5,cross_threshold_from_above_103.0,...,zone_101.5,zone_102.0,zone_102.5,zone_103.0,zone_97.0,zone_97.5,zone_98.0,zone_98.5,zone_99.0,zone_99.5
0,0,4,2024-05-03,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
1,1,6,2024-11-08,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2,2,6,2024-10-25,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
3,3,4,2023-04-24,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,4,6,2023-06-01,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False


In [7]:
# shape info

print('Train shape:', train_part_df.shape)
print('Test shape:', test_part_df.shape)

Train shape: (15, 68507)
Test shape: (15, 68506)


In [8]:
# train info

train_part_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Columns: 68507 entries, train_id to class_label
dtypes: bool(68499), datetime64[us](1), float64(4), int64(1), object(2)
memory usage: 1004.5+ KB


In [9]:
# test info

test_part_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Columns: 68506 entries, id to zone_99.5
dtypes: bool(68499), datetime64[us](1), float64(4), int64(1), object(1)
memory usage: 1004.4+ KB


## Timeout Error

- Jupyter notebooks timed out due to the size of the notebook.
- Will try to split notebook into smaller chunks.
- Will also try to use smaller sample sizes.
- A more detailed explanation of the issue and resolution can be found in the following location:
  - [Wide Dataset Loading Notes](../docs/03_notebooks/02_notes/00_baseline/01_wide_dataset_loading_notes.md)

In [10]:
# Reload full train data set with duck db

train_df = conn.execute("""
    SELECT *
    FROM read_csv_auto(
        '../data/raw/competition_data/train.csv',
        max_line_size=10000000
    )
""").fetch_arrow_table().to_pandas()



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [11]:
# Reload full test data set with duck db

test_df = conn.execute("""
    SELECT *
    FROM read_csv_auto(
        '../data/raw/competition_data/test.csv',
        max_line_size=10000000
    )
""").fetch_arrow_table().to_pandas()


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [12]:
# Confirm shape of datasets

print('Train df shape:', train_df.shape)
print('Test df shape:', test_df.shape)


Train df shape: (1932, 68507)
Test df shape: (828, 68506)


## Preprocessing Dataset
- Drop metadata
- Align columns

In [14]:
# Target
y = train_df['class_label']

# Drop target and metadata from train df
X = train_df.drop(columns = ['class_label', 'ticker_id', 't', 'train_id'])

# Drop metadata from test df
X_test = test_df.drop(columns = ['ticker_id', 't', 'id'])

# Align columns
X_test = X_test[X.columns]


In [15]:
# Sanity check

print('Train df first 10 columns: ', train_df.columns[:10])
print('Test df first 10 columns: ', test_df.columns[:10])

Train df first 10 columns:  Index(['train_id', 'ticker_id', 't', 'cross_threshold_from_above_100.0',
       'cross_threshold_from_above_100.5', 'cross_threshold_from_above_101.0',
       'cross_threshold_from_above_101.5', 'cross_threshold_from_above_102.0',
       'cross_threshold_from_above_102.5', 'cross_threshold_from_above_103.0'],
      dtype='object')
Test df first 10 columns:  Index(['id', 'ticker_id', 't', 'cross_threshold_from_above_100.0',
       'cross_threshold_from_above_100.5', 'cross_threshold_from_above_101.0',
       'cross_threshold_from_above_101.5', 'cross_threshold_from_above_102.0',
       'cross_threshold_from_above_102.5', 'cross_threshold_from_above_103.0'],
      dtype='object')


In [16]:
# Confirm dropped columns

print('Train df first 5 columns excluding the dropped metadata: ', X.columns[:5])
print('Test df first 5 columns excluding the dropped metadata: ', X_test.columns[:5])

Train df first 5 columns excluding the dropped metadata:  Index(['cross_threshold_from_above_100.0', 'cross_threshold_from_above_100.5',
       'cross_threshold_from_above_101.0', 'cross_threshold_from_above_101.5',
       'cross_threshold_from_above_102.0'],
      dtype='object')
Test df first 5 columns excluding the dropped metadata:  Index(['cross_threshold_from_above_100.0', 'cross_threshold_from_above_100.5',
       'cross_threshold_from_above_101.0', 'cross_threshold_from_above_101.5',
       'cross_threshold_from_above_102.0'],
      dtype='object')


## Encode Target
