# Short-Term Price Move Classifier
A compact, reproducible walkthrough for predicting next-day SPY direction using purely backward-looking technical indicators and a trio of classic ML models.



## Project Roadmap
- Load and visualize adjusted OHLCV data from Yahoo Finance
- Engineer strictly causal technical factors and build the binary target
- Respect time ordering for training/testing and compare against a majority baseline
- Fit Logistic Regression, Random Forest, and Gradient Boosting models
- Inspect ROC curves, coefficients, and feature importances before wrapping up with key takeaways



In [None]:
import pandas as pd  # type: ignore
import matplotlib.pyplot as plt  # type: ignore

plt.style.use("seaborn-v0_8")

from short_term_price_classifier.config import (
    DATA_START,
    DEFAULT_TICKER,
    RANDOM_STATE,
    TARGET_COLUMN,
    TRAIN_FRACTION,
)
from short_term_price_classifier.data_loader import load_ohlcv
from short_term_price_classifier.features import build_feature_dataset
from short_term_price_classifier.modeling import (
    compute_baseline_predictions,
    time_based_train_test_split,
    train_gradient_boosting,
    train_logistic_regression,
    train_random_forest,
)
from short_term_price_classifier.evaluation import (
    evaluate_classifier,
    plot_feature_importances,
    plot_logistic_coefficients,
    plot_roc_curve,
    print_evaluation_report,
)



In [None]:
raw_df = load_ohlcv(DEFAULT_TICKER, DATA_START)
print(f"Rows: {len(raw_df):,} from {raw_df.index.min().date()} to {raw_df.index.max().date()} for {DEFAULT_TICKER}")
raw_df.head()



In [None]:
ax = raw_df["Close"].plot(title=f"{DEFAULT_TICKER} Adjusted Close", figsize=(10, 4))
ax.set_ylabel("Price ($)")
plt.tight_layout()
plt.show()



In [None]:
feature_df, feature_cols = build_feature_dataset(raw_df)
print(f"Feature rows after dropping NaNs: {len(feature_df):,}")
feature_df[feature_cols + [TARGET_COLUMN]].head()



In [None]:
target_counts = feature_df[TARGET_COLUMN].value_counts().sort_index()
print("Class counts:\n", target_counts)
print("Class share:\n", (target_counts / target_counts.sum()).round(3))
(target_counts / target_counts.sum()).plot(kind="bar", title="Target Distribution")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()



In [None]:
X_train, X_test, y_train, y_test = time_based_train_test_split(
    df=feature_df,
    feature_cols=feature_cols,
    target_col=TARGET_COLUMN,
    train_fraction=TRAIN_FRACTION,
)
print(f"Train samples: {len(X_train):,}, Test samples: {len(X_test):,}")

metrics_summary = {}

y_pred_baseline = compute_baseline_predictions(y_train, y_test)
baseline_metrics = evaluate_classifier(y_test, y_pred_baseline)
print_evaluation_report("Baseline (Majority Class)", baseline_metrics)
metrics_summary["Baseline"] = baseline_metrics
baseline_metrics



In [None]:
log_reg, scaler = train_logistic_regression(X_train, y_train, RANDOM_STATE)
X_test_scaled = scaler.transform(X_test)
log_pred = log_reg.predict(X_test_scaled)
log_proba = log_reg.predict_proba(X_test_scaled)[:, 1]
log_metrics = evaluate_classifier(y_test, log_pred, log_proba)
print_evaluation_report("Logistic Regression", log_metrics)
metrics_summary["Logistic Regression"] = log_metrics
log_metrics



In [None]:
plot_roc_curve(y_test, log_proba, "Logistic Regression ROC")
plot_logistic_coefficients(feature_cols, log_reg.coef_[0], "Logistic Regression Coefficients")



In [None]:
rf_model = train_random_forest(X_train, y_train, RANDOM_STATE)
rf_pred = rf_model.predict(X_test)
rf_proba = rf_model.predict_proba(X_test)[:, 1]
rf_metrics = evaluate_classifier(y_test, rf_pred, rf_proba)
print_evaluation_report("Random Forest", rf_metrics)
metrics_summary["Random Forest"] = rf_metrics
rf_metrics



In [None]:
plot_roc_curve(y_test, rf_proba, "Random Forest ROC")
plot_feature_importances(feature_cols, rf_model.feature_importances_, "Random Forest Feature Importance")



In [None]:
gb_model = train_gradient_boosting(X_train, y_train, RANDOM_STATE)
gb_pred = gb_model.predict(X_test)
gb_proba = gb_model.predict_proba(X_test)[:, 1]
gb_metrics = evaluate_classifier(y_test, gb_pred, gb_proba)
print_evaluation_report("Gradient Boosting", gb_metrics)
metrics_summary["Gradient Boosting"] = gb_metrics
gb_metrics



In [None]:
plot_roc_curve(y_test, gb_proba, "Gradient Boosting ROC")
plot_feature_importances(feature_cols, gb_model.feature_importances_, "Gradient Boosting Feature Importance")



In [None]:
summary_df = pd.DataFrame(metrics_summary).T
summary_df



## Takeaways
- All models modestly outperform the majority baseline, though edge remains slim (ROC AUC often hovering just above 0.5).
- Tree ensembles highlight the importance of short-term momentum, volatility, and RSI; logistic regression yields interpretable coefficients for the same signals.
- Obvious extensions: incorporate regime detection, multi-asset universes, alternative holding periods, or richer risk-management overlays.

