# *Deep Learning Basics with PyTorch*
# Part I — Foundations of Machine Learning
## Chapter 2: Data, Features, and Representations
In this chapter, we reconstructed the classic "Iris" ML workflow using financial data.
Each step — feature creation, visualization, model fitting, and boundary inspection —
builds intuition for how machine learning interprets patterns in markets.

The same pipeline underpins deep learning models, which we will explore in the next chapters.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8')  # plotting

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

%config InlineBackend.figure_format = 'retina'

## Building a Machine Learning Dataset from ADR Market Data
We start by transforming raw price and volume series for a chosen ADR (e.g. GGAL) into daily returns and volume changes — the simplest features capturing market direction and liquidity shifts.
A binary target encodes whether the day closed up (1) or down (0).
This preprocessing step mirrors classic feature extraction in ML, but applied to real market micro-structure data

In [None]:
# --- Load and prepare ADR data
df = pd.read_csv("adr_prices_and_vol.csv", parse_dates=["Date"])
ticker = "GGAL"

# Select columns and drop missing values
df_t = df[["Date", f"{ticker}_Price", f"{ticker}_Volume"]].dropna().copy()

# Create simple features (returns and volume change)
df_t["Return_1d"] = df_t[f"{ticker}_Price"].pct_change()
df_t["VolChange"] = df_t[f"{ticker}_Volume"].pct_change()
df_t.dropna(inplace=True)

# Create a simple binary target: 1 = Up day, 0 = Down day
df_t["Target"] = (df_t["Return_1d"] > 0).astype(int)

# --- Define global feature matrix and target vector ---
features = ["Return_1d", "VolChange"]
X = df_t[features].values
y = df_t["Target"].values

## Load and Visualize Financial Features

In [None]:
# --- Visualization (analogous to the Iris scatter)
plt.figure(figsize=(5, 4))
plt.scatter(
    df_t.loc[df_t["Target"] == 0, "Return_1d"],
    df_t.loc[df_t["Target"] == 0, "VolChange"],
    label="Down Day", marker="o", alpha=0.6
)
plt.scatter(
    df_t.loc[df_t["Target"] == 1, "Return_1d"],
    df_t.loc[df_t["Target"] == 1, "VolChange"],
    label="Up Day", marker="^", alpha=0.6
)

plt.xlabel("Daily Return")
plt.ylabel("Volume Change")
plt.legend(frameon=False)
plt.title(f"{ticker} — Feature Space: Return vs Volume")
plt.tight_layout()
plt.show()

- This scatterplot shows how price momentum (returns) and trading activity (volume change) interact.
- Up-days cluster differently from down-days, suggesting a weak but learnable pattern.
- Visualization remains the most intuitive way to check separability before modeling.

## Exploring Alternative Feature Projections

In [None]:
# --- Feature construction (5-day return and volatility)
df_t = df[["Date", f"{ticker}_Price"]].dropna().copy()
df_t["Return_5d"] = df_t[f"{ticker}_Price"].pct_change(5)
df_t["Volatility_5d"] = df_t["Return_5d"].rolling(5).std()
df_t.dropna(inplace=True)

# --- Target: 1 = Up 5d return, 0 = Down 5d return
df_t["Target"] = (df_t["Return_5d"] > 0).astype(int)

# --- Visualization
plt.figure(figsize=(5, 4))
plt.scatter(
    df_t.loc[df_t["Target"] == 0, "Return_5d"],
    df_t.loc[df_t["Target"] == 0, "Volatility_5d"],
    label="5-Day Down Period", marker="o", alpha=0.6
)
plt.scatter(
    df_t.loc[df_t["Target"] == 1, "Return_5d"],
    df_t.loc[df_t["Target"] == 1, "Volatility_5d"],
    label="5-Day Up Period", marker="^", alpha=0.6
)

plt.xlabel("5-Day Return")
plt.ylabel("5-Day Rolling Volatility")
plt.legend(frameon=False)
plt.title(f"{ticker} — Alternative Feature Projection")
plt.tight_layout()
plt.show()

- Feature engineering changes what patterns become visible.
- Aggregating over five days smooths noise and introduces volatility as a second-order feature.
- Here we see whether multi-day behavior offers better separability — a precursor to using richer
temporal features or deep learning.

## Train a Scaler + Logistic Regression Pipeline

In [None]:
# --- Train/Test split, scaling, and logistic regression ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.3f}")

ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, display_labels=["Down", "Up"], cmap="Blues"
)
plt.title(f"{ticker} — Logistic Regression (Up vs Down Days)")
plt.tight_layout()
plt.show()

We apply a minimal ML pipeline — standardizing inputs ensures the model isn’t biased by scale differences between returns and volume changes. A logistic regression then estimates the probability of an up-day. Accuracy and the confusion matrix quantify how often the model gets direction right; interpret both rather than celebrating a single metric.

## Decision Boundary in 2D

In [None]:
# --- Fit model on entire dataset for visualization ---
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
pipe.fit(X, y)

# --- Create meshgrid across feature space ---
xmin, xmax = X[:, 0].min() - 0.02, X[:, 0].max() + 0.02
ymin, ymax = X[:, 1].min() - 0.02, X[:, 1].max() + 0.02
xx, yy = np.meshgrid(
    np.linspace(xmin, xmax, 300),
    np.linspace(ymin, ymax, 300)
)

# --- Predict class across grid ---
zz = pipe.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

# --- Plot boundary and data ---
plt.figure(figsize=(5, 4))
plt.contourf(xx, yy, zz, levels=[-0.5, 0.5, 1.5], cmap='coolwarm', alpha=0.2)

plt.scatter(
    X[y == 0, 0], X[y == 0, 1],
    marker='o', label='Down Day', alpha=0.6
)
plt.scatter(
    X[y == 1, 0], X[y == 1, 1],
    marker='^', label='Up Day', alpha=0.6
)

plt.xlabel("Daily Return")
plt.ylabel("Volume Change")
plt.legend(frameon=False)
plt.title(f"{ticker} — Logistic Regression Decision Boundary")
plt.tight_layout()
plt.show()


- The decision boundary divides the 2-D feature space into regions the classifier labels as “Up” or “Down.”
- In finance, such a boundary can be interpreted as a linear trading signal frontier — a simple function of return and liquidity change.
- Inspect whether up-days sit mostly inside the predicted region; deviations hint at noise or regime shifts.