# ü§ñ AI Stock Prediction: A Beginner's Walkthrough

Welcome! This notebook will guide you through the **entire AI process** used in your project. We will take raw stock prices and turn them into a machine learning model that can predict future price movements (specifically for high-volatility stocks like SMCI, CRSP, PLTR).

We will verify every step to ensure **no future data leakage** (cheating) occurs.

### üìö What we will cover:
1. **Setup**: Loading your project code.
2. **Data Ingestion**: Getting the raw stock prices.
3. **Feature Engineering**: Creating "Features" (Technical Indicators) that the AI learns from.
4. **Data Preparation**: Creating "Rolling Windows" to organize data for the AI.
5. **Model Training**: Training an **XGBoost** model (a very powerful AI algorithm).
6. **Evaluation**: Checking if the model actually works (Precision, Backtesting).

---
## 1. Setup & Imports

First, we need to import standard Python libraries (like pandas for data tables) and link this notebook to your existing code in the `src/` folder.

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from dotenv import load_dotenv

# Load env vars
load_dotenv()

# Add the 'src' directory to the system path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
src_path = os.path.join(project_root, 'src')
if src_path not in sys.path:
    sys.path.append(src_path)

print(f"‚úÖ Setup complete. Linked to code in: {src_path}")

# Import your specific project modules
from data_ingestion import download_price_data
from features import add_technical_indicators, build_rolling_window_dataset
from model import ModelConfig, train_xgboost_time_series


## 2. Ingesting Data

AI needs data to learn. We will download the stock history for our **High-Volatility Regime** tickers:
- **SMCI** (Super Micro Computer)
- **CRSP** (CRISPR Therapeutics)
- **PLTR** (Palantir)
- We also include stable stocks (AAPL etc.) if configured, but we focus on High Volatility now.

This function `download_price_data()` automatically fetches the data from Yahoo Finance.

In [None]:
print("‚è≥ Downloading (or loading) price data...")
df_prices = download_price_data()

# Display the first few rows to see what raw data looks like
print(f"Loaded {len(df_prices)} rows of data.")
df_prices.head()

## 3. Feature Engineering (The "Ingredients")

Raw prices (Open, High, Low, Close) are not enough for an AI. It needs **context**.
We calculate **Technical Indicators** which act as features (variables the AI looks at).

### What we calculate:
*   **RSI (Relative Strength Index):** Is the stock overbought (expensive) or oversold (cheap)?
*   **MACD:** Is the trend going up or down?
*   **Bollinger Bands:** Is the volatility high or low?
*   **ATR (Average True Range):** How much does the price move on average? (Crucial for our regime!)

We use the function `add_technical_indicators(df)`.

In [None]:
print("üõ†Ô∏è Calculating technical indicators...")
df_features = add_technical_indicators(df_prices)

# Let's inspect the new columns we created
new_columns = [c for c in df_features.columns if c not in df_prices.columns]
print(f"Created {len(new_columns)} new features: {new_columns}")

# Show SMCI's RSI and ATR
df_features[df_features['ticker'] == 'SMCI'][['date', 'adj_close', 'rsi_14', 'atr_14']].tail()

## 4. Preparing Data for the AI (Feature Windows)

### üß† Concept: The "Rolling Window"
This is the most critical part to understand for accurate prediction.

Stock data is a time series. To predict what happens **next week** (Day T+5), the AI needs to know what happened in the **past 2 weeks** (Day T-14 to Day T).

We structure the data into **Windows**:
*   **Input (X):** Statistics (Mean, Std Dev, Last Value) of the past 14 days.
    *   *Example:* "What was the average RSI over the last 14 days?"
*   **Target (y):** Did the price go up > 3.5% in the **next** 5 days?
    *   *Value:* 1 (Yes, Buy) or 0 (No, Cash).

### üõ°Ô∏è Safety Check: Future Leakage
We must ensure that the Input (X) **ONLY** contains data from the past, and the Target (y) is the **ONLY** thing looking at the future.

We use `build_rolling_window_dataset()`.

In [None]:
# Define our AI configuration
config = ModelConfig()
print(f"‚öôÔ∏è Configuration:")
print(f"  - Lookback Window: {config.window_days} days (Feature extraction)")
print(f"  - Prediction Horizon: {config.horizon_days} days (Forecasting)")
print(f"  - Success Threshold: {config.threshold * 100}% gain needed to call it a 'BUY'")

print("\nüèóÔ∏è Building the rolling window dataset (this aligns Past features with Future targets)...")
dataset = build_rolling_window_dataset(
    df_features,
    window_days=config.window_days,
    horizon_days=config.horizon_days,
    threshold=config.threshold
)

print(f"Dataset Shape: {dataset.shape}")
dataset[['date', 'ticker', 'target', 'future_return', 'rsi_14_mean', 'atr_14_last']].head()

## 5. Train/Test Split

We mimic real life. We train the AI on "History" (2020-2022) and test it on "The Future" (2023-Present).
If we trained on 2024 data and tested on 2023, that would be cheating!

*   **Training Set:** Data before Jan 1, 2023.
*   **Test Set:** Data from Jan 1, 2023 onwards.

In [None]:
# Sort by date to be safe
dataset = dataset.sort_values("date").reset_index(drop=True)

cutoff_date = pd.Timestamp("2023-01-01")

# Create the split
train_df = dataset[dataset["date"] < cutoff_date].reset_index(drop=True)
test_df = dataset[dataset["date"] >= cutoff_date].reset_index(drop=True)

print(f"üìö Training Samples (History): {len(train_df)}")
print(f"üìù Testing Samples (Future): {len(test_df)}")

# Prepare X (Features) and y (Target)
# We DROP 'future_return' from X because that is the answer key!
feature_cols = [c for c in dataset.columns if c not in ['ticker', 'date', 'target', 'future_return']]
X_train = train_df[feature_cols]
y_train = train_df['target']

X_test = test_df[feature_cols]
y_test = test_df['target']

print(f"Number of input features used by AI: {len(feature_cols)}")

## 6. Training the Model (XGBoost)

Variable `model` will be our AI brain. We use **XGBoost Classifier**.
*   **Why XGBoost?** It is excellent for tabular data and time-series because it handles non-linear relationships and interactions between features better than simpler models.
*   **CV (Cross-Validation):** We don't just train once. We train on chunks of time to ensure stability.

We use the function `train_xgboost_time_series`.

In [None]:
print("üß† Training AI Model...")
model, metrics = train_xgboost_time_series(X_train, y_train, config)

print(f"\nüèÜ Best Training Precision: {metrics['cv_best_precision']:.2f}")
print("This means: When the model predicted a BUY during training, it was correct X% of the time.")

### üîç Feature Importance: What is the AI looking at?
It's important to know *why* the AI makes a decision. We can plot which technical indicators were most important for the prediction.
Often **ATR** (Volatility) and **RSI** (Momentum) are top predictors in high-volatility regimes.

In [None]:
import matplotlib.pyplot as plt

# Get feature importance
importance = model.feature_importances_
feat_imp = pd.DataFrame({'feature': feature_cols, 'importance': importance})
feat_imp = feat_imp.sort_values('importance', ascending=False).head(10)

# Plot
plt.figure(figsize=(10, 6))
plt.barh(feat_imp['feature'], feat_imp['importance'], color='skyblue')
plt.xlabel('Importance Score')
plt.title('Top 10 Features Used by AI')
plt.gca().invert_yaxis() # Highest importance at top
plt.show()

## 7. Evaluating on Test Data (The Moment of Truth)

Now we give the AI the "Test Exam" (2023-2025 data).
We convert its probability score (0 to 1) into a decision:
*   **Probability >= 0.50:** BUY (Signal 1)
*   **Probability < 0.50:** CASH (Signal 0)

*Note: We use a Long-Only strategy. We do not short sell because it's too risky for these volatile stocks.*

In [None]:
# Get predictions
test_probs = model.predict_proba(X_test)[:, 1]

# Convert to signals (Long Only: Buy if prob >= 50%)
test_signals = np.where(test_probs >= 0.50, 1, 0)

# Add to dataframe for analysis
results_df = test_df.copy()
results_df['proba'] = test_probs
results_df['signal'] = test_signals
results_df['strategy_return'] = results_df['signal'] * results_df['future_return']

print("Signal Distribution:")
print(results_df['signal'].value_counts())
print("\nSignal 0 = Cash, Signal 1 = Buy")

## 8. Backtest: Did we make money?

We simulate what would have happened if we followed these signals.
*We update our portfolio every 5 days (non-overlapping) to match our prediction horizon.*

In [None]:
from backtest import max_drawdown

# Filter for non-overlapping periods (every 5 days) to calculate realistic cumulative returns
# This mimics trading: Buy Monday -> Hold 5 days -> Sell next Monday -> Re-evaluate
trading_dates = sorted(results_df['date'].unique())[::config.horizon_days]
backtest_df = results_df[results_df['date'].isin(trading_dates)].copy()

# Calculate daily portfolio return (average of all active signals)
portfolio_returns = backtest_df.groupby('date')['strategy_return'].mean()

# Calculate Equity Curve (Starting at $1.00)
equity_curve = (1 + portfolio_returns).cumprod()

# Plot
plt.figure(figsize=(12, 6))
plt.plot(equity_curve.index, equity_curve.values, label='AI Strategy (Long/Cash)', color='green', linewidth=2)
plt.axhline(1.0, color='gray', linestyle='--')
plt.title('Portfolio Performance (2023-Present)')
plt.ylabel('Equity Multiplier (Start = 1.0)')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

# Final Stats
total_return = (equity_curve.iloc[-1] - 1) * 100
print(f"üí∞ Final Total Return: {total_return:.2f}%")
print(f"üìâ Max Drawdown: {max_drawdown(equity_curve)*100:.2f}%")