# Data Preparation for XGBoost Models

## Objective
This notebook prepares the cleaned stock data for training XGBoost models. XGBoost is a powerful gradient boosting framework that works well with tabular data. We'll take the cleaned datasets from `ml/data/cleaned` and apply some basic transformations to prepare them for XGBoost.

## Steps:
1. Import necessary libraries
2. Load the cleaned data
3. Apply basic feature engineering for XGBoost
4. Handle categorical variables
5. Handle missing values
6. Export processed datasets for XGBoost training
7. Visualize some of the processed data

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from glob import glob
import warnings
from datetime import datetime, timedelta

# Set display options
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')
%matplotlib inline

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)

## Define Input and Output Directories

In [2]:
# Define input and output directories
input_dir = '../data/cleaned'
output_dir = '../data/xgboost'

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

print(f"Input directory: {input_dir}")
print(f"Output directory: {output_dir}")

Input directory: ../data/cleaned
Output directory: ../data/xgboost


## Function to Load Cleaned Data Files

This function will load all the cleaned CSV files from the input directory.

In [3]:
def load_cleaned_data():
    """
    Load all cleaned stock data files from the input directory.

    Returns:
    - Dictionary containing dataframes for each stock
    """
    stock_data = {}
    csv_files = glob(os.path.join(input_dir, '*.csv'))

    print(f"Found {len(csv_files)} CSV files in {input_dir}")

    # Removed tqdm wrapper to eliminate the loading animation
    for file_path in csv_files:
        # Extract stock symbol from filename
        stock_symbol = os.path.basename(file_path).split('.')[0]

        # Load the CSV file
        try:
            df = pd.read_csv(file_path)
            # Ensure 'date' column is datetime
            df['date'] = pd.to_datetime(df['date'])

            # Add to dictionary
            stock_data[stock_symbol] = df
            # You might want to keep this print statement to see files as they load
            print(f"Loaded {stock_symbol}: {df.shape[0]} rows, {df.shape[1]} columns")
        except Exception as e:
            print(f"Error loading {file_path}: {str(e)}")

    return stock_data

## Feature Engineering for XGBoost

XGBoost works well with a variety of features that capture different aspects of the data. We'll add some simple but effective features to help the model learn patterns in stock data.

In [4]:
def prepare_data_for_xgboost(df):
    """
    Prepare data for XGBoost by adding useful features and handling missing values.
    
    Parameters:
    - df: DataFrame with cleaned stock data
    
    Returns:
    - DataFrame prepared for XGBoost training
    """
    # Make a copy to avoid modifying the original DataFrame
    result = df.copy()
    
    # 1. Add price-based features
    
    # Price change from previous day (already have 'return' but adding percentage too)
    result['price_change_pct'] = result['close'].pct_change() * 100
    
    # Daily trading range as percentage
    result['daily_range_pct'] = (result['high'] - result['low']) / result['open'] * 100
    
    # Price position within day's range (0 = at low, 1 = at high)
    result['price_position'] = (result['close'] - result['low']) / (result['high'] - result['low'])
    
    # 2. Add simple trend indicators
    
    # Price distance from moving averages
    result['dist_from_ma5'] = (result['close'] / result['ma5'] - 1) * 100
    result['dist_from_ma20'] = (result['close'] / result['ma20'] - 1) * 100
    result['dist_from_ma50'] = (result['close'] / result['ma50'] - 1) * 100
    
    # Moving average crossovers (binary indicators)
    result['ma5_gt_ma20'] = (result['ma5'] > result['ma20']).astype(int)
    result['ma20_gt_ma50'] = (result['ma20'] > result['ma50']).astype(int)
    
    # 3. Add volume-based features
    
    # Volume change
    result['volume_change'] = result['volume'].pct_change()
    
    # Relative volume compared to recent average
    result['relative_volume'] = result['volume'] / result['volume_ma20']
    
    # 4. Add momentum indicators
    
    # 3-day momentum (how much price has changed over 3 days)
    result['momentum_3d'] = result['close'] / result['close'].shift(3) - 1
    
    # 5-day momentum
    result['momentum_5d'] = result['close'] / result['close'].shift(5) - 1
    
    # 10-day momentum
    result['momentum_10d'] = result['close'] / result['close'].shift(10) - 1
    
    # 5. Add simple technical indicators
    
    # Relative Strength Index (RSI) - simplified 14-day calculation
    delta = result['close'].diff()
    gain = delta.where(delta > 0, 0).rolling(window=14).mean()
    loss = -delta.where(delta < 0, 0).rolling(window=14).mean()
    rs = gain / loss
    result['rsi_14'] = 100 - (100 / (1 + rs))
    
    # Bollinger Bands - 20-day with 2 standard deviations
    result['bb_middle'] = result['close'].rolling(window=20).mean()
    result['bb_std'] = result['close'].rolling(window=20).std()
    result['bb_upper'] = result['bb_middle'] + 2 * result['bb_std']
    result['bb_lower'] = result['bb_middle'] - 2 * result['bb_std']
    
    # Bollinger Band width (volatility indicator)
    result['bb_width'] = (result['bb_upper'] - result['bb_lower']) / result['bb_middle']
    
    # Bollinger Band position (where price is relative to bands)
    result['bb_position'] = (result['close'] - result['bb_lower']) / (result['bb_upper'] - result['bb_lower'])
    
    # 6. Add simple lag features (previous days' values)
    
    # Previous day's closing price
    result['prev_close'] = result['close'].shift(1)
    
    # Previous day's volume
    result['prev_volume'] = result['volume'].shift(1)
    
    # Previous day's return
    result['prev_return'] = result['return'].shift(1)
    
    # 7. Handle categorical variables
    
    # One-hot encode day of week (if needed)
    # result = pd.get_dummies(result, columns=['day_of_week'], prefix='dow')
    
    # Cyclical encoding of day of week (better for capturing cyclical patterns)
    result['day_of_week_sin'] = np.sin(2 * np.pi * result['day_of_week'] / 7)
    result['day_of_week_cos'] = np.cos(2 * np.pi * result['day_of_week'] / 7)
    
    # Cyclical encoding of month
    result['month_sin'] = np.sin(2 * np.pi * result['month'] / 12)
    result['month_cos'] = np.cos(2 * np.pi * result['month'] / 12)
    
    # 8. Handle missing values
    
    # Replace infinity values with NaN
    result = result.replace([np.inf, -np.inf], np.nan)
    
    # Forward-fill missing values
    result = result.fillna(method='ffill')
    
    # Backward-fill any remaining missing values
    result = result.fillna(method='bfill')
    
    # As a last resort, fill any remaining NaNs with zeros
    result = result.fillna(0)
    
    return result

In [5]:
def visualize_xgboost_features(df, ticker, sample_days=90):
    """
    Create visualizations to analyze features prepared for XGBoost.
    
    Parameters:
    - df: DataFrame with processed stock data for XGBoost
    - ticker: Stock symbol
    - sample_days: Number of days to display in the plots
    """
    if df is None or df.empty:
        print(f"No data to visualize for {ticker}")
        return
    
    # Get the last n days of data
    last_days = df.iloc[-sample_days:].copy()
    
    # Create a figure with subplots
    fig, axes = plt.subplots(3, 1, figsize=(14, 15))
    
    # Plot 1: Price with Bollinger Bands
    axes[0].plot(last_days['date'], last_days['close'], label='Close Price')
    axes[0].plot(last_days['date'], last_days['bb_upper'], label='BB Upper', alpha=0.5, linestyle='--', color='red')
    axes[0].plot(last_days['date'], last_days['bb_middle'], label='BB Middle', alpha=0.5, linestyle='--', color='orange')
    axes[0].plot(last_days['date'], last_days['bb_lower'], label='BB Lower', alpha=0.5, linestyle='--', color='green')
    axes[0].set_title(f'{ticker} - Price with Bollinger Bands (Last {sample_days} Days)')
    axes[0].set_ylabel('Price ($)')
    axes[0].legend()
    axes[0].grid(True)
    
    # Plot 2: RSI
    axes[1].plot(last_days['date'], last_days['rsi_14'], label='RSI 14', color='purple')
    axes[1].axhline(y=70, color='red', linestyle='--', alpha=0.5)
    axes[1].axhline(y=30, color='green', linestyle='--', alpha=0.5)
    axes[1].set_title(f'{ticker} - Relative Strength Index (Last {sample_days} Days)')
    axes[1].set_ylabel('RSI')
    axes[1].set_ylim(0, 100)
    axes[1].grid(True)
    
    # Plot 3: Prediction Targets
    axes[2].plot(last_days['date'], last_days['close'], label='Actual Close', color='blue')
    axes[2].plot(last_days['date'], last_days['next_day_close'], label='Next Day Close', color='green', linestyle='--')
    axes[2].plot(last_days['date'], last_days['next_week_avg'], label='Next Week Avg', color='orange', linestyle=':')
    axes[2].plot(last_days['date'], last_days['next_month_avg'], label='Next Month Avg', color='red', linestyle='-.')
    axes[2].set_title(f'{ticker} - Prediction Targets (Last {sample_days} Days)')
    axes[2].set_ylabel('Price ($)')
    axes[2].legend()
    axes[2].grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # Feature Correlation with Targets
    plt.figure(figsize=(14, 10))
    
    # Select important features for correlation analysis
    important_features = ['close', 'return', 'rsi_14', 'bb_position', 'momentum_5d', 
                          'price_change_pct', 'daily_range_pct', 'volume_change', 'relative_volume']
    
    # Target variables
    targets = ['next_day_close', 'next_week_avg', 'next_month_avg']
    
    # Create correlation matrix for important features vs targets
    corr_df = df[important_features + targets].corr()
    sns.heatmap(corr_df.loc[important_features, targets], annot=True, cmap='coolwarm', 
                vmin=-1, vmax=1, center=0)
    plt.title(f'{ticker} - Feature Correlation with Target Variables')
    plt.tight_layout()
    plt.show()

## Process All Stocks

Now let's process all stocks and save the XGBoost-ready datasets.

In [None]:
# Load stock data
print("Loading stock data...")
stock_data = load_cleaned_data()

# Process each stock for XGBoost
print("\nProcessing stocks for XGBoost...")
processed_data = {}

# Removed tqdm wrapper to eliminate the loading animation
for ticker, df in stock_data.items():
    # Apply XGBoost preparation
    xgb_df = prepare_data_for_xgboost(df)
    processed_data[ticker] = xgb_df

    print(f"Processed {ticker}: Original shape {df.shape}, XGBoost ready shape {xgb_df.shape}")

    # Save the processed data
    output_file = os.path.join(output_dir, f"{ticker}_xgboost.csv")
    xgb_df.to_csv(output_file, index=False)
    print(f"Saved to {output_file}")

print("\nSuccessfully processed all", len(processed_data), "stocks for XGBoost!")

## Examine AAPL Dataset

Let's look at the head of the AAPL dataset to see the features we've created for XGBoost.

In [None]:
# Let's examine the AAPL dataset
aapl_df = processed_data['AAPL']
print(f"AAPL XGBoost dataset has {aapl_df.shape[1]} columns")
print("First 5 rows of selected columns:")

# Show only a subset of columns for clarity
display_cols = ['date', 'close', 'return', 'price_change_pct', 'rsi_14', 'bb_position', 
              'momentum_5d', 'next_day_close', 'next_week_avg', 'next_month_avg']
aapl_df[display_cols].head()

## Feature Importance Check

Let's run a simple analysis to identify potentially important features for the XGBoost model. We'll use correlation with the target variables.

In [None]:
# Calculate absolute correlation with next_day_close for AAPL
target = 'next_day_close'
numeric_cols = aapl_df.select_dtypes(include=['float64', 'int64']).columns
correlations = aapl_df[numeric_cols].corrwith(aapl_df[target]).abs().sort_values(ascending=False)

# Visualize top correlations
plt.figure(figsize=(20, 10))
correlations[:15].plot(kind='bar')
plt.title(f'Features with Highest Correlation to {target} (AAPL)')
plt.ylabel('Absolute Correlation')
plt.xlabel('Feature')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## Visualize Sample Stock Data

Let's visualize the processed data for a few stocks to understand what we've created.

In [None]:
tickers_to_visualize = ['AAPL', 'MSFT', 'TSLA']
for ticker in tickers_to_visualize:
    print(f"Visualizing {ticker}...")
    if ticker in processed_data:
        visualize_xgboost_features(processed_data[ticker], ticker)
    else:
        print(f"Data for {ticker} not found in processed_data.")

## Summary and Explanation of Features

We've engineered several features for XGBoost that capture different aspects of stock price movements. Here's a summary of the key features we created:

1. **Price-based features**:
  - `price_change_pct`: Percentage change in closing price from previous day
  - `daily_range_pct`: Daily trading range as a percentage of opening price
  - `price_position`: Where the closing price sits within the day's trading range

2. **Trend indicators**:
  - `dist_from_ma5`, `dist_from_ma20`, `dist_from_ma50`: Distance of price from moving averages
  - `ma5_gt_ma20`, `ma20_gt_ma50`: Binary indicators for moving average crossovers

3. **Volume indicators**:
  - `volume_change`: Percentage change in volume from previous day
  - `relative_volume`: Current volume compared to 20-day volume average

4. **Momentum indicators**:
  - `momentum_3d`, `momentum_5d`, `momentum_10d`: Price changes over different periods

5. **Technical indicators**:
  - `rsi_14`: Relative Strength Index (momentum oscillator, 0-100)
  - `bb_width`: Bollinger Band width (volatility indicator)
  - `bb_position`: Where price is within the Bollinger Bands (0-1)

6. **Lag features**:
  - `prev_close`, `prev_volume`, `prev_return`: Previous day's values

7. **Cyclical time features**:
  - `day_of_week_sin`, `day_of_week_cos`: Cyclical encoding of day of week
  - `month_sin`, `month_cos`: Cyclical encoding of month

These features provide XGBoost with different perspectives on price movement, volatility, trend, and momentum, which should help it learn patterns in stock prices. The target variables remain the same as in the original dataset:

- `next_day_close`: The next day's closing price
- `next_week_avg`: Average price for the next 5 trading days
- `next_month_avg`: Average price for the next 21 trading days

In the next phase, we'll train XGBoost models using these features to predict each of our three target variables.

## Conclusion

In this notebook, we've successfully prepared our stock datasets for XGBoost by:

1. Loading the cleaned data from Phase 1
2. Adding useful features for stock price prediction
3. Handling missing values and categorical variables
4. Saving the prepared datasets to the `ml/data/xgboost` directory
5. Visualizing the features and their relationships with target variables

The prepared datasets are now ready for the training phase, where we'll build XGBoost models to predict next-day closing prices, next-week average prices, and next-month average prices.

Key benefits of our approach:
- We've created straightforward, easily explainable features
- The features capture different aspects of stock behavior (price, volume, trend, momentum)
- We've maintained all three prediction targets for flexibility in model training
- The transformation process is clean and repeatable for new data

Next steps would be to move to the training phase, where we'll split the data into training and testing sets, train the XGBoost models, and evaluate their performance.