# Stock Data Preprocessing Pipeline

This notebook implements a comprehensive preprocessing pipeline for stock data to prepare it for machine learning models including Prophet, LSTM, and XGBoost.

## Pipeline Steps:
1. Fetch daily stock data using yfinance
2. Clean the data (handle missing values, sort by date, etc.)
3. Engineer features (returns, moving averages, technical indicators, etc.)
4. Create target variables for classification
5. Save processed data to CSV files

The processed data will follow the project's structure conventions.

## Import Libraries

This cell imports all the necessary Python libraries for the preprocessing pipeline. 
- `yfinance` is used for fetching historical stock market data.
- `pandas` is essential for data manipulation and analysis, primarily using DataFrames.
- `numpy` provides support for numerical operations, especially for arrays and mathematical functions.
- `matplotlib.pyplot` is imported for basic plotting, although not extensively used in this script, it's good practice for data exploration.
- `os` allows interaction with the operating system, used here for creating directories.
- `datetime` from the `datetime` module is used for handling date and time information, particularly for setting default end dates.
- `warnings` is used to control how warning messages are handled; `warnings.filterwarnings('ignore')` suppresses warnings to keep the output clean.

In [58]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

## Define Stock Tickers

This cell defines a list named `tickers`. This list contains the stock ticker symbols for 20 companies and one ETF (SPY) that will be processed by this notebook. Each ticker represents a specific stock (e.g., 'AAPL' for Apple Inc.) or an ETF (e.g., 'SPY' for SPDR S&P 500 ETF Trust). This list will be iterated over to fetch, process, and save data for each specified entity.

In [59]:
tickers = [
    'AAPL',  # Apple Inc.
    'MSFT',  # Microsoft Corporation
    'GOOG',  # Alphabet Inc. (Google)
    'AMZN',  # Amazon.com, Inc.
    'TSLA',  # Tesla, Inc.
    'META',  # Meta Platforms, Inc. (formerly Facebook)
    'NVDA',  # NVIDIA Corporation
    'SPY',   # SPDR S&P 500 ETF Trust
    'V',     # Visa Inc.
    'DIS',   # The Walt Disney Company
    'NFLX',  # Netflix, Inc.
    'PYPL',  # PayPal Holdings, Inc.
    'BABA',  # Alibaba Group
    'IBM',   # International Business Machines Corporation
    'AMD',   # Advanced Micro Devices, Inc.
    'BA',    # The Boeing Company
    'INTC',  # Intel Corporation
    'T',     # AT&T Inc.
    'GS',    # Goldman Sachs Group, Inc.
    'NKE'    # Nike, Inc.
]

## Define Helper Functions

The following cells define various helper functions that encapsulate specific steps of the data preprocessing pipeline. This modular approach makes the code more organized, reusable, and easier to understand.

### `fetch_stock_data` Function

This cell defines the `fetch_stock_data` function. Its purpose is to retrieve historical stock data for a specified ticker symbol using the `yfinance` library.

**Parameters:**
- `ticker` (str): The stock ticker symbol (e.g., 'AAPL').
- `start_date` (str): The start date for fetching data, in 'YYYY-MM-DD' format. Defaults to '2010-01-01'.
- `end_date` (str): The end date for fetching data, in 'YYYY-MM-DD' format. If not provided, it defaults to the current day.

**Functionality:**
1. Sets the `end_date` to today's date if it's not specified.
2. Creates a `yf.Ticker` object for the given stock symbol.
3. Calls the `history()` method on the Ticker object to download the stock data for the specified period.
4. Prints a confirmation message indicating the number of rows fetched and the date range.
5. Prints the first 5 rows of the fetched data for a quick check.

**Returns:**
- `pandas.DataFrame`: A DataFrame containing the historical stock data, typically including columns like Open, High, Low, Close, Volume, Dividends, and Stock Splits.

In [61]:
def fetch_stock_data(ticker, start_date='2010-01-01', end_date=None):
    """
    Fetch stock data for a given ticker using yfinance
    
    Parameters:
    -----------
    ticker : str
        Stock ticker symbol
    start_date : str
        Start date in 'YYYY-MM-DD' format
    end_date : str
        End date in 'YYYY-MM-DD' format, default is today
        
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing stock data
    """
    if end_date is None:
        end_date = datetime.today().strftime('%Y-%m-%d')
        
    stock = yf.Ticker(ticker)
    data = stock.history(start=start_date, end=end_date)
    
    print(f"Fetched {len(data)} rows of data for {ticker} from {start_date} to {end_date}")
    print(data.head(5))
    return data

### `clean_stock_data` Function

This cell defines the `clean_stock_data` function. This function is responsible for performing several data cleaning operations on the raw stock data DataFrame obtained from `yfinance`.

**Parameters:**
- `df` (pandas.DataFrame): The input DataFrame containing raw stock data, with a 'Date' index.

**Functionality:**
1. Creates a copy of the input DataFrame to avoid modifying the original data in place.
2. Resets the index so that 'Date' becomes a regular column. This is often necessary for certain operations like timezone localization and sorting.
3. Removes timezone information from the 'Date' column using `dt.tz_localize(None)`. This standardizes the date format.
4. Sorts the DataFrame by the 'Date' column in ascending order and resets the index again, dropping the old index.
5. Drops any duplicate rows based on the 'Date' column to ensure each trading day has only one entry.
6. Replaces any 'Volume' entries that are 0 with `np.nan` (Not a Number). Zero volume can indicate missing data or non-trading days and is better handled as NaN for imputation.
7. Handles missing values (NaNs) in the price columns ('Open', 'High', 'Low', 'Close') using forward fill (`ffill()`). This means missing values are replaced by the last known valid value.
8. Forward fills missing values in the 'Volume' column as well.
9. Drops any remaining rows that still contain NaN values after the fill operations. This ensures the DataFrame is free of missing data.
10. Converts the 'Volume' column to an integer data type.
11. Sets the 'Date' column back as the DataFrame's index, which is a common convention for time-series data and useful for subsequent feature engineering steps.

**Returns:**
- `pandas.DataFrame`: A cleaned DataFrame with handled missing values, sorted dates, and 'Date' as the index.

In [62]:
def clean_stock_data(df):
    """
    Clean stock data by handling missing values, sorting by date, etc.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing stock data
        
    Returns:
    --------
    pandas.DataFrame
        Cleaned DataFrame
    """
    # Make a copy to avoid modifying the original data
    df = df.copy()

    # Rename columns cuz they were not fetched (always 0)
    df = df.drop(['Dividends', 'Stock Splits'], axis=1)
    
    # Reset index to keep Date as a column
    df = df.reset_index()
    
    # Remove timezone info from Date column
    df['Date'] = df['Date'].dt.tz_localize(None)
    
    # Sort data by date (ascending) and reset index
    df = df.sort_values('Date').reset_index(drop=True)
    
    # Drop duplicate rows
    df = df.drop_duplicates(subset=['Date'])
    
    # Replace Volume = 0 with NaN
    df.loc[df['Volume'] == 0, 'Volume'] = np.nan
    
    # Handle missing values: Forward fill for price columns
    for col in ['Open', 'High', 'Low', 'Close']:
        df[col] = df[col].ffill()
    
    # Forward fill for Volume
    df['Volume'] = df['Volume'].ffill()
    
    # Drop any remaining rows with NaNs
    df = df.dropna()
    
    # Ensure integer type for Volume
    df['Volume'] = df['Volume'].astype(int)
    
    # Set Date as index again for feature engineering
    df = df.set_index('Date')
    
    return df

### `add_technical_indicators` Function

This cell defines the `add_technical_indicators` function. Its purpose is to calculate and add several common technical indicators to the stock data DataFrame. These indicators can provide insights into market trends, momentum, and volatility, and are often used as features in financial machine learning models.

**Parameters:**
- `df` (pandas.DataFrame): The input DataFrame containing cleaned stock data, with 'Date' as the index and at least a 'Close' price column.

**Functionality:**
1. Creates a copy of the input DataFrame.
2. **RSI (Relative Strength Index):** Calculates the 14-period RSI. This involves:
   - Calculating price differences (`delta`).
   - Separating gains and losses.
   - Calculating the average gain and average loss over a 14-day window.
   - Computing the Relative Strength (RS) and then the RSI value.
3. **MACD (Moving Average Convergence Divergence):** Calculates the MACD line, MACD signal line, and MACD histogram.
   - `MACD`: Difference between the 12-period Exponential Moving Average (EMA) and the 26-period EMA of the 'Close' price.
   - `MACD_Signal`: 9-period EMA of the MACD line.
   - `MACD_Hist`: Difference between the MACD line and the MACD signal line.
4. **Bollinger Bands:** Calculates the 20-period Bollinger Bands.
   - `MA_20`: 20-day Simple Moving Average (SMA) of the 'Close' price.
   - `BB_Upper`: Upper Bollinger Band (MA_20 + 2 * 20-day standard deviation of 'Close' price).
   - `BB_Lower`: Lower Bollinger Band (MA_20 - 2 * 20-day standard deviation of 'Close' price).
5. **Bollinger Band Width and Position:**
   - `BB_Width`: Measures the width of the Bollinger Bands relative to the middle band (`(BB_Upper - BB_Lower) / MA_20`).
   - `BB_Position`: Indicates where the 'Close' price is relative to the Bollinger Bands (`(Close - BB_Lower) / (BB_Upper - BB_Lower)`).
6. **Volatility (Historical):** Calculates historical volatility based on the standard deviation of daily returns over different periods, annualized by multiplying by the square root of 252 (approximate number of trading days in a year).
   - `Daily_Return`: Percentage change in 'Close' price from the previous day.
   - `Volatility_10D`: Annualized 10-day rolling standard deviation of daily returns.
   - `Volatility_30D`: Annualized 30-day rolling standard deviation of daily returns.

**Returns:**
- `pandas.DataFrame`: The DataFrame with the newly added technical indicator columns.

In [63]:
def add_technical_indicators(df):
    """
    Add technical indicators to the DataFrame
    
    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing stock data
        
    Returns:
    --------
    pandas.DataFrame
        DataFrame with added technical indicators
    """
    df = df.copy()
    
    # Calculate RSI (Relative Strength Index)
    delta = df['Close'].diff()
    gain = delta.where(delta > 0, 0)
    loss = -delta.where(delta < 0, 0)
    
    avg_gain = gain.rolling(window=14).mean()
    avg_loss = loss.rolling(window=14).mean()
    
    rs = avg_gain / avg_loss
    df['RSI'] = 100 - (100 / (1 + rs))
    
    # Calculate MACD (Moving Average Convergence Divergence)
    exp1 = df['Close'].ewm(span=12, adjust=False).mean()
    exp2 = df['Close'].ewm(span=26, adjust=False).mean()
    df['MACD'] = exp1 - exp2
    df['MACD_Signal'] = df['MACD'].ewm(span=9, adjust=False).mean()
    df['MACD_Hist'] = df['MACD'] - df['MACD_Signal']
    
    # Calculate Bollinger Bands
    df['MA_20'] = df['Close'].rolling(window=20).mean()
    df['BB_Upper'] = df['MA_20'] + (df['Close'].rolling(window=20).std() * 2)
    df['BB_Lower'] = df['MA_20'] - (df['Close'].rolling(window=20).std() * 2)
    
    # Calculate BB width and position
    df['BB_Width'] = (df['BB_Upper'] - df['BB_Lower']) / df['MA_20']
    df['BB_Position'] = (df['Close'] - df['BB_Lower']) / (df['BB_Upper'] - df['BB_Lower'])
    
    # Calculate Volatility (historical) - Daily returns standard deviation over different windows
    df['Daily_Return'] = df['Close'].pct_change()
    df['Volatility_10D'] = df['Daily_Return'].rolling(window=10).std() * np.sqrt(252)  # Annualized
    df['Volatility_30D'] = df['Daily_Return'].rolling(window=30).std() * np.sqrt(252)  # Annualized
    
    return df

### `engineer_features` Function

This cell defines the `engineer_features` function. This function is responsible for creating a variety of features from the stock data, building upon the cleaned data and technical indicators. These features are designed to capture different aspects of price movements, trends, volatility, and seasonality, which can be beneficial for machine learning models.

**Parameters:**
- `df` (pandas.DataFrame): The input DataFrame, typically the output from `clean_stock_data` or further processed, with 'Date' as the index.

**Functionality:**
1. Creates a copy of the input DataFrame.
2. **Technical Indicators:** Calls the `add_technical_indicators` function to add RSI, MACD, Bollinger Bands, and volatility measures to the DataFrame.
3. **Returns:**
   - `Daily_Return`: This is already calculated within `add_technical_indicators`.
   - `Weekly_Return`: Calculates the percentage change in the 'Close' price over the last 5 trading days.
   - `Monthly_Return`: Calculates the percentage change in the 'Close' price over the last 21 trading days (approximating a month).
4. **Moving Averages (MA):**
   - `MA_20`: Already calculated as part of Bollinger Bands in `add_technical_indicators`.
   - Calculates Simple Moving Averages (SMAs) of the 'Close' price for various window sizes: 5-day (`MA_5`), 10-day (`MA_10`), 50-day (`MA_50`), 100-day (`MA_100`), and 200-day (`MA_200`).
5. **Rolling Standard Deviation (STD):**
   - Calculates the rolling standard deviation of the 'Close' price for 5-day (`STD_5`) and 20-day (`STD_20`) windows. This serves as another measure of price volatility.
6. **Average Volume:**
   - `Volume_MA_20`: Calculates the 20-day Simple Moving Average of the 'Volume'.
7. **Price Range:**
   - `Price_Range`: Calculates the difference between the 'High' and 'Low' prices for each day, representing the intraday trading range.
8. **Daily Change:**
   - `Daily_Change`: Calculates the difference between the 'Close' and 'Open' prices for each day.
9. **Time-based Features:**
   - Resets the index to make 'Date' a column temporarily.
   - `DayOfWeek`: Extracts the day of the week (0 for Monday, 6 for Sunday) from the 'Date'.
   - `Month`: Extracts the month (1 for January, 12 for December) from the 'Date'.
   - Sets 'Date' back as the DataFrame's index.

**Returns:**
- `pandas.DataFrame`: The DataFrame enriched with a comprehensive set of engineered features.

In [64]:
def engineer_features(df):
    """
    Engineer features for machine learning models
    
    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing clean stock data
        
    Returns:
    --------
    pandas.DataFrame
        DataFrame with engineered features
    """
    df = df.copy()
    
    # --- Technical indicators ---
    df = add_technical_indicators(df)
    
    # --- Returns ---
    # Daily return (already calculated in technical indicators function)
    
    # Weekly return (5 trading days)
    df['Weekly_Return'] = df['Close'].pct_change(5)
    
    # Monthly return (21 trading days)
    df['Monthly_Return'] = df['Close'].pct_change(21)
    
    # --- Moving averages ---
    # MA_20 is already calculated for Bollinger Bands
    df['MA_5'] = df['Close'].rolling(window=5).mean()
    df['MA_10'] = df['Close'].rolling(window=10).mean()
    df['MA_50'] = df['Close'].rolling(window=50).mean()
    df['MA_100'] = df['Close'].rolling(window=100).mean()
    df['MA_200'] = df['Close'].rolling(window=200).mean()
    
    # --- Rolling standard deviation ---
    df['STD_5'] = df['Close'].rolling(window=5).std()
    df['STD_20'] = df['Close'].rolling(window=20).std()
    
    # --- Average volume ---
    df['Volume_MA_20'] = df['Volume'].rolling(window=20).mean()
    
    # --- Price range ---
    df['Price_Range'] = df['High'] - df['Low']
    
    # --- Daily change ---
    df['Daily_Change'] = df['Close'] - df['Open']
    
    # --- Time-based features ---
    # Reset index to get the Date as a column for time-based features
    df = df.reset_index()
    df['DayOfWeek'] = df['Date'].dt.dayofweek
    df['Month'] = df['Date'].dt.month
    
    # Set Date back as index
    df = df.set_index('Date')
    
    return df

### `create_target_variables` Function

This cell defines the `create_target_variables` function. The purpose of this function is to generate target variables (labels) for supervised machine learning, specifically for classification tasks aimed at predicting future price movements.

**Parameters:**
- `df` (pandas.DataFrame): The input DataFrame, which should contain engineered features and a 'Close' price column, with 'Date' as the index.

**Functionality:**
1. Creates a copy of the input DataFrame.
2. **Target for Next Day (`Target_1D`):**
   - Shifts the 'Close' price column by -1. This brings the next day's closing price to the current day's row.
   - Compares this future 'Close' price with the current day's 'Close' price.
   - Assigns `1` if the next day's 'Close' is higher than the current day's 'Close' (price increased), and `0` otherwise (price decreased or stayed the same).
3. **Target for Next Week (`Target_1W`):**
   - Shifts the 'Close' price column by -5 (approximating 5 trading days in a week).
   - Compares the 'Close' price 5 days ahead with the current day's 'Close' price.
   - Assigns `1` if the 'Close' price 5 days ahead is higher, and `0` otherwise.
4. **Target for Next Month (`Target_1M`):**
   - Shifts the 'Close' price column by -21 (approximating 21 trading days in a month).
   - Compares the 'Close' price 21 days ahead with the current day's 'Close' price.
   - Assigns `1` if the 'Close' price 21 days ahead is higher, and `0` otherwise.

**Note:** Using `shift(-n)` to create target variables will result in `NaN` values for the last `n` rows of these target columns, as there is no future data available for them. These rows are typically dropped before model training.

**Returns:**
- `pandas.DataFrame`: The DataFrame with the newly added target variable columns (`Target_1D`, `Target_1W`, `Target_1M`).

In [65]:
def create_target_variables(df):
    """
    Create target variables for classification
    
    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame with engineered features
        
    Returns:
    --------
    pandas.DataFrame
        DataFrame with target variables
    """
    df = df.copy()
    
    # Target for next day (1 day ahead)
    df['Target_1D'] = np.where(df['Close'].shift(-1) > df['Close'], 1, 0)
    
    # Target for next week (5 trading days ahead)
    df['Target_1W'] = np.where(df['Close'].shift(-5) > df['Close'], 1, 0)
    
    # Target for next month (21 trading days ahead)
    df['Target_1M'] = np.where(df['Close'].shift(-21) > df['Close'], 1, 0)
    
    return df

### `create_directories` Function

This cell defines the `create_directories` function. Its sole purpose is to create the necessary directory structure where the processed data files will be saved. This helps in organizing the project files.

**Functionality:**
1. Uses `os.makedirs('../data/cleaned', exist_ok=True)`.
   - `os.makedirs()`: Creates a directory. If intermediate directories in the path do not exist, it creates them as well.
   - `'../data/cleaned'`: Specifies the path of the directory to be created. This path suggests a project structure where the current notebook is in a subdirectory (e.g., `notebooks`), and the data is stored in a parallel `data` directory, with a `cleaned` subfolder for processed files.
   - `exist_ok=True`: If the directory already exists, this argument prevents the function from raising an error. The operation will simply be skipped.
2. Prints a confirmation message indicating that the directory has been created (or was already present).

This function is typically called once at the beginning of the main processing script to ensure the output location is ready.

In [66]:
def create_directories():
    """
    Create directories for storing processed data
    """
    os.makedirs('../data/cleaned', exist_ok=True)
    print("Created directory: ../data/cleaned")

### `process_stock` Function

This cell defines the `process_stock` function, which serves as a master orchestrator for processing the data of a single stock ticker. It integrates all the previously defined helper functions (fetch, clean, feature engineer, create targets) into a sequential pipeline.

**Parameters:**
- `ticker` (str): The stock ticker symbol (e.g., 'AAPL').
- `start_date` (str): The start date for fetching data, in 'YYYY-MM-DD' format. Defaults to '2010-01-01'.
- `end_date` (str): The end date for fetching data, in 'YYYY-MM-DD' format. If `None`, it defaults to the current day (handled by `fetch_stock_data`).

**Functionality:**
1. Prints a header to clearly indicate which ticker is currently being processed.
2. **Fetch Data:** Calls `fetch_stock_data(ticker, start_date, end_date)` to download the raw historical data for the specified ticker and period.
3. **Clean Data:** Calls `clean_stock_data(df)` to clean the raw data (handle missing values, sort, etc.). Prints the shape of the DataFrame after cleaning.
4. **Engineer Features:** Calls `engineer_features(df_clean)` to add technical indicators, returns, moving averages, and other relevant features. Prints the shape after feature engineering.
5. **Create Target Variables:** Calls `create_target_variables(df_featured)` to generate binary classification targets for different future time horizons. Prints the shape after adding targets.
6. **Drop NaNs:** Removes rows with any NaN values. NaN values are typically introduced at the beginning of the series due to rolling window calculations (e.g., moving averages, RSI) and at the end of the series due to shifting for target variable creation.
7. Prints the final shape of the DataFrame after dropping NaNs.
8. **Reset Index:** Resets the DataFrame's index. This converts the 'Date' index back into a regular column, which is often preferred when saving to a CSV file.

**Returns:**
- `pandas.DataFrame`: A fully processed DataFrame for the given stock, ready for model training or saving to a file. It includes cleaned data, engineered features, and target variables, with 'Date' as a column.

In [67]:
def process_stock(ticker, start_date='2010-01-01', end_date=None):
    """
    Process stock data for a given ticker through the entire pipeline
    
    Parameters:
    -----------
    ticker : str
        Stock ticker symbol
    start_date : str
        Start date in 'YYYY-MM-DD' format
    end_date : str
        End date in 'YYYY-MM-DD' format, default is today
        
    Returns:
    --------
    pandas.DataFrame
        Processed DataFrame ready for use in models
    """
    print(f"\n{'='*80}\nProcessing {ticker}\n{'='*80}")
    
    # 1. Fetch data
    df = fetch_stock_data(ticker, start_date, end_date)
    
    # 2. Clean data
    print(f"\nCleaning data for {ticker}...")
    df_clean = clean_stock_data(df)
    print(f"Shape after cleaning: {df_clean.shape}")
    
    # 3. Engineer features
    print(f"\nEngineering features for {ticker}...")
    df_featured = engineer_features(df_clean)
    print(f"Shape after feature engineering: {df_featured.shape}")
    
    # 4. Create target variables
    print(f"\nCreating target variables for {ticker}...")
    df_with_targets = create_target_variables(df_featured)
    print(f"Shape after adding targets: {df_with_targets.shape}")
    
    # 5. Drop rows with NaNs (due to rolling windows and shifting)
    df_final = df_with_targets.dropna()
    print(f"\nFinal shape after dropping NaNs: {df_final.shape}")
    
    # 6. Reset index to make Date a column again before saving
    df_final = df_final.reset_index()
    
    return df_final

## Create Directories

This cell executes the `create_directories` function defined earlier. Its purpose is to ensure that the target directory (`../data/cleaned/`) for saving the processed stock data CSV files exists before the main processing loop begins. If the directory doesn't exist, it will be created. If it already exists, the function does nothing due to the `exist_ok=True` parameter used in `os.makedirs`.

In [68]:
create_directories()

Created directory: ../data/cleaned


## Process All Stocks

This section defines and then calls a function to process all stock tickers specified in the `tickers` list.

### `process_all_stocks` Function Definition

This cell defines the `process_all_stocks` function. This function iterates through a list of stock tickers, processes each one using the `process_stock` function, and then saves the resulting DataFrame to a CSV file.

**Parameters:**
- `tickers` (list): A list of stock ticker symbols to process.
- `start_date` (str): The start date for fetching data, passed to `process_stock`. Defaults to '2010-01-01'.
- `end_date` (str): The end date for fetching data, passed to `process_stock`. If `None`, it defaults to the current day (handled by `fetch_stock_data` within `process_stock`).

**Functionality:**
1. Iterates through each `ticker` in the provided `tickers` list.
2. **Process Stock:** For each ticker, it calls `process_stock(ticker, start_date, end_date)` to perform the complete data fetching, cleaning, feature engineering, and target creation pipeline.
3. **Save to CSV:**
   - Defines an `output_path` for the CSV file, naming it `{ticker}.csv` and placing it in the `../data/cleaned/` directory.
   - Saves the processed DataFrame (`df`) to this path using `df.to_csv(output_path, index=False)`. `index=False` prevents pandas from writing the DataFrame index as a column in the CSV file (since 'Date' is already a column).
   - Prints a confirmation message indicating where the data was saved.
4. **Error Handling:** Includes a `try-except` block to catch any exceptions that might occur during the processing of a single stock (e.g., data not available for a ticker, network issues). If an error occurs, it prints an error message and uses `continue` to proceed to the next ticker in the list, ensuring that the failure of one stock doesn't halt the entire process.

In [69]:
def process_all_stocks(tickers, start_date='2010-01-01', end_date=None):
    """
    Process all stocks in the list and save the results to CSV files
    
    Parameters:
    -----------
    tickers : list
        List of stock ticker symbols
    start_date : str
        Start date in 'YYYY-MM-DD' format
    end_date : str
        End date in 'YYYY-MM-DD' format, default is today
    """
    for ticker in tickers:
        try:
            # Process the stock
            df = process_stock(ticker, start_date, end_date)
            
            # Save to CSV
            output_path = f'../data/cleaned/{ticker}.csv'
            df.to_csv(output_path, index=False)
            print(f"\nSaved processed data to {output_path}")
            
        except Exception as e:
            print(f"\nError processing {ticker}: {e}")
            continue

### Executing `process_all_stocks`

This cell calls the `process_all_stocks` function, passing it the `tickers` list defined earlier in the notebook. This action initiates the main data processing loop for all the specified stocks. The `start_date` and `end_date` parameters will use their default values ('2010-01-01' and today, respectively) as they are not explicitly provided in this call.

The line is commented out (`# process_all_stocks(tickers)`), meaning it will not run automatically when the notebook is executed from top to bottom. The user needs to uncomment this line to start the processing of all stocks. This is a common practice to prevent accidental long-running operations.

In [70]:
# Process all stocks (uncomment to run)
process_all_stocks(tickers)


Processing AAPL
Fetched 3864 rows of data for AAPL from 2010-01-01 to 2025-05-14
                               Open      High       Low     Close     Volume  \
Date                                                                           
2010-01-04 00:00:00-05:00  6.414465  6.446623  6.382908  6.431896  493729600   
2010-01-05 00:00:00-05:00  6.449627  6.479381  6.409054  6.443015  601904800   
2010-01-06 00:00:00-05:00  6.443017  6.468563  6.333920  6.340532  552160000   
2010-01-07 00:00:00-05:00  6.363973  6.371487  6.282827  6.328809  477131200   
2010-01-08 00:00:00-05:00  6.320395  6.371487  6.283128  6.370886  447610800   

                           Dividends  Stock Splits  
Date                                                
2010-01-04 00:00:00-05:00        0.0           0.0  
2010-01-05 00:00:00-05:00        0.0           0.0  
2010-01-06 00:00:00-05:00        0.0           0.0  
2010-01-07 00:00:00-05:00        0.0           0.0  
2010-01-08 00:00:00-05:00        0.0   

## Process Individual Stock Example

This section provides an example of how to process and save data for a single stock. This can be useful for testing the pipeline on a smaller scale, debugging, or when only data for a specific stock is needed without processing the entire list.

### Example: Processing a Single Stock ('AAPL')

This cell demonstrates the steps to process data for a single stock, using 'AAPL' (Apple Inc.) as an example.

**Functionality:**
1. **Define Ticker:** Sets the `ticker` variable to 'AAPL'.
2. **Process Stock:** Calls the `process_stock(ticker)` function. This will fetch, clean, engineer features, and create targets for 'AAPL' using the default start and end dates.
The result (a processed DataFrame) is stored in the `df` variable.
3. **Save to CSV:**
   - Defines the `output_path` for the CSV file as `../data/cleaned/AAPL.csv`.
   - Saves the `df` to this path using `df.to_csv(output_path, index=False)`.
   - Prints a confirmation message indicating the save location.

The code in this cell is commented out. To run this example, the user would need to uncomment these lines.

In [71]:
# Example for processing a single stock (uncomment to run)
ticker = 'AAPL'
df = process_stock(ticker)

# Save to CSV
output_path = f'../data/cleaned/{ticker}.csv'
df.to_csv(output_path, index=False)
print(f"\nSaved processed data to {output_path}")


Processing AAPL
Fetched 3864 rows of data for AAPL from 2010-01-01 to 2025-05-14
                               Open      High       Low     Close     Volume  \
Date                                                                           
2010-01-04 00:00:00-05:00  6.414464  6.446622  6.382907  6.431896  493729600   
2010-01-05 00:00:00-05:00  6.449629  6.479382  6.409055  6.443017  601904800   
2010-01-06 00:00:00-05:00  6.443015  6.468561  6.333918  6.340530  552160000   
2010-01-07 00:00:00-05:00  6.363975  6.371489  6.282828  6.328811  477131200   
2010-01-08 00:00:00-05:00  6.320393  6.371486  6.283127  6.370884  447610800   

                           Dividends  Stock Splits  
Date                                                
2010-01-04 00:00:00-05:00        0.0           0.0  
2010-01-05 00:00:00-05:00        0.0           0.0  
2010-01-06 00:00:00-05:00        0.0           0.0  
2010-01-07 00:00:00-05:00        0.0           0.0  
2010-01-08 00:00:00-05:00        0.0   

## Exploratory Analysis of Processed Data

This section demonstrates how to load one of the processed CSV files and perform a brief exploratory analysis. This is a crucial step to verify that the preprocessing pipeline has worked as expected and that the data is in the correct format with the intended features and targets.

### Loading and Examining a Processed File

This cell provides code to load a processed stock data file (e.g., for 'AAPL') and display some basic information about it.

**Functionality:**
1. **Specify Ticker and File Path:**
   - Sets `ticker` to 'AAPL' (or any other ticker for which data has been processed).
   - Constructs the `file_path` to the corresponding CSV file in the `../data/cleaned/` directory.
2. **Check File Existence:** Uses `os.path.exists(file_path)` to ensure the file actually exists before attempting to load it.
3. **If File Exists:**
   - **Load Data:** Reads the CSV file into a pandas DataFrame using `pd.read_csv(file_path)`.
   - **Basic Info:**
     - Prints the shape of the DataFrame (`df.shape`) to show the number of rows and columns.
     - Prints the first 10 rows of the DataFrame using `display(df.head(10))` for a visual inspection of the data.
   - **Column List:** Iterates through `df.columns` and prints each column name. This helps verify that all expected features and target variables are present.
   - **Target Distribution:**
     - Prints the normalized value counts for each of the target variables (`Target_1D`, `Target_1W`, `Target_1M`) using `df['Target_...'].value_counts(normalize=True)`. This shows the proportion of 0s and 1s for each target, which is important for understanding class balance in classification tasks.
4. **If File Does Not Exist:** Prints a message indicating that the file was not found and reminds the user to process stocks first.

The code in this cell is commented out. It should be uncommented and run after at least one stock's data has been processed and saved.

In [72]:
# Uncomment to run after processing at least one stock
ticker = 'AAPL'  # Change this to any processed ticker
file_path = f'../data/cleaned/{ticker}.csv'

if os.path.exists(file_path):
    df = pd.read_csv(file_path)
    
    # Basic info
    print(f"DataFrame shape: {df.shape}")
    print("\nFirst few rows:")
    display(df.head(10))
    
    print("\nColumns:")
    for col in df.columns:
        print(f"- {col}")
    
    # Target distribution
    print("\nTarget distribution:")
    print(f"Target_1D: {df['Target_1D'].value_counts(normalize=True)}")
    print(f"Target_1W: {df['Target_1W'].value_counts(normalize=True)}")
    print(f"Target_1M: {df['Target_1M'].value_counts(normalize=True)}")
else:
    print(f"File {file_path} not found. Please process stocks first.")

DataFrame shape: (3665, 35)

First few rows:


Unnamed: 0,Date,Open,High,Low,Close,Volume,RSI,MACD,MACD_Signal,MACD_Hist,...,STD_5,STD_20,Volume_MA_20,Price_Range,Daily_Change,DayOfWeek,Month,Target_1D,Target_1W,Target_1M
0,2010-10-18,9.571358,9.587286,9.445731,9.557233,1093010800,82.037075,0.321773,0.264316,0.057457,...,0.269323,0.302132,626992800.0,0.141555,-0.014125,0,10,0,0,0
1,2010-10-19,9.11844,9.430101,9.016857,9.30147,1232784000,69.540579,0.322044,0.275862,0.046182,...,0.231547,0.315237,655228280.0,0.413244,0.18303,1,10,1,0,0
2,2010-10-20,9.286742,9.444526,9.222727,9.332725,721624400,74.787064,0.32108,0.284906,0.036175,...,0.17844,0.330682,662045020.0,0.221799,0.045983,2,10,0,0,0
3,2010-10-21,9.387726,9.459255,9.220625,9.302373,551460000,75.093013,0.314245,0.290773,0.023471,...,0.113532,0.341182,650312180.0,0.23863,-0.085353,3,10,0,0,0
4,2010-10-22,9.288844,9.317997,9.205595,9.240758,372778000,77.737011,0.300393,0.292697,0.007695,...,0.122223,0.348249,636476680.0,0.112402,-0.048086,4,10,1,0,1
5,2010-10-25,9.289449,9.364884,9.269913,9.281935,392462000,73.11792,0.289402,0.292038,-0.002637,...,0.033848,0.354628,631958040.0,0.094971,-0.007514,0,10,0,0,0
6,2010-10-26,9.222726,9.308982,9.186059,9.25819,392929600,71.638178,0.275598,0.28875,-0.013152,...,0.036224,0.352205,599852400.0,0.122922,0.035464,1,10,0,1,1
7,2010-10-27,9.246173,9.313795,9.184562,9.251583,399002800,71.258856,0.261116,0.283223,-0.022108,...,0.024891,0.34735,596320340.0,0.129233,0.00541,2,10,0,1,1
8,2010-10-28,9.255189,9.256692,9.043308,9.173743,551051200,63.454761,0.240584,0.274695,-0.034112,...,0.04064,0.329541,590203320.0,0.213384,-0.081446,3,10,0,1,1
9,2010-10-29,9.143384,9.192973,9.042401,9.045708,430511200,56.317371,0.211542,0.262065,-0.050522,...,0.096476,0.302773,589321740.0,0.150572,-0.097676,4,10,1,1,1



Columns:
- Date
- Open
- High
- Low
- Close
- Volume
- RSI
- MACD
- MACD_Signal
- MACD_Hist
- MA_20
- BB_Upper
- BB_Lower
- BB_Width
- BB_Position
- Daily_Return
- Volatility_10D
- Volatility_30D
- Weekly_Return
- Monthly_Return
- MA_5
- MA_10
- MA_50
- MA_100
- MA_200
- STD_5
- STD_20
- Volume_MA_20
- Price_Range
- Daily_Change
- DayOfWeek
- Month
- Target_1D
- Target_1W
- Target_1M

Target distribution:
Target_1D: Target_1D
1    0.527967
0    0.472033
Name: proportion, dtype: float64
Target_1W: Target_1W
1    0.571896
0    0.428104
Name: proportion, dtype: float64
Target_1M: Target_1M
1    0.610641
0    0.389359
Name: proportion, dtype: float64


## Conclusion

This notebook has implemented a comprehensive preprocessing pipeline for stock data that includes:

1. Fetching historical stock data using yfinance
2. Cleaning the data by handling missing values and timezone information
3. Engineering a rich set of features including:
   - Returns (daily, weekly, monthly)
   - Moving averages
   - Technical indicators (RSI, MACD, Bollinger Bands)
   - Volatility measures
   - Price-based features
   - Time-based features
4. Creating classification target variables for next day, week, and month price direction
5. Saving the processed data to CSV files following the project's naming convention

The processed data is ready for use in machine learning models like Prophet, LSTM, and XGBoost.