<h1 style="text-align: center; font-size: 50px;"> Stock Analysis with Pandas </h1>

In this notebook, we run a series of database operations using standard Pandas, running on CPU. These values will be displayed and logged to be used as reference to compare with the GPU accelerated version using CUDF. 

The data we will be working with is a subset of the [USA 514 Stocks Prices NASDAQ NYSE dataset](https://www.kaggle.com/datasets/olegshpagin/usa-stocks-prices-ohlcv) from Kaggle. This was segmented in differently sized samples, with 5M, 10M, 15M and 20M data entries, and should be set up as an asset (Dataset) called USA_Stocks on the AI Studio project. 

# Notebook Overview
- Start Execution
- Install and Import Libraries
- Configure Settings
- Verify Assets
- Perform Analysis with Standard Pandas
- Run Analysis and Log Results to MLFlow

# Start Execution

In [1]:
import logging  # For application-level logging
import time     # For runtime measurement (wall clock)

# Configure logger
logger: logging.Logger = logging.getLogger("run_workflow_logger")
logger.setLevel(logging.INFO)
logger.propagate = False  # Prevent duplicate logs from parent loggers

# Set formatter
formatter: logging.Formatter = logging.Formatter(
    fmt="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)

# Configure and attach stream handler
stream_handler: logging.StreamHandler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)

In [2]:
start_time = time.time()  

logger.info("Notebook execution started.")

2025-09-10 16:26:13 - INFO - Notebook execution started.


# Install and Import Libraries

In [3]:
%%time

%pip install -r ../../requirements.txt --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
CPU times: user 61.9 ms, sys: 22.8 ms, total: 84.6 ms
Wall time: 4.83 s


In [4]:
# =============================
# Standard Library Imports
# =============================
import os
import sys
import types
import nbformat
import importlib.util
import warnings           # To manage and filter Python warnings
from pathlib import Path  # For object-oriented filesystem paths

# =============================
# Third-Party Library Imports
# =============================
import pandas as pd       # Data manipulation and analysis
import mlflow             # Experiment tracking and model logging

# Configure Settings

In [5]:
# ------------------------ Suppress Verbose Logs ------------------------
warnings.filterwarnings("ignore")

In [6]:
# Directory containing the USA stock parquet datasets
DATASET_DIR = Path("/home/jovyan/datafabric/USA_Stocks/")

# Sample sizes (in millions of rows) to evaluate during the analysis
SAMPLE_SIZES_TO_TEST = [5, 10]

# Rolling window size (in days) used for time-series statistical operations
ROLLING_WINDOW_SIZE = 7

# Name of the MLflow experiment for tracking performance and metrics
MLFLOW_EXPERIMENT_NAME = "USA Stock Analysis with Pandas"

# Verify Assets

In [7]:
# Define required dataset filenames
dataset_filenames = [
    "usa_stocks_5m.parquet",
    "usa_stocks_10m.parquet",
    "usa_stocks_15m.parquet",
    "usa_stocks_20m.parquet",
]

# Construct full dataset file paths using pathlib
dataset_paths = [DATASET_DIR / filename for filename in dataset_filenames]

# Check if all dataset files exist
all_files_exist = all(path.exists() for path in dataset_paths)

# Output the dataset configuration status
if all_files_exist:
    logger.info("Dataset is properly configured")
else:
    logger.info("Dataset is not properly configured. Please, create and download the assets on your project on AI Studio")

2025-09-10 16:26:20 - INFO - Dataset is properly configured


# Perform Analysis with Standard Pandas    

In the next cells, we will define functions to run different operations in datasets:
  * A function to describe the dataset
  * A function to aggregate results grouped by "ticker" (the identifier of each stock)
  * A function to aggregate by ticker, year and week
  * A function to retrieve a rolling window with a given number of days for each ticker

For each of these functions, the result will be logged and displayed in the notebook. These functions will then be applied to the given set of samples in sample_sizes (e.g. [5, 10]). Bigger samples (15M and 20M) might be too heavy depending on the setup of your computer, so we recommend to configure the desired sample sizes according to the available resources.

In [8]:
def describe_dataframe(df):
    """
    Compute basic descriptive statistics for the input DataFrame.

    Parameters:
        df (pd.DataFrame): Input DataFrame.

    Returns:
        tuple: (elapsed_time_in_seconds, descriptive_statistics)
    """
    start_time = time.time()
    descriptive_stats = df.describe()
    elapsed_time = time.time() - start_time
    return elapsed_time, descriptive_stats


def aggregate_by_ticker(df):
    """
    Perform simple aggregation grouped by ticker.

    Aggregates:
        - Minimum datetime
        - Maximum datetime
        - Count of records

    Parameters:
        df (pd.DataFrame): Input DataFrame.

    Returns:
        tuple: (elapsed_time_in_seconds, aggregated_dataframe)
    """
    start_time = time.time()
    aggregation_result = df.groupby("ticker").agg({
        "datetime": ["min", "max", "count"]
    })
    elapsed_time = time.time() - start_time
    return elapsed_time, aggregation_result


def aggregate_by_ticker_week(df):
    """
    Perform composite aggregation grouped by ticker, year, and week.

    Aggregates:
        - Minimum closing price
        - Maximum closing price

    Parameters:
        df (pd.DataFrame): Input DataFrame.

    Returns:
        tuple: (elapsed_time_in_seconds, aggregated_dataframe)
    """
    start_time = time.time()
    df[["year", "week", "day"]] = df["datetime"].dt.isocalendar()
    aggregation_result = df.groupby(["ticker", "year", "week"]).agg({
        "close": ["min", "max"]
    })
    elapsed_time = time.time() - start_time
    return elapsed_time, aggregation_result


def compute_rolling_mean(df, window_days):
    """
    Calculate rolling window mean for each ticker over a given number of days.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        window_days (int): Number of days for the rolling window.

    Returns:
        tuple: (elapsed_time_in_seconds, result_dataframe)
    """
    start_time = time.time()
    result = (
        df.set_index("datetime")
          .sort_index()
          .groupby("ticker")
          .rolling(f"{window_days}D")
          .mean()
          .reset_index()
    )
    elapsed_time = time.time() - start_time
    return elapsed_time, result

# Run Analysis and Log Results to MLFlow

In [9]:
mlflow.set_tracking_uri('/phoenix/mlflow')
# Set the MLflow experiment to track runs
mlflow.set_experiment(experiment_name=MLFLOW_EXPERIMENT_NAME)

# Loop through each dataset sample size and run analysis
for sample_size in SAMPLE_SIZES_TO_TEST:
    run_name = f"Standard Analysis - {sample_size}M"

    with mlflow.start_run(run_name=run_name):
        # Log configuration parameters
        mlflow.log_param("Computing", "cpu")
        mlflow.log_param("Dataset size in millions of rows", sample_size)
        
        # Load dataset corresponding to the current sample size
        dataset_path = f"/home/jovyan/datafabric/USA_Stocks/usa_stocks_{sample_size}m.parquet"
        df = pd.read_parquet(dataset_path)

        print(f"\n--- Running Analysis for {sample_size}M Rows ---")
        
        # Description
        description_time, _ = describe_dataframe(df)
        mlflow.log_metric("Description_time_seconds", description_time)
        print(f"Description Time      : {description_time:.4f} seconds")
        
        # Simple Aggregation
        simple_agg_time, _ = aggregate_by_ticker(df)
        mlflow.log_metric("Simple_aggregation_time_seconds", simple_agg_time)
        print(f"Simple Aggregation    : {simple_agg_time:.4f} seconds")
        
        # Composite Aggregation
        composite_agg_time, _ = aggregate_by_ticker_week(df)
        mlflow.log_metric("Composite_aggregation_time_seconds", composite_agg_time)
        print(f"Composite Aggregation : {composite_agg_time:.4f} seconds")
        
        # Rolling Window
        rolling_time, _ = compute_rolling_mean(df, ROLLING_WINDOW_SIZE)
        mlflow.log_metric(f"Rolling_window_{ROLLING_WINDOW_SIZE}D_time_seconds", rolling_time)
        print(f"Rolling Window ({ROLLING_WINDOW_SIZE}D) : {rolling_time:.4f} seconds")

2025/09/10 16:26:20 INFO mlflow.tracking.fluent: Experiment with name 'USA Stock Analysis with Pandas' does not exist. Creating a new experiment.



--- Running Analysis for 5M Rows ---
Description Time      : 0.7597 seconds
Simple Aggregation    : 0.1838 seconds
Composite Aggregation : 0.4242 seconds
Rolling Window (7D) : 3.2777 seconds

--- Running Analysis for 10M Rows ---
Description Time      : 1.3147 seconds
Simple Aggregation    : 0.4355 seconds
Composite Aggregation : 0.8405 seconds
Rolling Window (7D) : 8.3399 seconds


In [10]:
end_time: float = time.time()
elapsed_time: float = end_time - start_time
elapsed_minutes: int = int(elapsed_time // 60)
elapsed_seconds: float = elapsed_time % 60

logger.info(f"⏱️ Total execution time: {elapsed_minutes}m {elapsed_seconds:.2f}s")
logger.info("✅ Notebook execution completed successfully.")

2025-09-10 16:26:40 - INFO - ⏱️ Total execution time: 0m 27.33s
2025-09-10 16:26:40 - INFO - ✅ Notebook execution completed successfully.


Built with ❤️ using Z by HP AI Studio.