# Introduction and Dataset Overview


We chose a dataset about ETFs and Mutual Funds from Yahoo Finance. It includes lots of financial information and daily prices for thousands of funds. ETFs have become very popular because they cost less than mutual funds and are easier to manage, so predicting their daily returns is useful for investors.

## What Makes This Dataset Interesting?

- It has a large amount of data, which lets us show how to work with big data using PySpark and Databricks.
- The data contains different types of financial information like prices, volumes, and returns, allowing us to explore, clean, and build models.
- Predicting whether an ETF will go up or down each day is a real problem that can help with making better investment decisions.


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType, StringType
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
import matplotlib.pyplot as plt
import numpy as np
from pyspark.ml.linalg import VectorUDT
import pandas as pd
import matplotlib.pyplot as plt


# ETF Daily Return Prediction Project

## Problem Statement

Exchange Traded Funds (ETFs) are popular investment vehicles that track indexes or baskets of assets.  
Our goal is to **predict whether an ETF will have a positive daily return** — meaning if the closing price will be higher than the opening price on a given day — **before the market closes**.

This is formulated as a **binary classification problem**, where the target variable indicates whether the ETF's price will increase during the trading day.

## Important Clarification

- While the dataset contains `open` and `close` prices for each day, **we cannot use the `close` price of the day to predict that day's return** because it would be data leakage (the value we want to predict would be an input).
- Instead, the model should use **historical data and engineered features derived from previous days** (e.g., previous day returns, moving averages, volume trends) to predict whether the return will be positive for the next day.
- The `close` price is only used to **label the data**, i.e., to compute if the return was positive or not for training and evaluation.

## Dataset

We use the **ETF prices** dataset, which contains daily historical prices for ETFs, including:

- `fund_symbol`: the ETF ticker symbol
- `price_date`: date of the price record
- `open`, `close`, `adj_close`: daily prices (adjusted close accounts for dividends and splits)
- `volume`: trading volume

## Objective

- Load and clean the data using PySpark.
- Create a binary target column `daily_return_positive`:  
  1 if `close > open` (the ETF gained value during the day), 0 otherwise.
- Perform exploratory data analysis (EDA) and data quality checks.
- Engineer features using historical price and volume data **prior to the day to be predicted**.
- Prepare the dataset for machine learning using Spark MLlib.
- Train and evaluate classification models to predict daily return direction *ahead of market close*.

## Next Steps

- Data cleaning (handling missing or zero prices, removing outliers).
- Feature engineering (lagged returns, moving averages, volume changes).
- Constructing the ML pipeline and model training.
- Model evaluation, tuning, and interpretation.

---

This project demonstrates skills in distributed data processing and machine learning with PySpark on Databricks, focusing on financial time series prediction while avoiding data leakage.


In [0]:
etf_prices = spark.read.parquet("dbfs:/tmp/silver/etf_prices").filter("insertion_date = current_date()-1")
display(etf_prices)


In [0]:
display(etf_prices.summary())

### Dataset Summary and Quick Insights

The dataset contains approximately 3.86 million rows of daily ETF price data, including opening, closing, adjusted closing prices, and traded volume.

- **Fund Symbol:** Identifies each ETF (categorical, so no meaningful numeric statistics like mean or standard deviation).
- **Price Variables (`open`, `close`, `adj_close`):**  
  Average prices hover around 120,000, but extremely high standard deviations (~11 million) indicate the presence of significant outliers skewing the data.  
  Minimum prices of zero are suspicious and likely represent missing or erroneous data since real ETF prices should not be zero.
- **Volume:**  
  Trading volume ranges dramatically, from 0 up to nearly 3 billion shares traded in a day. Zero volume may indicate days without trading or data issues. The average volume (~1 million) reflects typical daily activity but with high variability.
- **Outliers and Inconsistencies:**  
  The extreme maximum values in prices and volumes (reaching billions) suggest some data points may be extraordinary market events or more likely data errors needing cleaning.

**Conclusion:**  
Before any modeling or detailed analysis, the data requires thorough cleaning to handle zero and extreme values. Due to the dataset’s size and wide value ranges, sampling and scaling will be crucial for efficient processing and to ensure meaningful machine learning results.


Filter out rows with zero or negative values in `open`, `close`, or `volume` as they likely represent invalid data.


In [0]:
cleaned = etf_prices.filter(
    (col("open") > 0) &
    (col("close") > 0) &
    (col("adj_close") > 0) &
    (col("volume") > 0)
).na.drop()

In [0]:

quantiles = cleaned.approxQuantile(["open", "close", "volume"], [0.01, 0.99], 0.01)
open_low, open_high = quantiles[0]
close_low, close_high = quantiles[1]
vol_low, vol_high = quantiles[2]

In [0]:
# Winsorisation
def winsorize(col_name, low, high):
    return when(col(col_name) < low, lit(low)) \
           .when(col(col_name) > high, lit(high)) \
           .otherwise(col(col_name))

cleaned = cleaned.withColumn("open_winsor", winsorize("open", open_low, open_high).cast(DoubleType())) \
                 .withColumn("close_winsor", winsorize("close", close_low, close_high).cast(DoubleType())) \
                 .withColumn("volume_winsor", winsorize("volume", vol_low, vol_high).cast(DoubleType()))


In [0]:
cleaned = cleaned.withColumn("daily_return_positive", (col("close_winsor") > col("open_winsor")).cast(IntegerType()))


In [0]:
cleaned = cleaned.withColumn(
    "daily_return_pct",
    ((col("close_winsor") - col("open_winsor")) / col("open_winsor")) * 100
)


In [0]:
cleaned.select("fund_symbol", "price_date", "open_winsor", "close_winsor", "volume_winsor", "daily_return_positive", "daily_return_pct").show(10, truncate=False)


In [0]:
cleaned = cleaned.drop("open", "close", "volume")

# Rename winsorized columns to original names
cleaned = cleaned.withColumnRenamed("open_winsor", "open") \
                 .withColumnRenamed("close_winsor", "close") \
                 .withColumnRenamed("volume_winsor", "volume")

In [0]:
display(cleaned.summary())

### Revisiting Outlier Treatment: Beyond Winsorization

While we previously applied **winsorization at the 1st and 99th percentiles** to limit the impact of extreme values on `open`, `close`, and `volume`, this approach proved **insufficient for the `daily_return_pct`** variable.

#### Issues Identified:
- From the dataset summary:
  - `daily_return_pct` has a **maximum of over 364%** and a **minimum near -98%**, which is abnormally wide for a daily return.
  - The **standard deviation is ~1.42**, which is disproportionately large compared to the median (0%), indicating **high skew and heavy tails**.
- These extreme values may result from **splits, errors, or rare market events**, and will likely **harm model training** by introducing noise or misleading variance.

#### Solution:
We will now apply a **more robust outlier filtering method** using:
- The **IQR (Interquartile Range)** method to detect and remove statistical outliers.
- A **hard cap at ±100%** daily return to eliminate unrealistic values.

This ensures a more realistic and stable data distribution for classification modeling.


In [0]:
from pyspark.sql.functions import when, col

def iqr_winsorize(df, col_name, absolute_cap=None):
    q1, q3 = df.approxQuantile(col_name, [0.25, 0.75], 0.01)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    if absolute_cap is not None:
        upper_bound = min(upper_bound, absolute_cap)

    return df.withColumn(
        f"{col_name}_winsor",
        when(col(col_name) < lower_bound, lower_bound)
        .when(col(col_name) > upper_bound, upper_bound)
        .otherwise(col(col_name))
    )

# Appliquer la winsorisation sur les colonnes concernées
cleaned_filtered = cleaned
cleaned_filtered = iqr_winsorize(cleaned_filtered, "daily_return_pct", absolute_cap=100.0)
cleaned_filtered = iqr_winsorize(cleaned_filtered, "open", absolute_cap=1000)
cleaned_filtered = iqr_winsorize(cleaned_filtered, "close", absolute_cap=1000)
cleaned_filtered = iqr_winsorize(cleaned_filtered, "volume", absolute_cap=1_000_000)


In [0]:
display(cleaned_filtered.summary())

In [0]:
from pyspark.sql.functions import col, when

# Drop original columns
cols_to_drop = ["open", "close", "volume", "daily_return_positive", "daily_return_pct","adj_close","daily_return_pct_winsor"]
cleaned_filtered = cleaned_filtered.drop(*cols_to_drop)

# Recalculate daily_return_pct from winsorized open and close
cleaned_filtered = cleaned_filtered.withColumn(
    "daily_return_pct",
    (col("close_winsor") - col("open_winsor")) / col("open_winsor") * 100
)

# Recalculate daily_return_positive from new daily_return_pct
cleaned_filtered = cleaned_filtered.withColumn(
    "daily_return_positive",
    when(col("daily_return_pct") > 0, 1).otherwise(0)
)



In [0]:
display(cleaned_filtered.summary())

### Updated Summary After Winsorization

The dataset after winsorization shows a much-improved distribution across key financial variables:

- **Price columns (`open_winsor`, `close_winsor`)**:
  - Mean prices around 43 USD, with medians near 34 USD.
  - The maximum prices are capped below 100, effectively removing extreme outliers.
  - The standard deviation (~25) indicates moderate variability, which is reasonable for market prices.

- **Volume (`volume_winsor`)**:
  - Mean volume is approximately 93,000 shares.
  - Median volume is 24,600 shares, with a 75th percentile at 138,600, showing a realistic distribution of trading activity.
  - Maximum volume is limited to about 342,550, avoiding the impact of unrealistic spikes.

- **Daily Returns (`daily_return_pct`)**:
  - Mean daily return is close to zero (0.058%), reflecting a balanced market movement.
  - Returns range from approximately -68% to +360%, showing that some extreme but plausible moves remain after winsorization.
  - Median and 75th percentile returns are modest (3.36% and 56.12%), indicating typical daily fluctuations.

- **Daily Return Positive Flag (`daily_return_positive`)**:
  - Approximately 51% of the daily returns are positive, reflecting a balanced distribution of gains and losses.

Overall, the winsorization and outlier handling have effectively reduced extreme anomalies while preserving meaningful market dynamics, making this dataset suitable for robust analysis and modeling.


In [0]:
display(cleaned_filtered)

### Top 10 Days with Highest Daily Returns

Show the days and ETFs with the highest positive returns as an example of useful exploratory insight.


In [0]:
from pyspark.sql.functions import desc

cleaned_filtered.orderBy(desc("daily_return_pct")) \
                .select("fund_symbol", "price_date", "daily_return_pct") \
                .show(10, truncate=False)



Despite winsorization, some assets still exhibit extremely high daily returns (above 100%), mainly leveraged ETFs and volatile securities. These extreme values can heavily skew model training, potentially leading to overfitting or unstable predictions.

**Financially sensible approach:**

- **Cap daily returns at ±100%** to exclude unrealistic or excessively volatile spikes that are unlikely to represent typical market behavior.
- This threshold balances keeping meaningful volatility signals without letting extreme outliers dominate.
- It aligns with risk management practices by preventing models from reacting disproportionately to rare and extreme events.

By applying this hard cap, we ensure a cleaner dataset, more stable model training, and better generalization for typical market conditions.



In [0]:
import seaborn as sns
import matplotlib.pyplot as plt
sample_pd = cleaned_filtered.select(
    "daily_return_pct", "open_winsor", "close_winsor", "volume_winsor"
).sample(fraction=0.01, seed=42).toPandas()  

plt.figure(figsize=(16, 12))

plt.subplot(2, 2, 1)
sns.histplot(sample_pd["daily_return_pct"], bins=50, kde=True)
plt.title("Distribution of daily_return_pct (winsorized)")

plt.subplot(2, 2, 2)
sns.histplot(sample_pd["open_winsor"], bins=50, kde=True)
plt.title("Distribution of open price (winsorized)")

plt.subplot(2, 2, 3)
sns.histplot(sample_pd["close_winsor"], bins=50, kde=True)
plt.title("Distribution of close price (winsorized)")

plt.subplot(2, 2, 4)
sns.histplot(sample_pd["volume_winsor"], bins=50, kde=True)
plt.title("Distribution of volume (winsorized)")

plt.tight_layout()
plt.show()

### Data Analysis Summary

After applying winsorization and capping, the dataset shows a much cleaner and business-relevant distribution.

- **Daily Returns (`daily_return_pct`)**:  
  The median return is approximately 0.03%, with an interquartile range between -0.28% and +0.56%. Capping extreme values at 100% reduces the influence of rare, highly volatile sessions — common with leveraged or niche ETFs. Around 51% of returns are positive, indicating balanced market movements.

- **Prices (`open_winsor`, `close_winsor`)**:  
  Median prices are around 34 USD, with a max below 100 USD — realistic for well-known ETFs and mid-range equities. The cap excludes ultra-high-priced stocks like Google or Amazon, focusing instead on widely traded instruments. Tickers such as AAA (Investment Grade Bond ETF) and AAAU (Gold Trust ETF) confirm the dataset includes legitimate, everyday financial assets. Price variability (std ≈ 26 USD) remains within reasonable bounds.

- **Volume (`volume_winsor`)**:  
  The median daily volume is about 24,600 shares, with an upper limit of ~342,000, helping to eliminate outliers without discarding active trading patterns.

**Business context:**  
This cleaned dataset retains credible financial behaviors while removing anomalies, making it robust for training predictive models with reduced risk of bias or overfitting.


In [0]:
from pyspark.sql.functions import col, count, mean, min, max

fund_summary = cleaned_filtered.groupBy("fund_symbol").agg(
    count("*").alias("count"),
    mean("daily_return_pct_capped").alias("mean_return_pct"),
    min("daily_return_pct_capped").alias("min_return_pct"),
    max("daily_return_pct_capped").alias("max_return_pct"),
    
    mean("open_winsor").alias("mean_open"),
    min("open_winsor").alias("min_open"),
    max("open_winsor").alias("max_open"),
    
    mean("close_winsor").alias("mean_close"),
    min("close_winsor").alias("min_close"),
    max("close_winsor").alias("max_close"),
    
    mean("volume_winsor").alias("mean_volume"),
    min("volume_winsor").alias("min_volume"),
    max("volume_winsor").alias("max_volume")
)

fund_summary.show(10)


### Fund-Level Summary (top 10 rows)

Based on the cleaned and capped dataset:

- **Returns**:
  - Most funds show average daily returns close to 0%, indicating stable performance.
  - Some like `BBEU` have a positive average return (~0.07%), suggesting consistent upward movement.
  - Others like `AWAY` and `BJK` show slightly negative trends, possibly reflecting sector volatility.

- **Prices**:
  - Median open/close prices vary: `AESR` and `CHII` are low-cost funds ($11–13), while `BBEU` and `BJK` are more expensive ($36–50).
  - Price max caps below $100 ensure focus on commonly traded ETFs.

- **Volume**:
  - Significant variability: some funds like `BBEU` trade with high average volume (~227k), others like `BSMT` or `AQWA` have low liquidity (<10k).
  - This affects model reliability—higher volume means more stable price behavior.

This summary helps identify fund characteristics (growth vs. stability, liquid vs. illiquid) for modeling or investment strategies.


%md
### Feature Engineering

We now enrich our dataset with new features to improve the modeling phase.

The following engineered features are added:

- `price_diff`: Absolute difference between close and open prices (no ratio, to avoid outlier inflation).
- `volatility_proxy`: Proxy for volatility measured as the absolute value of capped daily return percentage.
- `volume_log`: Log-transformed volume to handle large scale differences and reduce skew.
- `is_high_volume`: Flag if the day's volume is in the top 25% quantile for that fund.
- `price_avg`: Average of open and close prices — useful as a smoothed price indicator.

These features aim to capture volatility, liquidity, and trend signals without introducing extreme values. They are computed using PySpark's parallelized operations to handle large-scale data efficiently.


In [0]:
from pyspark.sql.functions import abs, log1p, lit, avg, col, when
from pyspark.sql import Window

# Absolute price difference (close - open)
cleaned_filtered = cleaned_filtered.withColumn(
    "price_diff", abs(col("close_winsor") - col("open_winsor"))
)

# Volatility proxy — absolute daily return percentage
cleaned_filtered = cleaned_filtered.withColumn(
    "volatility_proxy", abs(col("daily_return_pct_capped"))
)

# Log-transformed volume to reduce skew
cleaned_filtered = cleaned_filtered.withColumn(
    "volume_log", log1p(col("volume_winsor"))
)

# Price average (close + open) / 2 — a smoother price signal
cleaned_filtered = cleaned_filtered.withColumn(
    "price_avg", (col("open_winsor") + col("close_winsor")) / 2
)

# High volume flag — if volume is in the top 25% for that fund
volume_window = Window.partitionBy("fund_symbol")
volume_75 = cleaned_filtered.approxQuantile("volume_winsor", [0.75], 0.01)[0]

cleaned_filtered = cleaned_filtered.withColumn(
    "is_high_volume",
    when(col("volume_winsor") >= volume_75, 1).otherwise(0)
)

# Show result
cleaned_filtered.select("fund_symbol", "price_diff", "volatility_proxy", "volume_log", "price_avg", "is_high_volume").show(5)


The average daily price difference is small, around 0.01 to 0.03 USD, indicating stable intraday price movement typical for liquid assets. The volatility proxy captures realistic price fluctuations without extreme spikes, and the log-transformed volume balances varying trading activity levels. About 14% of days are classified as high volume, useful for distinguishing active trading periods.

### Modeling Approach

We aim to predict if an ETF will have a positive daily return (binary classification).  
Key points:

- Target: `daily_return_positive` (1 if next day's close > open, else 0)  
- Features: engineered from historical data *before* the prediction day (lags, moving averages, volume trends etc.)  
- Model: Use Spark MLlib’s classification algorithms (Logistic Regression, Random Forest, or Gradient Boosted Trees)  
- Pipeline: Feature assembler + model + evaluator  
- Evaluation: Use area under ROC curve (AUC), accuracy, and F1-score  
- Train/test split: Time-aware split (train on past dates, test on later dates to avoid leakage)  

This ensures no future data leaks into training, respecting the financial prediction constraints.


In [0]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Use engineered features only (exclude anything that leaks info about the same day's close price)
feature_cols = [
    "price_diff",         # open_winsor - close_winsor of previous day or engineered lag
    "volatility_proxy",   # rolling volatility proxy from previous days
    "volume_log",         # log of volume, lagged or winsorized
    "price_avg",          # average price feature
    "is_high_volume"      # volume indicator (binary)
]

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

rf = RandomForestClassifier(
    labelCol="daily_return_positive",  # or use daily_return_positive_capped if preferred
    featuresCol="features",
    numTrees=50,
    maxDepth=5,
    seed=42
)

pipeline = Pipeline(stages=[assembler, rf])

# Split respecting time (train on earlier dates, test on later dates)
train_df = cleaned_filtered.filter("price_date < '2021-01-01'")
test_df = cleaned_filtered.filter("price_date >= '2021-01-01'")

# Fit model on training set
model = pipeline.fit(train_df)

# Predict on test set
predictions = model.transform(test_df)

# Evaluate with AUC (area under ROC curve)
evaluator = BinaryClassificationEvaluator(
    labelCol="daily_return_positive",
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

auc = evaluator.evaluate(predictions)
print(f"Test AUC: {auc:.4f}")


### Model Comparison and Hyperparameter Optimization

To improve our ETF daily return prediction (for now, AUC = 0.68), we test different classification algorithms available in Spark MLlib:  
- Logistic Regression (baseline linear model)  
- Random Forest (already tested)  
- Gradient Boosted Trees (powerful ensemble method)  

Then, we perform hyperparameter tuning using CrossValidator with a simple grid search on key parameters (e.g., maxDepth, numTrees) to improve performance.



In [0]:
from pyspark.ml.classification import LogisticRegression, GBTClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Prepare features vector as before
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

# Define classifiers to test
lr = LogisticRegression(labelCol="daily_return_positive", featuresCol="features", maxIter=20)
rf = RandomForestClassifier(labelCol="daily_return_positive", featuresCol="features", seed=42)
gbt = GBTClassifier(labelCol="daily_return_positive", featuresCol="features", maxIter=20, seed=42)

# Create pipelines for each
pipeline_lr = Pipeline(stages=[assembler, lr])
pipeline_rf = Pipeline(stages=[assembler, rf])
pipeline_gbt = Pipeline(stages=[assembler, gbt])

# Train-test split (same as before)
train_df = cleaned_filtered.filter("price_date < '2021-01-01'")
test_df = cleaned_filtered.filter("price_date >= '2021-01-01'")

# Fit and evaluate each model
evaluator = BinaryClassificationEvaluator(labelCol="daily_return_positive", metricName="areaUnderROC")

def train_evaluate(pipeline, train_data, test_data, model_name):
    model = pipeline.fit(train_data)
    preds = model.transform(test_data)
    auc = evaluator.evaluate(preds)
    print(f"{model_name} Test AUC: {auc:.4f}")
    return model, auc

print("Training and evaluating baseline models...")
lr_model, lr_auc = train_evaluate(pipeline_lr, train_df, test_df, "Logistic Regression")
rf_model, rf_auc = train_evaluate(pipeline_rf, train_df, test_df, "Random Forest")
gbt_model, gbt_auc = train_evaluate(pipeline_gbt, train_df, test_df, "Gradient Boosted Trees")

# Hyperparameter tuning for Random Forest (example)
paramGrid = ParamGridBuilder()\
    .addGrid(rf.numTrees, [20, 50, 100])\
    .addGrid(rf.maxDepth, [5, 10])\
    .build()

crossval = CrossValidator(estimator=pipeline_rf,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3,  # 3-fold cross validation
                          parallelism=4)  # Parallelism for faster tuning

print("Starting hyperparameter tuning for Random Forest...")
cv_model = crossval.fit(train_df)

# Evaluate best model on test data
best_model = cv_model.bestModel
predictions = best_model.transform(test_df)
best_auc = evaluator.evaluate(predictions)
print(f"Best Random Forest after tuning Test AUC: {best_auc:.4f}")


# Conclusion

In this project, we explored a large ETF financial dataset using PySpark, demonstrating key skills in big data processing, feature engineering, and machine learning.

We built several classification models to predict whether an ETF’s price would increase during the trading day, achieving reasonable performance with a test AUC around 0.68 for tree-based models. Despite efforts to tune hyperparameters, model improvements were limited, suggesting that more complex features or external data may be needed for significant gains.

Overall, this project highlights the challenges and opportunities of applying big data techniques to financial time series. The use of PySpark enabled efficient handling of large datasets, and the predictive models provide a solid baseline for future improvements.

This experience underscores the importance of careful feature design and the potential of scalable tools like Spark in real-world data science problems.
