# 02 — Data Cleaning & Feature Engineering
This notebook loads the raw BTC/ETH price data and the Fear & Greed Index, fixes types/missing values, aligns the time series, and engineers features (returns, moving averages, rolling volatility). The output is saved to `data/processed/merged_clean.csv`.

### Load raw datasets
Load the raw CSV files from `data/raw/`. The loader below handles both the older yfinance CSV header format and the clean one-row-header format.

In [1]:
import pandas as pd

def load_price_csv(path: str) -> pd.DataFrame:
    # New format (recommended): Date + yfinance columns
    preview = pd.read_csv(path, nrows=5)
    if len(preview.columns) > 0 and str(preview.columns[0]).strip().lower() == "price":
        # Old format: first 3 rows are header/ticker/date artifacts
        df = pd.read_csv(
            path,
            skiprows=3,
            header=None,
            names=["Date", "Close", "High", "Low", "Open", "Volume"],
        )
    else:
        df = pd.read_csv(path)
        if "Date" not in df.columns:
            raise ValueError(f"Expected a 'Date' column in {path}. Got columns: {list(df.columns)}")
        # Keep only the columns we care about (some exports include 'Adj Close')
        keep = [c for c in ["Date", "Open", "High", "Low", "Close", "Volume"] if c in df.columns]
        df = df[keep]

    df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
    df = df.dropna(subset=["Date"]).set_index("Date").sort_index()
    df.index = df.index.normalize()

    # Ensure numerics
    for col in ["Open", "High", "Low", "Close", "Volume"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    return df

def load_fear_greed_csv(path: str) -> pd.DataFrame:
    fg = pd.read_csv(path, parse_dates=["timestamp"])
    fg = fg.rename(columns={"timestamp": "Date"}).set_index("Date").sort_index()
    fg.index = fg.index.normalize()
    fg["FG_Value"] = pd.to_numeric(fg["value"], errors="coerce")
    return fg

btc = load_price_csv("../data/raw/btc_prices.csv")
eth = load_price_csv("../data/raw/eth_prices.csv")
fg = load_fear_greed_csv("../data/raw/fear_greed_index.csv")

print("Raw shapes:", btc.shape, eth.shape, fg.shape)
print("Date ranges:")
print("  BTC:", btc.index.min(), "→", btc.index.max())
print("  ETH:", eth.index.min(), "→", eth.index.max())
print("  F&G:", fg.index.min(), "→", fg.index.max())

Raw shapes: (1079, 5) (1079, 5) (2873, 4)
Date ranges:
  BTC: 2023-01-02 00:00:00 → 2025-12-15 00:00:00
  ETH: 2023-01-02 00:00:00 → 2025-12-15 00:00:00
  F&G: 2018-02-01 00:00:00 → 2025-12-17 00:00:00


### Align date ranges
To make comparisons fair, we align BTC, ETH, and sentiment to their shared overlapping time window.

In [2]:
common_start = max(btc.index.min(), eth.index.min(), fg.index.min())
common_end = min(btc.index.max(), eth.index.max(), fg.index.max())

btc = btc.loc[common_start:common_end]
eth = eth.loc[common_start:common_end]
fg = fg.loc[common_start:common_end]

print("Aligned date window:", common_start, "→", common_end)
print("Aligned shapes:", btc.shape, eth.shape, fg.shape)
btc.head()

Aligned date window: 2023-01-02 00:00:00 → 2025-12-15 00:00:00
Aligned shapes: (1079, 5) (1079, 5) (1078, 4)


Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-01-02,16625.509766,16759.34375,16572.228516,16688.470703,12097775227
2023-01-03,16688.847656,16760.447266,16622.371094,16679.857422,13903079207
2023-01-04,16680.205078,16964.585938,16667.763672,16863.238281,18421743322
2023-01-05,16863.472656,16884.021484,16790.283203,16836.736328,13692758566
2023-01-06,16836.472656,16991.994141,16716.421875,16951.96875,14413662913


### Clean missing values & duplicates
Remove duplicate timestamps, forward-fill gaps, and enforce numeric columns before feature engineering.

In [3]:
# Drop duplicates (keep first occurrence)
btc = btc[~btc.index.duplicated(keep="first")]
eth = eth[~eth.index.duplicated(keep="first")]
fg = fg[~fg.index.duplicated(keep="first")]

# Forward fill missing values (common for daily financial series)
btc = btc.ffill()
eth = eth.ffill()
fg = fg.ffill()

# Final numeric enforcement
for col in ["Open", "High", "Low", "Close", "Volume"]:
    if col in btc.columns:
        btc[col] = pd.to_numeric(btc[col], errors="coerce")
    if col in eth.columns:
        eth[col] = pd.to_numeric(eth[col], errors="coerce")

fg["FG_Value"] = pd.to_numeric(fg["FG_Value"], errors="coerce")

print("Duplicates removed and types fixed.")

Duplicates removed and types fixed.


### BTC features
Create daily returns, short/medium moving averages, and rolling volatility for BTC.

In [4]:
# Daily Return
btc["BTC_Return"] = btc["Close"].pct_change()

# Moving Averages
btc["BTC_MA7"]  = btc["Close"].rolling(7).mean()
btc["BTC_MA30"] = btc["Close"].rolling(30).mean()

# Rolling Volatility (30 days)
btc["BTC_Vol30"] = btc["BTC_Return"].rolling(30).std()

print("BTC features created.")

BTC features created.


### ETH features
Create the same feature set for ETH to keep the dataset consistent.

In [5]:
# Daily Return
eth["ETH_Return"] = eth["Close"].pct_change()

# Moving Averages
eth["ETH_MA7"]  = eth["Close"].rolling(7).mean()
eth["ETH_MA30"] = eth["Close"].rolling(30).mean()

# Rolling Volatility (30 days)
eth["ETH_Vol30"] = eth["ETH_Return"].rolling(30).std()

print("ETH features created.")

ETH features created.


### Merge datasets
Join BTC + ETH + sentiment on date (inner join) to produce one aligned modeling/EDA table.

In [6]:
# Select only the columns we need
btc_subset = btc[["Close", "Volume", "BTC_Return", "BTC_MA7", "BTC_MA30", "BTC_Vol30"]]
eth_subset = eth[["Close", "Volume", "ETH_Return", "ETH_MA7", "ETH_MA30", "ETH_Vol30"]]
fg_subset  = fg[["FG_Value", "value_classification"]]

# Inner Join: BTC + ETH + Sentiment
merged = btc_subset.join(
    eth_subset, lsuffix="_BTC", rsuffix="_ETH", how="inner"
).join(
    fg_subset, how="inner"
)

print(f"Merged shape: {merged.shape}")

Merged shape: (1078, 14)


### Save cleaned dataset
Drop rows lost to rolling windows and save the final cleaned dataset to `data/processed/merged_clean.csv`.

In [7]:
# Drop NaN rows created by the rolling windows (first 30 days)
merged_clean = merged.dropna()

# Save to processed folder
merged_clean.to_csv("../data/processed/merged_clean.csv")

print(f"Final cleaned data saved. Shape: {merged_clean.shape}")
merged_clean.head()

Final cleaned data saved. Shape: (1048, 14)


Unnamed: 0_level_0,Close_BTC,Volume_BTC,BTC_Return,BTC_MA7,BTC_MA30,BTC_Vol30,Close_ETH,Volume_ETH,ETH_Return,ETH_MA7,ETH_MA30,ETH_Vol30,FG_Value,value_classification
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2023-02-01,23723.769531,26683255504,0.025259,23231.479074,20606.082031,0.023814,1641.792725,8116969489,0.034829,1602.215402,1490.054069,0.028151,56,Greed
2023-02-02,23471.871094,32066936882,-0.010618,23294.206752,20832.482487,0.024069,1643.241577,10558081069,0.000882,1607.949062,1504.336161,0.028141,60,Greed
2023-02-03,23449.322266,27083066007,-0.000961,23347.148717,21052.01862,0.02418,1664.745605,8169519805,0.013086,1617.461792,1517.943461,0.027785,60,Greed
2023-02-04,23331.847656,15639298538,-0.00501,23390.114118,21268.522331,0.024251,1667.059204,5843302512,0.00139,1630.979527,1531.830815,0.027695,58,Greed
2023-02-05,22955.666016,19564262605,-0.016123,23273.128348,21468.645573,0.024751,1631.645874,6926696531,-0.021243,1628.906703,1543.906376,0.028253,58,Greed
