# Step 0: Project Overview
__In this notebook, we collect raw data for our project:__
- Commodity prices (gold, oil, wheat) via Yahoo Finance (yfinance).
- Kaggle datasets: geopolitical risk index and global news.

All raw files will be stored in the `data/` folder for consistency.

In [None]:
# 01_data_collection
Fetch commodity prices (gold, oil, wheat) and load Kaggle datasets (geopolitical risk, global news)

# Step 1: Install Dependencies (if missing)
__We install required Python packages if not already available:__
- `yfinance` for fetching financial data.
- `kaggle` for downloading datasets from Kaggle.

Uncomment and run the following lines only if packages are missing.

In [None]:
# If you don't have yfinance or kaggle installed, uncomment and run these:
 !pip install yfinance kaggle


# Step 2: Import Libraries and Set Paths
__We import all the necessary libraries and configure global paths:__
- `os` → file system operations.
- `pandas` / `numpy` → data handling.
- `yfinance` → commodity price data.
- `datetime` → date management.

We also define the root directory, create a `data/` folder if it doesn’t exist, and set our analysis date range.

In [None]:
import os
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime

# paths
ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))  # if notebook lives in notebooks/
DATA_DIR = os.path.join(ROOT, "data")
os.makedirs(DATA_DIR, exist_ok=True)

# choose date range
START = "2000-01-01"
END = None  # None -> yfinance takes up to today


# Step 3: Define Commodity Tickers and Fetch Function
__We define the tickers for gold, WTI crude oil, and wheat futures.__
- Each ticker corresponds to a Yahoo Finance symbol.
- We then implement a helper function `fetch_save()` which:
  1. Downloads the data from Yahoo Finance.
  2. Adds daily returns and 5-day rolling volatility.
  3. Saves the data as a CSV in the `data/` folder.

In [None]:
tickers = {
    "Gold": "GC=F",      # Gold futures
    "WTI":  "CL=F",      # WTI crude oil future
    "Wheat": "ZW=F"      # Wheat futures
}

def fetch_save(ticker_symbol, shortname):
    print(f"Downloading {shortname} ({ticker_symbol}) ...")
    df = yf.download(ticker_symbol, start=START, end=END, progress=False, auto_adjust=True)

    # Add features
    df['Return'] = df['Close'].pct_change()
    df['Vol_5'] = df['Return'].rolling(window=5).std()

    if df.empty:
        print("Warning: empty dataframe for", ticker_symbol)
    else:
        df.reset_index(inplace=True)                 # Date as column
        df.to_csv(os.path.join(DATA_DIR, f"{shortname}.csv"), index=False)
    return df



# Step 4: Fetch and Save Commodity Data
__We use the `fetch_save()` function to download and store data for:__
- Gold futures
- WTI crude oil futures
- Wheat futures

In [None]:
# fetch
df_gold = fetch_save(tickers["Gold"], "gold_futures")
df_wti  = fetch_save(tickers["WTI"], "wti_crude")
df_wheat = fetch_save(tickers["Wheat"], "wheat_futures")

# Step 5: Quick Data Inspection
__We verify that the downloaded datasets contain rows and preview the first few entries.__


In [None]:
for name, df in [("Gold", df_gold), ("WTI", df_wti), ("Wheat", df_wheat)]:
    print(name, "rows:", 0 if df is None else len(df))
    display(df.head())


# Step 6: Download Kaggle Datasets
__We now download external datasets from Kaggle for further analysis:__
1. **Gold & Silver Price vs Geopolitical Risk Index**
2. **Global News Dataset**
3. **Gold Price Prediction (LSTM reference dataset)**

⚠️ **Note:**
- This requires you to have your Kaggle API key (`kaggle.json`) set up in `~/.kaggle/`.
- These commands must be run in a notebook or shell with Kaggle configured.

In [None]:
import os

# Make sure we have a "data" directory
os.makedirs("../../data/data", exist_ok=True)

# 1. Gold & Silver Price vs Geopolitical Risk Index
!kaggle datasets download -d shreyanshdangi/gold-silver-price-vs-geopolitical-risk-19852025 -p data --unzip

# 2. Global News Dataset
!kaggle datasets download -d everydaycodings/global-news-dataset -p data --unzip

# 3. Gold Price Prediction (LSTM reference dataset)
!kaggle kernels pull farzadnekouei/gold-price-prediction-lstm-96-accuracy -p data/lstm_reference



# Step 7: Inspect Downloaded Files & Load External Datasets
__We:__
1. List the contents of the `data/` directory (to check Kaggle download results).
2. Attempt to load:
   - Geopolitical Risk (GPR) dataset
   - Global News dataset

⚠️ Note: Filenames may differ, so adapt accordingly.

In [None]:
# list files so you can adapt filenames
for f in sorted(os.listdir(DATA_DIR)):
    print(f)

# Reset DATA_DIR to "data" if needed (be careful not to overwrite!)
DATA_DIR = "data"

# Try loading datasets (update filenames if different)
try:
    gpr = pd.read_csv(os.path.join(DATA_DIR, "Gold-Silver-GeopoliticalRisk_HistoricalData.csv"))
except FileNotFoundError:
    print("⚠️ GPR dataset not found.")
    gpr = pd.DataFrame()

try:
    global_news = pd.read_csv(os.path.join(DATA_DIR, "data.csv"))
except FileNotFoundError:
    print("⚠️ Global News dataset not found.")
    global_news = pd.DataFrame()


# Step 8: Add Basic Features to Commodity Prices
__We define a helper function `add_basic_features()` that:__
- Parses the `Date` column
- Ensures chronological order
- Chooses the correct price column (`Adj Close` if available, otherwise `Close`)
- Computes daily returns, 5-day moving average (MA_5), and 5-day rolling volatility (Vol_5)

Then we apply it to Gold, WTI crude, and Wheat, and save the processed results to CSV.

In [None]:
def add_basic_features(df):
    df = df.copy()
    df['Date'] = pd.to_datetime(df['Date'])
    df.sort_values('Date', inplace=True)

    # (fix) removed ineffective: df.set_index('Date', inplace=False)
    price_col = 'Adj Close' if 'Adj Close' in df.columns else 'Close'
    df[price_col] = df[price_col].astype(float)

    df['Return'] = df[price_col].pct_change()
    df['MA_5'] = df[price_col].rolling(window=5).mean()
    df['Vol_5'] = df['Return'].rolling(window=5).std()
    return df

gold_feat = add_basic_features(df_gold) if not df_gold.empty else pd.DataFrame()
wti_feat  = add_basic_features(df_wti) if not df_wti.empty else pd.DataFrame()
wheat_feat= add_basic_features(df_wheat) if not df_wheat.empty else pd.DataFrame()

# Save processed
gold_feat.to_csv(os.path.join(DATA_DIR, "gold_processed.csv"), index=False)
wti_feat.to_csv(os.path.join(DATA_DIR, "wti_processed.csv"), index=False)
wheat_feat.to_csv(os.path.join(DATA_DIR, "wheat_processed.csv"), index=False)

print("✅ Saved processed files to", DATA_DIR)
print("GPR preview:" if not gpr.empty else "⚠️ GPR empty")
print(gpr.head())


# Step 9: Prepare Geopolitical Risk (GPR) Data & Merge with Gold
__We:__
1. Convert `DATE` column to datetime.
2. Resample the GPR index to **daily frequency** and forward-fill missing values.
3. Ensure `gold_feat` has `Date` as a column (not index).
4. Merge Gold features with GPR on the `Date` column.

In [None]:
# Example: if you have a geopolitcal risk csv with 'Date' and 'GPR' columns
# --- Prepare GPR data ---
gpr['DATE'] = pd.to_datetime(gpr['DATE'])
gpr_daily = gpr.set_index('DATE').resample('D').ffill().reset_index()

# --- Prepare gold_feat ---
# Only reset index if 'Date' is still the index (to avoid duplicates)
if gold_feat.index.name == 'Date':
    gold_feat = gold_feat.reset_index()

print("Gold feature columns:", gold_feat.columns)
print("GPR columns:", gpr_daily.columns)


# --- Merge on Date ---
merged_gold = pd.merge(
    gold_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

print("Merged dataset preview:")
print(merged_gold.head())

# Step 10: Merge GPR Data with All Commodities (Gold, Oil, Wheat)
__We improve robustness by:__
- Adding a `flatten_columns()` utility to fix MultiIndex columns (if present).
- Using `safe_reset()` to ensure `Date` is always a column.
- Merging GPR daily data with **Gold, WTI Crude, and Wheat** feature sets.

In [69]:
def flatten_columns(df):
    """Flatten MultiIndex columns if necessary."""
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [
            '_'.join([str(c) for c in col if c]).strip()
            for col in df.columns.values
        ]
    return df

def safe_reset(df):
    if df.index.name == 'Date':
        return df.reset_index()
    return df

# --- Gold ---
gold_feat = flatten_columns(gold_feat)
gold_feat = safe_reset(gold_feat)
merged_gold = pd.merge(
    gold_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

# --- Oil (WTI) ---
wti_feat = flatten_columns(wti_feat)
wti_feat = safe_reset(wti_feat)
merged_wti = pd.merge(
    wti_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

# --- Wheat ---
wheat_feat = flatten_columns(wheat_feat)
wheat_feat = safe_reset(wheat_feat)
merged_wheat = pd.merge(
    wheat_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

# --- Quick checks ---
print("Gold merged shape:", merged_gold.shape)
print("Oil merged shape:", merged_wti.shape)
print("Wheat merged shape:", merged_wheat.shape)

display(merged_gold.head())
display(merged_wti.head())
display(merged_wheat.head())


Gold merged shape: (6288, 26)
Oil merged shape: (6297, 24)
Wheat merged shape: (6311, 24)


Unnamed: 0,level_0,index,Date,Close_GC=F,High_GC=F,Low_GC=F,Open_GC=F,Volume_GC=F,Return,Vol_5,...,GOLD_CHANGE_%,SILVER_PRICE,SILVER_OPEN,SILVER_HIGH,SILVER_LOW,SILVER_CHANGE_%,GPRD,GPRD_ACT,GPRD_THREAT,EVENT
0,0,0,2000-08-30,273.899994,273.899994,273.899994,273.899994,0,,,...,0.29,4.91,4.92,4.91,4.9,-0.41,58.01,48.7,60.93,
1,1,1,2000-08-31,278.299988,278.299988,274.799988,274.799988,0,,,...,1.28,4.97,4.91,4.97,4.96,1.22,42.07,50.33,32.65,
2,2,2,2000-09-01,277.0,277.0,277.0,277.0,0,,,...,-0.32,4.97,4.95,4.97,4.96,0.0,45.34,30.13,55.29,
3,3,3,2000-09-05,275.799988,275.799988,275.799988,275.799988,2,,,...,-0.52,4.95,4.95,4.95,4.94,-0.2,52.97,77.45,29.37,
4,4,4,2000-09-06,274.200012,274.200012,274.200012,274.200012,0,,,...,-0.4,4.95,4.94,4.95,4.94,0.0,54.75,32.75,72.85,


Unnamed: 0,Date,Close_CL=F,High_CL=F,Low_CL=F,Open_CL=F,Volume_CL=F,Return,Vol_5,MA_5,DATE,...,GOLD_CHANGE_%,SILVER_PRICE,SILVER_OPEN,SILVER_HIGH,SILVER_LOW,SILVER_CHANGE_%,GPRD,GPRD_ACT,GPRD_THREAT,EVENT
0,2000-08-23,32.049999,32.799999,31.950001,31.950001,79385,,,,2000-08-23,...,-0.91,4.78,4.82,4.78,4.77,-1.04,45.92,30.52,56.01,
1,2000-08-24,31.629999,32.240002,31.4,31.9,72978,,,,2000-08-24,...,0.63,4.84,4.77,4.84,4.83,1.26,45.18,33.78,46.97,
2,2000-08-25,32.049999,32.099998,31.32,31.700001,44601,,,,2000-08-25,...,0.48,4.88,4.83,4.88,4.87,0.83,50.92,41.8,54.79,
3,2000-08-28,32.869999,32.919998,31.860001,32.040001,46770,,,,2000-08-28,...,-0.05,4.92,4.86,4.92,4.9,0.82,100.5,112.22,93.6,
4,2000-08-29,32.720001,33.029999,32.560001,32.82,49131,,,32.264,2000-08-29,...,-0.24,4.93,4.9,4.93,4.92,0.2,50.46,21.56,65.94,


Unnamed: 0,Date,Close_ZW=F,High_ZW=F,Low_ZW=F,Open_ZW=F,Volume_ZW=F,Return,Vol_5,MA_5,DATE,...,GOLD_CHANGE_%,SILVER_PRICE,SILVER_OPEN,SILVER_HIGH,SILVER_LOW,SILVER_CHANGE_%,GPRD,GPRD_ACT,GPRD_THREAT,EVENT
0,2000-07-17,244.0,248.0,243.75,248.0,7233,,,,2000-07-17,...,0.78,5.01,4.99,5.01,5.0,0.2,56.75,56.57,53.08,
1,2000-07-18,241.25,247.5,241.0,244.25,6523,,,,2000-07-18,...,-0.32,5.02,5.0,5.02,5.01,0.2,31.97,25.5,31.91,
2,2000-07-19,245.0,246.0,239.0,241.25,7136,,,,2000-07-19,...,-1.24,4.97,5.0,4.97,4.96,-1.0,63.04,22.85,95.31,
3,2000-07-20,247.0,247.5,244.0,244.0,3792,,,,2000-07-20,...,0.39,4.98,4.97,4.98,4.97,0.2,65.57,31.38,91.6,
4,2000-07-21,247.25,248.5,246.5,247.0,4380,,,244.9,2000-07-21,...,0.04,4.93,4.97,4.93,4.92,-1.0,22.83,17.07,28.47,


# Step 11: Quick Check of Saved Processed Files
__We check if our processed CSVs were successfully created and stored in `data/`.__

In [70]:
# quick check of saved files
for fname in ["gold_processed.csv", "wti_processed.csv", "wheat_processed.csv"]:
    print(fname, "->", os.path.exists(os.path.join(DATA_DIR,fname)))


gold_processed.csv -> True
wti_processed.csv -> True
wheat_processed.csv -> True


# Step 12: Save Notebook & Commit to Git
__Now that we have processed datasets and a working pipeline, we commit our notebook + CSV files to Git for version control.__

⚠️ Run these commands in the **terminal**, not inside the notebook.

In [None]:
git add data/*.csv notebooks/01_data_collection.ipynb
git commit -m "Add data collection notebook + initial processed commodity files"
git push origin main


# Step 13: Set Up Environment and Paths
__We:__
1. Import required Python libraries.
2. Define project root and data directory paths.
3. Set date range for commodity price downloads.

In [7]:
import os
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime

# paths
ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))  # if notebook lives in notebooks/
DATA_DIR = os.path.join(ROOT, "data")
os.makedirs(DATA_DIR, exist_ok=True)

# choose date range
START = "2000-01-01"
END = None  # None -> yfinance takes up to today


# Step 14: Download Commodity Prices (Gold, Oil, Wheat)
__We:__
1. Define Yahoo Finance tickers for **Gold, WTI crude oil, Wheat**.
2. Create `fetch_save()` to download data, calculate returns & 5-day volatility.
3. Save each dataset to CSV in the `data/` folder.

In [20]:
tickers = {
    "Gold": "GC=F",      # Gold futures
    "WTI":  "CL=F",      # WTI crude oil future
    "Wheat": "ZW=F"      # Wheat futures
}

def fetch_save(ticker_symbol, shortname):
    print(f"Downloading {shortname} ({ticker_symbol}) ...")
    df = yf.download(ticker_symbol, start=START, end=END, progress=False, auto_adjust=True)
    df['Return'] = df['Close'].pct_change()
    df['Vol_5'] = df['Return'].rolling(window=5).std()

    if df.empty:
        print("Warning: empty dataframe for", ticker_symbol)
    else:
        df.reset_index(inplace=True)                 # Date as column
        df.to_csv(os.path.join(DATA_DIR, f"{shortname}.csv"), index=False)
    return df

# fetch
df_gold = fetch_save(tickers["Gold"], "gold_futures")
df_wti  = fetch_save(tickers["WTI"], "wti_crude")
df_wheat = fetch_save(tickers["Wheat"], "wheat_futures")



Downloading gold_futures (GC=F) ...
Downloading wti_crude (CL=F) ...
Downloading wheat_futures (ZW=F) ...


Price,Date,Close,High,Low,Open,Volume
Ticker,Unnamed: 1_level_1,ZW=F,ZW=F,ZW=F,ZW=F,ZW=F
0,2000-07-17,244.0,248.0,243.75,248.0,7233
1,2000-07-18,241.25,247.5,241.0,244.25,6523
2,2000-07-19,245.0,246.0,239.0,241.25,7136
3,2000-07-20,247.0,247.5,244.0,244.0,3792
4,2000-07-21,247.25,248.5,246.5,247.0,4380


# Step 15: Inspect Commodity Datasets
__We quickly check:__
- Number of rows per dataset.
- First few records to confirm structure and values.

In [10]:
for name, df in [("Gold", df_gold), ("WTI", df_wti), ("Wheat", df_wheat)]:
    print(name, "rows:", 0 if df is None else len(df))
    display(df.head())


Gold rows: 6287


Price,Date,Close,High,Low,Open,Volume
Ticker,Unnamed: 1_level_1,GC=F,GC=F,GC=F,GC=F,GC=F
0,2000-08-30,273.899994,273.899994,273.899994,273.899994,0
1,2000-08-31,278.299988,278.299988,274.799988,274.799988,0
2,2000-09-01,277.0,277.0,277.0,277.0,0
3,2000-09-05,275.799988,275.799988,275.799988,275.799988,2
4,2000-09-06,274.200012,274.200012,274.200012,274.200012,0


WTI rows: 6296


Price,Date,Close,High,Low,Open,Volume
Ticker,Unnamed: 1_level_1,CL=F,CL=F,CL=F,CL=F,CL=F
0,2000-08-23,32.049999,32.799999,31.950001,31.950001,79385
1,2000-08-24,31.629999,32.240002,31.4,31.9,72978
2,2000-08-25,32.049999,32.099998,31.32,31.700001,44601
3,2000-08-28,32.869999,32.919998,31.860001,32.040001,46770
4,2000-08-29,32.720001,33.029999,32.560001,32.82,49131


Wheat rows: 6311


Price,Date,Close,High,Low,Open,Volume
Ticker,Unnamed: 1_level_1,ZW=F,ZW=F,ZW=F,ZW=F,ZW=F
0,2000-07-17,244.0,248.0,243.75,248.0,7233
1,2000-07-18,241.25,247.5,241.0,244.25,6523
2,2000-07-19,245.0,246.0,239.0,241.25,7136
3,2000-07-20,247.0,247.5,244.0,244.0,3792
4,2000-07-21,247.25,248.5,246.5,247.0,4380


# Step 16: Download Kaggle Datasets
__We:__
- Download datasets required for analysis via the Kaggle API.
- Ensure Kaggle credentials are set up (`~/.kaggle/kaggle.json`).
- Datasets include:
  1. **Gold & Silver Price vs Geopolitical Risk Index**
  2. **Global News Dataset**
  3. **Gold Price Prediction (LSTM reference)**

⚠️ These commands run in the **notebook**, but setup of the Kaggle API key is done in the **terminal**.

In [11]:
# run in terminal (not in notebook) from repo root
# pip install kaggle
# mkdir -p ~/.kaggle && cp /path/to/kaggle.json ~/.kaggle/kaggle.json && chmod 600 ~/.kaggle/kaggle.json

import os

# Make sure we have a "data" directory
os.makedirs("../../data/data", exist_ok=True)

# 1. Gold & Silver Price vs Geopolitical Risk Index
!kaggle datasets download -d shreyanshdangi/gold-silver-price-vs-geopolitical-risk-19852025 -p data --unzip

# 2. Global News Dataset
!kaggle datasets download -d everydaycodings/global-news-dataset -p data --unzip

# 3. Gold Price Prediction (LSTM reference dataset)
!kaggle kernels pull farzadnekouei/gold-price-prediction-lstm-96-accuracy -p data/lstm_reference



Dataset URL: https://www.kaggle.com/datasets/shreyanshdangi/gold-silver-price-vs-geopolitical-risk-19852025
License(s): Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Downloading gold-silver-price-vs-geopolitical-risk-19852025.zip to data




  0%|          | 0.00/745k [00:00<?, ?B/s]
100%|##########| 745k/745k [00:00<00:00, 353MB/s]


Dataset URL: https://www.kaggle.com/datasets/everydaycodings/global-news-dataset
License(s): CC0-1.0
Downloading global-news-dataset.zip to data




  0%|          | 0.00/400M [00:00<?, ?B/s]
 13%|#2        | 50.0M/400M [00:00<00:00, 510MB/s]
 25%|##4       | 99.0M/400M [00:00<00:00, 433MB/s]
 36%|###5      | 142M/400M [00:00<00:00, 436MB/s] 
 47%|####6     | 187M/400M [00:00<00:00, 447MB/s]
 58%|#####7    | 231M/400M [00:00<00:00, 442MB/s]
 69%|######8   | 274M/400M [00:00<00:00, 425MB/s]
 79%|#######8  | 315M/400M [00:00<00:00, 339MB/s]
 88%|########7 | 350M/400M [00:01<00:00, 186MB/s]
 97%|#########7| 388M/400M [00:01<00:00, 218MB/s]
100%|##########| 400M/400M [00:01<00:00, 296MB/s]


Source code downloaded to data/lstm_reference\gold-price-prediction-lstm-96-accuracy.ipynb


# Step 17: Load Kaggle Datasets into DataFrames
__We:__
1. List available files in `data/` to adapt filenames if needed.
2. Load **Geopolitical Risk Index** and **Global News** datasets.
3. Provide fallback placeholders (`None`) if datasets are missing.

In [55]:
# list files so you can adapt filenames
for f in sorted(os.listdir(DATA_DIR)):
    print(f)

DATA_DIR = "data"

# Try loading datasets
gpr_path = os.path.join(DATA_DIR, "Gold-Silver-GeopoliticalRisk_HistoricalData.csv")
news_path = os.path.join(DATA_DIR, "data.csv")

gpr = pd.read_csv(gpr_path) if os.path.exists(gpr_path) else None
global_news = pd.read_csv(news_path) if os.path.exists(news_path) else None

All_Historical_Data_Separately
Gold-Silver-GeopoliticalRisk_HistoricalData.csv
data.csv
gold_futures.csv
gold_processed.csv
lstm_reference
rating.csv
raw-data.csv
wheat_futures.csv
wheat_processed.csv
wti_crude.csv
wti_processed.csv


# Step 18: Feature Engineering for Commodity Prices
__We:__
1. Define `add_basic_features()` to compute:
   - Returns (`pct_change`)
   - 5-day moving average (`MA_5`)
   - 5-day volatility (`Vol_5`)
2. Apply it to Gold, WTI, and Wheat datasets.
3. Save processed outputs as `*_processed.csv`.

In [56]:
def add_basic_features(df):
    df = df.copy()
    df['Date'] = pd.to_datetime(df['Date'])
    df.sort_values('Date', inplace=True)
    df.set_index('Date', inplace=True)
    price_col = 'Adj Close' if 'Adj Close' in df.columns else 'Close'
    df[price_col] = df[price_col].astype(float)
    df['Return'] = df[price_col].pct_change()
    df['MA_5'] = df[price_col].rolling(window=5).mean()
    df['Vol_5'] = df['Return'].rolling(window=5).std()
    return df

gold_feat = add_basic_features(df_gold) if not df_gold.empty else pd.DataFrame()
wti_feat  = add_basic_features(df_wti) if not df_wti.empty else pd.DataFrame()
wheat_feat= add_basic_features(df_wheat) if not df_wheat.empty else pd.DataFrame()

# Save processed
gold_feat.to_csv(os.path.join(DATA_DIR, "gold_processed.csv"), index=False)
wti_feat.to_csv(os.path.join(DATA_DIR, "wti_processed.csv"), index=False)
wheat_feat.to_csv(os.path.join(DATA_DIR, "wheat_processed.csv"), index=False)

print("Saved processed files to", DATA_DIR)
if gpr is not None:
    print(gpr.head())


Saved processed files to data
         DATE  GOLD_PRICE  GOLD_OPEN  GOLD_HIGH  GOLD_LOW  GOLD_CHANGE_%  \
0  2025-09-10     3630.90    3633.61    3634.42   3620.90          -0.07   
1  2025-09-09     3633.61    3637.10    3674.75   3625.33          -0.06   
2  2025-09-08     3635.84    3586.82    3646.60   3579.67           1.24   
3  2025-09-07     3591.19    3592.07    3596.56   3586.95           0.12   
4  2025-09-05     3586.81    3547.00    3600.33   3540.05           1.15   

   SILVER_PRICE  SILVER_OPEN  SILVER_HIGH  SILVER_LOW  SILVER_CHANGE_%  \
0         40.92        40.89        40.94       40.72             0.09   
1         40.89        41.34        41.50       40.77            -1.13   
2         41.36        41.01        41.68       40.51             1.20   
3         40.86        41.00        41.01       40.76            -0.34   
4         41.01        40.69        41.44       40.55             0.76   

     GPRD  GPRD_ACT  GPRD_THREAT EVENT  
0     NaN       NaN        

# Step 19: Merge Gold Prices with Geopolitical Risk Index
__We:__
1. Convert GPR `DATE` column to datetime and resample daily.
2. Reset index for `gold_feat` if Date is still the index.
3. Merge Gold data with GPR on `Date`.

In [61]:

# --- Prepare GPR data ---
gpr['DATE'] = pd.to_datetime(gpr['DATE'])
gpr_daily = gpr.set_index('DATE').resample('D').ffill().reset_index()

# --- Prepare gold_feat ---
gold_feat = flatten_columns(gold_feat)  # flatten MultiIndex if needed
gold_feat = gold_feat.reset_index()
# make sure Date is a column

print("Gold feature columns:", gold_feat.columns)
print("GPR columns:", gpr_daily.columns)


# --- Merge on Date ---
merged_gold = pd.merge(
    gold_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

print("Merged dataset preview:")
print(merged_gold.head())

Gold feature columns: Index(['level_0', 'index', 'Date', 'Close_GC=F', 'High_GC=F', 'Low_GC=F',
       'Open_GC=F', 'Volume_GC=F', 'Return', 'Vol_5', 'MA_5'],
      dtype='object')
GPR columns: Index(['DATE', 'GOLD_PRICE', 'GOLD_OPEN', 'GOLD_HIGH', 'GOLD_LOW',
       'GOLD_CHANGE_%', 'SILVER_PRICE', 'SILVER_OPEN', 'SILVER_HIGH',
       'SILVER_LOW', 'SILVER_CHANGE_%', 'GPRD', 'GPRD_ACT', 'GPRD_THREAT',
       'EVENT'],
      dtype='object')
Merged dataset preview:
   level_0  index       Date  Close_GC=F   High_GC=F    Low_GC=F   Open_GC=F  \
0        0      0 2000-08-30  273.899994  273.899994  273.899994  273.899994   
1        1      1 2000-08-31  278.299988  278.299988  274.799988  274.799988   
2        2      2 2000-09-01  277.000000  277.000000  277.000000  277.000000   
3        3      3 2000-09-05  275.799988  275.799988  275.799988  275.799988   
4        4      4 2000-09-06  274.200012  274.200012  274.200012  274.200012   

   Volume_GC=F  Return  Vol_5  ...  GOLD_CHANGE_% 

# Step 20: Merge All Commodities with Geopolitical Risk Index
__We:__
1. Define utility functions:
   - `flatten_columns()` → handle MultiIndex columns.
   - `safe_reset()` → reset index if needed.
2. Apply preprocessing to Gold, WTI, and Wheat.
3. Merge each with daily GPR data.
4. Display merged dataset previews.

In [62]:
# --- Prepare GPR data ---
gpr['DATE'] = pd.to_datetime(gpr['DATE'])
gpr_daily = gpr.set_index('DATE').resample('D').ffill().reset_index()

def flatten_columns(df):
    """Flatten MultiIndex columns if necessary."""
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [
            '_'.join([str(c) for c in col if c]).strip()
            for col in df.columns.values
        ]
    return df

def safe_reset(df):
    if df.index.name == 'Date':
        return df.reset_index()
    return df

# --- Gold ---
gold_feat = flatten_columns(gold_feat)
gold_feat = safe_reset(gold_feat)
merged_gold = pd.merge(
    gold_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

# --- Oil (WTI) ---
wti_feat = flatten_columns(wti_feat)
wti_feat = safe_reset(wti_feat)
merged_wti = pd.merge(
    wti_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

# --- Wheat ---
wheat_feat = flatten_columns(wheat_feat)
wheat_feat = safe_reset(wheat_feat)
merged_wheat = pd.merge(
    wheat_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

# --- Quick checks ---
print("Gold merged shape:", merged_gold.shape)
print("Oil merged shape:", merged_wti.shape)
print("Wheat merged shape:", merged_wheat.shape)

display(merged_gold.head())
display(merged_wti.head())
display(merged_wheat.head())


Gold merged shape: (6288, 26)
Oil merged shape: (6297, 24)
Wheat merged shape: (6311, 24)


Unnamed: 0,level_0,index,Date,Close_GC=F,High_GC=F,Low_GC=F,Open_GC=F,Volume_GC=F,Return,Vol_5,...,GOLD_CHANGE_%,SILVER_PRICE,SILVER_OPEN,SILVER_HIGH,SILVER_LOW,SILVER_CHANGE_%,GPRD,GPRD_ACT,GPRD_THREAT,EVENT
0,0,0,2000-08-30,273.899994,273.899994,273.899994,273.899994,0,,,...,0.29,4.91,4.92,4.91,4.9,-0.41,58.01,48.7,60.93,
1,1,1,2000-08-31,278.299988,278.299988,274.799988,274.799988,0,,,...,1.28,4.97,4.91,4.97,4.96,1.22,42.07,50.33,32.65,
2,2,2,2000-09-01,277.0,277.0,277.0,277.0,0,,,...,-0.32,4.97,4.95,4.97,4.96,0.0,45.34,30.13,55.29,
3,3,3,2000-09-05,275.799988,275.799988,275.799988,275.799988,2,,,...,-0.52,4.95,4.95,4.95,4.94,-0.2,52.97,77.45,29.37,
4,4,4,2000-09-06,274.200012,274.200012,274.200012,274.200012,0,,,...,-0.4,4.95,4.94,4.95,4.94,0.0,54.75,32.75,72.85,


Unnamed: 0,Date,Close_CL=F,High_CL=F,Low_CL=F,Open_CL=F,Volume_CL=F,Return,Vol_5,MA_5,DATE,...,GOLD_CHANGE_%,SILVER_PRICE,SILVER_OPEN,SILVER_HIGH,SILVER_LOW,SILVER_CHANGE_%,GPRD,GPRD_ACT,GPRD_THREAT,EVENT
0,2000-08-23,32.049999,32.799999,31.950001,31.950001,79385,,,,2000-08-23,...,-0.91,4.78,4.82,4.78,4.77,-1.04,45.92,30.52,56.01,
1,2000-08-24,31.629999,32.240002,31.4,31.9,72978,,,,2000-08-24,...,0.63,4.84,4.77,4.84,4.83,1.26,45.18,33.78,46.97,
2,2000-08-25,32.049999,32.099998,31.32,31.700001,44601,,,,2000-08-25,...,0.48,4.88,4.83,4.88,4.87,0.83,50.92,41.8,54.79,
3,2000-08-28,32.869999,32.919998,31.860001,32.040001,46770,,,,2000-08-28,...,-0.05,4.92,4.86,4.92,4.9,0.82,100.5,112.22,93.6,
4,2000-08-29,32.720001,33.029999,32.560001,32.82,49131,,,32.264,2000-08-29,...,-0.24,4.93,4.9,4.93,4.92,0.2,50.46,21.56,65.94,


Unnamed: 0,Date,Close_ZW=F,High_ZW=F,Low_ZW=F,Open_ZW=F,Volume_ZW=F,Return,Vol_5,MA_5,DATE,...,GOLD_CHANGE_%,SILVER_PRICE,SILVER_OPEN,SILVER_HIGH,SILVER_LOW,SILVER_CHANGE_%,GPRD,GPRD_ACT,GPRD_THREAT,EVENT
0,2000-07-17,244.0,248.0,243.75,248.0,7233,,,,2000-07-17,...,0.78,5.01,4.99,5.01,5.0,0.2,56.75,56.57,53.08,
1,2000-07-18,241.25,247.5,241.0,244.25,6523,,,,2000-07-18,...,-0.32,5.02,5.0,5.02,5.01,0.2,31.97,25.5,31.91,
2,2000-07-19,245.0,246.0,239.0,241.25,7136,,,,2000-07-19,...,-1.24,4.97,5.0,4.97,4.96,-1.0,63.04,22.85,95.31,
3,2000-07-20,247.0,247.5,244.0,244.0,3792,,,,2000-07-20,...,0.39,4.98,4.97,4.98,4.97,0.2,65.57,31.38,91.6,
4,2000-07-21,247.25,248.5,246.5,247.0,4380,,,244.9,2000-07-21,...,0.04,4.93,4.97,4.93,4.92,-1.0,22.83,17.07,28.47,


# Step 21 — Quick check of saved files (Notebook)

In [63]:
# Quick check of saved processed files
for fname in ["gold_processed.csv", "wti_processed.csv", "wheat_processed.csv"]:
    exists = os.path.exists(os.path.join(DATA_DIR, fname))
    print(f"{fname} -> {'✅ exists' if exists else '❌ missing'}")

gold_processed.csv -> ✅ exists
wti_processed.csv -> ✅ exists
wheat_processed.csv -> ✅ exists


# Step 22 — Save to Git (Terminal, not notebook)

In [68]:
git add notebooks/notebooks/01_data_collection.ipynb notebooks/notebooks/data/*.csv .gitignore
git commit -m "Add data collection notebook + initial processed commodity files"
git push origin main


SyntaxError: invalid decimal literal (495343877.py, line 1)