# Step 0: Project Overview
__In this notebook, we collect raw data for our project:__
- Commodity prices (gold, oil, wheat) via Yahoo Finance (yfinance).
- Kaggle datasets: geopolitical risk index and global news.

All raw files will be stored in the `data/` folder for consistency.

In [None]:
# 01_data_collection
Fetch commodity prices (gold, oil, wheat) and load Kaggle datasets (geopolitical risk, global news)

# Step 1: Install Dependencies (if missing)
__We install required Python packages if not already available:__
- `yfinance` for fetching financial data.
- `kaggle` for downloading datasets from Kaggle.

Uncomment and run the following lines only if packages are missing.

In [None]:
# If you don't have yfinance or kaggle installed, run these:
 !pip install yfinance kaggle


# Step 2: Import Libraries and Set Paths
__We import all the necessary libraries and configure global paths:__
- `os` ‚Üí file system operations.
- `pandas` / `numpy` ‚Üí data handling.
- `yfinance` ‚Üí commodity price data.
- `datetime` ‚Üí date management.

We also define the root directory, create a `data/` folder if it doesn‚Äôt exist, and set our analysis date range.

In [14]:
import os
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime

# --- Project root ---
ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))  # notebook assumed in /notebooks
print(f"ROOT: {ROOT}")

# --- Data folders ---
DATA_DIR = os.path.join(ROOT, "data")
RAW_DIR = os.path.join(DATA_DIR, "raw")
PROCESSED_DIR = os.path.join(DATA_DIR, "processed")
MERGED_DIR = os.path.join(DATA_DIR, "merged")

os.makedirs(RAW_DIR, exist_ok=True)
os.makedirs(PROCESSED_DIR, exist_ok=True)
os.makedirs(MERGED_DIR, exist_ok=True)

print("ROOT:", ROOT)
print("RAW_DIR:", RAW_DIR)
print("PROCESSED_DIR:", PROCESSED_DIR)

# --- Ensure all exist ---
for folder in [DATA_DIR, RAW_DIR, PROCESSED_DIR, MERGED_DIR]:
    os.makedirs(folder, exist_ok=True)

print("Data folders ready:")
for f in [RAW_DIR, PROCESSED_DIR, MERGED_DIR]:
    print(" -", f)
# choose date range
START = "1985-01-01"
END = None  # None -> yfinance takes up to today


ROOT: C:\Users\taton\PycharmProjects\capstone-data-science
ROOT: C:\Users\taton\PycharmProjects\capstone-data-science
RAW_DIR: C:\Users\taton\PycharmProjects\capstone-data-science\data\raw
PROCESSED_DIR: C:\Users\taton\PycharmProjects\capstone-data-science\data\processed
Data folders ready:
 - C:\Users\taton\PycharmProjects\capstone-data-science\data\raw
 - C:\Users\taton\PycharmProjects\capstone-data-science\data\processed
 - C:\Users\taton\PycharmProjects\capstone-data-science\data\merged


# Step 3: Define Commodity Tickers and Fetch Function
__We define the tickers for gold, WTI crude oil, and wheat futures.__
- Each ticker corresponds to a Yahoo Finance symbol.
- We then implement a helper function `fetch_save()` which:
  1. Downloads the data from Yahoo Finance.
  2. Adds daily returns and 5-day rolling volatility.
  3. Saves the data as a CSV in the `data/` folder.

In [21]:
tickers = {
    "Gold": "GC=F",      # Gold futures
    "WTI":  "CL=F",      # WTI crude oil future
    "Wheat": "ZW=F"      # Wheat futures
}
def add_features(df, price_col="Close"):
    df = df.copy()
    df["Return"] = df[price_col].pct_change()
    df["MA_5"] = df[price_col].rolling(5).mean()
    df["Vol_5"] = df["Return"].rolling(5).std()
    return df

def flatten_columns(df):
    """Flatten MultiIndex columns if present."""
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [
            '_'.join([str(c) for c in col if c]).strip()
            for col in df.columns.values
        ]
    return df

def fetch_save(ticker_symbol, shortname):
    print(f"Downloading {shortname} ({ticker_symbol}) ...")
    df = yf.download(ticker_symbol, start=START, end=END, progress=False, auto_adjust=True)

    if df.empty:
        print(f"‚ö†Ô∏è Warning: empty dataframe for {shortname}")
        return pd.DataFrame()

    # Save raw data
    df = flatten_columns(df)
    raw_path = os.path.join(RAW_DIR, f"{shortname.lower()}_raw.csv")
    df.reset_index(inplace=True)
    df.to_csv(raw_path, index=False)
    print(f"‚úÖ Raw {shortname} data saved to {raw_path}")

    # Add features
    df = add_features(df, price_col="Close")

    # Save processed data
    processed_path = os.path.join(PROCESSED_DIR, f"{shortname.lower()}_processed.csv")
    df.to_csv(processed_path, index=False)
    print(f"‚úÖ Processed {shortname} data saved to {processed_path}")

    return df



# Step 4: Fetch and Save Commodity Data
__We use the `fetch_save()` function to download and store data for:__
- Gold futures
- WTI crude oil futures
- Wheat futures

In [22]:
# --- Run for all commodities ---
df_gold = fetch_save(tickers["Gold"], "Gold")
df_wti = fetch_save(tickers["WTI"], "WTI")
df_wheat = fetch_save(tickers["Wheat"], "Wheat")

print("\nüìÇ Raw data files:", os.listdir(RAW_DIR))
print("üìÇ Processed data files:", os.listdir(PROCESSED_DIR))

Downloading Gold (GC=F) ...
‚úÖ Raw Gold data saved to C:\Users\taton\PycharmProjects\capstone-data-science\data\raw\gold_raw.csv
‚úÖ Processed Gold data saved to C:\Users\taton\PycharmProjects\capstone-data-science\data\processed\gold_processed.csv
Downloading WTI (CL=F) ...
‚úÖ Raw WTI data saved to C:\Users\taton\PycharmProjects\capstone-data-science\data\raw\wti_raw.csv
‚úÖ Processed WTI data saved to C:\Users\taton\PycharmProjects\capstone-data-science\data\processed\wti_processed.csv
Downloading Wheat (ZW=F) ...
‚úÖ Raw Wheat data saved to C:\Users\taton\PycharmProjects\capstone-data-science\data\raw\wheat_raw.csv
‚úÖ Processed Wheat data saved to C:\Users\taton\PycharmProjects\capstone-data-science\data\processed\wheat_processed.csv

üìÇ Raw data files: ['gold_futures.csv', 'gold_futures_raw.csv', 'gold_raw.csv', 'wheat_futures.csv', 'wheat_raw.csv', 'wti_crude.csv', 'wti_raw.csv']
üìÇ Processed data files: ['gold_processed.csv', 'wheat_processed.csv', 'wti_processed.csv']


# Step 5: Quick Data Inspection
__We verify that the downloaded datasets contain rows and preview the first few entries.__


In [23]:
for name, df in [("Gold", df_gold), ("WTI", df_wti), ("Wheat", df_wheat)]:
    print(name, "rows:", 0 if df is None else len(df))
    display(df.head())


Gold rows: 6302


Price,Date,Close,High,Low,Open,Volume,Return,MA_5,Vol_5
Ticker,Unnamed: 1_level_1,GC=F,GC=F,GC=F,GC=F,GC=F,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,2000-08-30,273.899994,273.899994,273.899994,273.899994,0,,,
1,2000-08-31,278.299988,278.299988,274.799988,274.799988,0,0.016064,,
2,2000-09-01,277.0,277.0,277.0,277.0,0,-0.004671,,
3,2000-09-05,275.799988,275.799988,275.799988,275.799988,2,-0.004332,,
4,2000-09-06,274.200012,274.200012,274.200012,274.200012,0,-0.005801,275.839996,


WTI rows: 6311


Price,Date,Close,High,Low,Open,Volume,Return,MA_5,Vol_5
Ticker,Unnamed: 1_level_1,CL=F,CL=F,CL=F,CL=F,CL=F,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,2000-08-23,32.049999,32.799999,31.950001,31.950001,79385,,,
1,2000-08-24,31.629999,32.240002,31.4,31.9,72978,-0.013105,,
2,2000-08-25,32.049999,32.099998,31.32,31.700001,44601,0.013279,,
3,2000-08-28,32.869999,32.919998,31.860001,32.040001,46770,0.025585,,
4,2000-08-29,32.720001,33.029999,32.560001,32.82,49131,-0.004563,32.264,


Wheat rows: 6326


Price,Date,Close,High,Low,Open,Volume,Return,MA_5,Vol_5
Ticker,Unnamed: 1_level_1,ZW=F,ZW=F,ZW=F,ZW=F,ZW=F,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,2000-07-17,244.0,248.0,243.75,248.0,7233,,,
1,2000-07-18,241.25,247.5,241.0,244.25,6523,-0.01127,,
2,2000-07-19,245.0,246.0,239.0,241.25,7136,0.015544,,
3,2000-07-20,247.0,247.5,244.0,244.0,3792,0.008163,,
4,2000-07-21,247.25,248.5,246.5,247.0,4380,0.001012,244.9,


# Step 6: Download Kaggle Datasets
__We now download external datasets from Kaggle for further analysis:__
1. **Gold & Silver Price vs Geopolitical Risk Index**
2. **Global News Dataset**
3. **Gold Price Prediction (LSTM reference dataset)**

‚ö†Ô∏è **Note:**
- This requires you to have your Kaggle API key (`kaggle.json`) set up in `~/.kaggle/`.
- These commands must be run in a notebook or shell with Kaggle configured.

In [24]:
import os

# Make sure we have a "data" directory
os.makedirs("../../data", exist_ok=True)

# 1. Gold & Silver Price vs Geopolitical Risk Index
!kaggle datasets download -d shreyanshdangi/gold-silver-price-vs-geopolitical-risk-19852025 -p data --unzip

# 2. Global News Dataset
!kaggle datasets download -d everydaycodings/global-news-dataset -p data --unzip

# 3. Gold Price Prediction (LSTM reference dataset)
!kaggle kernels pull farzadnekouei/gold-price-prediction-lstm-96-accuracy -p data/lstm_reference



Dataset URL: https://www.kaggle.com/datasets/shreyanshdangi/gold-silver-price-vs-geopolitical-risk-19852025
License(s): Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Downloading gold-silver-price-vs-geopolitical-risk-19852025.zip to data




  0%|          | 0.00/745k [00:00<?, ?B/s]
100%|##########| 745k/745k [00:00<00:00, 291MB/s]


Dataset URL: https://www.kaggle.com/datasets/everydaycodings/global-news-dataset
License(s): CC0-1.0
Downloading global-news-dataset.zip to data




  0%|          | 0.00/400M [00:00<?, ?B/s]
 18%|#8        | 72.0M/400M [00:00<00:00, 755MB/s]
 36%|###6      | 144M/400M [00:00<00:00, 611MB/s] 
 51%|#####1    | 204M/400M [00:00<00:00, 555MB/s]
 65%|######5   | 260M/400M [00:00<00:00, 553MB/s]
 79%|#######8  | 314M/400M [00:00<00:00, 537MB/s]
 92%|#########1| 366M/400M [00:00<00:00, 531MB/s]
100%|##########| 400M/400M [00:00<00:00, 544MB/s]


Source code downloaded to data/lstm_reference\gold-price-prediction-lstm-96-accuracy.ipynb


# Step 7: Inspect Downloaded Files & Load External Datasets
__We:__
1. List the contents of the `data/` directory (to check Kaggle download results).
2. Attempt to load:
   - Geopolitical Risk (GPR) dataset
   - Global News dataset

‚ö†Ô∏è Note: Filenames may differ, so adapt accordingly.

In [28]:
# --- Load your original datasets (raw) ---
gpr_path = os.path.join(RAW_DIR, "All_Historical_Data_Separately", "Geopolitical Risk Index Daily.csv")
news_path = os.path.join(RAW_DIR, "Global News dataset", "data.csv")

try:
    gpr = pd.read_csv(gpr_path)
    print("‚úÖ GPR dataset loaded:", gpr.shape)
except FileNotFoundError:
    print("‚ö†Ô∏è GPR dataset not found at:", gpr_path)
    gpr = pd.DataFrame()

try:
    global_news = pd.read_csv(news_path)
    print("‚úÖ Global news dataset loaded:", global_news.shape)
except FileNotFoundError:
    print("‚ö†Ô∏è Global News dataset not found at:", news_path)
    global_news = pd.DataFrame()

# --- Quick sanity checks ---
if not gpr.empty:
    print("\nüìä GPR preview:")
    print(gpr.head(3))

if not global_news.empty:
    print("\nüì∞ Global news preview:")
    print(global_news.head(3))

‚úÖ GPR dataset loaded: (14861, 5)
‚úÖ Global news dataset loaded: (105375, 12)

üìä GPR preview:
         DATE    GPRD GPRD_ACT  GPRD_THREAT EVENT
0  01-01-1985  230.04   275.20       153.03   NaN
1  02-01-1985  115.68   146.77        87.44   NaN
2  03-01-1985   97.43   158.94        29.46   NaN

üì∞ Global news preview:
   article_id source_id                   source_name  \
0       89541       NaN  International Business Times   
1       89542       NaN                    Prtimes.jp   
2       89543       NaN                      VOA News   

                                       author  \
0                              Paavan MATHEMA   
1                                         NaN   
2  webdesk@voanews.com (Agence France-Presse)   

                                               title  \
0  UN Chief Urges World To 'Stop The Madness' Of ...   
1              RANDEBOO„Çà„Çä„ÉØ„É≥„É©„É≥„ÇØ‰∏ä„ÅÆÂ§ß‰∫∫„Å£„ÅΩ„Åï„ÅåÊºÇ„ÅÜ„Éã„ÉÉ„Éà„Å®„Éô„Çπ„Éà„ÅåÊñ∞ÁôªÂ†¥„ÄÇ   
2  UN Chief Urges Worl

# Step 8: Add Basic Features to Commodity Prices
__We define a helper function `add_basic_features()` that:__
- Parses the `Date` column
- Ensures chronological order
- Chooses the correct price column (`Adj Close` if available, otherwise `Close`)
- Computes daily returns, 5-day moving average (MA_5), and 5-day rolling volatility (Vol_5)

Then we apply it to Gold, WTI crude, and Wheat, and save the processed results to CSV.

Feature Engineering for Commodity Prices
__We:__
1. Define `add_basic_features()` to compute:
   - Returns (`pct_change`)
   - 5-day moving average (`MA_5`)
   - 5-day volatility (`Vol_5`)
2. Apply it to Gold, WTI, and Wheat datasets.
3. Save processed outputs as `*_processed.csv`.

In [41]:
def add_basic_features(df):
    df = df.copy()
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    df.sort_values('Date', inplace=True)

    # (fix) removed ineffective: df.set_index('Date', inplace=False)
    price_col = 'Adj Close' if 'Adj Close' in df.columns else 'Close'
    df[price_col] = df[price_col].astype(float)

    df['Return'] = df[price_col].pct_change()
    df['MA_5'] = df[price_col].rolling(window=5).mean()
    df['Vol_5'] = df['Return'].rolling(window=5).std()

gold_feat = add_basic_features(df_gold) if not df_gold.empty else pd.DataFrame()
wti_feat  = add_basic_features(df_wti) if not df_wti.empty else pd.DataFrame()
wheat_feat= add_basic_features(df_wheat) if not df_wheat.empty else pd.DataFrame()

## --- Save processed versions ---
gold_feat.to_csv(os.path.join(PROCESSED_DIR, "gold_processed.csv"), index=False)
wti_feat.to_csv(os.path.join(PROCESSED_DIR, "wti_processed.csv"), index=False)
wheat_feat.to_csv(os.path.join(PROCESSED_DIR, "wheat_processed.csv"), index=False)

print(f"‚úÖ Saved processed files to {PROCESSED_DIR}")
print("Processed files:", os.listdir(PROCESSED_DIR))

print("GPR preview:" if not gpr.empty else "‚ö†Ô∏è GPR empty")
print(gpr.head())



‚úÖ Saved processed files to C:\Users\taton\PycharmProjects\capstone-data-science\data\processed
Processed files: ['gold_processed.csv', 'wheat_processed.csv', 'wti_processed.csv']
GPR preview:
         DATE    GPRD GPRD_ACT  GPRD_THREAT EVENT
0  01-01-1985  230.04   275.20       153.03   NaN
1  02-01-1985  115.68   146.77        87.44   NaN
2  03-01-1985   97.43   158.94        29.46   NaN
3  04-01-1985  157.37   156.88       157.03   NaN
4  05-01-1985   81.36    92.70        77.32   NaN


# Step 11: Quick Check of Saved Processed Files
__We check if our processed CSVs were successfully created and stored in `data/`.__

In [107]:
# quick check of saved files
for fname in ["gold_processed.csv", "wti_processed.csv", "wheat_processed.csv"]:
    print(fname, "->", os.path.exists(os.path.join(DATA_DIR,fname)))


gold_processed.csv -> True
wti_processed.csv -> True
wheat_processed.csv -> True
Price        Date       Close        High         Low        Open Volume  \
Ticker                   GC=F        GC=F        GC=F        GC=F   GC=F   
0      2000-08-30  273.899994  273.899994  273.899994  273.899994      0   
1      2000-08-31  278.299988  278.299988  274.799988  274.799988      0   
2      2000-09-01  277.000000  277.000000  277.000000  277.000000      0   
3      2000-09-05  275.799988  275.799988  275.799988  275.799988      2   
4      2000-09-06  274.200012  274.200012  274.200012  274.200012      0   

Price  Return Vol_5        MA_5  
Ticker                           
0         NaN   NaN         NaN  
1         NaN   NaN         NaN  
2         NaN   NaN         NaN  
3         NaN   NaN         NaN  
4         NaN   NaN  275.839996  


# Step 12: Save Notebook & Commit to Git
__Now that we have processed datasets and a working pipeline, we commit our notebook + CSV files to Git for version control.__

‚ö†Ô∏è Run these commands in the **terminal**, not inside the notebook.

In [None]:
git add data/*.csv notebooks/01_data_collection.ipynb
git commit -m "Add data collection notebook + initial processed commodity files"
git push origin main


# Step 17: Load Kaggle Datasets into DataFrames
__We:__
1. List available files in `data/` to adapt filenames if needed.
2. Load **Geopolitical Risk Index** and **Global News** datasets.
3. Provide fallback placeholders (`None`) if datasets are missing.

In [36]:
print(gold_feat.index)
print(gold_feat.columns)
print(gold_feat['Date'].dtype)
print(gpr_daily['DATE'].dtype)

RangeIndex(start=0, stop=6302, step=1)
MultiIndex([(  'Date',     ''),
            ( 'Close', 'GC=F'),
            (  'High', 'GC=F'),
            (   'Low', 'GC=F'),
            (  'Open', 'GC=F'),
            ('Volume', 'GC=F'),
            ('Return',     ''),
            (  'MA_5',     ''),
            ( 'Vol_5',     '')],
           names=['Price', 'Ticker'])
datetime64[ns]


NameError: name 'gpr_daily' is not defined

In [42]:
print("Gold columns:", gold_feat.columns.tolist())
print("WTI columns:", wti_feat.columns.tolist())
print("Wheat columns:", wheat_feat.columns.tolist())

Gold columns: ['Date', 'Close_GC=F', 'High_GC=F', 'Low_GC=F', 'Open_GC=F', 'Volume_GC=F', 'Return', 'MA_5', 'Vol_5']
WTI columns: ['Date', 'Close_CL=F', 'High_CL=F', 'Low_CL=F', 'Open_CL=F', 'Volume_CL=F', 'Return', 'MA_5', 'Vol_5']
Wheat columns: ['Date', 'Close_ZW=F', 'High_ZW=F', 'Low_ZW=F', 'Open_ZW=F', 'Volume_ZW=F', 'Return', 'MA_5', 'Vol_5']


# Step 19: Prepare GPR dataset
__We:__
Convert GPR `DATE` column to datetime and resample daily.


In [46]:
# --- Prepare GPR dataset ---
gpr_path = os.path.join(RAW_DIR, "Geopolitical Risk Index Daily.csv")
gpr = pd.read_csv(gpr_path)
gpr['DATE'] = pd.to_datetime(gpr['DATE'])
gpr_daily = gpr.set_index('DATE').resample('D').ffill().reset_index()
gpr_daily = gpr_daily[['DATE', 'GPRD', 'GPRD_THREAT', 'EVENT']]
print("GPR prepared shape:", gpr_daily.shape)


def prepare_features(df, price_col="Adj_Close", name="Commodity"):
    df = df.copy()
    if price_col not in df.columns:
        # fallback if only "Close" exists
        price_col = "Close" if "Close" in df.columns else df.columns[1]
        print(f"‚ö†Ô∏è {name}: '{price_col}' used instead.")
    df["Return"] = df[price_col].pct_change()
    df["MA_5"] = df[price_col].rolling(5).mean()
    df["Vol_5"] = df["Return"].rolling(5).std()
    df = df.dropna(subset=[price_col, "Return", "MA_5", "Vol_5"]).reset_index(drop=True)
    print(f"{name} features prepared:", df.shape)
    return df

# def prepare_feat(df, name):
#     """Make sure features exist and drop NaN rows caused by pct_change/rolling."""
#     df = flatten_columns(df)
#     df = safe_reset(df)# Find best matches
#     possible_cols = df.columns.tolist()
#
#     date_col = 'Date' if 'Date' in possible_cols else df.columns[0]
#     adj_col = [c for c in possible_cols if 'Adj' in c or 'Close' in c][0]
#     return_col = [c for c in possible_cols if 'Return' in c][0]
#     ma_col = [c for c in possible_cols if 'MA' in c][0]
#     vol_col = [c for c in possible_cols if 'Vol' in c][0]
#
#     keep_cols = [date_col, adj_col, return_col, ma_col, vol_col]
#     df = df[keep_cols].dropna().reset_index(drop=True)
#
#     # Rename columns for consistency
#     df.columns = ['Date', 'Adj_Close', 'Return', 'MA_5', 'Vol_5']
#
#     print(f"{name} after feature prep:", df.shape)
#     return df

gold_feat = prepare_features(gold_feat, price_col="Close_GC=F", name="Gold")

wti_feat = prepare_features(wti_feat, price_col="Close_CL=F", name="WTI")

wheat_feat = prepare_features(wheat_feat, price_col="Close_ZW=F", name="Wheat")
# --- Save processed features ---
gold_feat.to_csv(os.path.join(PROCESSED_DIR, "gold_processed.csv"), index=False)
wti_feat.to_csv(os.path.join(PROCESSED_DIR, "wti_processed.csv"), index=False)
wheat_feat.to_csv(os.path.join(PROCESSED_DIR, "wheat_processed.csv"), index=False)
print("‚úÖ Processed CSVs saved to:", PROCESSED_DIR)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\taton\\PycharmProjects\\capstone-data-science\\data\\raw\\Geopolitical Risk Index Daily.csv'

# Step 20: Merge All Commodities with Geopolitical Risk Index
__We:__
1. Define utility functions:
   - `flatten_columns()` ‚Üí handle MultiIndex columns.
   - `safe_reset()` ‚Üí reset index if needed.
2. Apply preprocessing to Gold, WTI, and Wheat.
3. Merge each with daily GPR data.
4. Display merged dataset previews.

In [44]:
def flatten_columns(df):
    """Flatten MultiIndex columns if necessary."""
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [
            '_'.join([str(c) for c in col if c]).strip()
            for col in df.columns.values
        ]
    if df.index.name == 'Date':
        return df.reset_index()
    return df



# --- Gold ---
gold_feat = flatten_columns(gold_feat)
merged_gold = pd.merge(
    gold_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

# --- Oil (WTI) ---
wti_feat = flatten_columns(wti_feat)
merged_wti = pd.merge(
    wti_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

# --- Wheat ---
wheat_feat = flatten_columns(wheat_feat)
merged_wheat = pd.merge(
    wheat_feat,
    gpr_daily,
    left_on='Date',
    right_on='DATE',
    how='left'
)

# --- Quick checks ---
print("Gold merged shape:", merged_gold.shape)
print("Oil merged shape:", merged_wti.shape)
print("Wheat merged shape:", merged_wheat.shape)

display(merged_gold.head())
display(merged_wti.head())
display(merged_wheat.head())


NameError: name 'gpr_daily' is not defined

# Step 21 ‚Äî Save merged datasets & Quick check of saved files (Notebook)

In [115]:
merged_gold.to_csv(os.path.join(DATA_DIR, "gold_merged.csv"), index=False)
merged_wti.to_csv(os.path.join(DATA_DIR, "wti_merged.csv"), index=False)
merged_wheat.to_csv(os.path.join(DATA_DIR, "wheat_merged.csv"), index=False)

print("Merged datasets saved in:", DATA_DIR)
# Quick check of saved processed files
for fname in ["gold_processed.csv", "wti_processed.csv", "wheat_processed.csv"]:
    exists = os.path.exists(os.path.join(DATA_DIR, fname))
    print(f"{fname} -> {'‚úÖ exists' if exists else '‚ùå missing'}")

Merged datasets saved in: data
gold_processed.csv -> ‚úÖ exists
wti_processed.csv -> ‚úÖ exists
wheat_processed.csv -> ‚úÖ exists


# Step 22 ‚Äî Save to Git (Terminal, not notebook)

In [68]:
git add notebooks/01_data_collection.ipynb notebooks/notebooks/data/*.csv .gitignore
git commit -m "Add data collection notebook + initial processed commodity files"
git push origin main


SyntaxError: invalid decimal literal (495343877.py, line 1)

In [117]:
print(gold_feat[['Return', 'Vol_5']].info())
print(gold_feat[['Return', 'Vol_5']].head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6288 entries, 0 to 6287
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Return  0 non-null      float64
 1   Vol_5   0 non-null      float64
dtypes: float64(2)
memory usage: 98.4 KB
None
   Return  Vol_5
0     NaN    NaN
1     NaN    NaN
2     NaN    NaN
3     NaN    NaN
4     NaN    NaN


In [118]:
print(merged_gold.columns)
print(merged_gold[['Return', 'Vol_5']].head())

Index(['Date', 'Close_GC=F', 'High_GC=F', 'Low_GC=F', 'Open_GC=F',
       'Volume_GC=F', 'Return', 'Vol_5', 'MA_5', 'DATE', 'GPRD', 'GPRD_THREAT',
       'EVENT'],
      dtype='object')
   Return  Vol_5
0     NaN    NaN
1     NaN    NaN
2     NaN    NaN
3     NaN    NaN
4     NaN    NaN
