## Mega Dataset Ceration

### 1. Setup Paths & Load Backbone

**Explanation:** We define our directories and load the "Backbone" dataset (stock_senti_engineered_imputed_2.csv). This file contains our 7 tickers and the dates. It dictates the "shape" of the final dataset. We verify the initial row count so we can check for duplicates later.

In [45]:
import pandas as pd
import numpy as np
import os

# --- Define Directories ---
# Directory where your 8 semi-clean files live
semi_clean_dir = r"D:\MS_Data_Science_Thesis\Data_Cleaning\Semi_Clean_Datasets"

# Final Output Directory
final_dir = r"D:\MS_Data_Science_Thesis\Data_Cleaning\Master_Dataset"
os.makedirs(final_dir, exist_ok=True)

# --- Load the Backbone (Stock + Sentiment) ---
backbone_path = os.path.join(semi_clean_dir, "stock_senti_engineered_imputed_2.csv")
print(f"Loading Backbone: {backbone_path}")

df_final = pd.read_csv(backbone_path)
df_final['date'] = pd.to_datetime(df_final['date'])

# Store initial row count for integrity check
initial_rows = len(df_final)
print(f"Initial Backbone Rows: {initial_rows}")
df_final.head(3)

Loading Backbone: D:\MS_Data_Science_Thesis\Data_Cleaning\Semi_Clean_Datasets\stock_senti_engineered_imputed_2.csv
Initial Backbone Rows: 24052


Unnamed: 0,date,volume,open,high,low,close,adj close,ticker,sentiment_score,n_articles,...,MA7,MA50,RSI,Log_Returns,Close_to_MA7,Close_to_MA50,Sent_MA7,Sent_MA30,Sent_Vol7,sentiment_imputed
0,2010-04-27,23420500.0,43.833202,44.587894,41.363297,43.863693,30.670647,COP,0.0,0.0,...,43.797264,39.971154,74.139756,-0.013122,1.001517,1.097384,-0.162053,0.013629,0.343736,0.0
1,2010-04-28,22939200.0,43.90181,44.824215,43.772217,44.633633,31.209005,COP,0.084899,1.0,...,44.005268,40.10273,74.085707,0.017401,1.014279,1.112982,-0.149924,-0.011301,0.35181,0.084899
2,2010-04-29,17990400.0,44.976677,45.85334,44.976677,45.05291,31.502172,COP,-0.020716,4.0,...,44.190402,40.252144,71.624744,0.00935,1.019518,1.119267,-0.152884,-0.034716,0.350423,-0.020716


### 2. Define Macro File Map

**Explanation:** We create a Python dictionary mapping a "Readable Name" to the specific "Filename" you provided in your screenshot. This makes the next step (the loop) cleaner and less error-prone.

In [48]:
# --- Define the Macro Files to Merge ---
# Dictionary mapping: {Name_for_Log : Filename}
# These match the filenames provided in your screenshot
macro_files = {
    "Oil": "oil_engineered_3.csv",
    "Gas": "Gas_Engineered_4.csv",
    "VIX": "VIX_engineered_5.csv",
    "XLE": "XLE_engineered_6.csv",
    "Weather": "HDDCDD_engineered_7.csv",
    "Hurricane": "HUDRAT2_engineered_8.csv",
    "Carbon": "carbon_engineered_9.csv"
}

print(f"Identified {len(macro_files)} macro files to merge.")

Identified 7 macro files to merge.


### 3. The Iterative Merge Loop

**Explanation:** This is the engine. We loop through each file in the list.

- **Left Join:** We keep all stock rows. We match macro data based on date.

- **validate='many_to_one':** This is a safety feature. It ensures that the macro file (the "one" side) doesn't have duplicate dates. If it did, it would explode your dataset size.

- **Integrity Check:** After every merge, we check if the row count changed. If it did, something went wrong.

In [51]:
# --- The Iterative Mega Merge ---
for name, filename in macro_files.items():
    file_path = os.path.join(semi_clean_dir, filename)
    
    # Check if file exists
    if not os.path.exists(file_path):
        print(f"ERROR: Could not find {filename}")
        continue
        
    print(f"Merging {name} data...")
    
    # Load Macro
    df_macro = pd.read_csv(file_path)
    df_macro['date'] = pd.to_datetime(df_macro['date'])
    
    # MERGE (Left Join on Date)
    # validate='many_to_one' ensures we don't accidentally explode rows
    df_final = pd.merge(df_final, df_macro, on='date', how='left', validate='many_to_one')
    
    # Check if rows changed (Safety Check)
    if len(df_final) != initial_rows:
        print(f"WARNING: Row count changed after merging {name}! Check for duplicates.")
    else:
        # Print which new columns were added (excluding 'date')
        new_cols = list(df_macro.columns.difference(['date']))
        print(f" -> Successfully merged {name}. Added {len(new_cols)} columns.")

print("-" * 50)
print("Merge sequence complete.")

Merging Oil data...
 -> Successfully merged Oil. Added 6 columns.
Merging Gas data...
 -> Successfully merged Gas. Added 6 columns.
Merging VIX data...
 -> Successfully merged VIX. Added 6 columns.
Merging XLE data...
 -> Successfully merged XLE. Added 6 columns.
Merging Weather data...
 -> Successfully merged Weather. Added 8 columns.
Merging Hurricane data...
 -> Successfully merged Hurricane. Added 1 columns.
Merging Carbon data...
 -> Successfully merged Carbon. Added 6 columns.
--------------------------------------------------
Merge sequence complete.


### 4. Sanity Checks (Panel Verification)

**Explanation:** We verify the data quality before saving.
- **Row Count:** Must equal the initial count.

- **Null Check:** Since we imputed everything, this should be 0.

- **Panel Check:** We pick a random date and print rows for multiple tickers (e.g., XOM, CVX). They should have different stock prices but identical Oil/Gas/VIX values.

In [54]:
# --- Final Sanity Checks ---
print("--- FINAL SANITY CHECKS ---")

# 1. Row Count Integrity
print(f"1. Final Row Count: {len(df_final)} (Should be {initial_rows})")

# 2. Null Value Check (Should be 0 for all columns)
nulls = df_final.isna().sum()
if nulls.sum() == 0:
    print("2. Null Check: PASSED (No missing values in the entire dataset!)")
else:
    print("2. Null Check: WARNING (Found missing values):")
    print(nulls[nulls > 0])

# 3. Panel Structure Verification
# We check a specific date to ensure macros are repeated correctly across tickers
if len(df_final) > 100:
    check_date = df_final['date'].iloc[100] # Pick a random date
    # Select a few columns to verify, NOW INCLUDING CARBON
    subset_check = df_final[df_final['date'] == check_date][['ticker', 'close', 'Oil_Price', 'Gas_Price', 'VIX_Close', 'Carbon_Price']]
    print(f"\n3. Panel Check for date {check_date.date()} (Macros should be identical, Ticker/Close different):")
    print(subset_check)

--- FINAL SANITY CHECKS ---
1. Final Row Count: 24052 (Should be 24052)
Oil_Log_Return        21
Oil_RSI               91
Oil_MA7               42
Oil_MA50             343
Oil_Vol7             105
Gas_Log_Return         7
Gas_RSI               91
Gas_MA7               42
Gas_MA30             203
Gas_Vol7              49
VIX_Log_Return         7
VIX_Diff               7
VIX_RSI               91
VIX_MA50             343
VIX_to_MA50          343
XLE_Log_Return         7
XLE_RSI               91
XLE_MA7               42
XLE_MA50             343
XLE_Vol7              49
HDD_MA7               42
CDD_MA7               42
Total_DD_MA7          42
Total_DD_MA30        203
Weather_Shock        203
Carbon_Log_Return      7
Carbon_RSI            91
Carbon_MA7            42
Carbon_MA50          343
Carbon_Vol7           49
dtype: int64

3. Panel Check for date 2010-09-17 (Macros should be identical, Ticker/Close different):
      ticker      close  Oil_Price  Gas_Price  VIX_Close  Carbon_Price
100 

### 5. Final Clean-Up Code

Run this block to trim those NaN rows and save the Final, Final version.

In [57]:
# --- STEP 7: The Final Polish (Trimming Warm-up NAs) ---

# 1. Drop the rows with NaNs (The first ~50 days of 2010)
df_final_clean = df_final.dropna().copy()

print("--- FINAL POLISH REPORT ---")
print(f"Original Rows: {len(df_final)}")
print(f"Cleaned Rows:  {len(df_final_clean)}")
print(f"Rows Dropped:  {len(df_final) - len(df_final_clean)} (Expected due to MA50 warm-up)")

# 2. Final Verify (Must be 0)
print(f"Remaining Nulls: {df_final_clean.isna().sum().sum()}")


--- FINAL POLISH REPORT ---
Original Rows: 24052
Cleaned Rows:  23653
Rows Dropped:  399 (Expected due to MA50 warm-up)
Remaining Nulls: 0


### 6. Save the Master Dataset

**Explanation:** Finally, we save the master_dataset.csv to your new directory.

In [60]:
# --- Save the Masterpiece ---
clean_filename = "master_dataset.csv"
clean_full_path = os.path.join(final_dir, clean_filename)

df_final_clean.to_csv(clean_full_path, index=False)
print("=" * 60)
print(f"SUCCESS! Final Dataset saved to:\n{clean_full_path}")
print("=" * 60)

SUCCESS! Final Dataset saved to:
D:\MS_Data_Science_Thesis\Data_Cleaning\Master_Dataset\master_dataset.csv
