# Data Loading and Basic Cleaning

Load raw weather and outage data files with minimal processing.

**Pipeline:**
1. **01_data_loading.ipynb** (this notebook) - Load and reshape raw data
2. **02_data_integration_eda.ipynb** - Merge datasets and analysis  
3. **03_visualizations.ipynb** - Data visualizations

In [10]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
from datetime import datetime

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
np.random.seed(42)

In [None]:
# Data is stored on external drive with 7 years of coverage (2014-2020)

# Configuration for data source
EXTERNAL_DRIVE_PATH = "/Volumes/Academia/AI-Studio-Project/data/raw"
YEARS = ["2014", "2015", "2016", "2017", "2018", "2019", "2020"]

ANALYSIS_YEAR = "2014"  # Change this to analyze different years
USE_SINGLE_YEAR = False  # Set to False to load all years

if USE_SINGLE_YEAR:
    print(f"Loading data for {ANALYSIS_YEAR} only...")
    years_to_load = [ANALYSIS_YEAR]
else:
    print(f"Loading all years: {', '.join(YEARS)}...")
    years_to_load = YEARS

weather_files = {'tmax': [], 'tmin': [], 'prcp': []}

for year in years_to_load:
    weather_dir = Path(f"{EXTERNAL_DRIVE_PATH}/weather/daily_grids/{year}")
    if weather_dir.exists():
        weather_files['tmax'].extend(sorted(weather_dir.glob(f"tmax-{year}*.csv")))
        weather_files['tmin'].extend(sorted(weather_dir.glob(f"tmin-{year}*.csv")))
        weather_files['prcp'].extend(sorted(weather_dir.glob(f"prcp-{year}*.csv")))
        print(f"  Found weather data for {year}")
    else:
        print(f"  Warning: No weather data found for {year}")

print(f"Total weather files: tmax={len(weather_files['tmax'])}, tmin={len(weather_files['tmin'])}, prcp={len(weather_files['prcp'])}")

# Define column names (cause there's no header row in NOAA data)
cols = ['level', 'fips_code', 'county_name', 'year', 'month', 'variable'] + [f'day_{i:02d}' for i in range(1, 32)]

def load_weather_variable(file_list, var_name):
    """Load and combine weather files for one variable"""
    if not file_list:
        print(f"Warning: No files found for {var_name}")
        return pd.DataFrame()
    
    dfs = []
    for file in file_list:
        try:
            df = pd.read_csv(file, header=None, names=cols)
            dfs.append(df)
        except Exception as e:
            print(f"Error loading {file}: {e}")
    
    if dfs:
        combined = pd.concat(dfs, ignore_index=True)
        print(f"  Loaded {var_name}: {len(combined):,} rows from {len(file_list)} files")
        return combined
    else:
        return pd.DataFrame()

# Load each variable type
print("Loading weather data...")
tmax_df = load_weather_variable(weather_files['tmax'], 'TMAX')
tmin_df = load_weather_variable(weather_files['tmin'], 'TMIN') 
prcp_df = load_weather_variable(weather_files['prcp'], 'PRCP')

# Combine all weather variables
weather_df = pd.concat([tmax_df, tmin_df, prcp_df], ignore_index=True)
print(f"Combined weather data: {len(weather_df):,} rows")

# Load outage data for selected years
print("Loading outage data...")
outage_dfs = []
outage_df = pd.DataFrame()  # Initialize outage_df to an empty DataFrame
for year in years_to_load:
    outage_file = Path(f"{EXTERNAL_DRIVE_PATH}/outages/{year}/eaglei_outages_{year}.csv")
    if outage_file.exists():
        df = pd.read_csv(outage_file)
        outage_dfs.append(df)
        print(f"  Loaded outages for {year}: {len(df):,} rows")
    else:
        print(f"  Warning: No outage data found for {year}")

if outage_dfs:
    outage_df = pd.concat(outage_dfs, ignore_index=True)
    print(f"Combined outage data: {len(outage_df):,} rows")
else:
    print("Warning: No outage data loaded! Using an empty DataFrame.")

print(f"\nData loading complete!")
print(f"Weather data shape: {weather_df.shape}")
print(f"Outage data shape: {outage_df.shape}")
print(f"Years included: {', '.join(years_to_load)}")

  Loaded TMAX: 260,988 rows from 84 files
  Loaded TMIN: 260,988 rows from 84 files
  Loaded PRCP: 260,988 rows from 84 files
Combined weather data: 782,964 rows
Reshaping weather data...
Loading outage data...
  Loaded outages for 2014: 1,689,460 rows
  Loaded outages for 2015: 4,977,491 rows
  Loaded outages for 2016: 13,306,024 rows
  Loaded outages for 2017: 15,078,364 rows
  Loaded outages for 2018: 21,776,806 rows
  Loaded outages for 2019: 24,074,122 rows
  Loaded outages for 2020: 25,545,517 rows
Combined outage data: 106,447,784 rows

Data loading complete!
Weather data shape: (24271884, 7)
Outage data shape: (106447784, 5)
Years included: 2014, 2015, 2016, 2017, 2018, 2019, 2020


In [3]:
# 1. Check weather data structure and geographic identifiers
print("Weather data shape:", weather_df.shape)
weather_df.columns.tolist()

Weather data shape: (24271884, 7)


['fips_code', 'county_name', 'year', 'month', 'variable', 'day', 'value']

In [4]:
print("Unique variables:", weather_df['variable'].unique())
print("Unique FIPS codes:", weather_df['fips_code'].nunique())
print("Weather data sample:")
weather_df.head(10)

Unique variables: ['TMAX' 'TMIN' 'PRCP']
Unique FIPS codes: 3107
Weather data sample:


Unnamed: 0,fips_code,county_name,year,month,variable,day,value
0,1001,AL: Autauga,2014,1,TMAX,day_01,7.92
1,1003,AL: Baldwin County,2014,1,TMAX,day_01,8.78
2,1005,AL: Barbour County,2014,1,TMAX,day_01,9.29
3,1007,AL: Bibb County,2014,1,TMAX,day_01,7.73
4,1009,AL: Blount County,2014,1,TMAX,day_01,8.72
5,1011,AL: Bullock County,2014,1,TMAX,day_01,8.51
6,1013,AL: Butler County,2014,1,TMAX,day_01,8.35
7,1015,AL: Calhoun County,2014,1,TMAX,day_01,7.58
8,1017,AL: Chambers County,2014,1,TMAX,day_01,8.95
9,1019,AL: Cherokee County,2014,1,TMAX,day_01,6.88


In [5]:
# 2. Check outage data structure and geographic identifiers  
print("\nOutage data shape:", outage_df.shape)
outage_df.columns.tolist()


Outage data shape: (106447784, 5)


['fips_code', 'county', 'state', 'customers_out', 'run_start_time']

In [6]:
print("Outage data sample:")
outage_df.head(100)


Outage data sample:


Unnamed: 0,fips_code,county,state,customers_out,run_start_time
0,1037,Coosa,Alabama,12.0,2014-11-01 04:00:00
1,1051,Elmore,Alabama,7.0,2014-11-01 04:00:00
2,1109,Pike,Alabama,1.0,2014-11-01 04:00:00
3,1121,Talladega,Alabama,31.0,2014-11-01 04:00:00
4,4017,Navajo,Arizona,1.0,2014-11-01 04:00:00
...,...,...,...,...,...
95,19113,Linn,Iowa,2.0,2014-11-01 04:00:00
96,20045,Douglas,Kansas,45.0,2014-11-01 04:00:00
97,20091,Johnson,Kansas,2.0,2014-11-01 04:00:00
98,20173,Sedgwick,Kansas,1.0,2014-11-01 04:00:00


## Data Processing Summary

**Weather Data:**
- Source: 84 monthly CSV files (2014-2020)
- Reshaped from long to wide format
- Variables: TMAX, TMIN, PRCP

**Outage Data:**  
- Source: 7 annual CSV files (2014-2020)
- Deduplicated by county and date
- Created binary outage indicator

In [None]:
# Weather data cleaning pipeline
print("Cleaning weather data...")

# Melt from wide to long format to create individual date records
weather_long = weather_df.melt(
    id_vars=['fips_code', 'county_name', 'year', 'month', 'variable'],
    value_vars=[f'day_{i:02d}' for i in range(1, 32)],
    var_name='day',
    value_name='value'
)

# Extract numeric day from 'day' column
weather_long['day_num'] = weather_long['day'].str.extract('(\d+)').astype(int)

# Create date column with error handling for invalid dates
weather_long['date'] = pd.to_datetime(weather_long[['year', 'month', 'day_num']].rename(columns={'day_num': 'day'}), errors='coerce')

# Remove rows with invalid dates (e.g., Feb 30, Feb 31, etc.)
weather_long = weather_long.dropna(subset=['date'])

print(f"Date range: {weather_long['date'].min()} to {weather_long['date'].max()}")

# Pivot to final format: each weather variable becomes a column
print("Reshaping to final format...")
weather_pivot = weather_long.pivot_table(
    index=['fips_code', 'county_name', 'date'], 
    columns='variable', 
    values='value'
).reset_index()

# Flatten column names
weather_pivot.columns.name = None
weather_pivot = weather_pivot.rename(columns={'PRCP': 'prcp', 'TMAX': 'tmax', 'TMIN': 'tmin'})

print(f"Weather data cleaned: {weather_pivot.shape}")
print("Sample of cleaned weather data:")
weather_pivot.head(20)

Cleaning weather data...
Date range: 2014-01-01 00:00:00 to 2020-12-31 00:00:00
Reshaping weather data...
Weather data cleaned: (7944599, 6)
Sample of cleaned weather data:


Unnamed: 0,fips_code,county_name,date,prcp,tmax,tmin
0,1001,AL: Autauga,2014-01-01,0.0,7.92,3.12
1,1001,AL: Autauga,2014-01-02,7.54,9.14,4.07
2,1001,AL: Autauga,2014-01-03,1.41,12.62,-4.65
3,1001,AL: Autauga,2014-01-04,0.0,4.46,-5.44
4,1001,AL: Autauga,2014-01-05,0.0,9.72,-4.71
5,1001,AL: Autauga,2014-01-06,2.39,15.71,-3.26
6,1001,AL: Autauga,2014-01-07,0.0,-0.88,-11.17
7,1001,AL: Autauga,2014-01-08,0.0,-1.97,-11.4
8,1001,AL: Autauga,2014-01-09,0.0,6.69,-8.77
9,1001,AL: Autauga,2014-01-10,3.9,10.84,-1.1


In [8]:
# 2.2 Outage data cleaning pipeline  
print("Cleaning outage data...")

# Convert run_start_time to datetime
outage_df['run_start_time'] = pd.to_datetime(outage_df['run_start_time'])

# Extract date (without time) for matching with weather data
outage_df['date'] = outage_df['run_start_time'].dt.date
outage_df['date'] = pd.to_datetime(outage_df['date'])  # Convert back to datetime for consistency

# Create binary outage indicator
outage_df['outage_occurred'] = 1  # All rows in outage data represent outages

# Remove duplicates based on fips, date (keep first occurrence per county per day)
print(f"Before deduplication: {outage_df.shape}")
outage_clean = outage_df.drop_duplicates(subset=['fips_code', 'date'], keep='first')
print(f"After deduplication: {outage_clean.shape}")

# Check for missing FIPS codes
print(f"Missing FIPS codes: {outage_clean['fips_code'].isnull().sum()}")

print("Sample of cleaned outage data:")
outage_clean.head(10)

Cleaning outage data...
Before deduplication: (106447784, 7)
After deduplication: (3701035, 7)
Missing FIPS codes: 0
Sample of cleaned outage data:


Unnamed: 0,fips_code,county,state,customers_out,run_start_time,date,outage_occurred
0,1037,Coosa,Alabama,12.0,2014-11-01 04:00:00,2014-11-01,1
1,1051,Elmore,Alabama,7.0,2014-11-01 04:00:00,2014-11-01,1
2,1109,Pike,Alabama,1.0,2014-11-01 04:00:00,2014-11-01,1
3,1121,Talladega,Alabama,31.0,2014-11-01 04:00:00,2014-11-01,1
4,4017,Navajo,Arizona,1.0,2014-11-01 04:00:00,2014-11-01,1
5,5009,Boone,Arkansas,3.0,2014-11-01 04:00:00,2014-11-01,1
6,5119,Pulaski,Arkansas,1.0,2014-11-01 04:00:00,2014-11-01,1
7,6029,Kern,California,30.0,2014-11-01 04:00:00,2014-11-01,1
8,6037,Los Angeles,California,1555.0,2014-11-01 04:00:00,2014-11-01,1
9,6065,Riverside,California,2.0,2014-11-01 04:00:00,2014-11-01,1


## Export Cleaned Data

In [9]:
# Create processed data directory
import os
from pathlib import Path

PROCESSED_DATA_DIR = Path("../../data/processed")
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

print(f"Saving cleaned datasets to: {PROCESSED_DATA_DIR}")

# Save cleaned weather data
weather_file = PROCESSED_DATA_DIR / "weather_cleaned.csv"
weather_pivot.to_csv(weather_file, index=False)
print(f"Weather data saved: {weather_file} ({weather_pivot.shape[0]:,} rows)")

# Save cleaned outage data  
outage_file = PROCESSED_DATA_DIR / "outages_cleaned.csv"
outage_clean.to_csv(outage_file, index=False)
print(f"Outage data saved: {outage_file} ({outage_clean.shape[0]:,} rows)")

print("\nData export complete - ready for downstream notebooks")

Saving cleaned datasets to: ../../data/processed
Weather data saved: ../../data/processed/weather_cleaned.csv (7,944,599 rows)
Outage data saved: ../../data/processed/outages_cleaned.csv (3,701,035 rows)

Data export complete - ready for downstream notebooks


## Data Dictionary

**weather_cleaned.csv**
- `fips_code`: County FIPS code 
- `county_name`: County name  
- `date`: Date
- `tmax`: Maximum temperature (°C)
- `tmin`: Minimum temperature (°C)
- `prcp`: Precipitation (mm)

**outages_cleaned.csv**
- `fips_code`: County FIPS code
- `date`: Date
- `outage_occurred`: Binary indicator (1=outage, 0=no outage)
- `customers_out`: Number of customers affected
- `run_start_time`: Original outage timestamp