## Testing data types
This code cell is basically for testing purposes to see if the data types in temperature datasets match with data types in electricity usage dataset.
Since both years and months are integer data types for both of the files, we can confirm that merging them won't give us any problem

In [None]:
import pandas as pd

# For Alabama temperature data
al_temp = pd.read_csv('../datasets/state_temperature_data/ohio_cleaned_temperature_data.csv')
print("\nAlabama Temperature Data Types:")
print(al_temp.dtypes)
print("\nFirst few rows of Alabama Temperature:")
print(al_temp.head())

# For electricity usage data
elec_usage = pd.read_csv('../datasets/cleaned_electricity_usage_data.csv')
print("\nElectricity Usage Data Types:")
print(elec_usage.dtypes)
print("\nFirst few rows of Alabama Electricity Usage:")
print(elec_usage[elec_usage['stateDescription'] == 'Alabama'].head())

## Testing merging
This cell block goes ahead and merges one state "Alabama" with the electricity datset's Alabama rows. This is for testing purposes as well before we dive deep into all of temperature datasets and merge them all together with electricity dataset

In [8]:
# First load both datasets
elec_df = pd.read_csv('../datasets/cleaned_electricity_usage_data.csv')
al_temp = pd.read_csv('../datasets/state_temperature_data/alabama_cleaned_temperature_data.csv')

# Since we want to merge with state name, let's add state column to temperature data
al_temp['stateDescription'] = 'Alabama'

# Merge datasets for Alabama only
# This keeps all electricity columns and adds temperature at the end
test_merge = pd.merge(
    elec_df[elec_df['stateDescription'] == 'Alabama'],
    al_temp[['year', 'month', 'stateDescription', 'tavg']],
    on=['year', 'month', 'stateDescription'],
    how='left'
)

# Rename the temperature column
test_merge = test_merge.rename(columns={'tavg': 'average temperature in Fahrenheit'})

test_merge.to_csv('../datasets/test.csv', index=False)

## Merging all Temperature datasets into one List

### Imports
Firstly we have all of our imports:

pandas for data cleaning and merging

glob for taking all the temperature datasets in our "state_temperature_data" directory

os for helping us navigate the files and extract their names


### Load Files

Loading our electricity files and temperature files

printing first 5's names to see if everything is working

### For Loop

This is the main function where each file is taken

The state's name is extracted from the file name and converted into Upper case to match with our electriticy dataset state names

The state from temperature file is matched with the state name from Electricity dataset

The state name is added INSIDE the temperature file to merge easily

.merge method is what does the main merging by taking both files and doing the merge on = ['year', 'month', 'stateDescription']
the parameter how = left means to include all data from Electricity dataset and append temperature column in the end

Lastly, the processed state with it's data is merged into merged_states list so it's easier to merge with Electricity dataset later on


In [None]:
import pandas as pd
import glob
import os

# 1. Read the main electricity usage data
elec_df = pd.read_csv('../datasets/cleaned_electricity_usage_data.csv')

# 2. Get list of all state temperature files
state_files = glob.glob('../datasets/state_temperature_data/*_cleaned_temperature_data.csv')

# Print to check if we're finding the files
print(f"Found {len(state_files)} state files:")
print(state_files[:5])  # Print first 5 files to verify

# 3. Create empty list to store merged data for each state
merged_states = []

# 4. Loop through each state file
for state_file in state_files:
    # Extract state name from file name
    # Handle compound state names properly
    state_file_name = os.path.basename(state_file)
    state_name = state_file_name.split('_cleaned_temperature_data.csv')[0].replace('_', ' ').title()
    print(f"\nProcessing state: {state_name}")
    
    # Read state's temperature data
    state_temp = pd.read_csv(state_file)
    print(f"Temperature data shape: {state_temp.shape}")
    
    # Print to verify we have matching records in electricity data
    state_elec = elec_df[elec_df['stateDescription'] == state_name]
    print(f"Electricity data shape for {state_name}: {state_elec.shape}")
    
    # Add state name to temperature data
    state_temp['stateDescription'] = state_name
    
    # Merge this state's data
    state_merged = pd.merge(
        state_elec,
        state_temp[['year', 'month', 'stateDescription', 'tavg']],
        on=['year', 'month', 'stateDescription'],
        how='left'
    )
    print(f"Merged shape: {state_merged.shape}")
    
    # Add to our list of merged states
    merged_states.append(state_merged)

print(f"\nTotal states processed: {len(merged_states)}")

## Merging the list with all Temperature data into Electricity dataset

### Concatening
First we concatenate all of Temperature data on top of each other, ignore_index means to have consistent index for all the dataset as one big dataset

### Sorting
Sort the merged dataset by states first then year and then month 

### Save
Everything is saved into a csv file 

### Prints
Lastly, did some print statements on merged dataset to see if everything was merged fine

In [3]:
# 5. Combine all states into one DataFrame
final_df = pd.concat(merged_states, ignore_index=True)

# 6. Sort the data by state, year, and month
final_df = final_df.sort_values(['stateDescription', 'year', 'month'])

# 7. Save to new CSV file
final_df.to_csv('../datasets/merged_electricity_temperature_data.csv', index=False)

# 8. Print some information about the final dataset
print("\nFinal merged dataset info:")
print(f"Total number of records: {len(final_df)}")
print(f"Number of states: {final_df['stateDescription'].nunique()}")
print(f"Date range: {final_df['year'].min()}-{final_df['month'].min()} to {final_df['year'].max()}-{final_df['month'].max()}")
print("\nFirst few rows:")
print(final_df.head())


Final merged dataset info:
Total number of records: 13850
Number of states: 50
Date range: 2001-1 to 2024-12

First few rows:
   year  month stateDescription   sectorName  price    revenue       sales  \
0  2001      1          Alabama  all sectors   5.54  407.61261  7362.47302   
1  2001      2          Alabama  all sectors   5.31  321.06715  6041.02574   
2  2001      3          Alabama  all sectors   5.87  345.77802  5894.61038   
3  2001      4          Alabama  all sectors   5.72  347.18634  6064.53539   
4  2001      5          Alabama  all sectors   5.60  359.09236  6413.96530   

   tavg  
0  43.7  
1  55.0  
2  53.4  
3  66.0  
4  72.2  
