# Introduction 

In the following notebook, I will be merging 2 cleaned datasets together for an EDA located [here](https://github.com/KishenSharma6/Weather-Energy-Consumption-in-Spain/tree/master/Project%20Codes/02_Exploratory_Data_Analysis).

* Raw data can be found [here](https://github.com/KishenSharma6/Weather-Energy-Consumption-in-Spain/tree/master/Data/01_Raw_Data)
* Cleaned data can be found [here](https://github.com/KishenSharma6/Weather-Energy-Consumption-in-Spain/tree/master/Data/02_Cleaned_Data)

**Read in libraries for notebook**

In [1]:
import numpy as np
import pandas as pd

**Set notebook preferences**

In [2]:
#Set preferences for pandas 
pd.set_option("display.max_columns", 101)

**Read in data**

In [3]:
#Set path to raw data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Spain Hourly Energy Demand and Weather'

#Read in cleaded data
weather = pd.read_csv(path + '/Data/02_Cleaned_Data/2020_0620_Cleaned_Weather_Features.csv',
                      parse_dates=['date_time'], index_col='date_time')
energy =  pd.read_csv(path + '/Data/02_Cleaned_Data/2020_0620_Cleaned_Energy_Dataset.csv',
                      parse_dates=['date_time'], index_col='date_time')

# Preview Data

**Weather data**

In [4]:
#View data shape and head
print('Weather data shape:', weather.shape)
display(weather.head())

Weather data shape: (35064, 1)


Unnamed: 0_level_0,temp
date_time,Unnamed: 1_level_1
2015-01-01 00:00:00,30.814633
2015-01-01 01:00:00,30.85286
2015-01-01 02:00:00,30.108448
2015-01-01 03:00:00,30.091044
2015-01-01 04:00:00,30.19262


**Energy data**

In [5]:
#View data shape and head
print('Energy data shape:', energy.shape)
display(energy.head())

Energy data shape: (35064, 20)


Unnamed: 0_level_0,generation biomass,generation fossil brown coal/lignite,generation fossil gas,generation fossil hard coal,generation fossil oil,generation hydro pumped storage consumption,generation hydro run-of-river and poundage,generation hydro water reservoir,generation nuclear,generation other,generation other renewable,generation solar,generation waste,generation wind onshore,forecast solar day ahead,forecast wind onshore day ahead,total load forecast,total load actual,price day ahead,price actual
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2015-01-01 00:00:00,447.0,329.0,4844.0,4821.0,162.0,863.0,1051.0,1899.0,7096.0,43.0,73.0,49.0,196.0,6378.0,17.0,6436.0,26118.0,25385.0,50.1,65.41
2015-01-01 01:00:00,449.0,328.0,5196.0,4755.0,158.0,920.0,1009.0,1658.0,7096.0,43.0,71.0,50.0,195.0,5890.0,16.0,5856.0,24934.0,24382.0,48.1,64.92
2015-01-01 02:00:00,448.0,323.0,4857.0,4581.0,157.0,1164.0,973.0,1371.0,7099.0,43.0,73.0,50.0,196.0,5461.0,8.0,5454.0,23515.0,22734.0,47.33,64.48
2015-01-01 03:00:00,438.0,254.0,4314.0,4131.0,160.0,1503.0,949.0,779.0,7098.0,43.0,75.0,50.0,191.0,5238.0,2.0,5151.0,22642.0,21286.0,42.27,59.32
2015-01-01 04:00:00,428.0,187.0,4130.0,3840.0,156.0,1826.0,953.0,720.0,7097.0,43.0,74.0,42.0,189.0,4935.0,9.0,4861.0,21785.0,20264.0,38.41,56.04


# Merge data

In [15]:
#Merge datasets on index
df = pd.merge(energy, weather, left_index=True, right_index=True)

#Sort merged columns alphabetically
df = df.reindex(sorted(df.columns), axis=1)

#Drop duplicates
df.drop_duplicates(inplace = True)

#Replace ' ' in cols with '_'
import re
df = df.rename(columns= lambda x: re.sub(' ', '_',x))

#Check
print('Merged data frame shape: ', df.shape)
display(df.head())

## Isolate price day ahead and remove forecast variables from DF

Store as seperate df, we will try and model our data to outperform these predictions later.

In [16]:
#Store price forecasts in df to write later
price_forecast = pd.DataFrame()
price_forecast['price_forecast'] = df['price day ahead']

#Check
display(price_forecast.head())

Unnamed: 0_level_0,price_forecast
date_time,Unnamed: 1_level_1
2015-01-01 00:00:00,50.1
2015-01-01 01:00:00,48.1
2015-01-01 02:00:00,47.33
2015-01-01 03:00:00,42.27
2015-01-01 04:00:00,38.41


In [17]:
#Drop forecast variables
drop = df.filter(regex='ahead|forecast').columns
df.drop(drop, axis = 1, inplace =True)

#Check
display(df.head())

Unnamed: 0_level_0,generation biomass,generation fossil brown coal/lignite,generation fossil gas,generation fossil hard coal,generation fossil oil,generation hydro pumped storage consumption,generation hydro run-of-river and poundage,generation hydro water reservoir,generation nuclear,generation other,generation other renewable,generation solar,generation waste,generation wind onshore,price actual,temp,total load actual
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2015-01-01 00:00:00,447.0,329.0,4844.0,4821.0,162.0,863.0,1051.0,1899.0,7096.0,43.0,73.0,49.0,196.0,6378.0,65.41,30.814633,25385.0
2015-01-01 01:00:00,449.0,328.0,5196.0,4755.0,158.0,920.0,1009.0,1658.0,7096.0,43.0,71.0,50.0,195.0,5890.0,64.92,30.85286,24382.0
2015-01-01 02:00:00,448.0,323.0,4857.0,4581.0,157.0,1164.0,973.0,1371.0,7099.0,43.0,73.0,50.0,196.0,5461.0,64.48,30.108448,22734.0
2015-01-01 03:00:00,438.0,254.0,4314.0,4131.0,160.0,1503.0,949.0,779.0,7098.0,43.0,75.0,50.0,191.0,5238.0,59.32,30.091044,21286.0
2015-01-01 04:00:00,428.0,187.0,4130.0,3840.0,156.0,1826.0,953.0,720.0,7097.0,43.0,74.0,42.0,189.0,4935.0,56.04,30.19262,20264.0


# Write merged data to CSV

In [18]:
#View final shape of merged data
print('Final shape of merged data:', df.shape)
print('Final shape of price forecast data:', price_forecast.shape)

#Write file
df.to_csv(path + '/Data/02_Cleaned_Data/2020_0620_Weather_Energy.csv',)

#Write price price_forecast
price_forecast_path= r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Spain Hourly Energy Demand and Weather\Data'
price_forecast.to_csv(price_forecast_path + '/03_Processed_Data/2020_0620_Data_Price_Forecasts.csv')

Final shape of merged data: (35070, 17)
Final shape of price forecast data: (35070, 1)
