# Merge datasets

Datasets (energy, temperatures and solar) are merged.

Problems here:
- *Energy* dataset starts in 2013 while others start in 2016.
- *Energy* dataset gets data every 30 minutes. *Temperatures* only once a day and *Solar* every 3h.

So we are losing informations within days.

What I am going to do here is merging the datasets, keep only data **after** *2016-01-01* and replace NA values with the last valid value.

**Example:**  
If temparatures are set on `2016-01-01 00:00:00` but not on `2016-01-01 01:00`, the last one will get the value of `00:00` if it the last valid value.

In [17]:
import pandas as pd
import numpy as np
import feather
import datetime

import matplotlib.pyplot as plt

energy = pd.read_feather('data/energy.ftr')
temp = pd.read_feather('data/temperature.ftr')
sw = pd.read_feather('data/rayonnement.ftr')

In [18]:
# Keeps data of energy
merged = pd.merge(left=energy, left_on=['date_heure', 'region'], right=temp, right_on=['date', 'region'], how='left')

In [19]:

# Cleaning dataset
merged.drop(columns=['code_insee_region_y', 'date'], inplace=True)
merged.rename(columns={'date_heure':'date', 'code_insee_region_x': 'code_insee_region'}, inplace=True)

In [20]:
# Keeps data after 2016
merged_2016 = merged[ merged['date'] >= '2016-01-01' ]

In [21]:
# Since data are recorded by region, we have to loop over them
# Creates an empty DataFrame with same columns
merged_clean = pd.DataFrame(columns=merged.columns)

# For each region
for r in merged['region'].unique():
    # Gets region's data from merged df after 2016
    region = merged_2016[ merged_2016['region'] == r ]
    # Fills NAs with last valid value
    region.fillna(method='pad', inplace=True)
    # Appends to the clean df
    merged_clean = merged_clean.append(region, ignore_index=True)


In [22]:
# Needs to be sorted by date 
merged_clean.sort_values(by='date', ascending=True, inplace=True)

In [23]:
# Does the same with solar and wind
clean = pd.merge(left=merged_clean, left_on=['date', 'region'], right=sw, right_on=['date', 'region'], how='left')

In [24]:
clean.drop(columns=['code_insee_region_y'], inplace=True)
clean.rename(columns={'code_insee_region_x': 'code_insee_region'}, inplace=True)

In [25]:
df = pd.DataFrame(columns=clean.columns)
for r in clean['region'].unique():
    region = clean[ clean['region'] == r ]
    region.fillna(method='pad', inplace=True)
    df = df.append(region, ignore_index=True)

df.sort_values(by='date', ascending=True, inplace=True)

In [26]:
df.reset_index(drop=True, inplace=True)

In [27]:
# Replace all left NAs with 0
df.fillna(0, inplace=True)

In [28]:
# Ok !
df.isna().sum()

code_insee_region      0
region                 0
date                   0
consommation           0
thermique              0
nucleaire              0
eolien                 0
solaire                0
hydraulique            0
pompage                0
bioenergies            0
ech_physiques          0
tmin                   0
tmax                   0
tmoy                   0
vitesse_vent           0
rayonnement_solaire    0
dtype: int64

In [29]:
# And saves everything
df.to_feather('data/merged.ftr')

In [30]:
# Eolian Dataset
eolien = df[ ['date', 'region', 'code_insee_region', 'eolien', 'tmin', 'tmax', 'tmoy', 'vitesse_vent'] ]

In [31]:
# Solar Dataset
solaire = df[ ['date', 'region', 'code_insee_region', 'solaire', 'tmin', 'tmax', 'tmoy', 'vitesse_vent', 'rayonnement_solaire'] ]

In [32]:
# Save !
eolien.to_feather('data/eolien.ftr')
solaire.to_feather('data/solaire.ftr')