# Data Cleaning / Wrangling

**Goal** : gather all the data in one clean csv file


## Step 1 : Consumptions


In [6]:
# import the data
import pandas as pd
df1 = pd.read_csv("./data/eco2mix-regional-cons-def.csv", delimiter=";")

In [7]:
# inspect the table
print("Shape", df1.shape)
print("Columns", df1.columns)
df1.sample(3)

Shape (99312, 15)
Columns Index(['Code INSEE région', 'Région', 'Nature', 'Date', 'Heure',
       'Date - Heure', 'Consommation (MW)', 'Thermique (MW)', 'Nucléaire (MW)',
       'Eolien (MW)', 'Solaire (MW)', 'Hydraulique (MW)', 'Pompage (MW)',
       'Bioénergies (MW)', 'Ech. physiques (MW)'],
      dtype='object')


Unnamed: 0,Code INSEE région,Région,Nature,Date,Heure,Date - Heure,Consommation (MW),Thermique (MW),Nucléaire (MW),Eolien (MW),Solaire (MW),Hydraulique (MW),Pompage (MW),Bioénergies (MW),Ech. physiques (MW)
63843,11,Ile-de-France,Données définitives,2015-03-11,10:30,2015-03-11T10:30:00+01:00,10702.0,864.0,,1.0,27.0,5.0,,119.0,9686.0
77646,11,Ile-de-France,Données définitives,2016-08-27,13:30,2016-08-27T13:30:00+02:00,6982.0,-4.0,,4.0,42.0,6.0,,118.0,6816.0
5027,11,Ile-de-France,Données consolidées,2017-11-21,22:30,2017-11-21T22:30:00+01:00,9404.0,524.0,,16.0,0.0,7.0,,154.0,8703.0


In [8]:
# keep only interesting columns
df1["Date - Heure"]  = df1["Date"] + " " + df1["Heure"]


In [9]:
df1 = df1[["Date - Heure","Consommation (MW)"]]

In [10]:
# Check column types
df1.duplicated().sum()

0

In [None]:
# Convert column types if needed

In [None]:
# Check for duplicates

We have duplicated dates with different consumptions, **interesting** !

In [None]:
# Remove duplicates

In [None]:
# Rename columns

In [None]:
# Check days with missing half hours

We have days with less that 48 half hours, **interesting** !

## Step 2 : Temperatures

In [None]:
df2 = pd.read_csv("./data/meteo-paris.csv")

In [None]:
df2.head(1)

## Step 3 : Merge everything together

[Documentation on how to merge with pandas](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html)


![How to merge](https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2017/03/join-types-merge-names.jpg)


**Inner Merge / Inner join** – The default Pandas behaviour, only keep rows where the merge “on” value exists in both the left and right dataframes.

**Left Merge / Left outer join** – Keep every row in the left dataframe. Where there are missing values of the “on” variable in the right dataframe, add empty / NaN values in the result.

**Right Merge / Right outer join** – Keep every row in the right dataframe. Where there are missing values of the “on” variable in the left column, add empty / NaN values in the result.

**Outer Merge / Full outer join** – A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with NaNs elsewhere.

# Step 4 : handle missing half hours

# Step 4 : interpolate missing values
We want to keep our historical consumptions (which are precious) so we will interpolate missing values for temperature

First question : where are missing values ?

# Step 6 : Automate everything : 

In [None]:
def get_data(consumption_csv="./data/eco2mix_regional_cons_def.csv",weather_csv="./data/meteo-paris.csv"):
    """
    A function to get consumption and weather data
    Do the wrangling
    And return a nice & compact dataframe
    Temperatures are in °C
    
    """
    # consumptions
    consumption =  pd.read_csv(consumption_csv, delimiter=";", usecols = ["Date - Heure","Consommation (MW)"])
    consumption["Date - Heure"] = pd.to_datetime(consumption["Date - Heure"], utc=True).dt.tz_convert('Europe/Paris').dt.tz_localize(None)
    consumption.columns = ['Date', 'Conso']
    # weather
    weather = pd.read_csv(weather_csv,usecols=['dt','temp'])
    weather.columns = ['Date', 'Temp']
    weather['Date'] = pd.to_datetime(weather['Date'],unit='s',utc=True).dt.tz_convert('Europe/Paris').dt.tz_localize(None)    
    # Merging
    df1 = pd.merge(consumption,weather,on='Date',how="left")
    df1["Temp"] = df1["Temp"] - 273.15
    # Half hours
    date_range = pd.date_range(start=df1['Date'].min(),end=df1['Date'].max(),freq='30min')
    half_hours = pd.DataFrame(date_range,columns=['Date'])
    df2 = pd.merge(half_hours,df1,on='Date',how="left")
    #Interpolation
    df2.interpolate('linear',limit=4,inplace=True)
    #Drop duplicates
    df2.drop_duplicates(inplace=True,subset='Date')
    return df2.dropna()

In [None]:
df = get_data()

In [None]:
df.set_index('Date', inplace=True)

In [None]:
df['Conso']['2014':'2018'].plot(figsize=(15,5))

In [None]:
df['Temp']['2014':'2018'].plot(figsize=(15,5))

In [None]:
import seaborn
seaborn.distplot(df['Conso'])

In [None]:
sns.distplot(df['Temp'])

# That's Clean !