# Data collection

The goal of this presentation is to automate data collection.
We need some historical data to build our model and predict electrical consumption in Paris for J+1.   
Basically we need to collect as much interesting data as possible, starting from nothing.

### Available techniques

- Download a file
- Read a database
- Call an API
- Scrap the internet
- reflection

### Expected output

Datasets saved on our computer with file formats easily readable by python : csv, json, xml, excel

### Quizz (5 minutes) : What kind of data might be interesting to make a prediction ?

### Suggestions : 


The data we are going to collect : 

- Historical electrical consumption of Paris
- Historical weather
- weather forecast
- Days off in France

- How many electric vehicules
-   

### Workshop (15 minutes) : Try to collect the data


### Easy : Historical electrical consumption

[Electrical consumption in île-de-France between 2013 and 2017](https://rte-opendata.opendatasoft.com/explore/dataset/eco2mix_regional_cons_def/export/?disjunctive.libelle_region&disjunctive.nature&sort=-date_heure&refine.libelle_region=Ile-de-France)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
consumption = pd.read_csv("./data/eco2mix_regional_cons_def.csv", delimiter=";",parse_dates=["Date - Heure"])
consumption.set_index('Date - Heure',inplace=True)

In [4]:
consumption.sort_index(inplace=True)
consumption.head(3)

Unnamed: 0_level_0,Code INSEE région,Région,Nature,Date,Heure,Consommation (MW),Thermique (MW),Nucléaire (MW),Eolien (MW),Solaire (MW),Hydraulique (MW),Pompage (MW),Bioénergies (MW),Ech. physiques (MW)
Date - Heure,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2012-12-31 23:00:00,11,Ile-de-France,Données définitives,2013-01-01,00:00,,,,,,,,,
2012-12-31 23:30:00,11,Ile-de-France,Données définitives,2013-01-01,00:30,9134.0,685.0,,16.0,0.0,0.0,,142.0,8289.0
2013-01-01 00:00:00,11,Ile-de-France,Données définitives,2013-01-01,01:00,8822.0,685.0,,16.0,0.0,0.0,,142.0,7977.0


In [None]:
consumption.resample('D').count()

In [None]:
consumption["Code INSEE région"].value_counts()

In [None]:
# 31 march 2013 appears twice, strange
consumption["Date - Heure"].describe()

## Less easy : Historical weather

Not enough data : [Prévision Météo - Paris - AROME](https://public.opendatasoft.com/explore/dataset/arome-0025-sp1_sp2_paris/export/)  
Let's pay for some data ! [Openweather map API](https://openweathermap.org/history-bulk) (10$ for 5 years of weather in paris : a bargain !)


In [12]:
weather = pd.read_csv("./data/meteo-paris.csv")
weather['dt'] = pd.to_datetime(weather['dt'],unit='s')
weather.set_index('dt',inplace=True)

In [13]:
weather.head()

Unnamed: 0_level_0,dt_iso,city_id,city_name,lat,lon,temp,temp_min,temp_max,pressure,sea_level,...,rain_today,snow_1h,snow_3h,snow_24h,snow_today,clouds_all,weather_id,weather_main,weather_description,weather_icon
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-10-01 13:00:00,2012-10-01 13:00:00 +0000 UTC,2988507,,,,293.32,291.15,298.15,1017,,...,,,,,,0,800,Clear,Sky is Clear,01d
2012-10-01 14:00:00,2012-10-01 14:00:00 +0000 UTC,2988507,,,,293.324271,293.324271,293.324271,1017,,...,,,,,,0,800,Clear,sky is Clear,01
2012-10-01 15:00:00,2012-10-01 15:00:00 +0000 UTC,2988507,,,,293.334926,293.334926,293.334926,1017,,...,,,,,,1,800,Clear,sky is Clear,01
2012-10-01 16:00:00,2012-10-01 16:00:00 +0000 UTC,2988507,,,,293.345582,293.345582,293.345582,1017,,...,,,,,,1,800,Clear,sky is Clear,01
2012-10-01 17:00:00,2012-10-01 17:00:00 +0000 UTC,2988507,,,,293.356237,293.356237,293.356237,1017,,...,,,,,,2,800,Clear,sky is Clear,02


In [14]:
print(weather.index.min())
print(weather.index.max())

2012-10-01 13:00:00
2017-12-06 14:00:00


In [15]:
weather.resample('D').count()

Unnamed: 0_level_0,dt_iso,city_id,city_name,lat,lon,temp,temp_min,temp_max,pressure,sea_level,...,rain_today,snow_1h,snow_3h,snow_24h,snow_today,clouds_all,weather_id,weather_main,weather_description,weather_icon
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-10-01,11,11,0,0,0,11,11,11,11,0,...,0,0,0,0,0,11,11,11,11,11
2012-10-02,24,24,0,0,0,24,24,24,24,0,...,0,0,0,0,0,24,24,24,24,24
2012-10-03,25,25,0,0,0,25,25,25,25,0,...,0,0,0,0,0,25,25,25,25,25
2012-10-04,24,24,0,0,0,24,24,24,24,0,...,0,0,0,0,0,24,24,24,24,24
2012-10-05,25,25,0,0,0,25,25,25,25,0,...,0,0,0,0,0,25,25,25,25,25
2012-10-06,24,24,0,0,0,24,24,24,24,0,...,0,0,0,0,0,24,24,24,24,24
2012-10-07,31,31,0,0,0,31,31,31,31,0,...,0,0,0,0,0,31,31,31,31,31
2012-10-08,27,27,0,0,0,27,27,27,27,0,...,0,0,0,0,0,27,27,27,27,27
2012-10-09,43,43,0,0,0,43,43,43,43,0,...,0,0,0,0,0,43,43,43,43,43
2012-10-10,25,25,0,0,0,25,25,25,25,0,...,0,0,0,0,0,25,25,25,25,25


## Weather forecast 

To make a prediction we will need the weather forecast
Let's use an API to get this one

In [1]:
import os
import requests 
import pandas as pd

token = os.environ["OPENWEATHER"]
response = requests.get("http://api.openweathermap.org/data/2.5/forecast?id=2988507&mode=json&APPID={}".format(token)).json()

KeyError: 'OPENWEATHER'

NameError: name 'token' is not defined

In [None]:
forecast = [{"date":x["dt_txt"], "temperature":x["main"]["temp"]} for x in response["list"]]

In [None]:
df =pd.DataFrame(forecast)

In [None]:
df.head(5)

# Days off in France

No dataset easily available, we are going to scrap the web :
https://www.calendrier-365.fr

In [24]:
from bs4 import BeautifulSoup
import requests 

days_off = []

for year in range(2012,2021):
    url = "https://www.calendrier-365.fr/jours-feries/{}.html".format(year)
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    for x in soup.find_all("span", {"itemprop":"startDate"}):
        days_off.append(x.attrs["content"])
    

In [23]:
days_off

['2013-01-01',
 '2013-01-06',
 '2013-02-12',
 '2013-02-14',
 '2013-03-31',
 '2013-03-31',
 '2013-04-01',
 '2013-05-01',
 '2013-05-08',
 '2013-05-09',
 '2013-05-19',
 '2013-05-19',
 '2013-05-20',
 '2013-07-14',
 '2013-08-15',
 '2013-11-01',
 '2013-11-11',
 '2013-12-25',
 '2013-12-31',
 '2014-01-01',
 '2014-01-06',
 '2014-02-14',
 '2014-03-04',
 '2014-04-20',
 '2014-04-20',
 '2014-04-21',
 '2014-05-01',
 '2014-05-08',
 '2014-05-29',
 '2014-06-08',
 '2014-06-08',
 '2014-06-09',
 '2014-07-14',
 '2014-08-15',
 '2014-11-01',
 '2014-11-11',
 '2014-12-25',
 '2014-12-31',
 '2015-01-01',
 '2015-01-06',
 '2015-02-14',
 '2015-02-17',
 '2015-04-05',
 '2015-04-05',
 '2015-04-06',
 '2015-05-01',
 '2015-05-08',
 '2015-05-14',
 '2015-05-24',
 '2015-05-24',
 '2015-05-25',
 '2015-07-14',
 '2015-08-15',
 '2015-11-01',
 '2015-11-11',
 '2015-12-25',
 '2015-12-31',
 '2016-01-01',
 '2016-01-06',
 '2016-02-09',
 '2016-02-14',
 '2016-03-27',
 '2016-03-27',
 '2016-03-28',
 '2016-05-01',
 '2016-05-05',
 '2016-05-

In [25]:
def is_day_off(date):
    """
    Function to tell if a day is off in France
    Only works from 2013 to 2020.
    """
    if date.strftime("%Y-%m-%d") in days_off:
        return True
    return False

In [26]:
import datetime

today = datetime.datetime.today()
christmas = datetime.datetime(2018,12,25)
easter = datetime.datetime(2015,4,5)

print(is_day_off(today))
print(is_day_off(christmas))
print(is_day_off(easter))

False
True
True
