# Data collection

The goal of this presentation is to automate data collection.
We need some historical data to build our model and predict electrical consumption in Paris for J+1.   
Basically we need to collect as much interesting data as possible, starting from nothing.

### Expected output

Datasets saved on our computer with file formats easily readable by python : csv, json, xml, excel

### Quizz (5 minutes) : What kind of data might be interesting to make a prediction ?

### Suggestions : 


The data we are going to collect : 

- Historical electrical consumption of Paris ( > 2 years of data)
- Historical weather ( > 2 years of data)
- weather forecast (temperature, wind, solar radiation ...)
- Days off in France
- ~~How many electric vehicules~~

## https://github.com/LucasBerbesson/ds2


### Workshop (15 minutes) : Try to collect the data


### Easy : Historical electrical consumption

[Electrical consumption in île-de-France between 2013 and 2017](https://rte-opendata.opendatasoft.com/explore/dataset/eco2mix_regional_cons_def/export/?disjunctive.libelle_region&disjunctive.nature&sort=-date_heure&refine.libelle_region=Ile-de-France)

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
consumption = pd.read_csv("./data/eco2mix_regional_cons_def.csv", delimiter=";",parse_dates=["Date - Heure"])
consumption.set_index('Date - Heure',inplace=True)

In [4]:
consumption.sort_index(inplace=True)
consumption.head(3)

Unnamed: 0_level_0,Code INSEE région,Région,Nature,Date,Heure,Consommation (MW),Thermique (MW),Nucléaire (MW),Eolien (MW),Solaire (MW),Hydraulique (MW),Pompage (MW),Bioénergies (MW),Ech. physiques (MW)
Date - Heure,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2012-12-31 23:00:00,11,Ile-de-France,Données définitives,2013-01-01,00:00,,,,,,,,,
2012-12-31 23:30:00,11,Ile-de-France,Données définitives,2013-01-01,00:30,9134.0,685.0,,16.0,0.0,0.0,,142.0,8289.0
2013-01-01 00:00:00,11,Ile-de-France,Données définitives,2013-01-01,01:00,8822.0,685.0,,16.0,0.0,0.0,,142.0,7977.0


In [9]:
resample = consumption.resample('D').count()
resample[resample["Code INSEE région"]<48]

Unnamed: 0_level_0,Code INSEE région,Région,Nature,Date,Heure,Consommation (MW),Thermique (MW),Nucléaire (MW),Eolien (MW),Solaire (MW),Hydraulique (MW),Pompage (MW),Bioénergies (MW),Ech. physiques (MW)
Date - Heure,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2012-12-31,2,2,2,2,2,1,1,0,1,1,1,0,1,1
2013-10-27,46,46,46,46,46,46,46,0,46,46,46,0,46,46
2014-10-26,46,46,46,46,46,46,46,0,46,46,46,0,46,46
2015-10-25,46,46,46,46,46,46,46,0,46,46,46,0,46,46
2016-10-30,46,46,46,46,46,46,46,0,46,46,46,0,46,46
2017-10-29,46,46,46,46,46,46,46,0,46,46,46,0,46,46
2018-04-30,44,44,44,44,44,44,44,0,44,44,44,0,44,44


## Less easy : Historical weather

Not enough data : [Prévision Météo - Paris - AROME](https://public.opendatasoft.com/explore/dataset/arome-0025-sp1_sp2_paris/export/)  
Let's pay for some data ! [Openweather map API](https://openweathermap.org/history-bulk) (10$ for 5 years of weather in paris : a bargain !)


In [11]:
weather = pd.read_csv("./data/meteo-paris.csv")
weather['dt'] = pd.to_datetime(weather['dt'],unit='s')
weather.set_index('dt',inplace=True)

In [12]:
weather.head()

Unnamed: 0_level_0,dt_iso,city_id,city_name,lat,lon,temp,temp_min,temp_max,pressure,sea_level,...,rain_today,snow_1h,snow_3h,snow_24h,snow_today,clouds_all,weather_id,weather_main,weather_description,weather_icon
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-10-01 13:00:00,2012-10-01 13:00:00 +0000 UTC,2988507,,,,293.32,291.15,298.15,1017,,...,,,,,,0,800,Clear,Sky is Clear,01d
2012-10-01 14:00:00,2012-10-01 14:00:00 +0000 UTC,2988507,,,,293.324271,293.324271,293.324271,1017,,...,,,,,,0,800,Clear,sky is Clear,01
2012-10-01 15:00:00,2012-10-01 15:00:00 +0000 UTC,2988507,,,,293.334926,293.334926,293.334926,1017,,...,,,,,,1,800,Clear,sky is Clear,01
2012-10-01 16:00:00,2012-10-01 16:00:00 +0000 UTC,2988507,,,,293.345582,293.345582,293.345582,1017,,...,,,,,,1,800,Clear,sky is Clear,01
2012-10-01 17:00:00,2012-10-01 17:00:00 +0000 UTC,2988507,,,,293.356237,293.356237,293.356237,1017,,...,,,,,,2,800,Clear,sky is Clear,02


In [13]:
print(weather.index.min())
print(weather.index.max())

2012-10-01 13:00:00
2017-12-06 14:00:00


In [18]:
resample = weather.resample('D').count()
resample.sample(10)

Unnamed: 0_level_0,dt_iso,city_id,city_name,lat,lon,temp,temp_min,temp_max,pressure,sea_level,...,rain_today,snow_1h,snow_3h,snow_24h,snow_today,clouds_all,weather_id,weather_main,weather_description,weather_icon
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-21,24,24,0,0,0,24,24,24,24,0,...,0,0,1,0,0,24,24,24,24,24
2013-05-09,30,30,0,0,0,30,30,30,30,0,...,0,0,0,0,0,30,30,30,30,30
2013-06-23,25,25,0,0,0,25,25,25,25,0,...,0,0,0,0,0,25,25,25,25,25
2013-07-06,24,24,0,0,0,24,24,24,24,0,...,0,0,0,0,0,24,24,24,24,24
2013-06-03,24,24,0,0,0,24,24,24,24,0,...,0,0,0,0,0,24,24,24,24,24
2017-03-28,24,24,0,0,0,24,24,24,24,0,...,0,0,0,0,0,24,24,24,24,24
2016-04-18,24,24,0,0,0,24,24,24,24,0,...,0,0,0,0,0,24,24,24,24,24
2015-01-02,24,24,0,0,0,24,24,24,24,0,...,0,0,0,0,0,24,24,24,24,24
2016-05-12,40,40,0,0,0,40,40,40,40,0,...,0,0,0,0,0,40,40,40,40,40
2014-12-12,24,24,0,0,0,24,24,24,24,0,...,0,0,0,0,0,24,24,24,24,24


## Weather forecast 

To make a prediction we will need the weather forecast
Let's use an API to get this one

In [20]:
import os
import requests 
import pandas as pd

token = os.environ["OPENWEATHERMAP"]
response = requests.get("http://api.openweathermap.org/data/2.5/forecast?id=2988507&mode=json&APPID={}".format(token)).json()

response

{'cod': '200',
 'message': 0.0055,
 'cnt': 40,
 'list': [{'dt': 1531440000,
   'main': {'temp': 290.76,
    'temp_min': 290.76,
    'temp_max': 291.4,
    'pressure': 1021.49,
    'sea_level': 1033.38,
    'grnd_level': 1021.49,
    'humidity': 81,
    'temp_kf': -0.64},
   'weather': [{'id': 801,
     'main': 'Clouds',
     'description': 'few clouds',
     'icon': '02n'}],
   'clouds': {'all': 12},
   'wind': {'speed': 3.06, 'deg': 347},
   'sys': {'pod': 'n'},
   'dt_txt': '2018-07-13 00:00:00'},
  {'dt': 1531450800,
   'main': {'temp': 289.41,
    'temp_min': 289.41,
    'temp_max': 289.892,
    'pressure': 1021.29,
    'sea_level': 1033.2,
    'grnd_level': 1021.29,
    'humidity': 89,
    'temp_kf': -0.48},
   'weather': [{'id': 801,
     'main': 'Clouds',
     'description': 'few clouds',
     'icon': '02n'}],
   'clouds': {'all': 12},
   'wind': {'speed': 3, 'deg': 338.003},
   'sys': {'pod': 'n'},
   'dt_txt': '2018-07-13 03:00:00'},
  {'dt': 1531461600,
   'main': {'temp': 29

# Days off in France

No dataset easily available, we are going to scrap the web :
https://www.calendrier-365.fr

In [30]:
import re
import requests
from datetime import datetime
from bs4 import BeautifulSoup

days_off = []
for year in range(2012,2020):
    url = 'https://www.calendrier-365.fr/jours-feries/{}.html'.format(year)
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    for x in soup.find_all("td", {"class":"dtr tar"}):
        date = datetime.fromtimestamp(int(x.attrs["data-value"]))
        days_off.append(date.strftime("%Y-%m-%d"))

In [31]:
days_off

['2012-01-01',
 '2012-01-06',
 '2012-02-14',
 '2012-02-21',
 '2012-04-08',
 '2012-04-08',
 '2012-04-09',
 '2012-05-01',
 '2012-05-08',
 '2012-05-17',
 '2012-05-27',
 '2012-05-27',
 '2012-05-28',
 '2012-07-14',
 '2012-08-15',
 '2012-11-01',
 '2012-11-11',
 '2012-12-25',
 '2012-12-31',
 '2013-01-01',
 '2013-01-06',
 '2013-02-12',
 '2013-02-14',
 '2013-03-31',
 '2013-03-31',
 '2013-04-01',
 '2013-05-01',
 '2013-05-08',
 '2013-05-09',
 '2013-05-19',
 '2013-05-19',
 '2013-05-20',
 '2013-07-14',
 '2013-08-15',
 '2013-11-01',
 '2013-11-11',
 '2013-12-25',
 '2013-12-31',
 '2014-01-01',
 '2014-01-06',
 '2014-02-14',
 '2014-03-04',
 '2014-04-20',
 '2014-04-20',
 '2014-04-21',
 '2014-05-01',
 '2014-05-08',
 '2014-05-29',
 '2014-06-08',
 '2014-06-08',
 '2014-06-09',
 '2014-07-14',
 '2014-08-15',
 '2014-11-01',
 '2014-11-11',
 '2014-12-25',
 '2014-12-31',
 '2015-01-01',
 '2015-01-06',
 '2015-02-14',
 '2015-02-17',
 '2015-04-05',
 '2015-04-05',
 '2015-04-06',
 '2015-05-01',
 '2015-05-08',
 '2015-05-

In [None]:
def is_day_off(date):
    """
    Function to tell if a day is off in France
    Only works from 2013 to 2020.
    """
    if date.strftime("%Y-%m-%d") in days_off:
        return True
    return False

In [None]:
import datetime

today = datetime.datetime.today() 

next_saturday = today + datetime.timedelta(days=2)
christmas = datetime.datetime(2018,12,25)
easter = datetime.datetime(2015,4,5)

print(is_day_off(next_saturday))
print(is_day_off(today))
print(is_day_off(christmas))
print(is_day_off(easter))

# Strikes in Paris
Copyright to William Revah

In [41]:
from bs4 import BeautifulSoup
import requests
import datetime

strikes = []

url = "https://fr.wikipedia.org/wiki/Liste_des_manifestations_les_plus_importantes_en_France"
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
for table in soup.find_all("table"):
    for x in table.find_all("tr"):
        date=x.find_next("time")
        strikes.append(date.attrs["datetime"])

print(strikes)


['1790-07-14',
 '1790-07-14',
 '1794-06-08',
 '1832-06-05',
 '1832-06-05',
 '1832-06-05',
 '1832-06-05',
 '1840-12-15',
 '1840-12-15',
 '1869-10-08',
 '1869-10-08',
 '1877-09-08',
 '1877-09-08',
 '1885-06-01',
 '1891-05-01',
 '1894-07-01',
 '1908-07-31',
 '1908-07-31',
 '1908-07-31',
 '1931-01-07',
 '1934-02-12',
 '1935-07-14',
 '1936-05-24',
 '1944-08-26',
 '1944-08-26',
 '1951-02',
 '1953-07-14',
 '1961-10-17',
 '1962-02-13',
 '1968-05-13',
 '1968-05-30',
 '1977-07-31',
 '1983-10-15',
 '1984-03-04',
 '1984-06-24',
 '1986-12-04',
 '1986-12-10',
 '1989-07-14',
 '1994-01-16',
 '1995-12-12',
 '1998-07-13',
 '2002-05-01',
 '2002-05-01',
 '2003-05-13',
 '2003-05-13',
 '2003-06-03',
 '2013-05-17',
 '2013-05-17',
 '2006-03-18',
 '2006-03-28',
 '2013-05-17',
 '2013-05-17',
 '2009-01-29',
 '2009-01-29',
 '2009-03-19',
 '2010-03-23',
 '2010-03-23',
 '2010-05-27',
 '2010-06-24',
 '2010-09-07',
 '2010-09-23',
 '2010-10-02',
 '2010-10-12',
 '2010-10-16',
 '2010-10-19',
 '2010-10-28',
 '2010-11-06'