# Data collection

The goal of this presentation is to automate data collection.
We need some historical data to build our model and predict electrical consumption in Paris for J+1.   
Basically we need to collect as much interesting data as possible, starting from nothing.

### Expected output

Datasets saved on our computer with file formats easily readable by python : csv, json, xml, excel

### Quizz (5 minutes) : What kind of data might be interesting to make a prediction ?

### Workshop (15 minutes) : Try to collect the data

- Historical data
- Historical weather
- Population in Paris
- Trafic 
- Days off 
- Bridge
- Special events 
- Holidays


### Easy : Historical electrical consumption

[Electrical consumption in île-de-France between 2013 and 2017](https://rte-opendata.opendatasoft.com/explore/dataset/eco2mix_regional_cons_def/export/?disjunctive.libelle_region&disjunctive.nature&sort=-date_heure&refine.libelle_region=Ile-de-France)

In [2]:
import pandas as pd

In [3]:
consumption = pd.read_csv("./data/eco2mix-regional-cons-def.csv",delimiter=";")

In [4]:
consumption.head()

Unnamed: 0,Code INSEE région,Région,Nature,Date,Heure,Date - Heure,Consommation (MW),Thermique (MW),Nucléaire (MW),Eolien (MW),Solaire (MW),Hydraulique (MW),Pompage (MW),Bioénergies (MW),Ech. physiques (MW)
0,11,Ile-de-France,Données consolidées,2018-01-31,20:00,2018-01-31T20:00:00+01:00,11283.0,287.0,,39.0,0.0,0.0,,150.0,10807.0
1,11,Ile-de-France,Données consolidées,2018-01-31,20:30,2018-01-31T20:30:00+01:00,10863.0,288.0,,42.0,0.0,0.0,,152.0,10381.0
2,11,Ile-de-France,Données consolidées,2018-01-31,21:30,2018-01-31T21:30:00+01:00,9896.0,291.0,,40.0,0.0,0.0,,152.0,9414.0
3,11,Ile-de-France,Données consolidées,2018-02-01,00:30,2018-02-01T00:30:00+01:00,9184.0,293.0,,22.0,0.0,0.0,,156.0,8713.0
4,11,Ile-de-France,Données consolidées,2018-02-01,02:30,2018-02-01T02:30:00+01:00,8062.0,298.0,,18.0,0.0,0.0,,155.0,7590.0


### Less easy : Historical weather

Not enough data : [Prévision Météo - Paris - AROME](https://public.opendatasoft.com/explore/dataset/arome-0025-sp1_sp2_paris/export/)  
Let's pay for some data ! [Openweather map API](https://openweathermap.org/history-bulk) (10$ for 5 years of weather in paris : a bargain !)


In [5]:
weather = pd.read_csv("./data/meteo-paris.csv")

In [6]:
weather.head()

Unnamed: 0.1,Unnamed: 0,dt_iso,city_id,city_name,lat,lon,temp,temp_min,temp_max,pressure,...,snow_1h,snow_3h,snow_24h,snow_today,clouds_all,weather_id,weather_main,weather_description,weather_icon,Date
0,0,2013-03-13 02:00:00 +0000 UTC,2988507,,,,272.94,272.15,273.71,993,...,,,,,90,500,Rain,light rain,10n,2013-03-13 03:00:00
1,1,2013-03-13 02:00:00 +0000 UTC,2988507,,,,272.94,272.15,273.71,993,...,,,,,90,701,Mist,mist,50n,2013-03-13 03:00:00
2,2,2013-03-13 03:00:00 +0000 UTC,2988507,,,,272.69,272.15,273.15,993,...,,,,,90,500,Rain,light rain,10n,2013-03-13 04:00:00
3,3,2013-03-13 03:00:00 +0000 UTC,2988507,,,,272.69,272.15,273.15,993,...,,,,,90,701,Mist,mist,50n,2013-03-13 04:00:00
4,4,2013-03-13 03:00:00 +0000 UTC,2988507,,,,272.69,272.15,273.15,993,...,,,,,90,602,Snow,heavy snow,13n,2013-03-13 04:00:00


# Days off in France

No dataset easily available, we are going to scrap the web :
https://www.calendrier-365.fr

In [8]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta

In [9]:
# scrap https://www.calendrier-365.fr

In [10]:
response = requests.get("https://www.calendrier-365.fr/jours-feries/2018.html")

## Using regex

In [11]:
import re 

matches = re.findall('data-value=\"\d{2,}\"', response.text)
matches

['data-value="1514761200"',
 'data-value="1515193200"',
 'data-value="1518476400"',
 'data-value="1518562800"',
 'data-value="1522533600"',
 'data-value="1522533600"',
 'data-value="1522620000"',
 'data-value="1525125600"',
 'data-value="1525730400"',
 'data-value="1525903200"',
 'data-value="1526767200"',
 'data-value="1526767200"',
 'data-value="1526853600"',
 'data-value="1531519200"',
 'data-value="1534284000"',
 'data-value="1541026800"',
 'data-value="1541890800"',
 'data-value="1545692400"',
 'data-value="1546210800"']

## Using beautifulsoup

In [12]:
from bs4 import BeautifulSoup
from datetime import datetime

results = []

for year in range(2008,2019):
    url = 'https://www.calendrier-365.fr/jours-feries/{}.html'.format(year)
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    for element in soup.find_all("td",{"class":"dtr tar"}):
        new_date = datetime.fromtimestamp(int(element["data-value"])).strftime("%d-%m-%Y")
        results.append(new_date)

# Requesting strikes in Paris

In [13]:
from bs4 import BeautifulSoup
import requests
import datetime

strikes = []

url = "https://fr.wikipedia.org/wiki/Liste_des_manifestations_les_plus_importantes_en_France"
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
for table in soup.find_all("table"):
    for x in table.find_all("tr"):
        date=x.find_next("time")
        strikes.append(date.attrs["datetime"])

print(strikes)

['1790-07-14', '1790-07-14', '1794-06-08', '1832-06-05', '1832-06-05', '1832-06-05', '1832-06-05', '1840-12-15', '1840-12-15', '1869-10-08', '1869-10-08', '1877-09-08', '1877-09-08', '1885-06-01', '1891-05-01', '1894-07-01', '1908-07', '1908-07', '1908-07', '1931-01-07', '1934-02-12', '1935-07-14', '1936-05-24', '1944-08-26', '1944-08-26', '1951-02', '1953-07-14', '1961-10-17', '1962-02-13', '1968-05-13', '1968-05-30', '1976-03-04', '1977-07-31', '1983-10-15', '1984-03-04', '1984-06-24', '1986-12-04', '1986-12-10', '1989-07-14', '1994-01-16', '1995-12-12', '1998-07-13', '2002-05-01', '2002-05-01', '2003-06-28', '2003-05-13', '2003-06-03', '2004-06-26', '2005-06-25', '2006-03-18', '2006-03-28', '2006-06-24', '2007-06-30', '2008-06-28', '2009-01-29', '2009-03-19', '2009-06-27', '2010-03-23', '2010-05-27', '2010-06-24', '2010-09-07', '2010-09-23', '2010-10-02', '2010-10-12', '2010-10-16', '2010-10-19', '2010-10-28', '2010-11-06', '2010-06-26', '2011-06-25', '2012-06-30', '2013-01-13', '20