# Data collection

The goal of this presentation is to automate data collection.
We need some historical data to build our model and predict electrical consumption in Paris for J+1.   
Basically we need to collect as much interesting data as possible, starting from nothing.

### Expected output

Datasets saved on our computer with file formats easily readable by python : csv, json, xml, excel

### Quizz (5 minutes) : What kind of data might be interesting to make a prediction ?

### Suggestions : 


The data we are going to collect : 

- Historical electrical consumption of Paris ( > 2 years of data)
- Historical weather ( > 2 years of data)
- weather forecast (temperature, wind, solar radiation ...)
- Days off in France
- ~~How many electric vehicules~~

## https://github.com/LucasBerbesson/ds2


### Workshop (15 minutes) : Try to collect the data


### Easy : Historical electrical consumption

[Electrical consumption in île-de-France between 2013 and 2017](https://rte-opendata.opendatasoft.com/explore/dataset/eco2mix_regional_cons_def/export/?disjunctive.libelle_region&disjunctive.nature&sort=-date_heure&refine.libelle_region=Ile-de-France)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
consumption = pd.read_csv("./data/eco2mix_regional_cons_def.csv", delimiter=";",parse_dates=["Date - Heure"])
consumption.set_index('Date - Heure',inplace=True)

In [None]:
consumption.sort_index(inplace=True)
consumption.head(3)

In [None]:
consumption.resample('D').count()

In [None]:
consumption["Code INSEE région"].value_counts()

In [None]:
# 31 march 2013 appears twice, strange
consumption.describe()

## Less easy : Historical weather

Not enough data : [Prévision Météo - Paris - AROME](https://public.opendatasoft.com/explore/dataset/arome-0025-sp1_sp2_paris/export/)  
Let's pay for some data ! [Openweather map API](https://openweathermap.org/history-bulk) (10$ for 5 years of weather in paris : a bargain !)


In [None]:
weather = pd.read_csv("./data/meteo-paris.csv")
weather['dt'] = pd.to_datetime(weather['dt'],unit='s')
weather.set_index('dt',inplace=True)

In [None]:
weather.head()

In [None]:
print(weather.index.min())
print(weather.index.max())

In [None]:
weather.resample('D').count()

## Weather forecast 

To make a prediction we will need the weather forecast
Let's use an API to get this one

In [46]:
import os
import requests 
import pandas as pd

token = os.environ["OPENWEATHERMAP"]
response = requests.get("http://api.openweathermap.org/data/2.5/forecast?id=2988507&mode=json&APPID={}".format(token)).json()

# Days off in France

No dataset easily available, we are going to scrap the web :
https://www.calendrier-365.fr

In [37]:
import re
import requests

jours_feries = []
for year in range(2010,2019):
    response = requests.get("https://www.calendrier-365.fr/jours-feries/{}.html".format(year))
    jours_feries = jours_feries + re.findall("\d{4}-\d{2}-\d{2}",response.text)

In [None]:
from bs4 import BeautifulSoup
import requests 
a
days_off = []

response = requests.get('https://www.calendrier-365.fr/jours-feries/2016.html')

In [None]:
days_off = []
for year in range(2012,2020):
    url = 'https://www.calendrier-365.fr/jours-feries/{}.html'.format(year)
    response = requests.get(url)
    print("Scraping: ",url)
    soup = BeautifulSoup(response.text,"lxml")
    for x in soup.find_all("span", {"itemprop":"startDate"}):
        days_off.append(x.attrs["content"])

In [None]:
days_off

In [None]:
def is_day_off(date):
    """
    Function to tell if a day is off in France
    Only works from 2013 to 2020.
    """
    if date.strftime("%Y-%m-%d") in days_off:
        return True
    return False

In [None]:
import datetime

today = datetime.datetime.today() 

next_saturday = today + datetime.timedelta(days=2)
christmas = datetime.datetime(2018,12,25)
easter = datetime.datetime(2015,4,5)

print(is_day_off(next_saturday))
print(is_day_off(today))
print(is_day_off(christmas))
print(is_day_off(easter))

# Strikes in Paris
Copyright to William Revah

In [41]:
from bs4 import BeautifulSoup
import requests
import datetime

strikes = []

url = "https://fr.wikipedia.org/wiki/Liste_des_manifestations_les_plus_importantes_en_France"
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
for table in soup.find_all("table"):
    for x in table.find_all("tr"):
        date=x.find_next("time")
        strikes.append(date.attrs["datetime"])

print(strikes)


['1790-07-14',
 '1790-07-14',
 '1794-06-08',
 '1832-06-05',
 '1832-06-05',
 '1832-06-05',
 '1832-06-05',
 '1840-12-15',
 '1840-12-15',
 '1869-10-08',
 '1869-10-08',
 '1877-09-08',
 '1877-09-08',
 '1885-06-01',
 '1891-05-01',
 '1894-07-01',
 '1908-07-31',
 '1908-07-31',
 '1908-07-31',
 '1931-01-07',
 '1934-02-12',
 '1935-07-14',
 '1936-05-24',
 '1944-08-26',
 '1944-08-26',
 '1951-02',
 '1953-07-14',
 '1961-10-17',
 '1962-02-13',
 '1968-05-13',
 '1968-05-30',
 '1977-07-31',
 '1983-10-15',
 '1984-03-04',
 '1984-06-24',
 '1986-12-04',
 '1986-12-10',
 '1989-07-14',
 '1994-01-16',
 '1995-12-12',
 '1998-07-13',
 '2002-05-01',
 '2002-05-01',
 '2003-05-13',
 '2003-05-13',
 '2003-06-03',
 '2013-05-17',
 '2013-05-17',
 '2006-03-18',
 '2006-03-28',
 '2013-05-17',
 '2013-05-17',
 '2009-01-29',
 '2009-01-29',
 '2009-03-19',
 '2010-03-23',
 '2010-03-23',
 '2010-05-27',
 '2010-06-24',
 '2010-09-07',
 '2010-09-23',
 '2010-10-02',
 '2010-10-12',
 '2010-10-16',
 '2010-10-19',
 '2010-10-28',
 '2010-11-06'