# Data recovering

This first notebook aims at retrieving a minimal set of data for running further analysis.

**Warning:** In these notebooks, we will interact with a database, by adding fresh data. As a wrong move is always possible, an act-first-and-think-then tool that erase data folder and reset the database is provided (see `./resetdb.sh`). Use it with caution! :)

## Introduction

In [1]:
from datetime import datetime, date, timedelta
import json
import os
import zipfile

In [2]:
import pandas as pd
import requests
from sqlalchemy import create_engine

## Configuration

In [3]:
DATADIR = "../data"

In [4]:
os.makedirs(os.path.join(DATADIR, "lyon"), exist_ok=True)

In [5]:
HOST = "localhost"
PORT = 5432
USER = "rde"
DBNAME = "jitenshea"

## Utilities

In [6]:
def get_engine():
    url = "postgresql://{user}@{host}:{port}/{dbname}".format(user=USER, host=HOST, port=PORT, dbname=DBNAME)
    return create_engine(url)

In [7]:
def create_schema(schema):
    engine = get_engine()
    engine.execute("CREATE SCHEMA IF NOT EXISTS {schema};".format(schema=schema))
    engine.dispose()

Be careful to create the `lyon` schema at this step, otherwise further ones will fail...

In [8]:
create_schema("lyon")

## Retrieve the data

### Download the raw station data

We download the station information, and save the resulting archive into the data folder:

In [9]:
archive_path = os.path.join(DATADIR, "lyon", "lyon-stations.zip")

In [10]:
LYON_STATION_URL = "https://download.data.grandlyon.com/wfs/grandlyon?service=wfs&request=GetFeature&version=2.0.0&SRSNAME=EPSG:4326&outputFormat=SHAPEZIP&typename=pvo_patrimoine_voirie.pvostationvelov"

In [11]:
with open(archive_path, "wb") as fobj:
    resp = requests.get(LYON_STATION_URL)
    resp.raise_for_status()
    fobj.write(resp.content)

### Download bike availability history

#### Method 1: Latest records + Cron job

In [12]:
LYON_REALTIME_URL = "https://download.data.grandlyon.com/ws/rdata/jcd_jcdecaux.jcdvelov/all.json"

In [13]:
timestamp = datetime.now()
realtime_json_file = timestamp.strftime("%HH%MM")
realtime_json_path = os.path.join(DATADIR, "lyon", str(timestamp.year), str(timestamp.month), str(timestamp.day), realtime_json_file + ".json")
os.makedirs(os.path.dirname(realtime_json_path), exist_ok=True)
print(realtime_json_path)

session = requests.Session()
resp = session.get(LYON_REALTIME_URL)
with open(realtime_json_path, "w") as fobj:
    json.dump(resp.json(), fobj, ensure_ascii=False)
session.close()


../data/lyon/2019/8/14/12H36M.json


This method provides the freshest bike availability data, hence one could build a really big history by repeating the dump each X minutes (X being a frequency of your choice...).

By setting a cron job, this task could be done properly. In your shell:
```
crontab -e
```
Then in the crontab file:
```
# m h  dom mon dow   command
*/5 * * * * the-program
```
This last example would execute `the-program` each 5 minutes, every hour of every day of every month... However it is beyond the scope of this modest workshop!

*NOTE:* This is still the best way to get bike-sharing system data, anyway!

#### Method 2: It's your birthday!

No need to mess up the cron jobs on your laptop in a quick-and-dirt move, we are lucky! Some investigations on Lyon open data portal give us a ready-to-exploit toy dataset: the 7 last days of bike availability, measured every 5 minutes *(sounds perfect, isn't it?)*:

https://download.data.grandlyon.com/catalogue/srv/eng/catalog.search#/metadata/9bc6806d-e8a0-463b-aaa1-4364a75e44d7

In [14]:
LYON_AVAILABILITY_URL = "https://download.data.grandlyon.com/sos/velov?request=GetObservation&service=SOS&version=1.0.0&offering=reseau_velov&observedProperty=bikes&eventTime={begin}/{end}&responseFormat=application/json"

Before to retrieve the raw history data, we need some piping miscellanea:

In [15]:
def one_week_before(timestamp):
    return timestamp - timedelta(7)

In [16]:
stop = date.today()
start = one_week_before(stop)

In [17]:
start_date = start.strftime("%Y-%m-%dT%H:%M:%SZ")
stop_date = stop.strftime("%Y-%m-%dT%H:%M:%SZ")
LYON_AVAILABILITY_FULL_URL = LYON_AVAILABILITY_URL.format(begin=start_date, end=stop_date)
LYON_AVAILABILITY_FULL_URL

'https://download.data.grandlyon.com/sos/velov?request=GetObservation&service=SOS&version=1.0.0&offering=reseau_velov&observedProperty=bikes&eventTime=2019-08-07T00:00:00Z/2019-08-14T00:00:00Z&responseFormat=application/json'

In [18]:
availability_output_file = "{begin}-{end}.json".format(begin=start.strftime("%Y%m%d"), end=stop.strftime("%Y%m%d"))
availability_output_path = os.path.join(DATADIR, "lyon", "history", availability_output_file)
os.makedirs(os.path.dirname(availability_output_path), exist_ok=True)
availability_output_path

'../data/lyon/history/20190807-20190814.json'

Here we have defined the final download URL, as well as an output path on the file system, we can do the job:

In [19]:
session = requests.Session()
resp = session.get(LYON_AVAILABILITY_FULL_URL)
with open(availability_output_path, "w") as fobj:
    json.dump(resp.json(), fobj, ensure_ascii=False)

In [20]:
ls ../data/lyon/history

20190806-20190813.csv   20190807-20190814.csv
20190806-20190813.json  20190807-20190814.json


The bike availability (recent) history is on our computers!

## Store the data into the database

### Unzip the downloaded station archive

Once we have got the archive file, we may unzip it and retrieve the Lyon stations as shapefiles:

In [21]:
zip_ref = zipfile.ZipFile(archive_path)
zip_ref.extractall(os.path.dirname(archive_path))
zip_ref.close()

### Store the raw station data into the database

Here the station information lies into the shapefiles, we still have to store it into the application database. We use `shp2pgsql` and `psql` for this purpose.

In [22]:
LYON_SRID = 3946
LYON_DATANAME = "pvo_patrimoine_voirie.pvostationvelov"

In [23]:
import subprocess

shp_file = os.path.join(os.path.dirname(archive_path), LYON_DATANAME + ".shp")
cmd = "shp2pgsql -s " + str(LYON_SRID) + " " + shp_file + " lyon.raw_station"
cmd += " | psql -h " + HOST + " -d " + DBNAME + " -U " + USER + " -p " + str(PORT)
cmd

'shp2pgsql -s 3946 ../data/lyon/pvo_patrimoine_voirie.pvostationvelov.shp lyon.raw_station | psql -h localhost -d jitenshea -U rde -p 5432'

In [24]:
subprocess.call(cmd, shell=True)

0

Now the station should be in the database in a raw format. We can check it:

In [25]:
engine = get_engine()
rset = engine.execute("SELECT count(*) FROM lyon.raw_station;")
rset.fetchone()

(369,)

There is 369 bike-sharing stations in Lyon!

### Consider a standardized version of station data

At this point, one could stop the station data treatment. However we can still improve the design of our data; especially if we target to retrieve data in additional cities.

Here we will "simply" build a new table with fixed attributes. A particular attention must be paid on raw attributes (typically they can be known after exploring the data itself).

In [26]:
query = """
DROP TABLE IF EXISTS lyon.station;
CREATE TABLE lyon.station(
id varchar,
name varchar(250),
address varchar(250),
city varchar(100),
nb_stations int,
geom geometry(POINT, 4326)
);
INSERT INTO lyon.station
SELECT {id} AS id,
{name} AS name,
{address} AS address,
{city} AS city,
{nb_stations}::int AS nb_stations,
ST_TRANSFORM(ST_FORCE2D(geom), 4326) AS geom
FROM lyon.raw_station
"""

In [27]:
LYON_ID = "idstation"
LYON_NAME = "nom"
LYON_ADDRESS = "adresse1"
LYON_CITY = "commune"
LYON_NB_STATIONS = "nbbornette"
engine.execute(query.format(id=LYON_ID, name=LYON_NAME, address=LYON_ADDRESS, city=LYON_CITY, nb_stations=LYON_NB_STATIONS))
rset = engine.execute("SELECT count(*) FROM lyon.station;")
rset.fetchone()

(369,)

### Store the bike availability history into a csv file

Let come back to the bike availability data. We downloaded it in the `json` format, however a more convenient format is the `csv`: as tables, the data could be far easier to handle, and to store into the application database.

In [28]:
def convert_history_data(history_file):
    """Read the bike availability history data, and send it directly into a csv file
    
    The function, and especially the json file structure, is infered from the Lyon Open Data portal.
    """
    with open(history_file, "r") as fobj:
        data = json.load(fobj)
        datalist = []
        for d in data["ObservationCollection"]["member"]:
            cur_d = d["result"]["DataArray"]["values"]
            station_id = d["name"].split("-")[1]
            cur_d = [
                [item[0], int(float(item[1])), station_id]
                for item in cur_d
            ]
            datalist += cur_d
        df = pd.DataFrame(
            datalist, columns=["timestamp", "available_bikes", "id"]
        )
        df.loc[:, "timestamp"] = pd.to_datetime(df["timestamp"])
        df.sort_values("timestamp")
        with open(history_file.replace(".json", ".csv"), "w") as fobj:
            df.to_csv(fobj, index=False)

In [29]:
convert_history_data(availability_output_path)

In [30]:
ls ../data/lyon/history

20190806-20190813.csv   20190807-20190814.csv
20190806-20190813.json  20190807-20190814.json


Now it should be easier to populate the database...

### Store the bike availibility history into the database

In [31]:
availability_timeseries = pd.read_csv(availability_output_path.replace(".json", ".csv"), parse_dates=["timestamp"])

In [32]:
availability_timeseries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 669049 entries, 0 to 669048
Data columns (total 3 columns):
timestamp          669049 non-null datetime64[ns, UTC]
available_bikes    669049 non-null int64
id                 669049 non-null int64
dtypes: datetime64[ns, UTC](1), int64(2)
memory usage: 15.3 MB


For a sake of data consistency, store the IDs as strings (similarly to `lyon.station` table):

In [33]:
availability_timeseries.loc[:, "id"] = availability_timeseries["id"].astype(str)

In [34]:
availability_timeseries.head()

Unnamed: 0,timestamp,available_bikes,id
0,2019-08-07 00:00:00+00:00,12,9014
1,2019-08-07 00:05:00+00:00,12,9014
2,2019-08-07 00:10:00+00:00,12,9014
3,2019-08-07 00:15:00+00:00,12,9014
4,2019-08-07 00:20:00+00:00,12,9014


In [35]:
engine.execute("DROP TABLE IF EXISTS lyon.timeseries;")
availability_timeseries.to_sql("timeseries", schema="lyon", con=engine, chunksize=50000, method="multi", index=False)

In [36]:
rset = engine.execute("SELECT count(*) FROM lyon.timeseries;")
rset.fetchone()

(669049,)

As this point, we have built our database, and populated it with station and bike availability data!