# Analyze data with machine learning

Although a lot of interesting question could be addressed with this dataset, we will focus on one single use case, summarized by the following question: **Can we group the shared-bike station regarding their occupation chronicle?**

## Introduction

In [105]:
from datetime import datetime, date, timedelta
import os
import pytz

In [106]:
import pandas as pd
from sklearn.cluster import KMeans
from sqlalchemy import create_engine

## Configuration

In [107]:
DATADIR = "../data"

In [108]:
HOST = "localhost"
PORT = 5432
USER = "rde"
DBNAME = "jitenshea"

## Utilities

In [180]:
def get_engine():
    url = "postgresql://{user}@{host}:{port}/{dbname}".format(user=USER, host=HOST, port=PORT, dbname=DBNAME)
    return create_engine(url)
engine = get_engine()

## Compute stations clusters

### Retrieve the previously stored timeseries data

In [114]:
availability_input_file = "{begin}-{end}.csv".format(begin=start.strftime("%Y%m%d"), end=stop.strftime("%Y%m%d"))
availability_input_path = os.path.join(DATADIR, "lyon", "history", availability_input_file)
availability_input_path

'../data/lyon/history/20190807-20190814.csv'

In [115]:
df = pd.read_csv(availability_input_path, parse_dates=["timestamp"])

In [116]:
df.shape

(669049, 3)

In [117]:
df.timestamp.min(), df.timestamp.max()

(Timestamp('2019-08-07 00:00:00+0000', tz='UTC'),
 Timestamp('2019-08-14 00:00:00+0000', tz='UTC'))

In order to control the clustering process in a wider way, we may consider the time period as a input parameter, hence select the data accordingly.

We can note that timestamped data are stored with a timezone info, hence we must declare timezone-aware datetimes here.

In [118]:
today = date.today()
stop = datetime(today.year, today.month, today.day, 0, 0, tzinfo=pytz.utc)
start = stop - timedelta(7)
start, stop

(datetime.datetime(2019, 8, 7, 0, 0, tzinfo=<UTC>),
 datetime.datetime(2019, 8, 14, 0, 0, tzinfo=<UTC>))

In [119]:
df = df[(df["timestamp"] >= start) & (df["timestamp"] <= stop)]
df.shape

(669049, 3)

### Pré-traitement des données

First we remove from the analysis shared-bike stations that looks unused during the period (the station remained empty).

In [120]:
max_bikes = df.groupby("id")["available_bikes"].max()
unactive_stations = max_bikes[max_bikes==0].index.tolist()
df = df[~ df["id"].isin(unactive_stations)]

In [121]:
unactive_stations, df.shape

([8024, 10004, 10039, 10072], (661060, 3))

Then we resample the data each 5 minutes and group it with respect to station IDs :

In [153]:
rdf = (df.set_index("timestamp")
       .groupby("id")["available_bikes"]
       .resample("5T")
       .mean()
       .bfill()
       .unstack(0))
rdf.shape

(2017, 331)

*Spoiler:* this data is already sampled into 5-minute periods, however we keep the code as is for data consistency.

Then we can remove the week-end days from analysis, by anticipating that they may "pollute" the cluster formation:

In [154]:
rdf = rdf[rdf.index.weekday < 5]

And finally, we apply a naive normalization scheme to consider station filling rates instead of bike quantities:

In [155]:
rdf = rdf / rdf.max()

Once the data is normalized, we can aggregate availability at hour level for clustering step:

In [156]:
rdf["hour"] = rdf.index.hour
rdf = rdf.groupby("hour").mean()

This last operation provides the typical week day profile of each station, at each hour of the day:

In [160]:
rdf.iloc[:,:3]

id,1001,1003,1005
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.597336,0.416393,0.632172
1,0.580208,0.374444,0.633333
2,0.563542,0.334444,0.621875
3,0.557292,0.334444,0.591667
4,0.459375,0.321111,0.586458
5,0.265625,0.295556,0.472917
6,0.119792,0.167778,0.3375
7,0.041667,0.06,0.197917
8,0.063542,0.054444,0.163542
9,0.084375,0.048889,0.211458


### Apply the k-mean algorithm

At this point, we have a clusterable dataset! Let apply the easiest step...

In [161]:
N_CLUSTERS = 4

Here we are, `scikit-learn` makes it as easy as two lines of Python:

In [162]:
model = KMeans(n_clusters=N_CLUSTERS, random_state=0)
model.fit(rdf.T)
model

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [183]:
df_labels = pd.DataFrame({"id_station": rdf.columns, "labels": model.labels_})
df_centroids = pd.DataFrame(model.cluster_centers_, columns=["h{:02d}".format(i) for i in range(24)]).reset_index()

For each station, we have a cluster...

In [170]:
df_labels.head()

Unnamed: 0,id_station,labels
0,1001,2
1,1003,1
2,1005,2
3,1012,3
4,1013,2


...and for each cluster, we have the centroid expressed as a typical week day profile!

In [184]:
df_centroids.head()

Unnamed: 0,index,h00,h01,h02,h03,h04,h05,h06,h07,h08,...,h14,h15,h16,h17,h18,h19,h20,h21,h22,h23
0,0,0.221598,0.225618,0.230588,0.240347,0.268194,0.313796,0.388907,0.548083,0.589657,...,0.510559,0.381177,0.24442,0.203486,0.199108,0.192941,0.174875,0.161158,0.16655,0.168144
1,1,0.399133,0.39697,0.394776,0.388696,0.38691,0.380874,0.341407,0.309629,0.309802,...,0.32601,0.33795,0.355072,0.400034,0.422547,0.436883,0.441695,0.440805,0.42853,0.425325
2,2,0.742642,0.748009,0.748207,0.731734,0.684697,0.609994,0.453487,0.315072,0.278896,...,0.30281,0.365939,0.478828,0.547905,0.584031,0.616358,0.649894,0.683145,0.714461,0.736748
3,3,0.645313,0.643882,0.64211,0.640843,0.635387,0.622695,0.605428,0.598349,0.606151,...,0.600755,0.589948,0.581647,0.596499,0.620747,0.629145,0.636625,0.644712,0.648144,0.646979


## Store the clustering outputs to the application database

We previously clustered the shared-bike stations and deduced typical week day profiles, the job is almost done for ending the loop: we still have to store this new produced data to the database.

As the produced outputs highly depends on the chosen time period, it is recommended to store the start and stop datetimes as well: this modelization choice will allow us to store multiple clustering outputs into the database.

In [185]:
df_labels.loc[:, "start"] = start
df_labels.loc[:, "stop"] = stop
df_centroids.loc[:, "start"] = start
df_centroids.loc[:, "stop"] = stop
df_labels.head()

Unnamed: 0,id_station,labels,start,stop
0,1001,2,2019-08-07 00:00:00+00:00,2019-08-14 00:00:00+00:00
1,1003,1,2019-08-07 00:00:00+00:00,2019-08-14 00:00:00+00:00
2,1005,2,2019-08-07 00:00:00+00:00,2019-08-14 00:00:00+00:00
3,1012,3,2019-08-07 00:00:00+00:00,2019-08-14 00:00:00+00:00
4,1013,2,2019-08-07 00:00:00+00:00,2019-08-14 00:00:00+00:00


In [181]:
engine.execute("DROP TABLE IF EXISTS lyon.cluster;")
df_labels.to_sql("cluster", schema="lyon", con=engine, index=False)

In [186]:
engine.execute("DROP TABLE IF EXISTS lyon.centroid;")
df_centroids.to_sql("centroid", schema="lyon", con=engine, index=False)

That's all folks, for populating the database!

Now the database contains:
- the shared-bike station description;
- one week of bike availability timeseries;
- the clustering outputs.