# Author: Dinesh Sadhwani

### **Objective:** Extract data from the JCDecaux API over a period of 3 days before Christmas 2021 (19 Dec - 22 Dec) and 3 days after Christmas 2021 (26 Dec - 29 Dec)

#### **Problem:** JCDecaux API server provide fresh bikes data every minute. However, past data is never stored on their server or maybe it is, but JCDecaux does not provide access to it. So, if we need to collect data over a period of time, 3 days in our case, we would have to keep the python script running for that period.

#### **Solution:** We setup a Virtual Machine (VM) instance on Google Cloud Platform which acts like a server that runs the script for the desired amount of time. Every 30 mins the pandas dataframe that collects the data was converted to a csv file and stored on the hard disk of the VM instance. Once the 3-days data collection period was finished, the data was copied to a Google Cloud Storage Bucket, so that the VM instance could be shut down to save hourly costs.

## Phase 1: Get raw data
By the end of this phase we get a two sets of csv file with each set containing the pre- and post-Christmas bikes data.

In [1]:
# importing all the necessary libraries

import os
import requests
import pandas as pd
import time
import datetime
import glob

In [2]:
# setup all the necessary variables for the loop that constantly queries the API server. These variables remain the same for 
# both pre-christmas and post-christmas data

# 3 days = 4320 minutes, which is how long the loop will run
number_of_minutes = 4320

# this variable tells the loop to write the dataframe to csv file as described in the solution above
write_interval = 30

# the API key used for authentication
API_KEY = "2cca7c8c88124160de93f6619ea071b40b90ce44"

URL = "https://api.jcdecaux.com/vls/v3/stations?"

# this list defines the 4 cities for which we are pulling the data
CITIES = ["lyon", "dublin", "bruxelles", "toyama"]

In [3]:
# initialize the pandas dataframe that will hold the 30 minutes bike data

bikes_data_df = pd.DataFrame()

As far as the loop to call the API is concerned, there are two separate loop for each separate set of 3 days

#### **Loop for 19 Dec to 22 Dec**

In [None]:
# loop starts at minute 1 instead of minute 0

for _ in range(1, number_of_minutes + 1):
    for city in CITIES:
        bikes = URL + "contract=" + city + "&apiKey=" + API_KEY
        response = requests.get(bikes)
        data = response.json()
        df = pd.DataFrame(data)
        bikes_data_df = bikes_data_df.append(df, ignore_index=True)

    # this if control statement ensures that the dataframe is written to csv every 30 minutes and 
    # the bikes_data_df is freed up to hold data of the next 30 minutes

    if _ % write_interval == 0:

        # the columns that contain complex json objects are split into columns and then dropped

        bikes_data_df = bikes_data_df.join(
            pd.json_normalize(bikes_data_df["totalStands"].tolist()).add_prefix("totalStands_")).join(
            pd.json_normalize(bikes_data_df["mainStands"].tolist()).add_prefix("mainStands_")).drop(
            ["totalStands", "mainStands", "overflowStands"], axis=1)

        # it is usually the case that certain bike stations don't have any activities for several minutes, so for each minute, 
        # the API server refreshes, we may get repeated data. 
        # The pandas drop_duplicates() function ensures that duplicates are dropped,
        # and therefore the csv file contains unique rows

        bikes_data_df = bikes_data_df.drop_duplicates(subset=["number", "contractName", "lastUpdate"])

        bikes_data_df.to_csv(f"./pre_christmas/pre_christmas_{datetime.datetime.now().strftime('%d_%m_%H_%M_%S')}.csv", index=False)
        
        # once the csv file is created for our 30 minute data, the bikes_data_df variable is freed up to gather data for the next 30 minute segment

        bikes_data_df = pd.DataFrame()
 
    # this final one line of code is perhaps the most important part. If we let the for loop run without this, the script could overload the server with requests
    # and our account may get banned. Also, since the API server updates the data every minute, this script helps synchronize with the server

    time.sleep(60)

#### **Loop for 26 Dec to 29 Dec - This is same as the previous for loop. Only difference is that the path and file name changes in the pd.to_csv() function**

In [None]:
# loop starts at minute 1 instead of minute 0

for _ in range(1, number_of_minutes + 1):
    for city in CITIES:
        bikes = URL + "contract=" + city + "&apiKey=" + API_KEY
        response = requests.get(bikes)
        data = response.json()
        df = pd.DataFrame(data)
        bikes_data_df = bikes_data_df.append(df, ignore_index=True)

    # this if control statement ensures that the dataframe is written to csv every 30 minutes and 
    # the bikes_data_df is freed up to hold data of the next 30 minutes

    if _ % write_interval == 0:

        # the columns that contain complex json objects are split into columns and then dropped

        bikes_data_df = bikes_data_df.join(
            pd.json_normalize(bikes_data_df["totalStands"].tolist()).add_prefix("totalStands_")).join(
            pd.json_normalize(bikes_data_df["mainStands"].tolist()).add_prefix("mainStands_")).drop(
            ["totalStands", "mainStands", "overflowStands"], axis=1)

        # it is usually the case that certain bike stations don't have any activities for several minutes, so for each minute, 
        # the API server refreshes, we may get repeated data. 
        # The pandas drop_duplicates() function ensures that duplicates are dropped,
        # and therefore the csv file contains unique rows

        bikes_data_df = bikes_data_df.drop_duplicates(subset=["number", "contractName", "lastUpdate"])

        bikes_data_df.to_csv(f"./post_christmas/post_christmas_{datetime.datetime.now().strftime('%d_%m_%H_%M_%S')}.csv", index=False)
        
        # once the csv file is created for our 30 minute data, the bikes_data_df variable is freed up to gather data for the next 30 minute segment

        bikes_data_df = pd.DataFrame()
 
    # this final one line of code is perhaps the most important part. If we let the for loop run without this, the script could overload the server with requests
    # and our account may get banned. Also, since the API server updates the data every minute, this script helps synchronize with the server

    time.sleep(60)

## Phase 2 (this part can be run in the notebook)
In this phase we process each collection of csv files to make it suitable for data analysis.

In [5]:
# the very first step is to combine each 3-days csv file collection into a single dataframe

pre_christmas_bikes_df = pd.concat([pd.read_csv(f) for f in glob.glob("./data/jcd-data-prechristmas/" + "*.csv")],
                      ignore_index=True)

post_christmas_bikes_df = pd.concat([pd.read_csv(f) for f in glob.glob("./data/jcd-data-postchristmas/" + "*.csv")],
                      ignore_index=True)

In [7]:
# although we have made sure that we drop duplicated in the 30 minutes dataframe segments, there are duplicates
# because we have combined the files

pre_christmas_bikes_df[pre_christmas_bikes_df.duplicated(subset=["number", "contractName", "lastUpdate"], keep=False)]\
.sort_values("lastUpdate")

Unnamed: 0,number,contractName,name,address,position,banking,bonus,status,lastUpdate,connected,...,totalStands_availabilities.electricalBikes,totalStands_availabilities.electricalInternalBatteryBikes,totalStands_availabilities.electricalRemovableBatteryBikes,mainStands_capacity,mainStands_availabilities.bikes,mainStands_availabilities.stands,mainStands_availabilities.mechanicalBikes,mainStands_availabilities.electricalBikes,mainStands_availabilities.electricalInternalBatteryBikes,mainStands_availabilities.electricalRemovableBatteryBikes
284637,286,bruxelles,286 - BENS,BENS - CHEE D'ALSEMBERG DEVANT LE 548 / ALSEMB...,"{'latitude': 50.807811, 'longitude': 4.336789}",False,False,CLOSED,2021-09-30T08:37:53Z,False,...,0,0,0,20,0,0,0,0,0,0
454535,286,bruxelles,286 - BENS,BENS - CHEE D'ALSEMBERG DEVANT LE 548 / ALSEMB...,"{'latitude': 50.807811, 'longitude': 4.336789}",False,False,CLOSED,2021-09-30T08:37:53Z,False,...,0,0,0,20,0,0,0,0,0,0
192620,286,bruxelles,286 - BENS,BENS - CHEE D'ALSEMBERG DEVANT LE 548 / ALSEMB...,"{'latitude': 50.807811, 'longitude': 4.336789}",False,False,CLOSED,2021-09-30T08:37:53Z,False,...,0,0,0,20,0,0,0,0,0,0
458422,286,bruxelles,286 - BENS,BENS - CHEE D'ALSEMBERG DEVANT LE 548 / ALSEMB...,"{'latitude': 50.807811, 'longitude': 4.336789}",False,False,CLOSED,2021-09-30T08:37:53Z,False,...,0,0,0,20,0,0,0,0,0,0
614986,286,bruxelles,286 - BENS,BENS - CHEE D'ALSEMBERG DEVANT LE 548 / ALSEMB...,"{'latitude': 50.807811, 'longitude': 4.336789}",False,False,CLOSED,2021-09-30T08:37:53Z,False,...,0,0,0,20,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
640472,138,bruxelles,138 - HOUFFALIZE,HOUFFALIZE - PLACE DE HOUFFALIZE /HOUFFALIZEPL...,"{'latitude': 50.865218, 'longitude': 4.377735}",False,False,OPEN,2021-12-22T22:40:49Z,True,...,6,0,6,25,14,11,8,6,0,6
639762,138,bruxelles,138 - HOUFFALIZE,HOUFFALIZE - PLACE DE HOUFFALIZE /HOUFFALIZEPL...,"{'latitude': 50.865218, 'longitude': 4.377735}",False,False,OPEN,2021-12-22T22:40:49Z,True,...,6,0,6,25,14,11,8,6,0,6
640249,11,dublin,EARLSFORT TERRACE,Earlsfort Terrace,"{'latitude': 53.334295, 'longitude': -6.258503}",False,False,OPEN,2021-12-22T22:40:49Z,True,...,3,0,3,30,6,24,3,3,0,3
639781,259,bruxelles,259 - NOVILLE,NOVILLE - SQ NOVILLE,"{'latitude': 50.859608, 'longitude': 4.33409}",False,False,OPEN,2021-12-22T22:40:50Z,True,...,8,0,8,20,9,11,1,8,0,8


In [8]:
post_christmas_bikes_df[post_christmas_bikes_df.duplicated(subset=["number", "contractName", "lastUpdate"], keep=False)]\
.sort_values("lastUpdate")

Unnamed: 0,number,contractName,name,address,position,banking,bonus,status,lastUpdate,connected,...,totalStands_availabilities.electricalBikes,totalStands_availabilities.electricalInternalBatteryBikes,totalStands_availabilities.electricalRemovableBatteryBikes,mainStands_capacity,mainStands_availabilities.bikes,mainStands_availabilities.stands,mainStands_availabilities.mechanicalBikes,mainStands_availabilities.electricalBikes,mainStands_availabilities.electricalInternalBatteryBikes,mainStands_availabilities.electricalRemovableBatteryBikes
514925,286,bruxelles,286 - BENS,BENS - CHEE D'ALSEMBERG DEVANT LE 548 / ALSEMB...,"{'latitude': 50.807811, 'longitude': 4.336789}",False,False,CLOSED,2021-09-30T08:37:53Z,False,...,0,0,0,20,0,0,0,0,0,0
189461,286,bruxelles,286 - BENS,BENS - CHEE D'ALSEMBERG DEVANT LE 548 / ALSEMB...,"{'latitude': 50.807811, 'longitude': 4.336789}",False,False,CLOSED,2021-09-30T08:37:53Z,False,...,0,0,0,20,0,0,0,0,0,0
453157,286,bruxelles,286 - BENS,BENS - CHEE D'ALSEMBERG DEVANT LE 548 / ALSEMB...,"{'latitude': 50.807811, 'longitude': 4.336789}",False,False,CLOSED,2021-09-30T08:37:53Z,False,...,0,0,0,20,0,0,0,0,0,0
457650,286,bruxelles,286 - BENS,BENS - CHEE D'ALSEMBERG DEVANT LE 548 / ALSEMB...,"{'latitude': 50.807811, 'longitude': 4.336789}",False,False,CLOSED,2021-09-30T08:37:53Z,False,...,0,0,0,20,0,0,0,0,0,0
193333,286,bruxelles,286 - BENS,BENS - CHEE D'ALSEMBERG DEVANT LE 548 / ALSEMB...,"{'latitude': 50.807811, 'longitude': 4.336789}",False,False,CLOSED,2021-09-30T08:37:53Z,False,...,0,0,0,20,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
585415,1023,lyon,1023 - CROIX ROUSSE / PERFETTI,16 BVD DE LA CROIX-ROUSSE,"{'latitude': 45.772229, 'longitude': 4.819389}",False,True,OPEN,2021-12-29T07:32:03Z,True,...,4,0,4,20,14,6,10,4,0,4
585310,7057,lyon,7057 - JAURES / BOLLIER,217b–221 Avenue Jean Jaurès 69007 Lyon,"{'latitude': 45.734712, 'longitude': 4.835486}",True,False,OPEN,2021-12-29T07:32:04Z,True,...,5,0,5,34,14,20,9,5,0,5
585787,7057,lyon,7057 - JAURES / BOLLIER,217b–221 Avenue Jean Jaurès 69007 Lyon,"{'latitude': 45.734712, 'longitude': 4.835486}",True,False,OPEN,2021-12-29T07:32:04Z,True,...,5,0,5,34,14,20,9,5,0,5
585259,7038,lyon,7038 - PLACE STALINGRAD,Angle Rue de la Guillotiere,"{'latitude': 45.749622, 'longitude': 4.85299}",False,False,OPEN,2021-12-29T07:32:05Z,True,...,2,0,2,13,10,3,8,2,0,2


In [9]:
# so now we proceed to drop the duplicates by subsetting using number, contractName, and lastUpdate. The combination
# of these 3 attributes uniquely define each row

pre_christmas_bikes_df = pre_christmas_bikes_df.drop_duplicates(subset=["number", "contractName", "lastUpdate"])

post_christmas_bikes_df = post_christmas_bikes_df.drop_duplicates(subset=["number", "contractName", "lastUpdate"])


In [11]:
# finally we write the dataframes to csv files for importing in the analysis notebook

pre_christmas_bikes_df.to_csv("./data/jcd-pre-christmas-data.csv", index=False)

post_christmas_bikes_df.to_csv("./data/jcd-post-christmas-data.csv", index=False)