# Connecting to Melbourne Data

#### Welcome to the Tutorial

1.	Find Endpoint through website of the data set you need.
2.	Sign up at https://evergreen.data.socrata.com/signup/ to get a token so that you can call the data frequently (up to 1000 times per hour)
3.	Once signed up you can register for a token using https://support.socrata.com/hc/en-us/articles/210138558-Generating-an-App-Token to help guide you.
4.	Sign in to the Melbourne Data Portal using your Socrata login information.
5.	Click my profile > edit profile > developer settings
6.	Get an API Token (API KEY not needed for our purposes) to have direct communication with the Melbourne Data portal.

In [1]:
# found in the Parking Sensor API button
endpoint = "https://data.melbourne.vic.gov.au/resource/vh2v-4nfs.json"

In [None]:
#! pip install sodapy
#! pip install pandas

Install sodapy

120 seconds to finish the whole request


In [119]:
from sodapy import Socrata

client = Socrata(
    "data.melbourne.vic.gov.au",
    "EC65cHicC3xqFXHHvAUICVXEr", # app token, just used to reduce throttling, not authentication
    timeout=120
)

Go and find the end point id.  It’s the part at the end of the web address that looks like random numbers and letters.

In [120]:
# set to large limit to include all available sensors
data = client.get("vh2v-4nfs", limit=200000)

In [121]:
import pandas as pd
#Put the data into a dataframe
df = pd.DataFrame(data) 

In [122]:
len(df)

1080

In [123]:
#Check the data and see what it looks like
df.head()

Unnamed: 0,bay_id,st_marker_id,status,location,lat,lon,:@computed_region_evbi_jbp8
0,5361,8903W,Unoccupied,"{'latitude': '-37.82409822525305', 'longitude'...",-37.82409822525305,144.96211501714058,1
1,5548,12029W,Unoccupied,"{'latitude': '-37.81292958901017', 'longitude'...",-37.81292958901017,144.98300681441947,1
2,5418,11787W,Unoccupied,"{'latitude': '-37.81129498543278', 'longitude'...",-37.81129498543278,144.97542422097922,1
3,5358,8897W,Unoccupied,"{'latitude': '-37.82450433070958', 'longitude'...",-37.82450433070958,144.96163789794954,1
4,1292,4484E,Unoccupied,"{'latitude': '-37.8127273293113', 'longitude':...",-37.8127273293113,144.95416523342118,1


In [124]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   bay_id                       1080 non-null   object
 1   st_marker_id                 1080 non-null   object
 2   status                       1080 non-null   object
 3   location                     1080 non-null   object
 4   lat                          1080 non-null   object
 5   lon                          1080 non-null   object
 6   :@computed_region_evbi_jbp8  1080 non-null   object
dtypes: object(7)
memory usage: 59.2+ KB


For the purposes of this ETL we only need 'bay_id' and 'status'. We will be adding 'hour', 'minute', 'dayofweek' and 'date'

In [125]:
df.drop(columns = ['st_marker_id', 'location', 'lat', 'lon', ':@computed_region_evbi_jbp8'], inplace = True)

In [126]:
df

Unnamed: 0,bay_id,status
0,5361,Unoccupied
1,5548,Unoccupied
2,5418,Unoccupied
3,5358,Unoccupied
4,1292,Unoccupied
...,...,...
1075,3719,Unoccupied
1076,3019,Present
1077,1471,Present
1078,3727,Unoccupied


Now add in 'hour', 'minute', 'dayofweek' and 'date'


In [127]:
import datetime

To do this we need to call specific values in the datetime package.  Specifically %M for minute, %H for Hour, %A for dayofweek

In [128]:
time = datetime.datetime.now().strftime
print(time("%M"))
print(time("%H"))
print(time("%A"))
date = datetime.date.today()
print(date)

02
20
Saturday
2021-08-07


In [129]:
#Add this all together

df['hour'] = time("%H")
df['minute'] = time("%M")
df['dayofweek'] = time("%A")
df['date'] = datetime.date.today()
df

Unnamed: 0,bay_id,status,hour,minute,dayofweek,date
0,5361,Unoccupied,20,02,Saturday,2021-08-07
1,5548,Unoccupied,20,02,Saturday,2021-08-07
2,5418,Unoccupied,20,02,Saturday,2021-08-07
3,5358,Unoccupied,20,02,Saturday,2021-08-07
4,1292,Unoccupied,20,02,Saturday,2021-08-07
...,...,...,...,...,...,...
1075,3719,Unoccupied,20,02,Saturday,2021-08-07
1076,3019,Present,20,02,Saturday,2021-08-07
1077,1471,Present,20,02,Saturday,2021-08-07
1078,3727,Unoccupied,20,02,Saturday,2021-08-07


For the purposes of this notebook and not the python file, I will be showing you what the ETL will be doing on the server end.

In [170]:
#You may want to change the values.  See how df changes in length in the next cell
                    
i = 1
while i <= 2:
    df1 = pd.DataFrame(client.get("vh2v-4nfs", limit=200000))
    df1.drop(columns = ['st_marker_id', 'location', 'lat', 'lon', ':@computed_region_evbi_jbp8'], inplace = True)
    time = datetime.datetime.now().strftime
    df1['hour'] = time("%H")
    df1['minute'] = time("%M")
    df1['dayofweek'] = time("%A")
    df1['date'] = datetime.date.today()
    df = df.append(df1)
    i += 1

In [171]:
df

Unnamed: 0,bay_id,status,hour,minute,dayofweek,date
0,5361,Unoccupied,20,02,Saturday,2021-08-07
1,5548,Unoccupied,20,02,Saturday,2021-08-07
2,5418,Unoccupied,20,02,Saturday,2021-08-07
3,5358,Unoccupied,20,02,Saturday,2021-08-07
4,1292,Unoccupied,20,02,Saturday,2021-08-07
...,...,...,...,...,...,...
1075,3719,Unoccupied,20,13,Saturday,2021-08-07
1076,3019,Present,20,13,Saturday,2021-08-07
1077,1471,Unoccupied,20,13,Saturday,2021-08-07
1078,3727,Unoccupied,20,13,Saturday,2021-08-07


Now I will add the sleep function.  For starters lets see what sleep does first before adding into the loop

In [174]:
import time

print("Printed immediately.")
time.sleep(2*5)
print("Printed after 10 seconds.")

Printed immediately.
Printed after 10 seconds.


Okay lets run this to construct a dataset.  Be careful as this is meant to call the data many times and so the cell will run for a long time.  Feel free to skip the cell and check out the next one for how to save

In [None]:
#BE CAREFUL WITH THIS CELL.  PLEASE CHANGE VALUES OF i AND sleep().  This is set to run for an hour at 5 minute intervals

i = 1
while i <= 12:
    df1 = pd.DataFrame(client.get("vh2v-4nfs", limit=200000))
    df1.drop(columns = ['st_marker_id', 'location', 'lat', 'lon', ':@computed_region_evbi_jbp8'], inplace = True)
    time = datetime.datetime.now().strftime
    df1['hour'] = time("%H")
    df1['minute'] = time("%M")
    df1['dayofweek'] = time("%A")
    df1['date'] = datetime.date.today()
    df = df.append(df1)
    sleep(60*5)
    i += 1

In [None]:
df.to_csv('OneHourData.csv')

In [None]:
#BE CAREFUL WITH THIS CELL.  PLEASE CHANGE VALUES OF i AND sleep().  This is set to run for 6 hours at 15 minute intervals

i = 1
while i <= 24:
    df1 = pd.DataFrame(client.get("vh2v-4nfs", limit=200000))
    df1.drop(columns = ['st_marker_id', 'location', 'lat', 'lon', ':@computed_region_evbi_jbp8'], inplace = True)
    time = datetime.datetime.now().strftime
    df1['hour'] = time("%H")
    df1['minute'] = time("%M")
    df1['dayofweek'] = time("%A")
    df1['date'] = datetime.date.today()
    df = df.append(df1)
    sleep(60*15)
    i += 1

In [175]:
df.to_csv('SixHoursData.csv')

24