<div class="usecase-title">{Bins for events Argyle Squrae}</div>

<div class="usecase-authors"><b>Authored by: </b> {Alison Collins}</div>

<div class="usecase-duration"><b>Duration:</b> {60} mins</div>

<div class="usecase-level-skill">
    <div class="usecase-level"><b>Level: </b>{Intermediate}</div>
    <div class="usecase-skill"><b>Pre-requisite Skills: </b>{Python}</div>
</div>

<div class="usecase-section-header">Scenario</div>

{If you are planning an event at Argyle Square you will need to know if you need to hire additional bins. This use case seeks to determine th ebin capacity during events and make recommendations on the need for more bins based on event attendee numbers.}

<div class="usecase-section-header">What this use case will teach you</div>

At the end of this use case you will:
- {list the skills demonstrated in your use case}

<div class="usecase-section-header">{Heading for introduction or background relating to problem}</div>

{Write your introduction here. Keep it concise. We're not after "War and Peace" but enough background information to inform the reader on the rationale for solving this problem or background non-technical information that helps explain the approach. You may also wish to give information on the datasets, particularly how to source those not being imported from the client's open data portal.}



In [1]:
# Dependencies
import warnings
warnings.filterwarnings("ignore")
#warnings.filterwarnings("ignore")

# Import required modules
import requests
import numpy as np
import pandas as pd
import io
#pd.set_option('display.max_columns', None)

Import datasets using API

In [2]:
from io import StringIO

# Function to collect datasets using API
def datasetcollect(dataset_id):
    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    #apikey = " "
    dataset_id = dataset_id
    format = 'csv'

    url = f'{base_url}{dataset_id}/exports/{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC',
        #'api_key': apikey
    }

    # GET request
    response = requests.get(url, params=params)

    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        dataset = pd.read_csv(StringIO(url_content), delimiter=';')
        return dataset
    else:
        print(f'Request failed with status code {response.status_code}')

In [3]:
# Import stage activity dataset
dataset_id = 'meshed-sensor-type-3'
stage_activity_all = datasetcollect(dataset_id)
print(len(stage_activity_all))
stage_activity_all.head(3)

350920


Unnamed: 0,dev_id,sensor_name,time,temperature,humidity,light,motion,visit,vdd,lat_long
0,ers-55eb,,2022-12-13T20:09:42+00:00,10.5,89,297,0,0,3638,
1,ers-55ea,,2022-12-13T20:26:03+00:00,10.6,88,136,0,0,3635,
2,ers-55eb,,2022-12-13T20:34:47+00:00,10.8,89,698,0,0,3638,


In [4]:
# Import bin sensor dataset
dataset_id = 'netvox-r718x-bin-sensor'
bin_sensor_all = datasetcollect(dataset_id)
print(len(bin_sensor_all))
bin_sensor_all.head(3)

561783


Unnamed: 0,dev_id,time,temperature,distance,filllevel,battery,lat_long,sensor_name,fill_level
0,r718x-6778,2023-02-26T08:16:47+00:00,19.0,209.0,73.0,3.6,"-37.8025943, 144.9658434",r718x-bin sensor 8,71.0
1,r718x-6f16,2023-02-26T08:18:10+00:00,19.9,202.0,74.0,3.6,"-37.8028794, 144.9662728",r718x-bin sensor 17,72.0
2,r718x-677d,2023-02-26T08:18:02+00:00,20.7,200.0,74.0,3.6,"-37.8021051, 144.9654523",r718x-bin sensor 11,72.0


In [5]:
# Import blix mobile phone counter dataset
dataset_id = 'blix-visits'
blix_phones_all = datasetcollect(dataset_id)
print(len(blix_phones_all))
blix_phones_all.head(3)

109175


Unnamed: 0,datetime,keys1,total,dwell,sensor_name,sensor_type,lat_long,avg_dwell
0,2022-08-25T22:00:00+00:00,8171,27,5697,Pedestrian Sensor-Birrarung Marr,Mobile phone counting,"-37.8209898, 144.9759397",3.0
1,2022-08-26T08:00:00+00:00,8171,115,42090,Pedestrian Sensor-Birrarung Marr,Mobile phone counting,"-37.8209898, 144.9759397",6.0
2,2022-08-26T02:00:00+00:00,7780,228,94848,Pedestrian Sensor-Argyle Sq,Mobile phone counting,"-37.8025805, 144.9656012",6.0


Preprocessing of datasets

In [156]:
# Delete unwanted columns from datasets

# Drop columns from stage_activity dataframe
stage_activity = stage_activity_all[['dev_id','time','motion','visit']]
# Drop columns from stage_activity dataframe
bin_sensor_cols = bin_sensor_all[['dev_id','time','filllevel']]
# Drop columns from stage_activity dataframe
blix_phones = blix_phones_all[['datetime','total','dwell','avg_dwell']]

In [157]:
# check data types in columns

print("Data types in Stage activity")
print(stage_activity.dtypes)

print("Data types in Bin Sensor")
print(bin_sensor_cols.dtypes)

print("Data types in Blix Phones")
print(blix_phones.dtypes)

Data types in Stage activity
dev_id    object
time      object
motion     int64
visit      int64
dtype: object
Data types in Bin Sensor
dev_id        object
time          object
filllevel    float64
dtype: object
Data types in Blix Phones
datetime      object
total          int64
dwell          int64
avg_dwell    float64
dtype: object


In [158]:
# convert date time columns to date time type

stage_activity['date_time'] = pd.to_datetime(stage_activity['time'])
stage_activity = stage_activity.drop(['time'], axis=1)

bin_sensor_cols['date_time'] = pd.to_datetime(bin_sensor_cols['time'])
bin_sensor_cols = bin_sensor_cols.drop(['time'], axis=1)

blix_phones['date_time'] = pd.to_datetime(blix_phones['datetime'])
blix_phones = blix_phones.drop(['datetime'], axis=1)

In [159]:
#Check oldest and most recent dates in datasets

print("Date range in stage activity")
print(stage_activity["date_time"].min())
print(stage_activity["date_time"].max())

print("Date range in bin sensor")
print(bin_sensor_cols["date_time"].min())
print(bin_sensor_cols["date_time"].max())

print("Date range in blix phones")
print(blix_phones["date_time"].min())
print(blix_phones["date_time"].max())

Date range in stage activity
2022-11-29 06:05:16+00:00
2024-03-27 07:28:36+00:00
Date range in bin sensor
2023-02-26 08:16:37+00:00
2024-03-27 07:29:17+00:00
Date range in blix phones
2021-12-31 13:00:00+00:00
2024-03-26 12:00:00+00:00


In [160]:
# Drop rows so that all datasets have the same date range

stage_activity= stage_activity[(stage_activity['date_time'] > '2023-2-26') & (stage_activity['date_time'] <= '2024-3-26')]

bin_sensor= bin_sensor_cols[(bin_sensor_cols['date_time'] > '2023-2-26') & (bin_sensor_cols['date_time'] <= '2024-3-26')]

blix_phones= blix_phones[(blix_phones['date_time'] > '2023-2-26') & (blix_phones['date_time'] <= '2024-3-26')]


In [161]:
#BIN DATASET PREPROCESSING
# Filter unwanted values from bin dataset 

# Keep only rows with bin sensors in the stage area
filtered_bin_sensor = bin_sensor_cols[bin_sensor_cols["dev_id"].isin(["r718x-6778", "r718x-6775","r718x-6f25","r718x-677e","r718x-6f31"])]
filtered_bin_sensor.head(3)

# Check max and min values in bin fill levels
# Max and min of filllevel column
print(filtered_bin_sensor['filllevel'].agg(['min', 'max']))

# Count the number of values greater than 100 in the bin fill coumns
more = len(filtered_bin_sensor[filtered_bin_sensor['filllevel']>100])

# Fnd percentage of values impacted  by value >100 in fill level
# Count the number of rows in the dataframe
total = len(filtered_bin_sensor)
#check the length of the dataframe
print(len(filtered_bin_sensor))
# Calculate the percentage of data that has values greater than 100
print(more/total)

# As only 0.0456% of data is impacted by data inaccuracies, make the decision to drop these rows from the table.

# Drop rows where bin fill column is greater than 100
filtered_bin_sensor = filtered_bin_sensor.drop(filtered_bin_sensor[filtered_bin_sensor['filllevel'] > 100].index)
# Check the length of the dataframe
len(filtered_bin_sensor)


min      0.0
max    255.0
Name: filllevel, dtype: float64
133765
0.0004560236235188577


133704

In [137]:
# BIN DATASET PREPROCESSING: GROUP BY BIN THEN RESAMPLE

# Set index to datetime column
filtered_bin_sensor.set_index('date_time', inplace=True)

# Resample the data by hour
grouped_bin_sensor = filtered_bin_sensor.groupby('dev_id').resample('30min').max()
grouped_bin_sensor

Unnamed: 0_level_0,Unnamed: 1_level_0,dev_id,filllevel
dev_id,date_time,Unnamed: 2_level_1,Unnamed: 3_level_1
r718x-6775,2023-02-26 08:00:00+00:00,r718x-6775,61.0
r718x-6775,2023-02-26 08:30:00+00:00,r718x-6775,61.0
r718x-6775,2023-02-26 09:00:00+00:00,r718x-6775,63.0
r718x-6775,2023-02-26 09:30:00+00:00,r718x-6775,62.0
r718x-6775,2023-02-26 10:00:00+00:00,r718x-6775,52.0
...,...,...,...
r718x-6f31,2023-10-09 17:00:00+00:00,r718x-6f31,71.0
r718x-6f31,2023-10-09 17:30:00+00:00,r718x-6f31,71.0
r718x-6f31,2023-10-09 18:00:00+00:00,r718x-6f31,74.0
r718x-6f31,2023-10-09 18:30:00+00:00,r718x-6f31,74.0


In [162]:
# BIN DATASET PREPROCESSING: RESAMPLE WITHOUT GROUPING BY BIN

# Set index to datetime column
filtered_bin_sensor.set_index('date_time', inplace=True)

# Resample the data by hour
grouped_bin_sensor1 = filtered_bin_sensor.resample('30min').filllevel.max()
grouped_bin_sensor1

date_time
2023-02-26 08:00:00+00:00    74.0
2023-02-26 08:30:00+00:00    74.0
2023-02-26 09:00:00+00:00    74.0
2023-02-26 09:30:00+00:00    74.0
2023-02-26 10:00:00+00:00    74.0
                             ... 
2024-03-27 05:00:00+00:00    82.0
2024-03-27 05:30:00+00:00    81.0
2024-03-27 06:00:00+00:00    81.0
2024-03-27 06:30:00+00:00    83.0
2024-03-27 07:00:00+00:00    83.0
Freq: 30T, Name: filllevel, Length: 18959, dtype: float64

In [163]:
# STAGE ACTIVITY DATASET PREPROCESSING: RESAMPLE WITHOUT GROUPING BY SENSOR

# Set index to datetime column
stage_a = stage_activity
stage_a.set_index('date_time', inplace=True)

# Resample the data by hour
grouped_stage_activity = stage_a.resample('30min').motion.max()
grouped_stage_activity

date_time
2023-02-26 00:00:00+00:00    0.0
2023-02-26 00:30:00+00:00    0.0
2023-02-26 01:00:00+00:00    0.0
2023-02-26 01:30:00+00:00    1.0
2023-02-26 02:00:00+00:00    0.0
                            ... 
2024-03-25 21:30:00+00:00    1.0
2024-03-25 22:00:00+00:00    0.0
2024-03-25 22:30:00+00:00    0.0
2024-03-25 23:00:00+00:00    0.0
2024-03-25 23:30:00+00:00    0.0
Freq: 30T, Name: motion, Length: 18912, dtype: float64

In [165]:
merged_dataframe_A = pd.merge_asof(grouped_bin_sensor1, stage_a, on="date_time",tolerance=pd.Timedelta("2ms"))
merged_dataframe_A

ValueError: right keys must be sorted

In [None]:
# next steps 
##### learn time series
# How to deal with data to have it all in same time intervals (10 miuntes? Hour? half hour?)
# or maybe user interactive option????





In [21]:
!jupyter nbconvert --to html usecase_TEMPLATE_COPY.ipynb

This application is used to convert notebook files (*.ipynb)




        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
    Continue noteboo