# Bicycle data melbourne analysis

In [2]:
import pandas as pd
import numpy as np
import os
from collections import Counter
from tqdm.notebook import tqdm

## Data validation 

The bicycle data is delivered in a large number of CSV files, without an accompanying data dictionary.  To be sure the same columns are shared by all CSVs, such that they can be appended to a master file, we iterate over them and tally up the unique column combinations.  Ideally, the results will be a single combination of columns with a tally count the length of the number of CSV files.  Let's see!

In [2]:
#rootdir ='../data'
#tally = pd.DataFrame(columns=["count","len"])
#counter = 0
#for subdir, dirs, files in tqdm(os.walk(rootdir)):
#    for file in files:
#        if ('.csv' in file) and ('.zip' not in file):
#            # read in CSV, if it contains records (which at least one doesn't!)
#            file_path = os.path.join(subdir,file)
#            if os.path.getsize(file_path) > 0:
#                df = pd.read_csv(os.path.join(subdir,file))
#                # store list of columns in variable df_columns as a string
#                df_columns = f"{df.columns.to_list()}"
#                # if CSV columns string is in the tally index, increment this
#                if df_columns in tally.index:
#                    tally[tally.index==df_columns] += 1
#                # otherwise add CSV columns string to the tally index
#                else:
#                    tally.loc[df_columns] = 1
#                # increment a counter; athough theoretically this should only sum to the sum of tallys!
#                counter+=1
#
#print(counter)

In [8]:
tally

Unnamed: 0,count
"['DATA_TYPE', 'TIS_DATA_REQUEST', 'SITE_XN_ROUTE', 'LOC_LEG', 'DATE', 'TIME', 'CLASS', 'LANE', 'SPEED', 'WHEELBASE', 'HEADWAY', 'GAP', 'AXLE', 'AXLE_GROUPING', 'RHO', 'VEHICLE', 'DIRECTION']",13229


## Start Dask dashboard
The bicycle data is comprised of more than 16,000 CSV files; this is a very large multi-gigabyte dataset, and so rather than use Pandas which will struggle to load and manipulate the data, [Dask](https://dask.org) could provide a good option for parallel processing.

The use of the dashboard is optional, though.  So, commented out for now.

In [1]:
# from dask.distributed import Client
# client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
# client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://192.168.1.4:8787/status,

0,1
Dashboard: http://192.168.1.4:8787/status,Workers: 1
Total threads: 4,Total memory: 1.86 GiB
Status: running,Using processes: False

0,1
Comm: inproc://192.168.1.4/12364/1,Workers: 1
Dashboard: http://192.168.1.4:8787/status,Total threads: 4
Started: Just now,Total memory: 1.86 GiB

0,1
Comm: inproc://192.168.1.4/12364/4,Total threads: 4
Dashboard: http://192.168.1.4:55991/status,Memory: 1.86 GiB
Nanny: None,
Local directory: G:\My Drive\DPC\Bicycle\bicycle_data_melbourne\dask-worker-space\worker-j8b9akch,Local directory: G:\My Drive\DPC\Bicycle\bicycle_data_melbourne\dask-worker-space\worker-j8b9akch


## Creating a master dataframe

Now that we know that all of the CSVs share the same column names, we will join the various files together to create a master dataframe to run the analysis on. 

In [3]:
data_years = ['Bicycle_Volume_Speed_2017',
'Bicycle_Volume_Speed_2018',
'Bicycle_Volume_Speed_2019',
'Bicycle_Volume_Speed_2020',
'Bicycle_Volume_Speed_2021']

rootdir = f'../data/{data_years[0]}'
csv_files = []

for subdir, dirs, files in tqdm(os.walk(rootdir),desc="Getting CSV file paths...",unit="CSVs"):
    for file in files:
        if ('.csv' in file) and ('.zip' not in file):
            # record filepaths of CSVs containing records
            file_path = os.path.join(subdir,file)
            if os.path.getsize(file_path) > 0:
                csv_files.append(os.path.join(subdir,file))

print(f"Identified the locations of {len(csv_files)} valid CSV files to compile!")

Getting CSV file paths...: 0CSVs [00:00, ?CSVs/s]

Identified the locations of 1430 valid CSV files to compile!


In [18]:
dfs=[]
for csv in tqdm(csv_files,desc=f"Reading csv files for {rootdir}...",unit="CSVs"):
    csv_df = pd.read_csv(csv, index_col=None, header=0)
    dfs.append(csv_df)
    del csv_df

dfs = pd.concat(dfs, axis=0, ignore_index=True)

Reading csv files for ../data/Bicycle_Volume_Speed_2017...:   0%|          | 0/1430 [00:00<?, ?CSVs/s]

## Summary statistics

Let's look at the summary statistics for the master dataframe! First we'll look at what type of data is included in each column. Then we can see can check the max/min values, and the distribution of the data, to see if there are any outlying data points. 

In [21]:
dfs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9989156 entries, 0 to 9989155
Data columns (total 17 columns):
 #   Column            Dtype  
---  ------            -----  
 0   DATA_TYPE         object 
 1   TIS_DATA_REQUEST  object 
 2   SITE_XN_ROUTE     object 
 3   LOC_LEG           object 
 4   DATE              object 
 5   TIME              object 
 6   CLASS             object 
 7   LANE              object 
 8   SPEED             float64
 9   WHEELBASE         float64
 10  HEADWAY           float64
 11  GAP               float64
 12  AXLE              object 
 13  AXLE_GROUPING     object 
 14  RHO               float64
 15  VEHICLE           object 
 16  DIRECTION         object 
dtypes: float64(5), object(12)
memory usage: 1.3+ GB


In [23]:
#Summary statistics

dfs.describe()

Unnamed: 0,SPEED,WHEELBASE,HEADWAY,GAP,RHO
count,9989156.0,9989156.0,9989156.0,9989156.0,9989156.0
mean,21.83996,1.030735,189.0292,190.5951,0.9768114
std,6.683758,0.09946097,1348.285,1553.156,0.105566
min,0.3,0.0,0.0,0.0,0.0
25%,17.4,1.0,3.3,3.1,1.0
50%,21.9,1.0,26.7,26.5,1.0
75%,26.4,1.1,113.2,113.0,1.0
max,159.6,7.6,86400.0,569233.1,1.5
