# Bicycle data melbourne analysis

In [3]:
import pandas as pd
import numpy as np
import os
from collections import Counter
from tqdm.notebook import tqdm

## Data validation 

The bicycle data is delivered in a large number of CSV files, without an accompanying data dictionary.  To be sure the same columns are shared by all CSVs, such that they can be appended to a master file, we iterate over them and tally up the unique column combinations.  Ideally, the results will be a single combination of columns with a tally count the length of the number of CSV files.  Let's see!

In [None]:
#rootdir ='../data'
#tally = pd.DataFrame(columns=["count","len"])
#counter = 0
#for subdir, dirs, files in tqdm(os.walk(rootdir)):
#    for file in files:
#        if ('.csv' in file) and ('.zip' not in file):
#            # read in CSV, if it contains records (which at least one doesn't!)
#            file_path = os.path.join(subdir,file)
#            if os.path.getsize(file_path) > 0:
#                df = pd.read_csv(os.path.join(subdir,file))
#                # store list of columns in variable df_columns as a string
#                df_columns = f"{df.columns.to_list()}"
#                # if CSV columns string is in the tally index, increment this
#                if df_columns in tally.index:
#                    tally[tally.index==df_columns] += 1
#                # otherwise add CSV columns string to the tally index
#                else:
#                    tally.loc[df_columns] = 1
#                # increment a counter; athough theoretically this should only sum to the sum of tallys!
#                counter+=1
#
#print(counter)

In [8]:
tally

Unnamed: 0,count
"['DATA_TYPE', 'TIS_DATA_REQUEST', 'SITE_XN_ROUTE', 'LOC_LEG', 'DATE', 'TIME', 'CLASS', 'LANE', 'SPEED', 'WHEELBASE', 'HEADWAY', 'GAP', 'AXLE', 'AXLE_GROUPING', 'RHO', 'VEHICLE', 'DIRECTION']",13229


## Start Dask dashboard
The bicycle data is comprised of more than 16,000 CSV files; this is a very large multi-gigabyte dataset, and so rather than use Pandas which will struggle to load and manipulate the data, [Dask](https://dask.org) could provide a good option for parallel processing.

In [1]:
from dask.distributed import Client
client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://192.168.1.4:8787/status,

0,1
Dashboard: http://192.168.1.4:8787/status,Workers: 1
Total threads: 4,Total memory: 1.86 GiB
Status: running,Using processes: False

0,1
Comm: inproc://192.168.1.4/12364/1,Workers: 1
Dashboard: http://192.168.1.4:8787/status,Total threads: 4
Started: Just now,Total memory: 1.86 GiB

0,1
Comm: inproc://192.168.1.4/12364/4,Total threads: 4
Dashboard: http://192.168.1.4:55991/status,Memory: 1.86 GiB
Nanny: None,
Local directory: G:\My Drive\DPC\Bicycle\bicycle_data_melbourne\dask-worker-space\worker-j8b9akch,Local directory: G:\My Drive\DPC\Bicycle\bicycle_data_melbourne\dask-worker-space\worker-j8b9akch


## Creating a master dataframe

Now that we know that all of the CSVs share the same column names, we will join the various files together to create a master dataframe to run the analysis on. 

In [4]:
rootdir ='../data'
csv_files = []

for subdir, dirs, files in tqdm(os.walk(rootdir),desc="Getting CSV file paths...",unit="CSVs"):
    for file in files:
        if ('.csv' in file) and ('.zip' not in file):
            # record filepaths of CSVs containing records
            file_path = os.path.join(subdir,file)
            if os.path.getsize(file_path) > 0:
                csv_files.append(os.path.join(subdir,file))

print(f"Identified the locations of {len(csv_files)} valid CSV files to compile!")

Getting CSV file paths...: 0CSVs [00:00, ?CSVs/s]

Identified the locations of 16054 valid CSV files to compile


In [None]:
import dask.dataframe as dd
# manually specifying datatypes to increase likelihood of successful read...
df = dd.read_csv(csv_files, 
                 dtype={'DATA_TYPE': 'str',
                        'TIS_DATA_REQUEST': 'int64',
                        'SITE_XN_ROUTE': 'int64',
                        'LOC_LEG': 'int64',
                        'DATE': 'object',
                        'TIME': 'object',
                        'CLASS': 'int64',
                        'LANE': 'int64',
                        'SPEED': 'float',
                        'WHEELBASE': 'float',
                        'HEADWAY': 'float',
                        'GAP': 'float',
                        'AXLE': 'float64',
                        'AXLE_GROUPING': 'float64',
                        'RHO': 'float64',
                        'VEHICLE': 'str',
                        'DIRECTION': 'str'}
                )


In [None]:
df.head().compute(scheduler='threads')


In [31]:
df.to_csv('master_dataframe.csv')

## Summary statistics

Let's look at the summary statistics for the master dataframe! First we'll look at what type of data is included in each column. Then we can see can check the max/min values, and the distribution of the data, to see if there are any outlying data points. 

In [32]:
#Checking data types in each column

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4925 entries, 0 to 4924
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   DATA_TYPE         4925 non-null   object 
 1   TIS_DATA_REQUEST  4925 non-null   int64  
 2   SITE_XN_ROUTE     4925 non-null   int64  
 3   LOC_LEG           4925 non-null   int64  
 4   DATE              4925 non-null   object 
 5   TIME              4925 non-null   object 
 6   CLASS             4925 non-null   int64  
 7   LANE              4925 non-null   int64  
 8   SPEED             4925 non-null   float64
 9   WHEELBASE         4925 non-null   float64
 10  HEADWAY           4925 non-null   float64
 11  GAP               4925 non-null   float64
 12  AXLE              4925 non-null   int64  
 13  AXLE_GROUPING     4925 non-null   int64  
 14  RHO               4925 non-null   float64
 15  VEHICLE           4925 non-null   object 
 16  DIRECTION         4925 non-null   object 


In [34]:
#Summary statistics

df.describe()

Unnamed: 0,TIS_DATA_REQUEST,SITE_XN_ROUTE,LOC_LEG,CLASS,LANE,SPEED,WHEELBASE,HEADWAY,GAP,AXLE,AXLE_GROUPING,RHO
count,4925.0,4925.0,4925.0,4925.0,4925.0,4925.0,4925.0,4925.0,4925.0,4925.0,4925.0,4925.0
mean,208.0,10223.0,59443.482437,15.0,0.482437,18.764264,1.001401,473.021401,472.778518,2.0,1.0,0.984083
std,0.0,0.0,0.499742,0.0,0.499742,7.449773,0.14295,2259.648522,2259.641277,0.0,0.0,0.08118
min,208.0,10223.0,59443.0,15.0,0.0,1.4,0.0,0.0,0.0,2.0,1.0,0.2
25%,208.0,10223.0,59443.0,15.0,0.0,13.2,1.0,5.6,5.4,2.0,1.0,1.0
50%,208.0,10223.0,59443.0,15.0,0.0,18.5,1.0,109.1,108.7,2.0,1.0,1.0
75%,208.0,10223.0,59444.0,15.0,1.0,23.9,1.1,362.8,361.7,2.0,1.0,1.0
max,208.0,10223.0,59444.0,15.0,1.0,120.5,4.0,35194.6,35194.4,2.0,1.0,1.2


In [44]:
os.path.getsize('../data\Bicycle_Volume_Speed_2020\Black Rock - Bicycle volume data 2020\IND_D5555_X32021.csv.20201228')

0