# Victoria Bike Path Analysis

Welcome to my bike path analysis, a side project that I am using to teach myself python and data science. 
 
From this project, I would like to:
- see how ETL works on python and if I prefer it to a GUI system, like PowerQuery
- see if bike rideship is seasonal
- see what % of riders miss consecutive checkpoints and where these are?
- add the Strategic Cycling Corridor dataset to the mix
- see if I can work out some data science techniques - such as Machine Learning and AI

In [7]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.dates import DateFormatter
import os
import glob
import re
import io
import zipfile
from itertools import islice

In [8]:
#Header Data
VicRoadsHeader_URL = 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/VicRoads_Bike_Site_Number_Listing.csv'

#Yearly Data
Bicycle_Vol = {'Bicycle_Vol_2022_URL' : 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2022.zip',
               'Bicycle_Vol_2021_URL' : 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2021.zip',
               'Bicycle_Vol_2020_URL' : 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2020.zip',
               'Bicycle_Vol_2019_URL' : 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2019.zip',
               'Bicycle_Vol_2018_URL' : 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2018.zip',
               'Bicycle_Vol_2017_URL' : 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2017.zip',
               'Bicycle_Vol_2016_URL' : 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2016.zip',
               'Bicycle_Vol_2015_URL' : 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_speed_2015.zip'}

In [9]:
#create folders for storing data

#ensure home directory is the home director of this program
os.chdir(os.getcwd())
home = os.getcwd()
print("Home file = " + home)

#create folder called data in the home dir
datafolder = os.path.join(home,'data') 
if os.path.exists(datafolder) == False:
    os.mkdir(datafolder)
    print("Made directory = " + datafolder)

#create folder called zip in the data dir
zipfolder = os.path.join(datafolder, 'zip') 
if os.path.exists(zipfolder) == False:
    os.mkdir(zipfolder)
    print("Made directory = " + zipfolder)

#create folder called extracted in the data dir
extractedFolder = os.path.join(datafolder, 'extracted') 
if os.path.exists(extractedFolder) == False:
    os.mkdir(extractedFolder)
    print("Made directory = " + extractedFolder)

Home file = c:\Users\Study\Documents\Projects\Bike Paths\BikePath


In [10]:
#download header and write it to header excel file
Header = requests.get(VicRoadsHeader_URL)
savePath = os.path.join(datafolder, 'VicRoadsHeader.csv')

with open(savePath,'wb') as output:
        output.write(Header.content)
        print("File Saved = " + savePath)

File Saved = c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\VicRoadsHeader.csv


In [11]:
#read the first lines of VicRoads Header File
print(savePath)
df = pd.read_csv(savePath)
print(df.head())

c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\VicRoadsHeader.csv
   SITE_XN_ROUTE  LOC_LEG  STRT_LAT  STRT_LONG                       GPS  \
0           6411    59437 -37.77231  144.99042  [-37.772347 +144.990631]   
1           6411    59438 -37.77235  144.99077  [-37.772347 +144.990631]   
2           6415    59485 -37.82713  144.98511  [-37.827009 +144.984261]   
3           6415    59486 -37.82714  144.98513  [-37.827009 +144.984261]   
4           6419    59458 -37.75829  144.98039  [-37.758216 +144.980308]   

   SITE_NAME                                           TFM_DESC BEARING_DESC  \
0  D208X6411  (BIKE PATH) ST. GEORGES RD N BD 28M S OF SUMNE...  NORTH BOUND   
1  D208X6411  (BIKE PATH) ST. GEORGES RD S BD 28M S OF SUMNE...  SOUTH BOUND   
2  D208X6415  (BIKE PATH) NORTH BANK E BD 75M W OF MORELL BR...   EAST BOUND   
3  D208X6415  (BIKE PATH) NORTH BANK W BD 75M W OF MORELL BR...   WEST BOUND   
4  D208X6419  (BIKE PATH) MERRI CREEK TRAIL N BD S OF MORELA...  

---
***Note from Author***

The header file changed from a xlsx to a CSV file, which caused an error with my code. 

I've noted that the header file contains the GPS cood and address titles for the datasets.

Here is a cheatsheet for later:
- Header.SITE_NAME = file.TIS_DATA_REQUEST + "X" + file.SITE_XN_ROUTE
- Header.TFM_ID = file.LOC_LEG
---

In [12]:
#Download the zip files

Bicycle_Vol_Zip_Files = [] #create an empty list to idenitfy the list zips once created.

#subset the list to just one zip file for testing purposes;  There is 8 in total above up until 2022.
#[Testing routine - replace n_items with Bicycle_Vol.items(); review additional comment, below, in dictionary creation - key,value line - section]
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

n_items = take(8, Bicycle_Vol.items()) 
#[end of testing routine. instructions above to remove testing]

print("Dictionary file: [Name, Zip File] = ", n_items) 

#download zips from website and label using as subset of the dictionary for the URLs  
for key, url in n_items: #replace n_items with Bicycle_Vol.items() when testing is resolved
    
    from urllib.parse import urlparse
    saveFile = os.path.basename(urlparse(url).path) #take only the filename from the URL. e.g. Bicycle_Volume_speed_2015.zip
    savePath = os.path.join(zipfolder, saveFile )  
    
    Bicycle_Vol_Zip_Files.append(savePath) #Populate the empty list with filepaths of the zips.
    
    if os.path.exists(zipfolder) == False:
        resp = requests.get(url) #download the zip file
        with open(savePath,'wb') as output:
            output.write(resp.content)
            print("File Saved = " + savePath)
    else:
        print("Skipping file as already exists: " + savePath)

    

Dictionary file: [Name, Zip File] =  [('Bicycle_Vol_2022_URL', 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2022.zip'), ('Bicycle_Vol_2021_URL', 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2021.zip'), ('Bicycle_Vol_2020_URL', 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2020.zip'), ('Bicycle_Vol_2019_URL', 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2019.zip'), ('Bicycle_Vol_2018_URL', 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volume_and_Speed/Bicycle_Volume_Speed_2018.zip'), ('Bicycle_Vol_2017_URL', 'https://vicroadsopendatastorehouse.vicroads.vic.gov.au/opendata/Traffic_Measurement/Bicycle_Volu

In [13]:
#Nested Zip Function V1 (not working as well)

def extract_nested_zip(TargetFile, SourceFolder, DestinationFolder):
       #Unzip a zip file and its contents, including nested zip files
       # Delete the zip file(s) after extraction 
       # Credit - Stackoverflow - ronnydw answered May 10 '17 at 14:47
    
    #print("Opening File =" + zippedFile + " | Extracting to Folder = " +  datafolder)
    
    with zipfile.ZipFile(TargetFile, 'r') as zfile:
        print(zfile.namelist())
        zfile.extractall(path=DestinationFolder)
    #os.remove(TargetFile) #Delete the zip once extracted. This is required as the routine repeats until it doesn't find a zip file.

        #search the dir for zips and send it back to the start of the function.
    for root, dirs, files in os.walk(SourceFolder):
        for filename in files:
            #print(os.path.join(datafolder, filename))
            if re.search(r'zip$', filename):
                fileSpec = os.path.join(root, filename)
                try:
                    #print(fileSpec)
                    extract_nested_zip(fileSpec, root)
                except:
                    #print("Error! - Couldn't find the file: " + zippedFile)  
                    break      #Many errors thrown as the routine is too quick and doesn't wait for the zip file to be extracted before it goes again.

In [14]:
#Nested Zip Function V2
#https://stackoverflow.com/questions/36285502/how-to-extract-zip-file-recursively
#credit to Forge - 29/03/2016

def extract_nested_zip_v2(filename, DestinationPath):
    z = zipfile.ZipFile(filename)
    for f in z.namelist():
        # get directory name from file
        dirName = os.path.join(DestinationPath, os.path.splitext(f)[0])
        
        #this line of code is drawining an error.  Suspect it is because python uses linx / and zip is using \ for their internal folder strcutre.
        # create new directory
        if os.path.exists(dirName) == False:
            os.makedirs(dirName) #using makedirs rather than mkdir because zips paths include subfolders
        # read inner zip file into bytes buffer 
        content = io.BytesIO(z.read(f))
        zip_file = zipfile.ZipFile(content)
        
        for i in zip_file.namelist():
            zip_file.extract(i, dirName) 
           
        

In [15]:
#extract all data into the data folder.  Delete all zips and keep going until there are no more zips.
for File in Bicycle_Vol_Zip_Files:
    print("Extracting Zip: " + File + " into " + extractedFolder)
    extract_nested_zip_v2(File, extractedFolder)

Extracting Zip: c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\zip\Bicycle_Volume_Speed_2022.zip into c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted
Extracting Zip: c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\zip\Bicycle_Volume_Speed_2021.zip into c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted
Extracting Zip: c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\zip\Bicycle_Volume_Speed_2020.zip into c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted
Extracting Zip: c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\zip\Bicycle_Volume_Speed_2019.zip into c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted
Extracting Zip: c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\zip\Bicycle_Volume_Speed_2018.zip into c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted
Extracting Zip: c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\zip\Bicycl

In [42]:
#comments:
#Renaming isn't working as a whole.  It appears to be how I rename a file.

#clean up any CSV files that don't have an extension and make it a CSV file
for path, directories, files in os.walk(extractedFolder):
    for filename in files:
        filename_ext = os.path.splitext(filename)[1]
        if re.search('\d', filename_ext):  #check the extension for numbers to identify bad data files with numbers as extensions;  These need the numbers removed and then a CSV readded.
            try:
                Original_Filename = os.path.splitext(path)[0] + os.sep + filename
                os.rename(path, Original_Filename, filename.replace(".csv.", "_" + filename_ext[1:2]) + ".csv"))  #remove the .csv in the filename and put the extension in it's place.  The date was put in the wrong spot.
            #except:
                #print("Error! - Couldn't rename: " + filename)
                # print("original Filename = " + os.path.join(os.path.splitext(path)[0]), filename)
                # print("extension = " + filename_ext)
                # print("New Name = "+ os.path.join(os.path.splitext(path)[0], filename.replace(".csv.", "_" + filename_ext[1:2]) + ".csv"))  #remove the .csv in the filename and put the extension in it's place.  The date was put in the wrong spot.

c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted\Bicycle_Volume_speed_2015\IND_D5555_X32021\IND_D5555_X32021.csv.20150523
c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted\Bicycle_Volume_speed_2015\IND_D5555_X32021\IND_D5555_X32021.csv.20150524
c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted\Bicycle_Volume_speed_2015\IND_D5555_X32021\IND_D5555_X32021.csv.20150525
c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted\Bicycle_Volume_speed_2015\IND_D5555_X32021\IND_D5555_X32021.csv.20150526
c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted\Bicycle_Volume_speed_2015\IND_D5555_X32021\IND_D5555_X32021.csv.20150527
c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted\Bicycle_Volume_speed_2015\IND_D5555_X32021\IND_D5555_X32021.csv.20150528
c:\Users\Study\Documents\Projects\Bike Paths\BikePath\data\extracted\Bicycle_Volume_speed_2015\IND_D5555_X32021\IND_D5555_X32021.csv.20150529
c:\Use

In [11]:
#join all CSV's into a dataframe
all_files = glob.glob(extractedFolder + "/*.csv", recursive=True)

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

ValueError: No objects to concatenate

In [None]:
print(frame.head())


In [None]:
frame.describe()

So... this data shows we have 7.63M rows of data. Some data is not consistent across all files.  So we might have incomplete files.  This is only one year too...
How do i ensure I don't have duplicates that have gone through at the same time?  This would be likely given my numbers.  Speed looks OK - Average speed is 20.99213, but min is 2, Max is 15.92.  Why...  Need to split up months/dates
How cool would it be to see a map of melbourne and see a blink every time a bike went past a counter?
How many people are recorded at different areas and disappear from the next likely stop?  What are the possibilities for those people?  Stop at the shops?  Go an alterative or more dangerous/non-bike path route?

In [None]:
#frame['VEHICLE'].describe() 
frame['DATE'].describe()

In [None]:
frame['DATE'] = pd.to_datetime(frame['DATE'])
speeddf = frame.groupby(frame['DATE'].dt.strftime('%B'))['SPEED'].mean()
#need to add median, std, top 25, count, top 75, max

In [None]:
fig, ax = plt.subplots()
#ax.plot(speeddf['DATE'], speeddf['SPEED'], marker ='o', linestyle='--', color='r')
ax.plot(frame['DATE'], frame['SPEED'])
ax.set_xlabel("YY-MM")
ax.set_ylabel("Speed")
date_form = DateFormatter("%y-%d")
ax.xaxis.set_major_formatter(date_form)
plt.show()