<a href="https://colab.research.google.com/github/JyotiKuber/CodeDivisionProject/blob/main/V2_Bus_Data_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bus Journey to find how may hours each bus is on road**


In terms of the bus data, a question that I would like answer is how many hours are buses which have Euro III engines (over 15 years old) on the road each day when data has been collected. 

The data is collected every Wednesday and has a record for every 5 minute interval from 6am to 11.55pm. In each 5 minute interval there is data on every bus out on a route.

First version bus_data_project.ipynb contains the code which will calculate how many hours each bus is on route (assuming that if it is recorded in one 5 minute interval then we can assume that it was on the route for the whole of that 5 minutes)

Then, using a list of all the buses in the dataset, which has then been updated with the engine type (this may be a manual process which will produce a CSV file), I'll be able to filter for just the Euro III buses and 
calculate the total number of hours.

## Dataset Description 


https://tlewssdc6f.execute-api.eu-west-2.amazonaws.com/default/get_bus_tracking_data

The data comes from the Bus Open Data Service (Bus Open Data Service (dft.gov.uk)). Live tracking data (used for the bus company apps) is updated every 10 seconds. futureCoders sample this data every 5 minutes each Wednesday from 6:00 am to 11:55pm and store the data in files, one for every 5 minutes. The data contains information about every bus that is out on a route in that 5 minute snapshot.

The data for bus journeys is in this folder on Github: 

python-programming-for-data/Datasets at main · futureCodersSE/python-programming-for-data

Here we are working on data recieved in folder 04-Jan-2023 6:00am to 11:55pm
and json file contains data for each 5 minutes interval.



In [41]:
import requests
import json
import pandas as pd

def get_5minutes_data(date,time):
  api_url = "https://tlewssdc6f.execute-api.eu-west-2.amazonaws.com/default/get_bus_tracking_data"
  body = {"method": "get_data","request": {"date": date,"time": time}}
  response = requests.post(api_url, json=body)
  return json.loads(response.json())

def get_1hour_data(date,hour):
  hour_str = str(hour)
  if hour < 10:
    hour_str = '0'+ hour_str
  bus_data_list =[]  
  for minutes in range(0,56,5):
    if minutes < 10:
      recorded_time = hour_str +':'+ '0'+ str(minutes)
    else:
      recorded_time = hour_str +':'+ str(minutes)
    bus_data = get_5minutes_data(date,recorded_time)
    bus_data_list.append(bus_data)
  #display(len(bus_data_list))
  return bus_data_list
 
def get_1day_data(date):
  bus_data_list = []
  for hour in range(6,24):
    hour_bus_data = get_1hour_data(date,hour)
    bus_data_list.append(hour_bus_data)
  return bus_data_list

recording_date = '04-01-23'
recording_time = '06:00'
day_bus_data = get_1day_data(recording_date)
display(len(day_bus_data))

18

#Data cleaning


I installed JSON Formatter in Google chrome browser(from chrome webstore). Now after restarting Chrome, I can see parsed JSON which is more readable to work.

Initial processing requirements - create a dataframe from the tracked buses data that has the following columns:

RecordedAtTime
LineRef
OperatorRef
VehicleRef
Here as the data contains dates and time from different days also first we'll need to clean the data using the date on JSON file and it should match to the RecordedAtTime in the file records.

Here we are collecting only the records which are in given date 04/01/2023 and time 23:05. The timestamp is used to filter out the bus route information between 5 minutes.

Once we get the number of EURO III engine buses running on the road in 5 minutes we can calculate the buses in one hour and then in one day.

Save filtered data in an Excel or CSV file.

In [42]:
from datetime import datetime, timedelta

def get_snapshot_time_from_file_name(required_minutes):
  required_day = int(recording_date.split('-')[0])
  required_year = 2000 + int(recording_date.split('-')[2])
  required_month= int(recording_date.split('-')[1])
  required_hour = int(recording_time.split(':')[0])
  snapshot_time = datetime(required_year,required_month,required_day,required_hour,required_minutes)
  return snapshot_time

def in_this_5_mins(RecordedAtTime, snapshot_time):
  format = '%Y-%m-%dT%H:%M:%S+%f:00'
  # converting the timestamp string to datetime object
  datetime_object = datetime.strptime(RecordedAtTime, format)
  return snapshot_time.timestamp() - datetime_object.timestamp() < 300

vehicle_df = pd.DataFrame()
for hour_data in day_bus_data:
  minutes = 0
  for bus_data in hour_data:
    vehicle_list = []
    snapshot_time = get_snapshot_time_from_file_name(minutes)
    minutes += 5
    for item in bus_data:
      RecordedAtTime = item['RecordedAtTime']
      if in_this_5_mins(RecordedAtTime, snapshot_time):
        vehicle_journey_data = item['MonitoredVehicleJourney']
        LineRef = vehicle_journey_data['LineRef']
        OperatorRef = vehicle_journey_data['OperatorRef']
        VehicleRef= vehicle_journey_data['VehicleRef']
        my_dict = {'RecordedAtTime': RecordedAtTime,'LineRef': LineRef,'OperatorRef': OperatorRef, 'VehicleRef': VehicleRef}
        vehicle_list.append(my_dict)
    vehicle_list_df = pd.DataFrame(vehicle_list)
    vehicle_df = vehicle_df.append(vehicle_list_df) 
print(vehicle_df.info())
   

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23354 entries, 0 to 117
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   RecordedAtTime  23354 non-null  object
 1   LineRef         23354 non-null  object
 2   OperatorRef     23354 non-null  object
 3   VehicleRef      23354 non-null  object
dtypes: object(4)
memory usage: 912.3+ KB
None


There is an updated dataset here with the engine types for each vehicle (by vehicleRef). 

python-programming-for-data/updated_bus_regs.csv at main · futureCodersSE/python-programming-for-data

The next step will be to create a copy of  dataframe, with an extra column 'Emission Class' and to fill in the value for each row of data by looking up the vehicle ref and getting the emission class.

This will result in a new dataframe with some null values in the Emission Class column as the list may not be completely up to date.

However, I'll be able to find a trend in the vehicle refs (e.g. if the vehicle ref is greater than 6000 then the vehicle is probably a EURO VI)

In [52]:
def get_emission_data():
  url = "https://raw.githubusercontent.com/futureCodersSE/python-programming-for-data/main/Datasets/updated_bus_regs.csv"
  df = pd.read_csv(url)
  return df

def add_emission_class(v_df):
  emissions_df = get_emission_data()

  def get_emission_class(df):
      em_list = emissions_df[emissions_df['Last tracked'] == df['VehicleRef']]['Emission Class'].tolist()
        
      if len(em_list) > 0:
         return em_list[0]
      else:
         return None

  v_df['Emission Class'] = v_df.apply(get_emission_class, axis=1)
  return v_df
# Testing
em_df = vehicle_df.copy()
em_df['VehicleRef'] = pd.to_numeric(em_df['VehicleRef'], errors="coerce")
actual = add_emission_class(em_df)
display(actual.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23354 entries, 0 to 117
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   RecordedAtTime  23354 non-null  object 
 1   LineRef         23354 non-null  object 
 2   OperatorRef     23354 non-null  object 
 3   VehicleRef      23234 non-null  float64
 4   Emission Class  22774 non-null  object 
dtypes: float64(1), object(4)
memory usage: 1.1+ MB


None

In [51]:
emission_copy = actual.groupby(['Emission Class']).count()
display(emission_copy)
euro_3_df = actual[actual['Emission Class'] == 'EURO III']
print("Unique number of EURO III buses on the road:",len(euro_3_df['VehicleRef'].unique()))

#if set(['EURO III']).issubset(emission_copy.columns):
euro_3_vehicles = emission_copy['VehicleRef']['EURO III']
euro_3_vehicles_5minutes = euro_3_vehicles * 5 
print("Total number of on road hours for EURO III buses during a one day period: ",round(euro_3_vehicles_5minutes/60))

Unnamed: 0_level_0,RecordedAtTime,LineRef,OperatorRef,VehicleRef
Emission Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
EURO III,5029,5029,5029,5029
EURO IV,4338,4338,4338,4338
EURO V,1807,1807,1807,1807
EURO VI,11600,11600,11600,11600


Unique number of EURO III buses on the road: 31
Total number of on road hours for EURO III buses during a one day period:  419


#Saving clean data in csv file


Mounting and unmounting the Google Drive
In order to be able to open and save files on Google Drive, with Python, you first need to mount the Drive.


In [34]:
from google.colab import drive

def mount_drive():
  drive.mount('/content/drive', force_remount=True)
  folder_name = "/content/drive/MyDrive/Colab_data"
  return folder_name
data_folder = mount_drive()
print(data_folder)

Mounted at /content/drive
/content/drive/MyDrive/Colab_data


The code will save a copy of the file as recording date_vehicle_data.csv

In [46]:
def save_data(df, path):
  try:
     # append data frame to CSV file
    file_name = '/'+ recording_date +'_vehicle_data.csv'
    df.to_csv(path + file_name)
    print("File saved successfully")
  except:
    print("There was an error when trying to save the file.")
path = mount_drive()
save_data(vehicle_df, path)

Mounted at /content/drive
File saved successfully


In [47]:
from pandas.io.formats.info import DataFrameInfo
def save_data(df, path):
  try:
    file_name = '/'+ recording_date + '_vehicle_emission_data' +'.csv'
    df.to_csv(path + file_name)
    print("File saved successfully")
  except:
    print("There was an error when trying to save the file.")
path = mount_drive()
save_data(actual, path)

Mounted at /content/drive
File saved successfully


After finished working with the files, you should always unmount the Drive.

In [38]:
def unmount_drive():
  drive.flush_and_unmount()
  print('All changes made in this colab session should now be visible in Drive.')
unmount_drive()

All changes made in this colab session should now be visible in Drive.
