# Investigating and cleaning "Ticket inspector logs" 


This pipe checks the integrity of the Ticket inspector logs from a variety of angles 

a) are the bus lines listed correct?

b) are the bus stop codes correct?

c) are there any duplications?

d) is the start and end of ticket insepections logical? (ie., it ends, after it starts)

In [3]:
import pandas as pd
import numpy as np
import glob

## Importing data

In [8]:
files = glob.glob("c:/users/henry chapman/Documents/Coding/Data_science/Project_final/Raw_data/Ticket_inspections/*.csv")

dfs = []
for file in files:
    df = pd.read_csv(
        file,
        skiprows = 2,
        on_bad_lines = "skip"
    )
    dfs.append(df)


Inspections = pd.concat(dfs, ignore_index = True)

# Converting to english 
column_translation = {
    "Kuupäev": "Date",
    "Kontrolli algus": "Start_time",
    "Kontrolli lõpp": "End_time",
    "Peatus": "Stop",
    "Peatuse kood": "Stop_code",
    "Liin": "Line",
    "Kontrolör": "Inspector",
    "Kontrollitud kaarte": "Cards_checked",
    "Valideeritud": "Validated",
    "Valideerimata": "Unvalidated"
}
Inspections.rename(columns = column_translation, inplace = True)

print(len(Inspections))

# Dropping all NaNs 
Inspections.dropna(subset = ["Date", "Start_time", "Stop_code", "Unvalidated"], inplace = True)
print(len(Inspections))

Raw_Inspections = Inspections.copy()


163742
144140


## Data integrity Checking

### Checking that all of the values in "bus lines" are correct 

In [10]:
Integrity = Inspections.copy()

# All Bus routes in Estonia 
routes = pd.read_csv("c:/users/henry chapman/Documents/Coding/Data_science/Project_final/Raw_data/gtfs/routes.txt")
routes.head()

# Routes specefic to tartu inner city
Tartu_routes = routes[routes["competent_authority"].isin(["Tartu LV", "Tartu MV"])]
Tartu_routes.head()

set(Inspections["Line"].unique()) - set(Tartu_routes["route_short_name"].unique())

# I see that there are alot of buses present in the inspection report, not under tartu authority. 
# This is because in 2019 the bus schedule underwent a major overhaul, in which bus stops were revised, and a number of bus lines were removed 
# the bus lines that were removed aligns with the below list


{'14',
 '15',
 '16',
 '16A',
 '17',
 '18',
 '19',
 '20',
 '24',
 '26',
 '27',
 '28',
 '9A',
 nan}

### Checking that all of the stop codes are correct 

In [12]:
## All bu stops in estonia 
stops = pd.read_csv("c:/users/henry chapman/Documents/Coding/Data_science/Project_final/Raw_data/gtfs/stops.txt")
stops.head()

# All bus stops specefic to Tartu
Tartu_stops = stops[stops["authority"].isin(["Tartu MV", "Tartu LV"])]["stop_code"].tolist()
print(len(Tartu_stops))

# All bus stops lablled within the inspections report 
Inspector_stops = Inspections["Stop_code"].unique().tolist()
print(len(Inspector_stops))

# Too many stops to check through manually like bus lines 
# Looks like inspector reports for the last 5 years have only covered aobut 50% of all of the bus stops 
# I wonder if this is true, or an artifact bus stops with the same name, having directionality 

Extra_inspector_stops = set(Inspector_stops) - set(Tartu_stops)

# Four stops reported in the insepection report, are not found in supposdly exhuastive data base 
#{'7820237-1', '7820102-2', '7820336-1', '7820323-2', '7820278-1'}

print(Extra_inspector_stops)

# I wonder if these bus stops are under a differnet authority in the stops data base 
stops[stops["stop_code"].isin(['7820237-1', '7820102-2', '7820336-1', '7820323-2', '7820278-1'])]

# No they are not
# This is once again, likely due to the 2019 overhaul

1914
296
{'7820237-1', '7820278-1', '7820323-2', '7820102-2', '7820336-1'}


Unnamed: 0,stop_id,stop_code,stop_name,stop_lat,stop_lon,zone_id,alias,stop_area,stop_desc,lest_x,lest_y,zone_name,authority


### Checking that inspections are logical and duplicate checking

In [13]:
### WARNING, this is not a safe cell. Running it multiple times in a row WILL NOT WORK 
# Converting Date, start time, and end time into a standard DateTime columns 
# '%d/%m/%y %H:%M:%S.%f')

Inspections["Start_Dtime"] = pd.to_datetime((Inspections["Date"] + " " + Inspections["Start_time"]), format = "%d.%m.%Y %H:%M:%S")
Inspections["End_Dtime"] = pd.to_datetime((Inspections["Date"] + " " + Inspections["End_time"]), format = "%d.%m.%Y %H:%M:%S")
# Dropping no longer needed columns 
Inspections.drop(columns = ["Date", "Start_time", "End_time"], inplace = True)


In [15]:
# Sorting the data frame by time 
Inspections.sort_values(by = "Start_Dtime", inplace = True)

In [16]:
### Warning this is not a safe cell. TRunnign it multiple times in a row is not possible 
# some basic data logic integrity testing 

# Checking that the Start and end times are logically possible 
from datetime import timedelta

# ensuring that the end of inspection did not occur before it even began
if ((Inspections["End_Dtime"] - Inspections["Start_Dtime"]) < timedelta(0)).any():
    raise Exception("Some inspections supposedly finished before they started")


durations = Inspections["End_Dtime"] - Inspections["Start_Dtime"]
print(len(durations))

# Are there are dates which seem impossible 
print(Inspections["Start_Dtime"].min())
print(Inspections["Start_Dtime"].max())

# Are the number of cards checked, always equal to Validided + unvalidated 
if not ((Inspections["Validated"] + Inspections["Unvalidated"]) == Inspections["Cards_checked"]).all():
    raise Exception("Number of cards checked does not equal the sum of validated and unvalidated tickets")


# Checking for duplications
print(len(Inspections))
Inspections.drop_duplicates(subset = ["Start_Dtime", "End_Dtime", "Inspector"], inplace = True, keep = False)
print(len(Inspections))


Inspections.head()

144140
2016-02-12 06:50:33
2024-11-19 15:46:49
144140
141102


Unnamed: 0,Stop,Stop_code,Line,Inspector,Cards_checked,Validated,Unvalidated,Start_Dtime,End_Dtime
2880,Vene,7820301-1,7,Karl Müürsepp,5,5,0,2016-02-12 06:50:33,2016-02-12 06:50:52
2881,Vene,7820301-1,7,Külli Henn,3,1,2,2016-02-12 06:50:36,2016-02-12 06:51:41
2882,Vene,7820301-1,7,Mariin Piper,5,4,1,2016-02-12 06:50:38,2016-02-12 06:50:59
2883,Vene,7820301-1,4,Karl Müürsepp,9,6,3,2016-02-12 06:53:30,2016-02-12 07:06:12
2884,Vene,7820301-1,4,Külli Henn,8,7,1,2016-02-12 06:53:33,2016-02-12 07:06:40


In [17]:
# Seeing how much data has been filtered out. Is it reasonable? Or is imputation / less harsh deletion required?

(len(Inspections) / len(Raw_Inspections)) * 100

# Still having 97.89 % of the data after inital data cleaning seems very reasonable 

97.8923269043985

# Exporting 

In [18]:
Inspections.to_csv("C:/users/henry chapman/Documents/Coding/Data_science/Project_final/Output/1_Compiling_data/Pipe1/Inspections.csv", index = False)