# Introduction  
### The following code was tested on:  
Python 3+ running on...  
  
Microsoft Surface Go:
- CPU - Intel Pentium 4415Y @ 1.60 GHz, 2 cores, 4 threads  
- RAM - 8 GB

ULB Virtual Machine on Surface Go with:
- CPU - 2 cores dedicated  
- RAM - 4096 MB dedicated

ULB Cluster - user epb123

# Sampled Dataset exploration, meta-data collection

In [1]:
# Imports go here
import pandas as pd
import numpy as np
import os
import ntpath
import pickle

Here we import:  
- Pandas for its dataframes  
- Numpy for its data analysis packages  
- os to access system files  
- ntpath to assist with filenames
- pickle to access variables across notebooks

For future use, any user running the code can directly input their data storage directory for convenience.  
We are trying to run this on the ULB cluster, but this helped with local testing and development. 

In [2]:
def create_folder(newFolderName, newFolderPath):
# Function takes a folder name and a path and creates a new folder/directory
# Function returns folder path
    
    fullFolderPath = os.path.join(newFolderPath, newFolderName)
    
    # Folder might already have been created
    try:
        os.mkdir(fullFolderPath)
        print("new folder created at "+ fullFolderPath)
    except FileExistsError:
        print("folder already exists at " + fullFolderPath)
    
    return fullFolderPath

For some reason, storing and loading variables with %store does not work on the ULB cluster. We therefore use the pickle package.

In [3]:
# Set this = True if you are running the code on the ULB cluster
cluster = True

# Add your home directory as a raw string - must be less than 256 characters!!!
# NOTE: RUN YOUR NOTEBOOKS FROM THE SAME FOLDER FOR THIS ALL TO WORK (PREFERABLY THE BASE DIRECTORY)
if cluster == True:
    base_directory = r"/home/epb123/"
elif cluster == False:
    base_directory = r"/media/sf_Distributed/Project cleaner/"

#making some folders to keep things clean
v_direc = create_folder('variables', base_directory) + '/'
save_directory = create_folder('output', base_directory) + '/'

with open("v_direc",'wb') as variabledirectory:
    pickle.dump(v_direc,variabledirectory)   
with open(v_direc + "save_directory",'wb') as savedirectory:
    pickle.dump(save_directory,savedirectory)   
with open(v_direc + "base_directory",'wb') as basedirectory:
    pickle.dump(base_directory,basedirectory) 

# Add your data storage filepath below as a raw string - must be less than 256 characters!!!
if cluster == True:
    data_folder = r"/home/epb199/data/"
elif cluster == False:
    data_folder = r"/media/sf_Distributed/Project cleaner/tlc_0.2perc/"

#%store data_folder
with open(v_direc + "data_folder",'wb') as datafolder:
    pickle.dump(data_folder,datafolder)   

folder already exists at /home/epb123/variables
folder already exists at /home/epb123/output


## Statistics about the dataset

Compute basic statistics about the number of files in this sub-dataset, their size, and the number of records (lines) in each file. For length and number of records, give the min, max, mean, 25, 50, 75, 90 percentiles values.

### Functions for later use:

Here, we first needed some function for taking all of the file paths from the centralised data storage location and classifying them into types. We relied on the filenames for this classification (assuming they were correct), with the only point of note here being that we need to start with FHVHV or the if statement capturing FHV filenames will misclassify FHVHV files (without a bit of extra code). It then returns a convenient set of four lists with each set of filepaths for us to iterate over later. These filepaths are ordered oldest to newest with the simple .sort() as TlC was kind enough to follow an easily sortable and consistent file naming schema.

In [4]:
def get_paths(data_directory):
# Code to retrieve all path names from central data download and classify into type
# Returns tuple containing four lists: fhv filepaths, fhvhv filepaths, green filepaths, and yellow filepaths, in that order
# Each list will be sorted in ascending order - earliest records will therefore appear first

        # Making sure we are working in the directory the data is stored in
        original_directory = os.getcwd()
        os.chdir(data_directory)
        
        # Getting filepaths & filenames from directory, sorting by date (ASSUMING CORRECT FILE NAMING), splitting by type
        raw_filenames = os.listdir(data_directory)
        raw_filenames.sort()
        
        fhv_filepaths = []
        fhvhv_filepaths = []
        green_filepaths = []
        yellow_filepaths = []
        
        # Looping through all filenames, categorising by type, and generating filepaths
        for i in raw_filenames:
            if ".csv" in i:
                if "fhvhv" in i:
                    # Have to start with fhvhv, or else fhvhv files will be captured by fhv statement below
                    fhvhv_filepaths.append(os.path.abspath(i))
                elif "fhv" in i:
                    fhv_filepaths.append(os.path.abspath(i))
                elif "green" in i:
                    green_filepaths.append(os.path.abspath(i))
                elif "yellow" in i:
                    yellow_filepaths.append(os.path.abspath(i))
                else:
                    print("not found" + i)
            else:
                pass
        
        # Change directory back to original one before returning results
        os.chdir(original_directory)
        
        return fhv_filepaths, fhvhv_filepaths, green_filepaths, yellow_filepaths

Here, we needed a way to quickly count the number of rows in the files. Ours seem to be relatively small (only up to ~5MB for some of the Yellow CSVs, so a simple iteration would probably have been good enough but when running on the full 100% dataset, it would have been expensive to open each file and count each row. This operation could have been done in Spark, but as the rest of 2.1 did not appear to need it, we preferred to stay with regular Python methods.  

What we ended up doing was to impliment an "unbuffered counter", which was proposed by a user called Michael Bacon on StackOverflow. Essentially what it does is that it skips the step where Python loads data into a buffer for  use while the file completes loading. While that can be useful if we want to do a lot with the files, here we do not - we only want to run the row count operation and exit. Python will, in this function, read the raw file as kept on system storage (with the SSD on our laptops it is fine, but with a hard disk this may change), count the lines, and then move on, without losing time moving the whole file into a new area of memory. Given the expense of just loading the full files as spark dataframes later, we were very happy with the performance.

In [5]:
def unbuffered_counter(filename):
# Quickly counting rows in any large file by using raw unbuffered data, code credit to Michael Bacon on StackOverflow
# Returns number of new lines in the file
# Could also be done with a Spark RDD, but this method is sufficiently fast, accounting for time to start up Spark
    
    file = open(filename, 'rb')
    records = 0
    buffer_size = 1024 * 1024
    read_file = file.raw.read
    buffer = read_file(buffer_size)
    while buffer:
        records += buffer.count(b'\n')
        buffer = read_file(buffer_size)
    file.close()
    
    # First row is the header!
    records = records - 1
    return records

We needed a function to take a list of numbers and calculate basic statistics. This will be run eight times, so it is here for convenience. Numpy is the obvious package for these calculations.

In [6]:
def data_stats(file_datalist):
# Calculates basic stats for a given list
# Returns a list containing relevant stats

    stats = []
    
    stats.append(np.min(file_datalist))
    stats.append(np.max(file_datalist))
    stats.append(np.mean(file_datalist))
    
    stats.append(np.percentile(file_datalist, 25))
    stats.append(np.percentile(file_datalist, 50))
    stats.append(np.percentile(file_datalist, 75))
    stats.append(np.percentile(file_datalist, 90))
    
    return stats

This is just a function to take a list of filepaths and calculate those summary statistics using the function immediately above. Here, we apply os.stat()'s st_size for the file size and the unbuffered counter from before. We return the outputs as a single list as we later make a dataframe to display the results.

In [7]:
def file_sizelength(subdirectory):
# Function to take a subdirectory, calculate summary stats of filesizes and lengths for files in subdirectory
# Returns a list with all statistics in order size, length, with: min, max, mean, 25%, 50%, 75%, 90%

    allstats = []
    filesizes = []
    filelengths = []

    for i in subdirectory:
        # os.stat().st_size returns file size in bytes
        filesizes.append(os.stat(i).st_size)
        filelengths.append(unbuffered_counter(i))

    # Calling data_stats to calculate summary statistics
    allstats.extend(data_stats(filesizes))
    allstats.extend(data_stats(filelengths))
    
    return allstats

### Calculating summary statistics:

We start the analysis by just implementing the functions we have just developed. We also store the filepaths for use in 2.2 and beyond. We present the results of the simple calculation as a dataframe as that makes the most sense for this type of data, adding the function returns to the dataframe as individual items with [:]

In [8]:
filepaths = get_paths(data_folder)
#%store filepaths
with open(v_direc + "filepaths",'wb') as file_paths:
    pickle.dump(filepaths,file_paths)
    
# filepaths[0] is fhv, [1] is fhvhv, [2] is green, [3] is yellow

# Creating lists to store sub-dataset statistics, and adding the number of files as the first item
fhv_info = [len(filepaths[0])]
fhvhv_info = [len(filepaths[1])]
green_info = [len(filepaths[2])]
yellow_info = [len(filepaths[3])]

# Adding filesize and length summary statistics in each sub-dataset, using file_sizelength routine
fhv_info.extend(file_sizelength(filepaths[0]))
fhvhv_info.extend(file_sizelength(filepaths[1]))
green_info.extend(file_sizelength(filepaths[2]))
yellow_info.extend(file_sizelength(filepaths[3]))

# Creating a dataframe to present results
datastats_df = pd.DataFrame(columns = ['files', 'min size', 'max size', 'mean size', '25% size', '50% size', '75% size', '90% size', 'min records', 'max records', 'mean records', '25% records', '50% records', '75% records', '90% records'])
datastats_df.loc['fhv'] = fhv_info[:]
datastats_df.loc['fhvhv'] = fhvhv_info[:]
datastats_df.loc['green'] = green_info[:]
datastats_df.loc['yellow'] = yellow_info[:]

print("\nBelow are summary statistics for the TLC Trip Record sub-datasets:")
datastats_df


Below are summary statistics for the TLC Trip Record sub-datasets:


Unnamed: 0,files,min size,max size,mean size,25% size,50% size,75% size,90% size,min records,max records,mean records,25% records,50% records,75% records,90% records
fhv,64.0,52060.0,3339455.0,1147258.0,218111.25,646615.0,2257156.5,3020144.5,959.0,47672.0,21712.625,6030.5,21692.0,33844.75,43708.9
fhvhv,10.0,535789.0,2978931.0,2007751.0,1121047.25,2542280.0,2687857.25,2804332.8,8625.0,47703.0,32181.9,18022.0,40714.0,43056.25,44922.0
green,76.0,2512.0,570765.0,262437.3,121494.75,190194.5,456955.0,499751.0,15.0,3546.0,2026.5,1358.75,2072.5,2892.25,3126.0
yellow,131.0,43103.0,5959352.0,3750760.0,1756967.0,4442047.0,5123591.5,5491438.0,476.0,32300.0,24203.51145,19989.0,26294.0,29080.0,30202.0


## Analysis of the schema evolution.

Over time, the relational schema associated to each type of trip data (yellow, green, fhv, hvfhv) has changed. Let us analyze the changes.

### Functions for later use:

The next three functions just return the year, month, and date from a filepath. Since TLC was good enough to be consistent in their file naming, we can happily take advantage of this to make our lives a bit easier when identifying record groups.

In [9]:
def get_pathyear(givenpath):
# Function to get the year from a given filepath
# Returns the year as an integer
    return int(givenpath[-11:-7])

In [10]:
def get_pathmonth(givenpath):
# Function to get the month from a given filepath
# Returns the month as an integer
    return int(givenpath[-6:-4])

In [11]:
def get_pathdate(givenpath):
# Function to get the date from a given filepath
# Returns the date as a tuple in (year, month) form
        return (int(givenpath[-11:-7]), int(givenpath[-6:-4]))

This is essentially the catch-all function for checking the schema. As we want to minimise the number of loops we go through for efficiency's sake, we track all schema issues together.  

In analysing the schema changes, we make use of Python's unordered set properties - we effectively go through once and catch the major issues requiring human intervention (changed variable names, missing variables, etc.) using sets and then afterwards, we use the original list format just to track order changes.  

We use dataframes for the returned column addition/subtraction and re-ordering analysis, and a list of lists for the common schemas as this will be useful in 2.2.

One minor thing to note is that we make use of ntpath.basename() here for very easy file identification.

In [12]:
def column_analysis(sub_data):
# Function to analyse changes in data column labels, taking filepath list as inputs
# Function returns a tuple containing:
# Return 1. dataframe containing columns added to files when comparing vs. previous year, and columns dropped from files when comparing vs. subsequent year
# Return 2. list of lists, where each sub-list contains the dates as a tuple (year, month) which have a common schema
# Return 3. dataframe of variables where the index changed from one year to another
    # Create variables to capture changes - column changes as a set and a list of dates with the same schema
    col_changes = {}
    same_schema = []
    order_changesDf = pd.DataFrame(columns = ['file 0', 'file 1', 'index 0', 'index 1'])
    
    # Variable to store dates with the current schema
    current_schema = []

    # Checking if column names are the same - iterates over all files in sub-dataset
    for j in range (len(sub_data)):        
        # Storing first row of the csv and ensuring all headers are lowercase without leading and trailing white spaces
        df_j= pd.read_csv(sub_data[j],nrows=0)
        l1 = [item.lower().strip() for item in list(df_j.columns)]
        
        jdate = get_pathdate(sub_data[j])
        
        if j == 0:
        # Storing first csv's columns as l0
            l0 = l1[:]
            current_schema.append(sub_data[j])  
        
        elif set(l1) != set(l0):
        # Comparing the column names in set format, as order should not matter here
            l0_name = ntpath.basename(sub_data[j-1])
            l1_name = ntpath.basename(sub_data[j])
            
            # Elements in l0 but not in l1 => dropped elements
            col_changes['dropped from '+l0_name] = set(l0) - set(l1)
            # Elements in l1 and not in l0 => added elements
            col_changes['added to '+l1_name] = set(l1) - set(l0)
            # Reset schema tracker
            same_schema.append(current_schema[:])
            current_schema.clear()
            current_schema.append(sub_data[j])                      
            
            # Tracking column changes
            for k in range(len(l0)):
                try:
                    if l0[k] != l1[k]:
                        for j in range(len(l1)):
                            if l0[k] == l1[j]:
                                order_changesDf.loc[l0[k]+', y:'+str(jdate[0])+", m:"+str(jdate[1])] = [l0_name, l1_name, k, j]
                except IndexError:
                    for j in range(len(l1)):
                        if l0[k] == l1[j]:
                            order_changesDf.loc[l0[k]+', y:'+str(jdate[0])+", m:"+str(jdate[1])] = [l0_name, l1_name, k, j]
                    
            # Reset l0 for next loop
            l0 = l1
        
        elif set(l1) == set(l0):
            current_schema.append(sub_data[j])
            # Append the current_schema if we have reached the last record
            if j == len(sub_data) - 1:
                same_schema.append(current_schema[:])

            # Tracking column changes
            # We can use the first if statement here (l0 != l1) to possibly skip the further loops
            # We did not do this before because obviously, l0 would always == l1 if the sets aren't equal
            if l0 != l1:
                for k in range(len(l0)):
                    try:
                        if l0[k] != l1[k]:
                            for j in range(len(l1)):
                                if l0[k] == l1[j]:
                                    order_changesDf.loc[l0[k]+', y:'+str(jdate[0])+", m:"+str(jdate[1])] = [l0_name, l1_name, k, j]
                    except IndexError:
                        for j in range(len(l1)):
                            if l0[k] == l1[j]:
                                order_changesDf.loc[l0[k]+', y:'+str(jdate[0])+", m:"+str(jdate[1])] = [l0_name, l1_name, k, j]
            
            # Reset l0 for next loop
            l0 = l1
        
        else:
            print("error reading columns")
                
    return (pd.DataFrame.from_dict(col_changes, orient='index'), same_schema, order_changesDf)

We are going to check that the data itself is correct later on in 2.3, so all we care about for now are the headers for each data set - we can take the variable names by just reading the first row of each CSV (which also saves lots of read time). In this step, we also set all variable names to lowercase and strip all extra leading and/or trailing spaces. In the process of data integration in 2.2, we take the liberty of ensuring that all variable names follow this lowercase & no spaces format, so doing this here allows us to focus on more major schema issues.

In [13]:
def file_schema(address):   
# Function takes a csv from the given address
# Returns a list of stripped and lowercase column names
    
    headers = [item.lower().strip() for item in list(pd.read_csv(address,nrows=0).columns)]
    
    return headers

We weren't quite sure why, but the sample download appeared to be missing several years' worth of data. We assumed this was somehow a result of the sampling process, but here is a function to track the gaps. There is nothing particularly interesting in the construction - it just calculates an expected next year from the given path and checks to see if it is consistent with what we actually observe.

In [14]:
def date_analysis(sub_data):
# Function to extract date of every file in sub-dataset and highlight missing months
# Function returns list containing missing months in tuple (year, month) form
    
    # Generate a list of all dates
    # sub_data[i][-11:-7] is the year, sub_data[i][-6:-4] is the month, with the naming convention as used
    gap_years = []
    for i in range(len(sub_data)):
        current_year = get_pathyear(sub_data[i])
        current_month = get_pathmonth(sub_data[i])
        
        # Compare previous expectation to current one
        if i == 0:
            pass
        elif current_month != expected_month or current_year != expected_year:
            # Add all missing months to gap_years list
            while current_month != expected_month or current_year != expected_year:
                gap_years.append((expected_year, expected_month))
                if expected_month < 12:
                    expected_month = expected_month + 1
                else:
                    expected_year = expected_year + 1
                    expected_month = 1
        
        # Calculate what we expect the next date should be
        if current_month < 12:
            expected_year = current_year
            expected_month = current_month + 1
        else:
            expected_year = current_year + 1
            expected_month = 1
    
    return(gap_years)

### Schema evolution code

The below code runs the previously defined functions for each index of the filepaths[] list. They also store what is useful in 2.2 to memory to avoid re-calculation.  

Note: you should un-comment out the line: print(XYZ_schema[1]) if you want to see the full list of lists containing file paths with a common schema

## Analysis of schema changes for fhv cab data files

Analyze the schema changes for the FHV cab data files. Write down your conclusions

In [15]:
fhv_missing = date_analysis(filepaths[0])
fhv_schema = column_analysis(filepaths[0])

The following months are missing from this sub-dataset:

In [16]:
print(fhv_missing)

[(2019, 12), (2020, 2)]


In the following files, the schema is the same:

In [17]:
#print(fhv_schema[1])
fhv_sameSchema = fhv_schema[1]
#%store fhv_sameSchema
with open(v_direc + "fhv_sameSchema",'wb') as fhvsameSchema:
    pickle.dump(fhv_sameSchema,fhvsameSchema)   

In or after the following files, there was the following schema change:

In [18]:
fhv_schema[0]

Unnamed: 0,0,1,2,3
dropped from fhv_tripdata_2016-12.csv,pickup_date,locationid,,
added to fhv_tripdata_2017-01.csv,dropoff_datetime,pulocationid,dolocationid,pickup_datetime
dropped from fhv_tripdata_2017-06.csv,,,,
added to fhv_tripdata_2017-07.csv,sr_flag,,,
dropped from fhv_tripdata_2017-12.csv,,,,
added to fhv_tripdata_2018-01.csv,dispatching_base_number,,,
dropped from fhv_tripdata_2018-12.csv,dispatching_base_number,,,
added to fhv_tripdata_2019-01.csv,,,,


The final schema was:

In [19]:
print(file_schema(filepaths[0][-1]))

['dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag']


Observations on schema changes  

2018-12 and earlier:  
- dispatching_base_number = dispatching_base_num
- note: both columns existed simultaneously

2016-12 and earlier:
- pickup_date = pickup_datetime
- locationid = pulocationid

The index of the following variables has changed:

In [20]:
fhv_schema[2]

Unnamed: 0,file 0,file 1,index 0,index 1
"dispatching_base_num, y:2018, m:1",fhv_tripdata_2017-12.csv,fhv_tripdata_2018-01.csv,0,6
"pickup_datetime, y:2018, m:1",fhv_tripdata_2017-12.csv,fhv_tripdata_2018-01.csv,1,0
"dropoff_datetime, y:2018, m:1",fhv_tripdata_2017-12.csv,fhv_tripdata_2018-01.csv,2,1
"pulocationid, y:2018, m:1",fhv_tripdata_2017-12.csv,fhv_tripdata_2018-01.csv,3,2
"dolocationid, y:2018, m:1",fhv_tripdata_2017-12.csv,fhv_tripdata_2018-01.csv,4,3
"sr_flag, y:2018, m:1",fhv_tripdata_2017-12.csv,fhv_tripdata_2018-01.csv,5,4
"pickup_datetime, y:2019, m:1",fhv_tripdata_2018-12.csv,fhv_tripdata_2019-01.csv,0,1
"dropoff_datetime, y:2019, m:1",fhv_tripdata_2018-12.csv,fhv_tripdata_2019-01.csv,1,2
"pulocationid, y:2019, m:1",fhv_tripdata_2018-12.csv,fhv_tripdata_2019-01.csv,2,3
"dolocationid, y:2019, m:1",fhv_tripdata_2018-12.csv,fhv_tripdata_2019-01.csv,3,4


## Analysis of schema changes for fhvhv data files

Analyze the schema changes for the FHVHV cab data files. Write down your conclusions

In [21]:
fhvhv_missing = date_analysis(filepaths[1])
fhvhv_schema = column_analysis(filepaths[1])

The following months are missing from this sub-dataset:

In [22]:
print(fhvhv_missing)

[(2019, 7), (2019, 8), (2019, 9), (2019, 10), (2019, 11), (2019, 12), (2020, 2)]


In the following files, the schema is the same:

In [23]:
#print(fhvhv_schema[1])
fhvhv_sameSchema = fhvhv_schema[1]
#%store fhvhv_sameSchema
with open(v_direc + "fhvhv_sameSchema",'wb') as fhvhvsameSchema:
    pickle.dump(fhvhv_sameSchema,fhvhvsameSchema)   

In or after the following files, there was the following schema change:

In [24]:
fhvhv_schema[0]

The final schema was:

In [25]:
print(file_schema(filepaths[1][-1]))

['hvfhs_license_num', 'dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag']


The schema for the FHVHV set does not appear to have ever changed

The index of the following variables has changed:

In [26]:
fhvhv_schema[2]

Unnamed: 0,file 0,file 1,index 0,index 1


## Analysis of schema changes for green cab data files

Analyze the schema changes for the green taxi data files. Write down your conclusions

In [27]:
green_missing = date_analysis(filepaths[2])
green_schema = column_analysis(filepaths[2])

The following months are missing from this sub-dataset:

In [28]:
print(green_missing)

[(2019, 7), (2019, 8), (2019, 9), (2019, 10), (2019, 11), (2019, 12), (2020, 3)]


In the following files, the schema is the same:

In [29]:
#print(green_schema[1])
green_sameSchema = green_schema[1]
#%store green_sameSchema
with open(v_direc + "green_sameSchema",'wb') as greensameSchema:
    pickle.dump(green_sameSchema,greensameSchema)   

In or after the following files, there was the following schema change:

In [30]:
green_schema[0]

Unnamed: 0,0,1,2,3
dropped from green_tripdata_2014-12.csv,,,,
added to green_tripdata_2015-01.csv,improvement_surcharge,,,
dropped from green_tripdata_2016-06.csv,pickup_longitude,dropoff_latitude,pickup_latitude,dropoff_longitude
added to green_tripdata_2016-07.csv,pulocationid,dolocationid,,
dropped from green_tripdata_2018-12.csv,,,,
added to green_tripdata_2019-01.csv,congestion_surcharge,,,


The final schema was:

In [31]:
print(file_schema(filepaths[2][-1]))

['vendorid', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'ratecodeid', 'pulocationid', 'dolocationid', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type', 'trip_type', 'congestion_surcharge']


Observations on schema changes  

2016-06 and earlier:
- dolocationid and pulocationid can be computed from dropoff_longitude | dropoff_latitude and pickup_longitude | pickup_latitude

The index of the following variables has changed:

In [32]:
green_schema[2]

Unnamed: 0,file 0,file 1,index 0,index 1
"total_amount, y:2015, m:1",green_tripdata_2014-12.csv,green_tripdata_2015-01.csv,17,18
"payment_type, y:2015, m:1",green_tripdata_2014-12.csv,green_tripdata_2015-01.csv,18,19
"trip_type, y:2015, m:1",green_tripdata_2014-12.csv,green_tripdata_2015-01.csv,19,20
"passenger_count, y:2016, m:7",green_tripdata_2016-06.csv,green_tripdata_2016-07.csv,9,7
"trip_distance, y:2016, m:7",green_tripdata_2016-06.csv,green_tripdata_2016-07.csv,10,8
"fare_amount, y:2016, m:7",green_tripdata_2016-06.csv,green_tripdata_2016-07.csv,11,9
"extra, y:2016, m:7",green_tripdata_2016-06.csv,green_tripdata_2016-07.csv,12,10
"mta_tax, y:2016, m:7",green_tripdata_2016-06.csv,green_tripdata_2016-07.csv,13,11
"tip_amount, y:2016, m:7",green_tripdata_2016-06.csv,green_tripdata_2016-07.csv,14,12
"tolls_amount, y:2016, m:7",green_tripdata_2016-06.csv,green_tripdata_2016-07.csv,15,13


## Analysis of schema changes for yellow cab data files

Analyze the schema changes for the Yellow taxi data files. Write down your conclusions

In [33]:
yellow_missing = date_analysis(filepaths[3])
yellow_schema = column_analysis(filepaths[3])

The following months are missing from this sub-dataset:

In [34]:
print(yellow_missing)

[(2019, 7), (2019, 8), (2019, 9), (2019, 10), (2019, 11), (2019, 12), (2020, 3)]


In the following files, the schema is the same:

In [35]:
#print(yellow_schema[1])
yellow_sameSchema = yellow_schema[1]
#%store yellow_sameSchema
with open(v_direc + "yellow_sameSchema",'wb') as yellowsameSchema:
    pickle.dump(yellow_sameSchema,yellowsameSchema)   

In or after the following files, there was the following schema change:

In [36]:
yellow_schema[0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
dropped from yellow_tripdata_2009-12.csv,total_amt,start_lon,tolls_amt,start_lat,tip_amt,trip_dropoff_datetime,end_lat,vendor_name,trip_pickup_datetime,fare_amt,end_lon,store_and_forward
added to yellow_tripdata_2010-01.csv,vendor_id,tolls_amount,dropoff_latitude,total_amount,store_and_fwd_flag,tip_amount,dropoff_datetime,dropoff_longitude,fare_amount,pickup_longitude,pickup_latitude,pickup_datetime
dropped from yellow_tripdata_2014-12.csv,vendor_id,dropoff_datetime,rate_code,surcharge,pickup_datetime,,,,,,,
added to yellow_tripdata_2015-01.csv,ratecodeid,improvement_surcharge,tpep_dropoff_datetime,tpep_pickup_datetime,extra,vendorid,,,,,,
dropped from yellow_tripdata_2016-06.csv,pickup_longitude,dropoff_latitude,pickup_latitude,dropoff_longitude,,,,,,,,
added to yellow_tripdata_2016-07.csv,pulocationid,dolocationid,,,,,,,,,,
dropped from yellow_tripdata_2018-12.csv,,,,,,,,,,,,
added to yellow_tripdata_2019-01.csv,congestion_surcharge,,,,,,,,,,,


The final schema was:

In [37]:
print(file_schema(filepaths[3][-1]))

['vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'ratecodeid', 'store_and_fwd_flag', 'pulocationid', 'dolocationid', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge']


Observations on schema changes  

2016-06 and earlier:
- dolocationid and pulocationid can be computed from dropoff_longitude | dropoff_latitude and pickup_longitude | pickup_latitude

2014-12 and earlier:
- surcharge = improvement_surcharge
- dropoff_datetime = tpep_dropoff_datetime
- vendor_id = vendorid
- note: vendor_id to vendorid values changed from CMT and VTS to 1 and 2
- rate_code = ratecodeid
- pickup_datetime = tpep_pickup_datetime

2009-12 and earlier:
- total_amt = total_amount
- store_and_forward = store_and_fwd_flag
- fare_amt = fare_amount
- tip_amt = tip_amount
- end_lon = dropoff_longitude
- trip_pickup_datetime = tpep_pickup_datetime
- tolls_amt = tolls_amount
- end_lat = dropoff_latitude
- trip_dropoff_datetime = tpep_dropoff_datetime
- vendor_name = vendor_id = vendorid
- start_lon = pickup_longitude
- start_lat = pickup_latitude

The index of the following variables has changed:

In [38]:
yellow_schema[2]

Unnamed: 0,file 0,file 1,index 0,index 1
"total_amount, y:2015, m:1",yellow_tripdata_2014-12.csv,yellow_tripdata_2015-01.csv,17,18
"ratecodeid, y:2016, m:7",yellow_tripdata_2016-06.csv,yellow_tripdata_2016-07.csv,7,5
"store_and_fwd_flag, y:2016, m:7",yellow_tripdata_2016-06.csv,yellow_tripdata_2016-07.csv,8,6
"payment_type, y:2016, m:7",yellow_tripdata_2016-06.csv,yellow_tripdata_2016-07.csv,11,9
"fare_amount, y:2016, m:7",yellow_tripdata_2016-06.csv,yellow_tripdata_2016-07.csv,12,10
"extra, y:2016, m:7",yellow_tripdata_2016-06.csv,yellow_tripdata_2016-07.csv,13,11
"mta_tax, y:2016, m:7",yellow_tripdata_2016-06.csv,yellow_tripdata_2016-07.csv,14,12
"tip_amount, y:2016, m:7",yellow_tripdata_2016-06.csv,yellow_tripdata_2016-07.csv,15,13
"tolls_amount, y:2016, m:7",yellow_tripdata_2016-06.csv,yellow_tripdata_2016-07.csv,16,14
"improvement_surcharge, y:2016, m:7",yellow_tripdata_2016-06.csv,yellow_tripdata_2016-07.csv,17,15


## Final observations

There are a number of variables which have been re-named (and probably had their format changed) through the names, and column order changes are common. In order to integrate the data, all schemas can be re-ordered to some common format without much loss of efficiency (as the majority of data will require this anyway). The missing years again are worrying, but given that they exist on the TLC website, there should be no issue assuming this is an artifact of the sampling process.  

Please run our code for 2.2 on the same machine/instance in order to correctly access the variables which we have stored in this 2.1 implementation.