# Sampled Dataset exploration, meta-data collection

WARNING: this notebook assumes that:

- The data are in "MY_PARENT_FOLDER/data/sampled/" folder. You can run the bash script "download_metadata.sh" to download data and metadata in the correct folders to execute the jupyter notebooks.
- The data are sampled to be run on a personnal computer.

## Folder configuration

Run the bash script in order to check if the folder and the data are well configured. If you get an error, please follow the instructions or run the script download_metadata.sh.

In [2]:
# Check if the files and forlder are on the good folder to exectue the following code.
!./check_files.sh

------------ READY TO START -----------


In [3]:
# Import python modules
import os
import glob
import pandas as pd
import math
import numpy as np
import pickle 

## Statistics about the dataset

Compute basic statistics about the number of files in this sub-dataset, their size, and the number of records (lines) in each file. For length and number of records, give the min, max, mean, 25, 50, 75, 90 percentiles values.

### Files overview

In [4]:
print ('count of green tripdata files:')
!find data/sampled/green_tripdata_*.csv -type f | wc -l 
print ('count of yellow tripdata files:')
!find data/sampled/yellow_tripdata_*.csv -type f | wc -l
print ('count of fhv tripdata files:')
!find data/sampled/fhv_tripdata_*.csv -type f | wc -l 
print ('count of fhvhv tripdata files:')
!find data/sampled/fhvhv_tripdata_*.csv -type f | wc -l
print ('count of all tripdata files:')
!find data/sampled/*.csv -type f | wc -l

count of green tripdata files:
76
count of yellow tripdata files:
131
count of fhv tripdata files:
64
count of fhvhv tripdata files:
10
count of all tripdata files:
281


### Files stats in bytes

In [5]:
list_taxi = ["yellow", "green", "fhv", "fhvhv"]
#list_taxi = ["green"]
for taxi_brand in list_taxi :
    list_files = {}
    list_files[taxi_brand] = []
    nb_files = 0
    size_b = []
    # List the file from the same taxi company brand 
    for file in glob.glob("data/sampled/%s*.csv" %(taxi_brand)):
        nb_files = nb_files+1
        # Save in list the file name
        list_files[taxi_brand].append(file)
        size_b.append(os.path.getsize(file))
    print("For the file type called %s there are %i files." %(taxi_brand, nb_files))
    # Get size basic stat
    print("     Here are some metrics based on their size in bytes:")
    print("       -min value           : ", np.min(size_b))
    print("       -max value           : ", np.max(size_b))
    print("       -mean value          : ", np.round(np.mean(size_b),2))
    print("       -25 percentile value : ", np.quantile(size_b, .25)) 
    print("       -50 percentile value : ", np.quantile(size_b, .50)) 
    print("       -75 percentile value : ", np.quantile(size_b, .75)) 
    print("       -90 percentile value : ", np.quantile(size_b, .90))
    print("       -total file sum size : ", np.sum(size_b))

For the file type called yellow there are 131 files.
     Here are some metrics based on their size in bytes:
       -min value           :  43103
       -max value           :  5959352
       -mean value          :  3750759.69
       -25 percentile value :  1756967.0
       -50 percentile value :  4442047.0
       -75 percentile value :  5123591.5
       -90 percentile value :  5491438.0
       -total file sum size :  491349519
For the file type called green there are 76 files.
     Here are some metrics based on their size in bytes:
       -min value           :  2512
       -max value           :  570765
       -mean value          :  262437.29
       -25 percentile value :  121494.75
       -50 percentile value :  190194.5
       -75 percentile value :  456955.0
       -90 percentile value :  499751.0
       -total file sum size :  19945234
For the file type called fhv there are 74 files.
     Here are some metrics based on their size in bytes:
       -min value           :  52060


### Files stats in rows

In [6]:
for taxi_brand in list_taxi :
    names={}
    size_r=[]
    for fn in glob.glob("data/sampled/%s*.csv" %(taxi_brand)):
        with open(fn) as f:
            names[fn]=sum(1 for line in f if line.strip())      
        # Save in list files sizes in rows
        size_r=list(names.values())
    print("For the taxi brand called %s there are still %i files." %(taxi_brand, nb_files))
    # Get size basic stat
    print("     Here are some metrics based on their size in rows count:")
    print("       -min value           : ", np.min(size_r))
    print("       -max value           : ", np.max(size_r))
    print("       -mean value          : ", np.round(np.mean(size_r),2))
    print("       -25 percentile value : ", np.quantile(size_r, .25)) 
    print("       -50 percentile value : ", np.quantile(size_r, .50)) 
    print("       -75 percentile value : ", np.quantile(size_r, .75)) 
    print("       -90 percentile value : ", np.quantile(size_r, .90))
    print("       -total row number    : ", np.sum(size_r))

For the taxi brand called yellow there are still 10 files.
     Here are some metrics based on their size in rows count:
       -min value           :  477
       -max value           :  32301
       -mean value          :  24204.51
       -25 percentile value :  19990.0
       -50 percentile value :  26295.0
       -75 percentile value :  29081.0
       -90 percentile value :  30203.0
       -total row number    :  3170791
For the taxi brand called green there are still 10 files.
     Here are some metrics based on their size in rows count:
       -min value           :  16
       -max value           :  3547
       -mean value          :  2027.5
       -25 percentile value :  1359.75
       -50 percentile value :  2073.5
       -75 percentile value :  2893.25
       -90 percentile value :  3127.0
       -total row number    :  154090
For the taxi brand called fhv there are still 10 files.
     Here are some metrics based on their size in rows count:
       -min value           :  960

## Analysis of the schema evolution.

Over time, the relational schema associated to each type of trip data (yellow, green, fhv, hvfhv) has changed. Let us analyze the changes.

## Auxiliary functions

In [39]:
# Code to help analyze the schema changes goes here
# Get the list of file
list_taxi = ["yellow", "green", "fhv", "fhvhv"]
for taxi_brand in list_taxi :
    list_files = {}
    yr_list = {}
    mth_list = {}
    list_files[taxi_brand] = []
    yr_list[taxi_brand] = []
    mth_list[taxi_brand] = []
    nb_files = 0
    size = []
    count = 0
    # List the file from the same taxi company brand 
    for file in glob.glob("data/sampled/%s_*.csv" %(taxi_brand)):
        nb_files = nb_files+1
        # Save in list the files name
        list_files[taxi_brand].append(file)
    # Order by date the file list
    list_files[taxi_brand].sort()
    print("------------ [%s] ------------" %(taxi_brand.upper()))
    print("For the taxi brand called %s there are %i files." %(taxi_brand , nb_files))
    for yr in range(0,nb_files):
        # Read file by file
        df = pd.read_csv(list_files[taxi_brand][yr],error_bad_lines=False, sep=',')
        # Extract the head of the file
        head = list(df)
        head_lower = [ x.lower() for x in head ]
        # Remove the blank space in the col name
        head_lower_clean = [x.strip(' ') for x in head_lower]
        # counter for check file
        count = count + 1
        # Save first file schema as reference
        if yr == 0 :
            nb_col_ref = len(head)
            col_name_ref = head_lower_clean
            time_period = int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27])
            year = int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27])
            month = int(list_files[taxi_brand][yr][len(taxi_brand)+28:len(taxi_brand)+30])
            yr_list[taxi_brand].append(year)
            mth_list[taxi_brand].append(month)
            print("    --> In %i - %i : save 1st reference schema" %(year,month))
        else :
            # Compare reference schema with the shema of all the others files
            if head_lower_clean != col_name_ref :
                diff_name_col_add = [ x for x in head_lower_clean if x not in set(col_name_ref) ]
                diff_name_col_rm = [ x for x in col_name_ref if x not in set (head_lower_clean) ]
                if len(diff_name_col_add) > 0 :
                    diff_name_col = diff_name_col_add
                elif len(diff_name_col_add) == 0 :
                    diff_name_col = diff_name_col_rm
                year = int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27])
                month = int(list_files[taxi_brand][yr][len(taxi_brand)+28:len(taxi_brand)+30])
                yr_list[taxi_brand].append(year)
                mth_list[taxi_brand].append(month)
                print("    --> In %i - %i : %i diff in new schema compared to reference schema on a total of %i columns" %(year,month,len(diff_name_col),len(head_lower_clean)))
                print("               New col. not in ref:",diff_name_col_add)
                print("               Ref col. not in new",diff_name_col_rm)
                col_name_ref = head_lower_clean
                nb_col_ref = len(head_lower_clean)
                count = count - 1
        if yr == nb_files-1 :
            year = int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27])
            month = int(list_files[taxi_brand][yr][len(taxi_brand)+28:len(taxi_brand)+30])
            yr_list[taxi_brand].append(year)
            mth_list[taxi_brand].append(month)
    if count == nb_files:
        print("     There is no diff between files")
        
    for i in range(0,len(yr_list[taxi_brand])-1) :
        date_in = str(yr_list[taxi_brand][i])+'-'+str(mth_list[taxi_brand][i])
        if mth_list[taxi_brand][i+1] == 1 :
            date_out =  str(yr_list[taxi_brand][i+1]-1)+'-12'
        else :
            date_out =  str(yr_list[taxi_brand][i+1])+'-'+str(mth_list[taxi_brand][i+1])
        print("Records from %s to %s use the same schema" %(date_in,date_out))
    print("Reference schema:",col_name_ref)
    print("------------------------------")
    # Save date of schema change for data cleaning)
    days = np.ones(len(yr_list[taxi_brand]))
    print(days)
    d = {
        'year' : yr_list[taxi_brand],
        'month': mth_list[taxi_brand],
        'day'  : days
    }
    df_wrt = pd.DataFrame(data=d)
    df_wrt_time = pd.to_datetime(df_wrt)
    df_wrt_time.to_csv('data/Change_date_%s.csv' %(taxi_brand,), sep=',', )

------------ [YELLOW] ------------
For the taxi brand called yellow there are 131 files.
    --> In 2009 - 1 : save 1st reference schema
    --> In 2010 - 1 : 12 diff in new schema compared to reference schema on a total of 18 columns
               New col. not in ref: ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'pickup_longitude', 'pickup_latitude', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'fare_amount', 'tip_amount', 'tolls_amount', 'total_amount']
               Ref col. not in new ['vendor_name', 'trip_pickup_datetime', 'trip_dropoff_datetime', 'start_lon', 'start_lat', 'store_and_forward', 'end_lon', 'end_lat', 'fare_amt', 'tip_amt', 'tolls_amt', 'total_amt']


b'Skipping line 154: expected 18 fields, saw 19\nSkipping line 2380: expected 18 fields, saw 19\nSkipping line 8408: expected 18 fields, saw 19\nSkipping line 11210: expected 18 fields, saw 19\nSkipping line 11353: expected 18 fields, saw 19\nSkipping line 11663: expected 18 fields, saw 19\nSkipping line 13047: expected 18 fields, saw 19\nSkipping line 13900: expected 18 fields, saw 19\nSkipping line 14577: expected 18 fields, saw 19\nSkipping line 15041: expected 18 fields, saw 19\nSkipping line 15844: expected 18 fields, saw 19\n'


    --> In 2015 - 1 : 6 diff in new schema compared to reference schema on a total of 19 columns
               New col. not in ref: ['vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'ratecodeid', 'extra', 'improvement_surcharge']
               Ref col. not in new ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'rate_code', 'surcharge']
    --> In 2016 - 7 : 2 diff in new schema compared to reference schema on a total of 17 columns
               New col. not in ref: ['pulocationid', 'dolocationid']
               Ref col. not in new ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']
    --> In 2019 - 1 : 1 diff in new schema compared to reference schema on a total of 18 columns
               New col. not in ref: ['congestion_surcharge']
               Ref col. not in new []
Records from 2009-1 to 2009-12 use the same schema
Records from 2010-1 to 2014-12 use the same schema
Records from 2015-1 to 2016-7 use the same schema
Records from 



    --> In 2015 - 1 : 1 diff in new schema compared to reference schema on a total of 21 columns
               New col. not in ref: ['improvement_surcharge']
               Ref col. not in new []
    --> In 2016 - 7 : 2 diff in new schema compared to reference schema on a total of 19 columns
               New col. not in ref: ['pulocationid', 'dolocationid']
               Ref col. not in new ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']
    --> In 2019 - 1 : 1 diff in new schema compared to reference schema on a total of 20 columns
               New col. not in ref: ['congestion_surcharge']
               Ref col. not in new []
Records from 2013-8 to 2014-12 use the same schema
Records from 2015-1 to 2016-7 use the same schema
Records from 2016-7 to 2018-12 use the same schema
Records from 2019-1 to 2020-6 use the same schema
Reference schema: ['vendorid', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'ratecodeid', 'puloca

### Analysis of schema changes for fhv cab data files

FHV taxi (64 files) reacords go from January 2015 till June 2020 (December 2019 and February 2020 are missing).

#### Schema changes:

- In January 2017, the name of the pickup date column changed from 'pickup_date' to 'pickup_datetime' and the drop-off time are also now recorded. From this date, both pick-up and drop-off location id are saved, instead of pick-up location only.

- In Jully 2017, a new column is add 'sr_flag'. If sr_flag=1 then the trip was a part of a shared ride chain (e.g. Uber Pool, Lyft Line).

- In January 2018, the column nammed 'dispatching_base_number' was duplacated and removed in January 2019.

------- 
- Records from 2015-1 to 2016-12 use the same schema
- Records from 2017-1 to 2017-6 use the same schema
- Records from 2017-7 to 2017-12 use the same schema
- Records from 2018-1 to 2018-12 use the same schema
- Records from 2019-1 to 2020-6 use the same schema

#### Final columns schema

['dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag']

#### Deduce removed/changed/added column:

It is not possible to deduce removed or added columns from other ones. Drop-off time and location are save only from 01/2017 and shared ride are instore in 07/2017.

#### Conclusion extract from the previous python code execution:

    --> In 2015 - 1 : save 1st reference schema
    --> In 2017 - 1 : 4 diff in new schema compared to reference schema on a total of 5 columns
               New col. not in ref: ['pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid']
               Ref col. not in new ['pickup_date', 'locationid']
    --> In 2017 - 7 : 1 diff in new schema compared to reference schema on a total of 6 columns
               New col. not in ref: ['sr_flag']
               Ref col. not in new []
    --> In 2018 - 1 : 1 diff in new schema compared to reference schema on a total of 7 columns
               New col. not in ref: ['dispatching_base_number']
               Ref col. not in new []
    --> In 2019 - 1 : 1 diff in new schema compared to reference schema on a total of 6 columns
               New col. not in ref: []
               Ref col. not in new ['dispatching_base_number']
    Records from 2015-1 to 2016-12 use the same schema
    Records from 2017-1 to 2017-7 use the same schema
    Records from 2017-7 to 2017-12 use the same schema
    Records from 2018-1 to 2018-12 use the same schema
    Records from 2019-1 to 2020-6 use the same schema
    Reference schema: ['dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag']               

### Analysis of schema changes for fhvhv data files

FHVHV taxi (10 files) records from February 2019 till June 2020, 7 files of records are missing (from July 2019 till December 2019 and March 2020).

#### There is no change of the schema:

- Records from 2019-2 to 2020-6 use the same schema

#### Final columns schema

['hvfhs_license_num', 'dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag']

#### Conclusion extract from the previous python code execution:

    --> In 2019 - 2 : save 1st reference schema
     There is no diff between files
    Records from 2019-2 to 2020-6 use the same schema
    Reference schema: ['hvfhs_license_num', 'dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag']

### Analysis of schema changes for green cab data files

Green taxi (76 files) records from August 2013 till June 2020, 7 files of records are missing (from July 2019 till December 2019 and March 2020).

#### Schema changes:

- In January 2015, a new column is add 'improvement_surcharge'. Add $0.30 surcharge on hailed trips at the flag
drop.

- In Jully 2016, both pick-up and drop-off location id are saved, instead of pick-up and drop-off latitude and longitude location.

- In January 2019, a new column is add 'congestion_surcharge'. Add $2.75 surcharge if traffic congestion during the trip.

--------
- Records from 2013-8 to 2014-12 use the same schema
- Records from 2015-1 to 2016-7 use the same schema
- Records from 2016-7 to 2018-12 use the same schema
- Records from 2019-1 to 2020-6 use the same schema

#### Final columns schema:

['vendorid', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'ratecodeid', 'pulocationid', 'dolocationid', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type', 'trip_type', 'congestion_surcharge']

#### Deduce removed/changed/added column:

Before July 2016 it is possible to deduce the pick-up and drop-off location id using the real latitude and longitude position for the pick-up and the drop-off.

The 'improvement_surcharge' and 'congestion_surcharge' are new taxes that can be deduce for the missing period. In order to compare the trip price over all the period it is possible to remove these taxes to the 'total_amount'  when there are applied. Otherwise the taxes are set equal to $0 .

#### Conclusion extract from the previous python code execution:

    --> In 2013 - 8 : save 1st reference schema
    --> In 2015 - 1 : 1 diff in new schema compared to reference schema on a total of 21 columns
               New col. not in ref: ['improvement_surcharge']
               Ref col. not in new []
    --> In 2016 - 7 : 2 diff in new schema compared to reference schema on a total of 19 columns
               New col. not in ref: ['pulocationid', 'dolocationid']
               Ref col. not in new ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']
    --> In 2019 - 1 : 1 diff in new schema compared to reference schema on a total of 20 columns
               New col. not in ref: ['congestion_surcharge']
               Ref col. not in new []
    Records from 2013-8 to 2014-12 use the same schema
    Records from 2015-1 to 2016-7 use the same schema
    Records from 2016-7 to 2018-12 use the same schema
    Records from 2019-1 to 2020-6 use the same schema
    Reference schema: ['vendorid', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'ratecodeid', 'pulocationid', 'dolocationid', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type', 'trip_type', 'congestion_surcharge']

### Analysis of schema changes for yellow cab data files

Yellow taxi (131 files) records from January 2009 till June 2020, 7 files of records are missing (from July 2019 till December 2019 and March 2020).

#### Schema changes:

- In January 2010, 12 columns are renamed:
        - 'vendor_name'          --> 'vendor_id'
        - 'trip_pickup_datetime' --> 'pickup_datetime'
        - 'trip_dropoff_datetime'--> 'dropoff_datetime'
        - 'start_lon'            --> 'pickup_longitude'
        - 'start_lat'            --> 'pickup_latitude'
        - 'store_and_forward'    --> 'store_and_fwd_flag'
        - 'end_lon'              --> 'dropoff_longitude'
        - 'end_lat'              --> 'dropoff_latitude'
        - 'fare_amt'             --> 'fare_amount'
        - 'tip_amt               --> 'tip_amount'
        - 'tolls_amt'            --> 'tolls_amount'
        - 'total_amt'            --> 'total_amount'

- In January 2015, a new columns is add 'extra' for rush hour and overnight charges: $0.5 ; $1.0. And 5 columns are renamed:
        - 'vendor_id'       --> 'vendorid'
        - 'pickup_datetime' --> 'tpep_pickup_datetime'
        - 'dropoff_datetime'--> 'tpep_dropoff_datetime'
        - 'rate_code'       --> 'ratecodeid'
        - 'surcharge'       --> 'improvement_surcharge' --> the amount of the surcharge change from $.5 to $0.3.

- In July 2016, both pick-up and drop-off location id are saved, instead of pick-up and drop-off latitude and longitude location.

- In January 2019, a new column is add 'congestion_surcharge'. Add $2.75 surcharge if traffic congestion during the trip.

------------

- Records from 2009-1 to 2009-12 use the same schema
- Records from 2010-1 to 2014-12 use the same schema
- Records from 2015-1 to 2016-7 use the same schema
- Records from 2016-7 to 2018-12 use the same schema
- Records from 2019-1 to 2020-6 use the same schema

#### Final columns schema:

['vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'ratecodeid', 'store_and_fwd_flag', 'pulocationid', 'dolocationid', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge']


#### Deduce removed/changed/added column:

Before July 2016 it is possible to deduce the pick-up and drop-off location id using the real latitude and longitude position for the pick-up and the drop-off.

The 'improvement_surcharge' and 'congestion_surcharge' are new taxes that cannot be deduce for the missing period. In order to compare the trip price over all the period it is possible to remove these taxes to the 'total_amount'  when there are applied for the period after 01/2015 and 07/2016. Otherwise the taxes are set equal to $0.


#### Conclusion extract from the previous python code execution:

    --> In 2009 - 1 : save 1st reference schema
    --> In 2010 - 1 : 12 diff in new schema compared to reference schema on a total of 18 columns
               New col. not in ref: ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'pickup_longitude', 'pickup_latitude', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'fare_amount', 'tip_amount', 'tolls_amount', 'total_amount']
               Ref col. not in new ['vendor_name', 'trip_pickup_datetime', 'trip_dropoff_datetime', 'start_lon', 'start_lat', 'store_and_forward', 'end_lon', 'end_lat', 'fare_amt', 'tip_amt', 'tolls_amt', 'total_amt']
    --> In 2015 - 1 : 6 diff in new schema compared to reference schema on a total of 19 columns
               New col. not in ref: ['vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'ratecodeid', 'extra', 'improvement_surcharge']
               Ref col. not in new ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'rate_code', 'surcharge']
    --> In 2016 - 7 : 2 diff in new schema compared to reference schema on a total of 17 columns
               New col. not in ref: ['pulocationid', 'dolocationid']
               Ref col. not in new ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']
    --> In 2019 - 1 : 1 diff in new schema compared to reference schema on a total of 18 columns
               New col. not in ref: ['congestion_surcharge']
               Ref col. not in new []
    Records from 2009-1 to 2009-12 use the same schema
    Records from 2010-1 to 2014-12 use the same schema
    Records from 2015-1 to 2016-7 use the same schema
    Records from 2016-7 to 2018-12 use the same schema
    Records from 2019-1 to 2020-6 use the same schema
    Reference schema: ['vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'ratecodeid', 'store_and_fwd_flag', 'pulocationid', 'dolocationid', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge']

In [1]:
import pickle
