# Sampled Dataset exploration, meta-data collection

WARNING: this notebook assumes that:

- The data are in "MY_PARENT_FOLDER/data/sampled/" folder. You can run the bash script "download_metadata.sh" to download data and metadata in the correct folders to execute the jupyter notebooks.
- The data are sampled to be run on a personnal computer.

In [10]:
# Imports go here
import os
import glob
import pandas as pd
import math
import numpy as np

## Statistics about the dataset

Compute basic statistics about the number of files in this sub-dataset, their size, and the number of records (lines) in each file. For length and number of records, give the min, max, mean, 25, 50, 75, 90 percentiles values.

### Files overview

In [3]:
print ('count of green tripdata files:')
!find data/sampled/green_tripdata_*.csv -type f | wc -l 
print ('count of yellow tripdata files:')
!find data/sampled/yellow_tripdata_*.csv -type f | wc -l
print ('count of fhv tripdata files:')
!find data/sampled/fhv_tripdata_*.csv -type f | wc -l 
print ('count of fhvhv tripdata files:')
!find data/sampled/fhvhv_tripdata_*.csv -type f | wc -l
print ('count of all tripdata files:')
!find data/sampled/*.csv -type f | wc -l

count of green tripdata files:
76
count of yellow tripdata files:
131
count of fhv tripdata files:
64
count of fhvhv tripdata files:
10
count of all tripdata files:
281


### Files stats in bytes

In [6]:
list_taxi = ["yellow", "green", "fhv", "fhvhv"]
#list_taxi = ["green"]
for taxi_brand in list_taxi :
    list_files = {}
    list_files[taxi_brand] = []
    nb_files = 0
    size_b = []
    # List the file from the same taxi company brand 
    for file in glob.glob("data/sampled/%s*.csv" %(taxi_brand)):
        nb_files = nb_files+1
        #Save in list the file name
        list_files[taxi_brand].append(file)
        size_b.append(os.path.getsize(file))
    print("For the file type called %s there are %i files." %(taxi_brand, nb_files))
    # Get size basic stats
    print("     Here are some metrics based on their size in bytes:")
    print("       -min value           : ", np.min(size_b))
    print("       -max value           : ", np.max(size_b))
    print("       -mean value          : ", np.round(np.mean(size_b),2))
    print("       -25 percentile value : ", np.quantile(size_b, .25)) 
    print("       -50 percentile value : ", np.quantile(size_b, .50)) 
    print("       -75 percentile value : ", np.quantile(size_b, .75)) 
    print("       -90 percentile value : ", np.quantile(size_b, .90))
    print("       -total file sum size : ", np.sum(size_b))

For the file type called yellow there are 131 files.
     Here are some metrics based on their size in bytes:
       -min value           :  43103
       -max value           :  5959352
       -mean value          :  3750759.69
       -25 percentile value :  1756967.0
       -50 percentile value :  4442047.0
       -75 percentile value :  5123591.5
       -90 percentile value :  5491438.0
       -total file sum size :  491349519
For the file type called green there are 76 files.
     Here are some metrics based on their size in bytes:
       -min value           :  2512
       -max value           :  570765
       -mean value          :  262437.29
       -25 percentile value :  121494.75
       -50 percentile value :  190194.5
       -75 percentile value :  456955.0
       -90 percentile value :  499751.0
       -total file sum size :  19945234
For the file type called fhv there are 74 files.
     Here are some metrics based on their size in bytes:
       -min value           :  52060


### Files stats in rows

In [7]:
for taxi_brand in list_taxi :
    names={}
    size_r=[]
    for fn in glob.glob("data/sampled/%s*.csv" %(taxi_brand)):
        with open(fn) as f:
            names[fn]=sum(1 for line in f if line.strip())      
        # Save in list files sizes in rows
        size_r=list(names.values())
    print("For the taxi brand called %s there are still %i files." %(taxi_brand, nb_files))
    # Get size basic stat
    print("     Here are some metrics based on their size in rows count:")
    print("       -min value           : ", np.min(size_r))
    print("       -max value           : ", np.max(size_r))
    print("       -mean value          : ", np.round(np.mean(size_r),2))
    print("       -25 percentile value : ", np.quantile(size_r, .25)) 
    print("       -50 percentile value : ", np.quantile(size_r, .50)) 
    print("       -75 percentile value : ", np.quantile(size_r, .75)) 
    print("       -90 percentile value : ", np.quantile(size_r, .90))
    print("       -total row number    : ", np.sum(size_r))

For the taxi brand called yellow there are still 10 files.
     Here are some metrics based on their size in rows count:
       -min value           :  477
       -max value           :  32301
       -mean value          :  24204.51
       -25 percentile value :  19990.0
       -50 percentile value :  26295.0
       -75 percentile value :  29081.0
       -90 percentile value :  30203.0
       -total row number    :  3170791
For the taxi brand called green there are still 10 files.
     Here are some metrics based on their size in rows count:
       -min value           :  16
       -max value           :  3547
       -mean value          :  2027.5
       -25 percentile value :  1359.75
       -50 percentile value :  2073.5
       -75 percentile value :  2893.25
       -90 percentile value :  3127.0
       -total row number    :  154090
For the taxi brand called fhv there are still 10 files.
     Here are some metrics based on their size in rows count:
       -min value           :  960

## Analysis of the schema evolution.

Over time, the relational schema associated to each type of trip data (yellow, green, fhv, hvfhv) has changed. Let us analyze the changes.

## Auxiliary functions

In [9]:
# Code to help analyze the schema changes goes here
list_taxi = ["yellow", "green", "fhv", "fhvhv"]
#list_taxi = ["green"]
for taxi_brand in list_taxi :
    list_files = {}
    list_files[taxi_brand] = []
    nb_files = 0
    size = []
    count = 0
     # List the file from the same taxi company brand 
    for file in glob.glob("data/sampled/%s_*.csv" %(taxi_brand)):
        nb_files = nb_files+1
        # Save in list the file name
        list_files[taxi_brand].append(file)
        #size.append(os.path.getsize(file))
    # Order the files list by file names
    list_files[taxi_brand].sort()
    print("- For %s there are %i files." %(taxi_brand, nb_files))
    for yr in range(0,nb_files):
        # Read file by file
        df = pd.read_csv(list_files[taxi_brand][yr],error_bad_lines=False, sep=',')
        # Extract the head of the file
        head = list(df)
        head_lower = [ x.lower() for x in head ]
        # Remove the blank space in the col name
        head_lower_clean = [x.strip(' ') for x in head_lower]
        #print(yr,head_lower)
        # counter for check file
        count = count + 1
        # Save first file schema as reference
        if yr == 0 :
            nb_col_ref = len(head)
            col_name_ref = head_lower_clean
            time_period = int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27])
            # print(col_name_ref)
        else :
            # Compare reference schema with the shema of all the others files
            if head_lower_clean != col_name_ref :
                diff_name_col_add = [ x for x in head_lower_clean if x not in set(col_name_ref) ]
                diff_name_col_rm = [ x for x in col_name_ref if x not in set (head_lower_clean) ]
                #print(diff_name_col_add)
                #print(diff_name_col_rm)
                if len(diff_name_col_add) > 0 :
                    diff_name_col = diff_name_col_add
                elif len(diff_name_col_add) == 0 :
                    diff_name_col = diff_name_col_rm
                #pos_col_change = []
                #for i in range(0,len(diff_name_col)) :
                #    pos_col_change.append(head_lower_clean.index(diff_name_col[i]))
                print("     In %i - %i :" %(int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27]),int(list_files[taxi_brand][yr][len(taxi_brand)+28:len(taxi_brand)+30])))
                print("         %i diff on a total of %i col:" %(len(diff_name_col),len(head_lower_clean)), diff_name_col)
                # Check if the order of the column have changed
                #print(head_lower_clean)
                 #print(col_name_ref)
                #print(diff_name_col)
                ## If the numbers of column change
                if len(head_lower_clean) != nb_col_ref :
                    # If the numbers of column is different from reference yr check what is the new column name
                    diff_nb_col = len(head_lower_clean) - nb_col_ref
                    # find the new/remove col name())
                    #diff_name_col = [ x for x in head_lower_clean if x not in set(col_name_ref) ]
                    # Check if the order of the column have changed
                    #print(diff_name_col)
                    #pos_col_change = []
                    #for i in range(0,len(diff_name_col)) :
                    #    pos_col_change.append(head_lower_clean.index(diff_name_col[i]))
                    if diff_nb_col > 0 :
                            print("             %i/%i col add" %(diff_nb_col,len(diff_name_col)))
                    elif diff_nb_col < 0 :
                        print("             %i/%i col remove" %(abs(diff_nb_col),len(diff_name_col)))
                    if abs(diff_nb_col) != len(diff_name_col) :
                        nb_name_diff = len(diff_name_col) - abs(diff_nb_col) 
                        print("             %i/%i name change" %(nb_name_diff,len(diff_name_col)))
                elif (sum(len(i) for i in head_lower_clean)) != (sum(len(i) for i in col_name_ref)) :
                    nb_name_change = len(set(col_name_ref) - set(head_lower_clean))
                    print("             %i/%i column name have changed:" %(nb_name_change, len(diff_name_col)))
                    new_name=set(col_name_ref) - set(head_lower_clean)
                    old_name=set(head_lower_clean) - set(col_name_ref)
                    #print("                 New name are:",new_name)
                    #print(old_name, col_name_ref)
                col_name_ref = head_lower_clean
                nb_col_ref = len(head_lower_clean)
                count = count - 1
    if count == nb_files:
        print("     There is no diff between files")

- For yellow there are 131 files.
     In 2010 - 1 :
         12 diff on a total of 18 col: ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'pickup_longitude', 'pickup_latitude', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'fare_amount', 'tip_amount', 'tolls_amount', 'total_amount']
             12/12 column name have changed:


b'Skipping line 154: expected 18 fields, saw 19\nSkipping line 2380: expected 18 fields, saw 19\nSkipping line 8408: expected 18 fields, saw 19\nSkipping line 11210: expected 18 fields, saw 19\nSkipping line 11353: expected 18 fields, saw 19\nSkipping line 11663: expected 18 fields, saw 19\nSkipping line 13047: expected 18 fields, saw 19\nSkipping line 13900: expected 18 fields, saw 19\nSkipping line 14577: expected 18 fields, saw 19\nSkipping line 15041: expected 18 fields, saw 19\nSkipping line 15844: expected 18 fields, saw 19\n'


     In 2015 - 1 :
         6 diff on a total of 19 col: ['vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'ratecodeid', 'extra', 'improvement_surcharge']
             1/6 col add
             5/6 name change
     In 2016 - 7 :
         2 diff on a total of 17 col: ['pulocationid', 'dolocationid']
             2/2 col remove
     In 2019 - 1 :
         1 diff on a total of 18 col: ['congestion_surcharge']
             1/1 col add
- For green there are 76 files.
     In 2015 - 1 :
         1 diff on a total of 21 col: ['improvement_surcharge']
             1/1 col add
     In 2016 - 7 :
         2 diff on a total of 19 col: ['pulocationid', 'dolocationid']
             2/2 col remove
     In 2019 - 1 :
         1 diff on a total of 20 col: ['congestion_surcharge']
             1/1 col add
- For fhv there are 64 files.
     In 2017 - 1 :
         4 diff on a total of 5 col: ['pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid']
             2/4 col add
      

### Analysis of schema changes for fhv cab data files

Analyze the schema changes for the FHV cab data files. Write down your conclusions

- For fhv there are 64 files:

     In 2017 - 1 :
         4 diff on a total of 5 col: ['pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid']
             2/4 col add
             2/4 name change
     In 2017 - 7 :
         1 diff on a total of 6 col: ['sr_flag']
             1/1 col add
     In 2018 - 1 :
         1 diff on a total of 7 col: ['dispatching_base_number']
             1/1 col add
     In 2019 - 1 :
         1 diff on a total of 6 col: ['dispatching_base_number']
             1/1 col remove

### Analysis of schema changes for fhvhv data files

Analyze the schema changes for the FHVHV cab data files. Write down your conclusions

- For fhvhv there are 10 files.
     There is no diff between files

### Analysis of schema changes for green cab data files

Analyze the schema changes for the green taxi data files. Write down your conclusions

- For green there are 76 files:

     In 2015 - 1 :
         1 diff on a total of 21 col: ['improvement_surcharge']
             1/1 col add
     In 2016 - 7 :
         2 diff on a total of 19 col: ['pulocationid', 'dolocationid']
             2/2 col remove
     In 2019 - 1 :
         1 diff on a total of 20 col: ['congestion_surcharge']
             1/1 col add

### Analysis of schema changes for yellow cab data files

Analyze the schema changes for the Yellow taxi data files. Write down your conclusions

 For yellow there are 131 files:
 
     In 2010 - 1 :
         12 diff on a total of 18 col: ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'pickup_longitude', 'pickup_latitude', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'fare_amount', 'tip_amount', 'tolls_amount', 'total_amount']
             12/12 column name have changed:
     In 2015 - 1 :
         6 diff on a total of 19 col: ['vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'ratecodeid', 'extra', 'improvement_surcharge']
             1/6 col add
             5/6 name change
     In 2016 - 7 :
         2 diff on a total of 17 col: ['pulocationid', 'dolocationid']
             2/2 col remove
     In 2019 - 1 :
         1 diff on a total of 18 col: ['congestion_surcharge']
             1/1 col add