# Big Data Project H600 / L-Group : real world data exploration, integration, cleaning, transformation and analysis.

## 1- Files & Data Exploration

 **WARNING: this notebook assumes that:**
    - The data are in "MY_PARENT_FOLDER/data/sampled/" folder. You can run the bash script "download_metadata.sh" to download data and metadata in the correct folders to execute the jupyter notebooks.
    - The data are sampled to be run on a personnal computer.
    
### 1.1- Short descriptions of the Data:
The New York City Taxi and Limousine Commission (or TLC for short) has been publishing records about taxi trips in New York since 2009. The TLC trip dataset actually consists of 4 sub-datasets:

    1. Yellow taxi records are records that record trip information of New York's famous yellow taxi cars.
    2. Green taxi records are records that record trip information by so-called 'boro' taxis, a newer service introduced in August of 2013 to improve taxi service and availability in the boroughs. 
    3. FHV records (short for 'For Hire Vehicles') record information from services that offered for-hire vehicles (such as Uber, Lyft, Via, and Juno), but also luxury limousine bases.
    4. High volume FHV (FHVHV for short) are FHV records offered by services that make more than 10,000 trips per day.

### 1.2- Check folder and data configuration

Run the bash script in order to check if the folder and the data are well configured.
If you get an error, please follow the instructions or run the script download_metadata.sh.

In [1]:
# Check if the files and forlder are on the good folder to exectue the following code.
!./check_files.sh


------------ READY TO START -----------


### 1.3- Count of files per type

In [1]:
print ('count of green tripdata files:')
!find data/sampled/green_tripdata_*.csv -type f | wc -l 
print ('count of yellow tripdata files:')
!find data/sampled/yellow_tripdata_*.csv -type f | wc -l
print ('count of fhv tripdata files:')
!find data/sampled/fhv_tripdata_*.csv -type f | wc -l 
print ('count of fhvhv tripdata files:')
!find data/sampled/fhvhv_tripdata_*.csv -type f | wc -l
print ('count of all tripdata files:')
!find data/sampled/*.csv -type f | wc -l

count of green tripdata files:
76
count of yellow tripdata files:
131
count of fhv tripdata files:
64
count of fhvhv tripdata files:
10
count of all tripdata files:
281


### 1.4- Files size and metrics : bytes

In [1]:
import os
import glob
import pandas as pd
import math
import numpy as np
import datetime as dt

list_taxi = ["yellow", "green", "fhv", "fhvhv"]
#list_taxi = ["green"]
for taxi_brand in list_taxi :
    list_files = {}
    list_files[taxi_brand] = []
    nb_files = 0
    size_b = []
    # List the file from the same taxi company brand 
    for file in glob.glob("data/sampled/%s_*.csv" %(taxi_brand)):
        nb_files = nb_files+1
        # Save in list the file name
        list_files[taxi_brand].append(file)
        size_b.append(os.path.getsize(file))

    print("For the taxi brand called %s there are %i files." %(taxi_brand, nb_files))
    # Get size basic stat
    print("     Here are some metrics based on their size in bytes:")
    print("       -min value           : ", np.min(size_b))
    print("       -max value           : ", np.max(size_b))
    print("       -mean value          : ", np.round(np.mean(size_b),2))
    print("       -25 percentile value : ", np.quantile(size_b, .25)) 
    print("       -50 percentile value : ", np.quantile(size_b, .50)) 
    print("       -75 percentile value : ", np.quantile(size_b, .75)) 
    print("       -90 percentile value : ", np.quantile(size_b, .90))
    print("       -total file sum size : ", np.sum(size_b))

For the taxi brand called yellow there are 131 files.
     Here are some metrics based on their size in bytes:
       -min value           :  43103
       -max value           :  5959352
       -mean value          :  3750759.69
       -25 percentile value :  1756967.0
       -50 percentile value :  4442047.0
       -75 percentile value :  5123591.5
       -90 percentile value :  5491438.0
       -total file sum size :  491349519
For the taxi brand called green there are 76 files.
     Here are some metrics based on their size in bytes:
       -min value           :  2512
       -max value           :  570765
       -mean value          :  262437.29
       -25 percentile value :  121494.75
       -50 percentile value :  190194.5
       -75 percentile value :  456955.0
       -90 percentile value :  499751.0
       -total file sum size :  19945234
For the taxi brand called fhv there are 64 files.
     Here are some metrics based on their size in bytes:
       -min value           :  520

### 1.5- Files size and metrics: rows

In [3]:
list_taxi = ["yellow", "green", "fhv", "fhvhv"]
#list_taxi = ["green"]
for taxi_brand in list_taxi :
    names={}
    size_r=[]
    for fn in glob.glob("data/sampled/%s_*.csv" %(taxi_brand)):
        with open(fn) as f:
            names[fn]=sum(1 for line in f if line.strip())      
        # Save in list files sizes in rows
        size_r=list(names.values())

    print("For the taxi brand called %s there are still %i files." %(taxi_brand, nb_files))
    # Get size basic stat
    print("     Here are some metrics based on their size in rows count:")
    print("       -min value           : ", np.min(size_r))
    print("       -max value           : ", np.max(size_r))
    print("       -mean value          : ", np.round(np.mean(size_r),2))
    print("       -25 percentile value : ", np.quantile(size_r, .25)) 
    print("       -50 percentile value : ", np.quantile(size_r, .50)) 
    print("       -75 percentile value : ", np.quantile(size_r, .75)) 
    print("       -90 percentile value : ", np.quantile(size_r, .90))
    print("       -total row number    : ", np.sum(size_r))

For the taxi brand called yellow there are still 10 files.
     Here are some metrics based on their size in rows count:
       -min value           :  477
       -max value           :  32301
       -mean value          :  24204.51
       -25 percentile value :  19990.0
       -50 percentile value :  26295.0
       -75 percentile value :  29081.0
       -90 percentile value :  30203.0
       -total row number    :  3170791
For the taxi brand called green there are still 10 files.
     Here are some metrics based on their size in rows count:
       -min value           :  16
       -max value           :  3547
       -mean value          :  2027.5
       -25 percentile value :  1359.75
       -50 percentile value :  2073.5
       -75 percentile value :  2893.25
       -90 percentile value :  3127.0
       -total row number    :  154090
For the taxi brand called fhv there are still 10 files.
     Here are some metrics based on their size in rows count:
       -min value           :  960

### 1.6- Check for the files shema
The schema of the files have changes over time (name, order and numbers of column). The following script look for the kind of change and the date when it occurs.

In [2]:
# Get the list of file
list_taxi = ["yellow", "green", "fhv", "fhvhv"]
#list_taxi = ["green"]
for taxi_brand in list_taxi :
    list_files = {}
    yr_list = {}
    mth_list = {}
    list_files[taxi_brand] = []
    yr_list[taxi_brand] = []
    mth_list[taxi_brand] = []
    nb_files = 0
    size = []
    count = 0
    # List the file from the same taxi company brand 
    for file in glob.glob("data/sampled/%s_*.csv" %(taxi_brand)):
        nb_files = nb_files+1
        # Save in list the files name
        list_files[taxi_brand].append(file)
    # Order by dayte the files list
    list_files[taxi_brand].sort()
    print("[%s] For the taxi brand called %s there are %i files." %(taxi_brand.upper(),taxi_brand , nb_files))
    for yr in range(0,nb_files):
        # Read file by file
        df = pd.read_csv(list_files[taxi_brand][yr],error_bad_lines=False, sep=',')
        # Extract the head of the file
        head = list(df)
        head_lower = [ x.lower() for x in head ]
        # Remove the blank space in the col name
        head_lower_clean = [x.strip(' ') for x in head_lower]
        #print(yr,head_lower)
        # counter for check file
        count = count + 1
        # Save first file schema as reference
        if yr == 0 :
            nb_col_ref = len(head)
            col_name_ref = head_lower_clean
            time_period = int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27])
            year = int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27])
            month = int(list_files[taxi_brand][yr][len(taxi_brand)+28:len(taxi_brand)+30])
            yr_list[taxi_brand].append(year)
            mth_list[taxi_brand].append(month)
            print("    --> In %i - %i : save 1st reference schema" %(year,month))
            # print(col_name_ref)
        else :
            # Compare reference schema with the shema of all the others files
            if head_lower_clean != col_name_ref :
                # Col in new schema but not in reference schema
                diff_name_col_add = [ x for x in head_lower_clean if x not in set(col_name_ref) ]
                # Col in old schema but not in new schema
                diff_name_col_rm = [ x for x in col_name_ref if x not in set (head_lower_clean) ]
                pos_col_change = []
                index_ref = []
                if (len(diff_name_col_add)+len(diff_name_col_rm) == 0) and (len(head_lower_clean) == nb_col_ref) :
                    for i in range(0,len(diff_name_col)) :
                        index_ref.append(col_name_ref.index(col_name_ref[i]))
                        pos_col_change.append(head_lower_clean.index(col_name_ref[i]))
                year = int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27])
                month = int(list_files[taxi_brand][yr][len(taxi_brand)+28:len(taxi_brand)+30])
                yr_list[taxi_brand].append(year)
                mth_list[taxi_brand].append(month)
                print("    --> In %i - %i : %i diff in new schema compared to reference schema on a total of %i columns" %(year,month,len(diff_name_col),len(head_lower_clean)))
                #print("     In %i - %i :" %(int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27]),int(list_files[taxi_brand][yr][len(taxi_brand)+28:len(taxi_brand)+30])))
                print("               New col. not in ref:",diff_name_col_add)
                print("               Ref col. not in new",diff_name_col_rm)
                print("               Order have change:", pos_col_change)
                # Check if the order of the column have changed
                print(head_lower_clean)
                print(col_name_ref)
                #print(diff_name_col)
                ## If the numbers of column change
                if len(head_lower_clean) != nb_col_ref :
                    # If the numbers of column is different from reference yr check what is the new column name
                    diff_nb_col = len(head_lower_clean) - nb_col_ref
                    # find the new/remove col name())
                    diff_name_col = [ x for x in head_lower_clean if x not in set(col_name_ref) ]
                    # Check if the order of the column have changed
                    #print(diff_name_col)
                    pos_col_change = []
                    for i in range(0,len(diff_name_col)) :
                        pos_col_change.append(head_lower_clean.index(diff_name_col[i]))
                    #if diff_nb_col > 0 :
                        #print("             %i/%i col add" %(diff_nb_col,len(diff_name_col)))
                    #elif diff_nb_col < 0 :
                        #print("             %i/%i col remove" %(abs(diff_nb_col),len(diff_name_col)))
                    if abs(diff_nb_col) != len(diff_name_col) :
                        nb_name_diff = len(diff_name_col) - abs(diff_nb_col) 
                        #print("             %i/%i name change" %(nb_name_diff,len(diff_name_col)))
                elif (sum(len(i) for i in head_lower_clean)) != (sum(len(i) for i in col_name_ref)) :
                    nb_name_change = len(set(col_name_ref) - set(head_lower_clean))
                    #print("             %i/%i column name have changed:" %(nb_name_change, len(diff_name_col)))
                    new_name=set(col_name_ref) - set(head_lower_clean)
                    old_name=set(head_lower_clean) - set(col_name_ref)
                    #print("                 New name are:",new_name)
                    #print(old_name, col_name_ref)
                col_name_ref = head_lower_clean
                nb_col_ref = len(head_lower_clean)
                count = count - 1
        if yr == nb_files-1 :
            year = int(list_files[taxi_brand][yr][len(taxi_brand)+23:len(taxi_brand)+27])
            month = int(list_files[taxi_brand][yr][len(taxi_brand)+28:len(taxi_brand)+30])
            yr_list[taxi_brand].append(year)
            mth_list[taxi_brand].append(month)
    if count == nb_files:
        print("     There is no diff between files")
   
    for i in range(0,len(yr_list[taxi_brand])-1) :
        date_in = str(yr_list[taxi_brand][i])+'-'+str(mth_list[taxi_brand][i])
        if mth_list[taxi_brand][i+1] == 1 :
            date_out =  str(yr_list[taxi_brand][i+1]-1)+'-12'
        else :
            date_out =  str(yr_list[taxi_brand][i+1])+'-'+str(mth_list[taxi_brand][i+1])
        print("Records from %s to %s use the same schema" %(date_in,date_out))
    print(yr_list[taxi_brand])
    print(mth_list[taxi_brand])
        #df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
        

[YELLOW] For the taxi brand called yellow there are 131 files.
    --> In 2009 - 1 : save 1st reference schema


NameError: name 'diff_name_col' is not defined

### 2- Conclustion 

In total, there are 281 taxi files (76 for green, 131 for yellow, 64 for fhv and 10 for fhvhv). Each files represent one month of records. The record period depends on the taxi brand:

- Yellow taxi: from January 2009 till June 2020, 7 files of records are missing.
- Green taxi : from August 2013 till June 2020, 7 files of records are missing
- FHVHV taxi : from February 2019 till June 2020, 7 files of records are missing

        The missing periods are from July 2019 till December 2019 and March 2020.

- FHV taxi   : from January 2015 till June 2020, 2 files of records are missing

        The missing periods are December 2019 and February 2020.

All the files represents a size of around 0.6Go. 


In [110]:
((2020-2019)*12)+6-1


17