# Data integration

For each sub-dataset, write (and execute) code that converts a file (using possibly an old schema) into a file that has the new, latest schema version.

Your conversion code should not modify the original files, but instead create a new file. Be sure to explain the design behind your conversion functions!

The data integration step is highly parallellizable. Therefore, your solution on this part
**must** be written in Spark

WARNING: this notebook assumes that:

- The data are in "MY_PARENT_FOLDER/data/sampled/" folder. You can run the bash script "download_metadata.sh" to download data and metadata in the correct folders to execute the jupyter notebooks.
- The data are sampled to be run on a personnal computer.

In [10]:
# Imports go here
import os
import glob
import pandas as pd
import os 
import shutil
import datetime
import geopandas as gpd
from datetime import date
from datetime import datetime
from pyspark.sql.functions import col, lit
import pyspark.sql.functions as f
from shutil import copyfile
from shapely.geometry import Point
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=3g  pyspark-shell"
from pyspark.sql import SparkSession
try: 
    spark
    print("Spark application already started. Terminating existing application and starting new one")
    spark.stop()
except: 
    pass
# Create a new spark session (note, the * indicates to use all available CPU cores)
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("H600 L-Group") \
    .getOrCreate()
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext
#in local mode, you will be able to access the Spark GUI at http://localhost:4040

## FUNCTION DECLARATION

# Creation of a function to convert lat-lon into location ID
def convertlocID(lon, lat):
    global locationID # access the outer scope variable by declaring it global
    if lon != None and lat != None and lon < -70.0 and lon > -80.0 and lat > 35.0 and lat < 45.0:
        query_point = Point( lon, lat)
        possible_matches = list(rtree.intersection( query_point.bounds ))
        for i in range(0,len(possible_matches)) :
            if zones.iloc[possible_matches[i]].geometry.contains(query_point) == True :
                locationID = possible_matches[i]
    else:
        locationID = None
    
    return locationID

# Check if the value is null or not
def blank_as_null(x):

    return f.when(col(x).isNull(), 0 ).otherwise(col(x))

def create_files_list(brand,list_files):
    global nb_files
    nb_files = 0
    for file in glob.glob("data/sampled/%s_*.csv" %(brand)):
        nb_files = nb_files+1
        # Save in list the files name
        list_files.append(file)
        # Order by date the file list
        list_files.sort()

    return list_files, nb_files

def remove_all_whitespace(col):
    return f.regexp_replace(col, "\\s+", "")

Spark application already started. Terminating existing application and starting new one


In [2]:
#create cleaned data directories
isdir = os.path.isdir("data/cleaned")  
if isdir == False :
    print ("Need to create directory data/cleaned")
    os.mkdir("data/cleaned")
else:
    print ("The directory data/cleaned already exist")
    
list_taxi = ["yellow", "green", "fhv", "fhvhv"]
#list_taxi = ["green"]
for taxi_brand in list_taxi :
    path = "data/cleaned/%s" %(taxi_brand)
    # List the file from the same taxi company brand 
    isdir = os.path.isdir(path)
    if isdir == False :
        print ("Creation of the directory %s" % path)
        os.mkdir(path) 
    else:
        print ("The directory %s already exist" % path)
        

The directory data/cleaned already exist
The directory data/cleaned/yellow already exist
The directory data/cleaned/green already exist
The directory data/cleaned/fhv already exist
The directory data/cleaned/fhvhv already exist


## 1. FHVHV files

From previous analyses we saw that header was consistent across all then fhvhv files.
We then donc need to modify them.

In [6]:
source_dir= 'data/sampled/'       
for filename in glob.glob(os.path.join(source_dir,'fhvhv_*.csv')):
    shutil.copy(filename, 'data/cleaned/fhvhv')

## 2.FHV files

From previous analyse we decide to use as reference for the FHV taxi files the following schema:

['dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag'] 
 
We therefore need to apply somes transformations for creating new uniform files according to the time period previously defined and saved in the file Change_date_fhv.csv:

- Change schema 1 (from 2015-1 to 2016-12): 
            a) Add to the files empty columns for 'dropoff_datetime', 'DOLocationID' and 'SR_Flag'. 
            b) Change the columns name 'Pickup_date' by 'pickup_datetime', 'locationID' by 'PULocationID',        "Dispatching_base_num" by "dispatching_base_num".

- Change schema 2 (from 2017-1 to 2017-6): 
            a) Add to the files empty columns for 'DOLocationID' and 'SR_Flag'. 
            b) Change the columns name 'Pickup_date' by 'Pickup_DateTime', 'Dropoff_datetime' by 'dropoff_datetime', "Dispatching_base_num" by "dispatching_base_num".
            
- Change schema 3 (from 2017-7 to 2017-12): 
            a) Change the columns name 'Pickup_date' by 'Pickup_DateTime', 'Dropoff_datetime' by 'dropoff_datetime', "Dispatching_base_num" by "dispatching_base_num".
            
- Change schema 4 (from 2018-1 to 2018-12):
            a) Change the columns name 'Pickup_date' by 'Pickup_DateTime', 'Dropoff_datetime' by 'dropoff_datetime', "Dispatching_base_number" by "dispatching_base_num".
            b) Remove the double column Dispatching_base_num with no value
          
- Final schema 5 (from 2019-1 to 2020-6):
            NO change


In [11]:
source_dir= 'data/sampled/'
clean_dir = 'data/cleaned/'
taxi_brand='fhv'
list_files = []
nb_files=0
# List the file from the same taxi company brand 
for file in glob.glob("data/sampled/%s_*.csv" %(taxi_brand)):
    nb_files = nb_files+1
    # Save in list the files name
    list_files.append(file)
    # Order by date the file list
    list_files.sort()


# Open the date change file
df = pd.read_csv("data/Change_date_%s.csv" %(taxi_brand), sep=',', header=None)
dating_schema = [ datetime.strptime(x, '%Y-%m-%d') for x in df[1] ]
for yr in range(0,nb_files):
    if os.path.isfile(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::]) == False :
        year = int(list_files[yr][len(taxi_brand)+23:len(taxi_brand)+27])
        month = int(list_files[yr][len(taxi_brand)+28:len(taxi_brand)+30])
        date_file = date(year,month,1)
        fhv_DF = (spark.read
                    .option("sep", ",")
                    .option("header", True)
                    .option("inferSchema", True)
                    .csv(list_files[yr]) )
        for nb_schema in range(0,len(dating_schema)-1):
            if date_file >= dating_schema[nb_schema].date() and  date_file < dating_schema[nb_schema+1].date():
                if nb_schema+1 == 1 :
                    fhv1_DF = fhv_DF.withColumn("dropoff_datetime",lit('null'))\
                           .withColumn("DOLocationID",lit('null'))\
                           .withColumn("SR_Flag",lit('null'))\
                           .select(
                            col("Dispatching_base_num").alias("dispatching_base_num"),
                            col("Pickup_date").alias("pickup_datetime"),
                            "dropoff_datetime",
                            col("locationID").alias("PUlocationid"),
                            col("DOLocationID").alias("DOlocationid"),
                            "SR_Flag")
                    fhv1_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 2 :
                    fhv2_DF = fhv_DF.withColumn("DOLocationID",lit('null'))\
                            .withColumn("SR_Flag",lit('null'))\
                            .select(
                                col("Dispatching_base_num").alias("dispatching_base_num"),
                                col("Pickup_DateTime").alias("pickup_datetime"),
                                col("Dropoff_datetime").alias("dropoff_datetime"),
                                col("locationID").alias("PUlocationid"),
                                col("DOLocationID").alias("DOlocationid"),
                                "SR_Flag")
                    fhv2_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 3 :
                    fhv3_DF = fhv_DF.select(
                                col("Dispatching_base_num").alias("dispatching_base_num"),
                                col("Pickup_DateTime").alias("pickup_datetime"),
                                col("Dropoff_datetime").alias("dropoff_datetime"),
                                col("locationID").alias("PUlocationid"),
                                col("DOLocationID").alias("DOlocationid"),
                                "SR_Flag")
                    fhv3_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 4 :
                    fhv4_DF = fhv_DF.select(
                                col("Dispatching_base_number").alias("dispatching_base_num"),
                                col("Pickup_DateTime").alias("pickup_datetime"),
                                col("Dropoff_datetime").alias("dropoff_datetime"),
                                col("locationID").alias("PUlocationid"),
                                col("DOLocationID").alias("DOlocationid"),
                                "SR_Flag")
                    fhv4_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 5 :
                    fhv5_DF = fhv_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
        if date_file == dating_schema[5].date() :
            fhv5_DF = fhv_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
new_files = len(os.listdir('data/cleaned/'+taxi_brand))
if new_files == nb_files :
    print("All the %i files are well integrated !" %(new_files))
else :
    print("[ERROR] %i files on %i files have been integrated ..." %(new_files, nb_files))

All the 64 files are well integrated !


## 3.Green files

From previous analyse we decide to use as reference for the GREEN taxi files the following schema:

['vendorid', 'pickup_datetime', 'dropoff_datetime', 'store_and_fwd_flag', 'ratecodeid', 'pulocationid', 'dolocationid', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type', 'trip_type', 'congestion_surcharge']
 
We therefore need to apply somes transformations for creating new uniform files according to the time period previously defined and saved in the file Change_date_green.csv:

- Change schema 1 (from 2013-8 to 2014-12): 
            a) Two new columns are add : congestion_surcharge and improvement_surcharge
            b) The columns 'pickup_longitude', 'pickup_latitude' and 'dropoff_longitude', 'dropoff_latitude' are respectively changed by 'pulocationid' and 'dolocationid'. The transformation use geopandas to transform lat-lon position to location id.
            b) For all the others columns the upper case format letters are changed by lower case format.
           
- Change in schema 2 (from 2015-1 to 2016-7):
            a) One new column is add : congestion_surcharge
            b) The columns 'pickup_longitude', 'pickup_latitude' and 'dropoff_longitude', 'dropoff_latitude' are respectively changed by 'pulocationid' and 'dolocationid'. The transformation use geopandas to transform lat-lon position to location id.
            b) For all the others columns the upper case format letters are changed by lower case format.

- Change in schema 3 (from 2016-7 to 2018-12):
            a) One new column is add : congestion_surcharge
            b) For all the others columns the upper case format letters are changed by lower case format.

- Final schema 4 (from 2019-1 to 2020-6):
            NO change
            

In [8]:
source_dir= 'data/sampled/'
clean_dir = 'data/cleaned/'
taxi_brand='green'
list_files = []
nb_files=0
# List the file from the same taxi company brand 
for file in glob.glob("data/sampled/%s_*.csv" %(taxi_brand)):
    nb_files = nb_files+1
    # Save in list the files name
    list_files.append(file)
    # Order by date the file list
    list_files.sort()

# Load the shapefile, this yields a GeoDataFrame that has a row for each zone
zones = gpd.read_file('data/metadata/taxi_zones.shp')
zones = zones.to_crs({'init':'epsg:4326'})
rtree = zones.sindex

# Open the date change file
df = pd.read_csv("data/Change_date_%s.csv" %(taxi_brand), sep=',', header=None)
dating_schema = [ datetime.strptime(x, '%Y-%m-%d') for x in df[1] ]
for yr in range(0,nb_files):
    if os.path.isfile(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::]) == False :
        year = int(list_files[yr][len(taxi_brand)+23:len(taxi_brand)+27])
        month = int(list_files[yr][len(taxi_brand)+28:len(taxi_brand)+30])
        date_file = date(year,month,1)
        green_DF = (spark.read
                    .option("sep", ",")
                    .option("header", True)
                    .option("inferSchema", True)
                    .csv(list_files[yr]) )
        green_DF = green_DF.select([f.col(col).alias(col.replace(' ', '')) for col in green_DF.columns])
        for nb_schema in range(0,len(dating_schema)-1):
            Drop_ID = []
            Pick_ID = []
            if date_file >= dating_schema[nb_schema].date() and  date_file < dating_schema[nb_schema+1].date():
                print(date_file)
                if nb_schema+1 == 1 :
                    print("schema 1 for file:",list_files[yr])
                    green_DF = green_DF.withColumn("Dropoff_longitude", blank_as_null("Dropoff_longitude"))\
                           .withColumn("Dropoff_latitude", blank_as_null("Dropoff_longitude"))\
                           .withColumn("Pickup_latitude", blank_as_null("Pickup_latitude"))\
                           .withColumn("Pickup_longitude", blank_as_null("Pickup_longitude"))
                    # Transform LAT-LON in location ID
                    Pickup_list_lat = green_DF.select(f.collect_list('Pickup_latitude')).first()[0]
                    Pickup_list_lon = green_DF.select(f.collect_list('Pickup_longitude')).first()[0]
                    Dropoff_list_lat = green_DF.select(f.collect_list('Dropoff_latitude')).first()[0]
                    Dropoff_list_lon = green_DF.select(f.collect_list('Dropoff_longitude')).first()[0]
                    for i in range(0,len(Pickup_list_lat)):
                        a = convertlocID(Pickup_list_lon[i],Pickup_list_lat[i])
                        Pick_ID.append(a) 
                    for i in range(0,len(Dropoff_list_lat)):
                        a = convertlocID(Dropoff_list_lon[i],Dropoff_list_lat[i])
                        Drop_ID.append(a)
                    # Create the new file
                    green1_DF = DF = green_DF.withColumn("pulocationid",
                                                            f.udf(lambda id: Pick_ID[id])(f.monotonically_increasing_id()))\
                                             .withColumn("dolocationid",
                                                            f.udf(lambda id: Drop_ID[id])(f.monotonically_increasing_id()))\
                                             .withColumn("congestion_surcharge",lit('null'))\
                                             .withColumn("improvement_surcharge",lit('null'))\
                                                        .select(
                                                            col("VendorID").alias("vendorID"),
                                                            col("lpep_pickup_datetime").alias("pickup_datetime"),
                                                            col("Lpep_dropoff_datetime").alias("dropoff_datetime"),
                                                            col("Store_and_fwd_flag").alias("store_and_fwd_flag"),
                                                            col("RateCodeID").alias("ratecodeID"),
                                                            col("pulocationid").alias("PUlocationid"),
                                                            col("dolocationid").alias("DOlocationid"),
                                                            col("Passenger_count").alias("passenger_count"),
                                                            col("Trip_distance").alias("trip_distance"),
                                                            col("Fare_amount").alias("fare_amount"),
                                                            col("Extra").alias("extra"),
                                                            col("MTA_tax").alias("mta_tax"),
                                                            col("Tip_amount").alias("tip_amount"),
                                                            col("Tolls_amount").alias("tolls_amount"),
                                                            col("Ehail_fee").alias("ehail_fee"),
                                                            "improvement_surcharge",
                                                            col("Total_amount").alias("total_amount"),
                                                            col("Payment_type").alias("payment_type"),
                                                            col("Trip_type").alias("trip_type"),
                                                            "congestion_surcharge")
                    green1_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 2 :
                    print("schema 2")
                    green_DF = green_DF.withColumn("Dropoff_longitude", blank_as_null("Dropoff_longitude"))\
                           .withColumn("Dropoff_latitude", blank_as_null("Dropoff_longitude"))\
                           .withColumn("Pickup_latitude", blank_as_null("Pickup_latitude"))\
                           .withColumn("Pickup_longitude", blank_as_null("Pickup_longitude"))
                    # Transform LAT-LON in location ID
                    Pickup_list_lat = green_DF.select(f.collect_list('Pickup_latitude')).first()[0]
                    Pickup_list_lon = green_DF.select(f.collect_list('Pickup_longitude')).first()[0]
                    Dropoff_list_lat = green_DF.select(f.collect_list('Dropoff_latitude')).first()[0]
                    Dropoff_list_lon = green_DF.select(f.collect_list('Dropoff_longitude')).first()[0]
                    for i in range(0,len(Pickup_list_lat)):
                        a = convertlocID(Pickup_list_lon[i],Pickup_list_lat[i])
                        Pick_ID.append(a) 
                    for i in range(0,len(Dropoff_list_lat)):
                        a = convertlocID(Dropoff_list_lon[i],Dropoff_list_lat[i])
                        Drop_ID.append(a)
                    # Create the new file
                    green2_DF = green_DF.withColumn("pulocationid",
                                                            f.udf(lambda id: Pick_ID[id])(f.monotonically_increasing_id()))\
                                        .withColumn("dolocationid",
                                                            f.udf(lambda id: Drop_ID[id])(f.monotonically_increasing_id()))\
                                        .withColumn("congestion_surcharge",lit('null'))\
                                                        .select(
                                                            col("VendorID").alias("vendorID"),
                                                            col("lpep_pickup_datetime").alias("pickup_datetime"),
                                                            col("lpep_dropoff_datetime").alias("dropoff_datetime"),
                                                            col("Store_and_fwd_flag").alias("store_and_fwd_flag"),
                                                            col("RateCodeID").alias("ratecodeID"),
                                                            col("pulocationid").alias("PUlocationid"),
                                                            col("dolocationid").alias("DOlocationid"),
                                                            col("Passenger_count").alias("passenger_count"),
                                                            col("Trip_distance").alias("trip_distance"),
                                                            col("Fare_amount").alias("fare_amount"),
                                                            col("Extra").alias("extra"),
                                                            col("MTA_tax").alias("mta_tax"),
                                                            col("Tip_amount").alias("tip_amount"),
                                                            col("Tolls_amount").alias("tolls_amount"),
                                                            col("Ehail_fee").alias("ehail_fee"),
                                                            "improvement_surcharge",
                                                            col("Total_amount").alias("total_amount"),
                                                            col("Payment_type").alias("payment_type"),
                                                            col("Trip_type").alias("trip_type"),
                                                            "congestion_surcharge")
                    green2_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 3 :
                    print("schema 3")
                    green3_DF = green_DF.withColumn("congestion_surcharge",lit('null'))\
                                                        .select(
                                                            col("VendorID").alias("vendorID"),
                                                            col("lpep_pickup_datetime").alias("pickup_datetime"),
                                                            col("Lpep_dropoff_datetime").alias("dropoff_datetime"),
                                                            col("Store_and_fwd_flag").alias("store_and_fwd_flag"),
                                                            col("RateCodeID").alias("ratecodeID"),
                                                            col("pulocationid").alias("PUlocationid"),
                                                            col("dolocationid").alias("DOlocationid"),
                                                            col("Passenger_count").alias("passenger_count"),
                                                            col("Trip_distance").alias("trip_distance"),
                                                            col("Fare_amount").alias("fare_amount"),
                                                            col("Extra").alias("extra"),
                                                            col("MTA_tax").alias("mta_tax"),
                                                            col("Tip_amount").alias("tip_amount"),
                                                            col("Tolls_amount").alias("tolls_amount"),
                                                            col("Ehail_fee").alias("ehail_fee"),
                                                            "improvement_surcharge",
                                                            col("Total_amount").alias("total_amount"),
                                                            col("Payment_type").alias("payment_type"),
                                                            col("Trip_type").alias("trip_type"),
                                                            "congestion_surcharge")
                    green3_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 4 :
                    print("schema 4")
                    green4_DF = green_DF.select(
                                                            col("VendorID").alias("vendorID"),
                                                            col("lpep_pickup_datetime").alias("pickup_datetime"),
                                                            col("Lpep_dropoff_datetime").alias("dropoff_datetime"),
                                                            col("Store_and_fwd_flag").alias("store_and_fwd_flag"),
                                                            col("RateCodeID").alias("ratecodeID"),
                                                            col("pulocationid").alias("PUlocationid"),
                                                            col("dolocationid").alias("DOlocationid"),
                                                            col("Passenger_count").alias("passenger_count"),
                                                            col("Trip_distance").alias("trip_distance"),
                                                            col("Fare_amount").alias("fare_amount"),
                                                            col("Extra").alias("extra"),
                                                            col("MTA_tax").alias("mta_tax"),
                                                            col("Tip_amount").alias("tip_amount"),
                                                            col("Tolls_amount").alias("tolls_amount"),
                                                            col("Ehail_fee").alias("ehail_fee"),
                                                            "improvement_surcharge",
                                                            col("Total_amount").alias("total_amount"),
                                                            col("Payment_type").alias("payment_type"),
                                                            col("Trip_type").alias("trip_type"),
                                                            "congestion_surcharge")
                    green4_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
        if date_file == dating_schema[4].date() :
            print("schema LAST year")
            green4_DF = green_DF.select(
                                                            col("VendorID").alias("vendorID"),
                                                            col("lpep_pickup_datetime").alias("pickup_datetime"),
                                                            col("Lpep_dropoff_datetime").alias("dropoff_datetime"),
                                                            col("Store_and_fwd_flag").alias("store_and_fwd_flag"),
                                                            col("RateCodeID").alias("ratecodeID"),
                                                            col("pulocationid").alias("PUlocationid"),
                                                            col("dolocationid").alias("DOlocationid"),
                                                            col("Passenger_count").alias("passenger_count"),
                                                            col("Trip_distance").alias("trip_distance"),
                                                            col("Fare_amount").alias("fare_amount"),
                                                            col("Extra").alias("extra"),
                                                            col("MTA_tax").alias("mta_tax"),
                                                            col("Tip_amount").alias("tip_amount"),
                                                            col("Tolls_amount").alias("tolls_amount"),
                                                            col("Ehail_fee").alias("ehail_fee"),
                                                            "improvement_surcharge",
                                                            col("Total_amount").alias("total_amount"),
                                                            col("Payment_type").alias("payment_type"),
                                                            col("Trip_type").alias("trip_type"),
                                                            "congestion_surcharge")
            green4_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
new_files = len(os.listdir('data/cleaned/'+taxi_brand))
if new_files == nb_files :
    print("All the %i files are well integrated !" %(new_files))
else :
    print("[ERROR] %i files on %i files have been integrated ..." %(new_files, nb_files))

2013-08-01
schema 1 for file: data/sampled/green_tripdata_2013-08.csv
2013-09-01
schema 1 for file: data/sampled/green_tripdata_2013-09.csv
2013-10-01
schema 1 for file: data/sampled/green_tripdata_2013-10.csv
2013-11-01
schema 1 for file: data/sampled/green_tripdata_2013-11.csv
2013-12-01
schema 1 for file: data/sampled/green_tripdata_2013-12.csv
2014-01-01
schema 1 for file: data/sampled/green_tripdata_2014-01.csv
2014-02-01
schema 1 for file: data/sampled/green_tripdata_2014-02.csv
2014-03-01
schema 1 for file: data/sampled/green_tripdata_2014-03.csv
2014-04-01
schema 1 for file: data/sampled/green_tripdata_2014-04.csv
2014-05-01
schema 1 for file: data/sampled/green_tripdata_2014-05.csv
2014-06-01
schema 1 for file: data/sampled/green_tripdata_2014-06.csv
2014-07-01
schema 1 for file: data/sampled/green_tripdata_2014-07.csv
2014-08-01
schema 1 for file: data/sampled/green_tripdata_2014-08.csv
2014-09-01
schema 1 for file: data/sampled/green_tripdata_2014-09.csv
2014-10-01
schema 1 

## 4. Yellow files

Yellow files

From previous analyse we decided to use the following schema as a reference for the YELLOW taxi files:

['vendorid','tpep_pickup_datetime','tpep_dropoff_datetime','passenger_count','trip_distance','ratecodeid','store_and_fwd_flag','pulocationid','dolocationid','payment_type','fare_amount','extra','mta_tax','tip_amount','tolls_amount','improvement_surcharge','total_amount','congestion_surcharge']

We therefore need to apply somes transformations for creating new uniform files according to the time period previously defined and saved in the file Change_date_green.csv:

- Change schema 1 (from 2009-1 to 2009-12) :
            a)Columns transformations:
                  -'vendor_name' => 'vendorid'
                  -'Trip_Pickup_DateTime' => 'pickup_datetime'
                  -'Trip_Dropoff_DateTime' => 'dropoff_datetime'
                  -'Passenger_Count' => 'passenger_count'
                  -'Trip_Distance' => 'trip_distance'
                  -'Rate_Code' => 'ratecodeid'
                  -'store_and_forward' => 'store_and_fwd_flag'
                  -'Start_Lon','Start_Lat' => 'pulocationid'
                  -'End_Lon','End_Lat' => 'dolocationid'
                  -'Payment_Type' => 'payment_type'
                  -'Fare_Amt' => 'fare_amount'
                  -'surcharge' => 'extra'
                  -'Tip_Amt' => 'tip_amount'
                  -'Tolls_Amt' => 'tolls_amount'
                  -'Total_Amt' => 'total_amount'     
          b) Columns to add:
                  -'congestion_surcharge'
                  -'improvement_surcharge'

- Change schema 2 (from 2010-1 to 2014-12):
          a)Columns transformations:
                  -'vendor_id' => 'VendorID'
                  -'Trip_Distance' => 'trip_distance'
                  -'rate_code' => 'ratecodeID'
                  -'store_and_forward' => 'store_and_fwd_flag'
                  -'pickup_longitude','pickup_latitude' => 'pulocationid'
                  -'dropoff_longitude','dropoff_latitude' => 'dolocationid'   
                  -'surcharge' => 'extra'
          b) Columns to add:
                  -'congestion_surcharge'
                  -'improvement_surcharge'

- Change in schema 3 (from 2015-1 to 2016-7):
          a)Columns transformations:
              -'Trip_Pickup_DateTime' => 'pickup_datetime'
              -'Trip_Dropoff_DateTime' => 'dropoff_datetime'
              -'RateCodeID' => 'ratecodeid'
              -'store_and_forward' => 'store_and_fwd_flag'
              -'pickup_longitude','pickup_latitude' => 'puLocationid'
              -'dropoff_longitude','dropoff_latitude' => 'DOLocationid                 
          b) One new column to add : congestion_surcharge

- Change in schema 4 (from 2016-7 to 2018-12):          
          a)Columns transformations:
              -'Trip_Pickup_DateTime' => 'pickup_datetime'
              -'Trip_Dropoff_DateTime' => 'dropoff_datetime'
          b) One new column to add : congestion_surcharge

- Final schema 5 (from 2019-1 to 2020-6):          
          a)Columns transformations:
              -'Trip_Pickup_DateTime' => 'pickup_datetime'
              -'Trip_Dropoff_DateTime' => 'dropoff_datetime'
          b) Lowercasing header



In [9]:
from pyspark.sql.types import BooleanType
source_dir= 'data/sampled/'
clean_dir = 'data/cleaned/'
taxi_brand='yellow'
list_files = []
nb_files=0
# List the file from the same taxi company brand 
for file in glob.glob("data/sampled/%s_*.csv" %(taxi_brand)):
    nb_files = nb_files+1
    # Save in list the files name
    list_files.append(file)
    # Order by date the file list
    list_files.sort()

# Load the shapefile, this yields a GeoDataFrame that has a row for each zone
zones = gpd.read_file('data/metadata/taxi_zones.shp')
zones = zones.to_crs({'init':'epsg:4326'})
rtree = zones.sindex

# Open the date change file
df = pd.read_csv("data/Change_date_%s.csv" %(taxi_brand), sep=',', header=None)
dating_schema = [ datetime.strptime(x, '%Y-%m-%d') for x in df[1] ]
for yr in range(0,nb_files):
    if os.path.isfile(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::]) == False :
        year = int(list_files[yr][len(taxi_brand)+23:len(taxi_brand)+27])
        month = int(list_files[yr][len(taxi_brand)+28:len(taxi_brand)+30])
        date_file = date(year,month,1)
        yellow_DF = (spark.read
                    .option("sep", ",")
                    .option("header", True)
                    .option("inferSchema", True)
                    .csv(list_files[yr]) )
        yellow_DF = yellow_DF.select([f.col(col).alias(col.replace(' ', '')) for col in yellow_DF.columns])
        #test = yellow_DF.withColumn("dropoff_longitude", f.when(col("dropoff_longitude").isNull, 0).otherwise(col("dropoff_longitude")))
        #yellow_DF.printSchema()
        for nb_schema in range(0,len(dating_schema)-1):
            Drop_ID = []
            Pick_ID = []
            if date_file >= dating_schema[nb_schema].date() and  date_file < dating_schema[nb_schema+1].date():
                print(date_file)
                if nb_schema+1 == 1 :
                    # print("schema 1 for file:",list_files[yr])
                    yellow_DF = yellow_DF.withColumn("End_Lon", blank_as_null("End_Lon"))\
                             .withColumn("End_Lat", blank_as_null("End_Lat"))\
                             .withColumn("Start_Lat", blank_as_null("Start_Lat"))\
                             .withColumn("Start_Lon", blank_as_null("Start_Lon"))
                    # Transform LAT-LON in location ID
                    Pickup_list_lat = yellow_DF.select(f.collect_list('Start_Lat')).first()[0]
                    Pickup_list_lon = yellow_DF.select(f.collect_list('Start_Lon')).first()[0]
                    Dropoff_list_lat = yellow_DF.select(f.collect_list('End_Lat')).first()[0]
                    Dropoff_list_lon = yellow_DF.select(f.collect_list('End_Lon')).first()[0]
                    for i in range(0,len(Pickup_list_lat)):
                        a = convertlocID(Pickup_list_lon[i],Pickup_list_lat[i])
                        Pick_ID.append(a) 
                    for i in range(0,len(Dropoff_list_lat)):
                        a = convertlocID(Dropoff_list_lon[i],Dropoff_list_lat[i])
                        Drop_ID.append(a)
                    # Create the new file
                    yellow1_DF = yellow_DF.withColumn("congestion_surcharge",lit('null'))\
                                          .withColumn("improvement_surcharge",lit('null'))\
                                          .withColumn("pulocationid",
                                                          f.udf(lambda id: Pick_ID[id])(f.monotonically_increasing_id()))\
                                          .withColumn("dolocationid",
                                                          f.udf(lambda id: Drop_ID[id])(f.monotonically_increasing_id()))\
                                                          .select(
                                                               col("vendor_name").alias("vendorid"),
                                                               col("Trip_Pickup_DateTime").alias("pickup_datetime"),
                                                               col("Trip_Dropoff_DateTime").alias("dropoff_datetime"),
                                                               col("Passenger_Count").alias("passenger_count"),
                                                               col("Trip_Distance").alias("trip_distance"),
                                                               col("Rate_Code").alias("ratecodeid"),
                                                               col("store_and_forward").alias("store_and_fwd_flag"),
                                                               col("pulocationid").alias("PUlocationid"),
                                                               col("dolocationid").alias("DOlocationid"),
                                                               col("Payment_Type").alias("payment_type"),
                                                               col("Fare_Amt").alias("fare_amount"),
                                                               col("surcharge").alias("extra"),
                                                               "mta_tax",
                                                               col("Tip_Amt").alias("tip_amount"),
                                                               col("Tolls_Amt").alias("tolls_amount"),
                                                               "improvement_surcharge",
                                                               col("Total_Amt").alias("total_amount"),
                                                               "congestion_surcharge")
                    yellow1_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 2 :
                    # print("schema 2 for file:",list_files[yr])
                    yellow_DF = yellow_DF.withColumn("dropoff_longitude", blank_as_null("dropoff_longitude"))\
                             .withColumn("dropoff_latitude", blank_as_null("dropoff_latitude"))\
                             .withColumn("pickup_latitude", blank_as_null("pickup_latitude"))\
                             .withColumn("pickup_longitude", blank_as_null("pickup_longitude"))
                    # Transform LAT-LON in location ID
                    Pickup_list_lat = yellow_DF.select(f.collect_list('pickup_latitude')).first()[0]
                    Pickup_list_lon = yellow_DF.select(f.collect_list('pickup_longitude')).first()[0]
                    Dropoff_list_lat = yellow_DF.select(f.collect_list('dropoff_latitude')).first()[0]
                    Dropoff_list_lon = yellow_DF.select(f.collect_list('dropoff_longitude')).first()[0]
                    print(len(Dropoff_list_lon),len(Dropoff_list_lat))
                    for i in range(0,len(Pickup_list_lat)):
                        a = convertlocID(Pickup_list_lon[i],Pickup_list_lat[i])
                        Pick_ID.append(a) 
                    for i in range(0,len(Dropoff_list_lat)):
                        a = convertlocID(Dropoff_list_lon[i],Dropoff_list_lat[i])
                        Drop_ID.append(a)
                    # Create the new file
                    yellow2_DF = yellow_DF.withColumn("congestion_surcharge",lit('null'))\
                                          .withColumn("improvement_surcharge",lit('null'))\
                                          .withColumn("pulocationid",
                                                        f.udf(lambda id: Pick_ID[id])(f.monotonically_increasing_id()))\
                                          .withColumn("dolocationid",
                                                        f.udf(lambda id: Drop_ID[id])(f.monotonically_increasing_id()))\
                                                        .select(
                                                            col("vendor_id").alias("vendorid"),
                                                            "pickup_datetime",
                                                            "dropoff_datetime",
                                                            "passenger_count",
                                                            "trip_distance",
                                                            col("rate_code").alias("ratecodeid"),
                                                            "store_and_fwd_flag",
                                                            col("pulocationid").alias("PUlocationid"),
                                                            col("dolocationid").alias("DOlocationid"),
                                                            "payment_type",
                                                            "fare_amount",
                                                            col("surcharge").alias("extra"),
                                                            "mta_tax",
                                                            "tip_amount",
                                                            "tolls_amount" ,
                                                            "improvement_surcharge",
                                                            "total_amount",
                                                            "congestion_surcharge") 
                    yellow2_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 3 :
                    print("schema 3")
                    yellow_DF = yellow_DF.withColumn("dropoff_longitude", blank_as_null("dropoff_longitude"))\
                             .withColumn("dropoff_latitude", blank_as_null("dropoff_latitude"))\
                             .withColumn("pickup_latitude", blank_as_null("pickup_latitude"))\
                             .withColumn("pickup_longitude", blank_as_null("pickup_longitude"))
                    # Transform LAT-LON in location ID
                    Pickup_list_lat = yellow_DF.select(f.collect_list('pickup_latitude')).first()[0]
                    Pickup_list_lon = yellow_DF.select(f.collect_list('pickup_longitude')).first()[0]
                    Dropoff_list_lat = yellow_DF.select(f.collect_list('dropoff_latitude')).first()[0]
                    Dropoff_list_lon = yellow_DF.select(f.collect_list('dropoff_longitude')).first()[0]
                    for i in range(0,len(Pickup_list_lat)):
                        a = convertlocID(Pickup_list_lon[i],Pickup_list_lat[i])
                        Pick_ID.append(a) 
                    for i in range(0,len(Dropoff_list_lat)):
                        a = convertlocID(Dropoff_list_lon[i],Dropoff_list_lat[i])
                        Drop_ID.append(a)
                    # Create the new file
                    yellow3_DF = yellow_DF.withColumn("congestion_surcharge",lit('null'))\
                                          .withColumn("pulocationid",
                                                        f.udf(lambda id: Pick_ID[id])(f.monotonically_increasing_id()))\
                                          .withColumn("dolocationid",
                                                        f.udf(lambda id: Drop_ID[id])(f.monotonically_increasing_id()))\
                                                        .select(
                                                            col("VendorID").alias("vendorid"),
                                                            col("tpep_pickup_datetime").alias("pickup_datetime"),
                                                            col("tpep_dropoff_datetime").alias("dropoff_datetime"),
                                                            "passenger_count",
                                                            "trip_distance",
                                                            col("RateCodeID").alias("ratecodeid"),
                                                            "store_and_fwd_flag",
                                                            col("pulocationid").alias("PUlocationid"),
                                                            col("dolocationid").alias("DOlocationid"),
                                                            "payment_type",
                                                            "fare_amount",
                                                            "extra",
                                                            "mta_tax",
                                                            "tip_amount",
                                                            "tolls_amount",
                                                            "improvement_surcharge",
                                                            "total_amount",
                                                            "congestion_surcharge")
                    yellow3_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 4 :
                    print("schema 4")
                    # Create the new file
                    yellow4_DF = yellow_DF.withColumn("congestion_surcharge",lit('null'))\
                                                        .select(
                                                            col("VendorID").alias("vendorid"),
                                                            col("tpep_pickup_datetime").alias("pickup_datetime"),
                                                            col("tpep_dropoff_datetime").alias("dropoff_datetime"),
                                                            "passenger_count",
                                                            "trip_distance",
                                                            col("RatecodeID").alias("ratecodeid"),
                                                            "store_and_fwd_flag",
                                                            col("PULocationID").alias("PUlocationid"),
                                                            col("DOLocationID").alias("DOlocationid"),
                                                            "payment_type",
                                                            "fare_amount",
                                                            "extra",
                                                            "mta_tax",
                                                            "tip_amount",
                                                            "tolls_amount",
                                                            "improvement_surcharge",
                                                            "total_amount",
                                                            "congestion_surcharge")    

                    yellow4_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
                elif nb_schema+1 == 5 :
                    print("schema LAST")
                    # Create the new file
                    yellow5_DF = yellow_DF.select(
                                            col("VendorID").alias("vendorid"),
                                            col("tpep_pickup_datetime").alias("pickup_datetime"),
                                            col("tpep_dropoff_datetime").alias("dropoff_datetime"),
                                            "passenger_count",
                                            "trip_distance",
                                            col("RatecodeID").alias("ratecodeid"),
                                            "store_and_fwd_flag",
                                            col("PULocationID").alias("PUlocationid"),
                                            col("DOLocationID").alias("DOlocationid"),
                                            "payment_type",
                                            "fare_amount",
                                            "extra",
                                            "mta_tax",
                                            "tip_amount",
                                            "tolls_amount",
                                            "improvement_surcharge",
                                            "total_amount",
                                            "congestion_surcharge")
                    yellow5_DF = yellow_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
        if date_file == dating_schema[5].date() :
            # print("schema LAST year")
            yellow5_DF = yellow_DF.select(
                                            col("VendorID").alias("vendorid"),
                                            col("tpep_pickup_datetime").alias("pickup_datetime"),
                                            col("tpep_dropoff_datetime").alias("dropoff_datetime"),
                                            "passenger_count",
                                            "trip_distance",
                                            col("RatecodeID").alias("ratecodeid"),
                                            "store_and_fwd_flag",
                                            col("PULocationID").alias("PUlocationid"),
                                            col("DOLocationID").alias("DOlocationid"),
                                            "payment_type",
                                            "fare_amount",
                                            "extra",
                                            "mta_tax",
                                            "tip_amount",
                                            "tolls_amount",
                                            "improvement_surcharge",
                                            "total_amount",
                                            "congestion_surcharge")
            yellow5_DF = yellow_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(source_dir)::], index = False)
new_files = len(os.listdir('data/cleaned/'+taxi_brand))
if new_files == nb_files :
    print("All the %i files are well integrated !" %(new_files))
else :
    print("[ERROR] %i files on %i files have been integrated ..." %(new_files, nb_files))

2009-01-01
2009-02-01
2009-03-01
2009-04-01
2009-05-01
2009-06-01
2009-07-01
2009-08-01
2009-09-01
2009-10-01
2009-11-01
2009-12-01
2010-01-01
29739 29739
2010-02-01
22291 22291
2010-03-01
25768 25768
2010-04-01
30298 30298
2010-05-01
30969 30969
2010-06-01
29658 29658
2010-07-01
29321 29321
2010-08-01
25056 25056
2010-09-01
31085 31085
2010-10-01
28401 28401
2010-11-01
27826 27826
2010-12-01
27638 27638
2011-01-01
26932 26932
2011-02-01
28411 28411
2011-03-01
32150 32150
2011-04-01
29443 29443
2011-05-01
31115 31115
2011-06-01
30201 30201
2011-07-01
29490 29490
2011-08-01
26529 26529
2011-09-01
29261 29261
2011-10-01
31423 31423
2011-11-01
29062 29062
2011-12-01
29861 29861
2012-01-01
29943 29943
2012-02-01
29971 29971
2012-03-01
32300 32300
2012-04-01
30959 30959
2012-05-01
31142 31142
2012-06-01
30197 30197
2012-07-01
28767 28767
2012-08-01
28771 28771
2012-09-01
29098 29098
2012-10-01
29056 29056
2012-11-01
27554 27554
2012-12-01
29401 29401
2013-01-01
29556 29556
2013-02-01
27988 

In [53]:
yellow_DF.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: integer (nullable = true)

