# Data integration

For each sub-dataset, write (and execute) code that converts a file (using possibly an old schema) into a file that has the new, latest schema version.

Your conversion code should not modify the original files, but instead create a new file. Be sure to explain the design behind your conversion functions!

The data integration step is highly parallellizable. Therefore, your solution on this part
**must** be written in Spark

WARNING: this notebook assumes that:

- The data are in "MY_PARENT_FOLDER/data/sampled/" folder. You can run the bash script "download_metadata.sh" to download data and metadata in the correct folders to execute the jupyter notebooks.
- The data are sampled to be run on a personnal computer.

In [29]:
# Imports go here
import os
import glob
import pandas as pd
import os 
import shutil
import datetime
from shutil import copyfile
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=3g  pyspark-shell"
from pyspark.sql import SparkSession
try: 
    spark
    print("Spark application already started. Terminating existing application and starting new one")
    spark.stop()
except: 
    pass
# Create a new spark session (note, the * indicates to use all available CPU cores)
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("H600 L-Group") \
    .getOrCreate()
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext
#in local mode, you will be able to access the Spark GUI at http://localhost:4040

Spark application already started. Terminating existing application and starting new one


In [79]:
#create cleaned data directories
try :  
    os.path.isdir("data/cleaned")
except OSError:
    os.mkdir("data/cleaned")
    print ("Creation of the directory data/cleaned failed")
else:
    print ("Successfully created the directory data/cleaned")

list_taxi = ["yellow", "green", "fhv", "fhvhv"]
#list_taxi = ["green"]
for taxi_brand in list_taxi :
    path = "data/cleaned/%s" %(taxi_brand)
    # List the file from the same taxi company brand 
    try:
        os.path.isdir(path)
    except OSError:
        print ("Creation of the directory %s" % path)
        os.mkdir(path)
    else:
        print ("Successfully created the directory %s " % path)
    

Successfully created the directory data/cleaned
Successfully created the directory data/cleaned/yellow 
Successfully created the directory data/cleaned/green 
Successfully created the directory data/cleaned/fhv 
Successfully created the directory data/cleaned/fhvhv 


## 1. FHVHV files

From previous analyses we saw that header was consistent across all then fhvhv files.
We can then create a unified file.

In [6]:

#export new csv file
#fhvhv_DF.write.save(path='data/modified/fhvhv.csv', format='csv', mode='append', sep='\t')
#ça fait un dossier ac plusieurs fichiers pourris

#move files to t2 directory

source_dir= '/data/sampled/'       
for filename in glob.glob(os.path.join(source_dir,'fhvhv_*.csv')):
    shutil.copy(filename, 'data/cleanned/fhvhv')

print ('list of fhvhv tripdata files:')
!ls data/cleanned/fhvhv

print ('count of fhvhv tripdata files:')
!find data/cleanned/fhvhv -type f | wc -l 

/home/seb/Documents/Master_BDGA
list of fhvhv tripdata files:
fhvhv_tripdata_2019-02.csv  fhvhv_tripdata_2020-01.csv
fhvhv_tripdata_2019-03.csv  fhvhv_tripdata_2020-03.csv
fhvhv_tripdata_2019-04.csv  fhvhv_tripdata_2020-04.csv
fhvhv_tripdata_2019-05.csv  fhvhv_tripdata_2020-05.csv
fhvhv_tripdata_2019-06.csv  fhvhv_tripdata_2020-06.csv
count of fhvhv tripdata files:
10


## 2.FHV files

From previous analyse we decide to use as reference for the FHV taxi files the following schema:

['dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag'] 
 
We therefore need to apply somes transformations for creating new uniform files according to the time period previously defined and saved in the file Change_date_fhv.csv:

- Change schema 1 : 
            a) Add to the files empty columns for 'dropoff_datetime', 'DOLocationID' and 'SR_Flag'. 
            b) Change the columns name 'Pickup_date' by 'pickup_datetime', 'locationID' by 'PULocationID',        "Dispatching_base_num" by "dispatching_base_num".

- Change schema 2 : 
            a) Add to the files empty columns for 'DOLocationID' and 'SR_Flag'. 
            b) Change the columns name 'Pickup_date' by 'Pickup_DateTime', 'Dropoff_datetime' by 'dropoff_datetime', "Dispatching_base_num" by "dispatching_base_num".
            
- Change schema 3 : 
            a) Change the columns name 'Pickup_date' by 'Pickup_DateTime', 'Dropoff_datetime' by 'dropoff_datetime', "Dispatching_base_num" by "dispatching_base_num".
            
- Change schema 4 :
            a) Change the columns name 'Pickup_date' by 'Pickup_DateTime', 'Dropoff_datetime' by 'dropoff_datetime', "Dispatching_base_number" by "dispatching_base_num".
            b) Remove the double column Dispatching_base_num with no value
          
- Change schema 5 :
            NO change


In [157]:
from datetime import date
from datetime import datetime
from pyspark.sql.functions import col, lit
source_dir= 'data/sampled/'
clean_dir = 'data/cleaned/'
taxi_brand='fhv'
list_files = []
nb_files=0
# List the file from the same taxi company brand 
for file in glob.glob("data/sampled/%s_*.csv" %(taxi_brand)):
    nb_files = nb_files+1
    # Save in list the files name
    list_files.append(file)
    # Order by date the file list
    list_files.sort()


# Open the date change file
df = pd.read_csv("data/Change_date_%s.csv" %(taxi_brand), sep=',', header=None)
dating_schema = [ datetime.strptime(x, '%Y-%m-%d') for x in df[1] ]
for yr in range(0,nb_files):
    year = int(list_files[yr][len(taxi_brand)+23:len(taxi_brand)+27])
    month = int(list_files[yr][len(taxi_brand)+28:len(taxi_brand)+30])
    date_file = date(year,month,1)
    fhv_DF = (spark.read
                .option("sep", ",")
                .option("header", True)
                .option("inferSchema", True)
                .csv(list_files[yr]) )
    for nb_schema in range(0,len(dating_schema)-1):
        print(date_file)
        if date_file >= dating_schema[nb_schema].date() and  date_file < dating_schema[nb_schema+1].date():
            if nb_schema+1 == 1 :
                print("schema",nb_schema+1)
                fhv1_DF = fhv_DF.withColumn("dropoff_datetime",lit('null'))\
                       .withColumn("DOLocationID",lit('null'))\
                       .withColumn("SR_Flag",lit('null'))\
                       .select(
                        col("Dispatching_base_num").alias("dispatching_base_num"),
                        col("Pickup_date").alias("pickup_datetime"),
                        "dropoff_datetime",
                        col("locationID").alias("PULocationID"),
                        "DOLocationID",
                        "SR_Flag")
                fhv1_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(taxi_brand)+14::])
            elif nb_schema+1 == 2 :
                print("schema",nb_schema+1)
                fhv2_DF = fhv_DF.withColumn("DOLocationID",lit('null'))\
                        .withColumn("SR_Flag",lit('null'))\
                        .select(
                            col("Dispatching_base_num").alias("dispatching_base_num"),
                            col("Pickup_DateTime").alias("pickup_datetime"),
                            col("Dropoff_datetime").alias("dropoff_datetime"),
                            "PULocationID",
                            "DOLocationID",
                            "SR_Flag")
                fhv2_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(taxi_brand)+14::])
            elif nb_schema+1 == 3 :
                print("schema",nb_schema+1)
                fhv3_DF = fhv_DF.select(
                            col("Dispatching_base_num").alias("dispatching_base_num"),
                            col("Pickup_DateTime").alias("pickup_datetime"),
                            col("Dropoff_datetime").alias("dropoff_datetime"),
                            "PULocationID",
                            "DOLocationID",
                            "SR_Flag")
                fhv3_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(taxi_brand)+14::])
            elif nb_schema+1 == 4 :
                print("schema",nb_schema+1)
                fhv4_DF = fhv_DF.select(
                            col("Dispatching_base_number").alias("dispatching_base_num"),
                            col("Pickup_DateTime").alias("pickup_datetime"),
                            col("Dropoff_datetime").alias("dropoff_datetime"),
                            "PULocationID",
                            "DOLocationID",
                            "SR_Flag")
                fhv4_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(taxi_brand)+14::])
            elif nb_schema+1 == 5 :
                print("schema",nb_schema+1)
                fhv5_DF = fhv_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(taxi_brand)+14::])
    if date_file == dating_schema[5].date() :
        fhv5_DF = fhv_DF.toPandas().to_csv(clean_dir+taxi_brand+'/'+list_files[yr][len(taxi_brand)+14::])
new_files = len(os.listdir('data/cleaned/fhv'))
if new_files == nb_files :
    print("All the %i files are well integrated !" %(new_files))
else :
    print("[ERROR] %i files on %i files have been integrated ..." %(new_files, nb_files))

2015-01-01
schema 1
2015-01-01
2015-01-01
2015-01-01
2015-01-01
2015-02-01
schema 1
2015-02-01
2015-02-01
2015-02-01
2015-02-01
2015-03-01
schema 1
2015-03-01
2015-03-01
2015-03-01
2015-03-01
2015-04-01
schema 1
2015-04-01
2015-04-01
2015-04-01
2015-04-01
2015-05-01
schema 1
2015-05-01
2015-05-01
2015-05-01
2015-05-01
2015-06-01
schema 1
2015-06-01
2015-06-01
2015-06-01
2015-06-01
2015-07-01
schema 1
2015-07-01
2015-07-01
2015-07-01
2015-07-01
2015-08-01
schema 1
2015-08-01
2015-08-01
2015-08-01
2015-08-01
2015-09-01
schema 1
2015-09-01
2015-09-01
2015-09-01
2015-09-01
2015-10-01
schema 1
2015-10-01
2015-10-01
2015-10-01
2015-10-01
2015-11-01
schema 1
2015-11-01
2015-11-01
2015-11-01
2015-11-01
2015-12-01
schema 1
2015-12-01
2015-12-01
2015-12-01
2015-12-01
2016-01-01
schema 1
2016-01-01
2016-01-01
2016-01-01
2016-01-01
2016-02-01
schema 1
2016-02-01
2016-02-01
2016-02-01
2016-02-01
2016-03-01
schema 1
2016-03-01
2016-03-01
2016-03-01
2016-03-01
2016-04-01
schema 1
2016-04-01
2016-04-0

In [152]:
df = pd.read_csv("data/Change_date_%s.csv" %(taxi_brand), sep=',', header=None)
dating_schema = [ datetime.strptime(x, '%Y-%m-%d') for x in df[1] ]
fhv1_DF, fhv2_DF, fhv3_DF, fhv4_DF, fhv5_DF = 0,0,0,0,0
for yr in range(0,nb_files):
    year = int(list_files[yr][len(taxi_brand)+23:len(taxi_brand)+27])
    month = int(list_files[yr][len(taxi_brand)+28:len(taxi_brand)+30])
    date_file = date(year,month,1)
    for nb_schema in range(0,len(dating_schema)-1):
        if date_file >= dating_schema[nb_schema].date() and date_file < dating_schema[nb_schema+1].date():
            if nb_schema+1 == 1 :
                fhv1_DF = fhv1_DF + 1
            elif nb_schema+1 == 2 :
                fhv2_DF = fhv2_DF + 1
            elif nb_schema+1 == 3 :
                fhv3_DF = fhv3_DF + 1
            elif nb_schema+1 == 4 :
                fhv4_DF = fhv4_DF + 1
            elif nb_schema+1 == 5 :
                fhv5_DF = fhv5_DF + 1
    if date_file == dating_schema[5].date() :
        fhv5_DF = fhv5_DF + 1
#print(list_files[0:fhv1_DF])
#print(list_files[fhv1_DF:fhv1_DF+fhv2_DF])
#print(list_files[fhv1_DF+fhv2_DF:fhv1_DF+fhv2_DF+fhv3_DF])
#print(list_files[fhv1_DF+fhv2_DF+fhv3_DF+fhv4_DF])
#print(list_files[fhv1_DF+fhv2_DF+fhv3_DF+fhv4_DF:fhv1_DF+fhv2_DF+fhv3_DF+fhv4_DF+fhv5_DF])


All the 64 files are well integrated !


## 3.Green files

For green there are 76 files:

In 2015 - 1 :

   1 diff on a total of 21 col: ['improvement_surcharge']
           1/1 col add

In 2016 - 7 :

   2 diff on a total of 19 col: ['pulocationid', 'dolocationid']
           2/2 col remove

In 2019 - 1 :

   1 diff on a total of 20 col: ['congestion_surcharge']
           1/1 col add



In [110]:
print ('list of yellow tripdata files:')
!ls data/sampled/green*

list of yellow tripdata files:
data/sampled/green_tripdata_2013-08.csv
data/sampled/green_tripdata_2013-09.csv
data/sampled/green_tripdata_2013-10.csv
data/sampled/green_tripdata_2013-11.csv
data/sampled/green_tripdata_2013-12.csv
data/sampled/green_tripdata_2014-01.csv
data/sampled/green_tripdata_2014-02.csv
data/sampled/green_tripdata_2014-03.csv
data/sampled/green_tripdata_2014-04.csv
data/sampled/green_tripdata_2014-05.csv
data/sampled/green_tripdata_2014-06.csv
data/sampled/green_tripdata_2014-07.csv
data/sampled/green_tripdata_2014-08.csv
data/sampled/green_tripdata_2014-09.csv
data/sampled/green_tripdata_2014-10.csv
data/sampled/green_tripdata_2014-11.csv
data/sampled/green_tripdata_2014-12.csv
data/sampled/green_tripdata_2015-01.csv
data/sampled/green_tripdata_2015-02.csv
data/sampled/green_tripdata_2015-03.csv
data/sampled/green_tripdata_2015-04.csv
data/sampled/green_tripdata_2015-05.csv
data/sampled/green_tripdata_2015-06.csv
data/sampled/green_tripdat

In [166]:
#read fhv files
green1_DF = (spark.read
           .option("sep", ",")
           .option("header", True)
           .option("inferSchema", True)
            .csv('data/sampled/green_tripdata_2014-12.csv'))
green1_DF.printSchema()
green1_DF.show(1)
green2_DF = (spark.read
           .option("sep", ",")
           .option("header", True)
           .option("inferSchema", True)
            .csv('data/sampled/green_tripdata_2015-01.csv'))
green2_DF.printSchema()
green2_DF.show(1)
green3_DF = (spark.read
           .option("sep", ",")
           .option("header", True)
           .option("inferSchema", True)
            .csv('data/sampled/green_tripdata_2016-07.csv'))
green3_DF.printSchema()
green3_DF.show(2)

root
 |-- VendorID: integer (nullable = true)
 |-- lpep_pickup_datetime: timestamp (nullable = true)
 |-- Lpep_dropoff_datetime: timestamp (nullable = true)
 |-- Store_and_fwd_flag: string (nullable = true)
 |-- RateCodeID: integer (nullable = true)
 |-- Pickup_longitude: double (nullable = true)
 |-- Pickup_latitude: double (nullable = true)
 |-- Dropoff_longitude: double (nullable = true)
 |-- Dropoff_latitude: double (nullable = true)
 |-- Passenger_count: integer (nullable = true)
 |-- Trip_distance: double (nullable = true)
 |-- Fare_amount: double (nullable = true)
 |-- Extra: double (nullable = true)
 |-- MTA_tax: double (nullable = true)
 |-- Tip_amount: double (nullable = true)
 |-- Tolls_amount: double (nullable = true)
 |-- Ehail_fee: string (nullable = true)
 |-- Total_amount: double (nullable = true)
 |-- Payment_type: integer (nullable = true)
 |-- Trip_type: integer (nullable = true)

+--------+--------------------+---------------------+------------------+----------+----

In [168]:
##1.modifing schema as needed
green_201412 = green1_DF.withColumn("congestion_surcharge",lit('null'))\
                        .select(
                            col("VendorID").alias("VendorID"),
                            col("lpep_pickup_datetime").alias("lpep_pickup_datetime"),
                            col("Lpep_dropoff_datetime").alias("lpep_dropoff_datetime"),
                            col("Store_and_fwd_flag").alias("store_and_fwd_flag"),
                            col("RateCodeID").alias("RatecodeID"),
                            col("Passenger_count").alias("passenger_count"),
                            col("Trip_distance").alias("trip_distance"),
                            col("Fare_amount").alias("fare_amount"),
                            col("Extra").alias("extra"),
                            col("MTA_tax").alias("mta_tax"),
                            col("Tip_amount").alias("tip_amount"),
                            col("Tolls_amount").alias("tolls_amount"),
                            col("Ehail_fee").alias("ehail_fee"),
                            col("improvement_surcharge").alias("improvement_surcharge"),
                            col("Total_amount").alias("total_amount"),
                            col("Payment_type").alias("payment_type"),
                            col("Trip_type").alias("trip_type"),
                            "congestion_surcharge")
#green_201412.show(1)
#
#2.modifing schema as needed (with pu/do location id)
green_201501 = green2_DF.withColumn("congestion_surcharge",lit('null'))\
                        .select(
                            col("VendorID").alias("VendorID"),
                            col("lpep_pickup_datetime").alias("lpep_pickup_datetime"),
                            col("Lpep_dropoff_datetime").alias("lpep_dropoff_datetime"),
                            col("Store_and_fwd_flag").alias("store_and_fwd_flag"),
                            col("RateCodeID").alias("RatecodeID"),
                            col("Passenger_count").alias("passenger_count"),
                            col("Trip_distance").alias("trip_distance"),
                            col("Fare_amount").alias("fare_amount"),
                            col("Extra").alias("extra"),
                            col("MTA_tax").alias("mta_tax"),
                            col("Tip_amount").alias("tip_amount"),
                            col("Tolls_amount").alias("tolls_amount"),
                            col("Ehail_fee").alias("ehail_fee"),
                            col("improvement_surcharge").alias("improvement_surcharge"),
                            col("Total_amount").alias("total_amount"),
                            col("Payment_type").alias("payment_type"),
                            col("Trip_type").alias("trip_type"),
                            "congestion_surcharge")

green_201501.show(1)

#3.modifing schema as needed
green_201707 = green3_DF.withColumn("congestion_surcharge",lit('null'))\
#green_201707.show(2)



+--------+--------------------+---------------------+------------------+----------+---------------+-------------+-----------+-----+-------+----------+------------+---------+---------------------+------------+------------+---------+--------------------+
|VendorID|lpep_pickup_datetime|lpep_dropoff_datetime|store_and_fwd_flag|RatecodeID|passenger_count|trip_distance|fare_amount|extra|mta_tax|tip_amount|tolls_amount|ehail_fee|improvement_surcharge|total_amount|payment_type|trip_type|congestion_surcharge|
+--------+--------------------+---------------------+------------------+----------+---------------+-------------+-----------+-----+-------+----------+------------+---------+---------------------+------------+------------+---------+--------------------+
|       2| 2015-01-21 20:07:58|  2015-01-21 20:19:47|                 N|         1|              1|         2.48|       11.0|  0.5|    0.5|       0.0|         0.0|     null|                  0.3|        12.3|           1|        1|          

In [143]:
#function for geo location
import os
import glob
import pandas as pd
import math
import numpy as np

In [145]:
# Turn this subdf into a GeoDF, adding a "geometry" column to pinpoint the pickup location
geometry = [Point(xy) for xy in zip(subdf['pickup_longitude'], subdf['pickup_latitude'])]
geo_subdf = gpd.GeoDataFrame(subdf, geometry=geometry)
geo_subdf
# A geopanda dataframe has the possibility to create an R-tree index on it's geometry
rtree = zones.sindex

In [146]:
# By means of the intersection() method we can query for all the entries in the zones dataframe that 
# *can* intersect with a query point
# Note: this mentod can return false positives; the actual zone is part of the result.
# The method returns a generator. We use the list(.) constructor to convert this to a list.

query_point = Point( df1.iloc[0].pickup_longitude, df1.iloc[0].pickup_latitude)
possible_matches = list(rtree.intersection( query_point.bounds ))
possible_matches

[140, 236, 42]

In [147]:
zones.iloc[ possible_matches ]

Unnamed: 0,OBJECTID,Shape_Leng,Shape_Area,zone,LocationID,borough,geometry
140,141,0.041514,7.7e-05,Lenox Hill West,141,Manhattan,"POLYGON ((-73.96178 40.75988, -73.96197 40.759..."
236,237,0.042213,9.6e-05,Upper East Side South,237,Manhattan,"POLYGON ((-73.96613 40.76218, -73.96658 40.761..."
42,43,0.099739,0.00038,Central Park,43,Manhattan,"POLYGON ((-73.97255 40.76490, -73.97301 40.764..."


## Yellow files

For yellow there are 131 files:

 In 2010 - 1 :
 
   12 diff on a total of 18 col: ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'pickup_longitude', 'pickup_latitude', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'fare_amount', 'tip_amount', 'tolls_amount', 'total_amount']
         12/12 column name have changed:
         
 In 2015 - 1 :
 
   6 diff on a total of 19 col: ['vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'ratecodeid', 'extra', 'improvement_surcharge']
         1/6 col add
         5/6 name change
         
 In 2016 - 7 :
 
   2 diff on a total of 17 col: ['pulocationid', 'dolocationid']
         2/2 col remove
         
 In 2019 - 1 :
 
   1 diff on a total of 18 col: ['congestion_surcharge']
         1/1 col add

In [107]:
print ('list of yellow tripdata files:')
!ls data/sampled/yellow*

list of fhv tripdata files:
data/sampled/green_tripdata_2013-08.csv
data/sampled/green_tripdata_2013-09.csv
data/sampled/green_tripdata_2013-10.csv
data/sampled/green_tripdata_2013-11.csv
data/sampled/green_tripdata_2013-12.csv
data/sampled/green_tripdata_2014-01.csv
data/sampled/green_tripdata_2014-02.csv
data/sampled/green_tripdata_2014-03.csv
data/sampled/green_tripdata_2014-04.csv
data/sampled/green_tripdata_2014-05.csv
data/sampled/green_tripdata_2014-06.csv
data/sampled/green_tripdata_2014-07.csv
data/sampled/green_tripdata_2014-08.csv
data/sampled/green_tripdata_2014-09.csv
data/sampled/green_tripdata_2014-10.csv
data/sampled/green_tripdata_2014-11.csv
data/sampled/green_tripdata_2014-12.csv
data/sampled/green_tripdata_2015-01.csv
data/sampled/green_tripdata_2015-02.csv
data/sampled/green_tripdata_2015-03.csv
data/sampled/green_tripdata_2015-04.csv
data/sampled/green_tripdata_2015-05.csv
data/sampled/green_tripdata_2015-06.csv
data/sampled/green_tripdata_2