# Data integration

For each sub-dataset, write (and execute) code that converts a file (using possibly an old schema) into a file that has the new, latest schema version.

Your conversion code should not modify the original files, but instead create a new file. Be sure to explain the design behind your conversion functions!

The data integration step is highly parallellizable. Therefore, your solution on this part
**must** be written in Spark

WARNING: this notebook assumes that:

- The data are in "MY_PARENT_FOLDER/data/sampled/" folder. You can run the bash script "download_metadata.sh" to download data and metadata in the correct folders to execute the jupyter notebooks.
- The data are sampled to be run on a personnal computer.

In [24]:
# Imports go here
import os
import glob
import pandas as pd
import os 
import shutil
from shutil import copyfile
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=3g  pyspark-shell"
from pyspark.sql import SparkSession
try: 
    spark
    print("Spark application already started. Terminating existing application and starting new one")
    spark.stop()
except: 
    pass
# Create a new spark session (note, the * indicates to use all available CPU cores)
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("H600 L-Group") \
    .getOrCreate()
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext
#in local mode, you will be able to access the Spark GUI at http://localhost:4040

Spark application already started. Terminating existing application and starting new one


In [13]:
#create t2 directories
try:
    os.mkdir("data/t2")
except OSError:
    print ("Creation of the directory data/t2 failed")
else:
    print ("Successfully created the directory data/t2")

list_taxi = ["yellow", "green", "fhv", "fhvhv"]
#list_taxi = ["green"]
for taxi_brand in list_taxi :
    path = "data/t2/%s" %(taxi_brand)
    # List the file from the same taxi company brand 
    try:
        os.mkdir(path)  
    except OSError:
        print ("Creation of the directory %s failed" % path)
    else:
        print ("Successfully created the directory %s " % path)
    

Creation of the directory data/t2 failed
Creation of the directory data/t2/yellow failed
Creation of the directory data/t2/green failed
Creation of the directory data/t2/fhv failed
Creation of the directory data/t2/fhvhv failed


## 1. FHVHV files

From previous analyses we saw that header was consistent across all then fhvhv files.
We can then create a unified file.

In [45]:
#read fhvhv files
#fhvhv_DF = (spark.read
#           .option("sep", "\t")
#           .option("header", True)
#           .option("inferSchema", True)
#            .csv('data/sampled/fhvhv_*.csv'))

#fhvhv_DF.printSchema()
#fhvhv_DF.show(5)

#export new csv file
#fhvhv_DF.write.save(path='data/modified/fhvhv.csv', format='csv', mode='append', sep='\t')
#ça fait un dossier ac plusieurs fichiers pourris

source_dir= 'data/sampled/'       
for filename in glob.glob(os.path.join(source_dir,'fhvhv_*.csv')):
    shutil.copy(filename, 'data/t2/fhvhv')

print ('list of fhvhv tripdata files:')
!ls data/t2/fhvhv

print ('count of fhvhv tripdata files:')
!find data/t2/fhvhv -type f | wc -l 

list of fhvhv tripdata files:
fhvhv_tripdata_2019-02.csv  fhvhv_tripdata_2020-01.csv
fhvhv_tripdata_2019-03.csv  fhvhv_tripdata_2020-03.csv
fhvhv_tripdata_2019-04.csv  fhvhv_tripdata_2020-04.csv
fhvhv_tripdata_2019-05.csv  fhvhv_tripdata_2020-05.csv
fhvhv_tripdata_2019-06.csv  fhvhv_tripdata_2020-06.csv
count of fhvhv tripdata files:
10


## 2.FHV files

From previous analyses we saw that for fhv there are some adjustements:

In 2017 - 1 :

4 diff on a total of 5 col: ['pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid']
           2/4 col add
           2/4 name change

In 2017 - 7 :

1 diff on a total of 6 col: ['sr_flag']
           1/1 col add

In 2018 - 1 :

1 diff on a total of 7 col: ['dispatching_base_number']
           1/1 col add

In 2019 - 1 :

1 diff on a total of 6 col: ['dispatching_base_number']
           1/1 col remove

