# DATA WRANGLING HACKATHON

# FILES STEP

### Overview
This data dictionary describes High Volume FHV trip data. Each row represents a single trip in an FHV dispatched by one of NYC’s licensed High Volume FHV bases. On August 14, 2018, Mayor de Blasio signed Local Law 149 of 2018, creating a new license category for TLC-licensed FHV businesses that currently dispatch or plan to dispatch more than 10,000 FHV trips in New York City per day under a single brand, trade, or operating name, referred to as High-Volume For-Hire Services (HVFHS). This law went into effect on Feb 1, 2019.

### Objective
The main goal of this hackathon is to determine if the client is going to give a tip. 
Your submission file should be a CSV file with two columns (see example in sample_	submission.csv):
ID:  Id of the observation
Tipped: If the client Tipped or not

A dataset spread over several data sources has been provided for you. The total number of features is plentiful and it’s up to you to use as many or as little as you want. Given that, some features might be more relevant than others. 
Keep in mind that this is a Data Wrangling specialization. 

### Datasets:
| **Dataset** | **Information**   | Location|
|-------------|-------------------|---------------------|
|API          | Trip Mileage      | https://hckt02-api.lisbondatascience.org/docs#/default/get_data_data_get |
|Webpage      | Taxi Zone Data    | https://s02-infrastructure.s3.eu-west-1.amazonaws.com/hackathon-02-batch8/index.html |
|Files        | Detailed Trip Data| https://drive.google.com/drive/folders/12MhOAVrplggHVTm6-CtjqkkjI9xrVPek?usp=drive_link|
|Database     | Weather Data      | batch-s02.ctq2kxc7kx1i.eu-west-1.rds.amazonaws.com



# Why Use Dask for Large-Scale File Manipulation?

**Dask is a parallel computing library** in Python that excels in handling large-scale data processing tasks. **While Pandas is powerful** for manipulating structured data in formats like JSON, Parquet, CSV, and TSV, **it struggles with large datasets that exceed your system’s memory.** Dask, however, overcomes this limitation by breaking datasets into smaller chunks and processing them in parallel across multiple cores or even distributed systems.

***Key Benefits of Using Dask:***

    1. Scalability: Dask processes data that doesn’t fit into memory by working in chunks, allowing seamless manipulation of large files.
    2. Performance: Its ability to parallelize operations accelerates computations, making it ideal for processing multiple large files simultaneously.
    3. Familiar API: Dask’s DataFrame API mimics Pandas, so the learning curve for Pandas users is minimal.
    4. Multi-format Support: It supports reading and writing various file formats like JSON, Parquet, CSV, and TSV, handling file splits and schema consistency efficiently.
    5. Integration: Dask integrates well with existing Python ecosystems, including NumPy, Scikit-learn, and distributed computing frameworks.

***Why Dask for Our Solution?***

In our project, we are dealing with files of varying sizes and formats. Dask allows us to:
	•	Efficiently load and manipulate large JSON, Parquet, CSV, and TSV files without memory constraints.
	•	Process data in parallel to reduce runtime significantly.
	•	Handle diverse data formats uniformly, ensuring streamlined workflows.

By using Dask, we can scale beyond the limitations of Pandas while maintaining a familiar and Pythonic workflow, ensuring our solution remains robust, efficient, and ready for big data challenges.

# Why not to use Spark?
* Dask offers a simpler and more lightweight alternative to Spark, making it ideal for Python users who need to process large datasets without the complexity of managing a full distributed system. Unlike Spark, Dask runs natively in Python, allowing seamless integration with the existing Python ecosystem, such as NumPy, Pandas, and Scikit-learn, without requiring a separate cluster or JVM environment. Its installation and setup are straightforward, and it provides a familiar API for users experienced with Pandas. This simplicity makes Dask a more accessible option for developers who need efficient, scalable data processing without the overhead and learning curve associated with Spark.

### API docs: 
https://docs.dask.org/en/stable/

## Dask Dashboard

In [1]:
from dask.distributed import Client
client = Client(memory_limit='4GB')
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 14.90 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:65006,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 14.90 GiB

0,1
Comm: tcp://127.0.0.1:65019,Total threads: 2
Dashboard: http://127.0.0.1:65021/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:65009,
Local directory: /var/folders/ct/d5kc2y6d7xd99yz5v6hb22lc0000gn/T/dask-scratch-space/worker-13ov4p6v,Local directory: /var/folders/ct/d5kc2y6d7xd99yz5v6hb22lc0000gn/T/dask-scratch-space/worker-13ov4p6v

0,1
Comm: tcp://127.0.0.1:65020,Total threads: 2
Dashboard: http://127.0.0.1:65024/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:65010,
Local directory: /var/folders/ct/d5kc2y6d7xd99yz5v6hb22lc0000gn/T/dask-scratch-space/worker-lh8nlnfi,Local directory: /var/folders/ct/d5kc2y6d7xd99yz5v6hb22lc0000gn/T/dask-scratch-space/worker-lh8nlnfi

0,1
Comm: tcp://127.0.0.1:65018,Total threads: 2
Dashboard: http://127.0.0.1:65023/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:65011,
Local directory: /var/folders/ct/d5kc2y6d7xd99yz5v6hb22lc0000gn/T/dask-scratch-space/worker-88pcyj9r,Local directory: /var/folders/ct/d5kc2y6d7xd99yz5v6hb22lc0000gn/T/dask-scratch-space/worker-88pcyj9r

0,1
Comm: tcp://127.0.0.1:65017,Total threads: 2
Dashboard: http://127.0.0.1:65022/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:65012,
Local directory: /var/folders/ct/d5kc2y6d7xd99yz5v6hb22lc0000gn/T/dask-scratch-space/worker-csbnehr4,Local directory: /var/folders/ct/d5kc2y6d7xd99yz5v6hb22lc0000gn/T/dask-scratch-space/worker-csbnehr4




# File Wrangling Strategy:
1. Make all files data frame structures by processing in the following order:
   1. csv_train_df.csv
   2. json_train_df.json
   3. parquet_train_df.parquet
   4. tsv_train_df.tsv
2. Merge all files in a single dataset and generate a parquet output file

* **Explanation**: The datasets files have the same schema. We can imagine that this datasets might have come from different systems and other types of sources such as Excel files. We merge them to facilitate further join, cleasing, feature selection, enrichment operations.

### Library

In [2]:
import os

def filesize(filename):
    file_path = filename
    file_size = os.path.getsize(file_path) / (1024 * 1024)
    message = f"physical file size: {file_size:.2f} MB"
    return message

In [3]:
# Personalized function to map values
def map_flag(value):
    if str(value).strip().upper().startswith('Y'):
        return 1
    elif str(value).strip().upper().startswith('N'):
        return 0
    else:
        return None  # Or some default value for unidentified cases.

# Processing the Files

## Physical File Size

In [4]:
print(f"CSV {filesize('.data/csv_train_df.csv')}")
print(f"JSON {filesize('.data/json_train_df.json')}")
print(f"PARQUET {filesize('.data/parquet_train_df.parquet')}")
print(f"TSV {filesize('.data/tsv_train_df.tsv')}")

CSV physical file size: 382.27 MB
JSON physical file size: 463.08 MB
PARQUET physical file size: 156.16 MB
TSV physical file size: 358.02 MB


In [5]:
# https://docs.dask.org/en/latest/dataframe-best-practices.html
import dask.dataframe as dd

## A.CSV dataset manipulation

In [6]:
csv_train_df = dd.read_csv('.data/csv_train_df.csv')

## B. JSON dataset manipulation (complex JSON)

In [7]:
from dask import delayed
import pandas as pd
import json

# Define a function to load JSON and extract the 'data' key
@delayed
def load_json_data(filepath):
    with open(filepath, "r") as f:
        json_data = json.load(f)  # Load the JSON file
    return json_data["data"]      # Return only the 'data' key

# Path to the JSON file
filepath = ".data/json_train_df.json"

# Load the 'data' key using Dask Delayed
data = load_json_data(filepath)

# Create a Dask DataFrame from the delayed data
json_train_df = dd.from_delayed([delayed(pd.DataFrame)(data)])

## C. PARQUET dataset manipulation

In [8]:
parquet_train_df = pd.read_parquet('.data/parquet_train_df.parquet')

## D. TSV dataset manipulation

In [9]:
tsv_train_df = dd.read_csv('.data/tsv_train_df.tsv', sep='\t')

# Checking dataset schemas

In [10]:
## Verifying if All Four Datasets Have the Same Columns
## When dealing with datasets that contain hundreds of columns, manually comparing them can be 
## impractical and error-prone. Automating this type of comparison ensures efficiency and 
## accuracy, making it easier to confirm whether all datasets share the same columns.

csv_elements = set(csv_train_df.columns) # set for csv columns
json_elements = set(json_train_df.columns) # set for json columns
parquet_elements = set(parquet_train_df.columns) # set for parquet columns
tsv_elements = set(tsv_train_df.columns) # set for tsv columns

common_elements = csv_elements & parquet_elements& json_elements & tsv_elements

if csv_elements == common_elements:
    print(f'All columns are present in all datasets!')
    print(f'{common_elements}')
else:
    print(f'Only the following columns were found in all datasets: {common_elements}')

All columns are present in all datasets!
{'hvfhs_license_num', 'Tipped', 'driver_pay', 'sales_tax', 'shared_request_flag', 'request_datetime', 'trip_time', 'base_passenger_fare', 'shared_match_flag', 'tolls', 'DOLocationID', 'pickup_datetime', 'on_scene_datetime', 'ID', 'access_a_ride_flag', 'originating_base_num', 'wav_match_flag', 'wav_request_flag', 'PULocationID', 'dispatching_base_num', 'dropoff_datetime', 'airport_fee', 'trip_miles', 'bcf', 'congestion_surcharge'}


## Removing aditional columns that are not in all datasets

In [11]:
# Identifying adicional columns
csv_train_df_extra_cols = csv_elements - common_elements
json_train_df_extra_cols = json_elements - common_elements
parquet_train_df_extra_cols = parquet_elements - common_elements
tsv_train_df_extra_cols = tsv_elements - common_elements

In [12]:
print(f'csv_train_df_extra_cols:{csv_train_df_extra_cols}')
print(f'json_train_df_extra_cols:{json_train_df_extra_cols}')
print(f'parquet_train_df_extra_cols:{parquet_train_df_extra_cols}')
print(f'tsv_train_df_extra_cols:{tsv_train_df_extra_cols}')

csv_train_df_extra_cols:set()
json_train_df_extra_cols:{'index'}
parquet_train_df_extra_cols:set()
tsv_train_df_extra_cols:set()


In [13]:
csv_train_df = csv_train_df[list(common_elements)]
json_train_df = json_train_df[list(common_elements)]
parquet_train_df = parquet_train_df[list(common_elements)]
tsv_train_df = tsv_train_df[list(common_elements)]

In [14]:
csv_train_df.head(1)

Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,...,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,***HV0003***,0,5.4,0.7,NO,2021-12-09 12:03:02,357,7.91,N0,0.0,...,B03404,0,N,204,B03404,2021-12-09 12:13:41,0.0,1.85,0.24,0.0


In [15]:
json_train_df.head(1)

Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,...,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,HV0003,0,12.91,1.55,N,2021-06-11T17:59:44.000,229,17.47,N,0.0,...,B02864,N,N,74,B02864,2021-06-11T18:08:24.000,0.0,0.54,0.52,0.0


In [16]:
parquet_train_df.head(1)

Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,...,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
23212,HV0003,0,11.12,0.77,N,2020-12-31 23:56:17,469,8.72,N,0.0,...,B02864,N,N,185,B02864,2021-01-01 00:08:49,,1.77,0.26,0.0


In [17]:
tsv_train_df.head(1)

Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,...,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,HV0003,1,12.53,0.75,NOOoooo00ooOO,2021-08-08 17:57:27,646,8.42,N,0.0,...,B02866,N,N,170,B02866,2021-08-08 18:12:35,0.0,1.84,0.25,2.75


# Medallion Architecture

* The Medallion Architecture is a design pattern for data lakehouses, dividing data into three layers: Bronze, Silver, and Gold.
	1.	Bronze Layer: Raw, unprocessed data directly ingested from source systems, stored in its original format.
	2.	Silver Layer: Cleansed, enriched, and structured data, ready for analytics and further transformations.
	3.	Gold Layer: Finalized, highly curated data tailored for business use cases, dashboards, and machine learning.

*    **This structure improves data quality, scalability, and enables incremental processing, making it easier to manage and utilize large datasets effectively.**

### Combining all datasets

In [18]:
combined_df = dd.concat([csv_train_df, json_train_df, parquet_train_df, tsv_train_df])

In [19]:
len(combined_df)

8716742

## BRONZE or RAW Layer

### Saving to Bronze

***Preserving Bronze Data:*** 

**Role of the Bronze Layer**
* Serves as a repository for raw data.
* Preserves data in its original state for traceability and auditing.
* Prevents information loss in case it is necessary to revisit the data.


**Why avoid corrections in the Bronze Layer?**
* The Bronze Layer should reflect the data as provided by the sources, without modifications, to maintain the integrity and traceability of the original data.
* Altering data in this layer can hinder the identification of issues at the source or the reconstruction of the pipeline.

### Why CSV other than Parquet?  
We'll save it as CSV so we don't have to deal with many necessary transformations due to schema inference in parquet format.

In [20]:
# Save to CSV
combined_df.to_csv(".data/files/bronze/")

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


['/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/00.part',
 '/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/01.part',
 '/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/02.part',
 '/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/03.part',
 '/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/04.part',
 '/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/05.part',
 '/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/06.part',
 '/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/07.part',
 '/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/08.part',
 '/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/09.part',
 '/Users/brunodaemon/PycharmProjects/perfTuningSpark2/ldsa/.data/files/bronze/10.part',
 '/Users/brunodaemon/PycharmProj

In [21]:
combined_df.head(10)

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,...,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,***HV0003***,0,5.4,0.7,NO,2021-12-09 12:03:02,357,7.91,N0,0.0,...,B03404,0,N,204,B03404,2021-12-09 12:13:41,0.0,1.85,0.24,0.0
1,***HV0003***,1,26.55,3.08,NOOoooo00ooOO,2021-09-12 21:35:45,2362,34.68,NOOoooo00ooOO,0.0,...,B02889,0,N,249,B02889,2021-09-12 22:24:02,0.0,5.7,1.04,2.75
2,***HV0003***,0,8.96,1.18,NO,2021-11-22 08:43:05,810,13.25,No,0.0,...,B03404,0,N,243,B03404,2021-11-22 08:59:23,0.0,1.98,0.4,0.0
3,***HV0003***,0,5.58,0.7,N0,2021-09-17 18:50:23,287,7.91,No,0.0,...,B02869,0,N,80,B02869,2021-09-17 18:57:12,0.0,0.75,0.24,0.0
4,***HV0003***,0,18.0,1.16,No,2021-11-02 08:57:24,685,13.02,N0,0.0,...,B03404,1,N,210,B03404,2021-11-02 09:12:21,0.0,2.89,0.39,0.0
5,***HV0003***,0,6.41,0.89,No,2021-11-07 11:49:45,478,10.07,NO,0.0,...,B03404,0,N,254,B03404,2021-11-07 12:03:49,0.0,1.91,0.3,0.0
6,***HV0005***,0,13.45,1.2,NOOoooo00ooOO,2021-10-07 18:06:18,799,13.55,NOOoooo00ooOO,0.0,...,,0,N,50,B03406,2021-10-07 18:31:58,0.0,1.558,0.41,2.75
7,***HV0003***,0,18.14,2.17,N0,2021-08-08 19:51:27,1037,24.44,N0,0.0,...,B02764,0,N,168,B02764,2021-08-08 20:20:23,0.0,7.39,0.73,0.0
8,***HV0003***,0,7.14,0.77,N0,2021-08-24 16:24:13,484,8.68,NOOoooo00ooOO,0.0,...,B02764,0,N,117,B02764,2021-08-24 16:34:02,0.0,1.43,0.26,0.0
9,***HV0003***,0,11.77,2.03,N0,2021-12-11 13:28:04,685,22.82,NO,0.0,...,B03404,0,N,263,B03404,2021-12-11 13:50:40,0.0,1.65,0.68,2.75


***Notice:*** As you can see in the dataframe above there are many columns with values that will not help our ML training. So next step is to clean and format it.

## SILVER OR CLEANSED LAYER

### Loading the new dataset for cleansing

***Silver Layer (or Cleansed Layer):***
* Corrections such as:
* Standardizing dates.
* Converting data types (e.g., Yes/No, Y/N, 1/0 to booleans).
* Handling null or invalid values.
  
***Objective:***
* Make the data consistent and ready for use in analyses and models.

### Checking Columns Variance

**Comments:** The following results reveal 9 columns with very low variance, which may render them insignificant to the model if they don’t directly impact the results. For instance, features with a clear influence on tipping behavior, such as long trips during night or early morning hours when passengers feel safer, small gestures like offering a phone charger, or taking an alternate route to accommodate a passenger’s preferences, are more likely to contribute meaningfully.

* The only column excluded from this analysis will be Tipped, as it serves as the target variable.

In [22]:
# combined_df = dd.read_csv(".data/files/bronze/*.part")

In [23]:
# len(combined_df)
# ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

#+----------------+--------+----------+
#| Column         | Found  | Expected |
#+----------------+--------+----------+
#| wav_match_flag | object | int64    |
#+----------------+--------+----------+

In [24]:
combined_df = dd.read_csv(".data/files/bronze/*.part", dtype={'wav_match_flag':'object'})

In [25]:
combined_df.head()

Unnamed: 0.1,Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,...,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,0,***HV0003***,0,5.4,0.7,NO,2021-12-09 12:03:02,357,7.91,N0,...,B03404,0,N,204,B03404,2021-12-09 12:13:41,0.0,1.85,0.24,0.0
1,1,***HV0003***,1,26.55,3.08,NOOoooo00ooOO,2021-09-12 21:35:45,2362,34.68,NOOoooo00ooOO,...,B02889,0,N,249,B02889,2021-09-12 22:24:02,0.0,5.7,1.04,2.75
2,2,***HV0003***,0,8.96,1.18,NO,2021-11-22 08:43:05,810,13.25,No,...,B03404,0,N,243,B03404,2021-11-22 08:59:23,0.0,1.98,0.4,0.0
3,3,***HV0003***,0,5.58,0.7,N0,2021-09-17 18:50:23,287,7.91,No,...,B02869,0,N,80,B02869,2021-09-17 18:57:12,0.0,0.75,0.24,0.0
4,4,***HV0003***,0,18.0,1.16,No,2021-11-02 08:57:24,685,13.02,N0,...,B03404,1,N,210,B03404,2021-11-02 09:12:21,0.0,2.89,0.39,0.0


In [26]:
combined_df.dtypes

Unnamed: 0                        int64
hvfhs_license_num       string[pyarrow]
Tipped                            int64
driver_pay                      float64
sales_tax                       float64
shared_request_flag     string[pyarrow]
request_datetime        string[pyarrow]
trip_time                         int64
base_passenger_fare             float64
shared_match_flag       string[pyarrow]
tolls                           float64
DOLocationID                    float64
pickup_datetime         string[pyarrow]
on_scene_datetime       string[pyarrow]
ID                                int64
access_a_ride_flag      string[pyarrow]
originating_base_num    string[pyarrow]
wav_match_flag          string[pyarrow]
wav_request_flag        string[pyarrow]
PULocationID                      int64
dispatching_base_num    string[pyarrow]
dropoff_datetime        string[pyarrow]
airport_fee                     float64
trip_miles                      float64
bcf                             float64


In [27]:
len(combined_df)

8716742

#### shared_request_flag
**Conclusion:** After checking this column we decided to tranform Y=1, No=0 and replace the other values starting with N to 0 and Y to 1.

In [28]:
# Grab uniques in this column
unique_values = combined_df['shared_request_flag'].unique()

# # Compute uniques
unique_values_computed = unique_values.compute()

# Show
print(unique_values_computed)

0              yes
1              Yes
0               NO
0    NOOoooo00ooOO
0               N0
0              Y3S
0               No
0                N
0                Y
Name: shared_request_flag, dtype: string


In [29]:
# Calling function to replace values
combined_df['shared_request_flag'] = combined_df['shared_request_flag'].map(map_flag, meta=('shared_request_flag', 'int64'))

#### access_a_ride_flag
**Conclusion:** After checking this column values we decided to drop it since it has no relevance to the ML model training for lack of variance.

In [30]:
# Grab uniques in this column
unique_values = combined_df['access_a_ride_flag'].unique()

# Compute uniques
unique_values_computed = unique_values.compute()

# Show
print(unique_values_computed)

0                 
0             Null
0        Who knows
1               NO
0    NOOoooo00ooOO
0               N0
0             <NA>
0               No
0                N
0        Who cares
Name: access_a_ride_flag, dtype: string


In [31]:
combined_df = combined_df.drop('access_a_ride_flag', axis=1)

#### wav_request_flag
**Conclusion:** After checking this column values we decided to replace values starting with Y or y to 1 and N or n to 0

In [32]:
# Grab uniques in this column
unique_values = combined_df['wav_request_flag'].unique()

# Compute uniques
unique_values_computed = unique_values.compute()

# Show
print(unique_values_computed)

0    Yes
1    yes
0    Y3S
0      N
0      Y
Name: wav_request_flag, dtype: string


In [33]:
combined_df['wav_request_flag'] = combined_df['wav_request_flag'].map(map_flag, meta=('wav_request_flag', 'int64'))

In [34]:
combined_df['wav_request_flag'].head()

0    0
1    0
2    0
3    0
4    0
Name: wav_request_flag, dtype: int64

#### congestion_surcharge
**Conclusion:** After checking this column we decided not to change anything.

In [35]:
# Grab uniques in this column
unique_values = combined_df['congestion_surcharge'].unique()

# Compute uniques
unique_values_computed = unique_values.compute()

# Show
print(unique_values_computed)

0    0.00
0    5.50
0    8.25
0    2.75
0    0.75
Name: congestion_surcharge, dtype: float64


#### wav_match_flag
**Conclusion**: After checking this column we decided to change Y to 1 and N to 0

In [36]:
# Grab uniques in this column
unique_values = combined_df['wav_match_flag'].unique()

# Compute uniques
unique_values_computed = unique_values.compute()

# Show
print(unique_values_computed)

0    0
0    1
0    N
0    Y
Name: wav_match_flag, dtype: string


In [37]:
combined_df['wav_match_flag'] = combined_df['wav_match_flag'].map(map_flag, meta=('wav_match_flag', 'int64'))

In [38]:
combined_df['wav_match_flag'].unique().compute()

0       0
0    None
0       1
Name: wav_match_flag, dtype: object

#### hvfhs_license_num
**Conclusion:** After checking this column we decided to remove special characters such as *

In [39]:
# Grab uniques in this column
unique_values = combined_df['hvfhs_license_num'].unique()

# Compute uniques
unique_values_computed = unique_values.compute()

# Show
print(unique_values_computed)

0    ***HV0004***
1          HV0003
2          HV0005
0    ***HV0003***
0    ***HV0005***
0          HV0004
Name: hvfhs_license_num, dtype: string


In [40]:
combined_df['hvfhs_license_num'] = combined_df['hvfhs_license_num'].str.replace('[*]', '', regex=True)

In [41]:
combined_df['hvfhs_license_num'].head()

0    HV0003
1    HV0003
2    HV0003
3    HV0003
4    HV0003
Name: hvfhs_license_num, dtype: object

#### airport_fee
**Conclusion:** After checking this column we decided not to change anything.

In [42]:
# Grab uniques in this column
unique_values = combined_df['airport_fee'].unique()

# Compute uniques
unique_values_computed = unique_values.compute()

# Show
print(unique_values_computed)

0     0.0
1     6.4
0     2.5
0     0.5
1     NaN
0    10.0
0     5.0
1     1.0
Name: airport_fee, dtype: float64


#### shared_match_flag
**Conclusion:** After checking this column values we decided to replace values starting with Y or y to 1 and N or n to 0

In [43]:
# Grab uniques in this column
unique_values = combined_df['shared_match_flag'].unique()

# Compute uniques
unique_values_computed = unique_values.compute()

# Show
print(unique_values_computed)

0               NO
0    NOOoooo00ooOO
0               N0
0               No
0                N
0                Y
Name: shared_match_flag, dtype: string


In [44]:
combined_df['shared_match_flag'] = combined_df['shared_match_flag'].map(map_flag, meta=('shared_match_flag', 'int64'))

In [45]:
combined_df.head(5)

Unnamed: 0.1,Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,...,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,0,HV0003,0,5.4,0.7,0,2021-12-09 12:03:02,357,7.91,0,...,B03404,,0,204,B03404,2021-12-09 12:13:41,0.0,1.85,0.24,0.0
1,1,HV0003,1,26.55,3.08,0,2021-09-12 21:35:45,2362,34.68,0,...,B02889,,0,249,B02889,2021-09-12 22:24:02,0.0,5.7,1.04,2.75
2,2,HV0003,0,8.96,1.18,0,2021-11-22 08:43:05,810,13.25,0,...,B03404,,0,243,B03404,2021-11-22 08:59:23,0.0,1.98,0.4,0.0
3,3,HV0003,0,5.58,0.7,0,2021-09-17 18:50:23,287,7.91,0,...,B02869,,0,80,B02869,2021-09-17 18:57:12,0.0,0.75,0.24,0.0
4,4,HV0003,0,18.0,1.16,0,2021-11-02 08:57:24,685,13.02,0,...,B03404,,0,210,B03404,2021-11-02 09:12:21,0.0,2.89,0.39,0.0


In [46]:
combined_df.dtypes

Unnamed: 0                        int64
hvfhs_license_num       string[pyarrow]
Tipped                            int64
driver_pay                      float64
sales_tax                       float64
shared_request_flag               int64
request_datetime        string[pyarrow]
trip_time                         int64
base_passenger_fare             float64
shared_match_flag                 int64
tolls                           float64
DOLocationID                    float64
pickup_datetime         string[pyarrow]
on_scene_datetime       string[pyarrow]
ID                                int64
originating_base_num    string[pyarrow]
wav_match_flag                    int64
wav_request_flag                  int64
PULocationID                      int64
dispatching_base_num    string[pyarrow]
dropoff_datetime        string[pyarrow]
airport_fee                     float64
trip_miles                      float64
bcf                             float64
congestion_surcharge            float64


In [47]:
combined_df.columns

Index(['Unnamed: 0', 'hvfhs_license_num', 'Tipped', 'driver_pay', 'sales_tax',
       'shared_request_flag', 'request_datetime', 'trip_time',
       'base_passenger_fare', 'shared_match_flag', 'tolls', 'DOLocationID',
       'pickup_datetime', 'on_scene_datetime', 'ID', 'originating_base_num',
       'wav_match_flag', 'wav_request_flag', 'PULocationID',
       'dispatching_base_num', 'dropoff_datetime', 'airport_fee', 'trip_miles',
       'bcf', 'congestion_surcharge'],
      dtype='object')

In [48]:
# invalid_tipped = df[~((df['Tipped'] == 0) | (df['Tipped'] == 1))]
# print(len(invalid_tipped))

In [49]:
# valid_tipped = df[((df['Tipped'] == 0) | (df['Tipped'] == 1))]
# print(len(valid_tipped))

In [50]:
combined_df['wav_request_flag'].value_counts().compute()

wav_request_flag
0    8704344
1      12398
Name: count, dtype: int64

In [51]:
combined_df['airport_fee'].value_counts().compute()

airport_fee
0.0     6489664
6.4           1
2.5      402806
0.5           4
10.0          2
1.0         123
5.0        2246
Name: count, dtype: int64

In [52]:
combined_df['tolls'].value_counts().compute()

tolls
0.10     6354
0.11     5716
0.28     3796
0.32     1729
0.94      604
         ... 
58.00       1
62.55       1
66.92       1
71.20       1
79.83       1
Name: count, Length: 3935, dtype: int64

In [53]:
combined_df['wav_match_flag'].value_counts().compute()

wav_match_flag
0    6197037
1     340520
Name: count, dtype: int64

In [54]:
combined_df['shared_match_flag'].value_counts().compute()

shared_match_flag
0    8711536
1       5206
Name: count, dtype: int64

In [55]:
combined_df['shared_request_flag'].value_counts().compute()

shared_request_flag
0    8702580
1      14162
Name: count, dtype: int64

### Transforming datetime columns

#### on_scene_datetime

In [56]:
import pandas as pd

# Apply transformation with map_partitions - This is necessary to run the command in all partitions we've created
combined_df['on_scene_datetime'] = combined_df['on_scene_datetime'].map_partitions(
    lambda df: pd.to_datetime(df, errors='coerce'),
    meta=('on_scene_datetime', 'datetime64[s]')  # Especificar o tipo como datetime
)

#### request_datetime

In [57]:
import pandas as pd

# Apply transformation with map_partitions - This is necessary to run the command in all partitions we've created
combined_df['request_datetime'] = combined_df['request_datetime'].map_partitions(
    lambda df: pd.to_datetime(df, errors='coerce'),
    meta=('request_datetime', 'datetime64[s]')  # Especificar o tipo como datetime
)

#### pickup_datetime

In [58]:
import pandas as pd

# Apply transformation with map_partitions - This is necessary to run the command in all partitions we've created
combined_df['pickup_datetime'] = combined_df['pickup_datetime'].map_partitions(
    lambda df: pd.to_datetime(df, errors='coerce'),
    meta=('pickup_datetime', 'datetime64[s]')  # Especificar o tipo como datetime
)

#### dropoff_datetime

In [59]:
import pandas as pd

# Apply transformation with map_partitions - This is necessary to run the command in all partitions we've created
combined_df['dropoff_datetime'] = combined_df['dropoff_datetime'].map_partitions(
    lambda df: pd.to_datetime(df, errors='coerce'),
    meta=('dropoff_datetime', 'datetime64[s]')  # Especificar o tipo como datetime
)

In [60]:
# Save to Parquet using pyarrow to improve write performance and file compression
combined_df.to_parquet(".data/files/silver/", write_index=False, engine="pyarrow")

In [61]:
df = dd.read_parquet('.data/files/silver/', blocksize='64MB')

In [62]:
print(f"Number of partitions: {df.npartitions}")

Number of partitions: 22


In [63]:
pd.set_option('display.max_columns', None) # allowing Pandas to show all dataframe columns
df.head(5)

Unnamed: 0.1,Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,DOLocationID,pickup_datetime,on_scene_datetime,ID,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,0,HV0003,0,5.4,0.7,0,2021-12-09 12:03:02,357,7.91,0,0.0,44.0,2021-12-09 12:07:44,2021-12-09 12:05:59,8163956,B03404,,0,204,B03404,2021-12-09 12:13:41,0.0,1.85,0.24,0.0
1,1,HV0003,1,26.55,3.08,0,2021-09-12 21:35:45,2362,34.68,0,0.0,225.0,2021-09-12 21:44:40,2021-09-12 21:44:14,5851835,B02889,,0,249,B02889,2021-09-12 22:24:02,0.0,5.7,1.04,2.75
2,2,HV0003,0,8.96,1.18,0,2021-11-22 08:43:05,810,13.25,0,0.0,169.0,2021-11-22 08:45:53,2021-11-22 08:45:18,7703607,B03404,,0,243,B03404,2021-11-22 08:59:23,0.0,1.98,0.4,0.0
3,3,HV0003,0,5.58,0.7,0,2021-09-17 18:50:23,287,7.91,0,0.0,112.0,2021-09-17 18:52:25,2021-09-17 18:52:18,5965669,B02869,,0,80,B02869,2021-09-17 18:57:12,0.0,0.75,0.24,0.0
4,4,HV0003,0,18.0,1.16,0,2021-11-02 08:57:24,685,13.02,0,0.0,108.0,2021-11-02 09:00:56,2021-11-02 09:00:10,7153598,B03404,,0,210,B03404,2021-11-02 09:12:21,0.0,2.89,0.39,0.0


## Checking DataFrame Partitions
**Next:** We'll now check how many partitions are available and how balanced they are after the creation of the file with all data sources from CSV, TSV, Parquet and JSON. LOADING partitions (parquet files) within the computer memory will help us process the file without crashing the memory and running it faster than using other methods. 

In [64]:
partition_sizes = df.map_partitions(lambda x: x.memory_usage(deep=True).sum()).compute()/1024/1024
for i, size in enumerate(partition_sizes):
    print(f'Partition {i}: {size:.2f} MB')

Partition 0: 74.97 MB
Partition 1: 74.97 MB
Partition 2: 74.98 MB
Partition 3: 74.99 MB
Partition 4: 74.96 MB
Partition 5: 74.98 MB
Partition 6: 79.45 MB
Partition 7: 79.46 MB
Partition 8: 82.89 MB
Partition 9: 82.41 MB
Partition 10: 82.36 MB
Partition 11: 82.30 MB
Partition 12: 82.57 MB
Partition 13: 82.57 MB
Partition 14: 82.43 MB
Partition 15: 81.98 MB
Partition 16: 81.29 MB
Partition 17: 89.97 MB
Partition 18: 89.97 MB
Partition 19: 89.96 MB
Partition 20: 89.97 MB
Partition 21: 89.98 MB


**Things to notice:**
* One of the files is 713MB in size
* One of the files is 153MB in size
* Some are 86BM in size
* Some are 72MB in size

This phenomenon is known as imbalance. Several factors can contribute to its occurrence, including Data Partitioning, Data Format, Dask Laziness, Inconsistent Data, Hardware and System Configuration, and Processing Strategies. While we won’t delve into the specifics or root causes of this issue at the moment, it’s important to note that there are ways to address and optimize this anomaly for more efficient processing.

Given that the total memory size required to load the data into memory is 2.5 GB, let’s divide it into 10 partitions, aiming for approximately 250 MB per partition.


***IMPORTANT:***
Dask often strives to distribute data equally among partitions, but achieving perfect balance is not always feasible.
Below, you’ll find two different approaches to ensure the data is partitioned as desired.
In both cases, you’ll notice that the 1GB partition is eliminated, which was the primary objective of this repartitioning strategy.

***TIP:*** 
In order to get all partitions of same size repartition the dataframe in 1 partition as the cell right below, then repartition again in 12 partitions, for instance. This will force Dask to recreate the partitions in df dataframe using the appropriate size for the number of partitions chosen.

### Physical Files Sizes
***REMEMBER:***
The repartition is happening in memory, the physical files won't change in size. Repartition will help  
balance the files reading so that the parallel processing benefits when running.

Look at the physical files sizes below when saved to my local computer:

| Permissions   | Owner  | Group | Size  | Date       | Time   | File Name       |
|---------------|--------|-------|-------|------------|--------|-----------------|
| -rw-r--r--@   | user1  | staff | 19M   | Dec 16     | 17:44  | part.0.parquet  |
| -rw-r--r--@   | user1  | staff | 19M   | Dec 16     | 17:44  | part.1.parquet  |
| -rw-r--r--@   | user1  | staff | 22M   | Dec 16     | 17:44  | part.10.parquet |
| -rw-r--r--@   | user1  | staff | 22M   | Dec 16     | 17:44  | part.11.parquet |
| -rw-r--r--@   | user1  | staff | 22M   | Dec 16     | 17:44  | part.12.parquet |
| -rw-r--r--@   | user1  | staff | 19M   | Dec 16     | 17:43  | part.2.parquet  |
| -rw-r--r--@   | user1  | staff | 19M   | Dec 16     | 17:43  | part.3.parquet  |
| -rw-r--r--@   | user1  | staff | 19M   | Dec 16     | 17:43  | part.4.parquet  |
| -rw-r--r--@   | user1  | staff | 19M   | Dec 16     | 17:43  | part.5.parquet  |
| -rw-r--r--@   | user1  | staff | 32M   | Dec 16     | 17:43  | part.6.parquet  |
| -rw-r--r--@   | user1  | staff | 155M  | Dec 16     | 17:44  | part.7.parquet  |
| -rw-r--r--@   | user1  | staff | 22M   | Dec 16     | 17:43  | part.8.parquet  |
| -rw-r--r--@   | user1  | staff | 22M   | Dec 16     | 17:43  | part.9.parquet  |


In [65]:
print(len(df))

8716742


In [66]:
if 'Unnamed: 0' in df.columns:
    df = df.drop('Unnamed: 0', axis=1)

In [67]:
df.head(5)

Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,DOLocationID,pickup_datetime,on_scene_datetime,ID,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,HV0003,0,5.4,0.7,0,2021-12-09 12:03:02,357,7.91,0,0.0,44.0,2021-12-09 12:07:44,2021-12-09 12:05:59,8163956,B03404,,0,204,B03404,2021-12-09 12:13:41,0.0,1.85,0.24,0.0
1,HV0003,1,26.55,3.08,0,2021-09-12 21:35:45,2362,34.68,0,0.0,225.0,2021-09-12 21:44:40,2021-09-12 21:44:14,5851835,B02889,,0,249,B02889,2021-09-12 22:24:02,0.0,5.7,1.04,2.75
2,HV0003,0,8.96,1.18,0,2021-11-22 08:43:05,810,13.25,0,0.0,169.0,2021-11-22 08:45:53,2021-11-22 08:45:18,7703607,B03404,,0,243,B03404,2021-11-22 08:59:23,0.0,1.98,0.4,0.0
3,HV0003,0,5.58,0.7,0,2021-09-17 18:50:23,287,7.91,0,0.0,112.0,2021-09-17 18:52:25,2021-09-17 18:52:18,5965669,B02869,,0,80,B02869,2021-09-17 18:57:12,0.0,0.75,0.24,0.0
4,HV0003,0,18.0,1.16,0,2021-11-02 08:57:24,685,13.02,0,0.0,108.0,2021-11-02 09:00:56,2021-11-02 09:00:10,7153598,B03404,,0,210,B03404,2021-11-02 09:12:21,0.0,2.89,0.39,0.0


### Dropping duplicates - ID

In [68]:
# Because we merged the four files it's necessary now to drop duplicates from the key ID 
# which is the transaction ID of the trip.
# df = df.drop_duplicates(subset=['ID'])

### Hash Duplicate Checking

A more efficient approach to identify and remove duplicates in large datasets is to generate a hash column using **SHA-256** by concatenating all columns. This creates a unique hash for each entire row, which can then be used with the ***dataframe.drop_duplicates(subset=['hash-column'])*** function to clean the data.

### Mode Missing Data

In [69]:
columns = df.columns
len(columns)

24

In [70]:
import dask.dataframe as dd

# Fill missing values with the nearest value
df = df.map_partitions(lambda df_partition: df_partition.fillna(method='ffill').fillna(method='bfill'))

  df = df.map_partitions(lambda df_partition: df_partition.fillna(method='ffill').fillna(method='bfill'))


In [71]:
len(df)

  df = df.map_partitions(lambda df_partition: df_partition.fillna(method='ffill').fillna(method='bfill'))


8716742

In [72]:
df['wav_match_flag'] = df['wav_match_flag'].fillna(0)  # fill it with 0

In [73]:
df.head()

  df = df.map_partitions(lambda df_partition: df_partition.fillna(method='ffill').fillna(method='bfill'))
  df = df.map_partitions(lambda df_partition: df_partition.fillna(method='ffill').fillna(method='bfill'))


Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,DOLocationID,pickup_datetime,on_scene_datetime,ID,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,HV0003,0,5.4,0.7,0,2021-12-09 12:03:02,357,7.91,0,0.0,44.0,2021-12-09 12:07:44,2021-12-09 12:05:59,8163956,B03404,0.0,0,204,B03404,2021-12-09 12:13:41,0.0,1.85,0.24,0.0
1,HV0003,1,26.55,3.08,0,2021-09-12 21:35:45,2362,34.68,0,0.0,225.0,2021-09-12 21:44:40,2021-09-12 21:44:14,5851835,B02889,0.0,0,249,B02889,2021-09-12 22:24:02,0.0,5.7,1.04,2.75
2,HV0003,0,8.96,1.18,0,2021-11-22 08:43:05,810,13.25,0,0.0,169.0,2021-11-22 08:45:53,2021-11-22 08:45:18,7703607,B03404,0.0,0,243,B03404,2021-11-22 08:59:23,0.0,1.98,0.4,0.0
3,HV0003,0,5.58,0.7,0,2021-09-17 18:50:23,287,7.91,0,0.0,112.0,2021-09-17 18:52:25,2021-09-17 18:52:18,5965669,B02869,0.0,0,80,B02869,2021-09-17 18:57:12,0.0,0.75,0.24,0.0
4,HV0003,0,18.0,1.16,0,2021-11-02 08:57:24,685,13.02,0,0.0,108.0,2021-11-02 09:00:56,2021-11-02 09:00:10,7153598,B03404,0.0,0,210,B03404,2021-11-02 09:12:21,0.0,2.89,0.39,0.0


### Dropping NA

In [74]:
# Dropping NA from the dataset
# df = df.dropna(subset=['request_datetime', 'on_scene_datetime', 'dropoff_datetime', 'pickup_datetime'])

In [75]:
# Count non-NA cells for each column or row.
# df.count(axis=0).compute()

In [76]:
# Dropping NA from the dataset
# df = df.dropna(subset=['shared_request_flag', 'wav_request_flag', 'trip_time', 'driver_pay', 'dispatching_base_num', 'originating_base_num', 'Tipped', 'request_datetime', 'on_scene_datetime', 'dropoff_datetime', 'bcf', 'base_passenger_fare', 'congestion_surcharge', 'PULocationID', 'ID', 'wav_match_flag', 'tolls', 'hvfhs_license_num', 'DOLocationID', 'trip_miles', 'sales_tax', 'airport_fee', 'shared_match_flag', 'pickup_datetime'])

## SAVING THE SILVER LAYER

In [77]:
type(df)

dask_expr._collection.DataFrame

In [78]:
df.to_parquet(".data/files/silver/", write_index=False, engine="pyarrow")

  df = df.map_partitions(lambda df_partition: df_partition.fillna(method='ffill').fillna(method='bfill'))
  df = df.map_partitions(lambda df_partition: df_partition.fillna(method='ffill').fillna(method='bfill'))


In [79]:
df = dd.read_parquet(".data/files/silver/", write_index=False, engine="pyarrow")

In [80]:
df.head()

Unnamed: 0,hvfhs_license_num,Tipped,driver_pay,sales_tax,shared_request_flag,request_datetime,trip_time,base_passenger_fare,shared_match_flag,tolls,DOLocationID,pickup_datetime,on_scene_datetime,ID,originating_base_num,wav_match_flag,wav_request_flag,PULocationID,dispatching_base_num,dropoff_datetime,airport_fee,trip_miles,bcf,congestion_surcharge
0,HV0003,0,5.4,0.7,0,2021-12-09 12:03:02,357,7.91,0,0.0,44.0,2021-12-09 12:07:44,2021-12-09 12:05:59,8163956,B03404,0,0,204,B03404,2021-12-09 12:13:41,0.0,1.85,0.24,0.0
1,HV0003,1,26.55,3.08,0,2021-09-12 21:35:45,2362,34.68,0,0.0,225.0,2021-09-12 21:44:40,2021-09-12 21:44:14,5851835,B02889,0,0,249,B02889,2021-09-12 22:24:02,0.0,5.7,1.04,2.75
2,HV0003,0,8.96,1.18,0,2021-11-22 08:43:05,810,13.25,0,0.0,169.0,2021-11-22 08:45:53,2021-11-22 08:45:18,7703607,B03404,0,0,243,B03404,2021-11-22 08:59:23,0.0,1.98,0.4,0.0
3,HV0003,0,5.58,0.7,0,2021-09-17 18:50:23,287,7.91,0,0.0,112.0,2021-09-17 18:52:25,2021-09-17 18:52:18,5965669,B02869,0,0,80,B02869,2021-09-17 18:57:12,0.0,0.75,0.24,0.0
4,HV0003,0,18.0,1.16,0,2021-11-02 08:57:24,685,13.02,0,0.0,108.0,2021-11-02 09:00:56,2021-11-02 09:00:10,7153598,B03404,0,0,210,B03404,2021-11-02 09:12:21,0.0,2.89,0.39,0.0


## FINISHED FILES COMPUTATION
**Conclusion:** Now we'll work on other data sources to bring everything together and create a single file for the ML model training.

In [81]:
df.columns

Index(['hvfhs_license_num', 'Tipped', 'driver_pay', 'sales_tax',
       'shared_request_flag', 'request_datetime', 'trip_time',
       'base_passenger_fare', 'shared_match_flag', 'tolls', 'DOLocationID',
       'pickup_datetime', 'on_scene_datetime', 'ID', 'originating_base_num',
       'wav_match_flag', 'wav_request_flag', 'PULocationID',
       'dispatching_base_num', 'dropoff_datetime', 'airport_fee', 'trip_miles',
       'bcf', 'congestion_surcharge'],
      dtype='object')