#### Course Title: Computing Foundations of Data Science (INFO-H-600)
#### University: Universite Libre Brussels
#### Program: Specialized Master Data Science, Big Data
#### Academic Year: 2020-2021
#### Purpose: Project Assignment
#### Authors: Kubam Ivo Mbi (507332), Berdai Hasnae ()
#### Submission Date: 25-Jan-21

## 1.0. Introduction

The New York City Taxi and Limousine Commission (TLC) was created in 1971. This agency is responsible for licensing and regulating New York City's Medallion (Yellow) taxi cabs, for-hire vehicles (community-based liveries, black cars and luxury limousines), commuter vans, and paratransit vehicles opereate in the city. It is approximated that about One million trips are recorded each day. There are Four types of trips broken down into Yellow Taxi, Green Taxi, For-Hire Vehicle (FHV) and High Volume For-Hire Vehicle (FHVHV). TLC receives taxi trip data from the technology service providers (TSPs) that provide electronic metering in each cab, and FHV trip data from the app, community livery, black car, or luxury limousine company, or base, who dispatched the tripVisit [About TLC](https://www1.nyc.gov/site/tlc/about/about-tlc.page) for more info. 

## 2.0. Project Objective

This project was aimed at using the tools and knowledge acquired in the course to do data wrangling and analysis of a sampled subset of data from the TLC database. This objective was broken down into the following tasks:<br>
> - **2.1. Collecting metadata, inspecting schema evolution:** Understanding the data and its characteristics
> - **2.2. Data integration:** Updating old schemas to the latest schema for each subdatset
> - **2.3. Data cleaning:** Checking files for record errors, repairing them and separting into good and bad records.
> - **2.4. Analysis:** Answering queries and plotting results using matplotlib

<br>See the assignment file for more details about the above objectives. 

## 3.0. Methodology

### 3.1. About the Dataset

The dataset used in this project was generated by uniformly sampling 0.2\% without repetition from each file per taxi trip types. The sampled dataset consist of records from 2009 with just the Yellow taxi uptil 2020 with all four taxi records included. The four sub-datasets representing the four trip types do not provide the same information. More information is provided by the data dictionaries found on the [TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) 

### 3.2. Solution Approach

Functional style of programming was highly utilized to reduce repetition of tasks. We will break the solution approach per project tasks. <br>
> - **Collecting metadata, inspecting schema evolution:** To compute the basic statistics, we utilized two main functions. The first function (***record_stat***) receives a file path containing files and record each file size and its number of records both saved in two lists. The returned lists from is passed onto function two (***cal_stat***) which then calculates the requested statistics per taxi type. To analyze the schema, a function called ***schema***  receives the folder path containing files, gets the schema of each file, compares them and separates files based on schema comparison. This was done per taxi subtypes. 
> - **Data integration:** We made use of five main functions. ***column_name_fix*** to lowercase and remove all whitespace is column names. ***column_drop*** to drop columns from a dataframe.***column_rename*** to rename selected columns with a provided list of column names. ***column_add*** to add list of columns to a dataframe. In order to convert longitudes and latitudes to locationID we use the function ***location_id***. This function receives a dataframe and zones as its parameters. The zones were read from taxi_zones.shp file then the coordinates were re-projected to the CRS EPSG 4326 system. From zones, rtree index was created. When a dataframe is pass into the ***location_id*** function, a query point is generated from the pickup longitude and and latitude and same for the dropoff longitude and latitude. These query points are intersected with the rtree for possible matches. These possible matches are check in zones geometry if there is exact an value and that value is attributed pickup locationID or dropoff locationID. 
> - **Data Cleaning** To clean the integrated datasets, we started by adding a new column called dirty to each file using the ***add_column*** function. Other functions used to check the cleanliness of the dataset were ***null_value*** to check columns for null value, ***locationID*** to make sure all locationID values were between 1 and 265. Passing each file through the above mention function, records that did not pass the criteria were flagged as dirty. In order to repair the flagged records, we used the function ***repair_missing*** which replaces missing values of selected columns by a defined value. The ***repair_absolute*** function was used on selected numeric fields to make sure there were positive. Finally, ***record_separation*** function was used to separate good records from bad ones. All cleaned files per dataset was stacked into one. So the final output from data cleaning had four files. One each for FHV, FHVHV, green and yellow.  <br>
**Note should be taken that we only considered columns that were required for the analysis section**
>- **Analysis:** Before hand we had calculated trip month and total receipt for each file in the data cleaning section using the ***month_extract*** and ***tot_receipt*** function respectively. So it was then easy for us to group most of the analysis by month as it was requested. Trip duration in minutes was calculated from pickup and dropoff time fields. The trip speed was calcualted diving the trip distance covered and trip duration. 


### 3.3. Execution Environment

In order to implement the above mention solution approach, python was the main programming language and  we made use of the following librairies: <br>
> - [os](https://docs.python.org/3/library/os.html) 
> - [numpy](https://numpy.org/)
> - [pandas](https://pandas.pydata.org/docs/)
> - [shutil](https://docs.python.org/3/library/shutil.html)
> - [Spark Python](https://spark.apache.org/docs/latest/api/python/index.html)
> - [geopandas](https://geopandas.org/)
> - [shapely](https://pypi.org/project/Shapely/)
> - [Matplotlib](https://matplotlib.org/)

We both used separate laptops to develop the codes which had the following specifictions
> - Lenovo ideapad Flex 5 CORE i7 10th Generation. 16.0GB RAM with windows operating system
> - 

## 4.0. Results

Results will be presented based on each tasks. 

### 4.1. Results for Collecting metadata, inspecting schema evolution

#### Statistics about the dataset

| Taxi Type |Statistics-Type | Number of files |Min|Max|Mean|25th Percentile|50th Percentile|75th Percentile|90th Percentile|
|:-| :-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|FHV|Record |104|959|85,995|26,723.23|8,711.5|24,684.0|43,087.0|47,332.1|
|FHV|Size(bytes) |104|4,096|3,339,455|707,580.21|4,096.0|201,559.5|777,344.75|2,742,530.1|
|FHVHV|Record|10|8,625|47,703|32,181.9|18,022.0|40,714.0|43,056.25|44,922.0|
|FHVHV|Size(bytes)|10|535,789|2,978,931|2,007,750.7|1,121,047.25|2,542,280.0|2,687,857.25|2,804,332.8|
|Green|Record|76|15|3546|2,026.5|1,358.75|2,072.5|2,892.25|3,126.0|
|Green|Size(bytes)|76|2,512|570,765|262,437.29|121,494.75|190,194.5|456,955.0|499,751.0|
|Yellow|Record|131|476|32,300|24,203.51|19,989.0|26,294.0|29,080.0|30,202.0|
|Yellow|Size(bytes)|131|43,103|5,959,352|3,750,759.69|1,756,967.0|4,442,047.0|5,123,591.5|5,491,438.0|

#### Analysis of the schema evolution

##### FHV Taxi

|Version|Year|Details|
| :-:|:-:|:-|
|One|2015-01 till 2016-12|'dispatching_base_num', 'pickup_date', 'locationid'|
|Two|2017-01 till 2017-06|'dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid'|
|Three|2017-07 till 2017-12|'dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag'|
|Four|2018-01 till 2018-12|'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag', 'dispatching_base_number', 'dispatching_base_num'|
|Five|2019-01 till 2020-06|'dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag'|

Note: Version Five and Three were similiar, so files from both schemas were merged into one.

##### FHVHV Taxi 
Only one schema which includes records from 2019-02 till 2020-06 with the following colmun headings: 'hvfhs_license_num', 'dispatching_base_num', 'pickup_datetime', 'dropoff_datetime', 'pulocationid', 'dolocationid', 'sr_flag'

##### Green Taxi

|Version|Year|Details|
| :-:|:-:|:-|
|One|2013-08 till 2014-12|'vendorid', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'ratecodeid', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'total_amount', 'payment_type', 'trip_type'|
|Two|2015-01 till 2016-06|'vendorid', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'ratecodeid', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type', 'trip_type'|
|Three|2016-07 till 2018-12|'vendorid', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'ratecodeid', 'pulocationid', 'dolocationid', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type', 'trip_type'|
|Four|2019-01 till 2020-06|'vendorid', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'ratecodeid', 'pulocationid', 'dolocationid', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type', 'trip_type', 'congestion_surcharge'|

#### Yellow Taxi

|Version|Year|Details|
| :-:|:-:|:-|
|One|2009-01 till 2009-12|'vendor_name', 'trip_pickup_datetime', 'trip_dropoff_datetime', 'passenger_count', 'trip_distance', 'start_lon', 'start_lat', 'rate_code', 'store_and_forward', 'end_lon', 'end_lat', 'payment_type', 'fare_amt', 'surcharge', 'mta_tax', 'tip_amt', 'tolls_amt', 'total_amt'|
|Two|2010-01 till 2014-12|'vendor_id', 'pickup_datetime', 'dropoff_datetime', 'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'rate_code', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tip_amount', 'tolls_amount', 'total_amount'|
|Three|2015-01 till 2016-06|'vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'ratecodeid', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount'|
|Four|2016-07 till 2018-12|'vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'ratecodeid', 'store_and_fwd_flag', 'pulocationid', 'dolocationid', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount'|
|Five|2019-01 till 2020-06|'vendorid', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'ratecodeid', 'store_and_fwd_flag', 'pulocationid', 'dolocationid', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge'|

### 4.2. Results for Data Cleaning. 

After the cleaning process, 66,7814 records of FHV dataset were flagged as bad. No bad records was recorded for FHVHV dataset. Green and yellow dataset both had 526 and 341,112 respectively

### 4.3. Results for Analysis

We will present just the plots in the sections. The values behind the plots can viewed in the notebook were the plots were generated. 

#### 4.3.1 The monthly total number of trips, grouped per dataset type

![Monthly Trips](monthlytrips.png "Monthly Total number of Trips")

In general, yellow taxi had the highest number of trips per month. Yellow and green taxi almost had similar pattern in the number of trips per month with high values in the first two quarters and a drop in the third quarter and slow rise in the last quarter. FHV started high in January, felt in Febraury till April and rise from June till the end of the end. No recognise pattern for FHVHV as we had records only for six months. Eventhough the scales of y-axis can guide in getting the numerical values behing the plots, see exact values in other notebook.  

#### 4.3.3 Monthly total number of trips in Manhattan and Brooklyn

![Manhattan and Brooklyn](M_Bpng.png "Title")

The Manhattan and Brooklyn total number of trips show similar pattern as the total number of trips as seen above. This could suggest that these two Boroughs contributed the most number of trips overall. 

#### 4.3.4 The monthly total receipts, grouped per dataset type

![Total Receipt](tot_receipt.png "Title")

As expected, monthly total receipts is similar in pattern as the total number of trips per month. We expect that when number of trips is high, total receipts should be high and vise versa. Again, Yellow taxi generated more total receipts as compared to green. FHV and FHVHV had no information for receipt. 

#### 4.3.5 The average trip receipt grouped per dataset type

![Tips Receipt](tips.png "Title")

In general,green taxi gets more trip receipt averagely than those of yellow taxi per month. Average trip receipt for green was 13 dollars while for yellow it was  12 dollars. The first and last quarter for green taxi showed averagely low trips receipt as oppose to the second and third with comparatively high tips.  

#### 4.3.6 Average cost per in-progress-minute

![Average Cost](avgcost.png "Title")

The Average cost per in progess minute for yellow taxi was consistenly low when compared to that of green on a monthly base except for the month of March with 1.38 vs 1.32 dollars per minute. So, yellow taxi service is averagely cheaper comparatively. 

#### 4.3.7 Average tip per trip grouped per dataset type

![Average Tip](tipamnt.png "Title")

Considering the values behind the plots, yellow taxi drivers averagely get more tips than those of green taxi. The scale of the plots can also guide to see that values of yellow are higher than those of yellow. Hence, as a driver, I will prefer to be a driver for yellow taxi. 

#### 4.3.8 Median monthly average trip speed, grouped per dataset type and per borough

![Median speed](medianspeed.png "Title")

Overall, the median monthly trip constant were quite low as non of the values went above one mile per minute.

#### 4.3.9 Time Taken to get to Newark Airport

![New York](NY.png "Title")

Taxi trips duration from Manhattan Midtown and Chelsea to Newark Airport showed similar pattern. The early mornings and late evenings were the best time to take a drip to Airport while the Afternoons and early evenings were the worst time to take a drip to the Airport. We will like to stress here that, trips from Times Squeare or Garment District also showed similar pattern but due to two trips recorded unusual trip duration caused the left skewed data display. <br>
The exact best time slots to travel to the Newark airport were ***05:00-05:59*** from Manhattan, ***02:00-02:59*** from Chelsea and ***03:00-03:59*** from Times Squeare or Garment District. Their corresponding worst time slots were ***17:00-17:59, 16:00-16:59, 16:00-16:59*** respectively. We decided not to consider ***00:00 - 00:59*** as the worse time slot for Times Squeare or Garment District because those two trips were outliers. <br>
The monthly median travel time for the best and worst time slots  plots can be found other notebook. 

#### 4.3.10. Time Taken to get to JFK Airport

![jfk](JFK.png "Title")

Median trip duration to JFK airport from the three locations showed similar pattern and also resembles what we saw for the Newark airport. Generally worst time to take a trip to the airport will be from ***14:00 - 19:59***. Early mornings and late at night are the best time slots. The plots shows ***15:00 - 15:59*** as the worse time slots to take a trip from Time Squeares or Garment District and from Chelsea to JFK airport. for manhattan it is ***16:00 - 16:59***. Their corressponding best time slots are ***01:00 - 01:59, 22:00-22:59, 00:00-00:59*** respectively

#### 4.3.11 Time Taken to get to LaGuardia Airport

![lg](LG.png "Title")

Once more, a similar pattern as to other two airports. We like to stress the median trip durations to LaGuardia Airport is in general low as compared to other Airports. May be this airport is closed to the three locations (still to be verified). ***17:00 - 17:59*** turns out to be the worse time slot to make a drip to LaGuardia Aiport from the three locations. ***23:00 - 23:59*** is the best time slot from Manhattan, ***22:00 - 22:59*** from Chelsea and ***21:00 - 21:59*** from Times Squeare or Garment District to LaGuardia Airport. 