# New York City Taxi Cab Trip

We look at [the New York City Taxi Cab dataset](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml). This includes every ride made in the city of New York in the year 2016.

On [this website](http://chriswhong.github.io/nyctaxi/) you can see the data for one random NYC yellow taxi on a single day.

On [this post](http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/), you can see an analysis of this dataset. Postgres and R scripts are available on [GitHub](https://github.com/toddwschneider/nyc-taxi-data).

## Loading the data

Normally we would read and load this data into memory as a Pandas dataframe.  However in this case that would be unwise because this data is too large to fit in RAM.

The data can stay in the hdfs filesystem but for performance reason we can't use the csv format. The file is large (32Go) and text formatted. Data Access is very slow.

## Parquet file format

[Parquet format](https://github.com/apache/parquet-format) is a common binary data store, used particularly in the Hadoop/big-data sphere. It provides several advantages relevant to big-data processing:

- columnar storage, only read the data of interest
- efficient binary packing
- choice of compression algorithms and encoding
- split data into files, allowing for parallel processing
- range of logical types
- statistics stored in metadata allow for skipping unneeded chunks
- data partitioning using the directory structure

To convert the csv file to parquet we can use Dask or Spark. Here the code using Spark.
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .master('local[16]') \
        .config('spark.hadoop.parquet.enable.summary-metadata', 'true') \
        .getOrCreate()
df = spark.read.csv(
    "hdfs://localhost:54310/user/pnavaro/2016_Yellow_Taxi_Trip_Data.csv", 
                    header="true",inferSchema="true")
df.write.parquet('hdfs://localhost:54310/user/pnavaro/nyc-taxi/2016.parquet')
spark.stop()
```