# Big Data Project H600 / L-Group

## Real world Data Exploration, Integration, Cleasning, Transformation and Analysis

Data:
The New York City Taxi and Limousine Commission (or TLC for short) has been publishing
records about taxi trips in New York since 2009. 

The TLC trip dataset actually consists of 4 sub-datasets:

    1.Yellow taxi records are records that record trip information of New York's famous yellow taxi cars

    2.Green taxi records are records that record trip information by so-called 'boro' taxis, a newer service introduced in August of 2013 to improve taxi service and availability in the boroughs. 

    3.FHV records (short for 'For Hire Vehicles') record information from services that offered for-hire vehicles (such as Uber, Lyft, Via, and Juno), but also luxury limousine bases.

    4.High volume FHV (FHVHV for short) are FHV records offered by services that make more than 10,000 trips per day

In [6]:
import os 
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=3g  pyspark-shell"
from pyspark.sql import SparkSession
try: 
    spark
    print("Spark application already started. Terminating existing application and starting new one")
    spark.stop()
except: 
    pass

# Create a new spark session (note, the * indicates to use all available CPU cores)
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("H600 L-Group") \
    .getOrCreate()
    
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext

#in local mode, you will be able to access the Spark GUI at http://localhost:4040



### All files in the pipe

In [9]:
# Let's list the files in data
!ls /home/bigdata/Desktop/TaxiNYC/Data/

fhvhv_tripdata_2019-02.csv  green_tripdata_2019-03.csv
fhvhv_tripdata_2019-03.csv  green_tripdata_2019-04.csv
fhvhv_tripdata_2019-04.csv  green_tripdata_2019-05.csv
fhvhv_tripdata_2019-05.csv  green_tripdata_2019-06.csv
fhvhv_tripdata_2019-06.csv  green_tripdata_2020-01.csv
fhvhv_tripdata_2020-01.csv  green_tripdata_2020-01.txt
fhvhv_tripdata_2020-03.csv  green_tripdata_2020-02.csv
fhvhv_tripdata_2020-04.csv  green_tripdata_2020-04.csv
fhvhv_tripdata_2020-05.csv  green_tripdata_2020-05.csv
fhvhv_tripdata_2020-06.csv  green_tripdata_2020-06.csv
fhv_tripdata_2015-01.csv    yellow_tripdata_2009-01.csv
fhv_tripdata_2015-02.csv    yellow_tripdata_2009-02.csv
fhv_tripdata_2015-03.csv    yellow_tripdata_2009-03.csv
fhv_tripdata_2015-04.csv    yellow_tripdata_2009-04.csv
fhv_tripdata_2015-05.csv    yellow_tripdata_2009-05.csv
fhv_tripdata_2015-06.csv    yellow_tripdata_2009-06.csv
fhv_tripdata_2015-07.csv    yellow_tripdata_2009-07.csv
fhv_tripdata_2015-08.csv    yellow_tripda

### Loading 1 file

In [10]:
# Load the contents of a file into an RDD. Note - when run on the cluster this load from HDFS (inside /user/$USER/)
# if you really want to load from HDFS, you can also put the full HDFS url, e.g.
# hdfs://public00:8020/user/<your_user_id_here>/data/books/pg20417.txt
fileName = 'Data/green_tripdata_2020-01.csv'
TaxiRDD = sc.textFile(fileName)

In [11]:
TaxiRDD.take(5)

['VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge',
 ',2020-01-27 06:40:00,2020-01-27 07:25:00,,,159,61,,19.84,40.95,2.75,0,0,6.12,,0.3,50.12,,,',
 '2,2020-01-18 01:01:42,2020-01-18 01:04:57,N,1,7,146,1,.56,4.5,0.5,0.5,1.45,0,,0.3,7.25,1,1,0',
 '2,2020-01-15 13:52:01,2020-01-15 14:13:09,N,1,134,77,1,5.87,22,0,0.5,0,0,,0.3,22.8,1,1,0',
 '2,2020-01-30 23:49:37,2020-01-31 00:12:17,N,1,42,143,1,5.40,20,0.5,0.5,4.81,0,,0.3,28.86,1,1,2.75']

In [13]:
tupleRDD = TaxiRDD.map(lambda line: line.split())
tupleRDD.take(3)

[['VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge'],
 [',2020-01-27',
  '06:40:00,2020-01-27',
  '07:25:00,,,159,61,,19.84,40.95,2.75,0,0,6.12,,0.3,50.12,,,'],
 ['2,2020-01-18',
  '01:01:42,2020-01-18',
  '01:04:57,N,1,7,146,1,.56,4.5,0.5,0.5,1.45,0,,0.3,7.25,1,1,0']]

In [14]:
wordsRDD = TaxiRDD.flatMap(lambda line: line.split())
wordsRDD.take(5)

['VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge',
 ',2020-01-27',
 '06:40:00,2020-01-27',
 '07:25:00,,,159,61,,19.84,40.95,2.75,0,0,6.12,,0.3,50.12,,,',
 '2,2020-01-18']

In [15]:
wordsRDD.count()

2686

In [16]:
wordsRDD.map(lambda x: 1).reduce(lambda a,b: a+b)

2686

### Do a word count on all files

In [17]:
allTaxiRDD = sc.textFile('Data/*.csv')

In [19]:
allTaxiRDD.count()

5036382

In [20]:
allTaxiRDD.count()

5036382

### Yellow taxi records