# Big Data Project H600 / L-Group
***
## Real world Data Exploration, Integration, Cleasning, Transformation and Analysis

Data:
The New York City Taxi and Limousine Commission (or TLC for short) has been publishing
records about taxi trips in New York since 2009. 

The TLC trip dataset actually consists of 4 sub-datasets:

    1.Yellow taxi records are records that record trip information of New York's famous yellow taxi cars

    2.Green taxi records are records that record trip information by so-called 'boro' taxis, a newer service introduced in August of 2013 to improve taxi service and availability in the boroughs. 

    3.FHV records (short for 'For Hire Vehicles') record information from services that offered for-hire vehicles (such as Uber, Lyft, Via, and Juno), but also luxury limousine bases.

    4.High volume FHV (FHVHV for short) are FHV records offered by services that make more than 10,000 trips per day

In [33]:
conda install -c conda-forge matplotlib

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [1]:
import os 
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=3g  pyspark-shell"
from pyspark.sql import SparkSession
try: 
    spark
    print("Spark application already started. Terminating existing application and starting new one")
    spark.stop()
except: 
    pass
# Create a new spark session (note, the * indicates to use all available CPU cores)
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("H600 L-Group") \
    .getOrCreate()
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext
#in local mode, you will be able to access the Spark GUI at http://localhost:4040

## 1: Files & Data Exploration 
### 1.1 Count of files per type

In [90]:
print ('count of green tripdata files:')
!find /home/bigdata/Desktop/Project/Data/green_tripdata_*.csv -type f | wc -l 
print ('count of yellow tripdata files:')
!find /home/bigdata/Desktop/Project/Data/yellow_tripdata_*.csv -type f | wc -l
print ('count of fhv tripdata files:')
!find /home/bigdata/Desktop/Project/Data/fhv_tripdata_*.csv -type f | wc -l 
print ('count of fhvhv tripdata files:')
!find /home/bigdata/Desktop/Project/Data/fhvhv_tripdata_*.csv -type f | wc -l
print ('count of all tripdata files:')
!find /home/bigdata/Desktop/Project/Data/*.csv -type f | wc -l 

count of green tripdata files:
76
count of yellow tripdata files:
131
count of fhv tripdata files:
64
count of fhvhv tripdata files:
10
count of all tripdata files:
281


### 1.2 Files size and metrics

In [93]:
#creating folders
!mkdir -p exe01

##creating size files for data exploration
!mkdir -p exe01
!wc Data/green_tripdata_*.csv > exe01/size_greentripdata.txt
!wc Data/yellow_tripdata_*.csv > exe01/size_yellowtripdata.txt
!wc Data/fhv_tripdata_*.csv > exe01/size_fhvtripdata.txt
!wc Data/fhvhv_tripdata_*.csv > exe01/size_fhvhvtripdata.txt
!wc Data/*.csv > exe01/size_alltripdata.txt

In [96]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

sizegreen=np.loadtxt("exe01/size_greentripdata.txt")

#sizegreen=pd.read_csv("exe01/size/size_greentripdata.txt", header=None)
#sizegreen.head()

#plt.bar("File", "Count", data = filecountmerged, color = "blue")
#plt.xlabel("File type")
#plt.ylabel("Number")
#plt.title("Number of files per type")
#plt.show()
#

ValueError: could not convert string to float: 'Data/green_tripdata_2013-08.csv'

      16       46     2512 Data/green_tripdata_2013-08.csv
      99      295    15235 Data/green_tripdata_2013-09.csv
     339     1015    52330 Data/green_tripdata_2013-10.csv
     757     2269   116683 Data/green_tripdata_2013-11.csv
    1190     3568   183869 Data/green_tripdata_2013-12.csv
    1587     4759   244988 Data/green_tripdata_2014-01.csv
    1988     5962   307739 Data/green_tripdata_2014-02.csv
    2569     7705   399204 Data/green_tripdata_2014-03.csv
    2597     7789   403797 Data/green_tripdata_2014-04.csv
    2823     8467   439546 Data/green_tripdata_2014-05.csv
    2656     7966   413919 Data/green_tripdata_2014-06.csv
    2529     7585   393897 Data/green_tripdata_2014-07.csv
    2671     8011   416087 Data/green_tripdata_2014-08.csv
    2703     8107   421297 Data/green_tripdata_2014-09.csv
    2960     8878   460789 Data/green_tripdata_2014-10.csv
    3071     9211   478112 Data/green_tripdata_2014-11.csv
    3263     9787   507737 Data/green_tr

### Loading 1 file

In [10]:
# Load the contents of a file into an RDD. Note - when run on the cluster this load from HDFS (inside /user/$USER/)
# if you really want to load from HDFS, you can also put the full HDFS url, e.g.
# hdfs://public00:8020/user/<your_user_id_here>/data/books/pg20417.txt
fileName = 'Data/green_tripdata_2020-01.csv'
TaxiRDD = sc.textFile(fileName)

In [11]:
TaxiRDD.take(5)

['VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge',
 ',2020-01-27 06:40:00,2020-01-27 07:25:00,,,159,61,,19.84,40.95,2.75,0,0,6.12,,0.3,50.12,,,',
 '2,2020-01-18 01:01:42,2020-01-18 01:04:57,N,1,7,146,1,.56,4.5,0.5,0.5,1.45,0,,0.3,7.25,1,1,0',
 '2,2020-01-15 13:52:01,2020-01-15 14:13:09,N,1,134,77,1,5.87,22,0,0.5,0,0,,0.3,22.8,1,1,0',
 '2,2020-01-30 23:49:37,2020-01-31 00:12:17,N,1,42,143,1,5.40,20,0.5,0.5,4.81,0,,0.3,28.86,1,1,2.75']

In [13]:
tupleRDD = TaxiRDD.map(lambda line: line.split())
tupleRDD.take(3)

[['VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge'],
 [',2020-01-27',
  '06:40:00,2020-01-27',
  '07:25:00,,,159,61,,19.84,40.95,2.75,0,0,6.12,,0.3,50.12,,,'],
 ['2,2020-01-18',
  '01:01:42,2020-01-18',
  '01:04:57,N,1,7,146,1,.56,4.5,0.5,0.5,1.45,0,,0.3,7.25,1,1,0']]

In [14]:
wordsRDD = TaxiRDD.flatMap(lambda line: line.split())
wordsRDD.take(5)

['VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge',
 ',2020-01-27',
 '06:40:00,2020-01-27',
 '07:25:00,,,159,61,,19.84,40.95,2.75,0,0,6.12,,0.3,50.12,,,',
 '2,2020-01-18']

In [15]:
wordsRDD.count()

2686

In [16]:
wordsRDD.map(lambda x: 1).reduce(lambda a,b: a+b)

2686

### Do a word count on all files

In [17]:
allTaxiRDD = sc.textFile('Data/*.csv')

In [19]:
allTaxiRDD.count()

5036382

In [20]:
allTaxiRDD.count()

5036382

### Yellow taxi records