# We Will follow the same steps we did in the 2020 taxi data. If you want to understand the steps please review `Spark_preprocess_2020`

# Preprocessing Stage
### Processing 2021 Taxi data 

__This will include the following process:__

1- Reading the Taxi data stored in AWS S3 bucket.

2- Analyzing the data and cleaning the data.

3- Grouping the data in Hourly basis and Make it ready for Modeling

4- Saving the transformed data in S3 bucket



===============================================================================================




# 1- Loading the data

In [1]:
from pyspark.sql import SparkSession

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
1,application_1662022518326_0002,pyspark,idle,,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
spark = SparkSession.builder.appName('cleaning').getOrCreate()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

* __2021 taxi consist of 12 parquet files, each file contain a month taxi records data. The files are uploaded in `taxi-project-thesis` AWS Bucket inside `taxi2021` folder. To read all the 12 parquet files together we will use `.coalesce(1)` function in pyspark__

In [50]:
df=spark.read.parquet("s3://taxi-project-thesis/taxi2021/*").coalesce(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [51]:
df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

30904308

__The dataframe contains `30904308` taxi trips. Lets analyze these trips.__

In [52]:
from pyspark.sql.functions import *

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# 2- Analyzing  and cleaning the data
## 2.1 Cleaning incorrect trips using pickup_datetime and dropoff_datetime

### Cleaning the trips using the pick-up year
* We need to extract the unique pick year to see if there are trips available in the dataframe which belong to incorrect years

In [53]:
# adding the pick up year
df_new=df.withColumn('Pick_Year',year(df['tpep_pickup_datetime']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [54]:
# adding the drop up year
df_new=df_new.withColumn('Drop_Year',year(df_new['tpep_dropoff_datetime']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [55]:
# check the pick-up year
df_new.select('Pick_Year').distinct().collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(Pick_Year=2021), Row(Pick_Year=2009), Row(Pick_Year=2003), Row(Pick_Year=2008), Row(Pick_Year=2022), Row(Pick_Year=2020), Row(Pick_Year=2098), Row(Pick_Year=2028), Row(Pick_Year=2004), Row(Pick_Year=2002), Row(Pick_Year=2029), Row(Pick_Year=2011), Row(Pick_Year=2070)]

In [56]:
# adding pick up month
df_new=df_new.withColumn('Pick_month',month(df_new['tpep_pickup_datetime']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [57]:
# adding drop month
df_new=df_new.withColumn('drop_month',month(df_new['tpep_dropoff_datetime']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [58]:
#We need to extract the pick year to see if there are incorrect years
pick_year = df_new.select(collect_set('Pick_Year').alias('Pick_Year')).first()['Pick_Year']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
for i in (pick_year):
    print('year ',i,df_new.filter(df_new['Pick_Year']==i).count())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

year  2022 49
year  2004 1
year  2070 1
year  2008 86
year  2020 16
year  2002 1
year  2009 203
year  2021 30903934
year  2003 10
year  2028 1
year  2029 1
year  2098 1
year  2011 4

In [59]:
# we will filter out the year 2020 and 2021
year_list=[2020,2021]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [60]:
df_new=df_new.filter(df_new.Pick_Year.isin(year_list))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [61]:
df.count()-df_new.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

358

In [62]:
# to check the pick up year in our dataframe
df_new.select('Pick_Year').distinct().collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(Pick_Year=2021), Row(Pick_Year=2020)]

### Cleaning the trips using the drop-off year

* We need to extract the unique drop year to see if there are incorrect years

In [63]:
df_new.select('Drop_Year').distinct().collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(Drop_Year=2021), Row(Drop_Year=2020), Row(Drop_Year=2022)]

We have drop year in 2020 should be removed

In [64]:
# we will remove the rows which the drop year in 2019
df_new.filter(df_new['Drop_Year']==2020).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

10

In [65]:
year_list=[2021,2022]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [66]:
df_new=df_new.filter(df_new.Drop_Year.isin(year_list))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [67]:
#check the drop year after filtering
df_new.select('Drop_Year').distinct().collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(Drop_Year=2021), Row(Drop_Year=2022)]

## 2.2 Analyzing and cleaning the trip distance


In [68]:
#Checking the trips that have zore trip distance
df_new.filter('trip_distance=0').count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

407780

lets do further analysis. if the drop location is silimar to pick location, then we assume there was congustion or any other reason and the passenger left from the taxi

In [69]:
zero_distance=df_new.filter(df_new['trip_distance']==0)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [70]:
print('Number of trips with zero trip distance ',zero_distance.count(),' trips')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Number of trips with zero trip distance  407780  trips

In [71]:
# trips which have zero distance and pick location is similir to drop location
zero_distance.filter(zero_distance['PULocationID']==zero_distance['DOLocationID']).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

197671

__So we will try to remove the trips which have zero trip distance and is pick location is different than drop location__

In [72]:
# trips which have zero distance and pick location is different than drop location
zero_distance.filter(zero_distance['PULocationID']!=zero_distance['DOLocationID']).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

210109

In [73]:
# our dataset should have this count
df_new.count() - 210109

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

30693831

### Steps:

* We will create data having zero distance trips (already done above).

* We will create dataframe that dont have zero distance trips.(`no_zero_distance`)


* We will filter out the trips that have zero distance and the pick up location is different than the drop locations from the zero dataframe.

* Then we will merge the zero dataframe with the non-zero-distance dataframe

In [74]:
# 1- we will remove the trips contains zero
no_zero_distance=df_new.filter(df_new['trip_distance']!=0)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [75]:
# we will create data having zero which we already created before
zero_distance.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

407780

In [76]:
#checking the count of the trips with zero distance and pick location is different than drop location
zero_distance.filter(zero_distance['PULocationID']!=zero_distance['DOLocationID']).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

210109

In [77]:
#3 - we will filter out the trips that have zero and different pick and drop locations
# here we will filter only the trips with similar pick and drop location
Zero_distance_cleared=zero_distance.filter(zero_distance['PULocationID']==zero_distance['DOLocationID'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [78]:
# to varify the correct number
no_zero_distance.count() +Zero_distance_cleared.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

30693831

In [79]:
# we will merge the trips that have no zero distance with the trips that have zero distance and simlar pick and drop locations
df_new_1 = no_zero_distance.union(Zero_distance_cleared)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [80]:
# this dataset contain all the observation except with zero distance and drop location is different than pick location
df_new_1.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

30693831

In [81]:
#Check if we have trips with zero distance and diffrent drop and pick locations
df_new_1.filter((df_new_1['trip_distance']==0) & (df_new_1['PULocationID']!=df_new_1['DOLocationID'])).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

0

In [82]:
df_new_1.select('trip_distance').describe().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+-----------------+
|summary|    trip_distance|
+-------+-----------------+
|  count|         30693831|
|   mean|6.970119250021118|
| stddev|700.7701268807028|
|    min|              0.0|
|    max|        351613.36|
+-------+-----------------+

In [83]:
# Number of trips above or equal 1000 miles
df_new_1.filter(df_new_1['trip_distance']>=1000).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1405

__we will just try to remove the unrealistic trips by removing any trip above 1000 mile__

In [84]:
df_new_1=df_new_1.filter(df_new_1['trip_distance']<1000)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [85]:
df_new_1.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

30692426

## 2.3 Fare amount
As per https://www1.nyc.gov/site/tlc/passengers/taxi-fare.page the min fare amount is $2.50 initial charge. so we will analyze the fare amount

In [86]:
#removing the trips that have fare amount less than 2.5
df_new_1=df_new_1.filter(df_new_1['fare_amount']>=2.5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [87]:
# we will remove the trips above 500 dollars since its not realistic in my point of view and only 113 trips which will not affect our data
df_new_1=df_new_1.filter(df_new_1['fare_amount']<=500)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [88]:
df_new_1.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

30538443

## 2.4 Trip duration
As per NYC Taxi & Limousine Commision Regulations, the maximum duration taxi driver can drive per day is 12 hours, so we will assume that the max trip duration could be 12 hours and remove and records axceed 12 hours

Steps to remove the trips above 12 hours:

* We will create new column that contain the trip duration by second `trip_duration_seconds` (just subtract the UNIX pickup time from the UNIX drop time)

* We will create trip_duration in hours from the newly created `trip_duration_seconds`

* Then filterout and remove any trip above 12 hours

In [89]:
df_new_1.withColumn('trip_duration_seconds',unix_timestamp("tpep_dropoff_datetime") - unix_timestamp('tpep_pickup_datetime'))\
.select(['tpep_pickup_datetime','tpep_dropoff_datetime','trip_duration_seconds']).show(20)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+---------------------+---------------------+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_duration_seconds|
+--------------------+---------------------+---------------------+
| 2021-11-01 00:34:15|  2021-11-01 00:57:21|                 1386|
| 2021-11-01 00:00:29|  2021-11-01 00:14:13|                  824|
| 2021-11-01 00:27:51|  2021-11-01 00:34:21|                  390|
| 2021-11-01 00:01:56|  2021-11-01 00:25:18|                 1402|
| 2021-11-01 00:10:44|  2021-11-01 00:29:06|                 1102|
| 2021-11-01 00:36:21|  2021-11-01 00:54:46|                 1105|
| 2021-11-01 00:12:05|  2021-11-01 00:31:06|                 1141|
| 2021-11-01 00:49:43|  2021-11-01 00:59:38|                  595|
| 2021-11-01 00:10:05|  2021-11-01 00:37:59|                 1674|
| 2021-11-01 00:26:13|  2021-11-01 00:37:37|                  684|
| 2021-11-01 00:41:44|  2021-11-01 00:46:21|                  277|
| 2021-11-01 00:24:21|  2021-11-01 00:33:56|                  

In [90]:
df_new_1=df_new_1.withColumn('trip_duration_seconds',unix_timestamp("tpep_dropoff_datetime") - unix_timestamp('tpep_pickup_datetime'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [91]:
# We will create trip duration in hour which is the `trip_duration_seconds / 3600`
df_new_1=df_new_1.withColumn('duration_In_Hours',round(col('trip_duration_seconds')/3600,1))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [92]:
df_new_1=df_new_1.filter(df_new_1['duration_In_Hours']<=12)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [93]:
df_new_1.filter(df_new_1['duration_In_Hours']>12).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

0

## 2.5 Check the trips in each month

The trips in each month shpuld be dropped in the same month and some trips in the next month. for example january trips will have pick-up trips occurred in january and the drop of these trips must be in january and some of them will be dropped in february, following the assumption that some trips have been picked during the last hour in 31-Jan and it might been dropped in the first hour during the february first.

In [94]:
# adding pick up month
df_new=df_new.withColumn('Pick_month',month(df_new['tpep_pickup_datetime']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [95]:
for i in range(1,13):
    print('Pick up month ',i,df_new_1.filter(df_new_1['Pick_month']==i).select('drop_month').distinct().collect())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Pick up month  1 [Row(drop_month=1), Row(drop_month=2)]
Pick up month  2 [Row(drop_month=3), Row(drop_month=2)]
Pick up month  3 [Row(drop_month=3), Row(drop_month=4)]
Pick up month  4 [Row(drop_month=4), Row(drop_month=5)]
Pick up month  5 [Row(drop_month=6), Row(drop_month=5)]
Pick up month  6 [Row(drop_month=6), Row(drop_month=7)]
Pick up month  7 [Row(drop_month=8), Row(drop_month=7)]
Pick up month  8 [Row(drop_month=8), Row(drop_month=7), Row(drop_month=9)]
Pick up month  9 [Row(drop_month=9), Row(drop_month=10)]
Pick up month  10 [Row(drop_month=11), Row(drop_month=10)]
Pick up month  11 [Row(drop_month=12), Row(drop_month=11)]
Pick up month  12 [Row(drop_month=12), Row(drop_month=1), Row(drop_month=10)]

__We notice that in Augest trips we have trips dropped in Jully which is incorrect. Also in December. lets view this data__

In [96]:
df_new_1.filter((df_new_1['Pick_month']==8) & (df_new_1['drop_month']==7)).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1

In [97]:
df_new_1.filter((df_new_1['Pick_month']==8) & (df_new_1['drop_month']==7)).select(['tpep_pickup_datetime','tpep_dropoff_datetime']).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+---------------------+
|tpep_pickup_datetime|tpep_dropoff_datetime|
+--------------------+---------------------+
| 2021-08-01 01:51:44|  2021-07-28 02:00:53|
+--------------------+---------------------+

its wrong trip which should be removed

__For Augest:__

__Steps to remove this trip:__


* Create dataset without Augest

* create dataframe only for Jully trips

* filter out this row and union the two dataframe

In [99]:
df_new_1.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

30485517

In [100]:
# Augest trips count
df_new_1.filter(df_new_1['Pick_month']==8).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2750378

In [101]:
# the whole year without Augest
df_new_1.filter(df_new_1['Pick_month']!=8).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

27735139

In [108]:
# create Aug_df
Aug_df=df_new_1.filter(df_new_1['Pick_month']==8)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [109]:
# create no_aug_df
no_aug_df = df_new_1.filter(df_new_1['Pick_month']!=8)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [105]:
# filter out the feb drop row
Aug_df.filter(Aug_df['drop_month']==7).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1

In [111]:
Aug_df=Aug_df.filter(Aug_df['drop_month']!=7)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [112]:
Aug_df.select('drop_month').distinct().collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(drop_month=8), Row(drop_month=9)]

In [113]:
df_new_final = Aug_df.union(no_aug_df)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [114]:
df_new_final.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

30485516

__For December__

__Steps to remove this trip:__


* Create dataset without December

* create dataframe only for December trips

* filter out this row and union the two dataframe

In [115]:
df_new_final.filter((df_new_final['Pick_month']==12) & (df_new_final['drop_month']==10)).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1

In [116]:
df_new_final.filter((df_new_final['Pick_month']==12) & (df_new_final['drop_month']==10)).select(['tpep_pickup_datetime','tpep_dropoff_datetime']).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+---------------------+
|tpep_pickup_datetime|tpep_dropoff_datetime|
+--------------------+---------------------+
| 2021-12-19 01:14:43|  2021-10-07 12:02:08|
+--------------------+---------------------+

its wrong trip which should be removed

In [124]:
# create Dec_df
Dec_df=df_new_final.filter(df_new_1['Pick_month']==12)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [125]:
# create no_dec_df
no_dec_df = df_new_final.filter(df_new_1['Pick_month']!=12)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [126]:
# filter out the feb drop row
Dec_df.filter(Dec_df['drop_month']==10).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1

In [127]:
Dec_df=Dec_df.filter(Aug_df['drop_month']!=10)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [128]:
Dec_df.select('drop_month').distinct().collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(drop_month=12), Row(drop_month=1)]

In [129]:
df_new_final = Dec_df.union(no_dec_df)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [130]:
df_new_final.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

30485515

# 3- Grouping the data per location ID in Hourly basis and Make it ready for Modeling

__We will need to groupby and aggregate the count of the taxi trips per location ID  to be in hourly basis__


In [132]:
df_new_final.select(countDistinct("PULocationID")).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------------+
|count(DISTINCT PULocationID)|
+----------------------------+
|                         263|
+----------------------------+

In [137]:
# Saving the distinct location ID in a list
LocationID2021 = df_new_final.select(collect_set('PULocationID').alias('locationID_list')).first()['locationID_list']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [138]:
len(LocationID2021)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

263

We have 263 location ID, and we have in 2020 262 locations, we will filter out the extra location

In [133]:
df=spark.read.parquet("s3://taxi-project-thesis/taxi2020/*").coalesce(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [135]:
LocationID2020 = df.select(collect_set('PULocationID').alias('locationID_list')).first()['locationID_list']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [136]:
len(LocationID2020)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

262

In [139]:
unique_list = []
 
for item in LocationID2021: 
    if item not in LocationID2020: 
        unique_list.append(item)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [140]:
unique_list

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[110]

In [146]:
df_new_final.filter((df_new_final['PULocationID']==unique_list[0])).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1

In [147]:
df_new_final=df_new_final.filter(df_new_final['PULocationID']!=unique_list[0])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [148]:
df_new_final.filter((df_new_final['PULocationID']==unique_list[0])).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

0

__We have 262 location ID, so when we will aggregate the the taxi trips in hourly basis, we should have `365 X 24 X 262 = 2295120 rows`__

- 365 days in 2021

- 24 hours per day

- 262 location ID

### Steps:
We will change the timestamp to this format `"yyyy-MM-dd HH:00"` to be able to group the trips hourly

In [149]:
df_new_final=df_new_final.withColumn("Pickup_datetime_hourly", date_format(col("tpep_pickup_datetime").cast("timestamp"), "yyyy-MM-dd HH:00"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

__we will create trip count column__

In [150]:
# we will create trip count column
df_new_final=df_new_final.withColumn("Trip_count", lit(1))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Now lets groupby and aggregate the trips per location in hourly basis

In [154]:
hourly_aggregated=df_new_final.groupby(['Pickup_datetime_hourly','PULocationID']).agg({'Trip_count':'count'})\
.withColumnRenamed("count(Trip_count)", "Trips_count")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [155]:
hourly_aggregated.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------+------------+-----------+
|Pickup_datetime_hourly|PULocationID|Trips_count|
+----------------------+------------+-----------+
|      2021-08-01 03:00|         140|         10|
|      2021-08-01 04:00|         179|          1|
|      2021-08-01 16:00|         142|        153|
|      2021-08-01 21:00|          75|         13|
|      2021-08-02 10:00|         144|         10|
|      2021-08-02 13:00|          65|          1|
|      2021-08-04 17:00|         143|         87|
|      2021-08-05 21:00|         181|          3|
|      2021-08-06 22:00|         130|          2|
|      2021-08-07 13:00|           4|          7|
|      2021-08-07 19:00|         114|         89|
|      2021-08-07 21:00|          49|          4|
|      2021-08-09 10:00|         148|         14|
|      2021-08-10 15:00|         237|        444|
|      2021-08-12 07:00|         144|          7|
|      2021-08-13 06:00|         230|         21|
|      2021-08-14 03:00|         141|         19|


In [156]:
hourly_aggregated.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

955052

The aggregated trips have 955052 rows and it should be 2295120, and I think the issue is there are no pick-up records in certain hours. So when I aggregate the data, it will not show the missing hours. 

#### To solve this problem:
- we need to create dataframe with timestamp started from 1-1-2021 till 1-1-2022. (8760 rows).
- generate dataframe for each locationID and merge all of them. (2295120)
- Create unique column that contain the __timestamp plus the locationID__ in both of them so we can join them.
- left join the generated dataframe with our hourly aggregated dataframe.

### 1- Creating unique coloum in hourly_aggregated
The unique column will merge the timestamp with the locationID. we need to convert the locationID from integer to string then we will use `concat_ws` to merge the string values together

In [162]:
hourly_aggregated.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- Pickup_datetime_hourly: string (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- Trips_count: long (nullable = false)

In [163]:
from pyspark.sql.types import TimestampType,StructField,StringType,IntegerType,StructType

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [164]:
#Create new column which convert the location ID to string
hourly_aggregated= hourly_aggregated.withColumn("PULocationID_string", hourly_aggregated["PULocationID"].cast(StringType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [165]:
#Create the unique column using concat_ws
hourly_aggregated=hourly_aggregated.select(concat_ws('_',hourly_aggregated.Pickup_datetime_hourly,
                                   hourly_aggregated.PULocationID_string).alias("UniqueColumn"),"Pickup_datetime_hourly","PULocationID","Trips_count")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [166]:
hourly_aggregated.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+----------------------+------------+-----------+
|        UniqueColumn|Pickup_datetime_hourly|PULocationID|Trips_count|
+--------------------+----------------------+------------+-----------+
|2021-08-01 03:00_140|      2021-08-01 03:00|         140|         10|
|2021-08-01 04:00_179|      2021-08-01 04:00|         179|          1|
|2021-08-01 16:00_142|      2021-08-01 16:00|         142|        153|
| 2021-08-01 21:00_75|      2021-08-01 21:00|          75|         13|
|2021-08-02 10:00_144|      2021-08-02 10:00|         144|         10|
+--------------------+----------------------+------------+-----------+
only showing top 5 rows

In [167]:
hourly_aggregated.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- UniqueColumn: string (nullable = false)
 |-- Pickup_datetime_hourly: string (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- Trips_count: long (nullable = false)

### 2- Generate timestamp dataframe from 1-1-2021 till 1-1-2022
The below function will generate timestamp spark dataframe

In [168]:
def generate_series(start, stop, interval):
    """
    :param start  - lower bound, inclusive
    :param stop   - upper bound, exclusive
    :interval int - increment interval in seconds
    """
    start, stop = spark.createDataFrame(
        [(start, stop)], ("start", "stop")
    ).select(
        [col(c).cast("timestamp").cast("long") for c in ("start", "stop")
    ]).first()
    # Create range with increments and cast to timestamp
    return spark.range(start, stop, interval).select(
        col("id").cast("timestamp").alias("timestamp")
    )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [169]:
# we will multiply by 60*60 to change the timestamp to hour
timestamp_df=generate_series("2021-01-01", "2022-01-01", 60 * 60)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

__We will change the generated timestamp to the same format in `hourly_aggregated`__

In [170]:
timestamp_df=timestamp_df.withColumn("hourly_timestamp", date_format(col("timestamp").cast("timestamp"), "yyyy-MM-dd HH:00"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [171]:
timestamp_df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- timestamp: timestamp (nullable = false)
 |-- hourly_timestamp: string (nullable = false)

In [172]:
timestamp_df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

8760

__Now this dataframe contain hourly timestamp for 2021. so we need to create the same rows for each location ID exist in `hourly_aggregated`__

Lets extract the location ID from the hourly_aggregated and save them into list

In [173]:
# extract the pickup location id from the dataset and save it in list
Pick_up_LocationID = df_new_final.select(collect_set('PULocationID').alias('locationID_list')).first()['locationID_list']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

__We have 262 location ID, we will aggregate the trips in hourly basis. so we should have `365 X 24 X 262 = 2295120 rows`.__

we need to iterate generated timestamp for each locationID (for loop) and append the result to an empty spark dataframe. in this case, we will have 2 columns and 2295120 rows.

__Define function that will help us to add the location ID column to the generated timestamp__

In [174]:
def locations_timestampGenerator(dataframe,locationID):
    dataframe=dataframe.withColumn("locationID", lit(locationID))
    return dataframe

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

__Now lets create an empty spark dataframe contains two columns `timestamp` and `locationID`__

In [175]:
data_schema = [StructField("timestamp",TimestampType(), True),StructField("hourly_timestamp", StringType(), True),
               StructField("locationID", IntegerType(), True)]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [176]:
final_struct = StructType(fields=data_schema)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [177]:
ID_plus_timestamp = spark.createDataFrame([], schema=final_struct)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [178]:
# iterate generated timestamp for each locationID (for loop) 
for i in Pick_up_LocationID:
    dataframe=locations_timestampGenerator(timestamp_df,i)
    ID_plus_timestamp = ID_plus_timestamp.union(dataframe)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [179]:
ID_plus_timestamp.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2295120

__Now lets make sure that all the locationID created by the `for loop` is matching the existing locationID in the original aggregated dataset__

We will extract the distinct location ID from `ID_plus_timestamp` and check with the location ID in `Hourly_aggregated`

In [180]:
location_ID_list2=ID_plus_timestamp.select(collect_set('locationID').alias('locationID_list')).first()['locationID_list']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

__Remember we already extracted the location ID from Hourly_aggregated `Pick_up_LocationID`, so we dont need to do it again__

now if there is any location ID in `ID_plus_timestamp`dataframe not exist in `Hourly_aggregated` it will append it to the empty list

In [181]:
unique_list = []
 
for item in location_ID_list2: 
    if item not in Pick_up_LocationID: 
        unique_list.append(item)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [182]:
# Should be empty
unique_list

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[]

#### Last step, We will need to create unique column that contain the __timestamp plus the locationID__ in `ID_plus_timestamp` so we can join it to `hourly_aggregated`.

In [183]:
# Change the location ID from integer to string
ID_plus_timestamp= ID_plus_timestamp.withColumn("locationID_string", ID_plus_timestamp["locationID"].cast(StringType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [184]:
# Create the unique column using concat_ws
ID_plus_timestamp=ID_plus_timestamp.select(concat_ws('_',ID_plus_timestamp.hourly_timestamp,
                                   ID_plus_timestamp.locationID_string).alias("UniqueColumn_G"),"timestamp","hourly_timestamp","locationID")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [185]:
ID_plus_timestamp.show(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+-------------------+----------------+----------+
|      UniqueColumn_G|          timestamp|hourly_timestamp|locationID|
+--------------------+-------------------+----------------+----------+
|2021-01-01 00:00_256|2021-01-01 00:00:00|2021-01-01 00:00|       256|
|2021-01-01 01:00_256|2021-01-01 01:00:00|2021-01-01 01:00|       256|
|2021-01-01 02:00_256|2021-01-01 02:00:00|2021-01-01 02:00|       256|
+--------------------+-------------------+----------------+----------+
only showing top 3 rows

#### Now joining both dataframes using the `unique column`

In [186]:
# join both dataframes using the unique column
Pick_aggregated=hourly_aggregated.join(ID_plus_timestamp,hourly_aggregated.UniqueColumn ==  ID_plus_timestamp.UniqueColumn_G,"right")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

__Now after we joined the dataframes, we will need to replace null values in PULocationID and count(Trip_count) columns of dataframe hourly_aggregated_final with 0.__

Assume  PULocationID of missing data to be 0

In [187]:
Pick_aggregated=Pick_aggregated.na.fill(0,subset=["PULocationID"])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [188]:
Pick_aggregated=Pick_aggregated.na.fill(0,subset=["Trips_count"])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### Now lets select the columns we need : `timestamp`, `locationID` ,` Trips_count` and the `UniqueColumn_G`
I have selected `UniqueColumn_G` to be the unique identifier and to join it with the dropped aggregated dataframe

In [189]:
Pick_aggregated=Pick_aggregated.select(['timestamp','hourly_timestamp','locationID','Trips_count','UniqueColumn_G'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [190]:
Pick_aggregated.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+----------------+----------+-----------+--------------------+
|          timestamp|hourly_timestamp|locationID|Trips_count|      UniqueColumn_G|
+-------------------+----------------+----------+-----------+--------------------+
|2021-01-01 00:00:00|2021-01-01 00:00|         1|          0|  2021-01-01 00:00_1|
|2021-01-01 00:00:00|2021-01-01 00:00|       165|          0|2021-01-01 00:00_165|
|2021-01-01 00:00:00|2021-01-01 00:00|       187|          0|2021-01-01 00:00_187|
|2021-01-01 00:00:00|2021-01-01 00:00|       195|          0|2021-01-01 00:00_195|
|2021-01-01 00:00:00|2021-01-01 00:00|       243|          0|2021-01-01 00:00_243|
|2021-01-01 00:00:00|2021-01-01 00:00|       251|          0|2021-01-01 00:00_251|
|2021-01-01 00:00:00|2021-01-01 00:00|        45|          3| 2021-01-01 00:00_45|
|2021-01-01 00:00:00|2021-01-01 00:00|        67|          0| 2021-01-01 00:00_67|
|2021-01-01 00:00:00|2021-01-01 00:00|        71|          0| 2021-01-01 00:00_71|
|202

## Aggregating the drop trips per location ID in hourly basis

- We will follow the same steps how we aggregate the pick-up trips above, but now for the drop-off so we will have hourly count per location ID for the pick-up trips and drop-off trips

In [191]:
df_new_final.select(countDistinct("DOLocationID")).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------------+
|count(DISTINCT DOLocationID)|
+----------------------------+
|                         261|
+----------------------------+

__We have 261 drop location ID, and we have 262 pick-up location ID__

In [192]:
# extract the drop location id from the dataset and save it in list
Drop_LocationID = df_new_final.select(collect_set('DOLocationID').alias('DOLocationID')).first()['DOLocationID']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [193]:
unique_list = []
 
for item in Pick_up_LocationID: 
    if item not in Drop_LocationID: 
        unique_list.append(item)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [194]:
unique_list

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[199]

#### Now we are going to aggregate the data based on hourly drop count per locationID.
We will change the timestamp to this format `"yyyy-MM-dd HH:00"` to be able to group the trips hourly

In [196]:
Drop_df=df_new_final.withColumn("Drop_datetime_hourly", date_format(col("tpep_dropoff_datetime").cast("timestamp"), "yyyy-MM-dd HH:00"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [197]:
# Grouping and aggregating the dropped trips
drop_aggregated = Drop_df.groupby(['Drop_datetime_hourly','DOLocationID']).agg({'Trip_count':'count'})\
.withColumnRenamed("count(Trip_count)", "Drop_Trips_count")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [198]:
drop_aggregated.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+------------+----------------+
|Drop_datetime_hourly|DOLocationID|Drop_Trips_count|
+--------------------+------------+----------------+
|    2021-11-01 09:00|         174|               2|
|    2021-11-01 12:00|         158|              57|
|    2021-11-02 00:00|         215|               1|
|    2021-11-02 07:00|         138|              59|
|    2021-11-03 00:00|         235|               2|
|    2021-11-03 11:00|         255|               9|
|    2021-11-03 12:00|         126|               2|
|    2021-11-03 16:00|         130|               3|
|    2021-11-04 00:00|         257|               2|
|    2021-11-04 03:00|         113|               3|
|    2021-11-04 06:00|          28|               1|
|    2021-11-04 13:00|         261|              21|
|    2021-11-04 14:00|          13|              53|
|    2021-11-05 02:00|          43|               5|
|    2021-11-05 02:00|          26|               1|
|    2021-11-05 09:00|         263|           

__Create unique column that contain the timestamp plus the locationID in ID_plus_timestamp so we can join it to Pick_aggregated__

Convert the DOLocationID to string type

In [201]:
drop_aggregated.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1398669

In [203]:
drop_aggregated=drop_aggregated.withColumn("DOLocationID_string", drop_aggregated["DOLocationID"].cast(StringType()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [204]:
# Create the unique column using `concat_ws`
drop_aggregated=drop_aggregated.select(concat_ws('_',drop_aggregated.Drop_datetime_hourly,
                                   drop_aggregated.DOLocationID_string).alias("UniqueColumn"),"Drop_datetime_hourly","Drop_datetime_hourly","DOLocationID","Drop_Trips_count")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Joining Pick_aggregated & drop_aggregated dataframes using the UniqueColumn

In [205]:
final_df=drop_aggregated.join(Pick_aggregated,drop_aggregated.UniqueColumn == Pick_aggregated.UniqueColumn_G,"right")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [206]:
# Filling the null Drop_Trips_count with zero
final_df=final_df.na.fill(0,subset=["Drop_Trips_count"])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Now we will just select the columns we  want:

`hourly timestamp`, `locationID`, `Pick_Trips_count` and `Drop_Trips_count`

In [207]:
final_df=final_df.select(['timestamp','hourly_timestamp','locationID','Trips_count','Drop_Trips_count'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [208]:
final_df=final_df.withColumnRenamed('Trips_count','Pick_Trips_count')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [209]:
final_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+----------------+----------+----------------+----------------+
|          timestamp|hourly_timestamp|locationID|Pick_Trips_count|Drop_Trips_count|
+-------------------+----------------+----------+----------------+----------------+
|2021-01-01 00:00:00|2021-01-01 00:00|         1|               0|               0|
|2021-01-01 00:00:00|2021-01-01 00:00|       165|               0|               1|
|2021-01-01 00:00:00|2021-01-01 00:00|       187|               0|               0|
|2021-01-01 00:00:00|2021-01-01 00:00|       195|               0|               0|
|2021-01-01 00:00:00|2021-01-01 00:00|       243|               0|               1|
|2021-01-01 00:00:00|2021-01-01 00:00|       251|               0|               0|
|2021-01-01 00:00:00|2021-01-01 00:00|        45|               3|               1|
|2021-01-01 00:00:00|2021-01-01 00:00|        67|               0|               0|
|2021-01-01 00:00:00|2021-01-01 00:00|        71|               0|          

# Feature Engineering

Now the dataframe is ready for modeling, final step is just to create some features such as month, day, year, hour and dayofweek


In [210]:
# Create Year column
final_df=final_df.withColumn('Year',year(final_df['timestamp']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [211]:
# Create Month column
final_df=final_df.withColumn('Month',month(final_df['timestamp']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [212]:
# create day of month column
final_df=final_df.withColumn('DayOfMonth',dayofmonth(final_df['timestamp']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [213]:
# Create hour column
final_df=final_df.withColumn('Hour',hour(final_df['timestamp']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [214]:
# Create Day of week column
final_df=final_df.withColumn("dayofweek", dayofweek(col("timestamp")))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [215]:
final_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+----------------+----------+----------------+----------------+----+-----+----------+----+---------+
|          timestamp|hourly_timestamp|locationID|Pick_Trips_count|Drop_Trips_count|Year|Month|DayOfMonth|Hour|dayofweek|
+-------------------+----------------+----------+----------------+----------------+----+-----+----------+----+---------+
|2021-01-01 00:00:00|2021-01-01 00:00|         1|               0|               0|2021|    1|         1|   0|        6|
|2021-01-01 00:00:00|2021-01-01 00:00|       165|               0|               1|2021|    1|         1|   0|        6|
|2021-01-01 00:00:00|2021-01-01 00:00|       187|               0|               0|2021|    1|         1|   0|        6|
|2021-01-01 00:00:00|2021-01-01 00:00|       195|               0|               0|2021|    1|         1|   0|        6|
|2021-01-01 00:00:00|2021-01-01 00:00|       243|               0|               1|2021|    1|         1|   0|        6|
|2021-01-01 00:00:00|2021-01-01 

In [216]:
final_df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2295120

### Save the final dataframe as parquet file in S3

__Now we will save this spark dataframe to S3 bucket inside `EMR-project folder`__

In [217]:
final_df.coalesce(1).write.parquet("s3://taxi-project-thesis/EMR-project/final_2021")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [218]:
### Load the final dataframe from S3
df_taxi=spark.read.parquet("s3://taxi-project-thesis/EMR-project/final_2021/part-00000-c3dd72a6-fb35-475b-8b46-032b980bc5b1-c000.snappy.parquet")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [219]:
df_taxi.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+----------------+----------+----------------+----------------+----+-----+----------+----+---------+
|          timestamp|hourly_timestamp|locationID|Pick_Trips_count|Drop_Trips_count|Year|Month|DayOfMonth|Hour|dayofweek|
+-------------------+----------------+----------+----------------+----------------+----+-----+----------+----+---------+
|2021-01-01 00:00:00|2021-01-01 00:00|         1|               0|               0|2021|    1|         1|   0|        6|
|2021-01-01 00:00:00|2021-01-01 00:00|       165|               0|               1|2021|    1|         1|   0|        6|
|2021-01-01 00:00:00|2021-01-01 00:00|       187|               0|               0|2021|    1|         1|   0|        6|
|2021-01-01 00:00:00|2021-01-01 00:00|       195|               0|               0|2021|    1|         1|   0|        6|
|2021-01-01 00:00:00|2021-01-01 00:00|       243|               0|               1|2021|    1|         1|   0|        6|
+-------------------+-----------

In [220]:
df_taxi.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2295120

DONE !