## Question 1. Install Spark and PySpark

* Install Spark
* Run PySpark
* Create a local spark session 
* Execute `spark.version`

What's the output?

In [55]:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
        .master('local[*]')
        .appName('test')
        .getOrCreate()
)
spark.version

'3.3.0'

## Question 2. HVFHW February 2021

Download the HVFHV data for february 2021:

```bash
https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2021-02.parquet
```

Read it with Spark using the same schema as we did 
in the lessons. We will use this dataset for all
the remaining questions.

Repartition it to 24 partitions and save it to
parquet.

What's the size of the folder with results (in MB)?

In [9]:
%%bash
mkdir -p data/raw
curl https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2021-02.parquet > data/raw/fhvhv_tripdata_2021-02.parquet

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  288M  100  288M    0     0  12.0M      0  0:00:23  0:00:23 --:--:-- 12.9M


In [10]:
df = spark.read.parquet('data/raw/fhvhv_tripdata_2021-02.parquet')
df.repartition(24).write.parquet('data/part/')



22/08/05 16:11:11 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers




22/08/05 16:11:15 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers




22/08/05 16:11:18 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
22/08/05 16:11:18 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers


                                                                                

In [11]:
!du -sh data/part

534M	data/part


## Question 3. Count records 

How many taxi trips were there on February 15?

Consider only trips that started on February 15.

In [12]:
df.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- originating_base_num: string (nullable = true)
 |-- request_datetime: timestamp (nullable = true)
 |-- on_scene_datetime: timestamp (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- trip_miles: double (nullable = true)
 |-- trip_time: long (nullable = true)
 |-- base_passenger_fare: double (nullable = true)
 |-- tolls: double (nullable = true)
 |-- bcf: double (nullable = true)
 |-- sales_tax: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)
 |-- tips: double (nullable = true)
 |-- driver_pay: double (nullable = true)
 |-- shared_request_flag: string (nullable = true)
 |-- shared_match_flag: string (nullable = true)
 |-- access_a_ride_flag: string (nul

In [34]:
df.createOrReplaceTempView('fhvhv')

In [21]:
spark.sql('''
select
    count(1) as count
from fhvhv
where
    pickup_datetime >= '2021-02-15' and pickup_datetime < '2021-02-16'
''').show()

+------+
| count|
+------+
|432118|
+------+



## Question 4. Longest trip for each day

Now calculate the duration for each trip.

Trip starting on which day was the longest? 

In [25]:
spark.sql('''
select
    date_trunc('day', pickup_datetime) as date,
    max(trip_time) as longest_trip_time
from fhvhv
group by 1
order by longest_trip_time desc
limit 1
''').show()



+-------------------+-----------------+
|               date|longest_trip_time|
+-------------------+-----------------+
|2021-02-11 00:00:00|            75540|
+-------------------+-----------------+



                                                                                

## Question 5. Most frequent `dispatching_base_num`

Now find the most frequently occurring `dispatching_base_num` 
in this dataset.

How many stages this spark job has?

> Note: the answer may depend on how you write the query,
> so there are multiple correct answers. 
> Select the one you have.

In [26]:
spark.sql('''
select
    dispatching_base_num,
    count(1) as freqs
from fhvhv
group by 1
order by freqs desc
limit 1
''').show()

                                                                                

+--------------------+-------+
|dispatching_base_num|  freqs|
+--------------------+-------+
|              B02510|3233664|
+--------------------+-------+



Ans: 2 stages

## Question 6. Most common locations pair

Find the most common pickup-dropoff pair. 

For example:

"Jamaica Bay / Clinton East"

Enter two zone names separated by a slash

If any of the zone names are unknown (missing), use "Unknown". For example, "Unknown / Clinton East". 

In [33]:
df_zones = spark.read.csv('../../data/taxi+_zone_lookup.csv', header='true')
df_zones.printSchema()
df_zones.createOrReplaceTempView('zones')

root
 |-- locationid: string (nullable = true)
 |-- borough: string (nullable = true)
 |-- zone: string (nullable = true)
 |-- service_zone: string (nullable = true)



In [40]:
spark.sql('''
select
    concat(
        ifnull(pzone.zone, 'Unknown'), ' / ', ifnull(dzone.zone, 'Unknown')
    ) as pair,
    count(1) as freqs
from fhvhv
left join zones pzone on pzone.locationid = fhvhv.PULocationID
left join zones dzone on dzone.locationid = fhvhv.DOLocationID
group by 1
order by freqs desc
limit 1
''').collect()

                                                                                

[Row(pair='East New York / East New York', freqs=45041)]

## Bonus question. Join type

(not graded) 

For finding the answer to Q6, you'll need to perform a join.

What type of join is it?

And how many stages your spark job has?

Ans: left join, 2 stages

In [41]:
spark.sparkContext.stop()