# <ins>Individual Assignment I</ins>
Handed in by Eugen Wettstein

## <ins>1. PySpark environment setup</ins>

In [301]:
import findspark
findspark.init()

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

## <ins>2. Data source and Spark data abstraction (DataFrame) setup</ins>

In [302]:
ridesDF = spark.read \
                 .option("inferSchema", "true") \
                 .option("header", "true") \
                 .csv("cab_rides.csv")

## <ins>3. Data set metadata analysis</ins>
### <ins>A. Display schema and size of the DataFrame</ins>

In [303]:
from IPython.display import display, Markdown

ridesDF.printSchema()
display(Markdown("This DataFrame has **%d rows**." % ridesDF.count()))

root
 |-- distance: double (nullable = true)
 |-- cab_type: string (nullable = true)
 |-- time_stamp: timestamp (nullable = true)
 |-- destination: string (nullable = true)
 |-- source: string (nullable = true)
 |-- price: double (nullable = true)
 |-- surge_multiplier: double (nullable = true)
 |-- id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- name: string (nullable = true)



This DataFrame has **637976 rows**.

### <ins>B. Get one or multiple random samples from the data set</ins>

In [304]:
ridesDF.cache() # optimization to make the processing faster
ridesDF.sample(False, 0.1).take(2)

[Row(distance=1.08, cab_type='Lyft', time_stamp=datetime.datetime(2018, 12, 2, 20, 53, 4, 677000), destination='Northeastern University', source='Back Bay', price=16.5, surge_multiplier=1.0, id='474d6376-bc59-4ec9-bf57-4e6d6faeb165', product_id='lyft_lux', name='Lux Black'),
 Row(distance=3.24, cab_type='Lyft', time_stamp=datetime.datetime(2018, 12, 2, 19, 23, 7, 499000), destination='Northeastern University', source='North Station', price=11.0, surge_multiplier=1.0, id='174b960d-58f1-4dfd-8672-8b43f13726a7', product_id='lyft', name='Lyft')]

### <ins>C. Data entities, metrics and dimensions</ins>

I've identified the following elements:

* **Entities:** Rides (main one which is measured - facts), Cab_types (dimension), City Locations (dimension)
* **Metrics:** Timestamp
* **Dimensions:** Distance, cab_type, destination, surge_multiplier, price, name, id

### <ins>D. Column categorization</ins>

The following could be a potential column categorization:

* **Timing related columns:** *time_stamp*
* **Drive related columns:** *distance*, *cab_type*, *source*, *destination*, *price* and *surge_multiplier*
* **Company car related columns:** *id*, *product_id*, *name*

## <ins>4. Columns groups basic profiling to better understand our data set</ins>
### <ins>A. Timing related columns basic profiling</ins>

In [366]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit, year, month, dayofmonth, hour, date_format, minute
from pyspark.mllib.linalg import SparseVector

display(Markdown("**Summary of column time_stamp**:"))
ridesDF.select(year(col('time_stamp')).alias('year'),month(col('time_stamp')).alias('month')\
               ,dayofmonth(col('time_stamp')).alias('day'),hour(col('time_stamp')).alias('hour')).summary().show()

year = ridesDF.select(year(col('time_stamp'))).distinct().count()
month = ridesDF.select(month(col('time_stamp'))).distinct().count()
day = ridesDF.select(dayofmonth(col('time_stamp'))).distinct().count()
hour = ridesDF.select(hour(col('time_stamp'))).distinct().count()

display(Markdown("**Checking amount of distinct values in column time_stamp**:"))

display(Markdown("""
| %s | %s | %s | %s |
|----|----|----|----|
| %s | %s | %s | %s |
""" % ("year", "month", "day", "hour", \
       "%d  occurrences" % year,\
       "%d  occurrences" % month,\
       "%d  occurrences" % day,\
       "%d  occurrences" % hour)))

display(Markdown("**Checking null values**:"))

ridesDF.select(count(when(col('time_stamp').isNull(), 1)).alias('null_values')).show()

display(Markdown("**Checking distinct days in both months where a ride was done and there occurance of the weekday**:"))

ridesDF.select(dayofmonth(col('time_stamp')).alias('day')).distinct().orderBy('day').show()

ridesDF.select(date_format('time_stamp','E').alias('day')).groupby('day').count().show()
       
#print(year,month)
#print(year)

**Summary of column time_stamp**:

+-------+--------------------+------------------+------------------+------------------+
|summary|                year|             month|               day|              hour|
+-------+--------------------+------------------+------------------+------------------+
|  count|              637976|            637976|            637976|            637976|
|   mean|              2018.0|11.589251006307448|17.762665053230844| 11.51422310557137|
| stddev|1.135600535413938...|0.4919701589039917|10.002298741735492|6.9593165878176775|
|    min|                2018|                11|                 1|                 0|
|    25%|                2018|                11|                13|                 5|
|    50%|                2018|                12|                17|                12|
|    75%|                2018|                12|                28|                17|
|    max|                2018|                12|                30|                23|
+-------+--------------------+--

**Checking amount of distinct values in column time_stamp**:


| year | month | day | hour |
|----|----|----|----|
| 1  occurrences | 2  occurrences | 16  occurrences | 24  occurrences |


**Checking null values**:

+-----------+
|null_values|
+-----------+
|          0|
+-----------+



**Checking distinct days in both months where a ride was done and there occurance of the weekday**:

+---+
|day|
+---+
|  1|
|  2|
|  3|
|  4|
| 10|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 26|
| 27|
| 28|
| 29|
| 30|
+---+

+---+------+
|day| count|
+---+------+
|Sun| 82681|
|Mon|112880|
|Thu| 92147|
|Sat| 83003|
|Wed| 65851|
|Tue|118508|
|Fri| 82906|
+---+------+



### <ins>B.Drive related columns basic profiling</ins>

In [368]:
display(Markdown("**Summary of columns distance, cab_type, source, destination, price and surge_multiplier**:"))
ridesDF.select(col('distance'),col('cab_type'),col('source')\
               ,col('destination'),col('price'),col('surge_multiplier')).summary().show()

display(Markdown("**Checking for nulls on columns distance, cab_type, source, destination, price and surge_multiplier**:"))
ridesDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["distance", "cab_type", "source", "destination", "price", "surge_multiplier"]]).show()

display(Markdown("**Checking amount of distinct values in columns distance, cab_type, source, destination, price and surge_multiplier**:"))
ridesDF.select([countDistinct(c).alias(c) for c in ["distance", "cab_type", "source", "destination", "price", "surge_multiplier"]]).show()

sourceDF = ridesDF.groupBy("source").agg(count(lit(1)).alias("Total"))
destinationDF   = ridesDF.groupBy("destination").agg(count(lit(1)).alias("Total"))
surge_multiplierDF    = ridesDF.groupBy("surge_multiplier").agg(count(lit(1)).alias("Total"))

leastFreqSource    = sourceDF.orderBy(col("Total").asc()).first()
mostFreqSource     = sourceDF.orderBy(col("Total").desc()).first()
leastFreqDestination      = destinationDF.orderBy(col("Total").asc()).first()
mostFreqDestination      = destinationDF.orderBy(col("Total").desc()).first()
leastFreqSurge_multiplier      = surge_multiplierDF.orderBy(col("Total").asc()).first()
mostFreqSurge_multiplier        = surge_multiplierDF.orderBy(col("Total").desc()).first()

display(Markdown("**Checking frequency of dibstinct values in column source, destination and surge_multiplier**:"))
display(Markdown("""
| %s | %s | %s | %s | %s | %s |
|----|----|----|----|----|----|
| %s | %s | %s | %s | %s | %s |
""" % ("leastFreqSource", "mostFreqSource", "leastFreqDestination", "mostFreqDestination",'leastFreqSurge_multiplier','mostFreqSurge_multiplier', \
       "%s (%s occurrences)" % (leastFreqSource["source"], leastFreqSource["Total"]), \
       "%s (%s occurrences)" % (mostFreqSource["source"], mostFreqSource["Total"]), \
       "%s (%s occurrences)" % (leastFreqDestination["destination"], leastFreqDestination["Total"]), \
       "%s (%s occurrences)" % (mostFreqDestination["destination"], mostFreqDestination["Total"]),\
       "%s (%s occurrences)" % (leastFreqSurge_multiplier["surge_multiplier"], leastFreqSurge_multiplier["Total"]), \
       "%s (%s occurrences)" % (mostFreqSurge_multiplier["surge_multiplier"], mostFreqSurge_multiplier["Total"]))))


**Summary of columns distance, cab_type, source, destination, price and surge_multiplier**:

+-------+------------------+--------+--------+-----------+-----------------+-------------------+
|summary|          distance|cab_type|  source|destination|            price|   surge_multiplier|
+-------+------------------+--------+--------+-----------+-----------------+-------------------+
|  count|            637976|  637976|  637976|     637976|           637976|             637976|
|   mean| 2.189261100730507|    null|    null|       null|16.54512549061407| 1.0150675730748493|
| stddev|1.1354130181861846|    null|    null|       null|9.324358581411598|0.09542184282423667|
|    min|              0.02|    Lyft|Back Bay|   Back Bay|              2.5|                1.0|
|    25%|              1.27|    null|    null|       null|              9.0|                1.0|
|    50%|              2.16|    null|    null|       null|             13.5|                1.0|
|    75%|              2.93|    null|    null|       null|             22.5|                1.0|
|    max|              7.86|  

**Checking for nulls on columns distance, cab_type, source, destination, price and surge_multiplier**:

+--------+--------+------+-----------+-----+----------------+
|distance|cab_type|source|destination|price|surge_multiplier|
+--------+--------+------+-----------+-----+----------------+
|       0|       0|     0|          0|    0|               0|
+--------+--------+------+-----------+-----+----------------+



**Checking amount of distinct values in columns distance, cab_type, source, destination, price and surge_multiplier**:

+--------+--------+------+-----------+-----+----------------+
|distance|cab_type|source|destination|price|surge_multiplier|
+--------+--------+------+-----------+-----+----------------+
|     549|       2|    12|         12|  147|               7|
+--------+--------+------+-----------+-----+----------------+



**Checking frequency of dibstinct values in column source, destination and surge_multiplier**:


| leastFreqSource | mostFreqSource | leastFreqDestination | mostFreqDestination | leastFreqSurge_multiplier | mostFreqSurge_multiplier |
|----|----|----|----|----|----|
| North Station (52576 occurrences) | Financial District (54197 occurrences) | North Station (52577 occurrences) | Financial District (54192 occurrences) | 3.0 (12 occurrences) | 1.0 (617001 occurrences) |


### <ins>C.Company car related columns basic profiling</ins>

In [372]:
display(Markdown("**Summary of columns id, product_id and name**:"))
ridesDF.select(col('id'),col('product_id'),col('name')).summary().show()

display(Markdown("**Checking for nulls on columns id, product_id and name**:"))
ridesDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["id", "product_id", "name"]]).show()

display(Markdown("**Checking amount of distinct values in columns id, product_id and name**:"))
ridesDF.select([countDistinct(c).alias(c) for c in ["id", "product_id", "name"]]).show()

**Summary of columns id, product_id and name**:

+-------+--------------------+--------------------+------+
|summary|                  id|          product_id|  name|
+-------+--------------------+--------------------+------+
|  count|              637976|              637976|637976|
|   mean|                null|                null|  null|
| stddev|                null|                null|  null|
|    min|00005b8c-5647-410...|55c66225-fbe7-4fd...| Black|
|    25%|                null|                null|  null|
|    50%|                null|                null|  null|
|    75%|                null|                null|  null|
|    max|ffffecd1-49b1-498...|        lyft_premier|   WAV|
+-------+--------------------+--------------------+------+



**Checking for nulls on columns id, product_id and name**:

+---+----------+----+
| id|product_id|name|
+---+----------+----+
|  0|         0|   0|
+---+----------+----+



**Checking amount of distinct values in columns id, product_id and name**:

+------+----------+----+
|    id|product_id|name|
+------+----------+----+
|637976|        12|  12|
+------+----------+----+



## <ins>5. Answer some business questions to improve service</ins>


In [327]:
# Dividing the data for each provider separately 
LyftDF = ridesDF.where(col("cab_type") == 'Lyft')
UberDF = ridesDF.where(col("cab_type") == 'Uber')

LyftDF.cache()
UberDF.cache()

DataFrame[distance: double, cab_type: string, time_stamp: timestamp, destination: string, source: string, price: double, surge_multiplier: double, id: string, product_id: string, name: string]

### <ins>A. Ratio of rides throughout the day</ins>

In [339]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit, year, month, dayofmonth, hour, date_format, minute

# To get statistics of the rides throughout the day we will modify slightly the data:
# The first step of diving the data for each provides has already been made
#   1. Binning the timestamp column
#   2. Getting the statistics of the data


#   1. Binning the timestamp and distance column for both services
UberDF = UberDF.withColumn("Time_of_day", when(hour(col("time_stamp"))<6,"Night_ride")\
                              .when((hour(col("time_stamp"))>=6) & (hour(col("time_stamp"))<12),'Morning_ride')\
                              .when((hour(col("time_stamp"))>=12) & (hour(col("time_stamp"))<18),"Afternoon_ride")\
                              .otherwise("Evening_ride"))\
.withColumn("Range", when(col("distance")<2.5,"Short_distance")\
                              .when((col("distance")>=2.5) & (col("distance")<5),'Medium_distance')\
                              .otherwise("Long_distance"))
UberDF.cache() # optimization to make the processing faster

LyftDF = LyftDF.withColumn("Time_of_day", when(hour(col("time_stamp"))<6,"Night_ride")\
                              .when((hour(col("time_stamp"))>=6) & (hour(col("time_stamp"))<12),'Morning_ride')\
                              .when((hour(col("time_stamp"))>=12) & (hour(col("time_stamp"))<18),"Afternoon_ride")\
                              .otherwise("Evening_ride"))\
.withColumn("Range", when(col("distance")<2.5,"Short_distance")\
                              .when((col("distance")>=2.5) & (col("distance")<5),'Medium_distance')\
                              .otherwise("Long_distance"))
LyftDF.cache() # optimization to make the processing faster

#   2. Getting the statistics of the data

display(Markdown("**Uber's rides for the months november and december**:"))
# Count for each month for Uber
UberNovTotal = UberDF.where(month(col("time_stamp"))==11).count()
UberDecTotal = UberDF.where(month(col("time_stamp"))==12).count()

UberDF.where(month(col("time_stamp"))==11)\
                   .select(month(col("time_stamp")).alias('month'),"Time_of_day",'price','surge_multiplier')\
                   .groupby('month',"Time_of_day")\
                   .agg(count("month").alias("rides"),\
                        avg('price'),avg('surge_multiplier'),\
                        (count("month")/UberNovTotal*100).alias("Ratio"))\
                   .orderBy(desc('rides')).show()

UberDF.where(month(col("time_stamp"))==12)\
                   .select(month(col("time_stamp")).alias('month'),"Time_of_day",'price','surge_multiplier')\
                   .groupby('month',"Time_of_day")\
                   .agg(count("month").alias("rides"),\
                        avg('price'), avg('surge_multiplier'),\
                        (count("month")/UberDecTotal*100).alias("Ratio"))\
                   .orderBy(desc('rides')).show()

display(Markdown("**Lyft's rides for the months november and december**:"))
# Count for each month for Lyft
LyftNovTotal = LyftDF.where(month(col("time_stamp"))==11).count()
LyftDecTotal = LyftDF.where(month(col("time_stamp"))==12).count()

LyftDF.where(month(col("time_stamp"))==11)\
                   .select(month(col("time_stamp")).alias('month'),"Time_of_day",'price','surge_multiplier')\
                   .groupby('month',"Time_of_day")\
                   .agg(count("month").alias("rides"),\
                        avg('price'), avg('surge_multiplier'),\
                        (count("month")/LyftNovTotal*100).alias("Ratio"))\
                   .orderBy(desc('rides')).show()

LyftDF.where(month(col("time_stamp"))==12)\
                   .select(month(col("time_stamp")).alias('month'),"Time_of_day",'price','surge_multiplier')\
                   .groupby('month',"Time_of_day")\
                   .agg(count("month").alias("rides"),\
                        avg('price'), avg('surge_multiplier'),\
                        (count("month")/LyftDecTotal*100).alias("Ratio"))\
                   .orderBy(desc('rides')).show()

**Uber's rides for the months november and december**:

+-----+--------------+-----+------------------+---------------------+------------------+
|month|   Time_of_day|rides|        avg(price)|avg(surge_multiplier)|             Ratio|
+-----+--------------+-----+------------------+---------------------+------------------+
|   11|Afternoon_ride|38802|15.739832998299057|                  1.0|28.552296575372704|
|   11|  Evening_ride|36951|15.877702903845634|                  1.0|27.190245625395516|
|   11|    Night_ride|33140|15.795368135184068|                  1.0|24.385936511206936|
|   11|  Morning_ride|27005| 15.82234771338641|                  1.0|19.871521288024844|
+-----+--------------+-----+------------------+---------------------+------------------+

+-----+--------------+-----+------------------+---------------------+------------------+
|month|   Time_of_day|rides|        avg(price)|avg(surge_multiplier)|             Ratio|
+-----+--------------+-----+------------------+---------------------+------------------+
|   12|    Night_rid

**Lyft's rides for the months november and december**:

+-----+--------------+-----+------------------+---------------------+------------------+
|month|   Time_of_day|rides|        avg(price)|avg(surge_multiplier)|             Ratio|
+-----+--------------+-----+------------------+---------------------+------------------+
|   11|Afternoon_ride|35914|17.344922592860723|   1.0321740825304895| 28.46928260007927|
|   11|  Evening_ride|35020|17.346330668189605|   1.0310322672758423|27.760602457391993|
|   11|    Night_ride|30260|17.350132187706542|   1.0314441506939855|23.987316686484345|
|   11|  Morning_ride|24956| 17.22771277448309|   1.0312049206603622|19.782798256044394|
+-----+--------------+-----+------------------+---------------------+------------------+

+-----+--------------+-----+------------------+---------------------+------------------+
|month|   Time_of_day|rides|        avg(price)|avg(surge_multiplier)|             Ratio|
+-----+--------------+-----+------------------+---------------------+------------------+
|   12|    Night_rid

### <ins>B. Number of rides per Hour</ins>

In [394]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, avg, month, col, hour

display(Markdown("**Amount of rides provided by Uber for each hour and for the months november and december**:"))

ridesDF.where((month(col("time_stamp"))==11) & (col('cab_type') == 'Uber'))\
                   .select(month(col("time_stamp")).alias('month'),hour(col("time_stamp")).alias('hour'),'cab_type')\
                   .groupby('month','hour','cab_type')\
                   .agg(count("*").alias("rides"))\
                   .orderBy(desc('rides')).show(24)

ridesDF.where((month(col("time_stamp"))==12) & (col('cab_type') == 'Uber'))\
                   .select(month(col("time_stamp")).alias('month'),hour(col("time_stamp")).alias('hour'),'cab_type')\
                   .groupby('month','hour','cab_type')\
                   .agg(count("*").alias("rides"))\
                   .orderBy(desc('rides')).show(24)

display(Markdown("**Amount of rides provided by Lyft for each hour and for the months november and december**:"))

ridesDF.where((month(col("time_stamp"))==11) & (col('cab_type') == 'Lyft'))\
                   .select(month(col("time_stamp")).alias('month'),hour(col("time_stamp")).alias('hour'),'cab_type')\
                   .groupby('month','hour','cab_type')\
                   .agg(count("*").alias("rides"))\
                   .orderBy(desc('rides')).show(24)

ridesDF.where((month(col("time_stamp"))==12) & (col('cab_type') == 'Lyft'))\
                   .select(month(col("time_stamp")).alias('month'),hour(col("time_stamp")).alias('hour'),'cab_type')\
                   .groupby('month','hour','cab_type')\
                   .agg(count("*").alias("rides"))\
                   .orderBy(desc('rides')).show(24)


**Amount of rides provided by Uber for each hour and for the months november and december**:

+-----+----+--------+-----+
|month|hour|cab_type|rides|
+-----+----+--------+-----+
|   11|   1|    Uber| 6936|
|   11|  23|    Uber| 6764|
|   11|  15|    Uber| 6602|
|   11|  12|    Uber| 6582|
|   11|  19|    Uber| 6536|
|   11|  17|    Uber| 6491|
|   11|  11|    Uber| 6476|
|   11|   0|    Uber| 6421|
|   11|  16|    Uber| 6412|
|   11|  14|    Uber| 6372|
|   11|  13|    Uber| 6343|
|   11|  18|    Uber| 6290|
|   11|  22|    Uber| 6013|
|   11|  10|    Uber| 5790|
|   11|  21|    Uber| 5685|
|   11|  20|    Uber| 5663|
|   11|   2|    Uber| 5659|
|   11|   5|    Uber| 4778|
|   11|   3|    Uber| 4674|
|   11|   4|    Uber| 4672|
|   11|   7|    Uber| 4203|
|   11|   8|    Uber| 3944|
|   11|   9|    Uber| 3607|
|   11|   6|    Uber| 2985|
+-----+----+--------+-----+

+-----+----+--------+-----+
|month|hour|cab_type|rides|
+-----+----+--------+-----+
|   12|   3|    Uber| 9101|
|   12|   7|    Uber| 9009|
|   12|   0|    Uber| 8930|
|   12|   6|    Uber| 8855|
|   12|   5|    Ube

**Amount of rides provided by Lyft for each hour and for the months november and december**:

+-----+----+--------+-----+
|month|hour|cab_type|rides|
+-----+----+--------+-----+
|   11|   1|    Lyft| 6609|
|   11|  23|    Lyft| 6550|
|   11|  18|    Lyft| 6234|
|   11|  13|    Lyft| 6127|
|   11|  14|    Lyft| 6110|
|   11|  16|    Lyft| 6041|
|   11|  11|    Lyft| 5966|
|   11|  17|    Lyft| 5950|
|   11|   0|    Lyft| 5948|
|   11|  19|    Lyft| 5927|
|   11|  12|    Lyft| 5844|
|   11|  15|    Lyft| 5842|
|   11|  22|    Lyft| 5736|
|   11|  20|    Lyft| 5393|
|   11|  10|    Lyft| 5344|
|   11|  21|    Lyft| 5180|
|   11|   2|    Lyft| 4983|
|   11|   5|    Lyft| 4248|
|   11|   4|    Lyft| 4242|
|   11|   3|    Lyft| 4230|
|   11|   7|    Lyft| 4020|
|   11|   8|    Lyft| 3647|
|   11|   9|    Lyft| 3203|
|   11|   6|    Lyft| 2776|
+-----+----+--------+-----+

+-----+----+--------+-----+
|month|hour|cab_type|rides|
+-----+----+--------+-----+
|   12|   6|    Lyft| 8379|
|   12|   3|    Lyft| 8318|
|   12|   5|    Lyft| 8254|
|   12|   0|    Lyft| 8056|
|   12|   1|    Lyf

### <ins>C. Most frequent routes per server</ins>

In [340]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, avg

display(Markdown("**Lyft's top destinations in november and december**:"))
LyftDF.where(month(col("time_stamp"))==11)\
                   .select(month(col("time_stamp")).alias('month'),'cab_type','Range','source','destination')\
                   .groupby('month','cab_type','Range','source','destination')\
                   .agg(count("cab_type").alias("rides")) \
                   .orderBy(desc('rides')).show(5)

LyftDF.where(month(col("time_stamp"))==12)\
                   .select(month(col("time_stamp")).alias('month'),'cab_type','Range','source','destination')\
                   .groupby('month','cab_type','Range','source','destination')\
                   .agg(count("cab_type").alias("rides")) \
                   .orderBy(desc('rides')).show(5)


display(Markdown("**Ubers's top destinations in november and december**:"))
UberDF.where(month(col("time_stamp"))==11)\
                   .select(month(col("time_stamp")).alias('month'),'cab_type','Range','source','destination')\
                   .groupby('month','cab_type','Range','source','destination')\
                   .agg(count("cab_type").alias("rides")) \
                   .orderBy(desc('rides')).show(5)

UberDF.where(month(col("time_stamp"))==12)\
                   .select(month(col("time_stamp")).alias('month'),'cab_type','Range','source','destination')\
                   .groupby('month','cab_type','Range','source','destination')\
                   .agg(count("cab_type").alias("rides")) \
                   .orderBy(desc('rides')).show(5)

display(Markdown("**Top destinations in november and december**:"))
ridesDF.where(month(col("time_stamp"))==11)\
                   .select(month(col("time_stamp")).alias('month'),'source','cab_type','destination')\
                   .groupby('month','source','destination')\
                   .agg(count("cab_type").alias("rides")) \
                   .orderBy(desc('rides')).show(5)

ridesDF.where(month(col("time_stamp"))==12)\
                   .select(month(col("time_stamp")).alias('month'),'source','cab_type','destination')\
                   .groupby('month','source','destination')\
                   .agg(count("cab_type").alias("rides")) \
                   .orderBy(desc('rides')).show(5)

**Lyft's top destinations in november and december**:

+-----+--------+---------------+--------------------+------------------+-----+
|month|cab_type|          Range|              source|       destination|rides|
+-----+--------+---------------+--------------------+------------------+-----+
|   11|    Lyft| Short_distance|           North End|       Beacon Hill| 1918|
|   11|    Lyft|Medium_distance|Northeastern Univ...|          West End| 1916|
|   11|    Lyft| Short_distance|  Financial District|     South Station| 1893|
|   11|    Lyft| Short_distance|         Beacon Hill|         North End| 1869|
|   11|    Lyft| Short_distance|       South Station|Financial District| 1867|
+-----+--------+---------------+--------------------+------------------+-----+
only showing top 5 rows

+-----+--------+---------------+------------------+------------------+-----+
|month|cab_type|          Range|            source|       destination|rides|
+-----+--------+---------------+------------------+------------------+-----+
|   12|    Lyft| Short_distance| 

**Ubers's top destinations in november and december**:

+-----+--------+--------------+------------------+------------------+-----+
|month|cab_type|         Range|            source|       destination|rides|
+-----+--------+--------------+------------------+------------------+-----+
|   11|    Uber|Short_distance|     South Station|  Theatre District| 2121|
|   11|    Uber|Short_distance|     South Station|Financial District| 2004|
|   11|    Uber|Short_distance|  Haymarket Square|Financial District| 2003|
|   11|    Uber|Short_distance|       Beacon Hill|         North End| 2001|
|   11|    Uber|Short_distance|Financial District|     South Station| 2001|
+-----+--------+--------------+------------------+------------------+-----+
only showing top 5 rows

+-----+--------+---------------+------------------+------------------+-----+
|month|cab_type|          Range|            source|       destination|rides|
+-----+--------+---------------+------------------+------------------+-----+
|   12|    Uber| Short_distance|Financial District|     Sout

**Top destinations in november and december**:

+-----+------------------+----------------+-----+
|month|            source|     destination|rides|
+-----+------------------+----------------+-----+
|   11|         North End|     Beacon Hill| 4016|
|   11|         North End|        Back Bay| 3961|
|   11|       Beacon Hill|       North End| 3904|
|   11|     South Station|Theatre District| 3895|
|   11|Financial District|   South Station| 3894|
+-----+------------------+----------------+-----+
only showing top 5 rows

+-----+------------------+------------------+-----+
|month|            source|       destination|rides|
+-----+------------------+------------------+-----+
|   12|     South Station|Financial District| 5663|
|   12|Financial District|     South Station| 5640|
|   12|            Fenway|          West End| 5589|
|   12|Financial District|  Haymarket Square| 5580|
|   12|          Back Bay|         North End| 5546|
+-----+------------------+------------------+-----+
only showing top 5 rows



### <ins>D. Distance per class</ins>

In [354]:
display(Markdown("**Uber's top distances and classes in november and december**:"))
UberDF.where(month(col("time_stamp"))==11).select(month(col('time_stamp')).alias('month'),'Range','name','time_stamp')\
                   .groupby('month','Range','name')\
                   .agg(count("time_stamp").alias("n")) \
                   .orderBy(desc('n')).show()

UberDF.where(month(col("time_stamp"))==12).select(month(col('time_stamp')).alias('month'),'Range','name','time_stamp')\
                   .groupby('month','Range','name')\
                   .agg(count("time_stamp").alias("n")) \
                   .orderBy(desc('n')).show()

display(Markdown("**Lyft's top distances and classes in november and december**:"))
LyftDF.where(month(col("time_stamp"))==11).select(month(col('time_stamp')).alias('month'),'Range','name','time_stamp')\
                   .groupby('month','Range','name')\
                   .agg(count("time_stamp").alias("n")) \
                   .orderBy(desc('n')).show()

LyftDF.where(month(col("time_stamp"))==12).select(month(col('time_stamp')).alias('month'),'Range','name','time_stamp')\
                   .groupby('month','Range','name')\
                   .agg(count("time_stamp").alias("n")) \
                   .orderBy(desc('n')).show()



**Uber's top distances and classes in november and december**:

+-----+---------------+---------+-----+
|month|          Range|     name|    n|
+-----+---------------+---------+-----+
|   11| Short_distance|   UberXL|13849|
|   11| Short_distance|Black SUV|13777|
|   11| Short_distance| UberPool|13765|
|   11| Short_distance|    Black|13762|
|   11| Short_distance|      WAV|13692|
|   11| Short_distance|    UberX|13615|
|   11|Medium_distance|      WAV| 8514|
|   11|Medium_distance|   UberXL| 8493|
|   11|Medium_distance|Black SUV| 8487|
|   11|Medium_distance|    UberX| 8483|
|   11|Medium_distance|    Black| 8460|
|   11|Medium_distance| UberPool| 8426|
|   11|  Long_distance|      WAV|  453|
|   11|  Long_distance| UberPool|  443|
|   11|  Long_distance|   UberXL|  438|
|   11|  Long_distance|    Black|  422|
|   11|  Long_distance|    UberX|  417|
|   11|  Long_distance|Black SUV|  402|
+-----+---------------+---------+-----+

+-----+---------------+---------+-----+
|month|          Range|     name|    n|
+-----+---------------+---------+-----+

**Lyft's top distances and classes in november and december**:

+-----+---------------+------------+-----+
|month|          Range|        name|    n|
+-----+---------------+------------+-----+
|   11| Short_distance|         Lux|13282|
|   11| Short_distance|     Lyft XL|13189|
|   11| Short_distance|   Lux Black|13181|
|   11| Short_distance|        Lyft|13145|
|   11| Short_distance|      Shared|13134|
|   11| Short_distance|Lux Black XL|13068|
|   11|Medium_distance|     Lyft XL| 7644|
|   11|Medium_distance|        Lyft| 7609|
|   11|Medium_distance|Lux Black XL| 7607|
|   11|Medium_distance|      Shared| 7575|
|   11|Medium_distance|         Lux| 7548|
|   11|Medium_distance|   Lux Black| 7528|
|   11|  Long_distance|Lux Black XL|  282|
|   11|  Long_distance|         Lux|  279|
|   11|  Long_distance|      Shared|  278|
|   11|  Long_distance|   Lux Black|  271|
|   11|  Long_distance|        Lyft|  270|
|   11|  Long_distance|     Lyft XL|  260|
+-----+---------------+------------+-----+

+-----+---------------+------------+-----+
|month|   