#In 200 words or more, describe the architecture of Apache Spark.

###Apache Spark is a computing engine which contains multiple API’s, these API’s allow a user to interact using traditional methods with the back-end Spark engine. This compute engine allows a user to create an end to end solution which utilises the distributed capabilities of multiple machines. The Apache Spark framework uses a master-slave architecture that consists of a driver, which runs as a master node, and many executors that run across as worker nodes in the cluster. Apache Spark can be used for batch processing and real-time processing as well.

###Spark has the capabilities to allow the user to write code in Python, Java, Scala R or SQL, but languages such as Python, and R require the user to install an instance of their relevant interpreter. All Python, R, Scala and SQL are compiled down into Java. PySpark is the Python API for Apache Spark as Spark is written in Scala, and PySpark was released to support the collaboration of Spark and Python. In addition to providing an API for Spark, PySpark helps to interface with Resilient Distributed Datasets (RDDs) by leveraging the Py4j library.

###Apache Spark provides a suite of Web UI/User Interfaces (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) that we can use to monitor the status and resource consumption of Spark cluster.

#Notebook runs every month for previous month data

### Importing the data from Capital Bikeshare into the delta lake.

In [0]:
import requests
import pandas as pd
from zipfile import ZipFile
import io
import datetime
today = datetime.date.today()
first = today.replace(day=1)
last_month = first - datetime.timedelta(days=1)
last_mon = last_month.strftime("%Y%m")
path="/dbfs/mnt/dev/data/rough/ashok/zip/"
url="https://s3.amazonaws.com/capitalbikeshare-data/"+last_mon+"-capitalbikeshare-tripdata.zip"
response = requests.get(url)
z = ZipFile(io.BytesIO(response.content))
z.extractall(path)

## Reading the csv into the dataframe

In [0]:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
df = spark.read.option("inferschema","true").option("header","true").csv("/mnt/dev/data/rough/ashok/zip")

## Dropping latitude related columns

In [0]:
df_drop_lat_lon = df.drop('start_lat','start_lng','end_lat','end_lng')

## Finding the duration of each ride

In [0]:
df_addingduration = df_drop_lat_lon.withColumn("from_timestamp_time",to_timestamp("started_at")).withColumn("to_timestamp_time",to_timestamp("ended_at"))
df_addingduration = df_addingduration.drop("started_at","ended_at")

df_addingduration = df_addingduration.withColumn('duration',(to_timestamp("to_timestamp_time").cast("long") - to_timestamp('from_timestamp_time').cast("long"))/60)
df_addingduration= df_addingduration.filter(col("from_timestamp_time").isNotNull())
df_addingduration= df_addingduration.filter(col("to_timestamp_time").isNotNull())
display(df_addingduration)

ride_id,rideable_type,start_station_name,start_station_id,end_station_name,end_station_id,member_casual,from_timestamp_time,to_timestamp_time,duration
636812F7EDA843A3,electric_bike,Fairfax Dr & N Taylor St,31049,,,member,2022-10-21T07:19:55.000+0000,2022-10-21T07:32:47.000+0000,12.866666666666667
2963CAC314D0C593,electric_bike,Eads St & 12th St S,31071,,,member,2022-10-21T16:52:10.000+0000,2022-10-21T17:07:43.000+0000,15.55
81383A5B7B0DDF3D,electric_bike,Van Ness Metro / UDC,31300,,,member,2022-10-17T09:23:40.000+0000,2022-10-17T09:41:44.000+0000,18.066666666666663
88E008078B2E7FFC,electric_bike,14th & Rhode Island Ave NW,31203,New Hampshire Ave & 24th St NW,31275.0,member,2022-10-19T20:16:01.000+0000,2022-10-19T20:22:35.000+0000,6.566666666666666
53D58D1427BF475B,classic_bike,Potomac & M St NW,31295,New Hampshire Ave & Ward Pl NW,31212.0,member,2022-10-25T21:10:30.000+0000,2022-10-25T21:18:51.000+0000,8.35
69247684F27CAB16,classic_bike,14th & Rhode Island Ave NW,31203,Rhode Island & Connecticut Ave NW,31239.0,member,2022-10-20T18:07:02.000+0000,2022-10-20T18:11:37.000+0000,4.583333333333333
021E310139F2FFE4,classic_bike,14th & Rhode Island Ave NW,31203,3rd & M St SE,31669.0,member,2022-10-29T00:36:00.000+0000,2022-10-29T01:04:57.000+0000,28.95
04A3B5CBCF7DD67A,classic_bike,14th & Rhode Island Ave NW,31203,Rhode Island & Connecticut Ave NW,31239.0,member,2022-10-11T18:07:31.000+0000,2022-10-11T18:12:13.000+0000,4.7
7EA5E34AC89923BA,classic_bike,11th & F St NW,31262,5th & K St NW,31600.0,member,2022-10-27T17:46:39.000+0000,2022-10-27T17:56:09.000+0000,9.5
B4F6ADFBA4AF66DF,classic_bike,Mount Vernon Ave & E Del Ray Ave,31086,Mount Vernon Ave & Kennedy St,31088.0,member,2022-10-23T16:59:59.000+0000,2022-10-23T17:03:20.000+0000,3.35


## Calculating the average duration by ride type

In [0]:
df_rideable_type_agg = df_addingduration.groupBy('rideable_type').avg("duration")

display(df_rideable_type_agg)

rideable_type,avg(duration)
docked_bike,96.36651343989108
electric_bike,14.82320969019104
classic_bike,19.56327849269697


## Calculating top 10 ride durations for each start station

In [0]:
df_over24 = df_over24.filter(col("duration") > 1440)
display(df_over24)

ride_id,rideable_type,start_station_name,start_station_id,end_station_name,end_station_id,member_casual,from_timestamp_time,to_timestamp_time,duration
AAF3CE4B95A07891,classic_bike,Anacostia Metro,31801,,,casual,2022-10-19T08:32:50.000+0000,2022-10-20T09:32:30.000+0000,1499.6666666666667
E112C160BE8C758D,classic_bike,Gallaudet / 8th St & Florida Ave NE,31508,,,casual,2022-10-30T18:35:14.000+0000,2022-10-31T19:34:59.000+0000,1499.75
A8A3BB946A74B5CB,classic_bike,14th & V St NW,31101,,,member,2022-10-23T12:08:03.000+0000,2022-10-24T13:07:58.000+0000,1499.9166666666667
1671339A417D2C2F,classic_bike,Pleasant St & MLK Ave SE,31807,,,casual,2022-10-10T14:48:49.000+0000,2022-10-11T15:48:42.000+0000,1499.8833333333334
22461B1698F1EB78,classic_bike,13th & U St NW,31132,,,member,2022-10-16T01:58:13.000+0000,2022-10-17T02:58:09.000+0000,1499.9333333333334
D2BF4730B5497986,classic_bike,N Oak St & W Broad St,32602,,,casual,2022-10-20T10:29:21.000+0000,2022-10-21T11:29:16.000+0000,1499.9166666666667
CC202332FB2597EB,docked_bike,N Veitch St & 20th St N,31029,,,casual,2022-10-01T22:07:54.000+0000,2022-10-02T23:07:55.000+0000,1500.0166666666669
A44BF903A9453640,classic_bike,The Mall at Prince Georges,32422,,,member,2022-10-29T19:58:18.000+0000,2022-10-30T20:58:12.000+0000,1499.9
DDF728A706308277,docked_bike,Meridian High School / Haycock Rd & Leesburg Pike,32600,,,casual,2022-10-30T20:06:47.000+0000,2022-11-03T15:32:47.000+0000,5486.0
8CE7B82DB0FFA01E,docked_bike,Sunset Hills Rd & Isaac Newton Square,32220,,,casual,2022-10-17T19:23:51.000+0000,2022-10-18T20:23:52.000+0000,1500.0166666666669


In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

windowSpec  = Window.partitionBy("start_station_name").orderBy(col("duration").desc())
df_top10 = df_over24.withColumn("row_number",row_number().over(windowSpec)).filter(col("row_number")<= 10)
display(df_top10)                                                                                                                                                                                        

ride_id,rideable_type,start_station_name,start_station_id,end_station_name,end_station_id,member_casual,from_timestamp_time,to_timestamp_time,duration,row_number
D3A0013116AE4495,docked_bike,10th & E St NW,31256,,,casual,2022-10-20T14:18:59.000+0000,2022-11-05T19:03:21.000+0000,23324.366666666665,1
829B9EB3F01892D4,docked_bike,10th & E St NW,31256,,,casual,2022-10-22T12:17:09.000+0000,2022-11-05T16:55:13.000+0000,20438.066666666666,2
9DA1B6EF4C4286AE,classic_bike,10th & E St NW,31256,,,casual,2022-10-10T12:48:49.000+0000,2022-10-11T13:48:44.000+0000,1499.9166666666667,3
E279C5DA462D64E1,classic_bike,10th & E St NW,31256,,,casual,2022-10-10T12:48:27.000+0000,2022-10-11T13:48:20.000+0000,1499.8833333333334,4
12107A126AA31998,classic_bike,10th & E St NW,31256,,,member,2022-10-07T15:51:11.000+0000,2022-10-08T16:50:52.000+0000,1499.6833333333334,5
7E17CC2C9517E11D,classic_bike,10th & G St NW,31274,,,casual,2022-10-29T15:44:49.000+0000,2022-10-30T16:44:41.000+0000,1499.8666666666666,1
2AEF2F50AFB541AB,classic_bike,10th & G St NW,31274,,,casual,2022-10-03T12:04:31.000+0000,2022-10-04T13:04:21.000+0000,1499.8333333333333,2
93DA14D2FEEE3663,docked_bike,10th & K St NW,31263,,,casual,2022-10-12T15:21:31.000+0000,2022-10-23T10:39:29.000+0000,15557.966666666667,1
65BEBF0B2B8F0DCA,docked_bike,10th & K St NW,31263,,,casual,2022-10-12T15:21:14.000+0000,2022-10-23T10:28:40.000+0000,15547.433333333332,2
5413E3A5D718427B,docked_bike,10th & K St NW,31263,10th St & L'Enfant Plaza SW,31287.0,casual,2022-10-15T15:48:48.000+0000,2022-10-16T22:22:39.000+0000,1833.85,3
