
# **Exercise 2: Spark SQL**
#### This second exercise will introduce database operations with Spark. We will start by introducing Spark SQL, moving onto exploratory analysis on Apache web server logs collected from the http://fileadmin.cs.lth.se website and ending with finding out which course is the most popular in fileadmin.

### **The following material will be covered:**
#### *Part 1:* Learning Spark SQL
#### *Part 2:* Exploratory analysis of fileadmin dataset
#### *Part 3:* Visualizing the day and month rhythms
#### *Part 4:* Finding the most popular course

### During the exercises, the following resources might come in handy:
* #### Documentation of the [PySpark API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)
* #### Documentation of the [Python API](https://docs.python.org/2.7/)
* #### Documentation of the [Spark SQL API](http://spark.apache.org/docs/latest/sql-programming-guide.html)
* #### Documentation of [Hive SQL](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)

### To run code in Jupyter, press: 
* #### `Ctrl-Enter` to run the code in the currently selected cell
* #### `Shift-Enter` to run the code in the currently selected cell and jump to the next cell

In [1]:
import os
os.environ["SPARK_DRIVER_MEMORY"] = "1792M"
os.environ["SPARK_OPTS"] = "--driver-java-options=-Xms1024M --driver-java-options=-Xmx1536M --driver-java-options=-Dlog4j.logLevel=info"

#from pyspark import SparkContext
#from pyspark.sql import SQLContext

#sc = SparkContext(master="local[*]")
#sqlContext = SQLContext(sc)

### **Helper: Displays rows from a Spark SQL object as HTML**
#### The following code can display arrays of rows or dataframes as a html table. This creates a human-friendly output of our data. It is enough to browse through this code, as it is not important to know it in detail for this exercise.

In [2]:
from IPython.display import display, HTML
import warnings

def displayRows(rowDf):
    headers = []
    rows = []
    if(str(type(rowDf)) == "<class 'pyspark.sql.dataframe.DataFrame'>"):
        rows = rowDf.limit(10000).collect() #Let's limit the output just in case!
        if(len(rows) == 10000):
            if(rowDf.limit(10001).count() == 10001):
                warnings.warn("More than 10 000 rows was returned, only showing the first 10 000.")
                
        headers = list(rowDf.columns)
    else:
        rows = rowDf
        if(len(rows) > 10000):
            warnings.warn("Rows has {0} elements, only showing the first 10 000.".format(len(rows)))
            rows = rows[0:10000]
            
        #Computes the unique set of keys
        headers = list(sorted(reduce(lambda x,y: x.union(set(y.asDict().iterkeys())), rows, set())))
            
    tableHead = ["<th>{0}</th>".format(key) for key in headers]
    tableBody = ["<tr>{0}</tr>".format(
                    "".join(["<td>{0}</td>".format(rowDict.get(header)) 
                            for rowDict 
                            in (row.asDict(),) 
                            for header 
                            in headers])
                    ) for row in rows]
    
    display(HTML(
    u"""<table>
    <thead><tr>{0}</tr></thead>
    <tbody>{1}</tbody>
    </table>
    """.format("".join(tableHead), "".join(tableBody))))

## Part 1: Learning Spark SQL

#### This part will introduce you the Spark SQL by writing SQL queries.

The cell below generates data which you will write queries for.

In [4]:
#Top 20 boy and girl names 2014 in random order.
names = ["Caden", "Kaylee", "Lucas", "Ethan", "Alexander", "Jackson", 
         "Aiden", "Madelyn", "Michael", "Avery", "Luke", "Isabella", 
         "Chloe", "Elijah", "Abigail", "Madison", "Jacob", "Zoe", "Emily", 
         "Jayden", "Liam", "Mason", "Mia", "Sophia", "Benjamin", "Layla", 
         "Emma", "Lily", "Charlotte", "Caleb", "James", "Noah", "Ella", 
         "Jack", "Jayce", "Aubrey", "Olivia", "Harper", "Logan", "Ava"]

#A-G in phonetic alphabet
groups = ["Alpha","Bravo", "Charlie", "Delta", "Echo", "Foxtrot", "Golf"]

#Some numeric magic to generate not so uniform random data.
tblUserRdd = sc.parallelize(map(lambda i: (i, ((i*104729)^131) % 7, 26500 + ((i*104729)^96587) % 6367), range(1,51)))
tblNamesRdd = sc.parallelize(enumerate(names, 1), 4)
tblGroupNamesRdd = sc.parallelize(enumerate(groups), 2)

#Create dataframes from the RDDs
tblNames      = sqlContext.createDataFrame(tblNamesRdd,      ["userId", "name"])
tblUsers      = sqlContext.createDataFrame(tblUserRdd,       ["id", "groupId", "salary"])
tblGroupNames = sqlContext.createDataFrame(tblGroupNamesRdd, ["id", "name"])

#Register them for use.
sqlContext.registerDataFrameAsTable(tblGroupNames, "tblGroupNames")
sqlContext.registerDataFrameAsTable(tblUsers, "tblUsers")
sqlContext.registerDataFrameAsTable(tblNames, "tblNames")

#### First, lets get some basic information about each dataframe

Dataframes are structured meaning that types and columns are well-defined; if you have read the data generation cell you might have noticed that the types were not specified. These are inferred by Spark.

#### Dataframes provide a very handy function called `printSchema()`. As its name implies, it shows the schema of the data, including column names and types.

In [5]:
tblUsers.printSchema()

root
 |-- id: long (nullable = true)
 |-- groupId: long (nullable = true)
 |-- salary: long (nullable = true)



#### It is possible to call a number of operations on dataframe, similar to RDDs dataframes have a `count()` action to display the number of rows in the dataframe.

In [6]:
tblUsers.count()

50

In [7]:
tblGroupNames.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)



In [8]:
tblGroupNames.count()

7

In [9]:
tblUsers.printSchema()

root
 |-- id: long (nullable = true)
 |-- groupId: long (nullable = true)
 |-- salary: long (nullable = true)



In [10]:
tblNames.count()

40

#### Next 3 cells will display the content of the dataframe by using the helper function *displayRows*

In [11]:
displayRows(tblUsers)

id,groupId,salary
1,5,26623
2,5,30452
3,5,30462
4,6,27932
5,6,26973
6,2,32796
7,2,29202
8,3,32531
9,3,28969
10,3,29034


In [12]:
displayRows(tblNames)

userId,name
1,Caden
2,Kaylee
3,Lucas
4,Ethan
5,Alexander
6,Jackson
7,Aiden
8,Madelyn
9,Michael
10,Avery


In [13]:
displayRows(tblGroupNames)

id,name
0,Alpha
1,Bravo
2,Charlie
3,Delta
4,Echo
5,Foxtrot
6,Golf


#### There is a basic function for displaying the contents of an Dataframe by using *show()*
However, the output is limited and gives a limited view of a long column. It is useful for debugging.

In [14]:
tblUsers.show()

+---+-------+------+
| id|groupId|salary|
+---+-------+------+
|  1|      5| 26623|
|  2|      5| 30452|
|  3|      5| 30462|
|  4|      6| 27932|
|  5|      6| 26973|
|  6|      2| 32796|
|  7|      2| 29202|
|  8|      3| 32531|
|  9|      3| 28969|
| 10|      3| 29034|
| 11|      0| 27978|
| 12|      1| 30759|
| 13|      1| 31825|
| 14|      1| 28231|
| 15|      1| 30599|
| 16|      5| 29023|
| 17|      5| 32820|
| 18|      5| 32862|
| 19|      5| 30292|
| 20|      6| 29373|
+---+-------+------+
only showing top 20 rows



#### Now, the first query you will write

### 1.a) Write a query that selects all user ids in the group with id 0

In [18]:
a = 2
assert a % 2 == 0, "value was odd, should be even"

In [None]:
# Replace <FILL IN> with the proper code
q1a = sqlContext.sql("""
SELECT id 
FROM tblUsers 
WHERE <FILL IN>
""")

displayRows(q1a)

In [None]:
assert set(map(lambda row: row.id, q1a.collect())) == set([11,26,27])

### 1.b) Write a query that finds the min and max userId grouped by groupId

The result should have the following columns:

1. minUserId: The min user id per group
2. maxUserId: The max user id per group
2. groupId: The group id

**Hint:** Use GROUP BY, MIN, MAX

In [None]:
q1b = sqlContext.sql("""
SELECT 
    <FILL IN> AS minUserId, 
    <FILL IN> AS maxUserId,
    <FILL IN> 
FROM tblUsers 
<FILL IN>
""")

displayRows(q1b)

In [None]:
minIds = {0: 11,
 1: 12,
 2: 6,
 3: 8,
 4: 24,
 5: 1,
 6: 4}

maxIds = {0: 27,
 1: 43,
 2: 46,
 3: 39,
 4: 40,
 5: 47,
 6: 50}

assert all(map(lambda row: minIds[row.groupId] == row.minUserId, q1b.collect()))
assert all(map(lambda row: maxIds[row.groupId] == row.maxUserId, q1b.collect()))

### 1.c) Compute the global average salary

When you do not specify any group by columns and use aggregating functions such as **AVG**(column) then the aggregation will be performed over the entire result and return a single row.

In [None]:
avgSalary = sqlContext.sql("""
SELECT <FILL IN> AS avgSalary 
FROM tblUsers
""").collect()[0].avgSalary

avgSalary

In [None]:
assert avgSalary == 29707.34

### 1.d) Aggregate salaries per group

Group per groupId and compute the minimum, average, maximum salary and sort by average salary descending.

**Hint:** Use MIN, AVG, MAX, GROUP BY, you can sort by computed columns.

In [None]:
q1d = sqlContext.sql("""
SELECT 
    groupId,
    COUNT(id) AS NumUsers,
    <FILL IN> AS MinSalary, 
    <FILL IN> AS AvgSalary,
    <FILL IN> AS MaxSalary,
    AVG(salary) - {} AS GlobalAvgDelta
FROM tblUsers
<FILL IN>
ORDER BY <FILL IN>
""".format(avgSalary))

displayRows(q1d)

In [None]:
groups = [
    (5, 26623, 30573, 32862), 
    (4, 26923, 29898, 31849),
    (1, 26600, 29833, 32234),
    (2, 27784, 29537, 32796),
    (6, 26973, 29490, 32245),
    (3, 27858, 29447, 32531),
    (0, 26784, 28369, 30346)
]

q1dresult = q1d.collect()
assert len(q1dresult) == 7
assert map(lambda i: q1dresult[i].groupId == groups[i][0], xrange(0,len(q1dresult)-1)), "GroupID column does not match."
assert map(lambda i: q1dresult[i].MinSalary == groups[i][1], xrange(0,len(q1dresult)-1)), "MinSalary column does not match."
assert map(lambda i: int(q1dresult[i].AvgSalary) == groups[i][2], xrange(0,len(q1dresult)-1)), "AvgSalary column does not match."
assert map(lambda i: q1dresult[i].MaxSalary == groups[i][3], xrange(0,len(q1dresult)-1)), "MaxSalary column does not match."

### **Part 2: Exploratory analysis of fileadmin dataset**
#### In this part, we will explore a parsed apache log from fileadmin

#### We start by loading our Apache log into a dataframe.
As before, we will use a SQLContext provided in the variable `sqlContext`, calling the `read.load()` function and supplying the path to the Apache log. Following this step, we register the dataframe as the table `fadmLog` using `registerDataFrameAsTable()`. This allows us to access the dataframe as a table using the name `fadmLog` in any subsequent SQL call.

#### The first line "matplotlib inline" is a directive that tells jupyter notebook to render all plots inline, i.e. as images and displaying them here. This is not valid python code and will only work in jupyter notebooks or ipython.

In [None]:
%matplotlib inline
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import *

from os.path import abspath

fadmLog = sqlContext.read.parquet("file:" + abspath("../data/fileadmin-logs.parquet/")).persist()
fadmLog.printSchema()
sqlContext.registerDataFrameAsTable(fadmLog, "fadmLog")

In [None]:
fadmLog.printSchema()

#### Let's take a look at 10 rows from the log.

In [None]:
displayRows(fadmLog.limit(10))

### **2.a) Determine number of entries**

In [None]:
fadmLog.count()

### **2.b) Determine time range**
#### We see that our log contains more than 19 million entries.

#### Next, we wish to determine the date of the first and last entry in the log.

#### Complete the SQL statement.
Select the oldest and the newest entry from the log using the `MIN()`, and `MAX()` statements. Use the `from_unixtime()` statement to convert the timestamps into a human-friendly format.

We expect a dataframe containing a single row with two columns, the **first(oldest) entry** and the **last(latest) entry**.

In [None]:
# Replace <FILL IN> with the proper code
firstLastDf = sqlContext.sql("""
    <FILL IN>
  """).cache()

displayRows(firstLastDf.collect())

In [None]:
# Test the first log entry
assert firstLastDf.collect()[0][0] == u'2011-06-21 23:44:15', "Wrong date of the first log entry!"

### **2.c) Determine the number of requests per year**
#### Which year did our website receive most requests?
Show each year and the total number of requests that year in ascending order by year.

#### Complete the SQL statement below.

In [None]:
# Replace <FILL IN> with the proper code
numberOfRequests = sqlContext.sql(""" 
                                     <FILL IN>
                                     """).cache()

displayRows(numberOfRequests.collect())

In [None]:
# Test the yearly summary
assert len(numberOfRequests.collect()) == 5, "Expected to see data from 5 years!"
assert numberOfRequests.collect()[3][1] == 5890942, "Wrong number of requests in year 2014!"

### **2.d) Determine the number of requests per year/month**
#### Do the number requests peak at certain month?
Repeat the previous exercise, displaying the month in addition to the year.

Show each year, month and the total number of requests that year in ascending order by year and month.

#### Complete the SQL statement below.


In [None]:
# Replace <FILL IN> with the proper code
numberOfRequestsYearMonth = sqlContext.sql("""
                                              <FILL IN>
                                              """).cache()

displayRows(numberOfRequestsYearMonth.collect())

In [None]:
# Test the yearly summary
assert len(numberOfRequestsYearMonth.collect()) == 50, "Expected to see data from 50 months!"
assert numberOfRequestsYearMonth.collect()[17][2] == 1225716, "Wrong number of requests in November, 2012!"

### **2.e) Determine the all time high 10 days of requests**
#### Find the 10 days with the highest number of requests. We expect a dataframe with 10 rows.
#### The columns must contain a date and the number of request for that date.
Sort the dataframe by the number of requests in descending order.

#### Complete the SQL statement below, use `GROUP BY`, `ORDER BY`, and `LIMIT` statements.

In [None]:
# Replace <FILL IN> with the proper code
topTenRequests = sqlContext.sql("""
SELECT
    PRINTF("%04d-%02d-%02d", year, month, day) AS date
    <FILL IN>
                                    """)

displayRows(topTenRequests.collect())

In [None]:
# Test the summary
assert len(topTenRequests.collect()) == 10, "Expected to see data from 10 dates!"
assert topTenRequests.collect()[0][1] == 924902, "Wrong number of requests, 2012-11-21!"

### **2.f) Determine the top 10 requests by source under a day**

#### Which sources are responsible for the most number of requests?
Find the top 10 sources and dates having the highest number of requests.

#### We expect a dataframe with 10 rows.
The columns must contain a source, a date and the number of request for that source and date. Sort the dataframe by the number of requests in descending order. 

#### Complete the SQL statement below.

**Hints:** use `GROUP BY`, `ORDER BY`, and `LIMIT` statements.

In [None]:
# Replace <FILL IN> with the proper code
topTenRequestSources = sqlContext.sql("""
                                            <FILL IN>
                                            """).cache()

displayRows(topTenRequestSources.collect())

In [None]:
# Test the summary
assert len(topTenRequestSources.collect()) == 10, "Expected to see data from 10 dates!"
assert topTenRequestSources.collect()[0].source == u'130.235.16.54', "Wrong top source!"

### **2.g) Determine the top 100 requests of a single resource by source during single a day**

#### Which resource is most requested?
Find the top 100 resources having the highest number of requests.

#### We expect a dataframe with 100 rows.
The columns must contain a resource, a source, a date and the number of request for that resource, source and date. Sort the dataframe by the number of requests in descending order. 

#### Complete the SQL statement below.

**Hints:** use `GROUP BY`, `ORDER BY`, and `LIMIT` statements.

In [None]:
# Replace <FILL IN> with the proper code
requestsTopHundred = sqlContext.sql("""
                                        <FILL IN>
                                        """).cache()

displayRows(requestsTopHundred)

In [None]:
# Test the summary
assert len(requestsTopHundred.collect()) == 100, "Expected to see data from 100 dates!"

### **2.h) Find crawlers by behavior**
#### Web crawlers usually make a lot of requests to different unique resources. By listing the sources with the highest unique number of resource accessed together with the number of resources requested, we can start to filter out web crawlers.
Find the ten most likely web crawlers, display source, number of distinct resources accessed by that source together with number of request.

Complete SQL statement below, using the `COUNT(DISTINCT *column*)`, `COUNT`, `GROUP BY`, `ORDER BY`, and `LIMIT` statements.

In [None]:
# Replace <FILL IN> with the proper code
crawlers = sqlContext.sql("""
                                <FILL IN>
                                """).cache()

displayRows(crawlers.collect())

In [None]:
# Test the summary
assert len(crawlers.collect()) == 10, "Expected to see data from 10 sources!"
assert crawlers.collect()[0][0] == u'66.249.78.63', "Expected to see a web crawler from 66.249.78.63 (GoogleBot)!"

## **Part 3: Visualizing the day and month rhythms**
#### In this part we will aggregate the number of requests made per hour and weekday and then visualize it.
#### A problem with our log is that a few sources are generating a lot of requests. These have the potential of polluting the aggregated statistics. The goal is to find them and eventually filter them out.

### **3.a) Find the top 10 sources that generates most traffic.**

#### Expected output:

* source - the source that made the requests
* numRequests - count of requests per group

#### Complete the SQL statement below, use COUNT, GROUP BY, ORDER BY, LIMIT

In [None]:
# Replace <FILL IN> with the proper code
topRequesters = sqlContext.sql("""
SELECT <FILL IN>
FROM fadmLog
GROUP BY <FILL IN>
ORDER BY <FILL IN> DESC
LIMIT <FILL IN>
""").persist()
topRequesters.show()

### **3.b) Aggregate by hour and weekday the number of requests, filter out the top 10**

#### First, we need a User Defined Function (UDF) that expects three columns (year, month, day) and then gives a number of which weekday it is.

In newer version Spark, this function will be provided but pre 1.5.0 it did not exist, and this will serve to show how to implement functionality you need when it does not exist.

In [None]:
import datetime
def getWeekDay(year, month, day):
    return datetime.datetime(year,month,day).weekday()

#### Register the function as getWeekDay and specify that it has an expected output type of Integer

In [None]:
sqlContext.registerFunction("getWeekDay", getWeekDay, IntegerType())

#### Secondarily, we need an efficient way of filtering out the top 10 requesters
#### We could use this method (registering topRequester DataFrame as a table):

```sql
SELECT source FROM fadmLog 
LEFT JOIN topRequester ON topRequester.source = fadmLog.source
WHERE topRequester.source IS NULL
```

#### However, this is a horribly inefficient method due to the use of a shuffling for filtering out just 10 items.
#### The solution: Use a UDF that checks against a set in memory.

In [None]:
topRequestersSources = topRequesters.map(lambda row: row.source).collect()

In [None]:
topRequestersBroadcast = sc.broadcast(set(topRequestersSources))

In [None]:
def notTopRequester(source):
    return not source in topRequestersBroadcast.value #Returns True if source not in the set

In [None]:
# Replace <FILL IN> with the proper code
sqlContext.registerFunction("notTopRequester", <FILL IN>, <FILL IN>)

#### Finally, we are ready to wrap up the entire solution from the individual bits.

#### Goal: Count the number of requests per hour and weekday from all requests where the source is not one of the top 10 requesters, and finally, sort by weekday, hour.

The expected output columns are:

1. **hour: Integer** - the hour of the day, can only assume values between 0-23
2. **weekday: Integer** - the index of a weekday, can only assume values between 0-6
3. **numRequests: Integer** - the number of requests in the group (hour, weekday)

#### Hint: A UDF can be used in WHERE clauses and GROUP BY clauses. You can sort by computed/projected columns.
#### The WHERE clause expects only a function that returns True or False

* Use getWeekDay to get the weekday from year, month, day.
* Use notTopRequester to find out which sources that should be selected

In [None]:
# Replace <FILL IN> with the proper code
hourWeekdayData = sqlContext.sql("""
SELECT <FILL IN>
FROM <FILL IN>
WHERE <FILL IN>
GROUP BY <FILL IN>
ORDER BY <FILL IN>
""").collect()

In [None]:
# Test the aggregated data
def toHours(row):
    return (row.weekday * 24) + row.hour

assert (
    len(hourWeekdayData) == 168 and 
    all(toHours(hourWeekdayData[i]) < toHours(hourWeekdayData[i+1]) for i in xrange(0,len(hourWeekdayData)-1)) and
    reduce(lambda x,y: x+y, map(lambda row: row.numRequests, hourWeekdayData),0) == 16182240 
)

### **3.c) Visualize the result of the previous aggregation**

In [None]:
#Visualization code
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as tcks

fig, ax = plt.subplots()

#Set the size of the figure
fig.set_size_inches(16,8)
weekdays = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

#Convert aggregated data to plotter
dateArray = map(lambda row: weekdays[row.weekday] + "@" + str(row.hour), hourWeekdayData)
requestArray = map(lambda row: row.numRequests, hourWeekdayData)

#A little hack to get labels per hour
x = np.arange(len(requestArray))+0.5
ax.plot(np.linspace(0,7*24-1,7*24),requestArray)
ax.set_xlim(0, 24*7)

ax.set_xticks( [pos for pos in xrange(0,24*7,6)] )
ax.set_xticklabels( ["{0}@{1}".format(weekdays[int(x / 24)], x%24) for x in xrange(0,24*7,6) ], rotation=90 ) ;

ax.xaxis.set_major_locator(tcks.IndexLocator(6, 0))
#ax.xaxis.set_major_formatter(tcks.FuncFormatter(lambda x, pos: ))
ax.xaxis.set_minor_locator(tcks.IndexLocator(1, 0))
ax.grid(True)
plt.show()

### **Part 4: Finding the most popular course**

##### We assume that all courses are located at the resource path: /cs/Education/{course}/... and that course codes are letters followed by numbers

##### Your task is to compute the number of requests per course

In [None]:
import regex as re
p = re.compile(ur'^\/cs\/Education\/([A-Za-z]+[0-9]+)\/.+')

#### We have provided a function that extracts the course name and handles errors cases gracefully.

In [None]:
def extractCourse(text):
    if(text != None):
        match = p.match(text)
        return match.group(1).upper() if match != None else ""
    else:
        return ""

### **4.a) Register extractCourse as a UDF**

In [None]:
# Replace <FILL IN> with the proper code
sqlContext.<FILL IN>

#### The general idea here is to show a way of splitting up a problem into 2 queries instead of an all in one. However, the execution will be faster or similar in performance.

### **4.b) Select requests and project them into source and course code.**

Only select the requests where the source is not a top requester

Expected columns:

1. **source: String** - The source
2. **course: String** - The course code as given by extractCourse

#### Hint: Use the UDFs: extractCourse and notTopRequester

In [None]:
rawCourseRequests = sqlContext.sql("""
    <FILL IN>
""")

In [None]:
# Test if the filtering works
assert set(rawCourseRequests.columns) == set(["source", "course"])
assert rawCourseRequests.count() == 16182240

### **4.c) Register the intermediary table as rawCourseRequests**

Now we have an intermediary table that contains rows of requests with course code and source. To make additional queries on this dataframe we can register it as an intermediary table.

In [None]:
sqlContext.registerDataFrameAsTable(rawCourseRequests, "rawCourseRequests")

### **4.d) Aggregate from rawCourseRequests and filter out invalid courses (e.g. those with "" names)**

Expected columns are:

1. **course : String** - extracted from resource column
2. **numRequests** - The count of requests per course
3. **numDistinctRequests** - The count of distinct requests per source column

Make numDistinctRequests ordered by highest value first.

#### Hint: use COUNT, COUNT(DISTINCT), GROUP BY, ORDER BY

In [None]:
# Replace <FILL IN> with the proper code
courseStatistics = sqlContext.sql("""
    <FILL IN>
    WHERE course <> ""
    <FILL IN>
""")

In [None]:
courseStatistics = courseStatistics.collect()
courseStatistics

In [None]:
# Test the aggregated data
assert len(filter(lambda row: row.course == "", courseStatistics)) == 0
assert reduce(lambda x,y: x+y, map(lambda row: row.numRequests, courseStatistics), 0) == 6222642 
assert reduce(lambda x,y: x+y, map(lambda row: row.numDistinctRequests, courseStatistics), 0) == 583381 