#### Analyzing Traffic Patterns to Wikimedia Projects

**Objective:**
Study traffic patterns to all English Wikimedia projects from the past hour

**Time to Complete:**
30 mins

**Data Source:**
201511 English Projects Pagecounts (~63 GB compressed parquet file)

**Business Questions:**

* Question # 1) How many different English Wikimedia projects saw traffic in the past hour?
* Question # 2) How much traffic did each English Wikimedia project get in the past hour?
* Question # 3) What were the 25 most popular English articles in the past hour?
* Question # 4) How many requests did the "Apache Spark" article recieve during this hour?
* Question # 5) Which Apache project received the most requests during this hour?
* Question # 6) What percentage of the 5.1 million English articles were requested in the past hour?
* Question # 7) How many total requests were there to English Wikipedia Desktop edition in the past hour?
* Question # 8) How many total requests were there to English Wikipedia Mobile edition in the past hour?

**Technical Accomplishments:**
- Create a DataFrame
- Print the schema of a DataFrame
- Use the following Transformations: `select()`, `distinct()`, `groupBy()`, `sum()`, `orderBy()`, `filter()`, `limit()`
- Use the following Actions: `show()`, `count()`
- Learn about Wikipedia Namespaces

**NOTE**
Please run this notebook in a Spark 2.0 cluster.

**Introduction: Wikipedia Pagecounts**
-----------

Until August, 2016, the Wikimedia Foundation poublished hourly page count statistics for all Wikimedia projects and languages. The projects include Wikipedia, Wikibooks, Wikitionary, Wikinews, etc. They elected to
[stop publishing that data](https://lists.wikimedia.org/pipermail/analytics/2016-March/005060.html) because it "does not count access to the
mobile site, it does not filter out spider or bot traffic, and it suffers from unknown loss due to logging infrastructure limitations."

However, the historical files are still out there, and they still make for an interesting use case. We'll be using the files from August 5, 2016 at 12:00 PM UTC. We've preloaded that data and converted it to a Parquet file for easy consumption.

You can see the hourly dump files <a href="https://dumps.wikimedia.org/other/pagecounts-raw/" target="_blank">here</a>.

Each line in the pagecounts files contains 4 fields:
- Project name
- Page title
- Number of requests the page recieved this hour
- Total size in bytes of the content returned

![Schema Explanation](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/pagecounts/schema_explanation.png)

In each line, the first column (like `en`) is the Wikimedia project name. The following abbreviations are used for the first column:
```
wikipedia mobile: ".mw"
wiktionary: ".d"
wikibooks: ".b"
wikimedia: ".m"
wikinews: ".n"
wikiquote: ".q"
wikisource: ".s"
wikiversity: ".v"
mediawiki: ".w"
```

Projects without a period and a following character are Wikipedia projects. So, any line starting with the column `en` refers to the English language Wikipedia (and can be requests from either a mobile or desktop client).

There will only be one line starting with the column `en.mw`, which will have a total count of the number of requests to English language Wikipedia's mobile edition.

`en.d` refers to English language Wiktionary.

`fr` is French Wikipedia. There are over 290 language possibilities.

**Create a DataFrame**
-----

A `spark` object is your entry point for working with structured data (rows and columns) in Spark.

Let's use the `spark` to create a DataFrame from the most recent pagecounts file:

In [10]:
pagecounts_en_all_df = spark.read.parquet('/data/training/wikipedia_visitor_stats_201511-parquet/')

Look at the first few records in the DataFrame:

In [12]:
pagecounts_en_all_df.show()

`printSchema()` prints out the schema for the DataFrame, the data types for each column and whether a column can be null:

In [14]:
pagecounts_en_all_df.printSchema()

Notice above that the first 2 columns are typed as `string`, but the requests column holds an `integer` and the bytes_served column holds a `long`.

Count the number of total records (rows) in the DataFrame:

In [17]:
pagecounts_en_all_df.count()

So, there are between 2 - 3 million rows in the DataFrame. This includes traffic to not just English Wikipedia articles, but also possibly English Wiktionary, Wikibooks, Wikinews, etc.

####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) **Q-1) How many different English Wikimedia projects saw traffic during that hour?**

In [20]:
pagecounts_en_all_df.select('project').distinct().show()

You can see that `show()` is `only showing top 20 rows`. Let's fix this.

How can we tell `show()` to display more lines?

**Challenge 1:**
1. Take a minute to look this up in the [Spark documentation](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.show).
2. Call `show()` on this DataFrame and set its parameter to display 50 lines

In [23]:
# TODO

# Type your answer here...

**Challenge 2: ** Can you figure out how to show 10 articles that saw traffic during that hour?

####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) **Q-2) How much traffic did each English Wikimedia project get during that hour?**

The following command will show the total number of requests each English Wikimedia project received:

In [27]:
pagecounts_en_all_sum_df = (pagecounts_en_all_df.
    select('project', 'requests').             # transformation
    groupBy('project').                        # transformation
    sum().                                     # transformation
    orderBy('sum(requests)', ascending=False). # transformation
    show()                                     # action
)

English Wikipedia desktop (en) typically gets the highest number of requests, followed by English Wikipedia mobile (en.mw).

####![Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/logo_spark_tiny.png) **Transformations and Actions**

####![Spark Operations](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/spark_ta.png)

DataFrames support two types of operations: *transformations* and *actions*.

Transformations, like `select()` or `filter()` create a new DataFrame from an existing one.

Actions, like `show()` or `count()`, return a value with results to the user. Other actions like `save()` write the DataFrame to distributed storage (like S3 or HDFS).

####![Spark T/A](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/pagecounts/trans_and_actions.png)

Transformations contribute to a query plan,  but  nothing is executed until an action is called.

Consider opening the <a href="https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.DataFrame" target="_blank">DataFrame API docs</a> in a new tab to keep it handy as a reference.

You can also hit 'tab' after the DataFrame name to see a drop down of the available methods:

####![tab](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/pagecounts/tab.png)

####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) **Q-3) What were the 25 most popular English articles during the past hour?**

The `filter()` transformation can be used to filter a DataFrame where the language column is `en`, meaning English Wikipedia articles only:

In [38]:
# Only rows for for English Wikipedia (en) will pass this filter, removing projects like Wiktionary, Wikibooks, Wikinews, etc
pagecounts_en_wikipedia_df = pagecounts_en_all_df.filter(pagecounts_en_all_df['project'] == "en")

Notice above that transformations, like `filter()`, return back a DataFrame.

Next, we can use the `orderBy()` transformation on the requests column to order the requests in descending order:

In [41]:
# Order by the requests column, in descending order
pagecounts_en_wikipedia_df.orderBy('requests', ascending=False).show(25)

In Databricks, there is a special display function that displays a Dataframe in an HTML table:

In [43]:
# Display the DataFrame as an HTML table so it's easier to read.
most_popular_en_wikipedia_df = pagecounts_en_wikipedia_df.orderBy('requests', ascending=False).limit(25)
most_popular_en_wikipedia_df.show()

Hmm, the result doesn't look correct. The article column contains non-articles, like: `Special:`, `File:`, `Category:`, `Portal`, etc. Let's learn about Namespaces so we can filter the non-articles out...

####![Wikipedia Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/logo_wikipedia_tiny.png) **Wikipedia Namespaces**

Wikipedia has many namespaces. The 5.1 million English articles are in the 0 namespace *(in red below)*. The other namespaces are for things like:
- Wikipedian User profiles (`User:` namespace 2)
- Files like images or videos (`File:` namespace 6)
- Draft articles not yet ready for publishing (`Draft:` namespace 118)

The hourly pagecounts file contains traffic requests to all Wikipedia namespaces. We'll need to filter out anything that is not an article.

![namespaces](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/pagecounts/namespaces.png)

Source: <a href="https://en.wikipedia.org/wiki/Wikipedia:Namespace" target="_blank">Wikipedia:Namespace</a>

For example, here is the `User:` page for Jimmy Wales, a co-founder of Wikipedia:

![User:jimbo_wales](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/pagecounts/user-jimbo_wales.png)

Source: <a href="https://en.wikipedia.org/wiki/User:Jimbo_Wales" target="_blank">User:Jimbo_Wales</a>

Which is different from the normal article page for Jimmy Wales *(this is the encyclopedic one)*:

![article-jimmy_wales](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/pagecounts/article-jimmy_wales.png)

Source: <a href="https://en.wikipedia.org/wiki/Jimmy_Wales" target="_blank">Jimmy_Wales<a/>

Next, here is an image from the `File:` namespace of Jimmy Wales in 2010:

![File:jimmy_wales_2010](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/pagecounts/file-jimmy_wales.png)

Source: <a href="https://en.wikipedia.org/wiki/File:Jimmy_Wales_July_2010.jpg" target="_blank">File:Jimmy_Wales_July_2010.jpg</a>

Let's filter out everything that is not an article:

In [52]:
# The 17 filters will remove everything that is not an article

pagecounts_en_wikipedia_articles_only_df = (pagecounts_en_wikipedia_df
  .filter(pagecounts_en_wikipedia_df["article"].rlike(r'^((?!Special:)+)'))
  .filter(pagecounts_en_wikipedia_df["article"].rlike(r'^((?!File:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Category:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!User:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Talk:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Template:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Help:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Wikipedia:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!MediaWiki:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Portal:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Book:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Draft:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Education_Program:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!TimedText:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Module:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Topic:)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!Images/)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!%22//upload.wikimedia.org)+)'))
  .filter(pagecounts_en_wikipedia_df['article'].rlike(r'^((?!%22//en.wikipedia.org)+)'))
)

Finally, repeat the `orderBy()` transformation from earlier:

In [54]:
display(pagecounts_en_wikipedia_articles_only_df.orderBy('requests', ascending=False).limit(25))

That looks better. Above you are seeing the 25 most requested English Wikipedia articles in the past hour!

This can give you a sense of what's popular or trending on the planet right now.

####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) **Q-4) How many requests did the "Apache Spark" article receive during this hour? **

**Challenge 3: ** Can you figure out how to filter the `pagecountsEnWikipediaArticlesOnlyDF` DataFrame for just `Apache_Spark`?

In [58]:
# TODO
# Type your answer here.

####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) **Q-5) Which Apache project received the most requests during this hour? **

In [60]:
# In the Regular Expression below:
# ^  - Matches beginning of line
# .* - Matches any characters, except newline

(pagecounts_en_wikipedia_articles_only_df.
  .filter(pagecounts_en_wikipedia_articles_only_df["article"].rlike("""^Apache_.*"""))
  .orderBy('requests', ascending=False)
  .show() # By default, show will return 20 rows
)

####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) **Q-6) What percentage of the 5.1 million English articles were requested during the hour?**

Start with the DataFrame that has already been filtered and contains just the English Wikipedia Articles:

In [63]:
display(pagecounts_en_wikipedia_articles_only_df)

Call the `count()` action on the DataFrame to see how many unique English articles were requested in the last hour:

In [65]:
pagecounts_en_wikipedia_articles_only_df.count()

The `count()` action returns back a `Long` data type in Scala and an `int` in Python.

There are currently about 5.1 million articles in English Wikipedia. So the percentage of English articles requested in the past hour is:

In [68]:
(pagecounts_en_wikipedia_articles_only_df.count() / 5100000.0) * 100

####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) **Q-7) How many total requests were there to English Wikipedia Desktop edition?**

The DataFrame holding English Wikipedia article requests has a 3rd column named `requests`:

In [71]:
pagecounts_en_wikipedia_articles_only_df.printSchema()

If we `groupBy()` the project column and then call `sum()`, we can count how many total requests there were to English Wikipedia:

In [73]:
# Import the SQL functions package, which includes statistical functions like sum(), max(), min(), avg(), etc.
from pyspark.sql.functions import *

In [74]:
display(pagecounts_en_wikipedia_df.groupBy("project").sum())

####![Wikipedia + Spark Logo Tiny](http://curriculum-release.s3-website-us-west-2.amazonaws.com/wiki-book/general/wiki_spark_small.png) **Q-8) How many total requests were there to English Wikipedia Mobile edition?**

We'll need to start with the original, base DataFrame, which contains all the English Wikimedia project requests:

In [77]:
display(pagecounts_en_all_df.limit(5))

**Challenge 4**: Set the table for answering this business question:
1. First Run a `filter()` to keep just the rows referring to English Mobile. The mobile edition articles have the `project` column set to `en.m`.
2. Count the rows in the resulting DataFrame

In [79]:
# TODO

# pagecounts_en_mobile_df = <<your filter expression comes here>>
# pagecounts_en_mobile_df.<<the action for counting elements comes here>>

In [80]:
pagecounts_en_mobile_df = pagecounts_en_all_df.filter(pagecounts_en_all_df['project'] == 'en.m')
pagecounts_en_mobile_df.count()

Okay, what do we have?

In [82]:
display(pagecounts_en_mobile_df)

Let's aggregate.

In [84]:
display(pagecounts_en_mobile_df.select(sum(pagecounts_en_mobile_df['requests'])))

The requests column above displays how many total requests English Wikipedia got from mobile clients. About 50% of the traffic to English Wikipedia seems to come from mobile clients.

We will analyze the Mobile vs. Desktop traffic patterns to Wikipedia more in the next notebook.