# Introduction to Interactive Big Data Analysis with Spark
<img src="http://www.copyrightuser.org/wp-content/uploads/2017/05/text_data_mining.jpg">

## Table of Content

1. [Curation]()
2. [Preparation]()
  1. [Data Importation](#2.A-Data-Importation)
  2. [Package Installation](#2.B-Package-Installation)
  3. [Package Importation](#2.C-Package-Importation)
  4. [Context Creation](#2.D-Context-Creation)
3. [Preprocessing]()
  1. [Creating an RDD](#3.A-Creating-an-RDD)
  2. [Getting Help](#3.B-Getting-Help)
  3. [Action on a Dataset](#4.-Action-on-a-Dataset)
  4. [Dataset Transformation](#5.-Dataset-Transformation)
  5. [Filtering a Dataset](#7.-Filtering-a-Dataset)
  6. [Caching a Dataset](#6.-Caching-a-Dataset)
4. [Processing]()
  1. [Valorizing data by transforming the dataset](#4.A-Valorizing-data-by-transforming-the-dataset)
  2. [First analysis: authors' life expectancy](#4.B-First-analysis:-authors'-life-expectancy)
  3. [Second analysis: authors' life expectancy... done correctly](#4.C-Second-analysis:-authors'-life-expectancy...-done-correctly)
5. [Storage]()
6. [Applying new knowledge](#6.-Applying-new-knowledge)
  1. [Preprocessing the pages to extract the text](#6.A-Preprocessing-the-pages-to-extract-the-text)
  2. [Processing: Analysing the work of an era](#6.B-Processing:-Analysing-the-work-of-an-era)
  3. [Learning: Learning: Topic modelling](#6.C-Learning:-Topic-modelling)

## List of Exercises
1. [Exercise 1: How to RDD?](#Exercise-1:-How-to-RDD?)
2. [Exercise 2: How to Count?](#Exercise-2:-How-to-Count?)
3. [Exercise 3: How to Transform?](#Exercise-3:-How-to-Transform?)
4. [Exercise 4: How to Filter?](#Exercise-4:-How-to-Filter?)
5. [Exercise 5: How to Extract?](#Exercise-5:-How-to-Extract?)
6. [Exercise 6: How to Reduce?](#Exercise-6:-How-to-Reduce?)
7. [Exercise 7: How to Cache?](#Exercise-7:-How-to-Cache?)

## 1. Curation

<img src="https://ucarecdn.com/d6f7ca55-9121-4d29-a5b3-7bf165b2c9bf/">

From Wikipedia:
> Data curation is a term used to indicate management activities related to organization and integration of data collected from various sources, annotation of the data, and publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes "all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data". In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database. The term is also used in the humanities, where increasing cultural and scholarly data from digital humanities projects requires the expertise and analytical practices of data curation. In broad terms, curation means a range of activities and processes done to create, manage, maintain, and validate a component.

> According to the University of Illinois' Graduate School of Library and Information Science, "Data curation is the active and on-going management of data through its lifecycle of interest and usefulness to scholarship, science, and education; curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time."

Curation being a field of its own, we will pass on the actual technics behind it. For this course, we will use a precurated dataset, the [eBooks@Adelaide dataset](https://ebooks.adelaide.edu.au/).

The data we will use has been scraped from the website and converted into [JSON](https://en.wikipedia.org/wiki/JSON) file.
<img src="http://prowebscraping.com/wp-content/uploads/2015/09/web-scraping-process1.png">

## 2. Preparation

<img src="http://1.bp.blogspot.com/-oWruWThh0Vo/UtbO00mqpnI/AAAAAAAAAaU/zRZdUBY1I14/s1600/png;base646cf98dd61304919c.png" width="50%">

### 2.A Data Importation
In order for all nodes of our cluster to access our data, we have previously imported the data in [NFS](https://en.wikipedia.org/wiki/Network_File_System). Here are the commands that we could have used to import the data in [HDFS](https://en.wikipedia.org/wiki/Apache_Hadoop#HDFS) instead.

```Shell
hdfs dfs -mkdir /adelaide/
hdfs dfs -mkdir /adelaide/meta
hdfs dfs -mkdir /adelaide/page
hdfs dfs -put ~/datasets/meta/*.json /adelaide/meta
hdfs dfs -put ~/datasets/meta/*.json /adelaide/meta
```

We can confirm that the data is actually available by listing the content of the dataset folders.

In [45]:
! ls /project/datasets/adelaide/meta

adelaide_meta_A.json  adelaide_meta_J.json  adelaide_meta_S.json
adelaide_meta_B.json  adelaide_meta_K.json  adelaide_meta_T.json
adelaide_meta_C.json  adelaide_meta_L.json  adelaide_meta_U.json
adelaide_meta_D.json  adelaide_meta_M.json  adelaide_meta_V.json
adelaide_meta_E.json  adelaide_meta_N.json  adelaide_meta_W.json
adelaide_meta_F.json  adelaide_meta_O.json  adelaide_meta_X.json
adelaide_meta_G.json  adelaide_meta_P.json  adelaide_meta_Y.json
adelaide_meta_H.json  adelaide_meta_Q.json  adelaide_meta_Z.json
adelaide_meta_I.json  adelaide_meta_R.json


In [2]:
! ls /project/datasets/adelaide/page

adelaide_page_A.json  adelaide_page_J.json  adelaide_page_S.json
adelaide_page_B.json  adelaide_page_K.json  adelaide_page_T.json
adelaide_page_C.json  adelaide_page_L.json  adelaide_page_U.json
adelaide_page_D.json  adelaide_page_M.json  adelaide_page_V.json
adelaide_page_E.json  adelaide_page_N.json  adelaide_page_W.json
adelaide_page_F.json  adelaide_page_O.json  adelaide_page_X.json
adelaide_page_G.json  adelaide_page_P.json  adelaide_page_Y.json
adelaide_page_H.json  adelaide_page_Q.json  adelaide_page_Z.json
adelaide_page_I.json  adelaide_page_R.json


### 2.B Python Package Installation

To analyze our data, we will need some Python packages:  
- numpy for numeric data manipulation;
- networkx for network analysis;
- plotly for plotting;
- beautifulsoup4 to parse and extract data from HTML pages.

These packages can be installed with the following command:
```
pip install numpy networkx plotly beautifulsoup4
```

In [3]:
! pip install numpy networkx plotly beautifulsoup4

Looking in links: /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/avx2, /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
Collecting numpy
Collecting networkx
Collecting plotly
Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/9e/d4/10f46e5cfac773e22707237bfcd51bbffeaf0a576b0a847ec7ab15bd7ace/beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
[K    100% |████████████████████████████████| 92kB 1.2MB/s ta 0:00:011
Collecting pytz (from plotly)
Installing collected packages: numpy, networkx, pytz, plotly, beautifulsoup4
Successfully installed beautifulsoup4-4.6.0 networkx-2.1 numpy-1.14.3 plotly-2.7.0 pytz-2018.4


### 2.C Package Importation

In this notebook, we will use [Apache Spark](http://spark.apache.org) to analyze briefly the Adelaide University's Book Dataset.

First, we need to import Spark's Python module named `pyspark`.

In [4]:
import pyspark

We then import some Python standard modules that will help us during the analysis.

In [5]:
import os
import json
import re

Finally, we import an interactive chart draing library [plotly](https://plot.ly/).

In [6]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly import graph_objs as go

init_notebook_mode(connected=True)

### 2.D Context Creation

Once we have imported the required packages, we need to create a SparkContext. The context is an object that allows us to interact with the Spark cluster and create new resilient distributed dataset (RDD).

In [7]:
conf = pyspark.SparkConf().setAppName("AdelaideNotebook")

try:
    sc = pyspark.SparkContext(conf=conf)
except:
    print("Warning : a SparkContext already exists.")

The context reads Spark configuration files and automatically deduces the configuration of our cluster.

## 3. Preprocessing
<img src="http://adcieo.com/wp-content/uploads/2015/02/cleansing.jpg">
### 3.A Creating an RDD

We will now create an RDD. The RDD will be created by reading JSON files containing the books' informations.

In [8]:
adelaide_meta_json = sc.textFile('/project/datasets/adelaide/meta/*.json')

Here is an example of an entry of the `adelaide_meta_json` RDD:

```
{
  "@context": "http://schema.org", 
  "dateModified": "2014-02-26", 
  "image": "https://ebooks.adelaide.edu.au/b/bowen/marjorie/avenging-of-ann-leete/cover.jpg", 
  "author": "Bowen, Marjorie, 1885-1952", 
  "@type": "Book", 
  "source": "https://gutenberg.net.au/ebooks09/0900581.txt", 
  "inLanguage": "en", 
  "publisher": "The University of Adelaide Library", 
  "name": "The Avenging of Ann Leete", 
  "keywords": "Literature", 
  "url": "https://ebooks.adelaide.edu.au/b/bowen/marjorie/avenging-of-ann-leete/", 
  "description": "The Avenging of Ann Leete / Marjorie Bowen"
}
```

We can look at a few entries with the RDD's method `take` to get the first `K` elements of the meta information dataset. Here, `K = 4`.

In [9]:
meta_first4 = adelaide_meta_json.take(4)
print(meta_first4)

['{"url": "https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/", "@context": "http://schema.org", "keywords": "Literature", "dateCreated": "1844", "datePublished": "2015-10-26", "@type": "Book", "author": "Emerson, Ralph Waldo, 1803-1882", "name": "New England Reformers", "description": "New England Reformers : A Lecture read before the Society in Amory Hall, on Sunday, 3 March, 1844 / Ralph Waldo Emerson", "image": "https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/cover.jpg", "publisher": "The University of Adelaide Library", "inLanguage": "en"}', '{"url": "https://ebooks.adelaide.edu.au/m/maupassant/guy/new-sensation/", "@context": "http://schema.org", "keywords": "Literature", "dateCreated": "", "datePublished": "2016-01-26", "@type": "Book", "author": "Maupassant, Guy de, 1850-1893", "name": "The New Sensation", "description": "The New Sensation : (Parisine) [] / Guy de Maupassant", "image": "https://ebooks.adelaide.edu.au/m/maupass

Since `take` returns a list, we can iterate on the result and each book information on a seperate line.

In [10]:
for entry in meta_first4:
    print(entry)

{"url": "https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/", "@context": "http://schema.org", "keywords": "Literature", "dateCreated": "1844", "datePublished": "2015-10-26", "@type": "Book", "author": "Emerson, Ralph Waldo, 1803-1882", "name": "New England Reformers", "description": "New England Reformers : A Lecture read before the Society in Amory Hall, on Sunday, 3 March, 1844 / Ralph Waldo Emerson", "image": "https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/cover.jpg", "publisher": "The University of Adelaide Library", "inLanguage": "en"}
{"url": "https://ebooks.adelaide.edu.au/m/maupassant/guy/new-sensation/", "@context": "http://schema.org", "keywords": "Literature", "dateCreated": "", "datePublished": "2016-01-26", "@type": "Book", "author": "Maupassant, Guy de, 1850-1893", "name": "The New Sensation", "description": "The New Sensation : (Parisine) [] / Guy de Maupassant", "image": "https://ebooks.adelaide.edu.au/m/maupassant/g

#### Exercise 1: How to RDD?

Create a new RDD named `adelaide_page_json` that contains the book URLs and its content as HTML code.

The path to the page files is `/project/datasets/adelaide/page/`.

Here is an example of an entry of the `adelaide_page_json` RDD:
```
["https://ebooks.adelaide.edu.au/m/maupassant/guy/kiss/", "<!DOCTYPE html>\n\n<html> [...]"]
```

Then retrieve the first element of that RDD and print it.

In [11]:
adelaide_page_json = sc.textFile('/project/datasets/adelaide/page/*.json')

In [12]:
adelaide_page_json.first()

'["https://ebooks.adelaide.edu.au/m/maupassant/guy/accent/", "<!DOCTYPE html>\\n\\n<html xmlns=\\"http://www.w3.org/1999/xhtml\\">\\n<head>\\n<meta charset=\\"utf-8\\"/>\\n<title>The Accent / Guy de Maupassant</title><script type=\\"application/ld+json\\">\\n{\\n   \\"@context\\" : \\"http://schema.org\\",\\n   \\"@type\\" : \\"Book\\",\\n   \\"author\\" : \\"Maupassant, Guy de, 1850-1893\\",\\n   \\"image\\" : \\"https://ebooks.adelaide.edu.au/m/maupassant/guy/accent/cover.jpg\\",\\n   \\"dateCreated\\" : \\"\\",\\n   \\"datePublished\\" : \\"2016-01-26\\",\\n   \\"description\\" : \\"The Accent : (L\'Accent) [] / Guy de Maupassant\\",\\n   \\"inLanguage\\" : \\"en\\",\\n   \\"name\\" : \\"The Accent\\",\\n   \\"publisher\\" : \\"The University of Adelaide Library\\",\\n   \\"keywords\\" : \\"Literature\\",\\n   \\"url\\" : \\"https://ebooks.adelaide.edu.au/m/maupassant/guy/accent/\\"\\n}\\n</script>\\n<!-- open graph -->\\n<meta content=\\"The Accent\\" property=\\"og:title\\"/>\\n<m

### 3.B Getting Help

At any moment, you can get help on a Python object using the `help()` function. For example, if we want to know more aboud the RDD's `take()` method.

In [13]:
help(adelaide_meta_json.take)

Help on method take in module pyspark.rdd:

take(num) method of pyspark.rdd.RDD instance
    Take the first num elements of the RDD.
    
    It works by first scanning one partition, and use the results from
    that partition to estimate the number of additional partitions needed
    to satisfy the limit.
    
    Translated from the Scala implementation in RDD#take().
    
    .. note:: this method should only be used if the resulting array is expected
        to be small, as all the data is loaded into the driver's memory.
    
    >>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)
    [2, 3]
    >>> sc.parallelize([2, 3, 4, 5, 6]).take(10)
    [2, 3, 4, 5, 6]
    >>> sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3)
    [91, 92, 93]



### 3.C Action on a Dataset

The `take()` method is one among multiple available *actions* we can apply on an RDD. An exhaustive [list of actions is available in Spark documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions).

In case where we do not want to leave the notebook tab, we can call `help()` directly on an RDD.

In [14]:
help(adelaide_meta_json)

Help on RDD in module pyspark.rdd object:

class RDD(builtins.object)
 |  A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
 |  Represents an immutable, partitioned collection of elements that can be
 |  operated on in parallel.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, other)
 |      Return the union of this RDD and another one.
 |      
 |      >>> rdd = sc.parallelize([1, 1, 2, 3])
 |      >>> (rdd + rdd).collect()
 |      [1, 1, 2, 3, 1, 1, 2, 3]
 |  
 |  __getnewargs__(self)
 |  
 |  __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer()))
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  aggregate(self, zeroValue, seqOp, combOp)
 |      Aggregate the elements of each partition, and then the results for all
 |      the partitions, using a given combine functions and a neutral "zero
 |      value."
 |      
 |      The functions C{op(t1, t2

Among the available actions, there is method named `count()`.

#### Exercise 2: How to Count?

Call the help function on the count method of the `adelaide_meta_json` to get to know more about the `count()` action. Then, apply this action on both RDDs and print the result.

In [15]:
meta_count = adelaide_meta_json.count()
page_count = adelaide_page_json.count()

In [16]:
print(meta_count)

4434


In [17]:
print(page_count)

4434


Each action applied on an RDD leads to the creation of one or many tasks and the production of a result. Every task executed in the same app can be visualized in the Spark's dashboard. In this interface, we can track the progress of a task, and check different performance measures on the task, for example its duration and cache statistics.

### 3.D Dataset Transformation

If we display the first 4 elements of our datasets that we retrieved earlier,

In [18]:
meta_first4

['{"url": "https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/", "@context": "http://schema.org", "keywords": "Literature", "dateCreated": "1844", "datePublished": "2015-10-26", "@type": "Book", "author": "Emerson, Ralph Waldo, 1803-1882", "name": "New England Reformers", "description": "New England Reformers : A Lecture read before the Society in Amory Hall, on Sunday, 3 March, 1844 / Ralph Waldo Emerson", "image": "https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/cover.jpg", "publisher": "The University of Adelaide Library", "inLanguage": "en"}',
 '{"url": "https://ebooks.adelaide.edu.au/m/maupassant/guy/new-sensation/", "@context": "http://schema.org", "keywords": "Literature", "dateCreated": "", "datePublished": "2016-01-26", "@type": "Book", "author": "Maupassant, Guy de, 1850-1893", "name": "The New Sensation", "description": "The New Sensation : (Parisine) [] / Guy de Maupassant", "image": "https://ebooks.adelaide.edu.au/m/maupas

we realize that the RDD is composed of the lines from the input text files. However, it is not possible to access to individual field in each dictionnary. **Why?**

In [19]:
meta_first = adelaide_meta_json.first()
meta_first

'{"url": "https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/", "@context": "http://schema.org", "keywords": "Literature", "dateCreated": "1844", "datePublished": "2015-10-26", "@type": "Book", "author": "Emerson, Ralph Waldo, 1803-1882", "name": "New England Reformers", "description": "New England Reformers : A Lecture read before the Society in Amory Hall, on Sunday, 3 March, 1844 / Ralph Waldo Emerson", "image": "https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/cover.jpg", "publisher": "The University of Adelaide Library", "inLanguage": "en"}'

In [20]:
meta_first['url']

TypeError: string indices must be integers

In [22]:
meta_first[100:120]

'http://schema.org", '

The action `first()` as its name states, return the first entry of the dataset. We see that **each entry is a single string**. We will need to transform each entry of the RDD in order to convert the strings, encoded in JSON, into a Python dictionary. To do this, we will use the Python standard library function **`json.loads`** to convert each JSON encoded string into its Python equivalent.

First, lets test `json.loads` on the previous first entry.

In [23]:
json.loads(meta_first)

{'url': 'https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/',
 '@context': 'http://schema.org',
 'keywords': 'Literature',
 'dateCreated': '1844',
 'datePublished': '2015-10-26',
 '@type': 'Book',
 'author': 'Emerson, Ralph Waldo, 1803-1882',
 'name': 'New England Reformers',
 'description': 'New England Reformers : A Lecture read before the Society in Amory Hall, on Sunday, 3 March, 1844 / Ralph Waldo Emerson',
 'image': 'https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/cover.jpg',
 'publisher': 'The University of Adelaide Library',
 'inLanguage': 'en'}

In [26]:
meta_first_dict = json.loads(meta_first)

In [46]:
print(meta_first_dict['url'])

https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/


We now want to apply this transformation to every entry in the RDD. The RDD's method `map(func)` returns a new distributed dataset formed by applying on each element of the source a function *func*.

In [47]:
adelaide_meta = adelaide_meta_json.map(json.loads)

The evaluation of this transformation is *lazy*. Spark does not compute anything as long as a result is not requested by an action. To convince yourself, execute the preceding cell, then visit the Spark dashboard. You should see that no job have been added to the list.

To convince ourselves that the transformation will be successfully applied, we can retrieve the first element of the transformed RDD.

In [48]:
print(adelaide_meta.first())
print(adelaide_meta.first()['url'])

{'url': 'https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/', '@context': 'http://schema.org', 'keywords': 'Literature', 'dateCreated': '1844', 'datePublished': '2015-10-26', '@type': 'Book', 'author': 'Emerson, Ralph Waldo, 1803-1882', 'name': 'New England Reformers', 'description': 'New England Reformers : A Lecture read before the Society in Amory Hall, on Sunday, 3 March, 1844 / Ralph Waldo Emerson', 'image': 'https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/cover.jpg', 'publisher': 'The University of Adelaide Library', 'inLanguage': 'en'}
https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/


#### Exercise 3: How to Transform?

Apply the JSON transformation on the page RDD that we have created in exercise 1 and print the URL of the fifth element of that dataset.

In [34]:
adelaide_page = adelaide_page_json.map(json.loads)

In [35]:
adelaide_page.first()[0]

'https://ebooks.adelaide.edu.au/m/maupassant/guy/accent/'

### 3.E Filtering a Dataset

#### 3.E.1 Filtering Bad Entries

Since we now have RDDs that are easier to manipulate, we can start the analysis. 

Our dataset was built by scraping the webpages of Adelaide University. However, during the process, some of the webpages could not be fetched by our spider pogram. Therefore, in our dataset, we end up with two kinds of entry.

Good entry example:
```
{"@context": "http://schema.org", "dateModified": "2014-02-26", "image": "https://ebooks.adelaide.edu.au/b/bowen/marjorie/avenging-of-ann-leete/cover.jpg", "author": "Bowen, Marjorie, 1885-1952", "@type": "Book", "source": "https://gutenberg.net.au/ebooks09/0900581.txt", "inLanguage": "en", "publisher": "The University of Adelaide Library", "name": "The Avenging of Ann Leete", "keywords": "Literature", "url": "https://ebooks.adelaide.edu.au/b/bowen/marjorie/avenging-of-ann-leete/", "description": "The Avenging of Ann Leete / Marjorie Bowen"}
```

Bad entry example:
```
{"description": "ERROR_COMP_NOT_FOUND"}
```

For the next operation, we wish to only keep entries for which we at least know the name of the author and the title of the book. To do so, we first define a function that returns `True` if the fields `author` and `name` are in the dictionnary.

In [37]:
def is_author_title_defined(record):
    return "author" and "name" in record

Try to answer the following quiz before executing the cell:  
* What sort of argument takes the `filter()` method?
* Is filter an action or a transformation?
* What does `filter()` return?

In [38]:
adelaide_meta_filt = adelaide_meta.filter(is_author_title_defined)

In [40]:
adelaide_meta.count()

4434

In [50]:
adelaide_meta_filt.count()

4366

In [42]:
look_for_more = adelaide_meta.filter(lambda x: not is_author_title_defined(x)).collect()

In [44]:
look_for_more

[{'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'description': 'ERROR_COMP_NOT_FOUND'},
 {'inLanguage': 'eng',
  '@context

#### Exercise 4: How to Filter?

For the next exercise, you will design your own bad entry filter. The page RDD's entries are not dictonaries but lists. Here is an example of a bad entry:
```
["https://ebooks.adelaide.edu.au/d/dante/", "None"]
```

Write a function that will return `True` or `False` wether the entry is good or bad then create a new RDD named `adelaide_meta_filt` by applying your filter function to every entry of adelaide_page.

To assess your filter design, count the number of elements in the resulting RDD. How many entries have you filtered?

In [51]:
bad_entry_example = ["https://ebooks.adelaide.edu.au/d/dante/", "None"]

In [52]:
bad_entry_example[1]

'None'

In [53]:
bad_entry_example[1] == 'None'

True

In [54]:
bad_entry_example[1] != 'None'

False

In [55]:
def filter_bad_entry(entry):
    return entry[1] != 'None'

In [56]:
filter_bad_entry(bad_entry_example)

False

In [57]:
filter_bad_entry(adelaide_page.first())

True

In [58]:
adelaide_page_filt = adelaide_page.filter(filter_bad_entry)

In [59]:
adelaide_page_filt.count()

4388

#### 3.E.2 Filtering Duplicate Entries

The meta-information of each book have been recovered by scraping the website of [Adelaide University's eBook Libary](https://ebooks.adelaide.edu.au/). Since two pages could point to the same book, there is a possibility that a book is present more than once in our dataset.

#### Exercise 5: How to Extract?

**To confirm that some books are present more than once in our dataset,  transform the dataset `adelaide_meta_filt` in a second dataset that only includes URLs.**

In [60]:
meta_urls = adelaide_meta_filt.map(lambda dictionary: dictionary['url'])

RDDs have a special method named `distinct()` that returns a new RDD containing strictly unique values. Next, we are going to call this method on our RDD of URLs and count the number of elements.

In [61]:
meta_urls.count() - meta_urls.distinct().count()

305

There is 305 duplicated entries in our dataset. To remove the duplicated entries, we will need to first associate an identifier that should be unique to each entry. We will name this identifier "key". A unique identifier for a webpage is its URL. Since every book is associated to a URL, we will use the URLs as the keys to our entries.

The RDD method `keyBy()` allows to associate each entry in our dataset with a key. The key will be defined from an item of the record, in this case the url.

In [62]:
adelaide_meta_url_key = adelaide_meta_filt.keyBy(lambda rec: rec['url'])

Each entry now has its own key, to convince ourselves, we can fetch the first element of that last RDD.

In [64]:
adelaide_meta_filt.first()

{'url': 'https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/',
 '@context': 'http://schema.org',
 'keywords': 'Literature',
 'dateCreated': '1844',
 'datePublished': '2015-10-26',
 '@type': 'Book',
 'author': 'Emerson, Ralph Waldo, 1803-1882',
 'name': 'New England Reformers',
 'description': 'New England Reformers : A Lecture read before the Society in Amory Hall, on Sunday, 3 March, 1844 / Ralph Waldo Emerson',
 'image': 'https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/cover.jpg',
 'publisher': 'The University of Adelaide Library',
 'inLanguage': 'en'}

In [65]:
rec = adelaide_meta_filt.first()

In [66]:
(rec['url'], rec)

('https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/',
 {'url': 'https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/',
  '@context': 'http://schema.org',
  'keywords': 'Literature',
  'dateCreated': '1844',
  'datePublished': '2015-10-26',
  '@type': 'Book',
  'author': 'Emerson, Ralph Waldo, 1803-1882',
  'name': 'New England Reformers',
  'description': 'New England Reformers : A Lecture read before the Society in Amory Hall, on Sunday, 3 March, 1844 / Ralph Waldo Emerson',
  'image': 'https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/cover.jpg',
  'publisher': 'The University of Adelaide Library',
  'inLanguage': 'en'})

In [63]:
adelaide_meta_url_key.first()

('https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/',
 {'url': 'https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/',
  '@context': 'http://schema.org',
  'keywords': 'Literature',
  'dateCreated': '1844',
  'datePublished': '2015-10-26',
  '@type': 'Book',
  'author': 'Emerson, Ralph Waldo, 1803-1882',
  'name': 'New England Reformers',
  'description': 'New England Reformers : A Lecture read before the Society in Amory Hall, on Sunday, 3 March, 1844 / Ralph Waldo Emerson',
  'image': 'https://ebooks.adelaide.edu.au/e/emerson/ralph_waldo/new-england-reformers/cover.jpg',
  'publisher': 'The University of Adelaide Library',
  'inLanguage': 'en'})

The last step is to keep only one value for each entry that shares the same key. To do this, we will apply a reduction operation, such that if we are given two records `a` and `b` with the same key, we only return the first record `a`. 

In [67]:
adelaide_meta_unique = adelaide_meta_url_key.reduceByKey(lambda a, b: a)

This series of operations are called a reduction operation on a key-value pair RDD. We will provide more details in two sections.

#### Exercise 6: How to Reduce?

The page dataset may also include duplicated pages. Identify what should be the key of that dataset then try to create a new dataset with only unique books. What is the structure of a record after applying the `keyBy()` method? Do we need to transform this dataset or could it be reduced directly?

Count the number of elements in the resulting RDD to confirm the validity your transformation.

In [None]:
adelaide_page_unique = adelaide_page_filt.<FILL IN>

Since we no longer need the keys, we can retrieve the values by calling the `values()` method on the RDD.

In [None]:
adelaide_page_filt.take(1)

In [70]:
adelaide_meta_unique.first()

('https://ebooks.adelaide.edu.au/z/zola/emile/z8nf/',
 {'url': 'https://ebooks.adelaide.edu.au/z/zola/emile/z8nf/',
  '@context': 'http://schema.org',
  'dateCreated': '1880',
  'datePublished': '2003-03-16',
  'dateModified': '2014-03-04',
  '@type': 'Book',
  'author': 'Zola, Émile, 1840-1902',
  'name': 'Nana',
  'description': 'Nana / Emile Zola',
  'image': 'https://ebooks.adelaide.edu.au/z/zola/emile/z8nf/cover.jpg',
  'publisher': 'The University of Adelaide Library',
  'inLanguage': 'fr'})

In [71]:
adelaide_meta_uniq = adelaide_meta_unique.values()

### 3.F Caching a Dataset

When we expect to operate frequently on the same dataset, it can be useful to tell Spark to keep it in memory.

To do so, we use the `cache()` method.

In [72]:
adelaide_meta_uniq.cache()

PythonRDD[40] at RDD at PythonRDD.scala:48

The RDDs stored in memory are displayed in the **Storage** section of Spark web interface. Note that datasets are not loaded in memory until an action is made on them. 

Actions on cached datasets are much faster than on non-cached datasets. But in order to be cached, an action must first be applied on the dataset. Based on that, try to explain the execution time for the following cells.

In [73]:
%time adelaide_meta_uniq.count()

CPU times: user 8.47 ms, sys: 2.84 ms, total: 11.3 ms
Wall time: 472 ms


4061

In [74]:
%time adelaide_meta_uniq.count()

CPU times: user 4.14 ms, sys: 5.05 ms, total: 9.19 ms
Wall time: 146 ms


4061

#### Exercise 7: How to Cache?

Cache the RDD from exercise 6 and evaluate how long it takes to retrieve the first 5 elements before and after caching. 

What happens when there is not enough memory to cache an RDD? 

Can you figure how to *uncache* an RDD?

In [None]:
adelaide_page_unique.<FILL IN>

In [None]:
%time pages_5 = adelaide_page_unique.take(5)

In [None]:
%time pages_5 = adelaide_page_unique.take(5)

In [None]:
adelaide_page_unique.<FILL IN>

## 4. Processing

<img src="http://www.lightspeedgmi.com/wp-content/uploads/2015/03/dataprocessing-circle.png">

To grasp the extent of our dataset, we can count to number of entries it contains

In [75]:
adelaide_meta_uniq.count()

4061

In [76]:
adelaide_meta_uniq.first()

{'url': 'https://ebooks.adelaide.edu.au/z/zola/emile/z8nf/',
 '@context': 'http://schema.org',
 'dateCreated': '1880',
 'datePublished': '2003-03-16',
 'dateModified': '2014-03-04',
 '@type': 'Book',
 'author': 'Zola, Émile, 1840-1902',
 'name': 'Nana',
 'description': 'Nana / Emile Zola',
 'image': 'https://ebooks.adelaide.edu.au/z/zola/emile/z8nf/cover.jpg',
 'publisher': 'The University of Adelaide Library',
 'inLanguage': 'fr'}

Each entry of our dataset is a dictionary. Each dictionary can contain a variable number of keys and every dictionary in our dataset do not necessarily share the same keys.

We can extract the keys from each dictionary and count how many times they are present. To access, the key of a dictionary, we can use the method `keys()`.

In [77]:
adelaide_meta_uniq.first().keys()

dict_keys(['url', '@context', 'dateCreated', 'datePublished', 'dateModified', '@type', 'author', 'name', 'description', 'image', 'publisher', 'inLanguage'])

In [79]:
adelaide_meta_uniq.map(lambda rec: list(rec.keys())).take(2)

[['url',
  '@context',
  'dateCreated',
  'datePublished',
  'dateModified',
  '@type',
  'author',
  'name',
  'description',
  'image',
  'publisher',
  'inLanguage'],
 ['author',
  '@context',
  'keywords',
  'inLanguage',
  'url',
  'dateModified',
  'name',
  '@type',
  'image',
  'description',
  'publisher']]

We want to apply this method to every dictionary in our dataset, so that would be a map. However, if we simply apply a map, we will get an RDD of key-lists. What we truly want is to merge the list to get an RDD of keys. 

Spark has a function to merge the iterable returned by a function, the `flatMap`.

In [80]:
adelaide_keys = adelaide_meta_uniq.flatMap(lambda rec: list(rec.keys()))

In [82]:
adelaide_keys.take(20)

['url',
 '@context',
 'dateCreated',
 'datePublished',
 'dateModified',
 '@type',
 'author',
 'name',
 'description',
 'image',
 'publisher',
 'inLanguage',
 'author',
 '@context',
 'keywords',
 'inLanguage',
 'url',
 'dateModified',
 'name',
 '@type']

We can inspect the first 5 elements of our dataset.

In [None]:
adelaide_keys.take(5)

We are now interested in finding the frequency of each key. This will give us an idea of the completeness of our dataset. To compute the key frequency, we will need to apply a classic map-reduce pattern.

First, we need to pair each key with the basic frequency value 1. This is the map operation.

In [83]:
adelaide_key_value = adelaide_keys.map(lambda key: (key, 1))

We can inspect the result of the transformation by looking at the first element

In [84]:
adelaide_key_value.first()

('url', 1)

Next, we will sum the values that shares the same key. This is the reduce operation.

In [85]:
from operator import add

# adelaide_agg = adelaide_key_value.reduceByKey(add)
adelaide_agg = adelaide_key_value.reduceByKey(lambda a,b : a + b)

Finally, we can collect our transformed RDD. Key-Value pair RDDs have a special method `collectAsMap` that returns the result as a dictionnary.

In [91]:
adelaide_agg.collectAsMap()['@context']

4061

We observe that only a few keys are available in most dictionaries of our dataset. We should therefore restrict our analysis to these fields or create new fields from the frequent one.

### 4.A Valorizing data by transforming the dataset

In [92]:
def process_name_birth_death(record):
    author = record.get('author', None)
    if author:
        author = author.strip()
        # Remove trailing dot
        if '.' == author[-1]:
            author = author[:-1]
        try:
            lastname, firstname, birth_death = author.split(',')
        except ValueError:
            return record
        try:
            birth, death = re.findall('\d+', birth_death)
        except ValueError:
            return record
        record['author_lastname'] = lastname.strip()
        record['author_firstname'] = firstname.strip()
        record['author_birth'] = int(birth)
        record['author_death'] = int(death)
    return record

In [93]:
adelaide_meta_val = adelaide_meta_unique.mapValues(process_name_birth_death).values()

In [94]:
adelaide_meta_val.take(2)

[{'url': 'https://ebooks.adelaide.edu.au/z/zola/emile/z8nf/',
  '@context': 'http://schema.org',
  'dateCreated': '1880',
  'datePublished': '2003-03-16',
  'dateModified': '2014-03-04',
  '@type': 'Book',
  'author': 'Zola, Émile, 1840-1902',
  'name': 'Nana',
  'description': 'Nana / Emile Zola',
  'image': 'https://ebooks.adelaide.edu.au/z/zola/emile/z8nf/cover.jpg',
  'publisher': 'The University of Adelaide Library',
  'inLanguage': 'fr',
  'author_lastname': 'Zola',
  'author_firstname': 'Émile',
  'author_birth': 1840,
  'author_death': 1902},
 {'author': 'Buchan, John, 1875-1940',
  '@context': 'http://schema.org',
  'keywords': 'Literature',
  'inLanguage': 'en',
  'url': 'https://ebooks.adelaide.edu.au/b/buchan/john/no_man_s_land/',
  'dateModified': '2014-02-26',
  'name': 'No-Man’s-Land',
  '@type': 'Book',
  'image': 'https://ebooks.adelaide.edu.au/b/buchan/john/no_man_s_land/cover.jpg',
  'description': 'No-Man’s-Land / John Buchan',
  'publisher': 'The University of Adel

In [95]:
def convert_dateCreated(record):
    if 'dateCreated' in record:
        dates = re.findall('\d+', record['dateCreated'])
        if len(dates) > 0:
            date = int(dates[0])
            # Check if the date is before common era
            if re.match(r'BC|bc|BCE|bce', record['dateCreated']):
                date *= -1
            record['dateCreated'] = date
        else:
            del record['dateCreated']
    return record

In [96]:
adelaide_meta_val = adelaide_meta_val.map(convert_dateCreated)

### 4.B First analysis: authors' life expectancy

In [97]:
def compute_age(rec):
    """Compute the age of an author when it died
    based on its year of birth and death.
    """
    if 'author_birth' and 'author_death' in rec:
        birth, death =  rec['author_birth'], rec['author_death']
        if birth < death:
            return death - birth
        else:
            # If year of birth is greater than year of death the
            # author was born in BCE. Do you think this is correct
            # in every cases?
            return birth - death
    else:
        return None

age_frequency = adelaide_meta_val.map(compute_age)\
                                 .countByValue()

Visualization can be done with multiple tools, here we use plotly.

In [98]:
data = [
    go.Bar(
        x=list(age_frequency.keys()),
        y=list(age_frequency.values()),
    )
]

layout = go.Layout(
    title="Adelaide authors life expectancy",
    xaxis=dict(
        title='life expectancy (years)',
    ),
    yaxis=dict(
        title='number of authors',
    ),
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

A surprising number of authors died at the age of 43. There is either a pattern with authors, or we have commited a mistake in our analysis.

What happens if an author has written more than one book? We need to remember that our dataset is composed of books, not authors. If we want to produce statistics on authors, we need to keep only distinct authors.

### 4.C Second analysis: authors' life expectancy... done correctly

In [99]:
def retrieve_name_age(rec):
    age = compute_age(rec)
    lastname = rec['author_lastname']
    firstname = rec['author_firstname']
    return firstname, lastname, age

authors = adelaide_meta_val.filter(lambda rec: 'author_lastname' in rec)\
                           .map(retrieve_name_age)

In [100]:
unique_authors = authors.distinct()
age_frequency2 = unique_authors.map(lambda tup: tup[2]).countByValue()

In [101]:
data = [
    go.Bar(
        x=list(age_frequency2.keys()),
        y=list(age_frequency2.values()),
    )
]

layout = go.Layout(
    title="Adelaide authors life expectancy",
    xaxis=dict(
        title='life expectancy (years)',
    ),
    yaxis=dict(
        title='number of authors',
    ),    
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

## 5. Storage

<img src="">

We could be interested in saving the result of our processing on disk. RDDs have some `saveAs[...]` methods to do this.

In [None]:
adelaide_meta_uniq.saveAsTextFile('<FILL IN>')

## 6. Applying new knowledge
### 6.A Preprocessing the pages to extract the text

In [None]:
from itertools import chain
from operator import itemgetter
from bs4 import BeautifulSoup

In [None]:
url, book1 = adelaide_page.first()

In [None]:
adelaide_page_unique = adelaide_page.groupByKey()\
                                    .mapValues(list)\
                                    .mapValues(itemgetter(0))

In [None]:
def extract_text(page):
    if page:
        soup = BeautifulSoup(page, 'html.parser')
        it = chain(soup.findAll(['meta', 'script', 'head']),
                   soup.findAll('div', {"id" : "controls"}),
                   soup.findAll('div', {"class" : "contents"}),
                   soup.findAll('div', {"class" : "titleverso"}),
                   soup.findAll('div', {"class" : "colophon"}),
                   soup.findAll('span', {"class" : "author"}))
        for div in it:
            div.extract()
        return soup.get_text().strip()

In [None]:
adelaide_text = adelaide_page_unique.mapValues(extract_text)

In [None]:
adelaide_text.count()

### 6.B Processing: Analysing the work of an era

We have an dataset of 4371 different books written at different times. Lets suppose we want to study the texts of the books written during the 1901–1939 Modernism era.

First we need to identify which books in our dataset were written during this era.

In [None]:
modern_era_books = adelaide_meta_val.filter(lambda rec: 1900 < rec.get('dateCreated', 0) < 1938)

Some of these books are not necessarily in English. We therefore need to apply a second filter on the language (`inLanguage`).

In [None]:
modern_era_en_books = modern_era_books.filter(lambda rec: rec.get('inLanguage', '') == 'en')

We can now count how many English books from our dataset were written during this era.

In [None]:
modern_era_en_books.count()

We can also count the number of distinct authors:

In [None]:
modern_era_books.map(lambda rec: rec['author']).distinct().count()

The meta information about each book and the book's text are stored in two separate RDDs. In order to retrieve the texts written during the modernism era, we will need to join the RDD of modern book era metainformation and the RDD of books' text.

To do so, we will first need to define the modern era book RDD as an RDD of key-value pairs. 

In [None]:
modern_era_books_kv = modern_era_books.keyBy(lambda rec: rec['url'])

Then, we can join the RDD of meta information on modern era books with the RDD of books' text. 

However, we first need to address a small problem, the URLs of the texts do not exactly match the ones from the meta information. To join two RDDS, the keys need to match perfectly.

In [None]:
adelaide_text2 = adelaide_text.map(lambda x: (x[0].rsplit('/', 1)[0] + '/', x[1]))

In the preceding line, we removed the URL suffix which consist of the page name (i.e: `complete.html`).

In [None]:
modernism_meta_text = modern_era_books_kv.join(adelaide_text2)

Finally, we build our corpus of modernism words by creating a single list of words.

In [None]:
modernism_word = modernism_meta_text.values().flatMap(lambda x: x[1].split())

In [None]:
from string import punctuation
def remove_punctuations(word):
    return re.sub(r'[{}‘—’”“]'.format(punctuation), " ", word).strip()

stopwords  = set(['all', 'pointing', 'four', 'go', 'oldest', 'seemed', 'whose', 'certainly',
'young',  'presents', 'to', 'asking', 'those', 'under', 'far', 'every',
'presented', 'did',  'turns', 'large', 'p', 'small', 'parted', 'smaller',
'says', 'second', 'further',  'even', 'what', 'anywhere', 'above', 'new',
'ever', 'full', 'men', 'here', 'youngest',  'let', 'groups', 'others', 'alone',
'along', 'great', 'k', 'put', 'everybody', 'use',  'from', 'working', 'two',
'next', 'almost', 'therefore', 'taken', 'until', 'today',  'more', 'knows',
'clearly', 'becomes', 'it', 'downing', 'everywhere', 'known', 'cases',  'must',
'me', 'states', 'room', 'f', 'this', 'work', 'itself', 'can', 'mr', 'making',
'my', 'numbers', 'give', 'high', 'something', 'want', 'needs', 'end', 'turn',
'rather', 'how', 'y', 'may', 'after', 'such', 'man', 'a', 'q', 'so', 'keeps',
'order', 'furthering',  'over', 'years', 'ended', 'through', 'still', 'its',
'before', 'group', 'somewhere',  'interesting', 'better', 'differently',
'might', 'then', 'non', 'good', 'somebody',  'greater', 'downs', 'they', 'not',
'now', 'gets', 'always', 'l', 'each', 'went', 'side',  'everyone', 'year',
'our', 'out', 'opened', 'since', 'got', 'shows', 'turning', 'differ',  'quite',
'members', 'ask', 'wanted', 'g', 'could', 'needing', 'keep', 'thing', 'place',
'w', 'think', 'first', 'already', 'seeming', 'number', 'one', 'done',
'another', 'open',  'given', 'needed', 'ordering', 'least', 'anyone', 'their',
'too', 'gives', 'interests',  'mostly', 'behind', 'nobody', 'took', 'part',
'herself', 'than', 'kind', 'b', 'showed',  'older', 'likely', 'r', 'were',
'toward', 'and', 'sees', 'turned', 'few', 'say', 'have',  'need', 'seem',
'saw', 'orders', 'that', 'also', 'take', 'which', 'wanting', 'sure', 'shall',
'knew', 'wells', 'most', 'nothing', 'why', 'parting', 'noone', 'later', 'm',
'mrs', 'points', 'fact', 'show', 'ending', 'find', 'state', 'should', 'only',
'going', 'pointed', 'do', 'his', 'get', 'cannot', 'longest', 'during', 'him',
'areas', 'h', 'she', 'x', 'where', 'we', 'see', 'are', 'best', 'said', 'ways',
'away', 'enough', 'smallest',  'between', 'across', 'ends', 'never', 'opening',
'however', 'come', 'both', 'c', 'last',  'many', 'against', 's', 'became',
'faces', 'whole', 'asked', 'among', 'point', 'seems',  'furthered', 'furthers',
'puts', 'three', 'been', 'much', 'interest', 'wants', 'worked',  'an',
'present', 'case', 'myself', 'these', 'n', 'will', 'while', 'would', 'backing',
'is', 'thus', 'them', 'someone', 'in', 'if', 'different', 'perhaps', 'things',
'make',  'same', 'any', 'member', 'parts', 'several', 'higher', 'used', 'upon',
'uses', 'thoughts',  'off', 'largely', 'i', 'well', 'anybody', 'finds',
'thought', 'without', 'greatest',  'very', 'the', 'yours', 'latest', 'newest',
'just', 'less', 'being', 'when', 'rooms',  'facts', 'yet', 'had', 'lets',
'interested', 'has', 'gave', 'around', 'big', 'showing',  'possible', 'early',
'know', 'like', 'necessary', 'd', 't', 'fully', 'become', 'works',  'grouping',
'because', 'old', 'often', 'some', 'back', 'thinks', 'for', 'though', 'per',
'everything', 'does', 'either', 'be', 'who', 'seconds', 'nowhere', 'although',
'by', 'on',  'about', 'goods', 'asks', 'anything', 'of', 'o', 'or', 'into',
'within', 'down', 'beings',  'right', 'your', 'her', 'area', 'downed', 'there',
'long', 'way', 'was', 'opens', 'himself',  'but', 'newer', 'highest', 'with',
'he', 'made', 'places', 'whether', 'j', 'up', 'us',  'problem', 'z', 'clear',
'v', 'ordered', 'certain', 'general', 'as', 'at', 'face', 'again',  'no',
'generally', 'backs', 'grouped', 'other', 'you', 'really', 'felt', 'problems',
'important', 'sides', 'began', 'younger', 'e', 'longer', 'came', 'backed',
'together',  'u', 'presenting', 'evenly', 'having', 'once'])

In [None]:
modernism_word_filt = modernism_word.map(str.lower)\
                                    .map(remove_punctuations)\
                                    .flatMap(str.split)\
                                    .filter(lambda word: word not in stopwords)\
                                    .filter(lambda word: len(word) > 3)\
                                    .filter(lambda word: word.isalpha())

In [None]:
from operator import add
modernism_word_count = modernism_word_filt.map(lambda x: (x, 1))\
                                          .reduceByKey(add)

In [None]:
modernism_word_count.top(10, key=lambda x: x[1])

### 6.C Learning: Topic modelling

In [None]:
modern_vocab = modernism_word_count.keys().zipWithIndex().collectAsMap()

In [None]:
br_modern_vocab = sc.broadcast(modern_vocab)

In [None]:
modernism_doc_bag = modernism_meta_text.values().map(lambda x: (x[0]['url'], x[1].split()))

In [None]:
from collections import Counter, OrderedDict
from pyspark.mllib.linalg import Vectors


mdwc_idx = modernism_doc_bag.mapValues(lambda words: list(filter(lambda word: word in br_modern_vocab.value, words)))\
                 .mapValues(lambda words: list(map(lambda word: br_modern_vocab.value[word], words)))\
                 .mapValues(Counter)\
                 .mapValues(lambda d: OrderedDict(sorted(d.items())))\
                 .mapValues(lambda counter: Vectors.sparse(len(br_modern_vocab.value), list(counter.keys()), list(counter.values())))\
                 .zipWithIndex().map(lambda x: [x[1], x[0][1]])\
                 .cache()

In [None]:
from pyspark.mllib.clustering import LDA

In [None]:
numTopics = 3
ldaModel = LDA.train(mdwc_idx, k=numTopics, maxIterations=10)

In [None]:
topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5)

In [None]:
modern_top_vocab_inv = {v:k for k, v in modern_vocab.items()}

In [None]:
for terms, termWeights in topicIndices:
    print("TOPIC:")
    for term, weight in zip(terms, termWeights):
        print(modern_top_vocab_inv[term], weight)
    print()

In [None]:
sc.stop()