# Word Count with Map-Reduce - Lab

## Introduction

now that we have seen the key map and reduce operators in spark, and also know when to use transformation and action operators, we can revisit the word count problem we introduced earlier in the section. In this lab, we will use the methods seen in the coding labs to read a text corpus into spark environment, perform a word count and try basic NLP ideas to get a good grip on how MapReduce performs. 

Note: In your Pyspark environment, create a folder `data` and move all the files from the provided `data` folder into it. Jupyter interface doesnt allow moving complete folders. 

## Objectives

You will be able to:

* Describe Map-Reduce operation in a big data context
* Perform basic NLP tasks with a given text corpus
* Perform basic analysis from the experiment findings towards identifying writing styles

## Map-Reduce task

Here is what our problem looks like:

* We have a huge text document
* We need to count the number of times each distinct word appears in the document


* Sample application:

    * Analyze web server logs to find popular URLs
    * Analyze texts for content or style 
    
    
## Word Count

We will illustrate a MapReduce computation for counting the number of occurrences for each word in a text corpus. In this example, the input file is a repository of documents, and each document is an element. We shall count the frequency of stop words for __style identification__ as stop words might have unique features which can potentially describe author's writing style based on their use of stop words while writing. We shall look at some texts by Shakspeare and Jane Austin following this motivation. 

Map-Reduce in PySpark provides a practical and efficient way of achieving this goal as it: 

* works if the file is too large for memory

* works even if the ouput is too large for memory

* is naturally parallelizable


### Map-Reduce Framework

Here are the steps that we will perform for our problem, under the map reduce framework. 

* Sequentially read a lot of data (text files in this case)


* Map:
    * Extract something you care about


* Group by key: Sort and Shuffle


* Reduce:
    * Aggregate, summarize, filter or transform


* Write the result 

In [1]:
# Start a local SparkContext
import pyspark
sc = pyspark.SparkContext('local[*]') # [*] represents a local context i.e. no cluster


In [2]:
r_j = sc.textFile('./text/romeoandjuliet.txt')

## Map functions

The basic steps of the map reduce framework should be:

* Transform the text
    * split words
    * apply a function to turn each entry into a datatype that is more trackable
    * make all words lowercase
    * remove "stopwords"
* reduce the values to calculate the final word count

To begin with, create a map function that will take begin the process of our word count operation

In [9]:
r_j_words = r_j.flatMap(lambda x: x.split(" "))

In [10]:
r_j_words.top(20)

['~5%',
 'zip',
 'z',
 'yron',
 'yron',
 'youthfull',
 'youthfull',
 'youth:',
 'youth,',
 'youth',
 'youth',
 'yours?',
 'yours',
 'yours',
 'your@login',
 'your',
 'your',
 'your',
 'your',
 'your']

Now, let's make all lowercase

In [36]:
from string import punctuation
lowered = r_j_words.map(lambda x: x.lower().strip(punctuation))

Now, let's remove all stopwords

In [37]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words.add('')
no_stopwords = lowered.filter(lambda x: x not in stop_words)

In [38]:
no_stopwords.collect()

['project',
 "gutenberg's",
 'etext',
 "shakespeare's",
 'first',
 'folio',
 'tragedie',
 'romeo',
 'juliet',
 '3rd',
 'edition',
 'plays',
 'see',
 'index',
 'copyright',
 'laws',
 'changing',
 'world',
 'sure',
 'check',
 'copyright',
 'laws',
 'country',
 'posting',
 'files',
 'please',
 'take',
 'look',
 'important',
 'information',
 'header',
 'encourage',
 'keep',
 'file',
 'disk',
 'keeping',
 'electronic',
 'path',
 'open',
 'next',
 'readers',
 'remove',
 'welcome',
 'world',
 'free',
 'plain',
 'vanilla',
 'electronic',
 'texts',
 'etexts',
 'readable',
 'humans',
 'computers',
 'since',
 '1971',
 'etexts',
 'prepared',
 'hundreds',
 'volunteers',
 'donations',
 'information',
 'contacting',
 'project',
 'gutenberg',
 'get',
 'etexts',
 'information',
 'included',
 'need',
 'donations',
 'tragedie',
 'romeo',
 'juliet',
 'william',
 'shakespeare',
 'july',
 '2000',
 'etext',
 '2261',
 'project',
 "gutenberg's",
 'etext',
 "shakespeare's",
 'first',
 'folio',
 'tragedie',
 'ro

### perform mapping operation to count words

 

In [40]:
mapped = no_stopwords.map(lambda x: (x,1))

In [41]:
mapped.collect()


[('project', 1),
 ("gutenberg's", 1),
 ('etext', 1),
 ("shakespeare's", 1),
 ('first', 1),
 ('folio', 1),
 ('tragedie', 1),
 ('romeo', 1),
 ('juliet', 1),
 ('3rd', 1),
 ('edition', 1),
 ('plays', 1),
 ('see', 1),
 ('index', 1),
 ('copyright', 1),
 ('laws', 1),
 ('changing', 1),
 ('world', 1),
 ('sure', 1),
 ('check', 1),
 ('copyright', 1),
 ('laws', 1),
 ('country', 1),
 ('posting', 1),
 ('files', 1),
 ('please', 1),
 ('take', 1),
 ('look', 1),
 ('important', 1),
 ('information', 1),
 ('header', 1),
 ('encourage', 1),
 ('keep', 1),
 ('file', 1),
 ('disk', 1),
 ('keeping', 1),
 ('electronic', 1),
 ('path', 1),
 ('open', 1),
 ('next', 1),
 ('readers', 1),
 ('remove', 1),
 ('welcome', 1),
 ('world', 1),
 ('free', 1),
 ('plain', 1),
 ('vanilla', 1),
 ('electronic', 1),
 ('texts', 1),
 ('etexts', 1),
 ('readable', 1),
 ('humans', 1),
 ('computers', 1),
 ('since', 1),
 ('1971', 1),
 ('etexts', 1),
 ('prepared', 1),
 ('hundreds', 1),
 ('volunteers', 1),
 ('donations', 1),
 ('information', 1),

In [46]:
word_counts = mapped.reduceByKey(lambda x,y: x + y)

In [48]:
word_counts.sortBy(lambda x: x[1],ascending=False).collect()

[('thou', 277),
 ('thy', 178),
 ('rom', 149),
 ('romeo', 143),
 ('thee', 138),
 ('loue', 135),
 ('haue', 126),
 ('shall', 106),
 ('come', 100),
 ('iul', 97),
 ('enter', 91),
 ('man', 84),
 ('nur', 77),
 ('night', 73),
 ('ile', 72),
 ('one', 70),
 ('good', 69),
 ('death', 69),
 ('may', 67),
 ('nurse', 64),
 ('hath', 64),
 ('ben', 63),
 ('go', 60),
 ('mer', 59),
 ('well', 58),
 ('would', 56),
 ('sir', 56),
 ('vp', 56),
 ('art', 54),
 ('iuliet', 54),
 ('day', 53),
 ('say', 52),
 ('fri', 51),
 ('lady', 50),
 ('dead', 48),
 ('doth', 47),
 ('giue', 47),
 ('yet', 46),
 ('tell', 44),
 ('let', 44),
 ('faire', 43),
 ('time', 43),
 ('take', 42),
 ('must', 42),
 ('like', 42),
 ('hast', 42),
 ('see', 42),
 ('vpon', 42),
 ('tybalt', 42),
 ('tis', 41),
 ('make', 40),
 ('selfe', 39),
 ('old', 38),
 ('much', 37),
 ('know', 36),
 ('cap', 36),
 ('project', 35),
 ('gone', 34),
 ('paris', 33),
 ('frier', 33),
 ('etext', 33),
 ('wife', 32),
 ('true', 30),
 ('looke', 30),
 ('vs', 30),
 ('wilt', 30),
 ('light

### Put it all together

Now, create a function `word_counter` that has the parameters:

* `file_name`: name of the text file
* `spark_context`: the SparkContext to be used to perform the operations

The function should return:

* `word_count` : list of tuples containing the top 50 words by count

Use your fu return the final wordcount. Then use it to create histograms of the top 20 words by frequency of usage

In [60]:
def word_counter(file_name, spark_context):
    # reading in the text file
    text = spark_context.textFile(file_name)
    
    # splitting the text
    split_text = text.flatMap(lambda x: x.split(" "))
    
    # lowercase the text
    lowered_text = split_text.map(lambda x: x.lower().strip(punctuation))
    
    # remove stopwords
    no_sw_text = lowered_text.filter(lambda x: x not in stop_words)
    
    # mapping the words
    mapped = no_sw_text.map(lambda x: (x,1))
    
    # reduce to obtain the word count
    word_count = mapped.reduceByKey(lambda x,y: x + y).sortBy(lambda x: x[1],ascending =False).collect()[:50]
    return word_count

After you have created the function, use it on the other text files found in the text folder

* emma.txt
* hamlet.txt
* othello.txt
* prideandprejudice.txt

In [51]:
ls ./text

emma.txt                   prideandprejudice.txt
hamlet.txt                 romeoandjuliet.txt
othello.txt                senseandsensibility 2.txt


In [53]:
emma = word_counter('./text/emma.txt',sc)

In [55]:
emma

[('mr', 1092),
 ('could', 822),
 ('would', 805),
 ('emma', 723),
 ('mrs', 671),
 ('must', 573),
 ('miss', 571),
 ('said', 471),
 ('much', 465),
 ('every', 419),
 ('one', 415),
 ('think', 375),
 ('weston', 368),
 ('thing', 366),
 ('harriet', 366),
 ('little', 359),
 ('never', 336),
 ('might', 322),
 ('knightley', 321),
 ('know', 312),
 ('say', 302),
 ('“i', 300),
 ('elton', 299),
 ('good', 292),
 ('well', 283),
 ('quite', 267),
 ('jane', 261),
 ('great', 261),
 ('time', 258),
 ('woodhouse', 249),
 ('nothing', 235),
 ('may', 233),
 ('always', 231),
 ('without', 219),
 ('thought', 218),
 ('shall', 217),
 ('see', 217),
 ('soon', 215),
 ('dear', 211),
 ('first', 202),
 ('made', 197),
 ('man', 196),
 ('sure', 192),
 ('like', 190),
 ('frank', 190),
 ('young', 189),
 ('ever', 185),
 ('fairfax', 182),
 ('churchill', 180),
 ('two', 167)]

In [58]:
hamlet = word_counter('./text/hamlet.txt',sc)

In [59]:
hamlet

[('ham', 358),
 ('lord', 225),
 ('king', 196),
 ('queen', 120),
 ('shall', 114),
 ('good', 109),
 ('hor', 109),
 ('hamlet', 107),
 ('come', 107),
 ('thou', 105),
 ('let', 96),
 ('thy', 86),
 ('pol', 86),
 ('like', 81),
 ('would', 81),
 ('well', 77),
 ('sir', 75),
 ('know', 74),
 ('enter', 73),
 ('tis', 73),
 ('th', 72),
 ('may', 71),
 ('go', 71),
 ('us', 70),
 ('love', 66),
 ('hath', 65),
 ('speak', 63),
 ('laer', 62),
 ('must', 61),
 ('give', 60),
 ('thee', 59),
 ('oph', 58),
 ("i'll", 56),
 ('upon', 55),
 ('make', 55),
 ('say', 54),
 ('father', 52),
 ('man', 52),
 ('much', 48),
 ('horatio', 47),
 ('think', 47),
 ('one', 47),
 ('laertes', 46),
 ('time', 46),
 ('heaven', 45),
 ('see', 45),
 ('play', 45),
 ('ros', 45),
 ('thus', 43),
 ('tell', 43)]

In [61]:
othello = word_counter('./text/othello.txt',sc)

In [62]:
othello

[('iago', 341),
 ('haue', 202),
 ('cassio', 184),
 ('oth', 182),
 ('des', 160),
 ('thou', 142),
 ('oh', 117),
 ('shall', 91),
 ('lord', 89),
 ('tis', 88),
 ('good', 87),
 ('thy', 87),
 ('aemil', 85),
 ('come', 85),
 ('would', 83),
 ('enter', 81),
 ('well', 81),
 ('may', 79),
 ('loue', 78),
 ('thee', 76),
 ('let', 72),
 ('othe', 69),
 ('yet', 68),
 ('hath', 68),
 ('know', 67),
 ('say', 66),
 ('one', 65),
 ('moore', 63),
 ('thinke', 63),
 ('go', 60),
 ('selfe', 60),
 ('must', 59),
 ('heauen', 58),
 ('othello', 57),
 ('heere', 55),
 ('make', 53),
 ('ile', 53),
 ('sir', 53),
 ('night', 52),
 ('cas', 52),
 ('vpon', 52),
 ('speake', 50),
 ('see', 49),
 ('desdemona', 48),
 ('man', 48),
 ('giue', 43),
 ('honest', 42),
 ('neuer', 41),
 ('rodorigo', 39),
 ('rod', 39)]