# Homework 2

Here we will implement a map-reduce algorithm in spark for counting the number of words from a given set of documents.
Let's start by properly initializing a SparkContext object.

In [2]:
from pyspark import SparkConf, SparkContext

config = SparkConf().setAppName('Homework 2').setMaster('local')

sc = SparkContext(conf=config)

### Loading the Dataset
Here we load the dataset and split it in a number of partitions. As a rule of thumb: the higher the number of partitions, the better the parallelism. One should nonetheless be aware that each partition have some overhead and thus it would be suboptimal to create a big number of partitions.

Since we're running on *4-core* machines we'll be partition the RDD in 8 parts.

In [7]:
docs = sc.textFile('text-sample.txt').repartition(8)

### Trivial Algorithm
In this naive algorithm we do the following:
- Take each word from each document
- Create a new key-value pair for each word with a value 1
- Collect the pairs by key and sum their values

In [8]:
words = docs.flatMap(lambda document: document.split(" "))\
    .map(lambda word: (word,1))\
    .reduceByKey(lambda x,y: x+y)

print('The number of different words is: ', words.count())

The number of different words is:  144873


## Improved word count 1 - Gugio's, just used this as reference for the improved 2

In this first more clever version of word count we need to modify the way documents are processed: in particular, we want to store directly the pair *(w,c(w))* were _c(w)_ is the number of occurrences of the word *w* in the document Di

In [9]:
import time

def wordcount_0_function(document) :
    pairs_dict = {}
    for word in document.split(' '):
        if word not in pairs_dict.keys():
            pairs_dict[word] = 1
        else :
            pairs_dict[word] += 1
    return [(key, pairs_dict[key]) for key in pairs_dict.keys()]

 
wordcount_1 = docs.flatMap(wordcount_0_function) \
                .reduceByKey(lambda accum, n: accum + n)

t0 = time.time()
n_dif_words = wordcount_1.count()
t1 = time.time()
                
print('Improved WordCount 1 result:')
print('Number of different Words :', n_dif_words)
print('Elapsed Time :', t1-t0, 's')

Improved WordCount 1 result:
Number of different Words : 144873
Elapsed Time : 13.970414161682129 s


## Improved word count 2 

In [11]:
N = docs.flatMap(lambda document: document.split(" "))\
    .map(lambda word: (word,1))\
    .count()
N

3503570

In [20]:
import numpy as np
import math

def wordcount_2(document) :
    pairs_dict = {}
    for word in document.split(' '):
        if word not in pairs_dict.keys():
            pairs_dict[word] = 1
        else :
            pairs_dict[word] += 1
    return [(np.random.randint(0, math.floor(N) - 1), (key, pairs_dict[key])) for key in pairs_dict.keys()]

docs.flatMap(wordcount_2).collect()

[(3401479, ('Slauerhoff', 28)),
 (662886, ('Jan', 3)),
 (179998, ('Jacob', 1)),
 (1521882, ('who', 2)),
 (593153, ('publish', 7)),
 (3079403, ('as', 6)),
 (861094, ('be', 27)),
 (195445, ('a', 36)),
 (2880552, ('dutch', 3)),
 (2972788, ('poet', 5)),
 (103888, ('and', 51)),
 (133639, ('novelist', 1)),
 (799885, ('he', 64)),
 (2332065, ('consider', 2)),
 (278293, ('one', 4)),
 (2957461, ('of', 61)),
 (2248898, ('the', 65)),
 (2880664, ('most', 1)),
 (1217239, ('important', 2)),
 (2832131, ('language', 2)),
 (36284, ('writer', 3)),
 (326841, ('bear', 1)),
 (2222037, ('fifth', 1)),
 (1333475, ('in', 45)),
 (2565356, ('family', 2)),
 (1986813, ('six', 2)),
 (674410, ('child', 3)),
 (391347, ('raise', 1)),
 (222376, ('moderately', 1)),
 (1107451, ('orthodox-protestant', 1)),
 (1936879, ('middle', 1)),
 (2193097, ('class', 1)),
 (1859598, ('environment', 1)),
 (3176191, ('Leeuwarden', 1)),
 (1121098, ('Netherlands', 5)),
 (3201766, ('suffer', 2)),
 (1126169, ('from', 9)),
 (695426, ('bout', 1