# Homework 2

Here we will implement a map-reduce algorithm in spark for counting the number of words from a given set of documents.
Let's start by properly initializing a SparkContext object.

In [1]:
from pyspark import SparkConf, SparkContext

config = SparkConf().setAppName('Homework 2').setMaster('local')

sc = SparkContext(conf=config)

### Loading the Dataset
Here we load the dataset and split it in a number of partitions. As a rule of thumb: the higher the number of partitions, the better the parallelism. One should nonetheless be aware that each partition have some overhead and thus it would be suboptimal to create a big number of partitions.

Since we're running on *4-core* machines we'll be partition the RDD in 8 parts.

In [2]:
docs = sc.textFile('dataset.txt').repartition(8)

### Trivial Algorithm
In this naive algorithm we do the following:
- Take each word from each document
- Create a new key-value pair for each word with a value 1
- Collect the pairs by key and sum their values

In [3]:
words = docs.flatMap(lambda document: document.split(" "))\
    .map(lambda word: (word,1))\
    .reduceByKey(lambda x,y: x+y)

print('The number of different words is: ', words.count())

The number of different words is:  15


## Improved word count 1

In this first more clever version of word count we need to modify the way documents are processed: in particular, we want to store directly the pair *(w,c(w))* were _c(w)_ is the number of occurrences of the word *w* in the document Di

In [136]:
def improvedMap(x): 
    line = x.split(" ")
    pairs = {}
    for s in line:
        if s in pairs:
            pairs[s] = pairs[s] + 1
        else:
            pairs[s] = 1
    keyVal = []
    for key in pairs:
        keyVal.append((key,pairs[key]))
    return keyVal

intermediate = docs.map(improvedMap)
#intermediate.reduce(lambda x,y,sum: sum + y)