# Anagrams

In this exercise we reiterate over the previous Anagrams example from TP1.
We run a local Spark program to solve the exercise.

Let's start by importing the Spark Context:

In [None]:
from pyspark import SparkContext

Next, we create an instance of SparkContext:

In [None]:
sc = SparkContext()

Now we want to get our data from HDFS.
Let's check whether the file is already there:

> Note that by starting the line with `!` we run a bash command instead of python code.

In [None]:
!hadoop fs -ls common_words_en_subset.txt

If the file don't show up you can put it on HDFS:

In [None]:
!hadoop fs -put /vagrant/tp/1/common_words_en_subset.txt

Next, we load the data into a Spark RDD (note that this command is lazily evaluated)

In [None]:
words = sc.textFile('common_words_en_subset.txt')
words

Let's display the content of the `words` RDD:

In [None]:
words.collect()

Observe the results; make sure the data has been successfully loaded.
Then, run the map transformation:

In [None]:
tuples = words.map(lambda x: (''.join(sorted(list(x))), x))
tuples.collect()

Then, run the groupByKey (the rough equivalent to Hadoop’s « shuffle »):

In [None]:
grouped = tuples.groupByKey().mapValues(lambda x: list(x))
grouped.collect()

Observe the results. Then, run filter:

In [None]:
filtered = grouped.filter(lambda x: len(x[1])>1)
filtered.collect()

Then, let's run mapValues:

In [None]:
res1 = filtered.mapValues(lambda x: ", ".join(x))
res1.collect()

Finally, also apply mapValues to the unfiltered RDD and save both datasets on
disk:

In [None]:
res2 = grouped.mapValues(lambda x: ", ".join(x))
res2.collect()

In [None]:
res1.saveAsTextFile('res-words-filtered')
res2.saveAsTextFile('res-words-unfiltered')

Let's check that the results were written on HDFS:

In [None]:
!hadoop fs -ls res-words-filtered

In [None]:
!hadoop fs -ls res-words-unfiltered

In [None]:
!hadoop fs -cat res-words-filtered/*

In [None]:
!hadoop fs -cat res-words-unfiltered/*

Question: Why is the `grouped` RDD executed twice? How to force Spark to evaluate the `grouped` RDD only once?

You can also of course try out various other operations described during the course to familiarize yourself with them.