# Playing with Spark API: Exercise

Let's create pyspark and get it ready to do things.

In [None]:
import pyspark

spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [None]:
documents = ['This is a document',
             'This is another document',
             'This is yet a third document',
             'When will this list of document end',
             'This is the last document']

In [None]:
doc_df = spark.createDataFrame([(d,) for d in documents], ['word'])

In [None]:
doc_df.show(truncate=False)

Let's get a few useful functions ready to go.

In [None]:
from pyspark.sql.functions import split, explode, col, lower, sort_array

doc_df.withColumn('word', split(lower(col('word')), "\s")).show(10, truncate=False)

In [None]:
doc_df.withColumn('word', explode(split(lower(col('word')), "\s"))).show(10, truncate=False)

In [None]:
doc_df.withColumn('word', explode(split(lower(col('word')), "\s")))\
      .where('word != ""')\
      .groupBy('word')\
      .count()\
      .orderBy('count', ascending=False)\
      .show()

## Words with friends - finding anagrams

In the file "data/words.txt", there is a list of words. Our goal is to group together words that are anagrams of each other (e.g. ACT and CAT).

This will show us how to load from a file, and a cool "canonical representation" trick.


In [None]:
word_df = spark.read.text('data/words.txt')
word_df.show(10)

First step, let's take every word and split it out into a list of characters and store that as a new column. So we want to go from:

```
| value |
---------
| AA    |
| AAH   |
| ...   |
```

Will become:

```
| value |     key     |
-----------------------
| AA    | [, A, A]    |
| AAH   | [, A, A, H] |
| ...   | ...         |
```

In [None]:
word_df_key = word_df.withColumn('key', sort_array(split(col('value'),'')))
word_df_key.show()

Now take that new list of characters you created and treat that as a key and group on that and see how many times those keys occur.

In [None]:
word_df_key.groupBy('key').count().orderBy('count',ascending=False).show(3)

What if we want to actually see all the anagrams? Hint: Check out the `collect_list` function.

In [None]:
# If we want the actual anagrams?
from pyspark.sql.functions import collect_list, struct, count
(word_df_key.groupBy('key')
            .agg(collect_list('value').alias('words'), count('key').alias('freq'))
            .orderBy('freq', ascending=False)
            .show(15, truncate=False)
)