## Homework - Exercise 1
Do the word count on `sample_data/README.md` with DataFrame API (don't use RDD API). Sort the result by descending count and make sure that empty words are not included. Hint: you can use `read.text`, `split`, `explode`, `lower`, `filter`, `select`,  `groupBy`, `count`, `orderBy` (some need to be imported from `pyspark.sql.functions`). Details can be found in the [documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html).

In [1]:
#@title PySpark installation and imports

!pip install pyspark --quiet
!pip install -U -q PyDrive --quiet
!apt install openjdk-8-jdk-headless &> /dev/null

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, lower, col, count

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m864.6 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
#@title Solution
spark = SparkSession.builder \
    .appName("WordCount") \
    .getOrCreate()

df = spark.read.text("sample_data/README.md")

# Split the lines into words and explode them, then remove empty words
words_df = df.select(explode(split(lower(col("value")), "\s+")).alias("word"))
words_df = words_df.filter(words_df.word != '')

# Group by word and count, then sort in descending order
word_counts = words_df.groupBy("word").agg(count("*").alias("count"))
word_counts = word_counts.orderBy(col("count").desc())

word_counts.show(50)
spark.stop()

+--------------------+-----+
|                word|count|
+--------------------+-----+
|                  is|    4|
|                   *|    3|
|                 the|    3|
|                   a|    3|
|                copy|    2|
|                 was|    2|
|                  in|    2|
|                 at:|    2|
|                  of|    2|
|           described|    2|
|              sample|    2|
|                 few|    1|
|       `mnist_*.csv`|    1|
|            2682899.|    1|
|                  us|    1|
|         statistical|    1|
|          originally|    1|
|                  by|    1|
|                 you|    1|
|                more|    1|
|             'graphs|    1|
|quartet](https://...|    1|
|              [mnist|    1|
|            contains|    1|
|            includes|    1|
|            american|    1|
|     `anscombe.json`|    1|
|                  27|    1|
|             housing|    1|
|library](https://...|    1|
|       statistician.|    1|
|            p