# CS483 - Colab 1
## Word Count in Spark

### Setup

Let's set up Spark on your Colab environment.  Run the cell below!

In [63]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

openjdk-8-jdk-headless is already the newest version (8u422-b05-1~22.04).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [64]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [65]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')

If you executed the cells above, you should be able to see the file *pg100.txt* under the "Files" tab on the left panel.

In [66]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Your task

If you successfully run the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter, we want to count the total number of (non-unique) words that start with a specific letter.

In your implementation, **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all words that **start** with a non-alphabetic character. You should output word counts for the **entire document**, inclusive of the title, author, and the main texts. If you encounter words broken as a result of new lines, e.g. "pro-ject" where the segment after the dash sign is on a new line, no special processing is needed and you can safely consider it as two words.

Your outputs will be graded on a range -- if your differences from the ground-truths are within an error threshold of 5, you'll be considered correct.

In [110]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

In [111]:
txt = spark.read.text("/content/pg100.txt")

In [112]:
words = txt.withColumn("word", explode(split(col("value"), r"\s+")))
total_word_count = words.count()

In [113]:
word_letters = words.withColumn("first_letter", lower(regexp_extract(col("word"), r"^([a-zA-Z])", 1)))

In [114]:
filtered_aplha_letters = word_letters.filter((col("first_letter").isNotNull()) & (trim(col("first_letter")) != ""))

In [115]:
letter_count = filtered_aplha_letters.groupBy("first_letter").count()

In [116]:
letter_count.orderBy("first_letter").show(truncate=False)
output_path = "output/output.csv"
letter_count.orderBy("first_letter").write.csv(output_path, header=True, mode="overwrite")
total_count = letter_count.agg(sum("count").alias("total_count")).collect()[0]["total_count"]

print(f"Total number of words starting with any letter: {total_count}")
print(f"Total number of words in the entire document: {total_word_count}")

+------------+------+
|first_letter|count |
+------------+------+
|a           |84836 |
|b           |45455 |
|c           |34567 |
|d           |29713 |
|e           |18697 |
|f           |36814 |
|g           |20782 |
|h           |60563 |
|i           |62167 |
|j           |3339  |
|k           |9418  |
|l           |29569 |
|m           |55676 |
|n           |26759 |
|o           |43494 |
|p           |27759 |
|q           |2377  |
|r           |14265 |
|s           |65705 |
|t           |123602|
+------------+------+
only showing top 20 rows

Total number of words starting with any letter: 895992
Total number of words in the entire document: 1023444


Once you obtained the desired results, **head over to Gradescope and submit your solution for this Colab**!