# CS483 - Colab 1
## Word Count in Spark

### Setup

Let's set up Spark on your Colab environment.  Run the cell below!

In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=4bf0bf5b62cde78604c1d3ae7183d573e28d8bb17da82e5e1435a373470f1f7a
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2
The following additional packages will be installed:
  libxtst6 openjdk-8-jre-headless
Suggested packages:
  openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra fonts-nanum fonts-ipafont-gothic
  fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei fonts-indic

Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [2]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



In [3]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')

If you executed the cells above, you should be able to see the file *pg100.txt* under the "Files" tab on the left panel.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Your task

If you successfully run the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter, we want to count the total number of (non-unique) words that start with a specific letter.

In your implementation, **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all words that **start** with a non-alphabetic character. You should output word counts for the **entire document**, inclusive of the title, author, and the main texts. If you encounter words broken as a result of new lines, e.g. "pro-ject" where the segment after the dash sign is on a new line, no special processing is needed and you can safely consider it as two words.

Your outputs will be graded on a range -- if your differences from the ground-truths are within an error threshold of 5, you'll be considered correct.

In [5]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

In [10]:
df = spark.read.text("pg100.txt")

In [11]:
df.show(truncate=False, n=100)

+---------------------------------------------------------------------------------+
|value                                                                            |
+---------------------------------------------------------------------------------+
|The Project Gutenberg EBook of The Complete Works of William Shakespeare, by     |
|William Shakespeare                                                              |
|                                                                                 |
|This eBook is for the use of anyone anywhere at no cost and with                 |
|almost no restrictions whatsoever.  You may copy it, give it away or             |
|re-use it under the terms of the Project Gutenberg License included              |
|with this eBook or online at www.gutenberg.org                                   |
|                                                                                 |
|** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **         

In [12]:
# Step 1: Convert text to lowercase
df = df.select(lower(col("value")).alias("value"))

# Step 2: Split each line into words using split() and then explode into individual rows
words_df = df.select(explode(split(col("value"), " ")).alias("word"))
#words_df.show()

# Step 3: Filter out empty words and keep only those that start with an alphabetic character
words_df = words_df.filter((col("word") != "") & (col("word").rlike("^[a-z]")))
#words_df.show()

# Step 4: Select the first letter of each word
words_df = words_df.withColumn("first_letter", col("word").substr(1, 1))
#words_df.show()

# Step 5: Group by the first letter and count occurrences
letter_counts_df = words_df.groupBy("first_letter").count().orderBy("first_letter")

# Show the entire result without truncation
letter_counts_df.show(truncate=False, n=26)

+------------+------+
|first_letter|count |
+------------+------+
|a           |84836 |
|b           |45455 |
|c           |34567 |
|d           |29713 |
|e           |18697 |
|f           |36814 |
|g           |20782 |
|h           |60563 |
|i           |62167 |
|j           |3339  |
|k           |9418  |
|l           |29569 |
|m           |55676 |
|n           |26759 |
|o           |43494 |
|p           |27759 |
|q           |2377  |
|r           |14265 |
|s           |65705 |
|t           |123602|
|u           |9170  |
|v           |5728  |
|w           |59597 |
|x           |14    |
|y           |25855 |
|z           |71    |
+------------+------+

