# CS246 - Colab 1
## Word Count in Spark

### Setup

Let's set up Spark on your Colab environment.  Run the cell below!

In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

openjdk-8-jdk-headless is already the newest version (8u382-ga-1~22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.


Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [2]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [3]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')

If you executed the cells above, you should be able to see the file *pg100.txt* under the "Files" tab on the left panel.

### Your task

If you successfully run the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter, we want to count the total number of (non-unique) words that start with a specific letter.

In your implementation, **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all words that **start** with a non-alphabetic character. You should output word counts for the **entire document**, inclusive of the title, author, and the main texts. If you encounter words broken as a result of new lines, e.g. "pro-ject" where the segment after the dash sign is on a new line, no special processing is needed and you can safely consider it as two words.

Your outputs will be graded on a range -- if your differences from the ground-truths are within an error threshold of 5, you'll be considered correct.

In [9]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

In [11]:
# YOUR
txt = spark.read.text("pg100.txt")

text = txt.rdd.map(lambda x: x[0])

print(text)

words = text.flatMap(lambda line: line.split(" ")).map(lambda word: word.strip('.,?!:;()[]')).filter(lambda word: word.isalpha()).map(lambda word: word.lower())

letter_count = words.flatMap(lambda word: [(letter, 1) for letter in word])

word_counts = letter_count.reduceByKey(lambda a, b: a + b)

for (letter, count) in word_counts.collect():
    print(f"{letter}: {count}")



PythonRDD[16] at RDD at PythonRDD.scala:53
h: 226955
p: 54296
r: 222906
j: 4540
c: 83311
g: 64225
b: 58967
l: 156486
s: 228108
i: 241208
y: 91993
d: 134874
t: 313384
e: 425960
o: 302009
u: 123810
n: 232511
k: 33067
f: 77822
m: 106123
w: 84605
a: 276215
v: 35109
x: 4995
q: 3367
z: 1462


In [6]:
# CODE
spark.stop()

In [7]:
# HERE

Once you obtained the desired results, **head over to Gradescope and submit your solution for this Colab**!