# CS246 - Colab 1
## Wordcount in Spark

### Setup

Let's setup Spark on your Colab environment.  Run the cell below!

In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Collecting pyspark
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
[K     |████████████████████████████████| 212.4 MB 57 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 43.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=d1cfcf6883c068901ff86888b84aad9bc75eb6f96a94c7a7d9ff0c38cf58f692
  Stored in directory: /root/.cache/pip/wheels/a5/0a/c1/9561f6fecb759579a7d863dcd846daaa95f598744e71b02c77
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2
The following additional packages will be installed:
  openjdk-8-jre-headless
Suggested packages:
  openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra
  fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhe

Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [2]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [3]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')

If you executed the cells above, you should be able to see the file *pg100.txt* under the "Files" tab on the left panel.

### Your task

If you run successfully the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of (non-unique) words that start with a specific letter. In your implementation **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all the words **starting** with a non-alphabetic character.

In [4]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

In [8]:
# 读取数据集并调用map函数
words = sc.textFile('pg100.txt').map(lambda line: line.lower()).flatMap(lambda line: line.split(" "))
words_mapped = words.map(lambda x: (x[0:1],1))

In [9]:
# 对map后的结果进行reduce
words_count = words_mapped.reduceByKey(lambda a, b: a + b)

[('p', 27759), ('g', 20782), ('c', 34567), ('s', 65705), ('b', 45455), ('', 506610), ('i', 62167), ('r', 14265), ('y', 25855), ('l', 29569), ('*', 24), ('d', 29713), ('1', 458), ('[', 2073), ('#', 3), ('j', 3339), ('h', 60563), ('.', 52), ('"', 356), ('9', 28), ('4', 46), ('_', 1), ('8', 15), ('?', 2), ('}', 2), ('$', 1), ('0', 6), ('t', 123602), ('e', 18697), ('o', 43494), ('w', 59597), ('f', 36814), ('u', 9170), ('a', 84836), ('n', 26759), ('m', 55676), ('2', 95), ('<', 248), ('v', 5728), ('(', 639), ('k', 9418), ('3', 59), ('/', 2), ("'", 3804), ('5', 35), ('q', 2377), ('6', 22), ('7', 17), ('z', 71), ('-', 52), (']', 7), ('x', 14), ('&', 21), (':', 1)]


In [12]:
# 展示得到的word count结果
for key, value in words_count.collect():
  if key.islower():
    print(key, value)

p 27759
g 20782
c 34567
s 65705
b 45455
i 62167
r 14265
y 25855
l 29569
d 29713
j 3339
h 60563
t 123602
e 18697
o 43494
w 59597
f 36814
u 9170
a 84836
n 26759
m 55676
v 5728
k 9418
q 2377
z 71
x 14


Once you obtained the desired results, **head over to Gradescope and submit your solution for this Colab**!