### Download and install java, hadoop, and findspark

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgz
!tar xf spark-3.0.3-bin-hadoop3.2.tgz
!pip install -q findspark

### Set environment variables

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.3-bin-hadoop3.2"

## Install pyspark

In [3]:
!pip install pyspark==3.0.3

Collecting pyspark==3.0.3
  Downloading pyspark-3.0.3.tar.gz (209.1 MB)
[K     |████████████████████████████████| 209.1 MB 59 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 62.2 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.3-py2.py3-none-any.whl size=209435970 sha256=1b4715227097ec27a3a8e309f626febbc627118ac1cc2e8e7398b81f1121a1de
  Stored in directory: /root/.cache/pip/wheels/7e/6d/0a/6b0bf301bc056d9af03194b732b9f49ad2fceb205aab2984fd
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.3


### Create spark session

In [4]:
from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.master("local[*]").appName("Word_Count").getOrCreate()
sc = spark.sparkContext

### Download the text file and save it as a RDD

In [5]:
from google.colab import files

files.upload()

Saving icp2.txt to icp2.txt


{'icp2.txt': b'As the Labor Day holiday nears, many people are planning travel and get-togethers to see family and friends. Unfortunately, this is occurring at the same time Covid-19 rates are climbing. The rates of new coronavirus infections are higher than they have been since January. Hospitalizations are also at their highest levels since January. In many parts of the United States, both infections and hospitalizations are higher than they were during Labor Day weekend in 2020. How should people think about Covid-19 safety now, compared to last year? Is it safe to see family and friends? What if extended family members want to stay in a house together -- what are some steps they should take to reduce risk? And how does the start of school affect our risk? To help navigate these questions, we spoke with CNN Medical Analyst Dr. Leana Wen. Wen is an emergency physician and visiting professor of health policy and management at the George Washington University Milken Institute School of

### Read the text file in a RDD variable

In [6]:
data = sc.textFile("icp2.txt")

### Show the contents of the RDD

In [7]:
data.collect()

['As the Labor Day holiday nears, many people are planning travel and get-togethers to see family and friends. Unfortunately, this is occurring at the same time Covid-19 rates are climbing. The rates of new coronavirus infections are higher than they have been since January. Hospitalizations are also at their highest levels since January. In many parts of the United States, both infections and hospitalizations are higher than they were during Labor Day weekend in 2020. How should people think about Covid-19 safety now, compared to last year? Is it safe to see family and friends? What if extended family members want to stay in a house together -- what are some steps they should take to reduce risk? And how does the start of school affect our risk? To help navigate these questions, we spoke with CNN Medical Analyst Dr. Leana Wen. Wen is an emergency physician and visiting professor of health policy and management at the George Washington University Milken Institute School of Public Healt

* First split the lines with spaces or tabs between them.
* Convert all the words to lowercase which makes comparision easier to the first letter
* Remove all the extra characters from the words
* Map the individual words
* Use ReduceByKey() to filter the words which are same
* Map the output with key and value pair with key being the first letter
* Use ReduceByKey() again to sort the first letter and its words in a dictionary format
* Finally, sort the output with the first letter in alphabetical order





In [8]:
import re

result = data.flatMap(lambda line: line.split(' ')) \
.flatMap(lambda line: line.split('\t')) \
.map(lambda sent: sent.lower().strip(",.:`?\"")) \
.map(lambda word: (word, word)) \
.reduceByKey(lambda word1, word2: word1) \
.map(lambda key : (key[0][0].upper(), key[0].title())) \
.reduceByKey(lambda key, word: key + ", " + word) \
.sortByKey()

### Print the final output

In [9]:
for element in result.collect():
  print(element)

('-', '--')
('2', '2020, 29')
('A', 'As, Are, At, Analyst, An, Against, Angeles, And, Also, About, A, Affect, Author')
('B', 'Both, Book, Been, Be, By')
('C', 'Climbing, Compared, Cnn, Centers, Covid-19, Coronavirus, County, Control')
('D', "During, Doctor'S, Different, Day, Does, Dr, Disease")
('E', 'Emergency, Extended')
('F', 'Family, Fight, Friends, For')
('G', 'George, Get-Togethers')
('H', 'Holiday, Higher, Have, House, Help, Hospitalized, Hospitalizations, Highest, How, Health')
('I', 'Is, Infections, In, Institute, It, If')
('J', 'January, Journey')
('L', 'Levels, Last, Leana, Lifelines, Likely, Los, Labor')
('M', 'Members, Medical, Management, More, Many, Milken, Main')
('N', 'New, Now, Nears, Navigate')
('O', 'Of, Officials, Occurring, Our, One')
('P', 'Planning, Policy, Public, Published, Prevention, People, Parts, Physician, Professor, Protect')
('Q', 'Questions')
('R', 'Rates, Reduce, Risk, Reason, Report')
('S', "Safe, Steps, Start, School, Spoke, Said, See, Same, Since, 

### Finally, download the output file and save it to your system

In [13]:
result.coalesce(1).saveAsTextFile("output")
files.download("output/part-00000")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>