Download java, spark any pyspark

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
!pip install -q findspark

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

In [4]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/8e/b0/bf9020b56492281b9c9d8aae8f44ff51e1bc91b3ef5a884385cb4e389a40/pyspark-3.0.0.tar.gz (204.7MB)
[K     |████████████████████████████████| 204.7MB 31kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 41.8MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.0-py2.py3-none-any.whl size=205044184 sha256=146b9653d1490bdfc6cfce94e67f8c1a2f07f74081d20f894d56a9c8407af2a5
  Stored in directory: /root/.cache/pip/wheels/57/27/4d/ddacf7143f8d5b76c45c61ee2e43d9f8492fc5a8e78ebd7d37
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.0


Prepare Spark session and create a spark context to create a RDD

In [5]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
spark = SparkSession.builder.master("local[*]").appName("Word_Count_Application").getOrCreate()
sc = spark.sparkContext

Download the text file and save it as a RDD

In [7]:
from google.colab import files
files.upload()

Saving icp2.txt to icp2.txt


{'icp2.txt': b'The University of South Carolina reports that more than 1,000 students currently have the virus.\r\nThe C.D.C. tells health officials to be ready to distribute a vaccine by November, raising concerns over politicized timing.\r\nIn Iowa, college students staged a sickout, and a football opener won\xe2\x80\x99t have fans after all.\r\nVirus fallout from the Sturgis motorcycle rally: A death in Minnesota, cases in South Dakota and more.\r\nNew studies show inexpensive steroid drugs can help critically sick people survive Covid-19.\r\nSilvio Berlusconi, Italy\xe2\x80\x99s former prime minister, tests positive.\r\nA judge orders the University of California to stop considering SAT or ACT scores because of the pandemic.'}

In [8]:
word_data = sc.textFile("icp2.txt")

First map the data to remove anything but whitespaces and alphanumeric characters to remove all punctutation.
The next 2 flatmaps divide the data into words separated by either spaces or tabs.
The next 2 maps convert each word to lowercase and creates a pairing of the word as a key and the value.
The next reduceByKey reduces all of same words together.
The following map creates a pairing with the first letter of the word as the key and the word itself as the value.
Then words starting with the same first letters are grouped together and then the groups are sorted by the first letter key.

In [66]:
import re
words = word_data.map(lambda s: re.sub(r'[^\w\s]', '', s)) \
.flatMap(lambda line: line.split(' ')) \
.flatMap(lambda line: line.split('\t')) \
.map(lambda s: s.lower()) \
.map(lambda word: (word, word)) \
.reduceByKey(lambda word1, word2: word1) \
.map(lambda key : (key[0][0], key[0])) \
.reduceByKey(lambda key, word: key + ", " + word) \
.sortByKey()

This following loop shows contents of the previous map-reduce

In [67]:
for element in words.collect():
  print(element)

('1', '1000')
('a', 'after, act, a, and, all')
('b', 'berlusconi, be, by, because')
('c', 'concerns, college, cases, critically, carolina, currently, cdc, can, covid19, california, considering')
('d', 'dakota, distribute, death, drugs')
('f', 'football, fans, fallout, from, former')
('h', 'have, help, health')
('i', 'in, iowa, inexpensive, italys')
('j', 'judge')
('m', 'more, motorcycle, minnesota, minister')
('n', 'new, november')
('o', 'of, officials, orders, over, opener, or')
('p', 'politicized, positive, people, prime, pandemic')
('r', 'ready, reports, raising, rally')
('s', 'staged, sickout, studies, steroid, stop, sat, scores, south, students, sturgis, show, sick, survive, silvio')
('t', 'than, tells, tests, the, that, to, timing')
('u', 'university')
('v', 'vaccine, virus')
('w', 'wont')


The results are then collected into a folder and the output file is downloaded from Google Colab.

In [73]:
words.coalesce(1).saveAsTextFile("output")
files.download("output/part-00000")