***Write a Spark Program to group the words in a given text file based on the starting letters***

Installing Spark and Hadoop as first and foremost step

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop3.2.tgz
!tar xf spark-3.0.3-bin-hadoop3.2.tgz
!pip install -q findspark

Setting Up JVM(Java Virtual Machine)

In [None]:
!ls /usr/lib/jvm/

default-java		   java-11-openjdk-amd64     java-8-openjdk-amd64
java-1.11.0-openjdk-amd64  java-1.8.0-openjdk-amd64


Set Up locations for Java and Spark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.3-bin-hadoop3.2"

Installing PySpark for writing spark code using Python

In [None]:
!pip install pyspark==3.0.2

Collecting pyspark==3.0.2
  Downloading pyspark-3.0.2.tar.gz (204.8 MB)
[K     |████████████████████████████████| 204.8 MB 43 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 54.9 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.2-py2.py3-none-any.whl size=205186690 sha256=a67527e2fb67d89e4e63c7a38aa6993904f0b0ad9d00128cffbb8f0e7aab50d4
  Stored in directory: /root/.cache/pip/wheels/9a/39/f6/970565f38054a830e9a8593f388b36e14d75dba6c6fdafc1ec
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.2


Configuring Spark application using SparkContext and SparkConf()

**SparkContext**: This is an entry point for spark functionality. It depicts the connection to a spark cluster and also used to create RDDs on that cluster.

   **.getOrCreate()** creates gets or creates new RDD. spark is the new RDD here.

**SparkConf()**: It configures the Spark application and is used to set up various Spark parameters as key-value pairs.

  **.setAppName()** used to set up name for Spark Application

   **.setMaster()** specifies locality of the program to run.

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("Niharika_ICP").setMaster("local[*]")
spark = SparkContext.getOrCreate(conf=conf)

uploading the files using files() function

In [None]:
#Uploaded the ICP text file from local directory  
from google.colab import files
files.upload()

Saving icp.txt to icp.txt


{'icp.txt': b'As the Labor Day holiday nears, many people are planning travel and get-togethers to see family and friends. Unfortunately, this is occurring at the same time Covid-19 rates are climbing. The rates of new coronavirus infections are higher than they have been since January. Hospitalizations are also at their highest levels since January. In many parts of the United States, both infections and hospitalizations are higher than they were during Labor Day weekend in 2020.How should people think about Covid-19 safety now, compared to last year? Is it safe to see family and friends? What if extended family members want to stay in a house together -- what are some steps they should take to reduce risk? And how does the start of school affect our risk?To help navigate these questions, we spoke with CNN Medical Analyst Dr.Leana Wen. Wen is an emergency physician and visiting professor of health policy and management at the George Washington University Milken Institute School of Pub

Loading the file into RDD using textFile() function

In [None]:
#Reading the text file that is uploaded
data = spark.textFile("icp.txt")

Displaying the contents of the RDD

In [None]:
#Printing the data in the text file
data.take(1)

['As the Labor Day holiday nears, many people are planning travel and get-togethers to see family and friends. Unfortunately, this is occurring at the same time Covid-19 rates are climbing. The rates of new coronavirus infections are higher than they have been since January. Hospitalizations are also at their highest levels since January. In many parts of the United States, both infections and hospitalizations are higher than they were during Labor Day weekend in 2020.How should people think about Covid-19 safety now, compared to last year? Is it safe to see family and friends? What if extended family members want to stay in a house together -- what are some steps they should take to reduce risk? And how does the start of school affect our risk?To help navigate these questions, we spoke with CNN Medical Analyst Dr.Leana Wen. Wen is an emergency physician and visiting professor of health policy and management at the George Washington University Milken Institute School of Public Health. 

Created Function to remove punctuations, white spaces using translate() and maketrans() functions

In [None]:
data.count()

1

In [None]:
# User Defined Function(UDF) Function to remove punctuations and white spaces
import string
def Rem_punc_spaces(l):
  p = l.maketrans('', '',string.punctuation)
  l = l.translate(p)
  l = l.strip()
  return l
  
  

Applying the function to remove punctuations and white spaces and also filtered data to remove numericals

In [None]:
#Applying the function to remove punctuations and removed numericals in the data
data_without_Punc = data.flatMap(lambda lines: Rem_punc_spaces(lines).split()).filter(lambda p: not p[0].isnumeric())

In [None]:
#Counting the number of words in the text after filtering
data_without_Punc.count()

245

Removing duplicates from RDD using distinct() function

In [None]:
#Removing duplicates from the data
data_Unique = data_without_Punc.distinct()

In [None]:
#Counting the number of words after removing duplicates
data_Unique.count()

161

We made first letter of every word to capital using capitalize() function

In [None]:
#Capitalizing the first letter of every word
data_First_Cap = data_Unique.map(lambda p: p.capitalize())

In [None]:

data_First_Cap.take(10)

['Holiday',
 'Are',
 'Planning',
 'Family',
 'Unfortunately',
 'This',
 'Is',
 'At',
 'Rates',
 'Climbing']

Matching similar words by using first character

In [None]:
#Separating first letter of every word to display output
Final_Data = data_First_Cap.map(lambda p: (p[0],p))

In [None]:
Final_Data.take(5)

[('H', 'Holiday'),
 ('A', 'Are'),
 ('P', 'Planning'),
 ('F', 'Family'),
 ('U', 'Unfortunately')]

Creating a list of words that has same first character

In [None]:
#Concatenating all the words that start with same letter
Final_Output = Final_Data.reduceByKey(lambda a,b:str(a)+ ',' +str(b))

In [None]:
Final_Output.take(5)

[('R', 'Rates,Reduce,Risk,Riskto,Reason,Report'),
 ('C', 'Climbing,Compared,County,Centers,Covid19,Coronavirus,Cnn,Control'),
 ('O', 'Of,Officials,Occurring,Our,One'),
 ('N', 'New,Now,Nears,Navigate'),
 ('J', 'January,Journey')]

Sorting the list in reverse alphabetical order

In [None]:
#Sorting the list in reverse alphabetical order
output = Final_Output.sortByKey(False)

In [None]:
output.take(5)

[('Y', 'Year'),
 ('W', 'What,We,Wen,Washington,Were,Weekend,Want,What,With,We,Well,Who'),
 ('V', 'Very,Visiting,Vaccines,Vaccinated'),
 ('U', 'Unfortunately,United,University,Unvaccinated,Us'),
 ('T',
  'This,The,Than,Think,Take,These,Things,The,Travel,To,Time,They,Their,Together,That,Times')]

Saving output file as "ICP1.txt" using saveAsTextFile() function

In [None]:
#Saving the output as file
output.map(lambda row: str(row[0]) + ",\t" +str(row[1])) \
.coalesce(1).saveAsTextFile("ICP_2.txt")