# Print word frequencies

# Print word frequencies

- After combining the values (counts) with the same key (word), you'll print the word frequencies using the `take(N)` action. You could have used the `collect()` action but as a best practice, it is not recommended as `collect()` returns all the elements from your RDD. You'll use `take(N)` instead, to return N elements from your RDD.

- What if we want to return the top 10 words? For this first, you'll need to swap the key (word) and values (counts) so that keys is count and value is the word. After you swap the key and value in the tuple, you'll sort the pair RDD based on the key (count) and print the top 10 words in descending order.

- You already have a `SparkContext` `sc` and `resultRDD` available in your workspace.


## Instructions
- Print the first 10 words and their frequencies from the `resultRDD`.
- Swap the keys and values in the `resultRDD`.
- Sort the keys according to descending order.
- Print the top 10 most frequent words and their frequencies.

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [4]:
file_path = "file:////home/talentum/test-jupyter/P2/M2/SM4/Dataset/constitution.txt"

# Create a baseRDD from the file path
baseRDD = sc.textFile(file_path)

# Split the lines of baseRDD into words
splitRDD = baseRDD.flatMap(lambda line: line.split(' '))

# Create a tuple of the word and 1 
splitRDD_tup = splitRDD.map(lambda word: (word, 1))

# Count of the number of occurences of each word
countRDD = splitRDD_tup.reduceByKey(lambda x, y: x + y)

# Display the first 10 words and their frequencies
for word in countRDD.take(10):
    print(word)

# Swap the keys and values 
countRDD_swap = countRDD.map(lambda x: (x[1], x[0]))

# Sort the keys in descending order
sortedRDD_swap = countRDD_swap.sortByKey(ascending=False)

print("\n Most frequent word :")
# Show the top 10 most frequent words and their frequencies
for word in sortedRDD_swap.take(10):
    print("{} has {} counts". format(word[1], word[0]))


('We', 2)
('the', 662)
('People', 2)
('of', 493)
('United', 85)
('States,', 55)
('in', 137)
('Order', 1)
('to', 183)
('form', 1)

 Most frequent word :
 has 812 counts
the has 662 counts
of has 493 counts
shall has 293 counts
and has 256 counts
to has 183 counts
be has 178 counts
or has 157 counts
in has 137 counts
by has 100 counts


In [6]:
file_path = "file:////home/talentum/test-jupyter/P2/M2/SM4/Dataset/constitution.txt"

sc.textFile(file_path) \
.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y) \
.map(lambda x: (x[1], x[0])) \
.sortByKey(ascending=False).take(10)  ### Chaining

[(812, ''),
 (662, 'the'),
 (493, 'of'),
 (293, 'shall'),
 (256, 'and'),
 (183, 'to'),
 (178, 'be'),
 (157, 'or'),
 (137, 'in'),
 (100, 'by')]