In [4]:
import findspark
findspark.init()
import pyspark
#Initiate Spark Context
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc=pyspark.SparkContext()
#sc.stop()

## Working with Pair RDDs

<ul>
<li>Real life datasets are usually key/value pairs</li>
<li>Each row is a key and maps to one or more values</li>
<li>PairRDD is a special data structure to work with this kind of datasets</li>
<li>PairRDD: Key is the identier and value is data</li>
    </ul>

#### All regular transformations work on pairRDD
<ul>
<li>Examples of paired RDD Transformations</li>
</ul>
<b>reduceByKey(func)</b>: Combine values with the same key
<br>
<b>groupByKey()</b>: Group values with the same key
<br>
<b>sortByKey()</b>: Return an RDD sorted by the key
<br>
<b>join()</b>: Join two pairRDDs based on their key

One of the most popular pair RDD transformations is <b>reduceByKey()</b> which operates on key, value (k,v) pairs and merges the values for each key.

In [7]:
# Create PairRDD Rdd with key value pairs
Rdd = sc.parallelize([(1,2),(3,4),(3,6),(4,5)])

# Apply reduceByKey() operation on Rdd
Rdd_Reduced = Rdd.reduceByKey(lambda x, y: x+y)

# Iterate over the result and print the output
for num in Rdd_Reduced.collect(): 
  print("Key {} has {} Counts".format(num[0], num[1]))

Key 1 has 2 Counts
Key 3 has 10 Counts
Key 4 has 5 Counts


<b>sortByKey()</b>  Many times it is useful to sort the pair RDD based on the key

In [8]:
# Sort the reduced RDD with the key by descending order
Rdd_Reduced_Sort = Rdd_Reduced.sortByKey(ascending=False)

# Iterate over the result and print the output
for num in Rdd_Reduced_Sort.collect():
  print("Key {} has {} Counts".format(num[0], num[1]))

Key 4 has 5 Counts
Key 3 has 10 Counts
Key 1 has 2 Counts


### More Actions:
<b>reduce(func)</b> action is used for aggregating the elements of a regularRDD.
<br>
<b>saveAsTextFile()</b> action saves RDD into a text le inside a directory with each partition as a
separate file.
<br>
<b>coalesce()</b> method can be used to save RDD as a single text file.
<br>

#### PairRDD actions include

In [22]:
print(Rdd.take(10))

[(1, 2), (3, 4), (3, 6), (4, 5)]


#### CountingBykeys
For many datasets, it is important to count the number of keys in a key/value dataset. For example, counting the number of countries where the product was sold or to show the most popular baby names.

In [44]:
# Transform the rdd with countByKey()
total = Rdd.countByKey()

# What is the type of total?
print("The type of total is", type(total))

# Iterate over the total and print the output
for k, v in total.items(): 
  print("key", k, "has", v, "counts")

The type of total is <class 'collections.defaultdict'>
key 1 has 1 counts
key 3 has 2 counts
key 4 has 1 counts


In [25]:
file_path = 'https://assets.datacamp.com/production/repositories/3514/datasets/d9e4e9c9a26e932e3164ad7585bc30fc06596a50/Complete_Shakespeare.txt'

#### Create a base RDD and transform it
Create a base RDD from Complete_Shakespeare.txt file.
<br>
Use RDD transformation to create a long list of words from each element of the base RDD.
<br>
Remove stop words from your data.
<br>
Create pair RDD where each element is a pair tuple of ('w', 1)
<br>
Group the elements of the pair RDD by key (word) and add up their values.
<br>
Swap the keys (word) and values (counts) so that keys is count and value is the word.
<br>
Finally, sort the RDD by descending order and print the 10 most frequent words and their frequencies.

In [48]:
# Create a baseRDD from the file path
baseRDD = sc.textFile(file_path)