### The Power of Pairs: Paired RDDs

Key/value pairs are good for solving many problems efficiently in a parallel fashion. Apache Mahout, a machine-learning library that was initially developed on top of Apache Hadoop, implements many machine-learning algorithms in the areas of classification, clustering, and collaborative filtering by using the MapReduce key/value-pair architecture . In this chapter, you’ll work through recipes that develop skills for solving interesting big data problems from many disciplines.

#### Create a Paired RDD
#### Problem

You want to create a paired RDD.
#### Solution
You have an RDD, RDD1. The elements of RDD1 are b, d, m, t, e, and u. You want to create a paired RDD, in which the keys are elements of a single RDD, and the value of a key is 0 if the element is a consonant, or1 if the element is a vowel. Figure 5-1 clearly depicts the requirements.

<img src = '430628_1_En_5_Fig1_HTML.gif'>

                                               Figure 5-1.

#### Creating a paired RDD

A paired RDD can be created in many ways. One way is to read data directly from files. We’ll explore this method in an upcoming chapter. Another way to create a paired RDD is by using the map() method, which you’ll learn about in this recipe.
#### How It Works

In this section, you’ll follow several steps to reach the solution.
#### Creating an RDD with Single Elements

Let’s start by creating an RDD out of our given data:

In [2]:
from pyspark import SparkContext
sc = SparkContext()

22/12/23 14:43:58 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext should be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.ClientServ

In [3]:
pythonList  =  ['b', 'd', 'm', 't', 'e', 'u']
RDD1 = sc.parallelize(pythonList, 2)

We have created an RDD named RDD1. The elements of RDD1 are b, d, m, t, e, and u. This is an RDD of letters. It can be observed that the elements b, d, m, and t are consonants. The other elements of RDD1, e and u, are vowels .

#### Writing a Python Method to Check for Consonants

We are going to define a Python function named vowelCheckFunction(). This function will take a letter as input and return 1 if the input is a consonant, or 0 if it is not. Let’s implement the function:

In [4]:
RDD1.collect()

                                                                                

['b', 'd', 'm', 't', 'e', 'u']

In [5]:
def vowelCheckFunction( data) :
     if data in ['a','e','i','o','u']:
        return 1
     else :
        return 0

#### Creating a Paired RDD

We can create our required RDD by using the map() function. We have to create a paired RDD: the keys will be the elements of RDD1, and the value will be 0 for keys that are consonants, or 1 for keys that are vowels:

In [6]:
pairedRdd = RDD1.map(lambda data: (data, vowelCheckFunction(data)))
pairedRdd.collect()

                                                                                

[('b', 0), ('d', 0), ('m', 0), ('t', 0), ('e', 1), ('u', 1)]

#### Fetching Keys from a Paired RDD

The keys() function can be used to fetch all the keys:


We can see that the keys() function performs a transformation. Therefore, keys() returns an RDD that requires the collect() function to get the data to the driver.

In [7]:
rddkeys = pairedRdd.keys()
rddkeys.collect()

['b', 'd', 'm', 't', 'e', 'u']

#### Fetching Values from a Paired RDD

Similar to the keys() function, the values() function will fetch all the values from a paired RDD. It also performs a transformation:


In [8]:
rddvalues = pairedRdd.values()
rddvalues.collect()

[0, 0, 0, 0, 1, 1]

#### Aggregate Data
#### Problem

You want to aggregate data.
#### Solution
You want to perform data aggregation on data from a lightbulb manufacturer, as shown in Table 5-1.

<img src = '430628_1_En_5_Figa_HTML.gif'>
                      
                                           Table 5-1. Filament Data

Company YP manufactures two types of filaments : filamentA and filamentB. 100W and 200W electric bulbs can be manufactured from both filaments. Table 5-1 indicates the expected life of each bulb.
You want to calculate the following:
    
    1.Mean life in hours for bulbs of each filament type
    2. Mean life in hours for bulbs of each power level
    3.Mean life in hours based on both filament type and power

We generally encounter aggregation of data in data-science problems. To get an aggregation of data, we can use many PySpark functions.

In this recipe, we’ll use the reduceByKey() function to calculate the mean by using keys. Calculating the mean of complex keys requires creating those complex keys. Complex keys can be created by using the map() function.
How It Works

Let’s start with the creation of the RDD.
#### Creating an RDD with Single Elements

In [9]:
filDataSingle = [['filamentA','100W',605],
                  ['filamentB','100W',683],
                  ['filamentB','100W',691],
                  ['filamentB','200W',561],
                  ['filamentA','200W',530],
                  ['filamentA','100W',619],
                  ['filamentB','100W',686],
                  ['filamentB','200W',600],
                  ['filamentB','100W',696],
                  ['filamentA','200W',579],
                  ['filamentA','200W',520],
                  ['filamentA','100W',622],
                  ['filamentA','100W',668],
                  ['filamentB','200W',569],
                  ['filamentB','200W',555],
                  ['filamentA','200W',541]]
filDataSingleRDD = sc.parallelize(filDataSingle,2)
filDataSingleRDD.take(3)

[['filamentA', '100W', 605],
 ['filamentB', '100W', 683],
 ['filamentB', '100W', 691]]

#### Creating a Paired RDD

First we have to calculate the mean lifetime of bulbs, based on their filament type. Better that we are creating a paired RDD with keys for the filament type and values for the life in hours. So let’s create our required paired RDD and then investigate it:

In [10]:
fildatapairedrdd1 = filDataSingleRDD.map(lambda data: (data[0], data[2]))
fildatapairedrdd1.collect()

[('filamentA', 605),
 ('filamentB', 683),
 ('filamentB', 691),
 ('filamentB', 561),
 ('filamentA', 530),
 ('filamentA', 619),
 ('filamentB', 686),
 ('filamentB', 600),
 ('filamentB', 696),
 ('filamentA', 579),
 ('filamentA', 520),
 ('filamentA', 622),
 ('filamentA', 668),
 ('filamentB', 569),
 ('filamentB', 555),
 ('filamentA', 541)]

We have created a paired RDD, filDataPairedRDD1, by using the map() function defined on the RDD. The paired RDD filDataPairedRDD1 has the filament type as the key, and the life in hours as the value.
#### Finding the Mean Lifetime Based on Filament Type

Now we have our required paired RDD. But is this all we need? No. To calculate the mean, we need a sum and a count. We have to add an extra 1 in our paired RDD so that we can get a sum and a count. So let’s add an extra 1 now to each RDD element:

In [11]:
fildatapairedrdd2 = fildatapairedrdd1.map(lambda data: (data[0], [data[1], 1]))
fildatapairedrdd2.collect()

[('filamentA', [605, 1]),
 ('filamentB', [683, 1]),
 ('filamentB', [691, 1]),
 ('filamentB', [561, 1]),
 ('filamentA', [530, 1]),
 ('filamentA', [619, 1]),
 ('filamentB', [686, 1]),
 ('filamentB', [600, 1]),
 ('filamentB', [696, 1]),
 ('filamentA', [579, 1]),
 ('filamentA', [520, 1]),
 ('filamentA', [622, 1]),
 ('filamentA', [668, 1]),
 ('filamentB', [569, 1]),
 ('filamentB', [555, 1]),
 ('filamentA', [541, 1])]

 filDataPairedRDD11 is a paired RDD. The values of filDataPairedRDD11 are presented as a list; the first element is the lifetime of the bulb (in hours), and the second element is just 1.

Now we have to calculate the sum of the values of the lifetimes for each filament type as well as the count value, so that we can calculate the mean. Many PySpark functions could be used to do this job, but here we are going to use the reduceByKey() function for paired RDDs.

The reduceByKey() function applies aggregation operators key wise. It takes an aggregation function as input and applies that function on the values of each RDD key.

Let’s calculate the sum of the total life hours of bulbs based on the filament type, and the count of elements for each filament type :

In [13]:
fildatapairedrdd3 = fildatapairedrdd2.reduceByKey(lambda data, data2: [(data[0] + data2[0]), (data[1] + data2[1])])
fildatapairedrdd3.collect()

[('filamentB', [5041, 8]), ('filamentA', [4684, 8])]

Finally, we have the summation of the life hours of bulbs and the count, based on filament type. The next step is to divide the sum by the count to get the mean value. Let’s do that:

In [16]:
fildatapairedrdd4 = fildatapairedrdd3.map(lambda data: [data[0], data[1][0]/ data[1][1]])
fildatapairedrdd4.collect()

[['filamentB', 630.125], ['filamentA', 585.5]]

Finally, we have our required mean, based on filament type. The mean lifetime of filamentA is 585.5 hours, and the mean lifetime of filamentB is 630.125 hours. We can infer that filamentB has a longer life than filamentA.
#### Finding the Mean Lifetime Based on Bulb Power

First, we will start with creating our paired RDD. The key will be the bulb power, and the value will be the life in hours:

In [19]:
fildatapairedrdd_power1 = filDataSingleRDD.map(lambda data: (data[1], data[2]))
fildatapairedrdd_power1.collect()

[('100W', 605),
 ('100W', 683),
 ('100W', 691),
 ('200W', 561),
 ('200W', 530),
 ('100W', 619),
 ('100W', 686),
 ('200W', 600),
 ('100W', 696),
 ('200W', 579),
 ('200W', 520),
 ('100W', 622),
 ('100W', 668),
 ('200W', 569),
 ('200W', 555),
 ('200W', 541)]

In [20]:
fildatapairedrdd_power2 = fildatapairedrdd_power1.map(lambda data: (data[0], [data[1], 1]))
fildatapairedrdd_power2.collect()

[('100W', [605, 1]),
 ('100W', [683, 1]),
 ('100W', [691, 1]),
 ('200W', [561, 1]),
 ('200W', [530, 1]),
 ('100W', [619, 1]),
 ('100W', [686, 1]),
 ('200W', [600, 1]),
 ('100W', [696, 1]),
 ('200W', [579, 1]),
 ('200W', [520, 1]),
 ('100W', [622, 1]),
 ('100W', [668, 1]),
 ('200W', [569, 1]),
 ('200W', [555, 1]),
 ('200W', [541, 1])]

Now we have included 1 in the value part of the RDD. Therefore, each value is a list that consists of the life in hours and a 1.

In [21]:
fildatapairedrdd_power3 = fildatapairedrdd_power2.reduceByKey(lambda data, data1: [data[0] + data1[0], data[1] + data1[1]])
fildatapairedrdd_power3.collect()

[('100W', [5270, 8]), ('200W', [4455, 8])]

In [22]:
fildatapairedrdd_power4 = fildatapairedrdd_power3.map(lambda data: [data[0], data[1][0]/ data[1][1]])
fildatapairedrdd_power4.collect()

[['100W', 658.75], ['200W', 556.875]]

In this last step, we have computed the mean and the count. From the result, we can infer that the mean life of 100W bulbs is longer than that of 200W bulbs.
#### Finding the Mean Lifetime Based on Filament Type and Power

To solve this part of the exercise, we need a paired RDD with keys that are complex. You might be wondering what a complex key is. Complex keys have more than one type. In our case, our complex key will have both the filament type and bulb power type. Let’s start creating our paired RDD with a complex key type:

In [23]:
fildatacomplexkeyrdd1 = filDataSingleRDD.map(lambda data: [(data[0], data[1]), data[2]])
fildatacomplexkeyrdd1.collect()

[[('filamentA', '100W'), 605],
 [('filamentB', '100W'), 683],
 [('filamentB', '100W'), 691],
 [('filamentB', '200W'), 561],
 [('filamentA', '200W'), 530],
 [('filamentA', '100W'), 619],
 [('filamentB', '100W'), 686],
 [('filamentB', '200W'), 600],
 [('filamentB', '100W'), 696],
 [('filamentA', '200W'), 579],
 [('filamentA', '200W'), 520],
 [('filamentA', '100W'), 622],
 [('filamentA', '100W'), 668],
 [('filamentB', '200W'), 569],
 [('filamentB', '200W'), 555],
 [('filamentA', '200W'), 541]]

We have created a paired RDD named filDataComplexKeyData. It can be easily observed that it has complex keys. The keys are a combination of filament type and bulb power. The rest of the exercise will move as in the previous step. In the following code, we are going to include an extra 1 in the values:

In [24]:
fildatacomplexkeyrdd2 = fildatacomplexkeyrdd1.map(lambda data: [data[0], [data[1], 1]])
fildatacomplexkeyrdd2.collect()

[[('filamentA', '100W'), [605, 1]],
 [('filamentB', '100W'), [683, 1]],
 [('filamentB', '100W'), [691, 1]],
 [('filamentB', '200W'), [561, 1]],
 [('filamentA', '200W'), [530, 1]],
 [('filamentA', '100W'), [619, 1]],
 [('filamentB', '100W'), [686, 1]],
 [('filamentB', '200W'), [600, 1]],
 [('filamentB', '100W'), [696, 1]],
 [('filamentA', '200W'), [579, 1]],
 [('filamentA', '200W'), [520, 1]],
 [('filamentA', '100W'), [622, 1]],
 [('filamentA', '100W'), [668, 1]],
 [('filamentB', '200W'), [569, 1]],
 [('filamentB', '200W'), [555, 1]],
 [('filamentA', '200W'), [541, 1]]]

Our required paired RDD, filDataComplexKeyData1 , has been created. Now we can apply the reduceByKey() function to get the sum and count, based on the complex keys:

In [25]:
fildatacomplexkeyrdd3 = fildatacomplexkeyrdd2.reduceByKey(lambda data, data1: [data[0]+ data1[0], data[1] + data1[1]])
fildatacomplexkeyrdd3.collect()

[(('filamentB', '100W'), [2756, 4]),
 (('filamentA', '200W'), [2170, 4]),
 (('filamentA', '100W'), [2514, 4]),
 (('filamentB', '200W'), [2285, 4])]

In [26]:
fildatacomplexkeyrdd4 = fildatacomplexkeyrdd3.map(lambda data: [data[0], data[1][0]/data[1][1]])
fildatacomplexkeyrdd4.collect()

[[('filamentB', '100W'), 689.0],
 [('filamentA', '200W'), 542.5],
 [('filamentA', '100W'), 628.5],
 [('filamentB', '200W'), 571.25]]

#### Join Data
#### Problem

You want to join data.
#### Solution
We have been given two tables: a Students table (Table 5-2) and a Subjects table (Table 5-3).
<img src='430628_1_En_5_Figb_HTML.gif'>
   
                                        Table 5-2. Students
<img src='430628_1_En_5_Figc_HTML.gif'>

                                        Table 5-3. Subjects
    
You want to perform the following on the Students and Subjects tables:

    Inner join

    Left outer join

    Right outer join

    Full outer join

Joining data tables is an integral part of data preprocessing. We are going to perform four types of data joins in this recipe.

An inner join returns all the keys that are common to both tables. It discards the key elements that are not common to both tables. In PySpark, an inner join is done by using the join() method defined on the RDD.

A left outer join includes all keys in the left table and excludes uncommon keys from the right table. A left outer join can be performed by using the leftOuterJoin() function defined on the RDD in PySpark.

Another important type of join is a right outer join. In a right outer join, every key of the second table is included, but from the first table, only those keys that are common to both tables are included. We can do a right outer join by using the rightOuterJoin() function in PySpark.

If you want to include all keys from both tables, go for a full outer join. It can be performed by using fullOuterJoin().
#### How It Works

We’ll follow the steps in this section to work with joins.
#### Creating Nested Lists

Let’s start creating a nested list of our data from the Students table:

In [27]:
studentData = [['si1','Robin','M'],
                ['si2','Maria','F'],
                ['si3','Julie','F'],
                ['si4','Bob',  'M'],
                ['si6','William','M']]

In [28]:
subjectsData = [['si1','Python'],
                 ['si3','Java'],
                 ['si1','Java'],
                 ['si2','Python'],
                 ['si3','Ruby'],
                 ['si4','C++'],
                 ['si5','C'],
                 ['si4','Python'],
                 ['si2','Java']]

#### Creating a Paired RDD of Students and Subjects

Before creating a paired RDD, we first have to create a single RDD. Let’s create studentRDD:


We can see that, every element of the studentRDD RDD is a list, and each list has three elements. Now we have to transform it into a paired RDD:

In [30]:
studentDataRDD = sc.parallelize(studentData, 2)
studentDataRDD.collect()

[['si1', 'Robin', 'M'],
 ['si2', 'Maria', 'F'],
 ['si3', 'Julie', 'F'],
 ['si4', 'Bob', 'M'],
 ['si6', 'William', 'M']]

In [31]:
subjectsDataRDD = sc.parallelize(subjectsData, 2)
subjectsDataRDD.collect()

[['si1', 'Python'],
 ['si3', 'Java'],
 ['si1', 'Java'],
 ['si2', 'Python'],
 ['si3', 'Ruby'],
 ['si4', 'C++'],
 ['si5', 'C'],
 ['si4', 'Python'],
 ['si2', 'Java']]

In [32]:
studentDataPairedRDD = studentDataRDD.map(lambda data: (data[0], [data[1], data[2]]))
studentDataPairedRDD.collect()

[('si1', ['Robin', 'M']),
 ('si2', ['Maria', 'F']),
 ('si3', ['Julie', 'F']),
 ('si4', ['Bob', 'M']),
 ('si6', ['William', 'M'])]

In [33]:
subjectsDataPairedRDD = subjectsDataRDD.map(lambda data: (data[0],data[1]))
subjectsDataPairedRDD.collect()

[('si1', 'Python'),
 ('si3', 'Java'),
 ('si1', 'Java'),
 ('si2', 'Python'),
 ('si3', 'Ruby'),
 ('si4', 'C++'),
 ('si5', 'C'),
 ('si4', 'Python'),
 ('si2', 'Java')]

#### Performing an Inner Join

As we know, an inner join in PySpark is done by using the join() function. We have to apply this function on the paired RDD studentPairedRDD, and provide subjectsPairedRDD as an argument to the join() function:

In [39]:
studentsubject_inner = studentDataPairedRDD.join(subjectsDataPairedRDD)
studentsubject_inner.collect()

[('si4', (['Bob', 'M'], 'C++')),
 ('si4', (['Bob', 'M'], 'Python')),
 ('si3', (['Julie', 'F'], 'Java')),
 ('si3', (['Julie', 'F'], 'Ruby')),
 ('si1', (['Robin', 'M'], 'Python')),
 ('si1', (['Robin', 'M'], 'Java')),
 ('si2', (['Maria', 'F'], 'Python')),
 ('si2', (['Maria', 'F'], 'Java'))]

Analyzing the output of this inner join reveals that the key part contains only keys that are common to the Students and Subjects tables; these appear in the joined table. The keys that are not common to both tables are not the part of joined table.
#### Performing a Left Outer Join

A left outer join can be performed by using the leftOuterJoin() function:

In [40]:
studentsubject_leftouterjoin = studentDataPairedRDD.leftOuterJoin(subjectsDataPairedRDD)
studentsubject_leftouterjoin.collect()

[('si4', (['Bob', 'M'], 'C++')),
 ('si4', (['Bob', 'M'], 'Python')),
 ('si6', (['William', 'M'], None)),
 ('si3', (['Julie', 'F'], 'Java')),
 ('si3', (['Julie', 'F'], 'Ruby')),
 ('si1', (['Robin', 'M'], 'Python')),
 ('si1', (['Robin', 'M'], 'Java')),
 ('si2', (['Maria', 'F'], 'Python')),
 ('si2', (['Maria', 'F'], 'Java'))]

Student ID si6 is in the Students table but not in the Subjects table. Hence, the left outer join includes si6 in the joined table. Because si6 doesn’t have its counterpart in the Subjects table, it has None in place of the subject.
#### Performing a Right Outer Join

A right outer join on the Students and Subjects tables can be performed by using the rightOuterJoin() function:

In [41]:
studentsubject_rightouterjoin = studentDataPairedRDD.rightOuterJoin(subjectsDataPairedRDD)
studentsubject_rightouterjoin.collect()

[('si4', (['Bob', 'M'], 'C++')),
 ('si4', (['Bob', 'M'], 'Python')),
 ('si3', (['Julie', 'F'], 'Java')),
 ('si3', (['Julie', 'F'], 'Ruby')),
 ('si5', (None, 'C')),
 ('si1', (['Robin', 'M'], 'Python')),
 ('si1', (['Robin', 'M'], 'Java')),
 ('si2', (['Maria', 'F'], 'Python')),
 ('si2', (['Maria', 'F'], 'Java'))]

Student ID si5 is in only the Subjects table; it is not part of the Students table. Therefore, it appears in the joined table.
#### Performing a Full Outer Join

Now let’s perform a full outer join. In a full outer join, keys from both tables will be included:

In [42]:
studentsubject_fullouterjoin = studentDataPairedRDD.fullOuterJoin(subjectsDataPairedRDD)
studentsubject_fullouterjoin.collect()

[('si4', (['Bob', 'M'], 'C++')),
 ('si4', (['Bob', 'M'], 'Python')),
 ('si6', (['William', 'M'], None)),
 ('si3', (['Julie', 'F'], 'Java')),
 ('si3', (['Julie', 'F'], 'Ruby')),
 ('si5', (None, 'C')),
 ('si1', (['Robin', 'M'], 'Python')),
 ('si1', (['Robin', 'M'], 'Java')),
 ('si2', (['Maria', 'F'], 'Python')),
 ('si2', (['Maria', 'F'], 'Java'))]

22/12/23 18:41:17 WARN DataStreamer: Exception for BP-241050457-127.0.0.1-1670841776195:blk_1073744908_4087
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:56322 remote=/127.0.0.1:9866]
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:163)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:519)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
	at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.

In the joined table, keys from both tables have been included. Student ID si6 is part of only the Students data, and it appears in the joined table. Similarly, student ID si5 is part of only the Subjects table, but it appears in our joined table.