<a href="https://colab.research.google.com/github/Dhaneshkp/Spark/blob/main/Paired%20RDD%20Operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

<center><h1> Transformations & Action on Pair RDDs </h1></center>

---


Different types transformations and Actions on Paired RDDs.
    
### `Transformations`



* 1. **Keys**
* 2. **Values**
* 3. **Group By Key**
* 4. **Reduce By Key**
* 3. **Join**


### `Actions`


* 1. **Count By Key**
* 2. **Look Up**   
    

---

#### `IMPORTING THE REQUIRED LIBRARIES`

---

In [1]:
# importing the spark context
from pyspark.context import SparkContext

In [2]:
# spark context object
sc = SparkContext(appName="pair_rdd_operations")

---


We have a students data in the file **students_data.txt** which has the 9 different columns. First one is **Roll No**, **Section**, **Name of Student**, **City**, and the last five columns are marks in 5 different subjects. The data of each student is in different row separated by space.

---

![](https://github.com/Dhaneshkp/Spark/blob/main/images/data_1.png?raw=1)


---


#### `Let's start with creating the rdd of the file - students_data.txt`

---

In [5]:
## rdd of the file
students_marks = sc.textFile("students_data.txt")

In [6]:
## view the data
students_marks.collect()

['101 A Rohit Gurgaon 65 77 43 66 87',
 '102 B Akansha Delhi 55 46 24 66 77',
 '103 A Himanshu Faridabad 75 38 84 38 58',
 '104 A Ekta Delhi 85 84 39 58 85',
 '105 B Deepanshu Gurgaon 34 55 56 23 66',
 '106 B Ayush Delhi 66 62 98 74 87',
 '107 B Aditi Delhi 76 83 75 38 58',
 '108 A Sahil Faridabad 55 32 43 56 66',
 '109 A Krati Delhi 34 53 25 67 75']

---

Next, We will split each line into a list of words using Map. Let's see how to do this with the help of map transformation in the below cell.

---

In [7]:
# split the rdd
students_marks = students_marks.map(lambda x: x.split(' '))

In [8]:
# collect the rdd
students_marks.collect()

[['101', 'A', 'Rohit', 'Gurgaon', '65', '77', '43', '66', '87'],
 ['102', 'B', 'Akansha', 'Delhi', '55', '46', '24', '66', '77'],
 ['103', 'A', 'Himanshu', 'Faridabad', '75', '38', '84', '38', '58'],
 ['104', 'A', 'Ekta', 'Delhi', '85', '84', '39', '58', '85'],
 ['105', 'B', 'Deepanshu', 'Gurgaon', '34', '55', '56', '23', '66'],
 ['106', 'B', 'Ayush', 'Delhi', '66', '62', '98', '74', '87'],
 ['107', 'B', 'Aditi', 'Delhi', '76', '83', '75', '38', '58'],
 ['108', 'A', 'Sahil', 'Faridabad', '55', '32', '43', '56', '66'],
 ['109', 'A', 'Krati', 'Delhi', '34', '53', '25', '67', '75']]

---

#### **`On the above dataset, we will try to find out the answers of the following questions using the pair RDDs`**

---

 - How to find out the keys & values of a particular pair RDD?
 - Find out all the datapoints of the student with roll no 102 & 105?
 - Find out the number of students from each city?
 - How to join 2 different pair RDDs?
 - Find out the number of students in each section?




---

---

#### `How to find out the keys & values of a particular Pair RDD?`


---

We will create the Pair RDD with key as the **roll no** and values as the **section**, **name** & **city**. Let's see how to do that in the below cell.

---



<center> <img src="https://github.com/Dhaneshkp/Spark/blob/main/images/key-value.png?raw=1"/> </center>

---

In [9]:
# create the pair rdd
students_marks_paired_rdd = students_marks.map(lambda x: (x[0], (x[1],x[2], x[3])))

In [10]:
# collect the paired rdd.
students_marks_paired_rdd.collect()

[('101', ('A', 'Rohit', 'Gurgaon')),
 ('102', ('B', 'Akansha', 'Delhi')),
 ('103', ('A', 'Himanshu', 'Faridabad')),
 ('104', ('A', 'Ekta', 'Delhi')),
 ('105', ('B', 'Deepanshu', 'Gurgaon')),
 ('106', ('B', 'Ayush', 'Delhi')),
 ('107', ('B', 'Aditi', 'Delhi')),
 ('108', ('A', 'Sahil', 'Faridabad')),
 ('109', ('A', 'Krati', 'Delhi'))]

---

#### `Transformation - Keys`

The **`Keys`** transformation will give you the keys of the paired RDD. Let's try it for the above paired RDD.

---

In [11]:
# keys transformation
students_marks_keys = students_marks_paired_rdd.keys()

In [12]:
# collect the rdd
students_marks_keys.collect()

['101', '102', '103', '104', '105', '106', '107', '108', '109']

---

#### `Transformation - Values`

The **`Values`** transformation will give you the values of the paired RDD. Let's try it for the above paired RDD.


---

In [13]:
# values tranformation
students_marks_value = students_marks_paired_rdd.values()

In [14]:
# collect the rdd
students_marks_value.collect()

[('A', 'Rohit', 'Gurgaon'),
 ('B', 'Akansha', 'Delhi'),
 ('A', 'Himanshu', 'Faridabad'),
 ('A', 'Ekta', 'Delhi'),
 ('B', 'Deepanshu', 'Gurgaon'),
 ('B', 'Ayush', 'Delhi'),
 ('B', 'Aditi', 'Delhi'),
 ('A', 'Sahil', 'Faridabad'),
 ('A', 'Krati', 'Delhi')]

---

#### `Find out all the datapoints of the student with roll no 102 & 105?`



---

#### `Action - Look Up`

The **`Look Up`** Action is used to search for the values of a particular Key. You need to pass the Key in the Look Up function to get the corresponding values.

Let's try this out !!

---

In [15]:
# look for the values of the roll no 102
students_marks_paired_rdd.lookup("102")

[('B', 'Akansha', 'Delhi')]

In [16]:
# look for the values of the roll no 105
students_marks_paired_rdd.lookup("105")

[('B', 'Deepanshu', 'Gurgaon')]

---

#### `Find out the number of students from each city?`


---

First of all, we will see how to do this task without using any pair RDD.

---

In [17]:
# list of distinct cities
students_marks_distinct_cities = students_marks.map(lambda x: x[3]).distinct()

# empty list
result = []

# loop over the distinct cities
for city in students_marks_distinct_cities.collect():
    # count no of data points for the city
    city_count = students_marks.filter(lambda x: x[3] == city).count()
    # append to the result list
    result.append((city,city_count))

print(result)

[('Gurgaon', 2), ('Delhi', 5), ('Faridabad', 2)]


---

#### Let's do the same task using the pair RDD.


---

We will create another RDD where key will the City Name and the Values will be the Roll no & Name of the Student.

----

In [18]:
# create pair rdd
students_city_as_key = students_marks.map(lambda x: (x[3], (x[0], x[2])) )

In [19]:
# collect the rdd
students_city_as_key.collect()

[('Gurgaon', ('101', 'Rohit')),
 ('Delhi', ('102', 'Akansha')),
 ('Faridabad', ('103', 'Himanshu')),
 ('Delhi', ('104', 'Ekta')),
 ('Gurgaon', ('105', 'Deepanshu')),
 ('Delhi', ('106', 'Ayush')),
 ('Delhi', ('107', 'Aditi')),
 ('Faridabad', ('108', 'Sahil')),
 ('Delhi', ('109', 'Krati'))]

---

#### `Action - Count By Key`

The **countByKey** action will return the count of each key. Let's use it in the below cell to find out the number of students from each city.

---

In [20]:
# count by key action
students_city_as_key.countByKey()

defaultdict(int, {'Gurgaon': 2, 'Delhi': 5, 'Faridabad': 2})

---


#### `How to join 2 different pair RDDs?`



---

Now, assume these students went to write an exam and the result of that exam is stored in another file **results.txt**. It has 3 columns **Roll no**, **Grade** in the exam and **qualified** which has yes/no for whether the exam was cleared or not.

---


![](https://github.com/Dhaneshkp/Spark/blob/main/images/data_2.png?raw=1)



Let's created an RDD of the file results.txt

---

In [23]:
# create an RDD
results = sc.textFile('results.txt')

In [24]:
# collect the RDD
results.collect()

['101 A Yes',
 '102 D No',
 '103 B Yes',
 '104 B Yes',
 '105 A Yes',
 '106 C No',
 '107 C No',
 '108 D No',
 '109 A Yes']

In [25]:
# We will split each line into a list of words using Map.
results = results.map(lambda x: x.split(' '))

In [26]:
# collect the RDD
results.collect()

[['101', 'A', 'Yes'],
 ['102', 'D', 'No'],
 ['103', 'B', 'Yes'],
 ['104', 'B', 'Yes'],
 ['105', 'A', 'Yes'],
 ['106', 'C', 'No'],
 ['107', 'C', 'No'],
 ['108', 'D', 'No'],
 ['109', 'A', 'Yes']]

---

Let's create a paired RDD where key will be the **roll no** and values will be the **grade** and **qualified** column.

---

In [27]:
# create results pair RDD
results_paired_rdd = results.map(lambda x: (x[0], (x[1],x[2])))

In [28]:
# collect the results pair RDD
results_paired_rdd.collect()

[('101', ('A', 'Yes')),
 ('102', ('D', 'No')),
 ('103', ('B', 'Yes')),
 ('104', ('B', 'Yes')),
 ('105', ('A', 'Yes')),
 ('106', ('C', 'No')),
 ('107', ('C', 'No')),
 ('108', ('D', 'No')),
 ('109', ('A', 'Yes'))]

---


#### `Transformation - Join`

Now, we will join the two RDDs using the **`Join`** transformation, the first one is **students_marks_paired_rdd** and another one is **results_paired_rdd**.

---

In [29]:
# join the RDD students marks and results
joined_data = students_marks_paired_rdd.join(results_paired_rdd)

In [30]:
# collect the RDD
joined_data.collect()

[('102', (('B', 'Akansha', 'Delhi'), ('D', 'No'))),
 ('108', (('A', 'Sahil', 'Faridabad'), ('D', 'No'))),
 ('103', (('A', 'Himanshu', 'Faridabad'), ('B', 'Yes'))),
 ('104', (('A', 'Ekta', 'Delhi'), ('B', 'Yes'))),
 ('105', (('B', 'Deepanshu', 'Gurgaon'), ('A', 'Yes'))),
 ('106', (('B', 'Ayush', 'Delhi'), ('C', 'No'))),
 ('107', (('B', 'Aditi', 'Delhi'), ('C', 'No'))),
 ('101', (('A', 'Rohit', 'Gurgaon'), ('A', 'Yes'))),
 ('109', (('A', 'Krati', 'Delhi'), ('A', 'Yes')))]

---

#### `Left Outer Join`

We will filter the data from the **student_marks_paired_rdd** and keep the data of section A students only.

---

In [31]:
# filter out the data of section A students
students_marks_section_a = students_marks_paired_rdd.filter(lambda x : x[1][0] == "A")

In [32]:
# collect the rdd of section A students
students_marks_section_a.collect()

[('101', ('A', 'Rohit', 'Gurgaon')),
 ('103', ('A', 'Himanshu', 'Faridabad')),
 ('104', ('A', 'Ekta', 'Delhi')),
 ('108', ('A', 'Sahil', 'Faridabad')),
 ('109', ('A', 'Krati', 'Delhi'))]

---

***Now, we will do left outer join on pair rdds, `students_marks_section_a` and `results_pair_rdd`***

----

In [33]:
# left outer join
results_of_section_a = students_marks_section_a.leftOuterJoin(results_paired_rdd)

In [34]:
# collect the results
results_of_section_a.collect()

[('108', (('A', 'Sahil', 'Faridabad'), ('D', 'No'))),
 ('103', (('A', 'Himanshu', 'Faridabad'), ('B', 'Yes'))),
 ('104', (('A', 'Ekta', 'Delhi'), ('B', 'Yes'))),
 ('101', (('A', 'Rohit', 'Gurgaon'), ('A', 'Yes'))),
 ('109', (('A', 'Krati', 'Delhi'), ('A', 'Yes')))]

---

#### `Right Outer Join`

We will filter the data from the **results_paired_rdd** and keep the data of the qualified students only.

---

In [35]:
# filter the data of qualified students
students_qualified = results_paired_rdd.filter(lambda x: x[1][1] == "Yes")

In [36]:
# collect the results
students_qualified.collect()

[('101', ('A', 'Yes')),
 ('103', ('B', 'Yes')),
 ('104', ('B', 'Yes')),
 ('105', ('A', 'Yes')),
 ('109', ('A', 'Yes'))]

---

***Now, we will do the right outer join on the pair rdds, `students_marks_pair_rdd`  and the `students_qualified`.***

---

In [37]:
# right outer join
students_qualified_details = students_marks_paired_rdd.rightOuterJoin(students_qualified)

In [38]:
# collect the results
students_qualified_details.collect()

[('103', (('A', 'Himanshu', 'Faridabad'), ('B', 'Yes'))),
 ('104', (('A', 'Ekta', 'Delhi'), ('B', 'Yes'))),
 ('105', (('B', 'Deepanshu', 'Gurgaon'), ('A', 'Yes'))),
 ('101', (('A', 'Rohit', 'Gurgaon'), ('A', 'Yes'))),
 ('109', (('A', 'Krati', 'Delhi'), ('A', 'Yes')))]