# SPARK RDD tutorial

In this colab, we will train the concepts learned about RDDs.
Check the materials at ALUD and the Spark RDD documentation (https://spark.apache.org/docs/latest/rdd-programming-guide.html)


## Setup

Let's set up Spark on your Colab environment.  Run the cell below!

In [1]:
# Let's import the libraries we will need
import pyspark
from pyspark import SparkContext, SparkConf

Let's initialize the Spark context.


In [2]:
sc = pyspark.SparkContext()

23/10/19 15:31:14 WARN Utils: Your hostname, osboxes resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
23/10/19 15:31:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/19 15:31:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/10/19 15:31:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


You can easily check the current version and get the link of the web interface. In the Spark UI, you can monitor the progress of your job and debug the performance bottlenecks (if your Colab is running with a **local runtime**).

In [3]:
sc

# Exercises

In [4]:
# The input data will be a list with integers
import json

with open('integer_file.txt', 'r') as f:
  input = json.loads(f.read())

## Exercise 1: LargestInteger

From a list of integers, we must select the largest (highest, biggest) integer  from the list.

In [None]:
# Here you must insert your code. You must put your result in a variable named 'result'
# Remember that you must first create the RDD using parallelize function



In [None]:
# Here is where the result will be checked Don't modify this!
print('Success!') if result == 99 else print('Fail!')

## Exercise 2: DistinctInteger

From a list of integers, get only the distinct integers, i.e., remove duplicates.

In [None]:
# Here you must insert your code. You must put your result in a variable named 'result'



In [5]:
# Here is where the result will be checked Don't modify this!
unique = []
fail = False
for item in result:
  if item not in unique:
    unique.append(item)
  else:
      fail = True
      break
print('Success!') if not fail else print('Fail!')

NameError: name 'result' is not defined

## Exercise 3: DistinctIntegerCount

From a list of integers, count the occurrences of each number.

In [None]:
# Here you must insert your code. You must put your result in a variable named 'result'.



In [None]:
# Here is where the result will be checked Don't modify this!
expected = {69: 135, 20: 129, 37: 123, 65: 123, 88: 121, 94: 120, 23: 118, 36: 117, 35: 117, 92: 116, 82: 114, 90: 114, 62: 113, 33: 112, 96: 110, 38: 110, 25: 110, 85: 110, 11: 110, 55: 110, 22: 107, 75: 107, 17: 106, 93: 106, 4: 105, 30: 105, 42: 105, 2: 105, 68: 105, 64: 105, 51: 105, 59: 105, 15: 105, 16: 104, 18: 104, 13: 104, 61: 104, 43: 103, 48: 102, 77: 102, 12: 101, 0: 101, 80: 101, 54: 101, 78: 101, 6: 101, 45: 101, 34: 100, 1: 100, 81: 100, 66: 99, 67: 99, 31: 99, 28: 98, 95: 98, 47: 98, 19: 97, 97: 97, 49: 97, 99: 97, 3: 97, 21: 97, 7: 96, 63: 96, 41: 96, 5: 96, 27: 96, 39: 96, 53: 96, 91: 95, 72: 94, 60: 94, 8: 94, 89: 94, 56: 93, 87: 92, 73: 92, 50: 91, 32: 91, 24: 91, 83: 91, 74: 90, 79: 90, 71: 90, 10: 89, 14: 88, 98: 88, 57: 87, 29: 87, 86: 86, 26: 85, 76: 85, 44: 85, 58: 84, 84: 83, 40: 83, 70: 79, 46: 78, 9: 77, 52: 76}

fail = False if len(expected.keys()) == len(result) else True
if not fail:
  fail = [expected[item[0]] != item[1] for item in result]
  print("Fail!") if True in fail else print("Success!")
else:
  print("Fail!")


## Exercise 4: Selection

From a set of tuples, return only those who match with the specified criteria. Specifically, from the given input relation, you must select those tuples in which the position is equals to
'Analyst'.

In [None]:
# The input data will be a list of tuples
tuples = []
with open('tuples.txt', 'r') as f:
  lines = f.readlines()
  for line in lines:
    item = line.strip().split(',')
    tuples.append((item[0], item[1], item[2], item[3]))

In [None]:
# Here you must insert your code. You must put your result in a variable named 'result'.



In [None]:
# Here is where the result will be checked Don't modify this!
expected = [('7', 'Matthew', 'Hawkins', 'Analyst'), ('10', 'Jeanette', 'King', 'Analyst'), ('11', 'Sang', 'Papitto', 'Analyst'), ('15', 'Amy', 'Hackathorn', 'Analyst'), ('16', 'Rene', 'Faulk', 'Analyst'), ('17', 'Karl', 'Warren', 'Analyst'), ('25', 'Greg', 'Sanchez', 'Analyst'), ('28', 'Amy', 'Gaston', 'Analyst'), ('31', 'Robin', 'Pollmann', 'Analyst'), ('33', 'Curtis', 'Franco', 'Analyst'), ('37', 'Nellie', 'Beasley', 'Analyst'), ('38', 'Lola', 'Lusk', 'Analyst'), ('42', 'John', 'Campbell', 'Analyst'), ('44', 'Dian', 'Finkenbinder', 'Analyst'), ('47', 'Courtney', 'Morales', 'Analyst'), ('48', 'Elaine', 'Street', 'Analyst'), ('55', 'Edward', 'Stefani', 'Analyst'), ('56', 'Dennis', 'Chang', 'Analyst'), ('57', 'Christine', 'Scott', 'Analyst'), ('59', 'Veronica', 'Freeman', 'Analyst'), ('60', 'James', 'Buel', 'Analyst'), ('64', 'Joseph', 'Flanders', 'Analyst'), ('65', 'Donna', 'Hernandez', 'Analyst'), ('70', 'Reynaldo', 'Cammarata', 'Analyst'), ('77', 'Morris', 'Stringfellow', 'Analyst'), ('78', 'Shirley', 'Smith', 'Analyst'), ('80', 'Evan', 'White', 'Analyst'), ('81', 'Christopher', 'Kass', 'Analyst'), ('85', 'Theresa', 'Bennett', 'Analyst'), ('92', 'Carl', 'Lehmberg', 'Analyst'), ('93', 'Matthew', 'Holgerson', 'Analyst'), ('95', 'Gary', 'Chan', 'Analyst')]

fail = False if len(expected) == len(result) else True
if not fail:
  fail = [item not in expected for item in result]
  print("Fail!") if True in fail else print("Success!")
else:
  print("Fail!")

## Exercise 5: Projection

From the previous set of tuples, return only the specified attributes. Specifically, from the given input relation, you must select the surname and the position from the given tuples.

In [None]:
# Here you must insert your code. You must put your result in a variable named 'result'.



In [None]:
# Here is where the result will be checked Don't modify this!

expected = [('Chun', 'Team Leader'), ('Petersen', 'Team Leader'), ('Hayes', 'Programmer'), ('Dodson', 'Programmer'), ('Baer', 'Team Leader'), ('Serrato', 'Team Leader'), ('Perkins', 'Programmer'), ('Hawkins', 'Analyst'), ('Voorhees', 'Team Leader'), ('Carter', 'Team Leader'), ('King', 'Analyst'), ('Papitto', 'Analyst'), ('Doan', 'Programmer'), ('Patel', 'Team Leader'), ('Phelps', 'Team Leader'), ('Hackathorn', 'Analyst'), ('Faulk', 'Analyst'), ('Warren', 'Analyst'), ('Jamerson', 'Programmer'), ('Dierks', 'Programmer'), ('Reyes', 'Programmer'), ('Sandoval', 'Team Leader'), ('Cole', 'Team Leader'), ('Heath', 'Programmer'), ('Mcshea', 'Team Leader'), ('Sanchez', 'Analyst'), ('Hendrick', 'Programmer'), ('Connell', 'Team Leader'), ('Gaston', 'Analyst'), ('Mcmurray', 'Programmer'), ('Hammon', 'Programmer'), ('Pollmann', 'Analyst'), ('Knowlton', 'Programmer'), ('Franco', 'Analyst'), ('Slater', 'Team Leader'), ('Finch', 'Programmer'), ('Moneyhun', 'Programmer'), ('Beasley', 'Analyst'), ('Lusk', 'Analyst'), ('Reed', 'Team Leader'), ('Born', 'Team Leader'), ('Perez', 'Team Leader'), ('Campbell', 'Analyst'), ('Otero', 'Team Leader'), ('Finkenbinder', 'Analyst'), ('Hall', 'Programmer'), ('Boudreau', 'Team Leader'), ('Morales', 'Analyst'), ('Street', 'Analyst'), ('Barrington', 'Team Leader'), ('Easton', 'Team Leader'), ('Sager', 'Team Leader'), ('Harvey', 'Programmer'), ('Small', 'Programmer'), ('Woodbury', 'Programmer'), ('Stefani', 'Analyst'), ('Chang', 'Analyst'), ('Scott', 'Analyst'), ('Ayer', 'Team Leader'), ('Freeman', 'Analyst'), ('Buel', 'Analyst'), ('Broten', 'Team Leader'), ('Sandhu', 'Team Leader'), ('Pickett', 'Programmer'), ('Flanders', 'Analyst'), ('Hernandez', 'Analyst'), ('Morgan', 'Programmer'), ('Slane', 'Team Leader'), ('Lynum', 'Programmer'), ('Close', 'Team Leader'), ('Cammarata', 'Analyst'), ('Tunis', 'Programmer'), ('Stansberry', 'Programmer'), ('Kwok', 'Programmer'), ('Wells', 'Team Leader'), ('Ziola', 'Team Leader'), ('Mctiernan', 'Programmer'), ('Stringfellow', 'Analyst'), ('Smith', 'Analyst'), ('Harris', 'Team Leader'), ('White', 'Analyst'), ('Kass', 'Analyst'), ('Lundsford', 'Programmer'), ('Campo', 'Programmer'), ('Sullivan', 'Programmer'), ('Bennett', 'Analyst'), ('Williams', 'Programmer'), ('Harvey', 'Programmer'), ('Myers', 'Team Leader'), ('Nugent', 'Programmer'), ('Colbert', 'Programmer'), ('Numbers', 'Programmer'), ('Lehmberg', 'Analyst'), ('Holgerson', 'Analyst'), ('Ali', 'Programmer'), ('Chan', 'Analyst'), ('Kovar', 'Team Leader'), ('Cochran', 'Team Leader'), ('Bradley', 'Programmer'), ('Gardner', 'Programmer')]
fail = False if len(expected) == len(result) else True
if not fail:
  fail = [item not in expected for item in result]
  print("Fail!") if True in fail else print("Success!")
else:
  print("Fail!")

## Exercise 6: Union

From two sets of tuples (relationA and relationB), compute the union of both sets.

In [None]:
# The input data will be two lists of tuples
relationA = []
relationB = []
with open('relationA.txt', 'r') as f:
  lines = f.readlines()
  for line in lines:
    item = line.strip().split(',')
    relationA.append((item[1], item[2], item[3]))

with open('relationB.txt', 'r') as f:
  lines = f.readlines()
  for line in lines:
    item = line.strip().split(',')
    relationB.append((item[1], item[2], item[3]))

In [None]:
# Here you must insert your code. You must put your result in a variable named 'result'.



In [None]:
# Here is where the result will be checked Don't modify this!

expected = [('0', 'Jesus', 'Anderson'), ('1', 'Bryan', 'Smith'), ('2', 'Wendy', 'Loomis'), ('3', 'Jason', 'Benninger'), ('4', 'Sharon', 'Byerly'), ('5', 'Trevor', 'Rodriguez'), ('6', 'Gerard', 'Stewart'), ('7', 'Stanley', 'Hall'), ('8', 'Pamela', 'Taylor'), ('9', 'Nancy', 'Lewis'), ('10', 'Ronald', 'Bullard'), ('11', 'Audrey', 'Sheffield'), ('12', 'Hope', 'Davis'), ('13', 'Rhonda', 'Reddy'), ('14', 'Santiago', 'Young'), ('15', 'Jacqueline', 'Manuel'), ('16', 'Alan', 'Laverty'), ('17', 'Lisa', 'Webb'), ('18', 'Angie', 'Burnette'), ('19', 'Lorrie', 'Luna'), ('20', 'Micheal', 'Allen'), ('21', 'James', 'Mcgarry'), ('22', 'Troy', 'Conrad'), ('23', 'Norma', 'Mcdonald'), ('24', 'Jose', 'Hayes'), ('25', 'Gary', 'Wesley'), ('26', 'Evelia', 'Doyle'), ('27', 'Virginia', 'Porter'), ('28', 'Jenny', 'Voss'), ('29', 'Ronnie', 'Mitchell'), ('30', 'Lee', 'Cross'), ('31', 'Ramona', 'Berggren'), ('32', 'Ken', 'Ward'), ('33', 'Lakisha', 'Frye'), ('34', 'Julie', 'Duncan'), ('35', 'David', 'Pool'), ('36', 'Ann', 'Morse'), ('37', 'Gwen', 'Kellett'), ('38', 'Wilbur', 'Hughes'), ('39', 'Edward', 'Bergeron'), ('40', 'Floyd', 'Marquez'), ('41', 'Dorothy', 'Gale'), ('42', 'Emily', 'Taylor'), ('43', 'John', 'Averett'), ('44', 'Marcy', 'Aguilar'), ('45', 'Dean', 'Davis'), ('46', 'Brian', 'Cope'), ('47', 'Karen', 'Hallowell'), ('48', 'Susan', 'Ray'), ('49', 'Elizabeth', 'Khensovan'), ('50', 'George', 'Heuer'), ('51', 'Alberta', 'Perez'), ('52', 'Maurice', 'Young'), ('53', 'Charles', 'Williams'), ('54', 'Robert', 'Garrison'), ('55', 'Luis', 'Lynch'), ('56', 'Nancy', 'Hale'), ('57', 'David', 'Navejas'), ('58', 'Susan', 'Bryant'), ('59', 'Irene', 'Bell'), ('60', 'Travis', 'Pineau'), ('61', 'Yvonne', 'Johnson'), ('62', 'Evelyn', 'Smith'), ('63', 'Carmen', 'Featherstone'), ('64', 'Josephine', 'Bartlett'), ('65', 'Albert', 'Mendez'), ('66', 'David', 'Mcelwain'), ('67', 'Jamie', 'Harris'), ('68', 'Amy', 'Fernandez'), ('69', 'Michael', 'Rempel'), ('70', 'Amy', 'Parker'), ('71', 'Pamela', 'Richard'), ('72', 'Vincent', 'Gray'), ('73', 'Christine', 'Horner'), ('74', 'Nina', 'Campbell'), ('75', 'Diane', 'Hamburg'), ('76', 'Ronnie', 'Heinemann'), ('77', 'Marlene', 'Murdock'), ('78', 'Delores', 'Lord'), ('79', 'Corina', 'Garner'), ('80', 'Alvin', 'Bell'), ('81', 'Kimberly', 'Apodaca'), ('82', 'Kenneth', 'Lowe'), ('83', 'Dawn', 'Gaydosh'), ('84', 'Dave', 'Black'), ('85', 'Randal', 'Krokos'), ('86', 'Horace', 'Castillo'), ('87', 'Hazel', 'Washington'), ('88', 'Mary', 'Adams'), ('89', 'Joel', 'Flower'), ('90', 'Kevin', 'Medina'), ('91', 'Bernice', 'Montes'), ('92', 'Thomas', 'Nichols'), ('93', 'Marquita', 'Wenthold'), ('94', 'Felix', 'Smith'), ('95', 'Robyn', 'Sachs'), ('96', 'Hazel', 'Alexander'), ('97', 'Micheal', 'Lubinski'), ('98', 'William', 'Holland'), ('99', 'Apolonia', 'Elder'), ('0', 'Programmer', '33078'), ('1', 'Analyst', '22645'), ('2', 'Analyst', '35181'), ('3', 'Programmer', '29697'), ('4', 'Team Leader', '44010'), ('5', 'Team Leader', '39357'), ('6', 'Analyst', '48253'), ('7', 'Analyst', '49606'), ('8', 'Team Leader', '23842'), ('9', 'Analyst', '29981'), ('10', 'Team Leader', '27469'), ('11', 'Team Leader', '49461'), ('12', 'Team Leader', '45931'), ('13', 'Team Leader', '30674'), ('14', 'Analyst', '21430'), ('15', 'Analyst', '21126'), ('16', 'Programmer', '34916'), ('17', 'Analyst', '28981'), ('18', 'Team Leader', '21272'), ('19', 'Programmer', '49670'), ('20', 'Programmer', '45154'), ('21', 'Analyst', '29054'), ('22', 'Team Leader', '31798'), ('23', 'Team Leader', '26492'), ('24', 'Analyst', '48868'), ('25', 'Analyst', '22699'), ('26', 'Team Leader', '21041'), ('27', 'Programmer', '45251'), ('28', 'Programmer', '35814'), ('29', 'Analyst', '31136'), ('30', 'Programmer', '31798'), ('31', 'Analyst', '30521'), ('32', 'Analyst', '47403'), ('33', 'Programmer', '44642'), ('34', 'Analyst', '29675'), ('35', 'Analyst', '27145'), ('36', 'Analyst', '44616'), ('37', 'Team Leader', '46773'), ('38', 'Team Leader', '23022'), ('39', 'Team Leader', '40547'), ('40', 'Team Leader', '49321'), ('41', 'Team Leader', '39600'), ('42', 'Team Leader', '38385'), ('43', 'Programmer', '23033'), ('44', 'Programmer', '41338'), ('45', 'Team Leader', '41777'), ('46', 'Team Leader', '31837'), ('47', 'Team Leader', '46977'), ('48', 'Analyst', '40218'), ('49', 'Team Leader', '24137'), ('50', 'Programmer', '28663'), ('51', 'Analyst', '27401'), ('52', 'Team Leader', '44634'), ('53', 'Team Leader', '48544'), ('54', 'Programmer', '30358'), ('55', 'Programmer', '28465'), ('56', 'Analyst', '26994'), ('57', 'Programmer', '47451'), ('58', 'Analyst', '20011'), ('59', 'Analyst', '49819'), ('60', 'Team Leader', '26179'), ('61', 'Team Leader', '46239'), ('62', 'Programmer', '46338'), ('63', 'Analyst', '26656'), ('64', 'Team Leader', '23627'), ('65', 'Programmer', '47978'), ('66', 'Programmer', '44355'), ('67', 'Analyst', '40195'), ('68', 'Team Leader', '42450'), ('69', 'Analyst', '46580'), ('70', 'Programmer', '42596'), ('71', 'Analyst', '45553'), ('72', 'Analyst', '49004'), ('73', 'Programmer', '40950'), ('74', 'Team Leader', '43547'), ('75', 'Analyst', '39278'), ('76', 'Team Leader', '41651'), ('77', 'Programmer', '25761'), ('78', 'Analyst', '40379'), ('79', 'Analyst', '20522'), ('80', 'Analyst', '39092'), ('81', 'Analyst', '21265'), ('82', 'Analyst', '39772'), ('83', 'Programmer', '40170'), ('84', 'Team Leader', '20607'), ('85', 'Team Leader', '31321'), ('86', 'Team Leader', '24847'), ('87', 'Analyst', '20699'), ('88', 'Analyst', '37901'), ('89', 'Team Leader', '31881'), ('90', 'Analyst', '30980'), ('91', 'Team Leader', '47792'), ('92', 'Analyst', '44337'), ('93', 'Analyst', '24150'), ('94', 'Programmer', '27402'), ('95', 'Team Leader', '22950'), ('96', 'Analyst', '39956'), ('97', 'Analyst', '26122'), ('98', 'Programmer', '47170'), ('99', 'Analyst', '34381')]

fail = False if len(expected) == len(result) else True
fail = False if len(relationA) + len(relationB) == len(result) else True
if not fail:
  fail = [item not in expected for item in result]
  print("Fail!") if True in fail else print("Success!")
else:
  print("Fail!")


## Exercise 7: Intersection

From a two sets of tuples (intRelationA and intRelationB), compute the intersection of both sets.

In [None]:
# The input data will be two lists of tuples
intRelationA = []
intRelationB = []

with open('intRelationA.txt', 'r') as f:
  lines = f.readlines()
  for line in lines:
    item = line.strip().split(',')
    intRelationA.append((item[1], item[2], item[3]))

with open('intRelationB.txt', 'r') as f:
  lines = f.readlines()
  for line in lines:
    item = line.strip().split(',')
    intRelationB.append((item[1], item[2], item[3]))

In [None]:
# Here you must insert your code. You must put your result in a variable named 'result'.


In [None]:
# Here is where the result will be checked Don't modify this!

expected = [('Frank', 'Collado', 'Analyst'), ('Allen', 'Edmons', 'Team Leader'), ('Tabatha', 'Sonnier', 'Team Leader'), ('Brandon', 'Jenkins', 'Programmer'), ('Shirley', 'Cheatam', 'Team Leader'), ('Royal', 'Chan', 'Programmer'), ('Santiago', 'Galloway', 'Team Leader'), ('Laura', 'Tyler', 'Analyst'), ('Juan', 'Covarrubias', 'Team Leader'), ('Carlton', 'Swan', 'Programmer'), ('Yuri', 'Kavanagh', 'Programmer'), ('Donald', 'Farmer', 'Analyst'), ('Jonathan', 'Harden', 'Analyst'), ('Rebecca', 'Taylor', 'Team Leader'), ('George', 'Wakefield', 'Programmer'), ('Marjorie', 'Carrillo', 'Analyst'), ('Jean', 'Hoffman', 'Analyst'), ('Melissa', 'Little', 'Programmer'), ('Fred', 'Vinck', 'Programmer'), ('Earl', 'Gohn', 'Analyst'), ('Shirley', 'Robinson', 'Team Leader'), ('Angelina', 'Wiseman', 'Analyst'), ('Larry', 'Saldana', 'Analyst'), ('Richard', 'Carr', 'Programmer'), ('David', 'Stott', 'Analyst'), ('Ernest', 'Kim', 'Analyst'), ('Howard', 'Reynolds', 'Programmer'), ('Deborah', 'Free', 'Programmer'), ('Jasmine', 'Aldana', 'Analyst'), ('Joseph', 'Rodgers', 'Team Leader'), ('Alice', 'George', 'Analyst'), ('David', 'Hopkins', 'Programmer'), ('Fern', 'Townsend', 'Team Leader')]

fail = False if len(expected) == len(result) else True
if not fail:
  fail = [item not in expected for item in result]
  print("Fail!") if True in fail else print("Success!")
else:
  print("Fail!")


## Exercise 8: Join

This exercise performs the relational algebra's join operation in distributed mode.
Specifically, from the given two input relations (relationA and relationB), you must compute the join of both relations through the ID
attribute.
You must get rid of duplicate tuples and columns.

In [None]:
# Here you must insert your code. You must put your result in a variable named 'result'. The result must be in the form:
# result = [('4', 'Sharon', 'Byerly', 'Team Leader', '44010'), ('10', 'Ronald', 'Bullard', 'Team Leader', '27469'), ...]



In [None]:
# Here is where the result will be checked Don't modify this!

expected = [('4', 'Sharon', 'Byerly', 'Team Leader', '44010'), ('10', 'Ronald', 'Bullard', 'Team Leader', '27469'), ('12', 'Hope', 'Davis', 'Team Leader', '45931'), ('16', 'Alan', 'Laverty', 'Programmer', '34916'), ('20', 'Micheal', 'Allen', 'Programmer', '45154'), ('24', 'Jose', 'Hayes', 'Analyst', '48868'), ('26', 'Evelia', 'Doyle', 'Team Leader', '21041'), ('40', 'Floyd', 'Marquez', 'Team Leader', '49321'), ('44', 'Marcy', 'Aguilar', 'Programmer', '41338'), ('50', 'George', 'Heuer', 'Programmer', '28663'), ('53', 'Charles', 'Williams', 'Team Leader', '48544'), ('54', 'Robert', 'Garrison', 'Programmer', '30358'), ('56', 'Nancy', 'Hale', 'Analyst', '26994'), ('57', 'David', 'Navejas', 'Programmer', '47451'), ('60', 'Travis', 'Pineau', 'Team Leader', '26179'), ('64', 'Josephine', 'Bartlett', 'Team Leader', '23627'), ('70', 'Amy', 'Parker', 'Programmer', '42596'), ('74', 'Nina', 'Campbell', 'Team Leader', '43547'), ('77', 'Marlene', 'Murdock', 'Programmer', '25761'), ('82', 'Kenneth', 'Lowe', 'Analyst', '39772'), ('83', 'Dawn', 'Gaydosh', 'Programmer', '40170'), ('86', 'Horace', 'Castillo', 'Team Leader', '24847'), ('88', 'Mary', 'Adams', 'Analyst', '37901'), ('3', 'Jason', 'Benninger', 'Programmer', '29697'), ('6', 'Gerard', 'Stewart', 'Analyst', '48253'), ('7', 'Stanley', 'Hall', 'Analyst', '49606'), ('15', 'Jacqueline', 'Manuel', 'Analyst', '21126'), ('18', 'Angie', 'Burnette', 'Team Leader', '21272'), ('23', 'Norma', 'Mcdonald', 'Team Leader', '26492'), ('25', 'Gary', 'Wesley', 'Analyst', '22699'), ('30', 'Lee', 'Cross', 'Programmer', '31798'), ('31', 'Ramona', 'Berggren', 'Analyst', '30521'), ('32', 'Ken', 'Ward', 'Analyst', '47403'), ('36', 'Ann', 'Morse', 'Analyst', '44616'), ('42', 'Emily', 'Taylor', 'Team Leader', '38385'), ('43', 'John', 'Averett', 'Programmer', '23033'), ('47', 'Karen', 'Hallowell', 'Team Leader', '46977'), ('49', 'Elizabeth', 'Khensovan', 'Team Leader', '24137'), ('51', 'Alberta', 'Perez', 'Analyst', '27401'), ('59', 'Irene', 'Bell', 'Analyst', '49819'), ('61', 'Yvonne', 'Johnson', 'Team Leader', '46239'), ('62', 'Evelyn', 'Smith', 'Programmer', '46338'), ('65', 'Albert', 'Mendez', 'Programmer', '47978'), ('67', 'Jamie', 'Harris', 'Analyst', '40195'), ('71', 'Pamela', 'Richard', 'Analyst', '45553'), ('80', 'Alvin', 'Bell', 'Analyst', '39092'), ('81', 'Kimberly', 'Apodaca', 'Analyst', '21265'), ('85', 'Randal', 'Krokos', 'Team Leader', '31321'), ('94', 'Felix', 'Smith', 'Programmer', '27402'), ('95', 'Robyn', 'Sachs', 'Team Leader', '22950'), ('99', 'Apolonia', 'Elder', 'Analyst', '34381'), ('0', 'Jesus', 'Anderson', 'Programmer', '33078'), ('1', 'Bryan', 'Smith', 'Analyst', '22645'), ('8', 'Pamela', 'Taylor', 'Team Leader', '23842'), ('9', 'Nancy', 'Lewis', 'Analyst', '29981'), ('14', 'Santiago', 'Young', 'Analyst', '21430'), ('17', 'Lisa', 'Webb', 'Analyst', '28981'), ('19', 'Lorrie', 'Luna', 'Programmer', '49670'), ('21', 'James', 'Mcgarry', 'Analyst', '29054'), ('22', 'Troy', 'Conrad', 'Team Leader', '31798'), ('29', 'Ronnie', 'Mitchell', 'Analyst', '31136'), ('33', 'Lakisha', 'Frye', 'Programmer', '44642'), ('34', 'Julie', 'Duncan', 'Analyst', '29675'), ('45', 'Dean', 'Davis', 'Team Leader', '41777'), ('48', 'Susan', 'Ray', 'Analyst', '40218'), ('63', 'Carmen', 'Featherstone', 'Analyst', '26656'), ('66', 'David', 'Mcelwain', 'Programmer', '44355'), ('68', 'Amy', 'Fernandez', 'Team Leader', '42450'), ('69', 'Michael', 'Rempel', 'Analyst', '46580'), ('73', 'Christine', 'Horner', 'Programmer', '40950'), ('84', 'Dave', 'Black', 'Team Leader', '20607'), ('91', 'Bernice', 'Montes', 'Team Leader', '47792'), ('93', 'Marquita', 'Wenthold', 'Analyst', '24150'), ('96', 'Hazel', 'Alexander', 'Analyst', '39956'), ('98', 'William', 'Holland', 'Programmer', '47170'), ('2', 'Wendy', 'Loomis', 'Analyst', '35181'), ('5', 'Trevor', 'Rodriguez', 'Team Leader', '39357'), ('11', 'Audrey', 'Sheffield', 'Team Leader', '49461'), ('13', 'Rhonda', 'Reddy', 'Team Leader', '30674'), ('27', 'Virginia', 'Porter', 'Programmer', '45251'), ('28', 'Jenny', 'Voss', 'Programmer', '35814'), ('35', 'David', 'Pool', 'Analyst', '27145'), ('37', 'Gwen', 'Kellett', 'Team Leader', '46773'), ('38', 'Wilbur', 'Hughes', 'Team Leader', '23022'), ('39', 'Edward', 'Bergeron', 'Team Leader', '40547'), ('41', 'Dorothy', 'Gale', 'Team Leader', '39600'), ('46', 'Brian', 'Cope', 'Team Leader', '31837'), ('52', 'Maurice', 'Young', 'Team Leader', '44634'), ('55', 'Luis', 'Lynch', 'Programmer', '28465'), ('58', 'Susan', 'Bryant', 'Analyst', '20011'), ('72', 'Vincent', 'Gray', 'Analyst', '49004'), ('75', 'Diane', 'Hamburg', 'Analyst', '39278'), ('76', 'Ronnie', 'Heinemann', 'Team Leader', '41651'), ('78', 'Delores', 'Lord', 'Analyst', '40379'), ('79', 'Corina', 'Garner', 'Analyst', '20522'), ('87', 'Hazel', 'Washington', 'Analyst', '20699'), ('89', 'Joel', 'Flower', 'Team Leader', '31881'), ('90', 'Kevin', 'Medina', 'Analyst', '30980'), ('92', 'Thomas', 'Nichols', 'Analyst', '44337'), ('97', 'Micheal', 'Lubinski', 'Analyst', '26122')]
fail = False if len(expected) == len(result) else True
if not fail:
  fail = [item not in expected for item in result]
  print("Fail!") if True in fail else print("Success!")
else:
  print("Fail!")


## Exercise 9: Matrix-Vector product

This exercise performs the matrix-vector product in distributed mode.

In [None]:
# The input data will be a matrix and a vector

matrix = sc.textFile("matrix.txt")
vector = []

with open('vector.txt', 'r') as f:
  lines = f.readlines()
  for line in lines:
    item = line.strip()
    vector.append(int(item))

In [None]:
# Here you must insert your code. You must put your result in a variable named 'result'. The result must be in the form:
# [(1, 20658), (2, 20300), (3, 21288), ...]

bcVector = sc.broadcast(vector)



In [None]:
# Here is where the result will be checked Don't modify this!

expected = [(1, 20658),
 (2, 20300),
 (3, 21288),
 (4, 20530),
 (5, 20801),
 (6, 20394),
 (7, 20007),
 (8, 20247),
 (9, 20316),
 (10, 19964),
 (11, 20495),
 (12, 20935),
 (13, 21306),
 (14, 20461),
 (15, 20257),
 (16, 20486),
 (17, 20362),
 (18, 20720),
 (19, 20416),
 (20, 20882),
 (21, 20590),
 (22, 19719),
 (23, 20267),
 (24, 20563),
 (25, 20924),
 (26, 20685),
 (27, 19788),
 (28, 20383),
 (29, 20059),
 (30, 20388),
 (31, 21105),
 (32, 21740),
 (33, 20962),
 (34, 20408),
 (35, 20338),
 (36, 20416),
 (37, 20072),
 (38, 19814),
 (39, 21483),
 (40, 21523),
 (41, 20655),
 (42, 20104),
 (43, 20687),
 (44, 20577),
 (45, 20784),
 (46, 20759),
 (47, 19452),
 (48, 20088),
 (49, 20717),
 (50, 20018)]

fail = False if len(expected) == len(result) else True
if not fail:
  fail = [item not in expected for item in result]
  print("Fail!") if True in fail else print("Success!")
else:
  print("Fail!")

# Congratulations!

You have finished the Spark RDD tutorial. I strongly recommend you to check the documentation at https://spark.apache.org/docs/latest/rdd-programming-guide.html