## SparkContext and SparkSession <a class="anchor" name="one"></a>

In [1]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

master = "local[*]"
app_name = "Parallel Search"
spark_conf = SparkConf().setMaster(master).setAppName(app_name)

spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')


Agenda:
1. Data Partitioning
2. RDD Partitioning
3. Parallel Search in RDD
4. Spark DataFrame
5. Parallel Search in Spark DataFrame
6. Parallel Search using Spark SQL

## Data Partitioning <a class="anchor" id="two"></a>


#### 1. Round-robin data partitioning ###
Round-robin data partitioning is the simplest data partitioning method in which each record in turn is allocated to a processing element (simply processor). Since it distributes the data evenly among all processors, it is also known as "equal-partitioning".

#### 2. Range data partitioning ###
Range data partitioning records based on a given range of the partitioning attribute. For example,the student table is partitioned based on "Last Name" based on the alphabetical order (i.e. A ~ Z). 

#### 3. Hash data partitioning ###
Hash data partitioning makes a partition based on a particular attribute using a hash function. The result of a hash function determines the processor where the record will be placed. Thus, all records within a partition have the same hash value.

## RDD partitioning <a class="anchor" id="three"></a>

By default, Spark partitions the data using <strong>Random equal partitioning</strong> unless there are specific transformations that uses a different type of partitioning</strong>
In the code below, we have defined two functions to implement custom partitioning using <strong>Range Partitioning</strong> and <strong>Hash Partitioning</strong>.


In [2]:
from pyspark.rdd import RDD

def print_partitions(data):
    if isinstance(data, RDD):
        numPartitions = data.getNumPartitions()
        partitions = data.glom().collect()
    else:
        numPartitions = data.rdd.getNumPartitions()
        partitions = data.rdd.glom().collect()
    
    print(f"NUMBER OF PARTITIONS: {numPartitions}")
    for index, partition in enumerate(partitions):
        if len(partition) > 0:
            print(f"Partition {index}: {len(partition)} records")
            print(partition)

In [3]:
list_players = [(1,'Ronaldo'),(2,'Messi'),(3,'Modric'),(4,'Xavi'),(5,'Iniesta'),
                (10,'Kroos'),(11,'Bale'),(12, 'Benzema'),(3, 'Valverde'),(18,'Bellingham'),(9,'Carvajal')]

#Define the number of partitions
no_of_partitions = 4

### Default Partitioning in Spark RDD <a class="anchor" id="default"></a>

In [4]:
# random equal partition
rdd = sc.parallelize(list_players, no_of_partitions)

In [5]:
print("Number of partitions:{}".format(rdd.getNumPartitions()))
print("Partitioner:{}".format(rdd.partitioner))
print_partitions(rdd)  

Number of partitions:4
Partitioner:None
NUMBER OF PARTITIONS: 4
Partition 0: 2 records
[(1, 'Ronaldo'), (2, 'Messi')]
Partition 1: 4 records
[(3, 'Modric'), (4, 'Xavi'), (5, 'Iniesta'), (10, 'Kroos')]
Partition 2: 2 records
[(11, 'Bale'), (12, 'Benzema')]
Partition 3: 3 records
[(3, 'Valverde'), (18, 'Bellingham'), (9, 'Carvajal')]


### Hash Partitioning in RDD <a class="anchor" id="hash"></a>
Hash partitioning uses the formula <code>partition = hash_function() % numPartitions</code> to determine which partition data item falls into.

In [7]:
#Hash Function to implement Hash Partitioning 
#Just computes the sum of digits
#Example : hash_function(12) produces 3 i.e. 2 + 1
#Then hash_function(12) % numPartitions = 3%4 = 3

def hash_function(key):
    total = 0
    for digit in str(key):
        total += int(digit)
    return total

In [8]:
# hash partitioning
hash_partitioned_rdd = rdd.partitionBy(no_of_partitions, hash_function)
print_partitions(hash_partitioned_rdd)            

NUMBER OF PARTITIONS: 4
Partition 0: 1 records
[(4, 'Xavi')]
Partition 1: 5 records
[(1, 'Ronaldo'), (5, 'Iniesta'), (10, 'Kroos'), (18, 'Bellingham'), (9, 'Carvajal')]
Partition 2: 2 records
[(2, 'Messi'), (11, 'Bale')]
Partition 3: 3 records
[(3, 'Modric'), (12, 'Benzema'), (3, 'Valverde')]


### Range Partitioning in RDD <a class="anchor" id="range"></a>
This strategy uses a range to distribute the items to respective partitions when the keys fall within the range. 

In [9]:
no_of_partitions=4

chunk_size = len(list_players)/no_of_partitions
range_arr=[[1,4],[5,9],[10,14],[15,19]]

def range_function(key):
    for index,item in enumerate(range_arr):
        if key >=item[0] and key <=item[1]:
            return index


In [10]:
# range partition
range_partitioned_rdd = rdd.partitionBy(no_of_partitions, range_function)
print_partitions(range_partitioned_rdd)

NUMBER OF PARTITIONS: 4
Partition 0: 5 records
[(1, 'Ronaldo'), (2, 'Messi'), (3, 'Modric'), (4, 'Xavi'), (3, 'Valverde')]
Partition 1: 2 records
[(5, 'Iniesta'), (9, 'Carvajal')]
Partition 2: 3 records
[(10, 'Kroos'), (11, 'Bale'), (12, 'Benzema')]
Partition 3: 1 records
[(18, 'Bellingham')]


## Parallel Search using RDDs  <a class="anchor" id="parallel-search-rdd"></a>

Now we will implement basic search functionalities and visualise the parallelism embedded in Spark to perform these kind of queries.

In this tutorial, you will use a csv dataset **bank.csv**. However, for this tutorial we won't analyse the case study but only perform some search queries with this data

In [11]:
bank_rdd = sc.textFile('bank.csv')

print(f"Total partitions: {bank_rdd.getNumPartitions()}")
print(f"Number of lines: {bank_rdd.count()}")

bank_rdd.take(4)

Total partitions: 2
Number of lines: 11163


['age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit',
 '59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes',
 '56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes',
 '41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes']

### Search in RDDs based on multiple conditions

We will focus on only four attributes from the data: age, education, marital and balance for filtering conditions. However, we will display additional information as well.

In [12]:
# Split each line separated by comma into a list 
bank_rdd1 = bank_rdd.map(lambda line: line.split(','))

# Remove the header
header = bank_rdd1.first()
bank_rdd1 = bank_rdd1.filter(lambda row: row != header)   #filter out header

# Indices for each attribute we will use
# Filter: age, education, marital, balance = 0, 3, 2, 5
# Display additional: day, month, deposit = 9, 10, 16

bank_rdd1 = bank_rdd1.filter(lambda x: int(x[5])>1000 and int(x[5])<2000)
bank_rdd1 = bank_rdd1.filter(lambda x: x[3] in ['primary','secondary'] and int(x[0])<30)
bank_rdd1 = bank_rdd1.filter(lambda x: x[2]=='married' )
bank_rdd1 = bank_rdd1.map(lambda field: (field[0],field[2],field[3],field[5],
                                         field[9],field[10],field[16]))
print(bank_rdd1.count())

18


In [13]:
numPartitions = bank_rdd1.getNumPartitions()
print(f"Total partitions: {numPartitions}")

# glom(): Return an RDD created by coalescing all elements within each partition into a list
partitions = bank_rdd1.glom().collect()
for index,partition in enumerate(partitions):
    print(f'Partition {index}:')
    for record in partition:
        print(record)

Total partitions: 2
Partition 0:
('29', 'married', 'secondary', '1135', '17', 'feb', 'yes')
('27', 'married', 'secondary', '1293', '8', 'apr', 'yes')
('29', 'married', 'secondary', '1180', '17', 'apr', 'yes')
('28', 'married', 'secondary', '1086', '20', 'apr', 'yes')
('26', 'married', 'secondary', '1595', '15', 'jun', 'yes')
('27', 'married', 'secondary', '1596', '1', 'sep', 'yes')
('28', 'married', 'secondary', '1595', '9', 'sep', 'yes')
('27', 'married', 'secondary', '1595', '29', 'dec', 'yes')
Partition 1:
('26', 'married', 'secondary', '1417', '6', 'jun', 'no')
('23', 'married', 'secondary', '1309', '3', 'jun', 'no')
('24', 'married', 'secondary', '1222', '20', 'apr', 'no')
('28', 'married', 'secondary', '1238', '14', 'may', 'no')
('26', 'married', 'secondary', '1595', '2', 'mar', 'no')
('27', 'married', 'secondary', '1303', '21', 'may', 'no')
('25', 'married', 'secondary', '1782', '19', 'jun', 'no')
('28', 'married', 'secondary', '1137', '6', 'feb', 'no')
('28', 'married', 'second

### Searching max/min value of an attribute in an RDD
This task will aim to find the record in the dataset that contains the highest value for a given attribute. In this case the attribute chosen is "balance".

In [14]:
# Read csv but now with 4 partitions
bank_rdd_4 = sc.textFile('bank.csv',4)

# Split and remove the header
bank_rdd_4 = bank_rdd_4.map(lambda line: line.split(','))
header = bank_rdd_4.first()
bank_rdd_4 = bank_rdd_4.filter(lambda row: row != header)   #filter out header


In [15]:
# Get max by value in index 5 (balance)

# Wrong way
result_max_balance = bank_rdd_4.max(key=lambda x: x[5]) 
print(result_max_balance)

['58', 'self-employed', 'married', 'secondary', 'no', '9994', 'no', 'no', 'cellular', '10', 'jul', '400', '1', '-1', '0', 'unknown', 'no']


In [16]:
# Correct way
result_max_balance2 = bank_rdd_4.takeOrdered(1,lambda x: -1*int(x[5]))[0]
print(result_max_balance2)

result_max_balance3 = bank_rdd_4.reduce(lambda x,y: y if int(y[5])>int(x[5]) else x)
print(result_max_balance3)

['84', 'retired', 'married', 'secondary', 'no', '81204', 'no', 'no', 'telephone', '28', 'dec', '679', '1', '313', '2', 'other', 'yes']
['84', 'retired', 'married', 'secondary', 'no', '81204', 'no', 'no', 'telephone', '28', 'dec', '679', '1', '313', '2', 'other', 'yes']


## DataFrames in Spark <a class="anchor" id="dataframes"></a>


In [17]:
df = spark.createDataFrame([(1,'Ronaldo', 'R'),(2,'Messi', 'M'),(3,'Modric', 'M'),(4,'Xavi', 'X'),(5,'Iniesta', 'I'),
                            (10,'Kroos', 'K'),(11,'Bale', 'B'),(12, 'Benzema', 'B'),(3, 'Valverde', 'V'),(18,'Bellingham', 'B'),(9,'Carvajal', 'C')],
                           ['Id','Name','Initial'])

df.show(5)
df.printSchema()

+---+-------+-------+
| Id|   Name|Initial|
+---+-------+-------+
|  1|Ronaldo|      R|
|  2|  Messi|      M|
|  3| Modric|      M|
|  4|   Xavi|      X|
|  5|Iniesta|      I|
+---+-------+-------+
only showing top 5 rows

root
 |-- Id: long (nullable = true)
 |-- Name: string (nullable = true)
 |-- Initial: string (nullable = true)



Another way to create a DataFrame is use the <strong>spark.read.csv</strong> file to load the data from csv to a DataFrame

### Partitioning in DataFrames <a class="anchor" id="df-partitioning"></a>

In [18]:
# Round-robin data partitioning
df_round = df.repartition(2)

# Range data partitioning
df_range = df.repartitionByRange(2, "Initial")

# Hash data partitioning
column_hash = "Id"
df_hash = df.repartition(column_hash)

### Parallel Search using Spark Dataframe <a class="anchor" id="parallel_search_df"></a>



In [19]:
df = spark.read.csv("bank.csv",header=True)
bank_df = df.repartition(5)

In [20]:
from pyspark.sql.functions import col

bank_df = bank_df.filter(col("balance")>1000)\
            .filter(col("balance")<2000)
bank_df = bank_df.where(col("education").isin(["primary", "secondary"])).filter(col("age")<30)
bank_df = bank_df.filter(bank_df["marital"]=='married')
bank_df = bank_df.select("age", "education", "balance", "day", "month", "deposit")
bank_df.show()

+---+---------+-------+---+-----+-------+
|age|education|balance|day|month|deposit|
+---+---------+-------+---+-----+-------+
| 27|secondary|   1293|  8|  apr|    yes|
| 26|secondary|   1595| 15|  jun|    yes|
| 28|secondary|   1595|  9|  sep|    yes|
| 28|secondary|   1020| 28|  may|     no|
| 27|secondary|   1303| 21|  may|     no|
| 28|secondary|   1086| 20|  apr|    yes|
| 23|secondary|   1309|  3|  jun|     no|
| 29|secondary|   1180| 17|  apr|    yes|
| 26|secondary|   1595|  2|  mar|     no|
| 26|secondary|   1417|  6|  jun|     no|
| 27|secondary|   1595| 29|  dec|    yes|
| 24|secondary|   1222| 20|  apr|     no|
| 28|secondary|   1137|  6|  feb|     no|
| 25|secondary|   1782| 19|  jun|     no|
| 28|secondary|   1238| 14|  may|     no|
| 27|secondary|   1596|  1|  sep|    yes|
| 29|secondary|   1386| 28|  may|     no|
| 29|secondary|   1135| 17|  feb|    yes|
+---+---------+-------+---+-----+-------+



In [21]:
print_partitions(bank_df)

NUMBER OF PARTITIONS: 5
Partition 0: 4 records
[Row(age='27', education='secondary', balance='1293', day='8', month='apr', deposit='yes'), Row(age='26', education='secondary', balance='1595', day='15', month='jun', deposit='yes'), Row(age='28', education='secondary', balance='1595', day='9', month='sep', deposit='yes'), Row(age='28', education='secondary', balance='1020', day='28', month='may', deposit='no')]
Partition 1: 4 records
[Row(age='27', education='secondary', balance='1303', day='21', month='may', deposit='no'), Row(age='28', education='secondary', balance='1086', day='20', month='apr', deposit='yes'), Row(age='23', education='secondary', balance='1309', day='3', month='jun', deposit='no'), Row(age='29', education='secondary', balance='1180', day='17', month='apr', deposit='yes')]
Partition 2: 3 records
[Row(age='26', education='secondary', balance='1595', day='2', month='mar', deposit='no'), Row(age='26', education='secondary', balance='1417', day='6', month='jun', deposit='

### Parallel Search using SQL language in Spark  <a class="anchor" id="parallel_search_sparksql"></a>
#### Spark SQL


In [None]:
# register the original DataFrame as a temp view so that we can query it using SQL
df.createOrReplaceTempView("df_sql")
filter_sql = spark.sql('''
  SELECT 
      age,education,balance,day,month,deposit
  FROM 
      df_sql
  WHERE 
      balance between 1000 and 2000
  AND education in ('secondary','primary')
  AND age < 30
  AND marital = 'married'
''')
# filter_sql.explain()
filter_sql.collect()