# Lesson 07 - Subsetting and Partitions

In [0]:
import numpy as np
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Subsetting RDDs

As mentioned in the previous lesson, one needs to be careful about using the `collect()` method on a large dataset. If the dataset you are working with is too large to fit in the memory of the cluster node running the driver process, then that node will crash resulting in the termination of the Spark application. In this lesson we will explore techniques that can be used to get a sense as to the contents of an RDD without needing to collect the entire contents onto a single node. 

To demonstrate the use of these methods, we will create an RDD named `letters_rdd` that contains 10,000 string values randomly sampled from the letters `'A'`, `'B'`, `'C'`, `'D'`, `'E'`, and `'F'`.

In [0]:
letters_arr = np.random.choice(list('ABCDEF'), size=10000)
letters_rdd = sc.parallelize(letters_arr)

### The take() Method

The **`take()`** action returns the first few elements of an RDD to the driver in the form of a list. This method has a single required parameter named `num` that determines the number of elements to be returned.

In [0]:
print(letters_rdd.take(12))

### The distinct() Method

We can use the **`distinct()`** transformation to create a new RDD containing only one copy of each element in the source RDD. When we call this method, each executor will determine the unique RDD elements present in the portion of the RDD that they have been assigned. The executors then provide these collections to the driver where the results are merged into the final list. 

Note that since `distinct()` is a transformation, we need to `collect()` the returned RDD in order to view it.

In [0]:
unique_rdd = letters_rdd.distinct()

print(unique_rdd.collect())

### The sample() Method

We can use the **`sample()`** transformation to generate a random sample from an RDD. This method has two required parameters:
* **`withReplacement`**  - Boolean value that controls whether or not the sampling is performed with replacement. 
* **`fraction`** - Sets the probability that any given element will be selected. 

Note that since `sample()` is a transformation, we need to `collect()` the returned RDD in order to view it. Run the cell below several times. You should get a different result each time. The expected size of the sample is equal to `fraction` times the size of the RDD, but the actual size of the sample will vary with each call. Change the value of the `fraction` parameter in the code below to see how this affects the result.

In [0]:
letters_sample = letters_rdd.sample(withReplacement=False, fraction=0.005)
print(letters_sample.count())
print(letters_sample.collect())

When sampling with replacement, Spark considers each element of the RDD and determines at random whether or not that particular value will be in the sample. The probability of an individual value being sampled is equal to `fraction`. The code cell below provides a single example that can be used to explore this idea.

In [0]:
my_rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

sample_rdd = my_rdd.sample(withReplacement=False, fraction=0.3)

print(sample_rdd.collect())

## Partitions

When an RDD is created, its contents are split into several partitions which are then distributed across the network with each worker node receiving some of partitions. We can determine the number of partitions that an RDD has been split into by calling the `getNumPartitions()` method of the RDD.

In [0]:
random_arr = np.random.choice(range(100), size=100)
random_rdd = sc.parallelize(random_arr)

print('Number of Elements in RDD:  ', random_rdd.count())
print('Number of Partitions in RDD:', random_rdd.getNumPartitions())

Spark takes care of managing the partitions and we typically do not need to worry about how the RDD elements are distributed across the partitions. However, if we do need to explore this, we can use the `glom()` transformation to view the contents of each partition. This method will return an RDD containing several lists, each of which will represent one of the RDD's partitions.

In the cell below, we view the elements in the first of the eight partitions that `random_rdd` has been split into.

In [0]:
partitions_rdd = random_rdd.glom()

print('Number of Partititions:     ', partitions_rdd.count())
print('Contents of First Partition:', partitions_rdd.take(1)[0])

In the next cell, we will determine the number of elements in each of the 8 partitions.

In [0]:
partitions_list = partitions_rdd.collect()
for i, p in enumerate(partitions_list):
  print(f'Partition Number {i} contains {len(p)} elements.')

Notice that the `random_rdd` was split into 8 partitions. The default number of partitions to use for RDDs can be set when creating a Spark session, and can be checked using the `defaultParallelism` attribute of our `sparkContext` object.

In [0]:
print('Default Parallelism:', sc.defaultParallelism)

We can control the number of partitions that are to be used for an RDD by setting the optional `numSlices` parameter of the `parallelize()` method.

In [0]:
random_rdd_1 = sc.parallelize(random_arr, numSlices=1)
random_rdd_2 = sc.parallelize(random_arr, numSlices=2)
random_rdd_4 = sc.parallelize(random_arr, numSlices=4)
random_rdd_8 = sc.parallelize(random_arr, numSlices=8)

print(random_rdd_1.getNumPartitions())
print(random_rdd_2.getNumPartitions())
print(random_rdd_4.getNumPartitions())
print(random_rdd_8.getNumPartitions())