# Partitions in your data

- SparkContext's `textFile()` method takes an optional second argument called `minPartitions` for specifying the minimum number of partitions. In this exercise, you'll create an RDD named `fileRDD_part` with `5` partitions and then compare that with `fileRDD` that you created in the previous exercise. 

- Remember, you already have a `SparkContext` `sc`, `file_path` and `fileRDD` available in your workspace.

## Instructions

- Find the number of partitions that support `fileRDD` RDD.
- Create an RDD named `fileRDD_part` from the file path but create `5` partitions.
- Confirm the number of partitions in the new `fileRDD_part` RDD.

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [3]:
file_path = 'file:////home/talentum/spark/README.md'
# Print the file_path
print("The file_path is", file_path)

# Create a fileRDD from file_path
fileRDD = sc.textFile(file_path)

# Check the number of partitions in fileRDD
print("Number of partitions in fileRDD is", fileRDD.getNumPartitions())

# Create a fileRDD_part from file_path with 5 partitions
fileRDD_part = sc.textFile(file_path, minPartitions = 5)

# Check the number of partitions in fileRDD_part
print("Number of partitions in fileRDD_part is", fileRDD_part.getNumPartitions())

The file_path is file:////home/talentum/spark/README.md
Number of partitions in fileRDD is 2
Number of partitions in fileRDD_part is 5


In [4]:
rdd1 = sc.parallelize([1,2,3,4,5,6,7,8,9,10])

print(rdd1.count())
print(rdd1.getNumPartitions())
print(rdd1.glom().collect())

10
6
[[1], [2, 3], [4, 5], [6], [7, 8], [9, 10]]


In [8]:
rdd1 = sc.parallelize([1,2,3,4,5,6,7,8,9,10], numSlices=5)

print(rdd1.count())
print(rdd1.getNumPartitions())
print(rdd1.glom())
print(rdd1.glom().collect())

10
5
PythonRDD[19] at RDD at PythonRDD.scala:53
[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
