# PySpark Cookbook: RDDs
## Tomasz Drabas, Denny Lee

#### Version: 0.1

#### 2018-07-01 (Happy Canada Day!)

This notebook is in support of [PySpark Cookbook](): Chapter 2 on RDDs.

## Creating RDDs

In [3]:
myRDD = sc.parallelize( 
 [('Amber', 22), ('Alfred', 23), ('Skye',4), ('Albert', 12), ('Amber', 9)]
)

In [4]:
myRDD.take(5)

## Reading data from files

In [6]:
myRDD = sc.textFile('/databricks-datasets/flights/airport-codes-na.txt')

In [7]:
myRDD.take(5)

In [8]:
myRDD.count()

In [9]:
myRDD = sc.textFile('/databricks-datasets/flights/airport-codes-na.txt').map(lambda line: line.split("\t"))

In [10]:
myRDD.getNumPartitions()

In [11]:
myRDD = sc.textFile('/databricks-datasets/flights/airport-codes-na.txt', minPartitions=4, use_unicode=True).map(lambda line: line.split("\t"))

In [12]:
myRDD.take(5)

In [13]:
myRDD.getNumPartitions()

In [14]:
myRDD = sc.textFile('/databricks-datasets/flights/departuredelays.csv').map(lambda line: line.split(","))
myRDD.count()

In [15]:
myRDD = sc.textFile('/databricks-datasets/flights/departuredelays.csv', minPartitions=8).map(lambda line: line.split(","))
myRDD.count()

In [16]:
myRDD.take(5)

In [17]:
myRDD.getNumPartitions()

#### *Using DataFrame*
Note, that its faster (2.44s for DF, 2.96s for RDD w/ 8 partitions) while DF also takes into account of the header and can infer the schema

In [19]:
myDF = spark.read.csv('/databricks-datasets/flights/departuredelays.csv', header=True, inferSchema=True)
myDF.count()

In [20]:
myDF.show()

In [21]:
myDF.rdd.getNumPartitions()

In [22]:
myDF.printSchema()

## RDD Transformations

#### Getting Ready

In [25]:
airports = sc.textFile('/databricks-datasets/flights/airport-codes-na.txt').map(lambda line: line.split("\t"))
airports.take(5)

In [26]:
flights = sc.textFile('/databricks-datasets/flights/departuredelays.csv').map(lambda line: line.split(","))
flights.take(5)

#### map()

In [28]:
airports.map(lambda c: (c[0], c[1])).take(5)

#### filter()

In [30]:
airports.map(lambda c: (c[0], c[1])).filter(lambda c: c[1] == "WA").take(5)


#### flatMap()

In [32]:
airports.filter(lambda c: c[1] == "WA").map(lambda c: (c[0], c[1])).flatMap(lambda x: x).take(10)

#### distinct()

In [34]:
airports.map(lambda c: c[2]).distinct().take(5)

#### sample()

In [36]:
flights.map(lambda c: c[3]).sample(False, 0.001, 123).take(5)

#### leftOuterJoin()

In [38]:
flights.map(lambda c: (c[3], c[0])).take(5)

In [39]:
flights.take(5)

In [40]:
airports.map(lambda c: (c[3], c[1])).take(5)

In [41]:
flt = flights.map(lambda c: (c[3], c[0]))
air = airports.map(lambda c: (c[3], c[1]))
flt.join(air).take(5)

In [42]:
flt = flights.map(lambda c: (c[3], c[0]))
air = airports.map(lambda c: (c[3], c[1]))
flt.join(air)

#### repartition()

In [44]:
flights.getNumPartitions()

In [45]:
flights2 = flights.repartition(8)
flights2.getNumPartitions()

In [46]:
# mapPartitionsWithIndex
#flights.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(1) else iter }
rdd = sc.parallelize([1, 2, 3, 4], 4)
def f(splitIndex, iterator): yield splitIndex
rdd.mapPartitionsWithIndex(f).sum()


#### zipWithIndex()

In [48]:
# View each row within RDD + the index 
# i.e. output is in form ([row], idx)
ac = airports.map(lambda c: (c[0], c[3]))
ac.zipWithIndex().take(5)

In [49]:
# zipWithIndex
#   Skip header row by 
#   - filter out row 0
#   - extract only row info
ac.zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)\
  .take(5)

#### sortByKey()

In [51]:
# Takes the origin code and delays, remove header
# runs a group by origin code via reduceByKey()
# sorting by the key (origin code)
flights.zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)\
  .map(lambda c: (c[3], int(c[1])))\
  .reduceByKey(lambda x, y: x + y)\
  .sortByKey()\
  .take(50)

In [52]:
# Create `a` RDD of Washington airports
a = airports.zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)\
  .filter(lambda c: c[1] == "WA")

# Create `b` RDD of British Columbia airports
b = airports.zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)\
  .filter(lambda c: c[1] == "BC")

# Union WA and BC airports
a.union(b).take(50)
  


#### Intersection

In [54]:
# Create first RDD
a = airports.zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)\
  .filter(lambda c: c[1] == "WA")\
  .map(lambda c: c[3])

In [55]:
flights.zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)\
  .take(50)




In [56]:
flights.zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)\
  .filter(lambda c: c[1] == '0')\
  .take(50)



In [57]:
a.take(50)

In [58]:
flights.take(10)

## RDD Actions

Same Getting Ready as Transformations

In [60]:
# take(n)
airports.take(3)

In [61]:
# collect()
airports.filter(lambda c: c[1] == "WA").collect()

In [62]:
# reduce(f)
flights\
   .filter(lambda c: c[3] == 'SEA' and c[4] == 'SFO')\
   .map(lambda c: int(c[1]))\
   .reduce(lambda x, y: x + y)

In [63]:
flights.take(5)

In [64]:
# reduceByKey
#   Determine delays by originating city
flights.zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)\
  .map(lambda c: (c[3], int(c[1])))\
  .reduceByKey(lambda x, y: x + y)\
  .take(5)

In [65]:
# count
flights.zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)\
  .count()

In [66]:
# saveAsTextFile
airports.saveAsTextFile("/tmp/denny/airports")

In [67]:
%fs ls /tmp/denny/airports/

### Pitfalls of using RDDs

In [69]:
## Getting Ready
flights = sc.textFile('/databricks-datasets/flights/departuredelays.csv')\
  .map(lambda line: line.split(","))\
  .zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)\

In [70]:
flights.take(5)

In [71]:
flightsDF = spark.read.options(header='true', inferSchema='true').csv('/databricks-datasets/flights/departuredelays.csv')
flightsDF.createOrReplaceTempView("flightsDF")

In [72]:
flightsDF.show(5)

In [73]:
# How to do it
flights.map(lambda c: (c[3], int(c[1]))).reduceByKey(lambda x, y: x + y).sortByKey().take(50)

In [74]:
spark.sql("select origin, sum(delay) as TotalDelay from flightsDF group by origin order by origin").show(50)

In [75]:
## Getting Ready
flights = sc.textFile('/databricks-datasets/flights/departuredelays.csv', minPartitions=8)\
  .map(lambda line: line.split(","))\
  .zipWithIndex()\
  .filter(lambda (row, idx): idx > 0)\
  .map(lambda (row, idx): row)

In [76]:
flights.count()

In [77]:
flights.getNumPartitions()

In [78]:
# Source: https://stackoverflow.com/a/38957067/1100699
def count_in_a_partition(idx, iterator):
  count = 0
  for _ in iterator:
    count += 1
  return idx, count


flights.mapPartitionsWithIndex(count_in_a_partition).collect()

In [79]:
# How to do it
flights.map(lambda c: (c[3], int(c[1]))).reduceByKey(lambda x, y: x + y).sortByKey().take(50)