## Coverage
1. Writing a file from local file system to hadoop
2. Starting Spark session
3. Creating RDDs 
  - RDD data types
  - Reading from local file system or hdfs (not working in jupyter)
  - From other RDDs using transformations
    - Using python functional programming style to create RDD transformations (map, filter, reduce) 
    - map, reduce, filter in python
4. Checking Lineage (DebugtoString)  
5. Memory Management
6. Additional transformations on RDDs based on types
  - paired RDDs
  - Other useful transformations
7. Paired RDD Operations and case 
8. Memory Management : Persistence, Caching, Serialization  
9. Optimization in Spark

#### 1. Writing a file from local file system to hadoop

In [None]:
!pwd
! touch purple_cow.txt
!ls
with open('purple_cow.txt', 'w') as con:
    con.write("""I've never seen a purple cow
    I never hope to see one;
    But, i can tell you, anyhow,
    I'd rather see than be one""")

!cat purple_cow.txt
#! hadoop fs -ls fractalUS
! hadoop fs -put purple_cow.txt fractalUS/
!hadoop fs -cat fractalUS/purple_cow.txt

#### 2. Starting Spark session and accessing spark context, which is used to create and access RDDs

spark 2.3.1 use Spark Session for entry point instead of spark context
Spark Session can be use use to create sql context (for Data Frames) as well as hivcontext, spark context
for RDDs

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('rdds').getOrCreate()
sc = spark.sparkContext

#from pyspark import SparkContext
#sc = SparkContext(appName= 'rdds')
#rdd = sc.textFile('/home/sumad/purple_cow.txt')
#rdd.take(2)

spark.version

#### 3. Creating RDDs
- Types of RDDs
- from files, collections etc 
- Create RDDs using common transformations - map, filter, reduce 
  - Use of functional programming : 
    - pass function to a function, and use of anonymous functions
    - rdd methods support functional programming, i.e they take functions as input and apply them 
    over each line in data
- See the DAG of transformations 
- DAGs:
   - provide lazy evaluations 
   - start computation from point of failure 
   - all transformations created RDDs in memory (unlike write to disk in map reduce)
- Row wise transformations where possible   
  - Where possible transformations happen row by row, i.e a row is taken through all transformations to an 
  action, instead of all blocks being acted by a single transformation at once

#### 3.1 Types of RDD 
Can hold any type of element
- Primitive type( integerm character etc)
- Sequence RDDs ( from dics, lists, tuples)
- Mixed data dype
- Pair RDDs ( support special transformations)
- Double RDDs (support numeric transformations)

#### 3.2 Creating RDDs

In [None]:
# from python collections
sc.parallelize([1,2,3,4])

# From text files 
## textFile only works with line delimited files
## Each line in text file is a new record in RDD
sc.textFile('dir/*.log') # all files in the dir with .log

# Xml or json files
## As there is no new line delimiter
sc.wholeTextFiles(dir) # reads whol file as RDD with whole file as a single element, size should be checked

Other input and Output formats available 
https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.RDD

#### 3.3 From transformations

#### map, filter, reduce in python 
- find sum of squares of first 10 integers that are even
- functional programming : functions serve as i/p and o/p units, and are chained

In [None]:
from functools import reduce

l = range(1,11,1)
f = filter(lambda x: x%2 ==0, l)
m = map(lambda x : x**2, f)
r = reduce(lambda x,y: x+ y, m)
print(r)

In [None]:
#### Convert all words in the text file into caps, and count lines that start with I, then do a word count

rdd = sc.textFile('/user/sumad/fractalUS/purple_cow.txt')
# act on each line of text file, each line is a string 
# filter needs a function to operate on each line and return boolean
rdd2 = rdd.map(lambda x: x.upper()) \
   .filter(lambda x: x.strip().startswith('I')) 
rdd2.count()
#3

In [None]:
# Using print improves readability of DAG
## show starting from text file two RDDs are created
print(rdd2.toDebugString())
"""
(2) PythonRDD[4] at RDD at PythonRDD.scala:49 []
 |  /user/sumad/fractalUS/purple_cow.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
 |  /user/sumad/fractalUS/purple_cow.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []
 """

In [None]:
 #Word Count
rdd3 = rdd.map(lambda x: x.strip()).flatMap(lambda x:x.split(' ')).map(lambda x: x.upper())\
.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
rdd3.collect()

#### 4. Checking lineage 
- Can see shuffle operation because of reduceByKey

In [None]:
"""
2) PythonRDD[16] at collect at <stdin>:1 []
 |  MapPartitionsRDD[15] at mapPartitions at PythonRDD.scala:129 []
 |  ShuffledRDD[14] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(2) PairwiseRDD[13] at reduceByKey at <stdin>:2 []
    |  PythonRDD[12] at reduceByKey at <stdin>:2 []
    |  /user/sumad/fractalUS/purple_cow.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
    |  /user/sumad/fractalUS/purple_cow.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []
""";

#### 5. Memory Management 
- When RDD's are created, they are created in RAM. What happens if memory is used up. 
  - RDDs are ejected by LRU (Last used rule) 
  - Persisting RDDs affects this rule, to be covered later in detail. 
  - **Also, if RDDs are not persisted, then they are cleaned from the memory after an action is called**

#### 6. Additional transformations (key ones)
- List of all
  - https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.RDD

- Operate and return single RDD
  - flatMap : first breaks sequence inside each RDD element, then, combines all elements in a single sequence,so
    retutns a single RDD with single element.
  - map , filter, reduce
  - foreach, distinct , top(n), first
  - min, max, mean, stddev
  - sample, randomSplit, 
- Operate on two RDDs
  - zip, intersection, union , subtract 
  - join 
- Operate on Paired RDD 
  - countByKey
  - groupByKey, reduceByKey,aggregateByKey : latter two perform better than first for grouping and aggregating
- Return paired RDD 
  - countByValue : paired RDD with unique element and its count
- Explicit repartitioning or operations by partition
  - mapPartitions :
  - glom : combines elements in each partition into a unique list of elements
  - colesce : returns rdd by repartioning to specified no. of partitions 
  - foreachPartition 
  - getNumPartitions
  - partitionBy, repartition (coalesce can prevent a shuffle operation)

#### 7. Paired RDD operations and case study

#### 8. Memory Management - Persistence, Caching, Serialization

#### 9. Optimization in Spark