# Spark Fundamentals

Basic RDDs - Don't freak, these are just references for future needs
======================================

Common RDD Constructors
-----------------------

Expression                               |Meaning
----------                               |-------
`sc.parallelize(iterable)`               |Create RDD of elements of some iterable
`sc.textFile(path)`                      |Create RDD of lines from file

Common Transformations
----------------------

Expression                               |Meaning
----------                               |-------
`filter(boolean condition)`              |Returns for where some boolean condition is True
`map(some function)`                     |Applies some function
`flatMap(some function)`                 |Apply some function that returns an iterator and flatten the entire output
`sample(withReplacement=True, ratio)`    |Sample the data by some ratio
`distinct()`                             |Remove duplicates in RDD
`sortBy(key function, ascending=True)`   |Sort elements by key defined in function in designated order
`randomSplit([ratio1, ratio2], seed)`    |Splits your data into two depening on ratio array

Common Key Pair RDD Transformations
----------------------------------

Expression                               |Meaning
----------                               |-------
`groupByKey(key value rdd)`              |Collapse a key value RDD by the key, and keeps the values in a iterable
`reduceByKey(some function)`             |Collapse a key value RDD by the key, and combines the values by some function
`mapValues(some function)`               |Apply some function to the values of some key value RDD
`flatMapValues(some function)`           |Apply some function that turns a key and iterable value RDD into key value RDD
`keys()`                                 |Returns the keys of a key value RDD
`values()`                               |Returns the values of a key value RDD

Common Multiple RDD Transformations
----------------------------------

Expression                               |Meaning
----------                               |-------
`union(another rdd)`                     |Append another RDD to current RDD
`join(another rdd)`                      |Join another RDD to current RDD by matching keys
`leftOuterJoin(another rdd)`             |Join another RDD to current RDD where another RDD has matching keys
`rightOuterJoin(another rdd)`            |Join current RDD to other RDD where current RDD has matching keys
`zip(another rdd)`                       |Combines two RDD to form a key value pair RDD

Common Actions
--------------

Expression                             |Meaning
----------                             |-------
`collect()`                            |Convert RDD to in-memory list 
`take(n)`                              |First n elements of RDD 
`top(n)`                               |Top n elements of RDD
`takeSample(withReplacement=True, n)`  |Create sample of n elements with replacement
`sum()`                                |Find element sum (assumes numeric elements)
`mean()`                               |Find element mean (assumes numeric elements)
`stdev()`                              |Find element deviation (assumes numeric elements)
`takeOrdered(n, function)`             |Returns n ordered elements as sorted by the value returned by the function

### Step 1: import pyspark

In [1]:
import findspark
findspark.init('/usr/local/spark')
import pyspark as ps    # for the pyspark suite
import warnings         # for displaying warning
from pyspark.sql import SQLContext

In [2]:
try:
    # we try to create a SparkContext to work locally on all cpus available
    sc = ps.SparkContext('local[4]')
    sqlContext = SQLContext(sc)
    print("Just created a SparkContext")
except ValueError:
    # give a warning if SparkContext already exists (for use inside pyspark)
    warnings.warn("SparkContext already exists in this scope")

Just created a SparkContext


### Step 2: initialize a spark context (RDD manager)

In [3]:
sc

<pyspark.context.SparkContext at 0x10571bda0>

### Step 3:  Construct a RDD with the data (we will be using churn.csv)

In [4]:
churn_rdd = sc.textFile('churn.csv')

In [5]:
churn_rdd.take(5)

["State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?",
 'KS,128,415,382-4657,no,yes,25,265.100000,110,45.070000,197.400000,99,16.780000,244.700000,91,11.010000,10.000000,3,2.700000,1,False.',
 'OH,107,415,371-7191,no,yes,26,161.600000,123,27.470000,195.500000,103,16.620000,254.400000,103,11.450000,13.700000,3,3.700000,1,False.',
 'NJ,137,415,358-1921,no,no,0,243.400000,114,41.380000,121.200000,110,10.300000,162.600000,104,7.320000,12.200000,5,3.290000,0,False.',
 'OH,84,408,375-9999,yes,no,0,299.400000,71,50.900000,61.900000,88,5.260000,196.900000,89,8.860000,6.600000,7,1.780000,2,False.']

### Step 4: Lets look at the first two lines to understand the format that textFile creates

In [6]:
churn_rdd.take(2)

["State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?",
 'KS,128,415,382-4657,no,yes,25,265.100000,110,45.070000,197.400000,99,16.780000,244.700000,91,11.010000,10.000000,3,2.700000,1,False.']

### Step 5: We need to split the data by commas.

In [7]:
churn_rdd = churn_rdd.map(lambda x: x.split(','))

In [8]:
churn_rdd.take(2)

[['State',
  'Account Length',
  'Area Code',
  'Phone',
  "Int'l Plan",
  'VMail Plan',
  'VMail Message',
  'Day Mins',
  'Day Calls',
  'Day Charge',
  'Eve Mins',
  'Eve Calls',
  'Eve Charge',
  'Night Mins',
  'Night Calls',
  'Night Charge',
  'Intl Mins',
  'Intl Calls',
  'Intl Charge',
  'CustServ Calls',
  'Churn?'],
 ['KS',
  '128',
  '415',
  '382-4657',
  'no',
  'yes',
  '25',
  '265.100000',
  '110',
  '45.070000',
  '197.400000',
  '99',
  '16.780000',
  '244.700000',
  '91',
  '11.010000',
  '10.000000',
  '3',
  '2.700000',
  '1',
  'False.']]

### Step 6: Extract the headers

In [9]:
headers = churn_rdd.first() # this is a list for reference

In [10]:
headers

['State',
 'Account Length',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

### Step 7: Remove the header from the data

In [11]:
churn_rdd = churn_rdd.filter(lambda x: x != headers)

In [12]:
churn_rdd.first()

['KS',
 '128',
 '415',
 '382-4657',
 'no',
 'yes',
 '25',
 '265.100000',
 '110',
 '45.070000',
 '197.400000',
 '99',
 '16.780000',
 '244.700000',
 '91',
 '11.010000',
 '10.000000',
 '3',
 '2.700000',
 '1',
 'False.']

### Step 8: Finding total churn

In [13]:
(
    churn_rdd.map(lambda x: x[-1] != 'False.')
             .sum()
)

483

In [14]:
churn_rdd.filter(lambda x: x[-1] != 'False.')\
             .count()

483

### Step 9: Finding total churn per State

In [15]:
(
    churn_rdd.map(lambda x: (x[0], x[-1] != 'False.'))
             .reduceByKey(lambda x, y: x + y)
             .take(5)
)

[('KS', 13), ('OH', 10), ('MO', 7), ('LA', 4), ('WV', 10)]

### Step 10: Finding average Customer Service Calls per Churn for each State

##### Tup of tup

In [16]:
(
    churn_rdd.map(lambda x: (x[0], (int(x[-2]), x[-1] != 'False.')))
             .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
             .mapValues(lambda x: 1. * x[0] / x[1])
             .take(2)
)

[('KS', 7.461538461538462), ('OH', 10.6)]

#### Joins

In [17]:
cust_rdd = ( 
                churn_rdd.map(lambda x: (x[0], int(x[-2])))
                         .reduceByKey(lambda x, y: x + y)
           )


In [18]:
state_rdd = (
                churn_rdd.map(lambda x: (x[0], x[-1] != 'False.'))
                         .reduceByKey(lambda x, y: x + y)
            )

In [19]:
(
    state_rdd.join(cust_rdd)
             .mapValues(lambda x: 1. * x[1] / x[0])
             .take(2)
)

[('KS', 7.461538461538462), ('OH', 10.6)]

### Caching RDDs to leverage the in memory usage

In [20]:
cached_churn_rdd = churn_rdd.persist()

### Practice #1: What's the min, mean, and max night charge for users that churned?

In [21]:
headers[-1]

'Churn?'

In [22]:
headers.index('Night Charge')

15

In [23]:
churn_rdd.map(lambda x:(x[-1])=='True.').take(20)

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False]

In [24]:
churned = churn_rdd.filter(lambda x:x[-1]=='True.')

In [25]:
churned.map(lambda x: float(x[15])).mean()

9.23552795031056

In [26]:
churned.map(lambda x: float(x[15])).max()

15.97

In [27]:
churned.map(lambda x: float(x[15])).min()

2.13

### Practice #2: How many of the churned users have Vmail plan?

In [28]:
headers.index('VMail Plan')

5

In [29]:
churned.map(lambda x: x[5]).take(5)

['no', 'no', 'no', 'no', 'yes']

In [30]:
churned.filter(lambda x: x[5]=='yes').count()

80

### Practice #3: Which state have the most day calls?

In [31]:
(
    churn_rdd.map(lambda x: (x[0], 1))
             .reduceByKey(lambda x, y: x + y)
             .max()
)

('WY', 77)