## Basic RDD actions and transformations

PySpark is used to connect python with Spark

In [1]:
# relevant imports
from pyspark import SparkContext

### Creat a SparkContext
- A SparkContext represents the connection to the sparkcluster
- used to create RDD
- commit variables to the cluster
- only one SparkContext can be active at any given time

In [2]:
# creat the sparkcontext
sc = SparkContext()

#### create a simple txt file to show some basic actions and transformations

In [3]:
%%writefile example.txt
first row
second row
third row
fourth row

Overwriting example.txt


### Make a RDD from a textfile
use .textFile to read in simple txt data and transfrom it into a RDD

In [4]:
file = sc.textFile('example.txt')

In [5]:
# check the format of the file
type(file)

pyspark.rdd.RDD

### Make a RDD from an array/list

In [6]:
a_list = [1,2,3,4,5,6]

In [7]:
rdd_list = sc.parallelize(a_list)

In [8]:
type(rdd_list)

pyspark.rdd.RDD

### Call actions on the RDD

In [9]:
file.count()

4

In [10]:
file.first()

'first row'

In [11]:
file.take(2)

['first row', 'second row']

In [12]:
file.top(2)

['third row', 'second row']

In [13]:
file.collect()

['first row', 'second row', 'third row', 'fourth row']

In [14]:
rdd_list.count()

6

### Transformations
- rdd.map() = transforms each element in the RDD and returns each of it (equal to pandas.apply())
- rdd.filter() = applies a function on each element in the RDD and returns the values for which the assessment is True
- rdd.flatMap() = transforms each element into a 0-N element, which increases the count of elements in the RDD
 

#### map()

In [15]:
file.map(lambda x: x.lower())

PythonRDD[8] at RDD at PythonRDD.scala:53

In [16]:
file.map(lambda x: x.lower()).collect()

['first row', 'second row', 'third row', 'fourth row']

Spark uses lazy evaluations, which delays the evaluation of an expression until its value is needed. With the action .collect() the code gets activated. 

#### filter()

In [17]:
file.filter(lambda x: 'two' in x)

PythonRDD[10] at RDD at PythonRDD.scala:53

In [18]:
file.filter(lambda x: 'two' in x).collect()

[]

#### Differences rdd.map() vs rdd.flatMap()

In [19]:
split_map = file.map(lambda x: x.split())

In [20]:
split_map.collect()

[['first', 'row'], ['second', 'row'], ['third', 'row'], ['fourth', 'row']]

In [21]:
split_flat = file.flatMap(lambda x: x.split())

In [22]:
split_flat.collect()

['first', 'row', 'second', 'row', 'third', 'row', 'fourth', 'row']