## Basic RDD actions and transformations

PySpark is used to connect python with Spark

In [5]:
# relevant imports
from pyspark import SparkContext

### Creat a SparkContext
- A SparkContext represents the connection to the sparkcluster
- used to create RDD
- commit variables to the cluster
- only one SparkContext can be active at any given time

In [6]:
# creat the sparkcontext
sc = SparkContext()

#### create a simple txt file to show some basic actions and transformations

In [7]:
%%writefile example.txt
Erste zeile
zweite zeile
Dritte zeile
vierte zeile

Writing example.txt


### Make a RDD from a textfile
use .textFile to read in simple txt data and transfrom it into a RDD

In [9]:
file = sc.textFile('example.txt')

In [10]:
# check the format of the file
type(file)

pyspark.rdd.RDD

### Make a RDD from an array/list

In [15]:
a_list = [1,2,3,4,5,6]

In [18]:
rdd_list = sc.parallelize(a_list)

In [19]:
type(rdd_list)

pyspark.rdd.RDD

### Call actions on the RDD

In [11]:
file.count()

4

In [12]:
file.first()

'Erste zeile'

In [13]:
file.take(2)

['Erste zeile', 'zweite zeile']

In [14]:
file.top(2)

['zweite zeile', 'vierte zeile']

In [23]:
file.collect()

['Erste zeile', 'zweite zeile', 'Dritte zeile', 'vierte zeile']

In [21]:
rdd_list.count()

6

### Transformations
- rdd.map() = transforms each element in the RDD and returns each of it (equal to pandas.apply())
- rdd.filter() = applies a function on each element in the RDD and returns the values for which the assessment is True
- rdd.flatMap() = transforms each element into a 0-N element, which increases the count of elements in the RDD
 

#### map()

In [25]:
file.map(lambda x: x.lower())

PythonRDD[8] at RDD at PythonRDD.scala:53

In [26]:
file.map(lambda x: x.lower()).collect()

['erste zeile', 'zweite zeile', 'dritte zeile', 'vierte zeile']

Spark uses lazy evaluations, which delays the evaluation of an expression until its value is needed. With the action .collect() the code gets activated. 

#### filter()

In [27]:
file.filter(lambda x: 'zwei' in x)

PythonRDD[10] at RDD at PythonRDD.scala:53

In [28]:
file.filter(lambda x: 'zwei' in x).collect()

['zweite zeile']

#### Differences rdd.map() vs rdd.flatMap()

In [31]:
split_map = file.map(lambda x: x.split())

In [32]:
split_map.collect()

[['Erste', 'zeile'],
 ['zweite', 'zeile'],
 ['Dritte', 'zeile'],
 ['vierte', 'zeile']]

In [33]:
split_flat = file.flatMap(lambda x: x.split())

In [34]:
split_flat.collect()

['Erste', 'zeile', 'zweite', 'zeile', 'Dritte', 'zeile', 'vierte', 'zeile']