In [1]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.getOrCreate()


22/11/01 15:38:31 WARN Utils: Your hostname, kevin resolves to a loopback address: 127.0.1.1; using 192.168.1.6 instead (on interface wlp0s20f3)
22/11/01 15:38:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/01 15:38:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# What are the Low-Level APIs
There are two sets of low-level APIs: there is one for manipulating distributed data (RDDs) and another for distributing and manipulating distributed shared variables (broadcast variables and accumulators)

## When to use the Low-Level APIs
You should generally use the lower-level APIs in 3 situations:
* You need some functionality that you cannot find in the higher-level APIs; for example, if you need very tight control over physical data placement across the cluster
* You need to maintain some legacy codebase written using RDDs
* You need to do some custom shared variable


## How to use the Lower-Level APIs
A sparkcontext is the entryppoint for lower-level API functionality. 

# Types of RDDs
As a user, you will likely only be creating two types of RDDs: the generic RDD type or a key-value RDD that provides additional functions, such as aggregating by key. 
Internally, each RDD is characterized by 5 main properties:
* A list of partitions
* A function for computing each split
* A list of dependencies on other RDDs
* Optionally, a Partitioner for key-value RDDs(e.g to say that the RDD is hash-partitioned)
* Optionally, a list of preferred locations on which to compute each split


These properties determine all of spark's ability to schedule and execute the user program. Different kinds of RDDs implement their own versions of each of the aforementioned properties allowing you to define new data sources.


# Creating RDDs
## From an existing DataFrame or Dataset

In [2]:
spark.range(10).rdd.toDF()

                                                                                

DataFrame[id: bigint]

To operate on this data, you will need to convert this Row object to the correct data type or extract values from it. This is now an RDD of type Row

In [3]:
spark.range(10).toDF('id').rdd.map(lambda row: row[0])

PythonRDD[17] at RDD at PythonRDD.scala:53

## From a local collection
To create an RDD from a collection, you will need to use the parallelize method on a SparkContext within a SparkSession. This turns a single node collection into a parallel collection. When creating this parallel collection, you can also explicitly state the number of partitions into which you would like to distribute this array

In [4]:
myCollection = "Spark the Definitive Guide : Big Data Processing Made Simple".split(" ")

words = spark.sparkContext.parallelize(myCollection, 2)

In [5]:
words.setName('myWords')
words.name()

'myWords'

# Tranformations


## distinct

In [6]:
words.distinct().count()

                                                                                

10

## filter

In [7]:
def startsWithS(individual):
    return individual.startswith('S')

words.filter(lambda word: startsWithS(word)).collect()

['Spark', 'Simple']

## map

In [9]:
words2 = words.map(lambda word: (word, word[0], word.startswith('S')))
words2.filter(lambda record: record[2]).take(5)

[('Spark', 'S', True), ('Simple', 'S', True)]

## flapMap
This provides a simple extension of the map function we just looked at. Sometimes, each current row should return multiple rows, instead

In [10]:
words.flatMap(lambda word: list(word)).take(5)

['S', 'p', 'a', 'r', 'k']

## sort

In [11]:
words.sortBy(lambda word: len(word) * -1).take(2)

['Definitive', 'Processing']

# Actions

## reduce

In [12]:
spark.sparkContext.parallelize(range(1,21)).reduce(lambda x, y: x + y)

210

In [13]:
def wordLengthReducer(leftWord, rightWord):
    if len(leftWord) > len(rightWord):
        return leftWord
    else:
        return rightWord


words.reduce(wordLengthReducer)

'Processing'

In [14]:
spark.stop()