# SparkSQL and DataFrames 

<a href = "http://yogen.io"><img src="http://yogen.io/assets/logo.svg" alt="yogen" style="width: 200px; float: right;"/></a>

## RDDs, DataSets, and DataFrames

RDDs are the original interface for Spark programming.

DataFrames were introduced in 1.3

Datasets were introduced in 1.6, and unified with DataFrames in 2.0

### Advantages of DataFrames:

from https://www.datacamp.com/community/tutorials/apache-spark-python:

> More specifically, the performance improvements are due to two things, which you’ll often come across when you’re reading up DataFrames: custom memory management (project Tungsten), which will make sure that your Spark jobs much faster given CPU constraints, and optimized execution plans (Catalyst optimizer), of which the logical plan of the DataFrame is a part.

## SparkSQL and DataFrames 


pyspark does not have the Dataset API, which is available only if you use Spark from a statically typed language: Scala or Java.

From https://spark.apache.org/docs/2.2.0/sql-programming-guide.html:

> A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset&lt;Row> to represent a DataFrame.


### The pyspark.sql module

Important classes of Spark SQL and DataFrames:

* `pyspark.sql.SparkSession` Main entry point for DataFrame and SQL functionality.

* `pyspark.sql.DataFrame` A distributed collection of data grouped into named columns.

* `pyspark.sql.Column` A column expression in a DataFrame.

* `pyspark.sql.Row` A row of data in a DataFrame.

* `pyspark.sql.GroupedData` Aggregation methods, returned by DataFrame.groupBy().

* `pyspark.sql.DataFrameNaFunctions` Methods for handling missing data (null values).

* `pyspark.sql.DataFrameStatFunctions` Methods for statistics functionality.

* `pyspark.sql.functions` List of built-in functions available for DataFrame.

* `pyspark.sql.types` List of data types available.

* `pyspark.sql.Window` For working with window functions.

http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html

https://spark.apache.org/docs/2.2.0/sql-programming-guide.html

## SparkSession

The traditional way to interact with Spark is the SparkContext. In the notebooks we get that from the pyspark driver.

From 2.0 we can use SparkSession to replace SparkConf, SparkContext and SQLContext

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark

In [7]:
spark.sparkContext, spark.sparkContext.textFile

(<SparkContext master=local[*] appName=PySparkShell>,
 <bound method SparkContext.textFile of <SparkContext master=local[*] appName=PySparkShell>>)

#### Passing other options to spark session:
    
    

In [8]:
spark = SparkSession.builder.config('algoporaqui', 'valeesto').getOrCreate()

We can check option values in the resulting session like this:

In [11]:
spark.sparkContext.getConf().getAll() 

[('spark.sql.catalogImplementation', 'hive'),
 ('algoporaqui', 'valeesto'),
 ('spark.rdd.compress', 'True'),
 ('spark.driver.port', '38183'),
 ('spark.driver.host', '192.168.2.114'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.app.id', 'local-1572018107789'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.name', 'PySparkShell'),
 ('spark.ui.showConsoleProgress', 'true')]

### Creating DataFrames

SparkSession.createDataFrame: from an RDD, a list or a pandas.DataFrame.

In [13]:
spark.createDataFrame()

<bound method SparkSession.createDataFrame of <pyspark.sql.session.SparkSession object at 0x7fe7d82705f8>>

### Creating DataFrames

* From RDDs
* from Hive tables
* From Spark sources: parquet (default), json, jdbc, orc, libsvm, csv, text


#### From RDDs

In [23]:
import random

random.choice(['wizard', 'warrior', 'priest'])

'priest'

In [40]:
random.seed(23)

data = [ (id_, random.choice(['wizard', 'warrior', 'priest'])) for id_ in range(15)]
data

[(0, 'warrior'),
 (1, 'wizard'),
 (2, 'wizard'),
 (3, 'priest'),
 (4, 'warrior'),
 (5, 'warrior'),
 (6, 'warrior'),
 (7, 'priest'),
 (8, 'warrior'),
 (9, 'wizard'),
 (10, 'priest'),
 (11, 'wizard'),
 (12, 'warrior'),
 (13, 'warrior'),
 (14, 'wizard')]

In [42]:
rdd = sc.parallelize(data)
rdd

ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:195

In [43]:
spark.createDataFrame(rdd)

DataFrame[_1: bigint, _2: string]

### Inferring and specifying schemas

In [44]:
df = spark.createDataFrame(data)
df

DataFrame[_1: bigint, _2: string]

In [46]:
df.printSchema()

root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)



In [48]:
df = spark.createDataFrame(data, schema=['id', 'category'])
df

DataFrame[id: bigint, category: string]

In [49]:
df.rdd

MapPartitionsRDD[24] at javaToPython at NativeMethodAccessorImpl.java:0

In [50]:
df.take(5) 

[Row(id=0, category='warrior'),
 Row(id=1, category='wizard'),
 Row(id=2, category='wizard'),
 Row(id=3, category='priest'),
 Row(id=4, category='warrior')]

In [52]:
onerow = df.first()
onerow

Row(id=0, category='warrior')

In [55]:
onerow.category

'warrior'

In [56]:
onerow['category']

'warrior'

#### Fully specifying a schema

We need to create a `StructType` composed of `StructField`s. each of those specifies afiled with name, type and `nullable` properties. 

In [65]:
from pyspark.sql import types

types.IntegerType()

IntegerType

Digression: pyspark as compatibility layer over native (Scala) Spark.

In [64]:
spark.sparkContext.parallelize([[1, 4], [3,2]]).map(lambda x: triquitritrocotro).collect()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 5.0 failed 1 times, most recent failure: Lost task 3.0 in stage 5.0 (TID 11, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 393, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-64-e7071d04e716>", line 1, in <lambda>
NameError: name 'triquitritrocotro' is not defined

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 393, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-64-e7071d04e716>", line 1, in <lambda>
NameError: name 'triquitritrocotro' is not defined

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


In [67]:
schema = types.StructType([types.StructField('id', types.IntegerType(), nullable=False),
                           types.StructField('category', types.StringType())])

schema

StructType(List(StructField(id,IntegerType,false),StructField(category,StringType,true)))

In [69]:
df = spark.createDataFrame(data, schema=schema)
df

DataFrame[id: int, category: string]

In [70]:
df.printSchema()

root
 |-- id: integer (nullable = false)
 |-- category: string (nullable = true)



#### From csv files

We can either read them directly into dataframes or read them as RDDs and transform that into a DataFrame. This second way will be very useful if we have unstructured data like web server logs.

In [83]:
spark.read.csv('coupon150720.csv', inferSchema=True)

DataFrame[_c0: bigint, _c1: int, _c2: string, _c3: string, _c4: string, _c5: string, _c6: double, _c7: string, _c8: int, _c9: string, _c10: string, _c11: string, _c12: int, _c13: string, _c14: string]

In [73]:
spark.sql('SELECT * FROM csv.`coupon150720.csv`')

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string, _c9: string, _c10: string, _c11: string, _c12: string, _c13: string, _c14: string]

In [76]:
!head -n 1 coupon150720.csv

79062005698500,1,MAA,AUH,9W,9W,56.79,USD,1,H,H,0526,150904,OK,IAF0


In [84]:
coupons = spark.sql('''SELECT _c0 AS tkt_number, 
                              _c1 AS coupon_number,
                              _c2 AS origin,
                              CAST(_c5 AS double) AS amount
                       FROM csv.`coupon150720.csv`''') 

coupons

DataFrame[tkt_number: string, coupon_number: string, origin: string, amount: double]

#### From other types of data

Apache Parquet is a free and open-source column-oriented data store of the Apache Hadoop ecosystem. It is similar to the other columnar storage file formats available in Hadoop namely RCFile and Optimized RCFile. It is compatible with most of the data processing frameworks in the Hadoop environment.

In [86]:
spark.read.json
spark.read.parquet

<bound method DataFrameReader.parquet of <pyspark.sql.readwriter.DataFrameReader object at 0x7fe7bd542470>>

### Basic operations with DataFrames

In [88]:
df.show(5)

+---+--------+
| id|category|
+---+--------+
|  0| warrior|
|  1|  wizard|
|  2|  wizard|
|  3|  priest|
|  4| warrior|
+---+--------+
only showing top 5 rows



In [89]:
df.take(5)

[Row(id=0, category='warrior'),
 Row(id=1, category='wizard'),
 Row(id=2, category='wizard'),
 Row(id=3, category='priest'),
 Row(id=4, category='warrior')]

### Filtering and selecting

Syntax inspired in SQL.

In [90]:
df.select('id')

DataFrame[id: int]

In [99]:
df.select('id').show(3)

+---+
| id|
+---+
|  0|
|  1|
|  2|
+---+
only showing top 3 rows



If we want to filter, we will need to build an instance of `Column`, using square bracket notation.

In [97]:
df['category']

Column<b'category'>

In [100]:
df.filter('id' < 5) 

TypeError: '<' not supported between instances of 'str' and 'int'

That's because a comparison between str and int will error out, so spark will not even get the chance to infer to which column we are referring.

In [101]:
df.filter(df['id'] < 5)  

DataFrame[id: int, category: string]

In [103]:
df.filter(df['id'] < 5) .show()

+---+--------+
| id|category|
+---+--------+
|  0| warrior|
|  1|  wizard|
|  2|  wizard|
|  3|  priest|
|  4| warrior|
+---+--------+



`where` is exactly synonimous with `filter`

In [91]:
df.where

<bound method filter of DataFrame[id: int, category: string]>

A column is quite different to a Pandas Series. It is just a reference to a column, and can only be used to construct sparkSQL expressions (select, where...). It can't be collected or taken as a one-dimensional sequence:

In [104]:
df['category'].show()

TypeError: 'Column' object is not callable

#### Exercise

Extract all player ids which correspond to priests

In [106]:
df.filter(df['category'] == 'priest').select('id').show()

+---+
| id|
+---+
|  3|
|  7|
| 10|
+---+



In [108]:
df[['id']], df['id'] 

(DataFrame[id: int], Column<b'id'>)

### Adding columns

Dataframes are immutable, since they are built on top of RDDs, so we can not assign to them. We need to create new DataFrames with the appropriate columns.

In [109]:
df['id'] * 100

Column<b'(id * 100)'>

In [110]:
df['newcolumn'] = df['id'] * 100

TypeError: 'DataFrame' object does not support item assignment

In [111]:
df.select('id', 
          'category', 
          df['id'] * 100)

DataFrame[id: int, category: string, (id * 100): int]

In [112]:
df

DataFrame[id: int, category: string]

In [113]:
df.withColumn('tocoto', df['id'] * 100)

DataFrame[id: int, category: string, tocoto: int]

### User defined functions

There are many useful functions in pyspark.sql.functions. These work on columns, that is, they are vectorial.

We can write User Defined Functions (`udf`s), which allow us to "vectorize" operations: write a standard function to process single elements, then build a udf with that that works on columns in a DataFrame, like a SQL function.

In [118]:
from pyspark.sql import functions

df[['id', functions.log('id')]].show()

+---+------------------+
| id|           LOG(id)|
+---+------------------+
|  0|              null|
|  1|               0.0|
|  2|0.6931471805599453|
|  3|1.0986122886681098|
|  4|1.3862943611198906|
|  5|1.6094379124341003|
|  6| 1.791759469228055|
|  7|1.9459101490553132|
|  8|2.0794415416798357|
|  9|2.1972245773362196|
| 10| 2.302585092994046|
| 11|2.3978952727983707|
| 12|2.4849066497880004|
| 13|2.5649493574615367|
| 14|2.6390573296152584|
+---+------------------+



In [120]:
import math

math.log1p(0) 

0.0

In [121]:
math.log1p(df['id']) 

TypeError: must be real number, not Column

This errors out because 

```python
math.log1p
```

is not a udf: it doesn't know how to work with strings or Column objects:

But we can transform it into a udf:

In [123]:
functions.udf(math.log1p)

<function math.log1p>

In [126]:
my_udf = functions.udf(math.log1p)
my_udf(df['id'])

Column<b'log1p(id)'>

In [128]:
df[['id', my_udf(df['id'])]].show(5)

+---+------------------+
| id|         log1p(id)|
+---+------------------+
|  0|               0.0|
|  1|0.6931471805599453|
|  2|1.0986122886681096|
|  3|1.3862943611198906|
|  4|1.6094379124341003|
+---+------------------+
only showing top 5 rows



We can do the same with any function we dream up:

In [129]:
df.show(5)

+---+--------+
| id|category|
+---+--------+
|  0| warrior|
|  1|  wizard|
|  2|  wizard|
|  3|  priest|
|  4| warrior|
+---+--------+
only showing top 5 rows



In [158]:
def isprime(n):
    for div in range(2, n):
        if n % div == 0:
            return False
    return True
        
isprime(6)        

False

In [159]:
isprime_udf = functions.udf(isprime)
isprime_udf('id')

Column<b'isprime(id)'>

If we want the resulting columns to be of a particular type, we need to specify the return type. This is because in Python return types can not be inferred.

In [160]:
df.withColumn('prime', isprime_udf('id'))

DataFrame[id: int, category: string, prime: string]

In [161]:
df.where(isprime_udf('id'))

AnalysisException: "filter expression 'isprime(id)' of type string is not a boolean.;;\nFilter isprime(id#24)\n+- LogicalRDD [id#24, category#25], false\n"

Think about this function: what is its return type?

In [162]:
def incognito(a, b):
    return a + b

In [163]:
isprime_udf = functions.udf(isprime, returnType=types.BooleanType())
isprime_udf('id')

Column<b'isprime(id)'>

In [164]:
df.withColumn('prime', isprime_udf('id')).show()

+---+--------+-----+
| id|category|prime|
+---+--------+-----+
|  0| warrior| true|
|  1|  wizard| true|
|  2|  wizard| true|
|  3|  priest| true|
|  4| warrior|false|
|  5| warrior| true|
|  6| warrior|false|
|  7|  priest| true|
|  8| warrior|false|
|  9|  wizard|false|
| 10|  priest|false|
| 11|  wizard| true|
| 12| warrior|false|
| 13| warrior| true|
| 14|  wizard|false|
+---+--------+-----+



In [165]:
df.where(isprime_udf('id')).show()

+---+--------+
| id|category|
+---+--------+
|  0| warrior|
|  1|  wizard|
|  2|  wizard|
|  3|  priest|
|  5| warrior|
|  7|  priest|
| 11|  wizard|
| 13| warrior|
+---+--------+



#### Exercise: 

Create a 'hp' field in our df. make it 30000 for priests, 40000 for wizards and 70000 for warriors.



In [176]:
hp = functions.udf(lambda cat: {'priest' : 30000, 
                                'wizard' : 40000, 
                                'warrior' : 70000}[cat])

df[[hp('category')]].show(5)

+------------------+
|<lambda>(category)|
+------------------+
|             70000|
|             40000|
|             40000|
|             30000|
|             70000|
+------------------+
only showing top 5 rows



If we have a column that is not the desired type, we can convert it with `cast`.

In [174]:
df.select('id',
         'category',
         hp('category').cast(types.IntegerType()).alias('hp'))

DataFrame[id: int, category: string, hp: int]

In this case in particular, it would be more concise to write:

In [177]:
df2 = df.withColumn('hp', hp('category').cast(types.IntegerType()))
df2

DataFrame[id: int, category: string, hp: int]

### Summary statistics

https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html

In [178]:
df.stat

<pyspark.sql.dataframe.DataFrameStatFunctions at 0x7fe7bc8dc898>

In [180]:
df2.stat.corr('id', 'hp')

-0.017937400083354368

In [181]:
df2.cov('id', 'hp')

-1428.5714285714278

### .crosstab()

Crosstab returns the contingency table for two columns, as a DataFrame.

In [193]:
random.seed(23)
land = functions.udf(lambda: random.choice(['gondor', 'mordor']))

df3 = df2.withColumn('land', land())
df3.show()

+---+--------+-----+------+
| id|category|   hp|  land|
+---+--------+-----+------+
|  0| warrior|70000|gondor|
|  1|  wizard|40000|gondor|
|  2|  wizard|40000|gondor|
|  3|  priest|30000|gondor|
|  4| warrior|70000|mordor|
|  5| warrior|70000|gondor|
|  6| warrior|70000|gondor|
|  7|  priest|30000|gondor|
|  8| warrior|70000|mordor|
|  9|  wizard|40000|gondor|
| 10|  priest|30000|gondor|
| 11|  wizard|40000|gondor|
| 12| warrior|70000|gondor|
| 13| warrior|70000|gondor|
| 14|  wizard|40000|gondor|
+---+--------+-----+------+



In [197]:
df3.crosstab('category', 'land').show()

+-------------+------+------+
|category_land|gondor|mordor|
+-------------+------+------+
|       priest|     2|     1|
|      warrior|     6|     1|
|       wizard|     2|     3|
+-------------+------+------+



In [198]:
df3.cache()

DataFrame[id: int, category: string, hp: int, land: string]

In [204]:
df3.show()

+---+--------+-----+------+
| id|category|   hp|  land|
+---+--------+-----+------+
|  0| warrior|70000|gondor|
|  1|  wizard|40000|gondor|
|  2|  wizard|40000|mordor|
|  3|  priest|30000|gondor|
|  4| warrior|70000|mordor|
|  5| warrior|70000|gondor|
|  6| warrior|70000|mordor|
|  7|  priest|30000|gondor|
|  8| warrior|70000|mordor|
|  9|  wizard|40000|mordor|
| 10|  priest|30000|gondor|
| 11|  wizard|40000|mordor|
| 12| warrior|70000|gondor|
| 13| warrior|70000|mordor|
| 14|  wizard|40000|gondor|
+---+--------+-----+------+



In [205]:
spark

In [208]:
spark.sparkContext.uiWebUrl

'http://192.168.2.114:4040'

### Grouping

Grouping works very similarly to Pandas: executing groupby (or groupBy) on a DataFrame will return an object (a GroupedData) that can then be aggregated to obtain the results.

In [211]:
gd = df3.groupby('category')
gd

<pyspark.sql.group.GroupedData at 0x7fe7bcb610f0>

GroupedData has several aggregation functions defined:

In [213]:
gd.mean('id').show()

+--------+-----------------+
|category|          avg(id)|
+--------+-----------------+
|  priest|6.666666666666667|
| warrior|6.857142857142857|
|  wizard|              7.4|
+--------+-----------------+



In [214]:
gd.max('id').show()

+--------+-------+
|category|max(id)|
+--------+-------+
|  priest|     10|
| warrior|     13|
|  wizard|     14|
+--------+-------+



We can do several aggregations in a single step, with a number of different syntaxes:

In [217]:
gd.agg({'id' : 'max', 'hp' : 'mean', 'id' : 'mean'})

DataFrame[category: string, avg(hp): double, avg(id): double]

In [218]:
gd.agg(functions.max('id'),
       functions.mean('hp'),
       functions.mean('id'))

DataFrame[category: string, max(id): int, avg(hp): double, avg(id): double]

In [220]:
df3.groupby(df['id'] > 5).mean('hp').show()

+--------+------------------+
|(id > 5)|           avg(hp)|
+--------+------------------+
|    true| 51111.11111111111|
|   false|53333.333333333336|
+--------+------------------+



### Intersections

Ver much like SQL joins. We can specify the columns and the join method (left, right, inner, outer) or we can let Spark infer them.

In [228]:
random.seed(42)

data = list(zip(random.choices(range(30), k=10),
                random.choices(['gondor', 'mordor'], k=10),
                random.choices(range(10000, 50000, 1000), k=10)))

data

[(19, 'gondor', 42000),
 (0, 'mordor', 37000),
 (8, 'gondor', 23000),
 (6, 'gondor', 16000),
 (22, 'mordor', 48000),
 (20, 'mordor', 23000),
 (26, 'gondor', 13000),
 (2, 'mordor', 13000),
 (12, 'mordor', 43000),
 (0, 'gondor', 34000)]

In [229]:
right = spark.createDataFrame(data, schema = ['id', 'land', 'hp_bonus'])
right

DataFrame[id: bigint, land: string, hp_bonus: bigint]

In [230]:
df3.join(right)

DataFrame[id: int, category: string, hp: int, land: string, id: bigint, land: string, hp_bonus: bigint]

In [231]:
df3.join(right).show()

AnalysisException: 'Detected implicit cartesian product for INNER join between logical plans\nInMemoryRelation [id#24, category#25, hp#580, land#941], StorageLevel(disk, memory, deserialized, 1 replicas)\n   +- *(1) Project [id#24, category#25, cast(pythonUDF0#1129 as int) AS hp#580, pythonUDF1#1130 AS land#941]\n      +- BatchEvalPython [<lambda>(category#25), <lambda>()], [id#24, category#25, pythonUDF0#1129, pythonUDF1#1130]\n         +- Scan ExistingRDD[id#24,category#25]\nand\nLogicalRDD [id#1507L, land#1508, hp_bonus#1509L], false\nJoin condition is missing or trivial.\nEither: use the CROSS JOIN syntax to allow cartesian products between these\nrelations, or: enable implicit cartesian products by setting the configuration\nvariable spark.sql.crossJoin.enabled=true;'

Spark refuses to do cross joins by default. To perform them, we can 

a) Allow then explicitly:

```python
session.conf.set("spark.sql.crossJoin.enabled", "true")
```

b) Specify the join criterion

```python
df4.join(new_df, on='id').show()
```



In [232]:
df3.join(right, on='id').show()

+---+--------+-----+------+------+--------+
| id|category|   hp|  land|  land|hp_bonus|
+---+--------+-----+------+------+--------+
|  0| warrior|70000|gondor|mordor|   37000|
|  8| warrior|70000|mordor|gondor|   23000|
|  6| warrior|70000|mordor|gondor|   16000|
|  2|  wizard|40000|mordor|mordor|   13000|
| 12| warrior|70000|gondor|mordor|   43000|
|  0| warrior|70000|gondor|gondor|   34000|
+---+--------+-----+------+------+--------+



In [235]:
df3.join(right, on=['id', 'land'], how='outer').show()

+---+------+--------+-----+--------+
| id|  land|category|   hp|hp_bonus|
+---+------+--------+-----+--------+
| 19|gondor|    null| null|   42000|
|  6|gondor|    null| null|   16000|
|  9|mordor|  wizard|40000|    null|
|  6|mordor| warrior|70000|    null|
|  0|mordor|    null| null|   37000|
| 26|gondor|    null| null|   13000|
| 14|gondor|  wizard|40000|    null|
|  8|gondor|    null| null|   23000|
|  1|gondor|  wizard|40000|    null|
|  4|mordor| warrior|70000|    null|
| 13|mordor| warrior|70000|    null|
|  3|gondor|  priest|30000|    null|
| 12|gondor| warrior|70000|    null|
| 12|mordor|    null| null|   43000|
|  2|mordor|  wizard|40000|   13000|
| 11|mordor|  wizard|40000|    null|
|  5|gondor| warrior|70000|    null|
| 10|gondor|  priest|30000|    null|
|  7|gondor|  priest|30000|    null|
|  8|mordor| warrior|70000|    null|
+---+------+--------+-----+--------+
only showing top 20 rows



In [238]:
joined = df3.join(right, on='id', how='outer')
joined.show()

+---+--------+-----+------+------+--------+
| id|category|   hp|  land|  land|hp_bonus|
+---+--------+-----+------+------+--------+
| 26|    null| null|  null|gondor|   13000|
| 19|    null| null|  null|gondor|   42000|
|  0| warrior|70000|gondor|mordor|   37000|
|  0| warrior|70000|gondor|gondor|   34000|
| 22|    null| null|  null|mordor|   48000|
|  7|  priest|30000|gondor|  null|    null|
|  6| warrior|70000|mordor|gondor|   16000|
|  9|  wizard|40000|mordor|  null|    null|
|  5| warrior|70000|gondor|  null|    null|
|  1|  wizard|40000|gondor|  null|    null|
| 10|  priest|30000|gondor|  null|    null|
|  3|  priest|30000|gondor|  null|    null|
| 12| warrior|70000|gondor|mordor|   43000|
|  8| warrior|70000|mordor|gondor|   23000|
| 11|  wizard|40000|mordor|  null|    null|
|  2|  wizard|40000|mordor|mordor|   13000|
|  4| warrior|70000|mordor|  null|    null|
| 13| warrior|70000|mordor|  null|    null|
| 14|  wizard|40000|gondor|  null|    null|
| 20|    null| null|  null|mordo

In [240]:
joined.select('land')

AnalysisException: "Reference 'land' is ambiguous, could be: land, land.;"

In [242]:
joined.select(df3['land']).show(5)

+------+
|  land|
+------+
|  null|
|  null|
|gondor|
|gondor|
|  null|
+------+
only showing top 5 rows



In [243]:
joined.select(right['land']).show(5)

+------+
|  land|
+------+
|gondor|
|gondor|
|mordor|
|gondor|
|mordor|
+------+
only showing top 5 rows



#### Digression

We can monitor our running jobs and storage used at the Spark Web UI. We can get its url with sc.uiWebUrl.

StorageLevels represent how our DataFrame is cached: we can save the results of the computation up to that point, so that if we process several times the same data only the subsequent steps will be recomputed.

In [244]:
spark

We can erase it with `unpersist`

In [245]:
joined.cache()

DataFrame[id: bigint, category: string, hp: int, land: string, land: string, hp_bonus: bigint]

In [246]:
joined.show(1)

+---+--------+----+----+------+--------+
| id|category|  hp|land|  land|hp_bonus|
+---+--------+----+----+------+--------+
| 26|    null|null|null|gondor|   13000|
+---+--------+----+----+------+--------+
only showing top 1 row



#### Exercise

Calculate the [z-score](http://www.statisticshowto.com/probability-and-statistics/z-score/) of each player's hp for their land

1) Calculate the mean and std of hp for each land

In [256]:
stats = df3.groupby('land').agg(functions.mean('hp').alias('avg'),
                                functions.stddev('hp').alias('std'))
stats.show()

+------+------------------+------------------+
|  land|               avg|               std|
+------+------------------+------------------+
|gondor|           47500.0|19086.270308410552|
|mordor|57142.857142857145|16035.674514745464|
+------+------------------+------------------+



2) Annotate each player with the stats corresponding to their land

In [260]:
annotated = df3.join(stats, on='land')
annotated.show(5)

+------+---+--------+-----+------------------+------------------+
|  land| id|category|   hp|               avg|               std|
+------+---+--------+-----+------------------+------------------+
|gondor|  0| warrior|70000|           47500.0|19086.270308410552|
|gondor|  1|  wizard|40000|           47500.0|19086.270308410552|
|mordor|  2|  wizard|40000|57142.857142857145|16035.674514745464|
|gondor|  3|  priest|30000|           47500.0|19086.270308410552|
|mordor|  4| warrior|70000|57142.857142857145|16035.674514745464|
+------+---+--------+-----+------------------+------------------+
only showing top 5 rows



3) Calculate the z-score

In [262]:
result = annotated.withColumn('z-score', (annotated['hp'] - annotated['avg']) / annotated['std'])
result.show(5)

+------+---+--------+-----+------------------+------------------+-------------------+
|  land| id|category|   hp|               avg|               std|            z-score|
+------+---+--------+-----+------------------+------------------+-------------------+
|gondor|  0| warrior|70000|           47500.0|19086.270308410552| 1.1788578719900638|
|gondor|  1|  wizard|40000|           47500.0|19086.270308410552|-0.3929526239966879|
|mordor|  2|  wizard|40000|57142.857142857145|16035.674514745464|-1.0690449676496976|
|gondor|  3|  priest|30000|           47500.0|19086.270308410552|-0.9168894559922718|
|mordor|  4| warrior|70000|57142.857142857145|16035.674514745464|  0.801783725737273|
+------+---+--------+-----+------------------+------------------+-------------------+
only showing top 5 rows



In [264]:
zscore = functions.udf(lambda xi, average, std: (xi - average) / std)
result = annotated.withColumn('z-score', zscore('hp', 'avg', 'std'))
result.show(5)

+------+---+--------+-----+------------------+------------------+-------------------+
|  land| id|category|   hp|               avg|               std|            z-score|
+------+---+--------+-----+------------------+------------------+-------------------+
|gondor|  0| warrior|70000|           47500.0|19086.270308410552| 1.1788578719900638|
|gondor|  1|  wizard|40000|           47500.0|19086.270308410552|-0.3929526239966879|
|mordor|  2|  wizard|40000|57142.857142857145|16035.674514745464|-1.0690449676496976|
|gondor|  3|  priest|30000|           47500.0|19086.270308410552|-0.9168894559922718|
|mordor|  4| warrior|70000|57142.857142857145|16035.674514745464|  0.801783725737273|
+------+---+--------+-----+------------------+------------------+-------------------+
only showing top 5 rows



Note that we can build more complex boolean conditions for joining, as well as joining on columns that do not have the same name:

In [265]:
right

DataFrame[id: bigint, land: string, hp_bonus: bigint]

In [267]:
df3.join(right, on=df3['id'] * 10000 > right['hp_bonus'] , how='outer').show()


+---+--------+-----+------+---+------+--------+
| id|category|   hp|  land| id|  land|hp_bonus|
+---+--------+-----+------+---+------+--------+
|  5| warrior|70000|gondor| 19|gondor|   42000|
|  6| warrior|70000|mordor| 19|gondor|   42000|
|  7|  priest|30000|gondor| 19|gondor|   42000|
|  8| warrior|70000|mordor| 19|gondor|   42000|
|  9|  wizard|40000|mordor| 19|gondor|   42000|
| 10|  priest|30000|gondor| 19|gondor|   42000|
| 11|  wizard|40000|mordor| 19|gondor|   42000|
| 12| warrior|70000|gondor| 19|gondor|   42000|
| 13| warrior|70000|mordor| 19|gondor|   42000|
| 14|  wizard|40000|gondor| 19|gondor|   42000|
|  4| warrior|70000|mordor|  0|mordor|   37000|
|  5| warrior|70000|gondor|  0|mordor|   37000|
|  6| warrior|70000|mordor|  0|mordor|   37000|
|  7|  priest|30000|gondor|  0|mordor|   37000|
|  8| warrior|70000|mordor|  0|mordor|   37000|
|  9|  wizard|40000|mordor|  0|mordor|   37000|
| 10|  priest|30000|gondor|  0|mordor|   37000|
| 11|  wizard|40000|mordor|  0|mordor|  

### Handling null values

In [271]:
df_with_nulls = joined.drop(df3['land'])
df_with_nulls

DataFrame[id: bigint, category: string, hp: int, land: string, hp_bonus: bigint]

In [274]:
df_with_nulls.show()

+---+--------+-----+------+--------+
| id|category|   hp|  land|hp_bonus|
+---+--------+-----+------+--------+
| 26|    null| null|gondor|   13000|
| 19|    null| null|gondor|   42000|
|  0| warrior|70000|mordor|   37000|
|  0| warrior|70000|gondor|   34000|
| 22|    null| null|mordor|   48000|
|  7|  priest|30000|  null|    null|
|  6| warrior|70000|gondor|   16000|
|  9|  wizard|40000|  null|    null|
|  5| warrior|70000|  null|    null|
|  1|  wizard|40000|  null|    null|
| 10|  priest|30000|  null|    null|
|  3|  priest|30000|  null|    null|
| 12| warrior|70000|mordor|   43000|
|  8| warrior|70000|gondor|   23000|
| 11|  wizard|40000|  null|    null|
|  2|  wizard|40000|mordor|   13000|
|  4| warrior|70000|  null|    null|
| 13| warrior|70000|  null|    null|
| 14|  wizard|40000|  null|    null|
| 20|    null| null|mordor|   23000|
+---+--------+-----+------+--------+



In [273]:
df_with_nulls.dropna().show()

+---+--------+-----+------+--------+
| id|category|   hp|  land|hp_bonus|
+---+--------+-----+------+--------+
|  0| warrior|70000|mordor|   37000|
|  0| warrior|70000|gondor|   34000|
|  6| warrior|70000|gondor|   16000|
| 12| warrior|70000|mordor|   43000|
|  8| warrior|70000|gondor|   23000|
|  2|  wizard|40000|mordor|   13000|
+---+--------+-----+------+--------+



In [283]:
df_with_nulls.dropna(thresh=3).show()

+---+--------+-----+------+--------+
| id|category|   hp|  land|hp_bonus|
+---+--------+-----+------+--------+
| 26|    null| null|gondor|   13000|
| 19|    null| null|gondor|   42000|
|  0| warrior|70000|mordor|   37000|
|  0| warrior|70000|gondor|   34000|
| 22|    null| null|mordor|   48000|
|  7|  priest|30000|  null|    null|
|  6| warrior|70000|gondor|   16000|
|  9|  wizard|40000|  null|    null|
|  5| warrior|70000|  null|    null|
|  1|  wizard|40000|  null|    null|
| 10|  priest|30000|  null|    null|
|  3|  priest|30000|  null|    null|
| 12| warrior|70000|mordor|   43000|
|  8| warrior|70000|gondor|   23000|
| 11|  wizard|40000|  null|    null|
|  2|  wizard|40000|mordor|   13000|
|  4| warrior|70000|  null|    null|
| 13| warrior|70000|  null|    null|
| 14|  wizard|40000|  null|    null|
| 20|    null| null|mordor|   23000|
+---+--------+-----+------+--------+



In [284]:
df_with_nulls.dropna(subset='land').show()

+---+--------+-----+------+--------+
| id|category|   hp|  land|hp_bonus|
+---+--------+-----+------+--------+
| 26|    null| null|gondor|   13000|
| 19|    null| null|gondor|   42000|
|  0| warrior|70000|mordor|   37000|
|  0| warrior|70000|gondor|   34000|
| 22|    null| null|mordor|   48000|
|  6| warrior|70000|gondor|   16000|
| 12| warrior|70000|mordor|   43000|
|  8| warrior|70000|gondor|   23000|
|  2|  wizard|40000|mordor|   13000|
| 20|    null| null|mordor|   23000|
+---+--------+-----+------+--------+



## SQL querying

We need to register our DataFrame as a table in the SQL context in order to be able to query against it.

In [286]:
spark.sql('SELECT * FROM df3 where land="mordor"')

AnalysisException: 'Table or view not found: df3; line 1 pos 14'

In [287]:
df3.registerTempTable('tocoto')

Once registered, we can perform queries as complex as we want.

In [289]:
spark.sql('SELECT * FROM tocoto where land="mordor"').show()

+---+--------+-----+------+
| id|category|   hp|  land|
+---+--------+-----+------+
|  2|  wizard|40000|mordor|
|  4| warrior|70000|mordor|
|  6| warrior|70000|mordor|
|  8| warrior|70000|mordor|
|  9|  wizard|40000|mordor|
| 11|  wizard|40000|mordor|
| 13| warrior|70000|mordor|
+---+--------+-----+------+



#### Exercise:

replicate the previous exercise, but with SparkSQL instead of dataframe methods.

## Interoperation with Pandas

Easy peasy. We can convert a spark DataFrame into a Pandas one, which will `collect` it, and viceversa, which will distribute it.

In [293]:
pandas_df = annotated.toPandas()
pandas_df

Unnamed: 0,land,id,category,hp,avg,std
0,gondor,0,warrior,70000,47500.0,19086.270308
1,gondor,1,wizard,40000,47500.0,19086.270308
2,mordor,2,wizard,40000,57142.857143,16035.674515
3,gondor,3,priest,30000,47500.0,19086.270308
4,mordor,4,warrior,70000,57142.857143,16035.674515
5,gondor,5,warrior,70000,47500.0,19086.270308
6,mordor,6,warrior,70000,57142.857143,16035.674515
7,gondor,7,priest,30000,47500.0,19086.270308
8,mordor,8,warrior,70000,57142.857143,16035.674515
9,mordor,9,wizard,40000,57142.857143,16035.674515


In [294]:
spark.createDataFrame(pandas_df)

DataFrame[land: string, id: bigint, category: string, hp: bigint, avg: double, std: double]

## Writing out


In [296]:
df_with_nulls.write.csv('out.csv')

#### Exercise

Repeat the exercise from the previous notebook, but this time with DataFrames.

Get stats for all tickets with destination MAD from `coupons150720.csv`.

You will need to extract ticket amounts with destination MAD, and then calculate:

1. Total ticket amounts per origin
2. Top 10 airlines by average amount

1) Extract the fields you need (c0,c1,c2,c3,c4 and c6) into a dataframe with proper names and types

Remember, you want to calculate:

Total ticket amounts per origin

Top 10 airlines by average amount

In [298]:
!head coupon150720.csv

79062005698500,1,MAA,AUH,9W,9W,56.79,USD,1,H,H,0526,150904,OK,IAF0
79062005698500,2,AUH,CDG,9W,9W,84.34,USD,1,H,H,6120,150905,OK,IAF0
79062005924069,1,CJB,MAA,9W,9W,60.0,USD,1,H,H,2768,150721,OK,IAA0
79065668570385,1,DEL,DXB,9W,9W,160.63,USD,2,S,S,0546,150804,OK,INA0
79065668737021,1,AUH,IXE,9W,9W,152.46,USD,1,V,V,0501,150803,OK,INA0
79062006192650,1,RPR,BOM,9W,9W,68.5,USD,1,K,K,2202,150720,OK,IAE0
79062006192650,2,BOM,RPR,9W,9W,68.5,USD,1,H,H,0377,150721,OK,IAE0
79062005733853,1,DEL,DED,9W,9W,56.16,USD,1,V,V,2839,150801,OK,INA0
79062005836987,1,ATL,LGA,AA,AA,28.3,USD,1,V,V,3237,150903,OK,INB0
79062005836987,2,LGA,EWR,,,0.0,USD,1,,,VOID,,,INA0


In [330]:
coupons = spark.sql('''SELECT _c0 AS tkt_number, 
                              _c1 AS coupon_number,
                              _c2 AS origin,
                              _c3 AS destination,
                              _c4 AS carrier,
                              CAST(_c6 AS double) AS amount
                       FROM csv.`coupon150720.csv`''') 

coupons

DataFrame[tkt_number: string, coupon_number: string, origin: string, destination: string, carrier: string, amount: double]

In [331]:
anonymous_df = spark.read.csv('coupon150720.csv').select(['_c0','_c1','_c2','_c3','_c4','_c6'])

anonymous_df.withColumnRenamed('_c0', 'tkt_number')

DataFrame[tkt_number: string, _c1: string, _c2: string, _c3: string, _c4: string, _c6: string]

In [332]:
old_names = ['_c0','_c1','_c2','_c3','_c4','_c6']
new_names = ['tkt_number', 'coupon_number','origin', 'destination', 'carrier', 'amount']

for old, new in zip(old_names, new_names):
    anonymous_df = anonymous_df.withColumnRenamed(old, new)
    
anonymous_df

DataFrame[tkt_number: string, coupon_number: string, origin: string, destination: string, carrier: string, amount: string]

2) Total ticket amounts per origin

In [336]:
coupons.filter(coupons['destination'] == 'MAD')\
       .groupby('origin')\
       .sum('amount')\
       .show(5)

+------+-----------------+
|origin|      sum(amount)|
+------+-----------------+
|   PMI|40547.17000000003|
|   YUL|           284.44|
|   HEL|          8195.76|
|   SXB|           264.46|
|   UIO|           8547.6|
+------+-----------------+
only showing top 5 rows



In [335]:
result = spark.sql('''SELECT _c2 AS origin,
                              SUM(CAST(_c6 AS double)) AS amount
                      FROM csv.`coupon150720.csv`
                      WHERE _c3="MAD"
                      GROUP BY _c2''') 

result.show(5)

+------+-----------------+
|origin|           amount|
+------+-----------------+
|   PMI|40547.17000000003|
|   YUL|           284.44|
|   HEL|          8195.76|
|   SXB|           264.46|
|   UIO|           8547.6|
+------+-----------------+
only showing top 5 rows



In [338]:
coupons.cache()

DataFrame[tkt_number: string, coupon_number: string, origin: string, destination: string, carrier: string, amount: double]

In [339]:
coupons.registerTempTable('coupons_table')

In [341]:
%%time
spark.sql('SELECT origin, SUM(amount) FROM coupons_table where destination="MAD" GROUP BY origin').show(5)

+------+-----------------+
|origin|      sum(amount)|
+------+-----------------+
|   PMI|40547.17000000003|
|   YUL|           284.44|
|   HEL|          8195.76|
|   SXB|           264.46|
|   UIO|           8547.6|
+------+-----------------+
only showing top 5 rows

CPU times: user 1.2 ms, sys: 4.37 ms, total: 5.57 ms
Wall time: 3.74 s


In [343]:
%%time 
spark.sql('SELECT origin, SUM(amount) FROM coupons_table where destination="MAD" GROUP BY origin').show(5)

+------+-----------------+
|origin|      sum(amount)|
+------+-----------------+
|   PMI|40547.17000000003|
|   YUL|           284.44|
|   HEL|          8195.76|
|   SXB|           264.46|
|   UIO|           8547.6|
+------+-----------------+
only showing top 5 rows

CPU times: user 3.64 ms, sys: 0 ns, total: 3.64 ms
Wall time: 212 ms


In [344]:
spark

In [345]:
coupons.unpersist()

DataFrame[tkt_number: string, coupon_number: string, origin: string, destination: string, carrier: string, amount: double]

In [348]:
from pyspark import StorageLevel

StorageLevel.MEMORY_ONLY # coupons.cache() is a synonym of coupons.persist(StorageLevel.MEMORY_ONLY)

StorageLevel(False, True, False, False, 1)

In [349]:
StorageLevel.MEMORY_AND_DISK

StorageLevel(True, True, False, False, 1)

In [350]:
coupons.persist(StorageLevel.MEMORY_AND_DISK_SER_2)

DataFrame[tkt_number: string, coupon_number: string, origin: string, destination: string, carrier: string, amount: double]

In [351]:
coupons.show(5)

+--------------+-------------+------+-----------+-------+------+
|    tkt_number|coupon_number|origin|destination|carrier|amount|
+--------------+-------------+------+-----------+-------+------+
|79062005698500|            1|   MAA|        AUH|     9W| 56.79|
|79062005698500|            2|   AUH|        CDG|     9W| 84.34|
|79062005924069|            1|   CJB|        MAA|     9W|  60.0|
|79065668570385|            1|   DEL|        DXB|     9W|160.63|
|79065668737021|            1|   AUH|        IXE|     9W|152.46|
+--------------+-------------+------+-----------+-------+------+
only showing top 5 rows



3) Top 10 Airlines by average amount



In [354]:
df.orderBy

<bound method DataFrame.sort of DataFrame[id: int, category: string]>

In [353]:
help(df.sort)

Help on method sort in module pyspark.sql.dataframe:

sort(*cols, **kwargs) method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` sorted by the specified column(s).
    
    :param cols: list of :class:`Column` or column names to sort by.
    :param ascending: boolean or list of boolean (default True).
        Sort ascending vs. descending. Specify list for multiple sort orders.
        If a list is specified, length of the list must equal length of the `cols`.
    
    >>> df.sort(df.age.desc()).collect()
    [Row(age=5, name='Bob'), Row(age=2, name='Alice')]
    >>> df.sort("age", ascending=False).collect()
    [Row(age=5, name='Bob'), Row(age=2, name='Alice')]
    >>> df.orderBy(df.age.desc()).collect()
    [Row(age=5, name='Bob'), Row(age=2, name='Alice')]
    >>> from pyspark.sql.functions import *
    >>> df.sort(asc("age")).collect()
    [Row(age=2, name='Alice'), Row(age=5, name='Bob')]
    >>> df.orderBy(desc("age"), "name").collect()
    [Row

In [358]:
df.limit(10).show()

+---+--------+
| id|category|
+---+--------+
|  0| warrior|
|  1|  wizard|
|  2|  wizard|
|  3|  priest|
|  4| warrior|
|  5| warrior|
|  6| warrior|
|  7|  priest|
|  8| warrior|
|  9|  wizard|
+---+--------+



In [363]:
coupons.filter(coupons['destination'] == 'MAD')\
       .groupby('carrier')\
       .mean('amount')\
       .sort('avg(amount)', ascending=False)\
       .show(10)

+-------+------------------+
|carrier|       avg(amount)|
+-------+------------------+
|     V0| 5418.098666666667|
|     AC| 740.6200000000001|
|     KE| 688.5261538461539|
|     SV| 553.1742553191489|
|     OB| 535.5044444444444|
|     AR| 513.5304761904762|
|     AV|450.19509554140177|
|     AM|440.73421052631585|
|     C2|            397.87|
|     LA| 379.9537078651686|
+-------+------------------+
only showing top 10 rows



## Further Reading

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

https://www.datacamp.com/community/tutorials/apache-spark-python

https://spark.apache.org/docs/2.2.0/sql-programming-guide.html

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf