<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Intro to Spark

---

![image.png](attachment:image.png)

### Learning Objectives
*After this lesson, you will be able to:*

- Describe the advantages/disadvantages of Spark compared to Hadoop MapReduce
- Define what an RDD is, by its properties and operations
- Explain the different between transformations and actions on an RDD
- Implement the different transformations through use cases
- Use Spark via python

<a id='intro'></a>
## What is Spark?
---


- Apache Spark is a (VERY) powerful open source in-memory framework
    - It is also called **general purpose processing engine**
- It was built on the top of Hadoop originally developed at UC Berkeley in 2009 with the aim of solving some of its problems
- The engineers knew EXTREMELY well mapReduce. They learnt a lot from:
    - why it was so damn hard to work with
    - what were the performance challenges

They addressed them very well:
- Spark was built around **speed, ease of use, and unified engine** (support libraries for SQL queries, streaming data, machine learning and graph processing)
- It is the **largest open source project in data processing**



## Spark is referred also as a Lightning-Fast Cluster Computing

### Wait what's a cluster?

![image.png](attachment:image.png)


## What's a cluster?

![image.png](attachment:image.png)


## What's a cluster?

![image.png](attachment:image.png)


## What's a cluster?

![image.png](attachment:image.png)


## What's a cluster?

![image.png](attachment:image.png)


## What's a cluster?

There are also ways to specify:
- how many CPUs you want
- how much memory
- how much time


## What is the link with Hadoop?

![image.png](attachment:image.png)

### Spark visually
![image.png](attachment:image.png)

### MapReduce vs Spark

<img src="images/apache_hadoop_ecosystem.jpg" width="500">


**Hadoop MapReduce limits:**
- your job has to fit the `<key, value>` paradigm
- no interactions (except by programming)
- each job read from disk: problem with iterative algorithms (machine learning)
- data is maintained via redundancy.

**How Spark answers this:**
- Spark proposes **other processing workflows than MapReduce**
- highly efficient distributed operations
- Spark runs in memory and on disk
- Can be up to 100x faster than Hadoop MapReduce in memory, and 10x faster on disk.
- Spark keeps everything in memory when possible, uses lots of it.


### Spark Visually

- It's a set of tools for developing applications.  
- It allows parallel processing of applications, in a distributed environment.

![](https://snag.gy/c9b1Kx.jpg)


<a id='dist'></a>
### Spark is a distributed computer framework for parallelized applications like _Hadoop_.

_Spark can interact with Hadoop's HDFS to access large amounts of data using high volume, distributed I/O._

![](https://snag.gy/s8gSlG.jpg)

<a id='api'></a>
### Spark is an API for handling large data tranformations like _Map Reduce_.

_It is a data transformation and selection tool like **Pandas**.  You can develop transformations through an API that allows you to chain operations together in a modular fashion, similar to **Pandas**._

![](https://snag.gy/9G4gJO.jpg)

> However, Spark delivers the idea of data manipulation through the framework of **transformations** and **actions**.  These transformations and actions are performed in parallel, using many different worker nodes (which may be distributed on multiple machines).

> Mapreduce is broken down into two main functions "map" and "reduce".  Spark on the other hand, operates through a set of operations determined through a **Directed Acyclic Graph** (or DAG for short).

## Resilient Distributed Datasets (RDD)

<img src="images/rdd_on_cluster.png" width="200" align="right">
\[[Image Source](http://horicky.blogspot.com/2015/02/big-data-processing-in-spark.html)\]

- created from HDFS, S3, HBase, JSON, text, local... or transformed from another RDD
- distributed accross the cluster, partitioned (atomic chunks of data)
- can recover from errors (node failure, slow process)
- traceability of each partition, can re-run the processing
- **immutable** : you cannot *modify* an RDD in place

## A "functional programming paradigm" and DAGs

RDDs are **immutable** ! You can only **transform** an existing RDD into another one.

Spark provides many transformations functions. By programming these functions, you construct a **Directed Acyclic Graph** (DAG).

<img src="images/dag.png">
\[[Image Source]()\]

When you use them, these functions are passed from the **client** to the **master**, who then distributes them to workers, who apply them accross their partitions of the RDD.


![](https://snag.gy/ieVW98.jpg)

## Spark architecture : from your coding hands to the cluster

<img src="images/from_rdd_to_cluster.png">
\[[Image Source]()\]

You construct your sequence of transformations in python.
Spark functional programming interface builds up a **DAG**.
This DAG is sent by the **driver** for execution to the **cluster manager**.

## Spark Jargon

Excerpt taken from \[[Arush Kharbanda](https://www.quora.com/What-exactly-is-Apache-Spark-and-how-does-it-work) on Quora\]

**Job**: A piece of code which reads some input  from HDFS or local, performs some computation on the data and writes some output data.

**Stages**: Jobs are divided into stages. Stages are classified as a Map or reduce stages(Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations(operators) cannot be Updated in a single Stage. It happens over many stages.

**Tasks**: Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor(machine).

**DAG**: DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.

**Executor**: The process responsible for executing a task.

**Driver**: The program/process responsible for running the Job over the Spark Engine

**Master**: The machine on which the Driver program runs

**Slave/Worker**: The machine on which the Executor program runs

# Operational Spark in Python

<img src="images/spark_flow.png" width="500">

We'll proceed along the usual spark flow (see above).
1. create the enviromnent to run spark from python
2. extract RDDs from files
3. run some transformations
4. execute actions to obtain values (local objects in python)

**Brainstorming**: So, let's suppose you have this thing called an RDD, which is just basically a dataset made of rows and values. What are all the operations you'd like to do to that RDD ?

In [1]:
# put your ideas here...

<a id='ml'></a>
### Spark has a machine learning library similar to  _Scikit Learn_.

![](https://snag.gy/RnuX6h.jpg)

_Spark provides an interface to MLib via Scala, Java, Python, and R.  The most common methods are provided such as regression, support vector machines, and random forrests, however, not all evaluation metrics are available in Python yet.  Spark is written in Scala, so features are prioritized to Scala first throughout the Spark ecosystem._



<a id='stream'></a>
### Spark is a framework for building high volume stream processors.
![](https://snag.gy/RCikuU.jpg)

With **Spark streaming**, it's possible to build a process that can respond to data in **real-time**, using any of Spark's features including Mlib, GraphX, or any kinds of transformations you could do within the Spark context.  The streaming capabilities of Spark core make it possible to produce real-time applications such as **ETL, analytics dashboards, data mining, or large scale aggregations**.

In short, you can create a "streaming context" that listens on a specific port, and tie to to any number of operations that can be programed with spark.


<a id='sql'></a>
### Spark has a SQL interface into dataframes like _Hive_.

Spark isn't actactly **Hive**, but it uses components from Hive.  However, you can use Spark dataframes easier to use with temporary SQL views.

>```python
># Load a dataset as a Spark DataFrame
>df = spark.read.csv("datasets/somedataset/hamburgers_eaten_per_hour.csv")
>df.createOrReplaceTempView("hamburgers")
>```



Then voila, you can slice and dice your dataframe as SQL:

>```python
>spark.sql("SELECT * FROM hamburgers").show()
>
># +------+---------+
># | eaten|     name|
># +------+---------+
># |null  |     Jeff|
># |  30  |   Kiefer|
># |  19  |     Hang|
># +------+---------+
>```

<a id='sparkui'></a>
## Spark UI
---

Anytime a "spark context" is created, a corresponding spark UI is launched. It is accessible **only** while the Spark application is running. Whenever you launch a Spark standalone instance, Pyspark, a Spark UI will be created.  Through the web UI, you can monitor how your applications run.  Anything that the Spark context handles, even the one line operations from PySpark can be observed as separate jobs in Spark UI.

### Spark Jobs
![](https://snag.gy/JnuSKC.jpg)

### Job Metrics
![](https://snag.gy/JW2fOb.jpg)

<a id='using'></a>
## Using Spark via Pyspark
---


<a id='guide'></a>
### Spark installation guide
>Make sure you have java version 8 (it will not work with a more recent one) and scala installed:
```
brew cask install java8
brew install scala
```

>Install Spark, Hive and pyspark:
```
brew install apache-spark
brew install hive
pip install pyspark
```
>Run to check you can access the shell: 
```
pyspark
```

>To omit the INFO logs, change conf log4j file from INFO to error and then cp to a file >without the .template

>Finally install findspark so we can easily point to our pyspark installation:
```
pip install findspark
```

## Initializing a `SparkContext` in Python

IPython / IPython notebook can be a *client* to interact with the *master*.

The client will have a `SparkContext` ("sc") that..

1. Acts as a gateway between the client and Spark master
2. Sends code/data from IPython to the master (who then sends it to the workers)

<img src="https://snag.gy/iCm4G1.jpg" width="600">

**"sc"** represents your interface to a running spark cluster manager.  A Spark context is defined as a preconfigured cluster, an application name connected to it.  All **transformations** and **actions** performed by Spark, are handled through the Spark context (aka: **sc**).

Using:

```python
import pyspark as ps
sc = ps.SparkContext('local[4]')
```

will create a *"local"* cluster made of the driver using all your cpu cores (macbook pro has 4 (For each processor core that is physically present (in our case, 2), the operating system addresses two virtual (logical) cores and shares the workload between them when possible)

In [2]:
import findspark
findspark.init('/usr/local/spark')

In [3]:
import pyspark as ps    # for the pyspark suite
import warnings         # for displaying warning
from pyspark.sql import SQLContext

In [4]:
try:
    # we try to create a SparkContext to work locally on all cpus available
    sc = ps.SparkContext('local[2]')
    sqlContext = SQLContext(sc)
    print("Just created a SparkContext")
except ValueError:
    # give a warning if SparkContext already exists (for use inside pyspark)
    warnings.warn("SparkContext already exists in this scope")

Just created a SparkContext


In [5]:
# To stop your spark context instance:
sc.stop()

IMPORTANT: Spark does some very heavy lifting for us and works very well most of the time.
It is not perfect though. If things start getting ugly (understand pyspark unable to create a spark context connection) run in terminal:

```
ps -ax | grep spark
kill <PID>
```

This is quite an extreme measure. Use it only if nothing else works.

<a id='rdd'></a>
## RDDs vs DataFrames

---

The two main types of data objects in Spark are the **Resilient Distributed Dataset** and the **DataFrame**.  Both types represent data in a distributed state.  RDDs store data in a more primitive state such as a list of pairs, integers, floats, or strings.  DataFrames have a rich structure defintion called a **schema** much like a Pandas dataframe.

- You use **RDDs** to manage semi-structured data.
- You use **DataFrames** to operate on typed series.

Both RDDs and DataFrames can contain multiple types of objects.  **DataFrames** are much more constrained because data is represented by a 2 deminsional tabular structure where columns represent variables, and rows as observations.  **RDDs** are much more flexible if your data requires much less structure than a DataFrame while still being able to use _transformation_ methods such as **`map()`** and _action_ methods such as **`reduce()`**.

>_Distributed Data in Spark_
>![](https://snag.gy/vxVhri.jpg)

## Creating an RDD (from files)

RDDs are **immutable**. Once created, you cannot modify them directly. You can only transform them into another RDD. 

Functions for creating an RDD from an external source are methods of the SparkContext object `sc`.

| Method | Description |
| - | - |
| [`sc.parallelize(array)`]() | Create an RDD from a python array or list |
| [`sc.textFile(path)`]() | Create an RDD from a text file |
| [`sc.pickleFile(path)`]() | Create an RDD from a HDFS pickle file |

### Creating RDDs from local files

#### `sc.parallelize()` : create an RDD from a python array/list

In [6]:
sc = ps.SparkContext('local[4]')

# creating an adhoc list
data_array = [['isaac', 18],
              ['lee', 7],
              ['brad', 2],
              ['giovanna', 14],
              ['darren', 10],
              ['cary', 42]]

In [7]:
# reading the array/list using SparkContext
rdd = sc.parallelize(data_array)

# collect the results (lazy)
rdd.collect()

[['isaac', 18],
 ['lee', 7],
 ['brad', 2],
 ['giovanna', 14],
 ['darren', 10],
 ['cary', 42]]

In [8]:
# to output the attributes / methods available
dir(rdd)

['__add__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_computeFractionForSampleSize',
 '_defaultReducePartitions',
 '_id',
 '_jrdd',
 '_jrdd_deserializer',
 '_memory_limit',
 '_pickled',
 '_reserialize',
 '_to_java_object_rdd',
 'aggregate',
 'aggregateByKey',
 'cache',
 'cartesian',
 'checkpoint',
 'coalesce',
 'cogroup',
 'collect',
 'collectAsMap',
 'combineByKey',
 'context',
 'count',
 'countApprox',
 'countApproxDistinct',
 'countByKey',
 'countByValue',
 'ctx',
 'distinct',
 'filter',
 'first',
 'flatMap',
 'flatMapValues',
 'fold',
 'foldByKey',
 'foreach',
 'foreachPartition',
 'fullOuterJoin',
 'getCheckpointFile',
 'getNumPartitions

In [9]:
sc

<pyspark.context.SparkContext at 0x113a5e0f0>

In [10]:
sc.stop()

#### `sc.textFile()` : from a text file !

The import will give you an rdd made of **strings which are lines of the text file**.

In [11]:
cat 'data/toy_data.txt'

matthew,4
jorge,8
josh,15
evangeline,16
emilie,23
yunjin,42


In [12]:
sc = ps.SparkContext('local[4]')

# displaying the content of the file in stdout
with open('data/toy_data.txt', 'r') as text:
    print(text.read())

# reading the file using SparkContext
rdd = sc.textFile('data/toy_data.txt')

# to output the content in python [irl, use collect() with great care]
rdd.collect()

matthew,4
jorge,8
josh,15
evangeline,16
emilie,23
yunjin,42



['matthew,4', 'jorge,8', 'josh,15', 'evangeline,16', 'emilie,23', 'yunjin,42']

In [13]:
rdd

data/toy_data.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [14]:
# to stop your spark session, run:
sc.stop()

## Transformations : transforming an RDD into another

- They are **lazy**: Spark doesn't apply the transformation right away, it just builds on the **DAG**
- They transform an RDD into another RDD because RDD are **immutable**.
- They can be **wide** or **narrow** (whether they shuffle partitions or not).

<img src="images/rdd_narrow_vs_wide_transformations.png" width="400"/>
\[[Image Source](http://horicky.blogspot.com/2013/12/spark-low-latency-massively-parallel.html)\]



| Method | Type | Category | Description |
| - | - | - |
| [`.map(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map) | transformation | mapping | Return a new RDD by applying a function to each element of this RDD. |
| [`.flatMap(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.flatMap) | transformation | mapping | Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. |
| [`.filter(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.filter) | transformation | reduction |  Return a new RDD containing only the elements that satisfy a predicate. |
| [`.sample()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sample) | transformation | reduction | Return a sampled subset of this RDD. |
| [`.distinct()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.distinct) | transformation | reduction |  Return a new RDD containing the distinct elements in this RDD. |
| [`.keys()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.keys) | transformation | `<k,v>` | Return an RDD with the keys of each tuple. |
| [`.values()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.values) | transformation | `<k,v>` | Return an RDD with the values of each tuple. |
| [`.join(rddB)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.join) | transformation | `<k,v>` | Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other. |
| [`.reduceByKey()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey) | transformation | `<k,v>` | Merge the values for each key using an associative and commutative reduce function. |
| [`.groupByKey()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.groupByKey) | transformation | `<k,v>` | Merge the values for each key using non-associative operation, like mean. |
| [`.sortBy(keyfunc)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortBy) | transformation | sorting |  Sorts this RDD by the given keyfunc. |
| [`.sortByKey()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortByKey) | transformation | sorting/`<k,v>` | Sorts this RDD, which is assumed to consist of (key, value) pairs. |



### Applying transformations and chaining them

Recall the spark flow:

<img src="images/spark_flow.png" width="500">

In the sequence below, we will in one sequence:
1. read an RDD from a text file
2. transform by applying `split`
3. transform by filtering
4. transform by casting some columns to their corresponding type.
5. use an action to output the results

Each transformation is a method of an RDD, and returns another RDD.

In [15]:
# displaying the content of the file in stdout
with open('data/sales.txt', 'r') as fin:
    print(fin.read())

#ID    Date           Store   State  Product    Amount
101    11/13/2014     100     WA     331        300.00
104    11/18/2014     700     OR     329        450.00
102    11/15/2014     203     CA     321        200.00
106    11/19/2014     202     CA     331        330.00
103    11/17/2014     101     WA     373        750.00
105    11/19/2014     202     CA     321        200.00



*Recall: Input functions, reading RDDs from files, are functions of the SparkContext.*

In [16]:
sc.stop()

In [17]:
sc = ps.SparkContext('local[4]')

# reads a text file line by line
rdd1 = sc.textFile('data/sales.txt')

rdd1.collect()

['#ID    Date           Store   State  Product    Amount',
 '101    11/13/2014     100     WA     331        300.00',
 '104    11/18/2014     700     OR     329        450.00',
 '102    11/15/2014     203     CA     321        200.00',
 '106    11/19/2014     202     CA     331        330.00',
 '103    11/17/2014     101     WA     373        750.00',
 '105    11/19/2014     202     CA     321        200.00']

In [18]:
# applies split() to each row
rdd2 = rdd1.map(lambda row: row.split())

rdd2.collect()

[['#ID', 'Date', 'Store', 'State', 'Product', 'Amount'],
 ['101', '11/13/2014', '100', 'WA', '331', '300.00'],
 ['104', '11/18/2014', '700', 'OR', '329', '450.00'],
 ['102', '11/15/2014', '203', 'CA', '321', '200.00'],
 ['106', '11/19/2014', '202', 'CA', '331', '330.00'],
 ['103', '11/17/2014', '101', 'WA', '373', '750.00'],
 ['105', '11/19/2014', '202', 'CA', '321', '200.00']]

In [19]:
# filters rows
rdd3 = rdd2.filter(lambda row: not row[0].startswith('#'))

rdd3.collect()

[['101', '11/13/2014', '100', 'WA', '331', '300.00'],
 ['104', '11/18/2014', '700', 'OR', '329', '450.00'],
 ['102', '11/15/2014', '203', 'CA', '321', '200.00'],
 ['106', '11/19/2014', '202', 'CA', '331', '330.00'],
 ['103', '11/17/2014', '101', 'WA', '373', '750.00'],
 ['105', '11/19/2014', '202', 'CA', '321', '200.00']]

In [20]:
def casting_function(X):
    return int(X[0]), X[1], int(X[2]), X[3], int(X[4]), float(X[5])

In [21]:
casting_function(['101', '11/13/2014', '100', 'WA', '331', '300.00'])

(101, '11/13/2014', 100, 'WA', 331, 300.0)

In [22]:
# applies casting_function to rows
rdd4 = rdd3.map(casting_function)

# shows the result
rdd4.collect()

[(101, '11/13/2014', 100, 'WA', 331, 300.0),
 (104, '11/18/2014', 700, 'OR', 329, 450.0),
 (102, '11/15/2014', 203, 'CA', 321, 200.0),
 (106, '11/19/2014', 202, 'CA', 331, 330.0),
 (103, '11/17/2014', 101, 'WA', 373, 750.0),
 (105, '11/19/2014', 202, 'CA', 321, 200.0)]

**Now, let's see the canonical way to write that in Python...**

In [23]:
sales_rdd = sc.textFile('data/sales.txt')

sales_rdd.collect()

['#ID    Date           Store   State  Product    Amount',
 '101    11/13/2014     100     WA     331        300.00',
 '104    11/18/2014     700     OR     329        450.00',
 '102    11/15/2014     203     CA     321        200.00',
 '106    11/19/2014     202     CA     331        330.00',
 '103    11/17/2014     101     WA     373        750.00',
 '105    11/19/2014     202     CA     321        200.00']

In [24]:
sales_rdd = sc.textFile('data/sales.txt') \
              .map(lambda rowstr: rowstr.split())   # <= JUST ADDED THIS HERE

sales_rdd.collect()

[['#ID', 'Date', 'Store', 'State', 'Product', 'Amount'],
 ['101', '11/13/2014', '100', 'WA', '331', '300.00'],
 ['104', '11/18/2014', '700', 'OR', '329', '450.00'],
 ['102', '11/15/2014', '203', 'CA', '321', '200.00'],
 ['106', '11/19/2014', '202', 'CA', '331', '330.00'],
 ['103', '11/17/2014', '101', 'WA', '373', '750.00'],
 ['105', '11/19/2014', '202', 'CA', '321', '200.00']]

In [25]:
sales_rdd = sc.textFile('data/sales.txt') \
              .map(lambda rowstr: rowstr.split()) \
              .filter(lambda row: not row[0].startswith('#'))    # <= JUST ADDED THIS HERE

sales_rdd.collect()

[['101', '11/13/2014', '100', 'WA', '331', '300.00'],
 ['104', '11/18/2014', '700', 'OR', '329', '450.00'],
 ['102', '11/15/2014', '203', 'CA', '321', '200.00'],
 ['106', '11/19/2014', '202', 'CA', '331', '330.00'],
 ['103', '11/17/2014', '101', 'WA', '373', '750.00'],
 ['105', '11/19/2014', '202', 'CA', '321', '200.00']]

In [26]:

sales_rdd = sc.textFile('data/sales.txt') \
              .map(lambda rowstr: rowstr.split()) \
              .filter(lambda row: not row[0].startswith('#')) \
              .map(casting_function)   # <= JUST ADDED THIS HERE

sales_rdd.collect()

[(101, '11/13/2014', 100, 'WA', 331, 300.0),
 (104, '11/18/2014', 700, 'OR', 329, 450.0),
 (102, '11/15/2014', 203, 'CA', 321, 200.0),
 (106, '11/19/2014', 202, 'CA', 331, 330.0),
 (103, '11/17/2014', 101, 'WA', 373, 750.0),
 (105, '11/19/2014', 202, 'CA', 321, 200.0)]

<span style="color:red">FROM NOW ON WE'LL RELY ON THESE TWO RDDs</span>

In [27]:
# creating an adhoc list
data_array = [['Harry', 18],
              ['John', 7],
              ['Anna', 2],
              ['Jim', 14],
              ['Debby', 10],
              ['Ally', 42]]

# reading the array/list using SparkContext
names_rdd = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
names_rdd.collect()

[['Harry', 18],
 ['John', 7],
 ['Anna', 2],
 ['Jim', 14],
 ['Debby', 10],
 ['Ally', 42]]

In [28]:
names_rdd

ParallelCollectionRDD[16] at parallelize at PythonRDD.scala:475

In [29]:

sales_rdd = sc.textFile('data/sales.txt') \
              .map(lambda x: x.split()) \
              .filter(lambda x: not x[0].startswith('#')) \
              .map(casting_function)

sales_rdd.collect()

[(101, '11/13/2014', 100, 'WA', 331, 300.0),
 (104, '11/18/2014', 700, 'OR', 329, 450.0),
 (102, '11/15/2014', 203, 'CA', 321, 200.0),
 (106, '11/19/2014', 202, 'CA', 331, 330.0),
 (103, '11/17/2014', 101, 'WA', 373, 750.0),
 (105, '11/19/2014', 202, 'CA', 321, 200.0)]

### Mapping

#### `.map(func)` : applying a function on every row

In [30]:
# applying a lambda function to an rdd
out_rdd = names_rdd.map(lambda X: len(X[0]))

# print out the original rdd
print("before: \n{}".format(names_rdd.collect()))

# print out the new rdd generated
print("\n" + "after: \n{}".format(out_rdd.collect()))

before: 
[['Harry', 18], ['John', 7], ['Anna', 2], ['Jim', 14], ['Debby', 10], ['Ally', 42]]

after: 
[5, 4, 4, 3, 5, 4]


#### `.flatMap(func)` : applying a function on every row and flattening the resulting lists

In [31]:
# applying a lambda function to an rdd (because why not)
out_rdd = names_rdd.map(lambda name_number: [
                            name_number[1], name_number[1] + 2, name_number[1] + len(name_number[0])])

# print out the original rdd
print(("before: \n{}".format(sales_rdd.collect())))

# print out the new rdd generated
print(("\n" + "after: \n{}".format(out_rdd.collect())))

before: 
[(101, '11/13/2014', 100, 'WA', 331, 300.0), (104, '11/18/2014', 700, 'OR', 329, 450.0), (102, '11/15/2014', 203, 'CA', 321, 200.0), (106, '11/19/2014', 202, 'CA', 331, 330.0), (103, '11/17/2014', 101, 'WA', 373, 750.0), (105, '11/19/2014', 202, 'CA', 321, 200.0)]

after: 
[[18, 20, 23], [7, 9, 11], [2, 4, 6], [14, 16, 17], [10, 12, 15], [42, 44, 46]]


In [32]:
# applying a lambda function to an rdd (because why not)
out_rdd = names_rdd.flatMap(lambda name_number: [
                            name_number[1], name_number[1] + 2, name_number[1] + len(name_number[0])])

# print out the original rdd
print(("before: \n{}".format(sales_rdd.collect())))

# print out the new rdd generated
print(("\n" + "after: \n{}".format(out_rdd.collect())))

before: 
[(101, '11/13/2014', 100, 'WA', 331, 300.0), (104, '11/18/2014', 700, 'OR', 329, 450.0), (102, '11/15/2014', 203, 'CA', 321, 200.0), (106, '11/19/2014', 202, 'CA', 331, 330.0), (103, '11/17/2014', 101, 'WA', 373, 750.0), (105, '11/19/2014', 202, 'CA', 321, 200.0)]

after: 
[18, 20, 23, 7, 9, 11, 2, 4, 6, 14, 16, 17, 10, 12, 15, 42, 44, 46]


### Row reduction

#### `.filter(func)`: filters an RDD using a function that returns boolean values

In [33]:
sales_rdd.top(2)

[(106, '11/19/2014', 202, 'CA', 331, 330.0),
 (105, '11/19/2014', 202, 'CA', 321, 200.0)]

In [34]:
sales_rdd.take(2)

[(101, '11/13/2014', 100, 'WA', 331, 300.0),
 (104, '11/18/2014', 700, 'OR', 329, 450.0)]

In [35]:
# filtering an rdd
out_rdd = sales_rdd.filter(
    lambda _id_date_store_state_pdt_amnt: (_id_date_store_state_pdt_amnt[3] == 'CA'))

# print out the original rdd
print(("before: \n{}".format(sales_rdd.collect())))

# print out the new rdd generated
print(("\n" + "after: \n{}".format(out_rdd.collect())))

before: 
[(101, '11/13/2014', 100, 'WA', 331, 300.0), (104, '11/18/2014', 700, 'OR', 329, 450.0), (102, '11/15/2014', 203, 'CA', 321, 200.0), (106, '11/19/2014', 202, 'CA', 331, 330.0), (103, '11/17/2014', 101, 'WA', 373, 750.0), (105, '11/19/2014', 202, 'CA', 321, 200.0)]

after: 
[(102, '11/15/2014', 203, 'CA', 321, 200.0), (106, '11/19/2014', 202, 'CA', 331, 330.0), (105, '11/19/2014', 202, 'CA', 321, 200.0)]


<a id='practical-tips'></a>
# <i class="fa fa-thumbs-up" aria-hidden="true"></i> Practical tips
---

### Sampling from enormous datasets
Undoubtably, you may have the need to examine a larger dataset.  A common operation is to take a sample.  To approximate the characteristics of your global distribution, you should try to adjust the size that best matches the metrics of central tendency or consider doing a power analysis to determine sample sizing.

> Size of your sample generally depends on your application be it A/B testing, EDA, Machine Learning, etc.

#### `.sample(withReplacement, fraction, seed)`: sampling an RDD !!

In [36]:
# sampling an rdd
out_rdd = sales_rdd.sample(True, 0.4)

# print out the original rdd
print("before: \n{}".format(sales_rdd.collect()))

# print out the new rdd generated
print("\n" + "after: \n{}".format(out_rdd.collect()))

before: 
[(101, '11/13/2014', 100, 'WA', 331, 300.0), (104, '11/18/2014', 700, 'OR', 329, 450.0), (102, '11/15/2014', 203, 'CA', 321, 200.0), (106, '11/19/2014', 202, 'CA', 331, 330.0), (103, '11/17/2014', 101, 'WA', 373, 750.0), (105, '11/19/2014', 202, 'CA', 321, 200.0)]

after: 
[(101, '11/13/2014', 100, 'WA', 331, 300.0), (101, '11/13/2014', 100, 'WA', 331, 300.0), (104, '11/18/2014', 700, 'OR', 329, 450.0), (106, '11/19/2014', 202, 'CA', 331, 330.0), (103, '11/17/2014', 101, 'WA', 373, 750.0)]


#### `.distinct()`: obtaining distinct rows

In [37]:
# obtaining distinct values of the "state" column of rdd_sales
out_rdd = sales_rdd.map(lambda _id_date_store_state_pdt_amnt: _id_date_store_state_pdt_amnt[3]) \
                   .distinct()

# print out the original rdd
print(("before: \n{}".format(sales_rdd.collect())))

# print out the new rdd generated
print(("\n" + "after: \n{}".format(out_rdd.collect())))

before: 
[(101, '11/13/2014', 100, 'WA', 331, 300.0), (104, '11/18/2014', 700, 'OR', 329, 450.0), (102, '11/15/2014', 203, 'CA', 321, 200.0), (106, '11/19/2014', 202, 'CA', 331, 330.0), (103, '11/17/2014', 101, 'WA', 373, 750.0), (105, '11/19/2014', 202, 'CA', 321, 200.0)]

after: 
['CA', 'WA', 'OR']


### Methods with a `<key, value>` paradigm

#### `.values()`: returns the values of a RDD made of `<key, value>` pairs

In [38]:
# applying a lambda function to an rdd (because why not)
out_rdd = names_rdd.values()

# print out the original rdd
print("before: \n{}".format(names_rdd.collect()))

# print out the new rdd generated
print("\n" + "after: \n{}".format(out_rdd.collect()))

before: 
[['Harry', 18], ['John', 7], ['Anna', 2], ['Jim', 14], ['Debby', 10], ['Ally', 42]]

after: 
[18, 7, 2, 14, 10, 42]


#### `.keys()`: returns the keys of a RDD made of `<k,v>` pairs

In [39]:
# applying a lambda function to an rdd (because why not)
out_rdd = names_rdd.keys()

# print out the original rdd
print("before: \n{}".format(names_rdd.collect()))

# print out the new rdd generated
print("\n" + "after: \n{}".format(out_rdd.collect()))

before: 
[['Harry', 18], ['John', 7], ['Anna', 2], ['Jim', 14], ['Debby', 10], ['Ally', 42]]

after: 
['Harry', 'John', 'Anna', 'Jim', 'Debby', 'Ally']


#### `rddA.join(rddB)`: join another RDD

In [40]:
state_sales_rdd = sales_rdd.map(
    lambda _id_date_store_state_pdt_amnt: (_id_date_store_state_pdt_amnt[3], _id_date_store_state_pdt_amnt[5]))

state_sales_rdd.collect()

[('WA', 300.0),
 ('OR', 450.0),
 ('CA', 200.0),
 ('CA', 330.0),
 ('WA', 750.0),
 ('CA', 200.0)]

In [41]:
# creating an adhoc list of managers for each state
managers_array = [['CA', 'Hot'],
                  ['OR', 'Cold'],
                  ['WA', 'Boring'],
                  ['TX', 'Gun']]

# reading the array/list using SparkContext
managers_rdd = sc.parallelize(managers_array)

# to output the content in python [irl, use with great care]
state_sales_rdd.join(managers_rdd).collect()

[('CA', (200.0, 'Hot')),
 ('CA', (330.0, 'Hot')),
 ('CA', (200.0, 'Hot')),
 ('OR', (450.0, 'Cold')),
 ('WA', (300.0, 'Boring')),
 ('WA', (750.0, 'Boring'))]

#### `.reduceByKey(func)`: reduce `values` by their `key` by applying func (what ?)

The `func` here needs to be associative and commutative... can you guess why ?

In [42]:
# creating an adhoc list
data_array = [['CA', 1],
              ['WA', 1],
              ['CA', 2],
              ['OR', 1],
              ['CA', 5],
              ['OR', 1]]

# reading the array/list using SparkContext
rdd = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd.collect()

[['CA', 1], ['WA', 1], ['CA', 2], ['OR', 1], ['CA', 5], ['OR', 1]]

In [43]:
rdd.reduceByKey(lambda v1, v2: v1 + v2).collect()

[('CA', 8), ('WA', 1), ('OR', 2)]

#### `.groupByKey(func)`: reduce `values` by their `keys` by applying func (again ?)

This can use any non-commutative function.

In [44]:
# creating an adhoc list
data_array = [['CA', 1],
              ['WA', 1],
              ['CA', 2],
              ['OR', 1],
              ['CA', 5],
              ['OR', 1]]

# reading the array/list using SparkContext
rdd = sc.parallelize(data_array)

# to output the content in python [irl, use with great care]
rdd.collect()

[['CA', 1], ['WA', 1], ['CA', 2], ['OR', 1], ['CA', 5], ['OR', 1]]

In [45]:
def mean(iterator):
    total, count = 0.0, 0
    for x in iterator:
        total += x
        count += 1
    return total / count


rdd.groupByKey() \
   .map(lambda state_iterator: (state_iterator[0], mean(state_iterator[1]))) \
   .collect()

[('CA', 2.6666666666666665), ('WA', 1.0), ('OR', 1.0)]

### Sorting methods

#### `.sortBy(keyfunc)`: sorting by the value of a function on rows

In [46]:
# sorting by any function (because why not?)

names_rdd_mod = names_rdd.map(
    lambda name_number: (name_number[0], (13 - name_number[1])**2))

In [47]:
names_rdd_join = names_rdd.join(names_rdd_mod)

In [48]:
names_rdd_sorted = names_rdd_join.sortBy(
    lambda name___number: name___number[1][1], ascending=True)


# print out the original rdd
print(("before: \n{}".format(names_rdd.collect())))

# print out the new rdd generated
print(("\n" + "after: \n{}".format(names_rdd_join.collect())))

before: 
[['Harry', 18], ['John', 7], ['Anna', 2], ['Jim', 14], ['Debby', 10], ['Ally', 42]]

after: 
[('Ally', (42, 841)), ('John', (7, 36)), ('Anna', (2, 121)), ('Harry', (18, 25)), ('Debby', (10, 9)), ('Jim', (14, 1))]


#### `.sortByKey()`: sorting by key on a `<k,v>` RDD

In [49]:
# sorting k,v pairs by key
out_rdd = names_rdd.sortByKey(ascending=False)

# print out the original rdd
print(("before: \n{}".format(names_rdd.collect())))

# print out the new rdd generated
print(("\n" + "after: \n{}".format(out_rdd.collect())))

before: 
[['Harry', 18], ['John', 7], ['Anna', 2], ['Jim', 14], ['Debby', 10], ['Ally', 42]]

after: 
[('John', 7), ('Jim', 14), ('Harry', 18), ('Debby', 10), ('Anna', 2), ('Ally', 42)]


## Actions : turning your RDD into something else (local object)

Actions are specific methods of an RDD object, they are usually designed to transform an RDD into something else (a python object, or a statistic).

When used/executed in IPython or in a notebook, they **launch the processing of the DAG**. This is where Spark stops being **lazy**. This is where your script will take time to execute.

| Method | Type | Description |
| - | - | - |
| [`.collect()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.collect) | action | Return a list that contains all of the elements in this RDD. Note that this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory. |
| [`.count()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.count) | action | Return the number of elements in this RDD. |
| [`.take(n)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.take) | action | Take the first `n` elements of the RDD. |
| [`.top(n)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.top) | action | Get the top `n` elements from a RDD. It returns the list sorted in descending order. |
| [`.first()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.first) | action | Return the first element in a RDD. |
| [`.sum()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sum) | action | Add up the elements in this RDD. |
| [`.mean()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.mean) | action | Compute the mean of this RDD’s elements. |
| [`.stdev()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.stdev) | action | Compute the standard deviation of this RDD’s elements. |

In [50]:
# creating an adhoc list
data_array = [['Blae', 18],
              ['Jose', 7],
              ['Anna', 2],
              ['Noorah', 14],
              ['Telamon', 10],
              ['Milly', 42]]

# reading the array/list using SparkContext
rdd_names = sc.parallelize(data_array)

### Actions that return portions of an RDD

#### `.collect()` : returning the *full* content of an RDD to "python space"

Returns the rows of an RDD as a list. Can be a bad idea if your RDD is gigantic, cause `.collect()` will return everything and put it in memory for python to process.

In [51]:
# to output the content in python
collected = names_rdd.collect()

# let's check the type of RDD
print(("type of rdd: {}".format(type(rdd_names))))

# let's check the type of what's collected
print(("\ntype of rdd_collected: {} \n".format(type(collected))))

# let's print the collected content
print(collected)

type of rdd: <class 'pyspark.rdd.RDD'>

type of rdd_collected: <class 'list'> 

[['Harry', 18], ['John', 7], ['Anna', 2], ['Jim', 14], ['Debby', 10], ['Ally', 42]]


#### `.take(n)` : returning (any) n lines of an RDD

Returns `n` the rows of an RDD as a list. These `n` are not randomly selected. They are Spark's own internal mechanism for obtaining the lines that can be collected first.

In [52]:
# to output the content in python
taken = names_rdd.take(2)

# let's check the type of what's collected
print(("type of rdd_taken: {}\n".format(type(taken))))

# let's print the collected content
print(taken)

type of rdd_taken: <class 'list'>

[['Harry', 18], ['John', 7]]


#### `.first()` : returning the first line of an RDD

In [53]:
print((names_rdd.first()))

['Harry', 18]


### Actions that compute some statistics

#### `.count()` : count the number of lines

In [54]:
print((names_rdd.count()))

6


#### `.sum()`: summing every line in an RDD

(The RDD needs to be containing summable values)

In [55]:
names_rdd.values().sum()

93

In [56]:
print(names_rdd.values().sum())

93


#### `.mean()`: averaging every line in an RDD

(The RDD needs to be containing summable values)

In [57]:
print(names_rdd.values().mean())

15.5


#### `.stdev()`: you get that right ?

In [58]:
print(names_rdd.values().stdev())

12.880864360230902


# Let's design chains of transformations together !

## 1. Computing sales per state

### Input RDD

In [59]:
def casting_function(xxx_todo_changeme):
    (_id, date, store, state, product, amount) = xxx_todo_changeme
    return((int(_id), date, int(store), state, int(product), float(amount)))


rdd_sales = sc.textFile('data/sales.txt') \
              .map(lambda x: x.split()) \
              .filter(lambda x: not x[0].startswith('#')) \
              .map(casting_function)

rdd_sales.collect()

[(101, '11/13/2014', 100, 'WA', 331, 300.0),
 (104, '11/18/2014', 700, 'OR', 329, 450.0),
 (102, '11/15/2014', 203, 'CA', 321, 200.0),
 (106, '11/19/2014', 202, 'CA', 331, 330.0),
 (103, '11/17/2014', 101, 'WA', 373, 750.0),
 (105, '11/19/2014', 202, 'CA', 321, 200.0)]

### Task

You want to obtain a sorted RDD of the states in which you have most sales done (amount).

What transformations do you need to apply ?
If you had to draw a workflow of the transformations to apply ?

### Code

In [60]:
out_rdd = sales_rdd  # apply transformation here...

out_rdd.collect()

[(101, '11/13/2014', 100, 'WA', 331, 300.0),
 (104, '11/18/2014', 700, 'OR', 329, 450.0),
 (102, '11/15/2014', 203, 'CA', 321, 200.0),
 (106, '11/19/2014', 202, 'CA', 331, 330.0),
 (103, '11/17/2014', 101, 'WA', 373, 750.0),
 (105, '11/19/2014', 202, 'CA', 321, 200.0)]

## Solution (double click)

<span style="color:white;font-family:'Courier New'"><br/>
out_rdd = rdd_sales.map(lambda row: (row[3], row[5])) \
                   .reduceByKey(lambda amount1, amount2: amount1 + amount2) \
                   .sortBy(lambda state_amount: state_amount[1], ascending=False)
<br/>
out_rdd.collect()<br/>
</span>

## Word count (again)

### Input RDD

In [61]:
# displaying the content of the file in stdout
with open('data/input.txt', 'r') as fin:
    print(fin.read())

# reading the file using SparkContext

rdd = sc.textFile('data/input.txt')

hello world
another line
yet another line
yet another another line



### Task
What transformations do you need to apply in order to do the word count?

### Code

In [62]:
out_rdd = rdd  # apply transformation here...

# collect the result
out_rdd.collect()

['hello world', 'another line', 'yet another line', 'yet another another line']

## Solution (double click)

<span style="color:white;font-family:'Courier New'">
out_rdd = rdd.flatMap(lambda str : str.split()) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda v1, v2: v1 + v2)
<br/>
out_rdd.collect()<br/>
</span>

## Find the date on which AAPL's stock price was the highest

### Input RDD

In [63]:
appl_raw_rdd = sc.textFile('data/aapl.csv')

print(("lines in file: {}".format(appl_raw_rdd.count())))

appl_raw_rdd.take(5)

lines in file: 254


['Date,Open,High,Low,Close,Volume,Adj Close',
 '2016-10-25,117.949997,118.360001,117.309998,118.25,39190300,118.25',
 '2016-10-24,117.099998,117.739998,117.00,117.650002,23538700,117.650002',
 '2016-10-21,116.809998,116.910004,116.279999,116.599998,23192700,116.599998',
 '2016-10-20,116.860001,117.379997,116.330002,117.059998,24125800,117.059998']

### Task

Now, design a pipeline that would :
1. filter out headers
2. split each line based on comma
3. keep only fields for Date (col 0) and Close (col 4)
4. order by Date in descending order

### Code

In [64]:
out_rdd = appl_raw_rdd  # apply transformation here...

out_rdd.take(5)

['Date,Open,High,Low,Close,Volume,Adj Close',
 '2016-10-25,117.949997,118.360001,117.309998,118.25,39190300,118.25',
 '2016-10-24,117.099998,117.739998,117.00,117.650002,23538700,117.650002',
 '2016-10-21,116.809998,116.910004,116.279999,116.599998,23192700,116.599998',
 '2016-10-20,116.860001,117.379997,116.330002,117.059998,24125800,117.059998']

### Solution

<span style="color:white;font-family:'Courier New'">
out_rdd = appl_raw_rdd.filter(lambda line: not line.startswith("Date")) \
                      .map(lambda line: line.split(",")) \
                      .map(lambda fields: (fields[0], float(fields[4]))) \
                      .sortBy(lambda x: x[0], ascending=False)
<br/>
out_rdd.collect()<br/>
</span>


## What's the in-memory y'all takin about ?

Recall:
- Spark runs in memory and on disk
- Can be up to 100x faster than Hadoop MapReduce in memory, and 10x faster on disk.
- Spark keeps everything in memory when possible, uses lots of it.


## Caching / Persistency

- The RDD does no work until an action is called. And then when an action is called it figures out the answer and then throws away all the data.
- If you have an RDD that you are going to reuse in your computation you can use cache() to make Spark cache the RDD.
- This is especially useful if you have to run the same computation over and over again on one RDD: one use case ? oh I don't know maybe... **MACHINE LEARNING !!!**

## Caching

Consider the following job...

In [65]:
import random
num_count = 5*10**5
num_list = [random.random() for i in range(num_count)]
rdd1 = sc.parallelize(num_list)
rdd2 = rdd1.sortBy(lambda num: num)

In [66]:
%time rdd2.count()


CPU times: user 7.19 ms, sys: 3.32 ms, total: 10.5 ms
Wall time: 1.14 s


500000

- Lets cache it and try again.

In [67]:
rdd2.cache()

PythonRDD[104] at RDD at PythonRDD.scala:48

In [68]:
%time rdd2.count()


CPU times: user 5.75 ms, sys: 2.42 ms, total: 8.18 ms
Wall time: 898 ms


500000

- Caching the RDD speeds up the job because the RDD does not have to be computed from scratch again.
- Calling cache() flips a flag on the RDD.
- The data is not cached until an action is called.
- You can uncache an RDD using unpersist()

## Persist

- Persist RDD to disk instead of caching it in memory.
- You can cache RDDs at different levels.

| Level	| Meaning |
| - | - |
| MEMORY_ONLY	| Same as cache() |
| MEMORY_AND_DISK	| Cache in memory then overflow to disk |
| MEMORY_AND_DISK_SER	| Like above; in cache keep objects serialized instead of live |
| DISK_ONLY	| Cache to disk not to memory |

<a id='activity'></a>
## Check out another dataset using Spark SQL and Spark DataFrames (more on that in the next lesson)

<a id='dtypes'></a>
## Spark data types
---

### RDD's

It's best to think of RDDs as primitive objects that are distributed.  RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.  The RDD type is the oldest type and has been around the Spark codebase since the version 1.0 days.


### DataFrames

As part of the "Tungsten Initiative", which sought to improve the performance of Spark, DataFrames entered the Spark codebase in version 1.3.  The big difference between RDD's and DataFrames, is that DataFrames introduce the idea of a "schema" much like Pandas.  

<a id='df'></a>
### DataFrames

The big plus is that Spark DataFrames serializes its data at a lower level to native Java/Scala, so when it's passed between nodes, it's much more performant, requiring fewer processes to handle computations.  Mainly data can be processed faster when it's optimized to a common format [the schema] that Spark doesn't have to convert to in order to perform tasks on it.

Outside of the performance optimizations introduced with a schema-based datastructure, the **DataFrame API** provides a convienient set of selectors for transforming data, much like Pandas.  Lastly, it's possible to create temporary views in which **DataFrames** can be queried with SQL - **SparkSQL**.

#### 1. Load up the "Pokemon" basic Pokedex dataset
First try without infering the schema and without the header.

In [69]:
#
sqlContext = ps.SQLContext(sc)

In [70]:
df = sqlContext.read.csv(
    path="./datasets/pokedex_basic.csv",
    header=True,
    # Poorly formed rows in CSV are dropped rather than erroring entire operation
    mode="DROPMALFORMED",
    # Not always perfect but works well in most cases as of 2.1+
    inferSchema=True
)

#### 2. Check out the dataset with infer schema paramter but without header.
How does it work with / without?

In [71]:
#


#### 3.  Create a tempory view with the Pokedex DataFrame called "pokemon"
Then 
```sql SELECT * FROM pokemon LIMIT 10```

In [72]:
#


#### 4.a Which is the strongest Pokemon by `Type`?
Using Spark DataFrame operations.  Research Sparks "grouping" functions.

In [73]:
#


#### 4.b Which is the strongest Pokemon by Type?
Using the Spark SQL temporary view.

In [74]:
#


#### 5.a Which Pokemon has the best combined Attack and Defence?
Using Spark DataFrame operations.

In [75]:
#


#### 5.b Which Pokemon has the best combined Attack and Defence?
Using the Spark SQL temporary view.

In [76]:
#


#### 6. Create a new feature called "Pokevalue" that is the combined Attack, Defence and scaled by .2 of the Pokemon HP.

Use any means necessary to solve this problem.

In [77]:
#
