## SparkSession

In order to work with Spark, we have to first set up a `SparkSession`.

From this point forward, we can interact with Apache Spark using this `spark` object.

In [1]:
import findspark
#findspark.init('/home/pascalfares/DataMiningSpark/sparkhome/spark-3.0.1-bin-hadoop2.7')
findspark.init('/opt/spark/spark-3.0.1-bin-hadoop2.7')
from pyspark.sql import SparkSession

>The findspark module help as to init spark in a python enviroment like jupyter, the one we are working on now.

In [2]:
# The builder method is used to set up an app which we name 'HelloWorldApp'
spark = SparkSession.builder.appName("HelloWorldApp").getOrCreate()

Let's break down this code snippet a bit further.
In order to work with Spark, we have to set up a Spark Application which we wish to name `HelloWorldApp`.

To do this:
- We initiated a `SparkSession` using the `.builder` method.
- We used `.appName` to tell Spark to name our Application `HelloWorldApp`. 
- We used `.getOrCreate()` to tell Spark to create the Application if it does not exist yet, or reconnect to the existing app with the given name should it exist already.
- Finally, the reference to this Spark application is stored in an object we named `spark`

*__Note__ that without a SparkSession, it is not possible to access and use Spark.
More information about SparkSession can be found [here](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession)*

### Hello World

Next, we will use this newly created `spark` object to create some data.

In [3]:
# Setup the textFile RDD to read the README.md file (old RDD) see next section for dtaframe to rdd reading text files
#   Note this is lazy 

textFile = spark.sparkContext.textFile("../README.md")

In [4]:
# When performing an action (like a count) this is when the textFile is read and aggregate calculated
#    Click on [View] to see the stages and executors

textFile.count()

117

In [5]:
l = textFile.collect()
display(l)

['# Mastering Big Data Analytics with PySpark [Machine Learning & Data Mining Workshop]',
 'This is the code repository for the lab the 2 first sessions "Machine Learning & Data Mining Workshop".',
 '',
 'Theses hand-on are mainly inspired by this workshop : https://github.com/PacktPublishing/Mastering-Big-Data-Analytics-with-PySpark Authored by: [Danny Meijer](https://www.linkedin.com/in/dannydatascientist). It is in fact a fork with adaptation to windows 10 and add some parts issued form our courses in [Cnam Liban](http://www.cnam-liban.fr).',
 '[Ingénierie de la fouille et de la visualisation de données massives](http://cedric.cnam.fr/vertigo/Cours/RCP216/)',
 'and',
 '[Cours de bases de données documentaires et distribuées](http://b3d.bdpedia.fr/)',
 '',
 '',
 '## About the WorkShop',
 '',
 '[ ] adapt this paragraph',
 '',
 "PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. This course starts by introducing you to PySpar

In [6]:
# Using Spark SQL, we create a dataframe which holds our `hello world` data
df = spark.sql('SELECT "hello world" as c1')

# We can then use the `show()` method to see what the DataFrame we just created looks like
df.show()

+-----------+
|         c1|
+-----------+
|hello world|
+-----------+



If you did everything right, you should see a table with your Hello World message inside. __Congratulations!__ You've just built your first Spark application that says hello to the world!!

> *__Troubleshooting__: if you run this code snippet without having set up a SparkSession (Spark Application), it throws an error like this:*
> ```
Py4JJavaError: An error occurred while calling o116.showString.
: java.lang.IllegalStateException: SparkContext has been shutdown
```
> ->
> __Fix this by running the SparkSession builder first (cell above)__

### Stopping our Spark application

It is always good practice to clean up behind ourselves. 
As we do not need this Application anymore after running what we want from it, we can kill it (stop it) using `.stop()`.
This tells Spark that it can kill off this Application and free up the resources.

In [7]:
# To kill the Spark application, use the `stop()` method
spark.stop()

That brings us to the end of this part of our tutorial.

**Happy Sparking!**

# **Lesson 1:** What is RDD, Dataframe and dataSet:

* Spark RDD APIs – An RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations on large clusters in a fault-tolerant manner. Thus, speed up the task.
* Spark Dataframe APIs – Unlike an RDD, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.
* Spark Dataset APIs – Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface. Dataset takes advantage of Spark’s Catalyst optimizer by exposing expressions and data fields to a query planner.