# Tutorial 1: Familar yourself with Pyspark
A good programming platform can save you lots of troubles and time. 
Try to setup the environment to start Pyspark.
This tutorial will present the easiest way to run Pyspark on a Juypter Notebook. 
The whole course will be based on Juypter Notebook.
If you want to install on the other operator system, you can see the manual attached with this course. 

## 1.0 Install Required Library 
Everytime you use the Python engine provided by Google Cloud, you need to install all the libraries to run the code.

In [1]:
!pip install pyspark



## 1.1 Create SparkContext and SparkSession
Before we look into examples, first let’s initialize **SparkSession** and **SparkContext** using the builder pattern method defined in SparkSession class. 
While initializing, we need to provide the master and application name as shown below. 
In realtime application, you will pass master from spark-submit instead of hardcoding on Spark application.

In [2]:
# create entry points to spark
from pyspark.sql import SparkSession

ss  = SparkSession.builder \
                            .master("local[1]")\
                            .appName("SparkByExamples.com")\
                            .getOrCreate()
spark = ss.sparkContext

The parameters in the Session stands for:

`master()` – If you are running it on the cluster you need to use your master name as an argument to master(). usually, it would be either <a href="https://sparkbyexamples.com/hadoop/how-yarn-works/">yarn (Yet Another Resource Negotiator)</a> or mesos depends on your cluster setup.
- Use `local[x]` when running in Standalone mode (i.e., local machine). x should be an integer value and should be greater than 0; this represents how many partitions it should create when using RDD, DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.

`appName()` – Used to set your application name.

`getOrCreate()` – This returns a SparkSession object if already exists, creates new one if not exists.

## 1.2 SparkContext
SparkContext is a main entry point for Spark functionality. 
A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumlators and broadcast variables on that cluster.

Here, we will try to get the setting in the current SparkContext.

In [None]:
spark.getConf().getAll()

## 1.3 Log Levels
When running operations on Pyspark, there will be many **log** giving the information of different ops. 
We can adjust the **log level** to control the information displayed. 
By default, the log level is **warn** in Pyspark. The log level in Pyspark is:

|  Log Levels   | Meanings  |
|  ----  | ----  |
| DEBUG  | The DEBUG Level designates fine-grained infomration events that are most useful to debug an application |
| INFO  | The INFO level designates infomrational messages that highlights the progresss of the application at coarse-grained level. |
| WARN  | The WARN level designates potentially harmful situations. |
| ERROR  | The ERROR level designates error events that might still allow the application to continue running. |
| TRACE  | The TRACE Level designates finer-grained informational events than the DEBUG. |
| FATAL  | The FATAL Level designates very severe error events that will presumably lead the application to abort. |
| ALL  | The ALL Level has the lowest possible rank and is intended to turn on all logging. |
| OFF  | The OFF Level has the highest possible rank and is intended to turn off logging. |


In [None]:
# to print all information in operation, you can set log level to ALL
spark.setLogLevel("ALL")

## 1.4 Hello world
If you can run the following code, that means the Pyspark is running. Let's run a simple program to do a letter count. 

In [None]:
from pyspark import SparkContext
from operator import add
 

data = spark.parallelize(list("Hello World"))
counts = data.map(lambda x: 
	(x, 1)).reduceByKey(add).sortBy(lambda x: x[1],
	 ascending=False).collect()

for (word, count) in counts:
    print("{}: {}".format(word, count))
