## Welcome to this course "Getting started with Apache Spark"
## Here is the 2nd video: Setting up the PySpark environment in Colab

![PySpark](https://drive.google.com/uc?id=1oU2tHXn4Tb4NJ0GQLbFQanLUVWj-3M-G)

## Contents
- Google Colab Introduction
- Introduction and Installation steps
- History of entry points to Spark Application - SparkContext/SQLContext/SparkSession, etc.
- Initialize Spark Session and read data

## Google Colab Introduction
- Google Colab is similar to Jupyter notebook that requires no setup to use. It also provides computing resources including GPUs for free
- You just need a gmail account to code and use resources of Google Colab
- Link: https://colab.research.google.com/

## Introduction and Installation process

- Install OpenJDK
  - Spark is written in Scala and runs on the JVM (Java Virtual Machine). So, we have to install OpenJDK
  - OpenJDK is a free and open-source implementation of the Java Platform
  - JDK is a software development kit to develop applications in Java
  - It is a software bundle which provides Java class libraries with necessary components to run Java code. JVM executes Java byte code and provides an environment for executing it. JDK is platform dependent
  - Spark has some incompatibility issues with Java 11. So, let's downgrade the Java version to 8.
- Install findspark and pyspark python libraries
- Add environment variables
- Start PySpark session
- Load data into this notebook

### Current directory

In [2]:
# print working directory
!pwd

# List files and folders
!ls

# Check the open jdk version on colab
!ls /usr/lib/jvm/

/content
ignore_this_folder  sample_data
default-java		   java-11-openjdk-amd64     java-8-openjdk-amd64
java-1.11.0-openjdk-amd64  java-1.8.0-openjdk-amd64


### Install Java 8

In [3]:
# Download and install Java 8
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

0% [Working]            Hit:1 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.38)] [Co                                                                               Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
                                                                               Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
0% [3 InRelease 14.2 kB/88.7 kB 16%] [Connecting to security.ubuntu.com (91.1890% [1 InRelease gpgv 15.9 kB] [3 InRelease 15.6 kB/88.7 kB 18%] [Connecting to                                                                                Hit:4 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
0% [1 InRelease gpgv 15.9 kB] [3 InRelease 47.5 kB/88.7 kB 54%] [Connecting to                                                                                Hit:5 https://cloud.r-project.org/bin/linux/ubuntu bion

In [4]:
# Check if we have java 8 or not
!ls /usr/lib/jvm/

default-java		   java-11-openjdk-amd64     java-8-openjdk-amd64
java-1.11.0-openjdk-amd64  java-1.8.0-openjdk-amd64


### Download Apache Spark binary

In [5]:
# Download Apache Spark binary: This link can change based on the version. Update this link with the latest version before using
!wget -q https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz

# Unzip file
!tar -xvzf spark-3.0.2-bin-hadoop2.7.tgz

spark-3.0.2-bin-hadoop2.7/
spark-3.0.2-bin-hadoop2.7/R/
spark-3.0.2-bin-hadoop2.7/R/lib/
spark-3.0.2-bin-hadoop2.7/R/lib/sparkr.zip
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/worker/
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/worker/worker.R
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/worker/daemon.R
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/tests/
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/tests/testthat/
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/tests/testthat/test_basic.R
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/profile/
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/profile/shell.R
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/profile/general.R
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/doc/
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/doc/sparkr-vignettes.html
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/doc/sparkr-vignettes.Rmd
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/doc/sparkr-vignettes.R
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/doc/index.html
spark-3.0.2-bin-hadoop2.7/R/lib/SparkR/R/
spark-3.0.2-

### Install necessary libraries

In [6]:
# Install findspark: Adds Pyspark to sys.path at runtime
!pip install -q findspark

# Install pyspark
!pip install pyspark



In [7]:
import findspark
findspark.init()

ValueError: ignored

### Add OS environment variables

In [8]:
# Add environmental variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"

# findspark will locate spark in the system
import findspark
findspark.init()

### Full code

In [None]:
# # Full code for your reference: We will use this block for setting up the environment in our future videos

# # Install java 8
# !apt-get update
# !apt-get install openjdk-8-jdk-headless -qq > /dev/null

# # Download Apache Spark binary: This link can change based on the version. Update this link with the latest version before using
# !wget -q https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz

# # Unzip file
# !tar -xvzf spark-3.0.2-bin-hadoop2.7.tgz

# # Install findspark: Adds Pyspark to sys.path at runtime
# !pip install -q findspark

# # Install pyspark
# !pip install pyspark

# # Add environmental variables
# import os
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"

# # findspark will locate spark in the system
# import findspark
# findspark.init()

## SparkSession - Initialize entry point to Spark
- Every Spark application needs an entry point to co-ordinate tasks, do operations, read & write data, access spark features, etc.

<center>
<div><img title="Spark Application" src="https://drive.google.com/uc?id=1f908ipDMGQ03A0UewfdrxA6mmuqdk1Yj" width="600"/>
</div></center>

### Entry points in Spark 1.x and Spark 2.x
- Spark 1.x provides 3 entry points: SparkContext, SQLContext and HiveContext. We have to initialize these entry points seperately to do operations.
- Spark 2.x provides a new entry point SparkSession, which is a combination of all the above-mentioned contexts and can be accessed using Spark Session object.
- Spark Context:
  - The SparkContext is used by the driver process of the Spark Application in order to establish a communication with the cluster and the resource managers in order to coordinate and execute jobs.
    ```
    from pyspark import SparkContext, SparkConf
    conf = SparkConf() \
          .setAppName('app') \
          .setMaster(master)
    sc = SparkContext(conf=conf)
    ```
- SQL Context:
  - SQLContext is the entry point to SparkSQL which is a Spark module for structured data processing. Once SQLContext is initialised, the user can then use it in order to perform various "sql-like" operations over Dataframes.
  ```
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SQLContext
    conf = SparkConf() \
        .setAppName('app') \
        .setMaster(master)
    sc = SparkContext(conf=conf)
    sql_context = SQLContext(sc)
    ```
- Spark Session:
  - Spark Session is a combination of the above contexts which helps users to avoid confusion.
  ```
    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
            .master("local") \
            .appName("Hands-on PySpark on Google Colab") \
            .getOrCreate()
  ```

### Initialize SparkSession

In [9]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local") \
        .appName("Hands-on PySpark on Google Colab") \
        .getOrCreate()

In [10]:
spark

In [11]:
spark.sparkContext

## Load data from Google Colab

In [12]:
!pwd
!ls

/content
ignore_this_folder  spark-3.0.2-bin-hadoop2.7
sample_data	    spark-3.0.2-bin-hadoop2.7.tgz


In [13]:
# Load data
spark_data = spark.read.format('csv').load("/content/sample_data/california_housing_train.csv")

In [14]:
# Print the top 5 rows using .show() function
spark_data.show(5, truncate=False)

+-----------+---------+------------------+-----------+--------------+-----------+----------+-------------+------------------+
|_c0        |_c1      |_c2               |_c3        |_c4           |_c5        |_c6       |_c7          |_c8               |
+-----------+---------+------------------+-----------+--------------+-----------+----------+-------------+------------------+
|longitude  |latitude |housing_median_age|total_rooms|total_bedrooms|population |households|median_income|median_house_value|
|-114.310000|34.190000|15.000000         |5612.000000|1283.000000   |1015.000000|472.000000|1.493600     |66900.000000      |
|-114.470000|34.400000|19.000000         |7650.000000|1901.000000   |1129.000000|463.000000|1.820000     |80100.000000      |
|-114.560000|33.690000|17.000000         |720.000000 |174.000000    |333.000000 |117.000000|1.650900     |85700.000000      |
|-114.570000|33.640000|14.000000         |1501.000000|337.000000    |515.000000 |226.000000|3.191700     |73400.000000

In [15]:
# We can set header=true as one of the options. This will read the first row as header
spark_data = spark.read.format('csv').options(header='true').load("/content/sample_data/california_housing_train.csv")
spark_data.show(5, truncate=False)

+-----------+---------+------------------+-----------+--------------+-----------+----------+-------------+------------------+
|longitude  |latitude |housing_median_age|total_rooms|total_bedrooms|population |households|median_income|median_house_value|
+-----------+---------+------------------+-----------+--------------+-----------+----------+-------------+------------------+
|-114.310000|34.190000|15.000000         |5612.000000|1283.000000   |1015.000000|472.000000|1.493600     |66900.000000      |
|-114.470000|34.400000|19.000000         |7650.000000|1901.000000   |1129.000000|463.000000|1.820000     |80100.000000      |
|-114.560000|33.690000|17.000000         |720.000000 |174.000000    |333.000000 |117.000000|1.650900     |85700.000000      |
|-114.570000|33.640000|14.000000         |1501.000000|337.000000    |515.000000 |226.000000|3.191700     |73400.000000      |
|-114.570000|33.570000|20.000000         |1454.000000|326.000000    |624.000000 |262.000000|1.925000     |65500.000000

In [16]:
# print Schema of the loaded dataframe
spark_data.printSchema()

root
 |-- longitude: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- housing_median_age: string (nullable = true)
 |-- total_rooms: string (nullable = true)
 |-- total_bedrooms: string (nullable = true)
 |-- population: string (nullable = true)
 |-- households: string (nullable = true)
 |-- median_income: string (nullable = true)
 |-- median_house_value: string (nullable = true)



In [17]:
# We can set inferSchema='true' to infer the data schema while reading the data
spark_data = spark.read.format('csv').options(header='true', inferSchema='true').load("/content/sample_data/california_housing_train.csv")
spark_data.show(5, truncate=False)
spark_data.printSchema()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|-114.31  |34.19   |15.0              |5612.0     |1283.0        |1015.0    |472.0     |1.4936       |66900.0           |
|-114.47  |34.4    |19.0              |7650.0     |1901.0        |1129.0    |463.0     |1.82         |80100.0           |
|-114.56  |33.69   |17.0              |720.0      |174.0         |333.0     |117.0     |1.6509       |85700.0           |
|-114.57  |33.64   |14.0              |1501.0     |337.0         |515.0     |226.0     |3.1917       |73400.0           |
|-114.57  |33.57   |20.0              |1454.0     |326.0         |624.0     |262.0     |1.925        |65500.0           |
+---------+--------+----

In [19]:
# Get the number of rows in the dataframe
spark_data.count()

17000

## Connect Google drive

In [None]:
# Use the below code to connect to Google drive
from google.colab import drive
drive.mount('/content/drive')

## Summary
- How to setup PySpark environment in Google Colab
- History of entry points (SparkContext, SQLContext, SparkSession)
- Initializing SparkSession
- Reading data using SparkSession object

### References
- https://www.guru99.com/difference-between-jdk-jre-jvm.html
- https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
- https://towardsdatascience.com/sparksession-vs-sparkcontext-vs-sqlcontext-vs-hivecontext-741d50c9486a

### Thank you :)
-  That's the end of the this video. If you like this video, please do like, share and subscribe to my channel.
- If you are on LinkedIn, please tag me and share your thoughts on this video and the series "Getting started with PySpark - Hands on". This will motivate me to make more videos.
<div>
<img src="https://drive.google.com/uc?id=1ttB2gJaw0cXuJfj6GBx5VaYf2ArjiRXM" width="200"/>
</div>