## Welcome to the 1st video of this course "Getting started with Apache Spark": 
# Basics of Apache Spark and PySpark

![PySpark](https://drive.google.com/uc?id=1oU2tHXn4Tb4NJ0GQLbFQanLUVWj-3M-G)

Check out my Introduction video of this course, if you haven't. Here is the [link](https://www.youtube.com/watch?v=2NrWSL_qh3A&list=PLX-qVd8z5JGeolxBVY4APHUnkbFEHJqu5) or click on the above video.

<div><img src="https://drive.google.com/uc?id=1VxZTnML7NviUOVB_nx_UBMEyp6GjkLwS" width="400"/>
</div></center>

## Contents
- Introduction to Apache Spark and PySpark
- Fetaures of PySpark
- Run time architecture of PySpark

## Introduction
- Apache Spark is the most powerful big data tool, which is a parallel distributed processing framework. The core power of Apache Spark is to handle huge amounts of data.
- Apache Spark supports the following languages: Scala, Java, Python and R
- PySpark is an interface for Apache Spark in Python. In this video, let's spend time to understand the features and internals of Apache Spark and PySpark.
- Understanding the underlying concepts will help you relate to the high level operations better.

<br>
<center><div><img src="https://drive.google.com/uc?id=1arWo0Pxd8cADKmrK4XMCDqzmMOo0jyhF" width="400"/>
</div></center>

## Features of PySpark

- It can distribute data across nodes and parallelize tasks
- Immutable: You can create a new dataframe by applying transformations on the existing dataframe
- Lazy evaluation: It uses DAG for computation
- Cache & Persistence: Data can be cached to memory/disk depending on the situation
- Fault Tolerance: It can recover data if it's lost
- Supports SQL
- Supports several modes of deployment: Standalone, Apache Mesos, Hadoop YARN, Kubernetes, etc. These are all different cluster managers
  - As we don't have the cluster setup, we will use "local" to run Spark on our laptops/Colab. I will tell you more about this during the creation of Spark session.
<center>
<div><img src="https://drive.google.com/uc?id=189zqaqweDmomARu3WHCli86p3JV4Ckhk" width="300"/>
</div></center>

- Read data from PostgreSQL, Cassandra, Amazon S3, HDFS, Blob, etc. It supports many other data sources

Reference: https://spark.apache.org/

## Run time Architecture of Apache Spark
- It works on a concept called master and slave, where master is driver and slaves are workers/executors. When you do spark-submit (submit the code to driver), the driver creates a Spark context, which is the heart of spark application. Your code is run on executors and the resources are managed by the cluster manager.
- Below is the Spark Application image

<center>
<div><img title="Spark Application" src="https://drive.google.com/uc?id=1f908ipDMGQ03A0UewfdrxA6mmuqdk1Yj" width="600"/>
</div></center>

- Let's divide the process into steps:
  - When you are working on a cluster, you can use use "spark-submit" command to launch the application on the cluster. In spark-submit, you can specify number of resources required (executors, driver memory, executor memory, etc), add credentials to data sources, any other config params.
    ```
    # Spark submit example for Yarn
    spark-submit --master yarn \
                 --deploy-mode cluster \
                 --num-executors 10 \
                 --executor-cores 8 \
                 --executor-memory 8g  \
                 --executors-cores 2 \
                 preprocess_data.py arg1

    # Spark submit example for Kubernetes
    spark-submit --master k8s://192.121.165.213:443 \
                 --deploy-mode cluster \
                 --num-executors 10 \
                 --executors-cores 2 \
                 preprocess_data.py 12
    ```
    - Reference: https://sparkbyexamples.com/spark/spark-submit-command/
<center>
<div><img title="Spark Application" src="https://drive.google.com/uc?id=1f908ipDMGQ03A0UewfdrxA6mmuqdk1Yj" width="600"/>
</div></center>

  - The driver program is launched and the main method of the code runs in the driver. It creates Spark context.
    - Spark context establishes a connection to the spark execution environment
    - Spark context acts as a master of Spark application
  - Based on the configuration provided in the spark-submit, the driver requests the cluster manager to provide the necessary resources for launching executors. The cluster manager launches the required executors (For example: 10 executors with each 8GB of memory)
  - One of the responsibility of the driver program is to transform the user code into DAG and convert them to jobs/stages/tasks and schedule them on executors to perform the tasks.
    - When the code is executed, Spark context creates DAG, which Directed Acyclic DAG and it optimizes the DAG for optimal execution to save memory and time.
  - Once the tasks are executed, it sends back the output to driver program.

### Summary:
- We have discussed
  - What is PySpark and Apache Spark
  - Features of PySpark
  - Internals of Apache Spark/PySpark
- Understanding the internals of Apache Spark is very important in building scalable applications. 
- NOTE: Writing code itself is not sufficient, syncing the code and configurations plays a major role when it comes to handling big data. Unless you have a bigger picture, you cannot write optimized code.

We are good from theory perspective. The focus will be mainly on hands-on PySpark in the coming videos.

### Thank you :)
-  That's the end of the this video. If you like this video, please do like, share and subscribe to my channel.
<div>
<img src="https://drive.google.com/uc?id=1ttB2gJaw0cXuJfj6GBx5VaYf2ArjiRXM" width="200"/>
</div>

### References:
- https://spark.apache.org/
- https://sparkbyexamples.com/spark/spark-submit-command/
- https://data-flair.training/blogs/how-apache-spark-works/