<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Terminology" data-toc-modified-id="Terminology-0.1">Terminology</a></span><ul class="toc-item"><li><span><a href="#SparkContext" data-toc-modified-id="SparkContext-0.1.1">SparkContext</a></span></li><li><span><a href="#SparkSession" data-toc-modified-id="SparkSession-0.1.2">SparkSession</a></span></li></ul></li></ul></li><li><span><a href="#Spark-SQL:" data-toc-modified-id="Spark-SQL:-1">Spark SQL:</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Creating-a-SparkSession" data-toc-modified-id="Creating-a-SparkSession-1.0.1">Creating a SparkSession</a></span></li><li><span><a href="#Reading-Data" data-toc-modified-id="Reading-Data-1.0.2">Reading Data</a></span></li></ul></li><li><span><a href="#Transformations" data-toc-modified-id="Transformations-1.1">Transformations</a></span><ul class="toc-item"><li><span><a href="#Selecting-Columns" data-toc-modified-id="Selecting-Columns-1.1.1">Selecting Columns</a></span></li><li><span><a href="#Filtering-Rows" data-toc-modified-id="Filtering-Rows-1.1.2">Filtering Rows</a></span></li><li><span><a href="#Grouping-and-Aggretting" data-toc-modified-id="Grouping-and-Aggretting-1.1.3">Grouping and Aggretting</a></span></li><li><span><a href="#Joining" data-toc-modified-id="Joining-1.1.4">Joining</a></span></li><li><span><a href="#Ordering" data-toc-modified-id="Ordering-1.1.5">Ordering</a></span></li><li><span><a href="#Windowing" data-toc-modified-id="Windowing-1.1.6">Windowing</a></span></li></ul></li><li><span><a href="#Actions" data-toc-modified-id="Actions-1.2">Actions</a></span><ul class="toc-item"><li><span><a href="#Counting-Rows" data-toc-modified-id="Counting-Rows-1.2.1">Counting Rows</a></span></li><li><span><a href="#Collecting-Data" data-toc-modified-id="Collecting-Data-1.2.2">Collecting Data</a></span></li><li><span><a href="#Writting-Data" data-toc-modified-id="Writting-Data-1.2.3">Writting Data</a></span></li><li><span><a href="#Showing-Data" data-toc-modified-id="Showing-Data-1.2.4">Showing Data</a></span></li><li><span><a href="#Summarizing-Data" data-toc-modified-id="Summarizing-Data-1.2.5">Summarizing Data</a></span></li></ul></li></ul></li></ul></div>

In [1]:
import pyspark

In [2]:

import datetime

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [27]:
pyspark.__version__

'3.0.1'

- [Spark Programming Guide](https://spark.apache.org/docs/3.0.1/index.html)
- [Spark Python API Docs](https://spark.apache.org/docs/latest/api/python/index.html#)
- [mysql Notes](https://www.tutorialspoint.com/mysql/mysql-select-database.htm)
- [SQL by W3School](https://www.w3schools.com/sql/)
- [mysqltutorial](https://www.mysqltutorial.org/)
- []()

```txt
The following bash variable hold the reference to the Spark Home Directory.

    $SPARK_HOME

Apply the following command at the terminal to see the path.
    $ echo $SPARK_HOME

Using the following command submit a job from spark library to Spark:
    $SPARK_HOME/bin/spark-submit $SPARK_HOME/examples/src/main/python/pi.py 10

```

### Terminology

#### SparkContext

The SparkContext is the entry point to the underlying Spark engine and was the primary entry point to Spark before version 2.0. It is responsible for coordinating the resources and orchestrating the processing of data in a Spark application.

The SparkContext is responsible for setting up internal services, including the scheduler, the task scheduler, and the cluster manager. It also sets up external services, such as Hadoop Distributed File System (HDFS), Apache Cassandra, and Apache HBase.

In summary, the SparkContext is responsible for low-level programming of Spark, including job scheduling, task dispatching, and cluster management.

#### SparkSession

The SparkSession, introduced in Spark 2.0, is a higher-level entry point to Spark that provides a single unified interface to interact with Spark. The SparkSession combines the functionality of the SparkContext, SQLContext, and HiveContext into a single object.

The SparkSession provides a seamless integration with Spark SQL, which is the Spark module for structured data processing. It allows Spark applications to read and write data in various file formats, such as CSV, JSON, and Parquet, and execute SQL queries against it. The SparkSession also provides an API for working with datasets, which are a type-safe extension of the DataFrame API.

In summary, the SparkSession provides a high-level API for working with Spark that combines the functionality of the SparkContext, SQLContext, and HiveContext.

## Spark SQL:

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf     ## udf => UserDefinedFunction
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import desc, asc, sum as Fsum
from pyspark.sql.functions import asc
from pyspark.sql.functions import sum as Fsum


In [19]:
data_path = "/Users/a.momin/Data/sparkify_log_small.json"

In [20]:
spark = SparkSession.builder.appName('MySQL').getOrCreate() ## create Spark SQL Session

In [33]:
df = spark.read.json(data_path) # pyspark.sql.dataframe.DataFrame

In [23]:
## creates a temporary view against which we can run SQL queries.
df.createOrReplaceTempView('user_log')

In [26]:
spark.sql("SELECT * FROM user_log LIMIT 2").show()

#### Creating a SparkSession

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("myApp") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

ModuleNotFoundError: No module named 'pyspark'

#### Reading Data

In [None]:
# CSV
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# JSON
df = spark.read.json("path/to/file.json")

# Parquet
df = spark.read.parquet("path/to/file.parquet")


### Transformations

#### Selecting Columns

In [2]:
df.select("column1", "column2")

NameError: name 'df' is not defined

#### Filtering Rows

In [None]:
df.filter(df.column1 == "value")

#### Grouping and Aggretting

In [None]:
df.groupBy("column1").agg({"column2": "sum"})

#### Joining

In [None]:
joined_df = df1.join(df2, "common_column")

#### Ordering

In [None]:
df.orderBy("column1")

#### Windowing

In [None]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window = Window.partitionBy("column1").orderBy("column2")
df.withColumn("row_number", row_number().over(window))

### Actions

#### Counting Rows

In [None]:
df.count()

#### Collecting Data

In [None]:
df.collect()

#### Writting Data

In [None]:
# CSV
df.write.csv("path/to/output.csv", header=True)

# JSON
df.write.json("path/to/output.json")

# Parquet
df.write.parquet("path/to/output.parquet")

#### Showing Data

In [None]:
df.show()

#### Summarizing Data

In [None]:
df.describe().show()