# What is PySpark?

Apache Spark is Python API for Apache Spark, an open source, distributed computing platform and collection of tools for real-time, massive data processing.
To build more scalable analyses and pipelines, PySpark is an useful language to learn if you're already familiar with Python and tools like Pandas.

Apache Spark is a computational engine that handles big data sets by processing them concurrently and in batches. PySpark was created to facilitate the integration of Python and Spark, which is written in Scala. PySpark makes use of the Py4j library to assist you in interacting with Resilient Distributed Datasets (RDDs), in addition to offering a Spark API.

![image.png](attachment:image.png)

Popular library Py4J is included into PySpark and enables Python to interact dynamically with JVM (Java Virtual Machine) objects.

### PySpark features quite a few libraries:

- **PySparkSQL:** PySpark library for conducting SQL-like analysis on a significant volume of structured or semi-structured data With PySparkSQL, SQL queries are also an option.


- **MLlib:** A machine learning (ML) library wrapper for PySpark and Spark. Numerous machine learning methods for dimensionality reduction, collaborative filtering, clustering, classification, regression, and other tasks are supported by MLlib.


- **GraphFrames:** A graph processing toolkit that offers a collection of APIs for quickly and effectively doing graph analyses using PySpark core and PySparkSQL. It is geared toward distributed computing that is quick.

***Further Studies:***
- https://www.databricks.com/glossary/pyspark
- https://www.databricks.com/discover/introduction-to-data-analysis-workshop-series/intro-apache-spark
- https://www.dominodatalab.com/blog/considerations-for-using-spark-in-your-data-science-stack/
- https://www.databricks.com/session_na20/from-python-to-pyspark-and-back-again-unifying-single-host-and-distributed-deep-learning-with-maggy

# What is Databricks?

Databricks is basically a Cloud-based Data Engineering tool that is widely used by companies to process, analyze and transform massive amounts of data and explore the data. Large volumes of data are processed and transformed using databricks, and machine learning models can be explore in the environment.

## Azure Databricks
Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform

**Three environments are available using Azure Databricks:**
- Databricks SQL
- Databricks data science and engineering
- Databricks machine learning

**1. Databricks SQL**
This enables analysts who work with SQL queries to develop and share dashboards, perform queries on Azure Data Lake, and establish various virtualizations.

**2. Databricks Data Science and Engineering**
Data engineers, data scientists, and machine learning engineers may work in an interactive environment with the help of Databricks.

**3. Databricks Machine Learning**
Machine learning environment is provided by Databricks. It aids in managing services for experiment tracking, feature development and model training.

***Further Studies:***
- https://hevodata.com/learn/what-is-databricks/
- https://intellipaat.com/blog/what-is-azure-databricks/
- https://www.databricks.com/product/azure
- https://learn.microsoft.com/en-us/azure/databricks/introduction/



![image.png](attachment:image.png)

# What is Azure Data Factory?
Azure Data Factory (ADF) is a cloud base service that primarily performs the complete (ETL) extract-transform-load process for ingesting, preparing, and transforming all your data at scale. The ADF service provides a code-free user interface for simple authoring and single-pane glass management. This allows users to build complex ETL processes for transforming data using dataflow or other computation services like Azure HDInsight, Azure Databricks, Azure Synapse Analytics, and Azure SQL Database.
![image.png](attachment:image.png)

# Project Architecture
![image-2.png](attachment:image-2.png)

# Appreciation:
Special Thanks to the following people for their continuous contribution to the Azure Community:
- [Maheer Basha Shaik](https://www.linkedin.com/in/maheer-basha-shaik-b50247102/)
- [Adam Marczak](https://www.linkedin.com/in/adam-marczak/)
- [Krish Naik](https://www.linkedin.com/in/naikkrish/)
- [Ramesh Retnasamy | Cloud Data Engineer/ Architect](https://www.linkedin.com/in/ramesh-retnasamy/)
- [Microsoft Learn](https://learn.microsoft.com/en-us/training/paths/data-engineer-azure-databricks/)

YouTube | Medium | Courses
- [PySPark Tutorial](https://www.youtube.com/watch?v=_C8kWso4ne4)
- [Adam Marczak - Azure for Everyone](https://www.youtube.com/@Azure4Everyone)
- [WafaStudies](https://www.youtube.com/@WafaStudies)
- [Azure Data Factory](https://www.udemy.com/course/learn-azure-data-factory-from-scratch/learn/lecture/23973042#overview)
**Google, YouTube & Chatgpt😂**

# Introduction to PySpark

### Step 1: Install PySpark and other Libraries

**Note:** Create a virtual environment before installing PySpark in your local machine.

!pip install pyspark

### Step 2: Import all necessary libries

In [2]:
import pyspark
import pandas as pd
from pyspark.sql import SparkSession

In [4]:
#Start spark session
#We need to set a App Name, this might take longer if doing it for the first time.
spark = SparkSession.builder.config("spark.driver.host", "localhost").appName('tutorial').getOrCreate()

In [5]:
spark #This gives you the config of your spark cluster you just created

**Read Data using Spark**

In [6]:
df_pyspark_2 = spark.read.csv("Big_Data.csv", header=True, inferSchema=True)
#InferSchema endures the right datatype are in each individual columns

### Step 3: Data Transformation & Cleaning

In [7]:
#View each columns datatypes
df_pyspark_2.printSchema()

root
 |-- Order NO: integer (nullable = true)
 |-- Order Date: integer (nullable = true)
 |-- Countries: string (nullable = true)
 |-- Pie Flavor3: string (nullable = true)
 |-- Quantity4: integer (nullable = true)
 |-- Price: integer (nullable = true)
 |-- Amount Sold: integer (nullable = true)
 |-- Slice Or Whole Pie: string (nullable = true)
 |-- Pre-Order/In-Store Purchase: string (nullable = true)
 |-- Organic?: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- _c20: string (nullable = true)
 |-- _c21: string (nullable = true)
 |-- Pie Flavor22: string (nullable = true)
 |-- Quantity23: string (nullable = true)
 |-- _c24: string (nullable = true)


In [8]:
#View all columns
df_pyspark_2.columns

['Order NO',
 'Order Date',
 'Countries',
 'Pie Flavor3',
 'Quantity4',
 'Price',
 'Amount Sold',
 'Slice Or Whole Pie',
 'Pre-Order/In-Store Purchase',
 'Organic?',
 '_c10',
 '_c11',
 '_c12',
 '_c13',
 '_c14',
 '_c15',
 '_c16',
 '_c17',
 '_c18',
 '_c19',
 '_c20',
 '_c21',
 'Pie Flavor22',
 'Quantity23',
 '_c24',
 '_c25']

In [9]:
#Fit the datatypes
from pyspark.sql.functions import to_date, col
df_pyspark_2 = df_pyspark_2.withColumn("Order Date", to_date("Order Date","dd-MM-yyyy"))

In [10]:
from pyspark.sql.types import IntegerType
df_pyspark_2 = df_pyspark_2.withColumn("Amount Sold", df_pyspark_2["Amount Sold"].cast(IntegerType()))

In [11]:
df_pyspark_2 = df_pyspark_2.withColumn("Price", df_pyspark_2["Price"].cast(IntegerType()))

In [12]:
df_pyspark_2.printSchema()

root
 |-- Order NO: integer (nullable = true)
 |-- Order Date: date (nullable = true)
 |-- Countries: string (nullable = true)
 |-- Pie Flavor3: string (nullable = true)
 |-- Quantity4: integer (nullable = true)
 |-- Price: integer (nullable = true)
 |-- Amount Sold: integer (nullable = true)
 |-- Slice Or Whole Pie: string (nullable = true)
 |-- Pre-Order/In-Store Purchase: string (nullable = true)
 |-- Organic?: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- _c20: string (nullable = true)
 |-- _c21: string (nullable = true)
 |-- Pie Flavor22: string (nullable = true)
 |-- Quantity23: string (nullable = true)
 |-- _c24: string (nullable = true)
 |-

**Drop all unwanted columns**

In [13]:
# delete two columns
df_pyspark_2 = df_pyspark_2.drop(*('_c10','_c11','_c12','_c13','_c14','_c15','_c16','_c17','_c18',
                                 '_c19','_c20','_c21','Pie Flavor22','Quantity23','_c24','_c25'))

In [15]:
df_pyspark_2.columns

['Order NO',
 'Order Date',
 'Countries',
 'Pie Flavor3',
 'Quantity4',
 'Price',
 'Amount Sold',
 'Slice Or Whole Pie',
 'Pre-Order/In-Store Purchase',
 'Organic?']

In [16]:
df_pyspark_2.printSchema() # Check data Schema

root
 |-- Order NO: integer (nullable = true)
 |-- Order Date: date (nullable = true)
 |-- Countries: string (nullable = true)
 |-- Pie Flavor3: string (nullable = true)
 |-- Quantity4: integer (nullable = true)
 |-- Price: integer (nullable = true)
 |-- Amount Sold: integer (nullable = true)
 |-- Slice Or Whole Pie: string (nullable = true)
 |-- Pre-Order/In-Store Purchase: string (nullable = true)
 |-- Organic?: string (nullable = true)



**Change the Column Names**

In [17]:
df_pyspark_2= df_pyspark_2.withColumnRenamed("Pie Flavor3","Pie_Flavor")
df_pyspark_2= df_pyspark_2.withColumnRenamed("Quantity4","Quantity")

In [18]:
df_pyspark_2.columns

['Order NO',
 'Order Date',
 'Countries',
 'Pie_Flavor',
 'Quantity',
 'Price',
 'Amount Sold',
 'Slice Or Whole Pie',
 'Pre-Order/In-Store Purchase',
 'Organic?']

In [21]:
df_pyspark_2.select("Pie_Flavor").show()

+------------------+
|        Pie_Flavor|
+------------------+
|             Apple|
|             Apple|
|             Apple|
|             Apple|
|             Apple|
|             Apple|
|             Apple|
|             Apple|
|            Cherry|
|Strawberry Rhubarb|
|            Cherry|
|Strawberry Rhubarb|
|            Cherry|
|            Cherry|
|             Apple|
|            Cherry|
|Strawberry Rhubarb|
|             Apple|
|             Apple|
|             Fudge|
+------------------+
only showing top 20 rows



In [22]:
df_pyspark_2.show()

+--------+----------+--------------------+------------------+--------+-----+-----------+------------------+---------------------------+--------+
|Order NO|Order Date|           Countries|        Pie_Flavor|Quantity|Price|Amount Sold|Slice Or Whole Pie|Pre-Order/In-Store Purchase|Organic?|
+--------+----------+--------------------+------------------+--------+-----+-----------+------------------+---------------------------+--------+
|   10001|      null|             Algeria|             Apple|       5| 6721|      33605|             Slice|                  Pre-Order|      No|
|   10002|      null|              Angola|             Apple|       3| 4572|      13716|             Whole|                   In-Store|     Yes|
|   10003|      null|               Benin|             Apple|       3| 3862|      11586|             Whole|                   In-Store|      No|
|   10004|      null|            Botswana|             Apple|       4| 8302|      33208|             Whole|                  Pre-O

**Perform some Analysis**

In [24]:
df_pyspark_2.filter(df_pyspark_2["Amount Sold"] >= 30000).show()

+--------+----------+----------+------------------+--------+-----+-----------+------------------+---------------------------+--------+
|Order NO|Order Date| Countries|        Pie_Flavor|Quantity|Price|Amount Sold|Slice Or Whole Pie|Pre-Order/In-Store Purchase|Organic?|
+--------+----------+----------+------------------+--------+-----+-----------+------------------+---------------------------+--------+
|   10001|      null|   Algeria|             Apple|       5| 6721|      33605|             Slice|                  Pre-Order|      No|
|   10004|      null|  Botswana|             Apple|       4| 8302|      33208|             Whole|                  Pre-Order|      No|
|   10008|      null|  Cameroon|             Apple|       5| 9463|      47315|             Slice|                  Pre-Order|     Yes|
|   10015|      null|  DR Congo|             Apple|       4| 7937|      31748|             Whole|                  Pre-Order|     Yes|
|   10019|      null|  Eswatini|             Apple|    

In [28]:
df_pyspark_2.filter((df_pyspark_2['Price']>=7000) & (df_pyspark_2['Price']<=9000)).select(["Countries","Pie_Flavor","Amount Sold"]).show() #This gives between the range of 7e3-9e3

+-------------+------------------+-----------+
|    Countries|        Pie_Flavor|Amount Sold|
+-------------+------------------+-----------+
|     Botswana|             Apple|      33208|
|      Burundi|             Apple|      23130|
|         Chad|Strawberry Rhubarb|       8437|
|     DR Congo|             Apple|      31748|
|       Gambia|             Apple|      31404|
|        Ghana|             Fudge|      17682|
|Guinea-Bissau|             Apple|       7620|
|        Libya|             Fudge|      41840|
|   Mauritania|         Blueberry|       7071|
|      Senegal|             Apple|      30176|
|  South Sudan|Strawberry Rhubarb|       8978|
|         Togo|             Apple|      35948|
| Burkina Faso|             Fudge|       7861|
|         Chad|             Fudge|      25629|
|     Ethiopia|Strawberry Rhubarb|      40285|
|      Lesotho|Strawberry Rhubarb|      41465|
|    Mauritius|Strawberry Rhubarb|      16026|
|      Somalia|             Apple|      26367|
|        Suda

In [30]:
df_pyspark_2.filter((df_pyspark_2['Pie_Flavor']=='Apple') | (df_pyspark_2['Pie_Flavor']=="Cherry")).show() #This gives between the range of 7e3-9e3

+--------+----------+--------------------+----------+--------+-----+-----------+------------------+---------------------------+--------+
|Order NO|Order Date|           Countries|Pie_Flavor|Quantity|Price|Amount Sold|Slice Or Whole Pie|Pre-Order/In-Store Purchase|Organic?|
+--------+----------+--------------------+----------+--------+-----+-----------+------------------+---------------------------+--------+
|   10001|      null|             Algeria|     Apple|       5| 6721|      33605|             Slice|                  Pre-Order|      No|
|   10002|      null|              Angola|     Apple|       3| 4572|      13716|             Whole|                   In-Store|     Yes|
|   10003|      null|               Benin|     Apple|       3| 3862|      11586|             Whole|                   In-Store|      No|
|   10004|      null|            Botswana|     Apple|       4| 8302|      33208|             Whole|                  Pre-Order|      No|
|   10005|      null|        Burkina Faso

In [31]:
df_pyspark_2.select(["Order Date","Price","Quantity","Amount Sold"]).show() #Show certain columns

+----------+-----+--------+-----------+
|Order Date|Price|Quantity|Amount Sold|
+----------+-----+--------+-----------+
|      null| 6721|       5|      33605|
|      null| 4572|       3|      13716|
|      null| 3862|       3|      11586|
|      null| 8302|       4|      33208|
|      null| 5603|       2|      11206|
|      null| 7710|       3|      23130|
|      null| 4871|       3|      14613|
|      null| 9463|       5|      47315|
|      null| 4970|       3|      14910|
|      null| 8437|       1|       8437|
|      null| 2565|       4|      10260|
|      null| 4590|       1|       4590|
|      null| 6437|       2|      12874|
|      null| 9574|       1|       9574|
|      null| 7937|       4|      31748|
|      null| 2006|       4|       8024|
|      null| 5652|       3|      16956|
|      null| 1627|       1|       1627|
|      null| 6542|       5|      32710|
|      null| 1465|       5|       7325|
+----------+-----+--------+-----------+
only showing top 20 rows



**Read more on Filter using PySPark:** https://sparkbyexamples.com/pyspark/pyspark-where-filter/

In [32]:
df_pyspark_2.filter((df_pyspark_2['Price']>=7000) & (df_pyspark_2['Price']<=9000)).show() #This gives between the range of 7e3-9e3

+--------+----------+-------------+------------------+--------+-----+-----------+------------------+---------------------------+--------+
|Order NO|Order Date|    Countries|        Pie_Flavor|Quantity|Price|Amount Sold|Slice Or Whole Pie|Pre-Order/In-Store Purchase|Organic?|
+--------+----------+-------------+------------------+--------+-----+-----------+------------------+---------------------------+--------+
|   10004|      null|     Botswana|             Apple|       4| 8302|      33208|             Whole|                  Pre-Order|      No|
|   10006|      null|      Burundi|             Apple|       3| 7710|      23130|             Whole|                  Pre-Order|      No|
|   10010|      null|         Chad|Strawberry Rhubarb|       1| 8437|       8437|             Slice|                   In-Store|      No|
|   10015|      null|     DR Congo|             Apple|       4| 7937|      31748|             Whole|                  Pre-Order|     Yes|
|   10022|      null|       Gambia

**Create a new column by multiplying and adding two existing columns**

In [33]:
df_pyspark_2 = df_pyspark_2.withColumn("Total_Amount_Sold", col("Quantity") * col("Price"))
df_pyspark_2.select(["Total_Amount_Sold","Amount Sold"]).show()

+-----------------+-----------+
|Total_Amount_Sold|Amount Sold|
+-----------------+-----------+
|            33605|      33605|
|            13716|      13716|
|            11586|      11586|
|            33208|      33208|
|            11206|      11206|
|            23130|      23130|
|            14613|      14613|
|            47315|      47315|
|            14910|      14910|
|             8437|       8437|
|            10260|      10260|
|             4590|       4590|
|            12874|      12874|
|             9574|       9574|
|            31748|      31748|
|             8024|       8024|
|            16956|      16956|
|             1627|       1627|
|            32710|      32710|
|             7325|       7325|
+-----------------+-----------+
only showing top 20 rows



**PySpark GroupBy and Aggregate Functions**

In [34]:
df_pyspark_2.groupBy("Pie_Flavor").sum().show() # Look for the sum of each

+------------------+-------------+-------------+----------+----------------+----------------------+
|        Pie_Flavor|sum(Order NO)|sum(Quantity)|sum(Price)|sum(Amount Sold)|sum(Total_Amount_Sold)|
+------------------+-------------+-------------+----------+----------------+----------------------+
|             Other|       999563|          279|    477648|         1522151|               1522151|
|Strawberry Rhubarb|      7245335|         1870|   3511829|        10457081|              10457081|
|           Pumpkin|      5775006|         1507|   2804354|         8313987|               8313987|
|             Fudge|      3534468|          968|   1659068|         5139027|               5139027|
|            Cherry|      3667886|          957|   1736496|         5186149|               5186149|
|             Apple|      7996745|         2148|   3791402|        11467975|              11467975|
|         Blueberry|      2357148|          604|   1166427|         3381975|               3381975|


**Update Different Values in a column**

In [35]:
from pyspark.sql.functions import regexp_replace # Import the Regexp_replace library

# Replace all occurrences of the value "Other" in the "Pie_Flavor" column with "Mango"
df_pyspark_2 = df_pyspark_2.withColumn("Pie_Flavor", regexp_replace("Pie_Flavor", "Other", "Mango"))

In [36]:
df_pyspark_2.groupBy("Pie_Flavor").sum().show()

+------------------+-------------+-------------+----------+----------------+----------------------+
|        Pie_Flavor|sum(Order NO)|sum(Quantity)|sum(Price)|sum(Amount Sold)|sum(Total_Amount_Sold)|
+------------------+-------------+-------------+----------+----------------+----------------------+
|Strawberry Rhubarb|      7245335|         1870|   3511829|        10457081|              10457081|
|             Mango|       999563|          279|    477648|         1522151|               1522151|
|           Pumpkin|      5775006|         1507|   2804354|         8313987|               8313987|
|             Fudge|      3534468|          968|   1659068|         5139027|               5139027|
|            Cherry|      3667886|          957|   1736496|         5186149|               5186149|
|             Apple|      7996745|         2148|   3791402|        11467975|              11467975|
|         Blueberry|      2357148|          604|   1166427|         3381975|               3381975|


In [38]:
df_pyspark_2.groupBy("Pie_Flavor").agg({"Amount Sold":"sum"}).show()

+------------------+----------------+
|        Pie_Flavor|sum(Amount Sold)|
+------------------+----------------+
|Strawberry Rhubarb|        10457081|
|             Mango|         1522151|
|           Pumpkin|         8313987|
|             Fudge|         5139027|
|            Cherry|         5186149|
|             Apple|        11467975|
|         Blueberry|         3381975|
+------------------+----------------+



In [37]:
df_pyspark_2.filter(df_pyspark_2['Pie_Flavor']=='Mango').show()

+--------+----------+----------+----------+--------+-----+-----------+------------------+---------------------------+--------+-----------------+
|Order NO|Order Date| Countries|Pie_Flavor|Quantity|Price|Amount Sold|Slice Or Whole Pie|Pre-Order/In-Store Purchase|Organic?|Total_Amount_Sold|
+--------+----------+----------+----------+--------+-----+-----------+------------------+---------------------------+--------+-----------------+
|   10073|      null|  Eswatini|     Mango|       2| 3472|       6944|             Whole|                  Pre-Order|     Yes|             6944|
|   10077|      null|     Ghana|     Mango|       5| 2956|      14780|             Slice|                   In-Store|     Yes|            14780|
|   10232|      null|     Egypt|     Mango|       2| 1308|       2616|             Slice|                  Pre-Order|     Yes|             2616|
|   10247|      null|    Malawi|     Mango|       2| 2367|       4734|             Whole|                   In-Store|     Yes|    