In [None]:
!pip install pyspark

In [None]:
'''
Apache Spark is a unified computing engine and a set of libraries for parallel data processing on
computer clusters.

Why Unified: 
Spark is called unified because a single engine + a single execution model supports many different data processing workloads, instead of needing separate systems for each one.

Before Spark, you typically needed different tools for different jobs

Workload	        Old-school tool
Batch processing	MapReduce
SQL queries	        Hive
Streaming	        Storm
Machine learning	Mahout
Graph processing	Giraph

Each had: Its own engine, Its own APIs, Its own data movement and Lots of glue code.

Spark provides one core engine (Spark Core) and builds everything else on top of it:
Spark SQL, Structured streaming, MLlib,Graphx.

In practice, unified means, you use one pipeline end to end, one execution and optimization engine, 
one programming model (for different workloads like sql, Ml etc), and One fault-tolerance & scheduling layer

Compute Engine: Spark is only compute engine, not platform (Databricks is, again only Big Data platform, not fullfledged app. platform), whereas Hadoop, an older sibling in Big data toolset,
designed to be performant by storing data local to it. Spark is designed to be distributed computing engine, it reads data from variety of location inlcuding cloud storages and performs
computations and returns/writes the data back.

Librabries: As a unified engine to provide a unified API for common data analysis tasks. Spark supports both standard libraries that ship with the engine as well as a wide array of 
external libraries published as third-party packages by the open source communities.

Spark Architecture:

Component	    Analogy	                  Real-World Function
Cluster	        The Factory	              The physical group of servers.
Driver	        The Project Manager	      Coordinates work and talks to the user.
SparkSession	The Project Contract	  The interface used to start work.
Executors	    The Factory Workers       Perform the actual data processing.


Apache Spark follows a master–worker architecture designed for distributed data processing.

SparkSession is the entry point to Spark. It allows users to create DataFrames, run SQL, and configure Spark. It is a handle to Spark, not an execution component.

The Driver is the brain of a Spark application. It runs the main program, creates the logical execution plan (DAG), splits it into stages and tasks, schedules those tasks,
and tracks execution and failures. There is exactly one driver per application.

A Cluster is the set of machines where Spark runs. It provides compute resources (CPU and memory) but does not contain Spark logic itself.

The Cluster Manager (YARN, Kubernetes, Standalone, etc.) is responsible for allocating resources and launching executors on worker nodes.

Executors are worker JVM processes that run on cluster nodes. They execute tasks on data partitions, perform transformations, handle shuffles, 
and store intermediate or cached data in memory or disk.

Overall, Spark separates coordination (driver), resource management (cluster manager), and execution (executors), while relying on external storage for data persistence.
This design enables scalable, fault-tolerant, and flexible distributed computation.

Step-by-step flow

User writes Spark code ->  SparkSession is created -> Driver starts -> Driver contacts cluster manager -> Cluster manager launches executors
Driver builds DAG -> (DAG → stages → tasks) -> Tasks sent to executors -> Executors process partitions -> Results sent back to driver or written to storage.

DAG = what needs to be done (logical plan)

Stages = how the work is broken at shuffle boundaries (narrow and wide transformations, depends on code, stages are created)

Tasks = actual parallel work on data partitions
'''


In [None]:
'''
Spark APIs: spark APIs let spark to run in various languages like Scala, java, Python, R, SQL. These language APIs can be used to drive the Spark or invoke the Driver.
Spark has two fundamental sets of APIs: the low-level “unstructured” APIs, and the higher-level structured APIs.

Dataframe: Represents a table of data with rows and columns, is part of structured API. Spark's Dataframe is distributed unlike Python's Panda's DF or R's Dataframe.
The dataframe data is partitioned in Spark called Input Partitions or Dataframe partitions, this is generally 128MB per partition. (Shuffle partition depends on number of pertitions setting)

Transformations: the core data structures are immutable, meaning they cannot be changed after they’re created. To “change” a DataFrame, you need to instruct Spark how you would like to
modify it to do what you want. These instructions are called transformations.
Ex:
divisBy2 = myRange.where("number % 2 = 0")

There are two types of transformations: those that specify narrow dependencies, and those that specify wide dependencies.

When transformation depends on its own partition, then narrow (1:1 partitions)(ex: Filter), when require multiple partitions to create a new partition then wide (1:N partitions) 
(ex: Group By, sort).

Actions: An action instructs Spark to compute a result from a series of transformations. Ex: Compute, collect, show, head etc.

> Actions to view data in the console
> Actions to collect data to native objects in the respective language
> Actions to write to output data sources

Lazy Evaluation: Lazy Evaluation means that Spark does not execute your code immediately when you write it. Instead, it records your instructions as a "plan" and only runs them 
when you explicitly ask for a final result.

Transformations (Lazy): ilter(), map(), select(), groupBy(), join(). They return a new DataFrame but the data inside remains untouched. Spark just adds a "step" to the DAG (Lineage).
Why Lazy is good: Example: The "Filter Pushdown"

Imagine you have a 10TB dataset, and your code looks like this:

    Read the 10TB dataset.

    Select 2 columns.

    Filter for ID = 5.

    Show the result.
    If Spark was Eager: It would load all 10TB into memory, then throw away 99% of the columns, then throw away 99.9% of the rows. It would likely crash.

Because Spark is Lazy: It looks at the whole plan. It realizes it only needs one specific ID. It tells the data source (like Parquet or a Database) to only send the data where ID=5. 
It never loads the 10TB. This is called Predicate Pushdown.

Spark UI: To monitor the Jobs. localhost:4040
'''


In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkInJupyter") \
    .master("local[*]") \
    .getOrCreate()

spark


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/31 19:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [7]:
myRange = spark.range(1000).toDF("number")
getpwd()

NameError: name 'getpwd' is not defined

In [9]:
flightData2015 = spark\
.read\
.option("inferSchema", "true")\
.option("header", "true")\
.csv("./SparkBasics/2015-summary.csv")


In [5]:
divisBy2 = myRange.where("number % 2 = 0")
print(divisBy2.count())
myRange.show()

500
+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
|    19|
+------+
only showing top 20 rows


In [None]:
flightData2015.show()
print(flightData2015.count())

In [None]:
flightData2015.sort("count").explain()

In [None]:
flightData2015.sort("count").show(flightData2015.count())

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "4")

In [None]:
collectDF=flightData2015.sort("count").collect() # collects data to native programming interface data structure, since pyspark, brings everything from all teh executors to List.
# Chances are there that  spark crashes, because the driver node may have say 512MB of memory and data collected from all executors may be over 512MB, ion that case OOO will come and crashed.

In [None]:
type(collectDF)

In [None]:
# Input Partitions are dictated by the size of the data on disk (up to 128MB).

# Shuffle Partitions are dictated by your config settings, regardless of how small the input data is.
spark.conf.set("spark.sql.shuffle.partitions", "200")

In [None]:
flightData2015.createOrReplaceTempView("flight_data_2015") # create a view from Dataframe

In [None]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM V_flight_data_2015
GROUP BY DEST_COUNTRY_NAME
""")
dataFrameWay = flightData2015\
.groupBy("DEST_COUNTRY_NAME")\
.count()
sqlWay.explain()
dataFrameWay.explain()

sqlWay.show()
dataFrameWay.show()

In [None]:
spark.sql("SELECT max(count) from V_flight_data_2015").take(1)

In [None]:
from pyspark.sql.functions import max
flightData2015.select(max("count")).take(1)

In [None]:
# top 5 destination countries
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM V_flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")
maxSql.show()
#Df way
from pyspark.sql.functions import desc
flightData2015\
.groupBy("DEST_COUNTRY_NAME")\
.sum("count")\
.withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total"))\
.limit(5)\
.show()