# Spark Applications
This consists of a driver process and a set of executor processes. The driver process runs your main() function, sits ona  node and is responsible for the following functions:
* Maintaining informatioon about the spark application
* Responding to a user's program or input
* Analyzing, distributing and scheduling work across the executors

The executors are responsible for actually carrying out the work that the driver assigns them. This means that the executor is responsible for:
* Executing the code assigned to it by the driver
* Reposrting the state of the computation on that executor back to the driver node


## The SparkSession
You control your spark application through a driver process called the sparksession. This is the way spark executes user-defined manipulations across the cluster

In [1]:
import findspark
findspark.init()
findspark.find()

'/opt/spark/'

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

22/10/31 22:01:27 WARN Utils: Your hostname, kevin resolves to a loopback address: 127.0.1.1; using 192.168.1.6 instead (on interface wlp0s20f3)
22/10/31 22:01:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/31 22:01:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
myRange = spark.range(1000).toDF('number')
myRange.show()

                                                                                

+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
|    19|
+------+
only showing top 20 rows



# Transformations
In spark, the core data structures are immutable. To 'change' a dataframe, you need to instruct spark how to would like to modify it to do what you want. Here is a simple transformation to find all even numbers in ouyr current dataframe


In [7]:
divBy2 = myRange.where("number % 2 = 0")

There are two types of transformations:
* narrow transformations - These are those for which each input partition will contribute to only one output partition. In the preceding code, the where statement specifies a narrow dependency, where only one partition contributes to at most one output partition.
* A wide transformation will ahve input partitions contributing to many output partitions

## Actions
Transformations allow us to build up ouyr logical transformation plan. To trigger tyhe computation, we run an action. An action instructs spark to compute a result from a series of transformations

In [8]:
divBy2.count()

500

# End-to-End Example
We will add schema inference, which means wewant spark to take a best guess aty what the schema of our dataframe should be. We also want to specify that the first row is the header in the file. To get the schema information, spark reads a little bit into the data and then attempts to parse the types in those rows according to the types available in spark

In [10]:
flightData2015 = spark.read\
                        .option('inferSchema', 'true')\
                        .option('header', 'true')\
                        .csv('/home/kevin/Desktop/Big-Data-with-Pyspark/data/flight-data/csv/2015-summary.csv')

In [11]:
flightData2015.take(3)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344)]

In [14]:
from pyspark.sql.functions import max
flightData2015.select(max('count')).take(1)

[Row(max(count)=370002)]

In [15]:
# arrange in descenbding order
from pyspark.sql.functions import desc
flightData2015\
    .groupBy('DEST_COUNTRY_NAME')\
    .sum('count')\
    .withColumnRenamed('sum(count)', 'destination_total')\
    .sort(desc('destination_total'))\
    .limit(5)\
    .show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [16]:
flightData2015\
    .groupBy('DEST_COUNTRY_NAME')\
    .sum('count')\
    .withColumnRenamed('sum(count)', 'destination_total')\
    .sort(desc('destination_total'))\
    .limit(5)\
    .explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- TakeOrderedAndProject(limit=5, orderBy=[destination_total#86L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#36,destination_total#86L])
   +- HashAggregate(keys=[DEST_COUNTRY_NAME#36], functions=[sum(count#38)])
      +- Exchange hashpartitioning(DEST_COUNTRY_NAME#36, 200), ENSURE_REQUIREMENTS, [plan_id=181]
         +- HashAggregate(keys=[DEST_COUNTRY_NAME#36], functions=[partial_sum(count#38)])
            +- FileScan csv [DEST_COUNTRY_NAME#36,count#38] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/kevin/Desktop/Big-Data-with-Pyspark/data/flight-data/csv/20..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>




In [18]:
spark.stop()