##### Introduction as in Apache Website

In [None]:
from pyspark import S

sc = SparkContext(appName="Test")

##### Reading Text File

In [None]:
textFile = sc.textFile(r"file:///C:\Users\Dell\Desktop\Hadoop\Spark\bootStrap.sh")

In [None]:
textFile.count()

In [None]:
textFile.collect()

##### Filter == Query in pandas

In [None]:
textFile[textFile.value.contains("abc")].count()

###### Note that the above case has created RDD and dataframe because the file has been read using spark context and not SQL context

Let us create sparkSession

In [None]:
from pyspark import SQLContext
spark = SQLContext(sc).sparkSession

In [None]:
text = spark.read.text(r"file:///C:\Users\Dell\Desktop\Hadoop\Spark\bootStrap.sh")

In [None]:
text[text.value.contains("get")].collect()

In [None]:
from pyspark.sql.functions import *

##### Creating new column in spark dataframe

In [None]:
text1 = text.select(size(split(text.value,"\s+")).name("wordCount"))
text1.show()

In [None]:
text.select(size(split(text.value,"\s+")).name("wordCount")).agg(max(col("wordCount"))).show()

##### Simulating MapReduce action

In [None]:
text.select(split(text.value,"\s+").name("word")).collect()

In [None]:
 text.select(split(text.value,"\s+").name("word")).groupby("word").count().show()

##### Explode, GroupBy and OrderBy

In [None]:
 df = text.select(explode(split(text.value,"\S +")).name("word")).groupby("word").count().orderBy(desc("count"))

In [None]:
%%timeit
df.show()

In [None]:
df.show()

##### Importance of Cache

In [None]:
df.cache()

In [None]:
%%timeit
df.show()

##### One more way of creating spark session

In [None]:
from pyspark.sql import SparkSession
spark2 = SparkSession.builder.appName("temp").getOrCreate()

In [None]:
spark2.read.text(r"file:///C:\Users\Dell\Desktop\Hadoop\Spark\bootStrap.sh").show()

##### Quick Note on Cluster and spark program execution on them
* one single python spark program has 2 logical components.
    * Driver tasks
        * To assgin a task and collect the task operation output.
        * It cordianates with resource managers like yarn and mesos for resource availing. This is transperant to the programmer as he does only read, write and dataframe related operations. 
    * Worker tasks
        * indicates its presence to resource allocator and hence gets nominated for a task
        * executes the instructions as per driver program
        * note that here operation is limited to the data available in its scope and hence execution is faster and also parallel processing is feasible.
        
        
* Monitoring can be done through: http://<driver-node>:4040
* There are application like livy which also provides interactive access to jobs and thier features.

##### Job Scheduling
* unlike, in any other scheduling context, here we are talking about the resource scheduling


* Static Partitioning:
    * number of resources to a job can be limited through spark submit arguments
        * --num-executors : number of executors for a job
        * --executor-memory
        * --executor-cores : Note cores are logical units of computation in CPU



* Dynamic Resource Allocation:
    * Jobs can return the resource after current usage and request again for later use.
    * set spark.dynamicAllocation.enabled to true
    * set spark.shuffle.service.enabled to true
        * enabling resource executor to be removed but not the shuffle files. So intermediate results are retained so that when executor requires it, can be fetched. 
        * Shuffle service which always run and can collect all the shuffle files info across application can avoid a scenario of new executor trying to access the old executor shuffule file in progress writing.
            * Here Old executor can submit a shuffle right to the shuffle services and terminate gracefully.
            * Shuffle service will handle the situation of new executor requesting the old shuffle file content.

    * spark.dynamicAllocation.schedulerBacklogTimeout is used to trigger the request.
        * if not availed per first request, it will request again for every spark.dynamicAllocation.sustainedSchedulerBacklogTimeout seconds. 
        * executor count is increased exponentially for every subsequent request
            * this is because job was waiting for it for so long and has to catch up for the waiting period
            * it also acts as a buffer for future use (as the intial request did not get fulfill).
            
    * executors are removed after spark.dynamicAllocation.executorIdleTimeout seconds.
            
        
        
* Scheduling Within an Application:
    * Fair scheduling: all jobs within application like save, collect etc are by default executed FIFO manner but Fair scheduling can allocate resources to smaller jobs even wen long jobs are being execute.
        * conf.set("spark.scheduler.mode", "FAIR")
        * Pools can also have 
            * weights to decide which has to be given higher preference
            * minShare to decide which is the minimum share despite being lower weights
        

##### Shared Variables
* Broadcast Variables:
    * a variable is broadcasted to all the nodes in the cluster.
    * read only copy
    * used when repeated usage of large dataset.
        * broadcastVar = sc.broadcast([1, 2, 3])
        * broadcastVar.value

* Accumulator Variables:
    * associative or cumulative operations on a varible by different tasks
    * cluster nodes can add values through (+=) operator to it but cannot read
    * only driver program can read it.
    
* Spark Streaming:
    * uses DStream API using Spark RDD support
    * input stream is divided into micro batch and executed
    * each micro batch is a RDD
    * after each micror batch source is polled for new micro batch
    * foreachRDD fetches you the data of each micro batch
   
* Spark Structured Streaming:
    * built on top of spark SQL programming, leverages the dataframe apis
    * new data input is row in unbound table
    * can handle late event data
    * foreachBatch gives resultant dataframe 
    * here latency is 100 ms
        * Contnuous processing (>2.3): has the end to end latency of 1ms.

##### Example of Spark Structured Streaming:

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split

In [2]:
spark = SparkSession.builder.appName("WordCount").getOrCreate()

In [3]:
lines = spark.readStream.format("socket").option("host", "localhost").option("port","9999").load()

In [4]:
words = lines.select(explode(split(lines.value, " ")).alias("word"))

In [None]:
wordCount = words.groupBy("word").count()

In [None]:
query = wordCount.writeStream.format("console").outputMode("complete").start()

In [None]:
query.awaitTermination()