## Motivation for Apache Spark

1. Difficulty of programming in Mapreduce and Hadoop
    - simple word count programming takes 60-70 lines of code
    - performance bottlenecks: multiple disk writes

2. Support for iterative jobs
3. Support of streaming jobs

## Key features:

- in-memory computation
- distribitedd processing with parallelization
- support for multiple cluster manmagers
- fault-tolerant
- lazy evaluation
- cache and persistence
- inbuild optimization and dataframes
- supports ansi sql


## Spark Ecosystem:

- Spark compute engine + spark core api(scala, python, java, r)
- high level api on top of core api:
    - spark sql, dataframes, datasets
    - streaming
    - MLlib
    - GraphX

## Spark Architecture:



## Spark installation:

### Sparkcontext:

- entry point for spark functionality
- represents connection to a cluster
- used to create RDD and boradcast variables on the cluster
- master:  local[*] means all available threads
- app_id: name of the application

### Sparksession:

- create dataframe
- register dfs as tables
- execute sql over tables
- cache tables
- read parquet files

#### Example:

In [None]:

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("word count").config("spark.some.config.option", "some-value").getOrCreate()

In [None]:
spark

In [None]:
cols = ['city', 'users']
data = [
    ('Banglore', '7654'), 
    ('Delhi','4234'),
    ('Mumbai', '234')
    ]

In [None]:
rdd = spark.sparkContext.parallelize(data)
rdd

In [None]:
df1 = rdd.toDF()
df1

In [None]:
df1.printSchema()

In [None]:
df1.show()

In [None]:
spark.stop()

#### What is RDD(resilient distributed dataset):

- fundamental bukilding block of spark
- fault-tolerant
- immutable distributed collection of objects
- just like a list in python, but data is distributed over nodes in a cluster
- rdd creates logical partitions of data
- it abstracts the parallelization part

### Spark Context webui

- jobs
- Stages
- tasks
- storage
- environemnt
- executors
- sql

In [None]:
from pyspark.context import SparkContext

sc1 = SparkContext('local','test')
# sc2 = SparkContext('local','test2')

#### Note:  Only one sparkcontext at once


In [None]:
sc1

in spark 2.0 sparksession was used, which is kind of an alternative to sparkcontext. And we can create multiple sparksession.

In [None]:
sc1.stop()

#### How fast is spark comapred to hadoop

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext

In [2]:
sc = SparkContext()
spark = SparkSession(sc)

24/02/06 18:51:09 WARN Utils: Your hostname, Navneets-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.229.113 instead (on interface en0)
24/02/06 18:51:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/06 18:51:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
rdd1 = sc.parallelize([1,2,3,4])
rdd1_first = rdd1.filter(lambda x: x<3)
rdd1_first.collect()

                                                                                

[1, 2]

#### Wordcount with spark:

In [4]:
from operator import add

from pyspark.sql import SparkSession


spark = SparkSession\
    .builder\
    .appName("PythonWordCountAnalysis")\
    .getOrCreate()

# lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])


24/02/06 18:51:18 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [2]:
spark

In [5]:
# reading the input file in spark
fname = "/Users/navneet/Documents/python-interview-questions/docker_commands.txt"
df = spark.read.text(fname)
# loads all the lines into a dataframe

In [6]:
df.show()

+--------------------+
|               value|
+--------------------+
|   docker stop mysql|
|     docker rm mysql|
|                    |
|docker run --name...|
|      -p 3308:3306 \|
|    -e MYSQL_ROOT...|
|    -v data:/var/...|
|             mysql:8|
+--------------------+



In [9]:
lines=df.rdd.map(lambda r: r[0])
lines.collect()

list

In [10]:
counts = lines.flatMap(lambda x: x.split(' ')) \
                .map(lambda x: (x, 1)) \
                .reduceByKey(add)
output = counts.collect()
for (word, count) in output:
    print("%s: %i" % (word, count))

spark.stop()

docker: 3
stop: 1
mysql: 3
rm: 1
: 17
run: 1
--name: 1
-d: 1
\: 4
-p: 1
3308:3306: 1
-e: 1
MYSQL_ROOT_PASSWORD=change-me: 1
-v: 1
data:/var/lib/mysql: 1
mysql:8: 1


#### Actions:

- return values ot the driver prpgram
- return anything other than rdd
- trigger computation

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("Actions").getOrCreate()
spark

24/01/07 11:46:49 WARN Utils: Your hostname, Navneets-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 172.20.10.7 instead (on interface en0)
24/01/07 11:46:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/07 11:46:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
data = [('a',3),('b',5),('c',12),('b','13'),('b',19)]
inputrdd = spark.sparkContext.parallelize(data)
inputrdd

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289

#### Transformations:

- spark operations that allow changing one df to another
- lazily evaulated
- executed only when actions are triggered
- https://sparkbyexamples.com/spark/spark-rdd-transformations/

##### Narrow Transformations:

-  one to one mapping
- shuffling bw partitions is not required

In [4]:
frdd = spark.sparkContext.textFile("/Users/navneet/Documents/python-interview-questions/subject material/spark/[English (auto-generated)] Scaling Privacy in a Spark Ecosystem [DownSub.com].txt")

In [5]:
rdd1 = spark.sparkContext.parallelize([1,2,3,4,5,6])

- map()

In [6]:
trdd = rdd1.map(lambda x :x +1)

In [7]:
trdd.collect()

[2, 3, 4, 5, 6, 7]

##### Wide transformations:

- shuffling is required
- groupbykey, reduceby key
- data needs to be exchangeqd bw partitions in order to complete the transformation

#### Optimization in spark:

##### Serialization:

- java: default
- kyro: 10x faster than java
- tradeoff bw performance and versatility # TBR

##### Memory Tuning:

  

- parallelism: cluster utilization
- memory use of reduce tasks
- boradcasting for lasrge variables
- data 
- take() over collect()
- persistence in spark
- avoid groupbykey, use groupbykey
- aggregate with accumulators
- braodcast variables
- partitioning:
    - depends on the no. of cores
    - less partition means resource underutilized
    - more partitions means heavy shuffling
    - generally: 128 mb is max no. of bytes in a single partition
- repartition:
    - full data shuffle
- coalesce:
    - works when decreasing the  partitions.
    - minimizes the data movement


how are stages and jobs created in spark
lambda limitations
what happens when you submit a spark job to spark
python threadding and multiprocessing
async operation in python