# PySpark Bootcamp Workshop - Day 1

In [6]:
from IPython.display import Image, HTML, SVG, display
from IPython.lib.display import YouTubeVideo

In [29]:
Image(url="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg")

In [30]:
YouTubeVideo('p8FGC49N-zM')

***About me!***  
https://www.linkedin.com/in/vivek-bombatkar
    
***About you?***    

# Day 1	Spark DataFrames, Big Data & Hadoop

In [31]:
Image(filename = "pics/d1_1.JPG")

<IPython.core.display.Image object>

### In a top-down approach an overview of the system is formulated, specifying, but not detailing, any first-level subsystems. - wikipedia


- Get good understanding of the Breadth of the system first, then in Depth. 
- Ex. studding all types of JOINS we perform one basic join by executing it and understanding what is happening behind.
- Understanding by some Hands-On and supporting theory.
- When going through the theory, we can jump on to notebook and validate concepts ourself!

### Environment Preparation:
Below some of the environments for executing spark jobs.

- community.cloud.databricks.com (COMMUNITY EDITION)
- colab.research.google.com
- Kaggle Kernels (Kaggle kernel > Internet On ; ! Pip install pyspark)
- [community.cloud.databricks](https://community.cloud.databricks.com/login.html)

In [32]:
Image(filename = "pics/d1_spark_1.JPG",width=700,height=500)

<IPython.core.display.Image object>

[ A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf)
- ***Spark Overview***:
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
- ***Speed***:
Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting.
- ***Ease of Use***:
Spark has easy-to-use APIs for operating on large datasets. This includes a collection of over 100 operators for transforming data and familiar data frame APIs for manipulating semi-structured data.
- ***A Unified Engine***:
Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.


Source: https://databricks.com/spark/about

Quick Start : http://spark.apache.org/docs/latest/quick-start.html



In [33]:
Image(url = "https://databricks.com/wp-content/uploads/2018/12/PySpark-1024x164.png")

- Apache Spark is written in Scala programming language.  
- PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark.   
- In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language.   
- This has been achieved by taking advantage of the Py4j library.

Source:  https://databricks.com/glossary/pyspark --> look for 'DataFrame'

## Connection to the Spark...

In [16]:
from pyspark.sql import SparkSession
    
spark = SparkSession.builder \
.master("local[*]") \
.config("spark.submit.deployMode","client") \
.getOrCreate()

In [3]:
# help(spark)

In [4]:
spark.stop()

In [37]:
Image(filename = "pics/d1_2.JPG")

<IPython.core.display.Image object>

# ***Hands-On session***: 

***spark-python/jupyter-pyspark/PySpark DataFrame Skeleton.ipynb***   
- https://github.com/dimajix/spark-training/blob/master/spark-python/jupyter-pyspark/PySpark%20DataFrame%20Skeleton.ipynb  


Data in S3 bucket: https://console.aws.amazon.com/s3/buckets/dimajix-training/data/weather/2003/?region=eu-central-1&tab=overview

### Before we move on; my favrate place to look for PySpark help! 

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window

In [38]:
Image(filename = "pics/d1_3.JPG")

<IPython.core.display.Image object>

***Resilient Distributed Datasets (RDDs)*** 

***Why RDD's are important?***  
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce. 

- At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster.   
- The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.   
- There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.  


Source: http://spark.apache.org/docs/latest/rdd-programming-guide.html

***The Map Reduce***   
Programming Model --> Divide & Conquer --> Can be applied to any language and any framework.    
1) Map --> 	pick what you want from a record.  
 
		Works parallely on each node  

		The result is shuffled sorted   

--> Magic Phase of Shuffling --> Bring together all the values of the similar key together so that we can perform reducing  

2) Reduce --> 	aggregation based on the map functionality  

# ***Hands-On session***: 

***spark-python/jupyter-pyspark/PySpark WordCount.ipynb***  
https://github.com/vivek-bombatkar/spark-training/blob/master/spark-python/jupyter-pyspark/PySpark%20WordCount.ipynb

In [7]:
Image(url="http://training.databricks.com/databricks_guide/gentle_introduction/trans_and_actions.png")

Transformations & Actions: http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

In [40]:
Image(url = "https://i.stack.imgur.com/3rF6p.png",width=700,height=500)

In [41]:
Image(url = "https://databricks.com/wp-content/uploads/2016/07/sql-vs-dataframes-vs-datasets-type-safety-spectrum.png",width=700,height=500)

***When to use RDDs?***  

Consider these scenarios or common use cases for using RDDs when:  

- you want low-level transformation and actions and control on your dataset;  
- your data is unstructured, such as media streams or streams of text;  
- you want to manipulate your data with functional programming constructs than domain specific expressions;  
- you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and  
- you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.  

Source: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

In [42]:
Image(filename= "pics/d1_df_2.JPG")

<IPython.core.display.Image object>

# Lunch Time

In [43]:
Image(filename = "pics/d1_4.JPG")

<IPython.core.display.Image object>

***Quiz / Food-For-Thought***: 




# ***Hands-On session***: 

***spark-python/jupyter-weather-df/Weather Analysis Exercise.ipynb***

https://github.com/vivek-bombatkar/spark-training/blob/master/spark-python/jupyter-weather-df/Weather%20Analysis%20Exercise.ipynb


In [44]:
Image(filename = "pics/d1_5.JPG")

<IPython.core.display.Image object>

***The Big data***  



- Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them.  

- Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.    

- 3 v's of Big Data - IBM  

1) Volume --> Amount of data 

2) Velocity --> Speed of the data arrival.

	--> Monthly / Daily / Hourly - Batch
	--> Seconds - Near Realtime
	--> Miliseconds - Realtime
    
3) Variety:-

	a) Structured --> schema aware, fixed rows and cols, data types
	b) Semi-Structured --> textual data having no schema -- emails, logs, blogs, comments, JSON
	c) UnStructured --> 
		images -- Cheques 
		video -- CCTV, security
		audio -- trader's call, Customer confirmation calls


In [45]:
Image(url="https://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg",width=1200,height=1000)

In [46]:
Image(url="https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Hadoop_logo_new.svg/1920px-Hadoop_logo_new.svg.png",width=700,height=500)

***Apache Hadoop*** -->

1) is an open-source software framework   
2) used for distributed storage and distributed processing   
3) of very large data sets.   
4) It consists of computer clusters built from commodity hardware.  


***Framework for doing 2 things***  

a) Storage --> HDFS; Distributed   
b) Processing --> Map-Reduce; can be done in 2 ways  Batch		v/s 	Streaming [ Apache Storm ]  


***Data going to Code		v/s 		Code going to Data***    
Traditionally RDBMS v/s Hadoop    
Centralized v/s Distributed  



In [47]:
Image(url="https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2016/10/HADOOP-ECOSYSTEM-Edureka.png")

*** Storage***  
- OLTP			v/s		OLAP    

- Transactional v/s Analytical     
RDBMS ; DWH  
NoSQL ; Hadoop  


- Biggest Diffentiator between RDBMS and Hadoop   

Centralized 	v/s 	Distributed	--> Storage + Processing  


- Data Lake Implementation --> Business Term


- ETL			v/s		ELT  

Extract Transform Load			Extract Load Transform  
Process while streaming			Hadoop  

***HDFS : Master - Slave Architecture***  
- Master [ Server Grade ] & Slave [ Commodity machine ]  
- Commodity --> No dual power supply, No RAID, No high memory configuration, huge amount of storage.  


***Processing***
REPL: Repeate Execute Print Loop

		1) Traditional Map Reduce
		2) Spark --> In memory technique [ scala, python ]

***Distributions of Hadoop***

1) Cloudera  
2) Hortonworks  
3) Pivotal  
4) MapR  
5) IBM Big Insights  
6) HDInsight by microsoft  

https://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support

***Why Hadoop + Spark goes hand in hand?***

In [48]:
Image(filename = "pics/d1_6.JPG")

<IPython.core.display.Image object>

## HIVE

In [49]:
Image(url="https://upload.wikimedia.org/wikipedia/commons/thumb/b/bb/Apache_Hive_logo.svg/225px-Apache_Hive_logo.svg.png")

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.


https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-HiveTutorial

***HIVE Storage types in action***


In [8]:
Image(filename = "pics/d1_7.JPG")

<IPython.core.display.Image object>

In [25]:
from pyspark.sql.functions import udf

@udf('int')
def tmp(i):
    return(i+10)

sdf = spark.createDataFrame([(x,) for x in range(0,10)],["c1"])
sdf.show()

sdf.withColumn("c2",tmp(sdf.c1)) #*

sdf.show()


+---+
| c1|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

+---+---+
| c1| c2|
+---+---+
|  0| 10|
|  1| 11|
|  2| 12|
|  3| 13|
|  4| 14|
|  5| 15|
|  6| 16|
|  7| 17|
|  8| 18|
|  9| 19|
+---+---+




- https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html  
- https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html      
- https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs     

# ***Hands-On session***: 
***spark-python/jupyter-advanced/05 - UDFs Skeleton.ipynb***   

- https://github.com/vivek-bombatkar/spark-training/blob/master/spark-python/jupyter-advanced/05%20-%20UDFs%20Skeleton.ipynb   
