#### About
Apache Spark using Python a.k.a PySpark
> Basic Details
1. PySpark helps in performing EDA while building machine learning pipelines.
2. It helps in creating ETL pipelines.
3. Big Data by definition means extremely large datasets that may be analyzed computationally to reveal patterns, trends.
4. Big Data Analytics by definition means the process of examining large and varied datasets to uncover information including hidden patterns, market trends and customer preferences.
5. Social media networks like Facebook who generated 4 peta byte data per day had to evolve this concept of BigData.
6. The 5V's of Big Data - Velocity(Speed at which data is processed), Volume(Quantity of Data), Variety(Diff. Kind of data- Structured, Semi structured and unstructured), Veracity(Inconsistency of Data), Value(Previously unvaluable data to valuable data).
7. Hadoop is a framework which is used to store Big Data a spectrum of devices. It is done to help one process big data in parallel.
8. Three major components of a hadoop ecosystem are hdfs(hadoop distributed file system - Storage layer- Parallel), map reduce(processing layer - Parallel) and yarn(negotiator layer - yet another resource negotiator).
9. Hadoop cluster consists of master and slave nodes. A map reduce submitted to master will automatically executed for slave nodes.
10. In betweem the slaves and master, There is a resource negotiator i.e yarn. It keeps track of size etc resources on slaves. Suppose - Master requests yarn to give slaves that can divide 10 TB file into 5 Tb each and save. Yarn will return back the slave number 5, 7. It runs continuosly across slaves and check if they are functioning.
11. Replicas of data blocks is maintained in master and slaves so that if one server goes down, Then also backup i.e fault tolerance is maintained.
12. Map reduce is broken into 2 parts - Mapper function and reducer function. Hadoop will pass each line to mapper function and then the intermediate output will be read and aggregation(group by key) is done by reducer function.
13. Master and slave nodes run daemons in background. Daemon can be referred as a computer program that runs a background process instead of being under direct control of an interactive user on a multitasking computer os.
14. Data bricks is a platform where all requirements like hadoop, spark, hive come preinstalled over a n-node cluster.
15. HDFs consist of Namenode(Runs on master, Keeps tracks of block) and Datanode(where actual data is stored).
16. Map reduce has two major components ie. Map functions which converts one set of data into another where individual elements are broken down into key,value pairs whereas reduce function takes data from map function as input, aggregates and summarizes the results to yield the final output.
17. Hadoop works inefficiently with small data, Only large data is preferred.
18. Spark uses hadoop in two ways - One is storage and another is processing. Since spark has its own cluster management computation, it uses hadoop for storage purposes only !
19. Apache Spark is a lightning fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and extends the mapreduce model to efficiently use it for more types of computations.
20. Features of spark include in a) memory computation i.e no need to fetch data from disk every single time b) Fault tolerance - It implies very less data loss or nil c) Lazy evaluation i.e all transformations made in spark RDD involves creation of a new RDD.
21.  All structured and unstructured data go into HDFS and input data is being fed to Spark, All intermediate output is saved in RAM and final output is written back to HDFS. Spark takes live data via Spark streaming whereas Hadoop map reduce doesn't have this functionality. It uses map reduce for optional processing and YARN for resource negotiator.
22. Spark core is core layer for akl spark. Spark SQL is a distributed framework for structured data processing.
23. Spark streaming is an add-on to core Spark APi which allows scalable, high throughput, fault tolerant stream processing of live data streams.
24. Spark can access data from sources such as kafka, flume, kinesis or Tcp socket. The data can be visualised in live dashboards. Spark uses micro batching for Live streaming.
25. Spark MLlib is a scalable machine learing library that contains sklearn, tensorflow etc. It's by default written in scala.
26. Spark is a data analytics engine. It has read-eval-print-loop(repl) shell.
27. Spark GraphX is library for manipulating graphs which provides analysis of graphs.
28. Pyspark is an API written in Python to support Apache Spark.
29. Pyspark comes to rescue in handling large dataset instead of pandas.
30. Pyspark uses RDD(Resilient Distributed Database)




##### Comparison of pandas with respect to pyspark.
1. When dataset size increase beyond size of ram then pandas have to replaced with pyspark.
2. pandas doesn't have parallelism like pyspark.
3. Operations in pyspark are lazy but not in pyspark.
4. In pyspark, we can't change the data since it uses RDD. We can transform the dataframe. In pandas, data frames are immutable hence are dynamic.
5. dataframe access is slower in pyspark but processing is fast but in pandas's vice versa.



In [1]:
from pyspark import SparkContext

In [10]:
sc = SparkContext()

In [11]:
a = sc.parallelize(['this','is','pyspark','demo','that','you','are','viewing']) # we can parallelise the data as list or tensor

In [7]:
a.take(4) # returns till 4 # we don't use collect method since it puts data into RAM and then it's of no use.

['this', 'is', 'pyspark', 'demo']

In [15]:
# to stop the spark context
sc.stop()

To access GUI : Access localhost:4040
> It stops when spark context is set to stop.

In [16]:
#second way to define a spark context
sc = SparkContext()
print(sc.getConf().getAll())

[('spark.driver.extraJavaOptions', '-XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED'), ('spark.app.id', 'local-1678635364399'), ('spark.app.startTime', '1678635364329'), ('spark.driver.host', 'suraj'), ('spark.executor.id', 'driver'), ('spark.app.submitTime', '1678634749769'), ('spark.app.name', 'pyspark-shell'), ('spark

In [18]:
sc.stop()

In [19]:
# third way to define a spark context
# master and its name
sc = SparkContext("local","Master")

#### RDD
1. RDD is like dataset
2. RDD contains Transformation and Action. Transformation helps in creating a new RDD. Action gives us a value(Integer, String). Transformation is lazy, It creates Directed Acyclic Graph. Once action is applied, It executes.
3. Transformation is divided into narrow and wide transformations.
4. Various operations in Actions are Collect(), Count(), countByValue(), Take(), Top(), Reduce(), Fold(), Foreach(), saveAsTextFile() whereas various operations in Transformations are map(), flatmap(), filter(), distinct(), reduceByKey(), groupByKey(), mapValues(), flatMapValues(), sortByKey()

##### 1. Actions

In [22]:
# creating a rDD and their basic actions
values = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
print(type(values))

<class 'pyspark.rdd.RDD'>


In [23]:
values.collect() # puts it to ram

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

In [24]:
values.countByValue()

defaultdict(int,
            {1: 1,
             2: 1,
             3: 1,
             4: 1,
             5: 1,
             6: 1,
             7: 1,
             8: 1,
             9: 1,
             10: 1,
             11: 1,
             12: 1,
             13: 1,
             14: 1,
             15: 1,
             16: 1,
             17: 1,
             18: 1,
             19: 1,
             20: 1})

In [25]:
# for each
def display_value(x):
    print(x)
a = values.foreach(lambda x: display_value(x))
print(a)

None


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20


In [35]:
# using glom - it transform each partition into a tupple. One tuple per parition. 
values.glom().collect()

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]]

In [26]:
# using take over collect
values.take(7)

[1, 2, 3, 4, 5, 6, 7]

In [43]:
#creating multiple partitions
values1 = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],21)
print(type(values1)) # simple RDD

<class 'pyspark.rdd.RDD'>


In [38]:
values1.take(5)

[1, 2, 3, 4, 5]

In [41]:
values1.glom().collect()

                                                                                

[[],
 [1],
 [2],
 [3],
 [4],
 [5],
 [6],
 [7],
 [8],
 [9],
 [10],
 [11],
 [12],
 [13],
 [14],
 [15],
 [16],
 [17],
 [18],
 [19],
 [20]]

In [44]:
#pipelined RDD
type(values1.glom())

pyspark.rdd.PipelinedRDD

In [45]:
# Reduce vs fold function - fold takes initial parameter
values1.reduce(lambda a,b:a+b)

                                                                                

210

In [47]:
#max number
values1.reduce(lambda x,y: x if x>y else y)

                                                                                

20

In [48]:
#user define function
def func(a,b):
    return a -b

values1.reduce(func)

                                                                                

-208

In [50]:
values.fold(1, lambda a,b : a+b) # adds 2 in 1st place, 4 in 2nd place
#folds depend on the parallelise secondparam, too

212

In [51]:
values3 = sc.parallelize(range(1,20))
values3.collect()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [27]:
# reading from text file
firstnames = sc.textFile('first_names.txt')


In [28]:
type(firstnames)

pyspark.rdd.RDD

In [29]:
firstnames.first()

'Adfas'

In [30]:
firstnames.take(5)

['Adfas', 'fafsa', 'gaasf', 'fsafgasg', 'fsadsgasg']

In [31]:
firstnames.count()

21

In [32]:
firstnames.top(6)

['sdbnbnhgfgsd',
 'hnfdhgdhf',
 'hbfgthnsdvbbdf',
 'hbfdsdvhnsfdv',
 'grthjtj',
 'ghsfhrghsfvdhb']

In [33]:
firstnames.distinct().count()

                                                                                

21

##### 2. Transformations
It is a function that has I/O as RDD

- Narrow transformation also known as pipelining where data are required to be in single note. 
- MAP, FLATMAP. MAP Partitions, Filter, Sample and Union are examples of this.
- Wide transformation also known as shuffling where data may live in many partitions.
- Intersection and Join, Distinct, Cartesian, Repartition, GroupByKey, ReduceByKey are examples of this.

In [52]:
# multiply each row by 0.1
values.collect()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

In [54]:
values.map(lambda a: a*0.1).collect()

[0.1,
 0.2,
 0.30000000000000004,
 0.4,
 0.5,
 0.6000000000000001,
 0.7000000000000001,
 0.8,
 0.9,
 1.0,
 1.1,
 1.2000000000000002,
 1.3,
 1.4000000000000001,
 1.5,
 1.6,
 1.7000000000000002,
 1.8,
 1.9000000000000001,
 2.0]

In [57]:
#flatmap for varying number of partittions
values.flatMap(lambda x: range(1,x)).collect()

[1,
 1,
 2,
 1,
 2,
 3,
 1,
 2,
 3,
 4,
 1,
 2,
 3,
 4,
 5,
 1,
 2,
 3,
 4,
 5,
 6,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19]

In [62]:
values4 = sc.parallelize([1,2])
values4.flatMap(lambda x:(x,x*2,10)).collect() # last element as 10

[1, 2, 10, 2, 4, 10]

In [64]:
sc.stop()

##### Spark SQL
- It can take data from CSV, Json etc and create SQL dataframe(relational database)
- We can use SQL queries, too.
- To create spark dataframe, We use spark session|

In [65]:
sc = SparkContext()

In [66]:
dataframe = sc.textFile('data.csv')
#reading RDD

In [67]:
type(dataframe)

pyspark.rdd.RDD

In [68]:
dataframe.take(2)

['RecordNumber,Country,City,Zipcode,State', '1,US,PARC PARQUE,704,PR']

In [69]:
dataframe.collect()

['RecordNumber,Country,City,Zipcode,State',
 '1,US,PARC PARQUE,704,PR',
 '2,US,PASEO COSTA DEL SUR,704,PR',
 '10,US,BDA SAN LUIS,709,PR',
 '49347,US,HOLT,32564,FL',
 '49348,US,HOMOSASSA,34487,FL',
 '61391,US,CINGULAR WIRELESS,76166,TX',
 '61392,US,FORT WORTH,76177,TX',
 '61393,US,FT WORTH,76177,TX',
 '54356,US,SPRUCE PINE,35585,AL',
 '76511,US,ASH HILL,27007,NC',
 '4,US,URB EUGENE RICE,704,PR',
 '39827,US,MESA,85209,AZ',
 '39828,US,MESA,85210,AZ',
 '49345,US,HILLIARD,32046,FL',
 '49346,US,HOLDER,34445,FL',
 '3,US,SECT LANAUSSE,704,PR',
 '54354,US,SPRING GARDEN,36275,AL',
 '54355,US,SPRINGVILLE,35146,AL',
 '76512,US,ASHEBORO,27203,NC',
 '76513,US,ASHEBORO,27204,NC']

In [71]:
dataframe.top(5)

['RecordNumber,Country,City,Zipcode,State',
 '76513,US,ASHEBORO,27204,NC',
 '76512,US,ASHEBORO,27203,NC',
 '76511,US,ASH HILL,27007,NC',
 '61393,US,FT WORTH,76177,TX']

In [72]:
temp = dataframe.first()
cols = temp.split(',')
cols

['RecordNumber', 'Country', 'City', 'Zipcode', 'State']

In [75]:
import pyspark
sparksession = pyspark.sql.SparkSession.builder.master("local").appName("BasicApp").getOrCreate() # get or create for R/W

In [77]:
type(sparksession)

pyspark.sql.session.SparkSession

In [80]:
df= sparksession.read.csv('data.csv', header=True, inferSchema=True)
#creating tree structre
df.printSchema()


root
 |-- RecordNumber: integer (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- State: string (nullable = true)



In [79]:
type(df)

pyspark.sql.dataframe.DataFrame

In [81]:
df.count() #count of the rows

20

In [82]:
df.first()

Row(RecordNumber=1, Country='US', City='PARC PARQUE', Zipcode=704, State='PR')

In [83]:
df.show(10)

+------------+-------+-------------------+-------+-----+
|RecordNumber|Country|               City|Zipcode|State|
+------------+-------+-------------------+-------+-----+
|           1|     US|        PARC PARQUE|    704|   PR|
|           2|     US|PASEO COSTA DEL SUR|    704|   PR|
|          10|     US|       BDA SAN LUIS|    709|   PR|
|       49347|     US|               HOLT|  32564|   FL|
|       49348|     US|          HOMOSASSA|  34487|   FL|
|       61391|     US|  CINGULAR WIRELESS|  76166|   TX|
|       61392|     US|         FORT WORTH|  76177|   TX|
|       61393|     US|           FT WORTH|  76177|   TX|
|       54356|     US|        SPRUCE PINE|  35585|   AL|
|       76511|     US|           ASH HILL|  27007|   NC|
+------------+-------+-------------------+-------+-----+
only showing top 10 rows



In [85]:
#select statement from SQL
df.select("RecordNumber","ZipCode").show(10)

+------------+-------+
|RecordNumber|ZipCode|
+------------+-------+
|           1|    704|
|           2|    704|
|          10|    709|
|       49347|  32564|
|       49348|  34487|
|       61391|  76166|
|       61392|  76177|
|       61393|  76177|
|       54356|  35585|
|       76511|  27007|
+------------+-------+
only showing top 10 rows



In [86]:
# Where equivalent filter
df.filter("ZipCode >=25000").show()


+------------+-------+-----------------+-------+-----+
|RecordNumber|Country|             City|Zipcode|State|
+------------+-------+-----------------+-------+-----+
|       49347|     US|             HOLT|  32564|   FL|
|       49348|     US|        HOMOSASSA|  34487|   FL|
|       61391|     US|CINGULAR WIRELESS|  76166|   TX|
|       61392|     US|       FORT WORTH|  76177|   TX|
|       61393|     US|         FT WORTH|  76177|   TX|
|       54356|     US|      SPRUCE PINE|  35585|   AL|
|       76511|     US|         ASH HILL|  27007|   NC|
|       39827|     US|             MESA|  85209|   AZ|
|       39828|     US|             MESA|  85210|   AZ|
|       49345|     US|         HILLIARD|  32046|   FL|
|       49346|     US|           HOLDER|  34445|   FL|
|       54354|     US|    SPRING GARDEN|  36275|   AL|
|       54355|     US|      SPRINGVILLE|  35146|   AL|
|       76512|     US|         ASHEBORO|  27203|   NC|
|       76513|     US|         ASHEBORO|  27204|   NC|
+---------

In [88]:
#describe like pandas dataframe
df.describe("Country").show()

+-------+-------+
|summary|Country|
+-------+-------+
|  count|     20|
|   mean|   null|
| stddev|   null|
|    min|     US|
|    max|     US|
+-------+-------+



In [89]:
#drop duplicates like pandas datfarme
new_df = df.dropDuplicates()

In [90]:
new_df.show()

+------------+-------+-------------------+-------+-----+
|RecordNumber|Country|               City|Zipcode|State|
+------------+-------+-------------------+-------+-----+
|       76513|     US|           ASHEBORO|  27204|   NC|
|       39827|     US|               MESA|  85209|   AZ|
|       49347|     US|               HOLT|  32564|   FL|
|           3|     US|      SECT LANAUSSE|    704|   PR|
|           1|     US|        PARC PARQUE|    704|   PR|
|       61391|     US|  CINGULAR WIRELESS|  76166|   TX|
|       39828|     US|               MESA|  85210|   AZ|
|       61392|     US|         FORT WORTH|  76177|   TX|
|          10|     US|       BDA SAN LUIS|    709|   PR|
|       49345|     US|           HILLIARD|  32046|   FL|
|           4|     US|    URB EUGENE RICE|    704|   PR|
|       49348|     US|          HOMOSASSA|  34487|   FL|
|           2|     US|PASEO COSTA DEL SUR|    704|   PR|
|       49346|     US|             HOLDER|  34445|   FL|
|       54354|     US|      SPR

In [91]:
#dropping nulls
df.dropna('any').count()

20

In [93]:
df.show()

+------------+-------+-------------------+-------+-----+
|RecordNumber|Country|               City|Zipcode|State|
+------------+-------+-------------------+-------+-----+
|           1|     US|        PARC PARQUE|    704|   PR|
|           2|     US|PASEO COSTA DEL SUR|    704|   PR|
|          10|     US|       BDA SAN LUIS|    709|   PR|
|       49347|     US|               HOLT|  32564|   FL|
|       49348|     US|          HOMOSASSA|  34487|   FL|
|       61391|     US|  CINGULAR WIRELESS|  76166|   TX|
|       61392|     US|         FORT WORTH|  76177|   TX|
|       61393|     US|           FT WORTH|  76177|   TX|
|       54356|     US|        SPRUCE PINE|  35585|   AL|
|       76511|     US|           ASH HILL|  27007|   NC|
|           4|     US|    URB EUGENE RICE|    704|   PR|
|       39827|     US|               MESA|  85209|   AZ|
|       39828|     US|               MESA|  85210|   AZ|
|       49345|     US|           HILLIARD|  32046|   FL|
|       49346|     US|         

In [94]:
#reading second df
df2= sparksession.read.csv('data1.csv', header=True, inferSchema=True)
df2.show()

+------------+-------+
|RecordNumber|   Name|
+------------+-------+
|           1|  fdsfd|
|           2|  fsafs|
|          10|    dgg|
|       49347|  hdshd|
|       49348| hdjhjj|
|       61391|    gfj|
|       61392|    hfg|
|       61393|     fd|
|       54356|     gg|
|       76511|     kk|
|           4|     jj|
|       39827|   rhfh|
|       39828|   jytg|
|       49345|   jnnn|
|       49346| mhgnhg|
|           3|  gfdfh|
|       54354|   jfgh|
|       54355|   jgfj|
|       76512|   jgfj|
|       76513|jgfjgfj|
+------------+-------+



In [95]:
# joining two csv datadset
df.join(df2, df.RecordNumber == df2.RecordNumber).show()

+------------+-------+-------------------+-------+-----+------------+-------+
|RecordNumber|Country|               City|Zipcode|State|RecordNumber|   Name|
+------------+-------+-------------------+-------+-----+------------+-------+
|           1|     US|        PARC PARQUE|    704|   PR|           1|  fdsfd|
|           2|     US|PASEO COSTA DEL SUR|    704|   PR|           2|  fsafs|
|          10|     US|       BDA SAN LUIS|    709|   PR|          10|    dgg|
|       49347|     US|               HOLT|  32564|   FL|       49347|  hdshd|
|       49348|     US|          HOMOSASSA|  34487|   FL|       49348| hdjhjj|
|       61391|     US|  CINGULAR WIRELESS|  76166|   TX|       61391|    gfj|
|       61392|     US|         FORT WORTH|  76177|   TX|       61392|    hfg|
|       61393|     US|           FT WORTH|  76177|   TX|       61393|     fd|
|       54356|     US|        SPRUCE PINE|  35585|   AL|       54356|     gg|
|       76511|     US|           ASH HILL|  27007|   NC|       7