# Spark Architecture and the Resilient Distributed Dataset

You learned Python in the preceding chapter. Now it is time to learn PySpark and utilize the power of a distributed system to solve problems related to big data. We generally distribute large amounts of data on a cluster and perform processing on that distributed data. 

Learning about the architecture of Spark will be very helpful to your understanding of the various components of Spark. Before delving into the recipes let’s explore this topic.
                                 
                                 Figure 4-1 describes the Spark architecture.

<img src='430628_1_En_4_Fig1_HTML.gif'>

The main components of the Spark architecture are the driver and executors. For each PySpark application, there will be one driver program and one or more executors running on the cluster slave machine. You might be wondering, what is an application in the context of PySpark? An application is a whole bunch of code used to solve a problem.

The driver is the process that coordinates with many executors running on various slave machines. Spark follows a master/slave architecture. The SparkContext object is created by the driver. SparkContext is the main entry point to a PySpark application. You will learn more about SparkContext in upcoming chapters. In this chapter, we will run our PySpark commands in the PySpark shell. After starting the shell, we will find that the SparkContext object is automatically created. We will encounter the SparkContext object in the PySpark shell as the variable sc. The shell itself is working as our driver. The driver breaks our application into small tasks; a task is the smallest unit of your application. Tasks are run on different executors in parallel. The driver is also responsible for scheduling tasks to different executors.

Executors are slave processes. An executor runs tasks. It also has the capability to cache data in memory by using the BlockManager process. Each executor runs in its own Java Virtual Machine (JVM).

The cluster manager manages cluster resources. The driver talks to the cluster manager to negotiate resources. The cluster manager also schedules tasks on behalf of the driver on various slave executor processes. PySpark is dispatched with Standalone Cluster Manager. PySpark can also be configured on YARN and Apache Mesos. In our recipes, you are going to see how to configure PySpark on Standalone Cluster Manager and Apache Mesos. On a single machine, PySpark can be started in local mode too.
The main celebrated component of PySpark is the resilient distributed dataset (RDD). The RDD is a data abstraction over the distributed collection. Python collections such as lists, tuples, and sets can be distributed very easily. An RDD is recomputed on node failures. Only part of the data is calculated or recalculated, as required. An RDD is created using various functions defined in the SparkContext class. One important method for creating an RDD is parallelize(), which you will encounter again and again in this chapter. 

                           Figure 4-2 illustrates the creation of an RDD.

<img src='430628_1_En_4_Fig2_HTML.gif'>

Let’s say that we have a Python collection with the elements Data1, Data2, Data3, Data4, Data5, Data6, and Data7. This collection is distributed over the cluster to create an RDD. For simplicity, we can assume that two executors are running. Our collection is divided into two parts. The first executor gets the first part of the collection, which has the elements Data1, Data2, Data3, and Data4. The second part of the collection is sent to the second executor. So, the second executor has the data elements Data5, Data6, and Data7.

We can perform two types of operations on the RDD: transformation and action . Transformation on an RDD returns another RDD. We know that RDDs are immutable; therefore, changing the RDD is impossible. Hence transformations always return another RDD. Transformations are lazy, whereas actions are eagerly evaluated. I say that the transformation is lazy because whenever a transformation is applied to an RDD, that operation is not applied to the data at the same time. Instead, PySpark notes the operation request, but all the transformations are applied when the first action is called.
Figure 4-3 illustrates a transformation operation. The transformation on RDD1 creates RDD2. RDD1 has two partitions. The first partition of RDD1 has four data elements: Data1, Data2, Data3, and Data4. The second data partition of RDD1 has three elements: Data5, Data6, and Data7. After transformation on RDD1, RDD2 is created. RDD2 has six elements. So it is clear that the daughter RDD might have a different number of data elements than the father RDD. RDD2 also has two partitions. The first partition of RDD2 has three data points: Data8, Data9, and Data10. The second partition of RDD2 also has three elements: Data11, Data12, and Data13. Don’t get confused about the daughter RDD having a different number of partitions than the father RDD. 

<img src= '430628_1_En_4_Fig3_HTML.gif'>

                                     Figure 4-3.RDD transformations
Figure 4-4 illustrates an action performed on an RDD. In this example, we are applying the summation action. Summed data is returned to the driver. In other cases, the result of an action can be saved to a file or to another destination.

<img src = '430628_1_En_4_Fig4_HTML.gif'>

You might be wondering, if Spark has been written in Scala, then how is Python contacting with Scala? You might guess that a Python wrapper of PySpark has been written using Jython, and that this Jython code is compiled to Java bytecode and run on the JVM. This guess isn’t correct.

A running Python program can access Java objects in a JVM by using Py4J. A running Java program can also access Python objects by using Py4J. A gateway between Python and Java enables Python to use Java objects.

Driver programs use Py4J to communicate between Python and the Java SparkContext object. PySpark uses Py4J, so that PySpark Python code can

On remote cluster machines, the PythonRDD object creates Python subprocesses and communicates with them using pipes. The PythonRDD object runs in JVM and communicates with Python processes by using pipes.

### Create an RDD
#### Problem

You want to create an RDD.
#### Solution

As we know, an RDD is a distributed collection. You have a list with the following data:

pythonList = [2.3,3.4,4.3,2.4,2.3,4.0]

You want to do the following operations:

    Create an RDD of the list

    Get the first element

    Get the first two elements

    Get the number of partitions in the RDD

In PySpark, an RDD can be created in many ways. One way to create an RDD out of a given collection is to use the parallelize() function. The SparkContext object is used to call the parallelize() function. You’ll read more about SparkContext in an upcoming chapter.

In the case of big data, even tabular data, a table might have more than 1,000 columns. Sometimes analysts want to see what those columns of data look like. The first() function is defined on an RDD and will return the first element of the RDD.

To get more than one element from a list, we can use the take() function. The number of partitions of a collection can be fetched by using getNumPartitions().

#### How It Works

Let’s follow the steps in this section to solve the problem.
##### Creating an RDD of the List

Let’s first create a Python list by using the following:

In [5]:
pythonList = [2.3,3.4,4.3,2.4,2.3,4.0]

Parallelization or distribution of data is done using the parallelize() function. This function takes two arguments. The first argument is the collection to be parallelized, and the second argument indicates the number of distributed chunks of data you want:


Using the parallelize() function, we have distributed our data in two partitions. In order to get all the data on the driver, we can use the collect() function, as shown in the following code line. Using the collect() function is not recommended in production; rather, it should be used only in code debugging.

In [6]:
from pyspark import SparkContext
sc = SparkContext()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/19 15:20:34 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [7]:
parPythonData = sc.parallelize(pythonList,2)

In [8]:
parPythonData.collect()

                                                                                

[2.3, 3.4, 4.3, 2.4, 2.3, 4.0]

#### Getting the First Element

The first() function can be used to get the first data out of an RDD. You might have figured out that the collect() and first() functions perform actions:

In [10]:
parPythonData.first()

                                                                                

2.3

#### Getting the First Two Elements

Sometimes data analysts want to see more than one row of data. The take() function can be used to fetch more than one row from an RDD. The number of rows you want is given as an argument to the take() function:

In [11]:
parPythonData.take(3)

[2.3, 3.4, 4.3]

#### Getting the Number of Partitions in the RDD

In order to optimize PySpark code, a proper distribution of data is required. The number of partitions of an RDD can be found using the getNumPartitions() function:

In [12]:
parPythonData.getNumPartitions()

2

Recall that we were partitioning our data into two partitions while using the parallelize() function.
#### Convert Temperature Data
#### Problem

You want to convert temperature data by writing a temperature unit conversion program on an RDD.
#### Solution
You are given daily temperatures in Fahrenheit. You want to perform some analysis on that data. But your new software takes input in Celsius. Therefore, you want to convert your temperature data from Fahrenheit to Celsius. Table 4-1 shows the data you have.

<img src='430628_1_En_4_Figa_HTML.gif'>

You want to do the following:

    Convert temperature from Fahrenheit to Celsius

    Get all the temperature data points greater than 13o C

We can convert temperature from Fahrenheit to Celsius by using the following mathematical formula:
oC = (oF – 32) × 5/9

We can see that in PySpark, this is a transformation problem. We can achieve this task by using the map() function on the RDD.

Getting all the temperatures greater than 13o C is a filtering problem. Filtering of data can be done by using the filter() function on the RDD.
How It Works

We’ll follow the steps in this section to complete the conversion and filtering exercises.
Step 4-2-1. Parallelizing the Data

We are going to parallelize data by using our parallelize() function. We are going to distribute our data in two partitions, as follows:

In [13]:
tempData = [59,57.2,53.6,55.4,51.8,53.6,55.4]
parTempData = sc.parallelize(tempData)
parTempData.collect()

[59, 57.2, 53.6, 55.4, 51.8, 53.6, 55.4]

#### Converting Temperature from Fahrenheit to Celsius

Now we are going to convert our temperature in Fahrenheit to Celsius. We’ll write a fahrenheitToCentigrade function, which will take the temperature in Fahrenheit and return a temperature in Celsius for a given input:

In [15]:
def fahrenheitToCentigrade(temperature) :
    centigrade = (temperature-32)*5/9
    return centigrade


Let’s test our fahrenheitToCentigrade function:
We are providing 59 as the input in Fahrenheit. Our function returns a Celsius value of our Fahrenheit input; 59o F is equal to 15o C.

In [16]:
parCentigradeData = parTempData.map(fahrenheitToCentigrade)
parCentigradeData.collect()

[15.0, 14.000000000000002, 12.0, 13.0, 10.999999999999998, 12.0, 13.0]

We have converted the given temperature to Celsius. Now let’s filter out all the temperatures greater than or equal to 13o C.
#### Filtering Temperatures Greater than 13o C

To filter data, we can use the filter() function on the RDD. We have to provide a predicate as input to the filter() function. A predicate is a function that tests a condition and returns True or False.

Let’s define the predicate tempMoreThanThirteen, which will take a temperature value and return True if input is greater than or equal to 13:

We are going to send our tempMoreThanThirteen function as input to the filter() function. The filter() function will iterate over each value in the parCentigradeData RDD. For each value, the tempMoreThanThirteen function will be applied. If the value is greater than or equal to 13, True will be returned. The value for which tempMoreThanThirteen returns True will come to filteredTemprature:

In [17]:
def tempMoreThanThirteen(temperature): return temperature >=13

In [18]:
filteredTemprature = parCentigradeData.filter(tempMoreThanThirteen)
filteredTemprature.collect()

[15.0, 14.000000000000002, 13.0, 13.0]

We can replace our predicates by using the lambda function. Using a lambda function makes the code more readable. The following code line clearly depicts that the filter() function takes a predicate as input and returns True for all the values greater than or equal to 13:

In [20]:
filteredTemprature = parCentigradeData.filter(lambda x: x>=13)
filteredTemprature.collect()

[15.0, 14.000000000000002, 13.0, 13.0]

We finally have four elements indicating a temperature that is either greater than or equal to 13. So now you understand the way to do basic analysis on data with PySpark .
#### Perform Basic Data Manipulation
#### Problem

You want to do data manipulation and run aggregation operations.
#### Solution
In this recipe, you are given data indicating student grades for a two-year (four-semester) course. Seven students are enrolled in this course. Table 4-2 depicts two years of grade data, divided into semesters, for seven enrolled students.

<img src = '430628_1_En_4_Figb_HTML.gif'>

                                           Table 4-2. Student Grades

You want to calculate the following:

    Average grades per semester, each year, for each student

    Top three students who have the highest average grades in the second year

    Bottom three students who have the lowest average grades in the second year

    All students who have earned more than an 80% average in the second semester of the second year

Using the map() function is often helpful. In this example, the average grades per semester, for each year, can be calculated using map().

It is a general data science problem to get the top k elements, such as the top k highly performing bonds. The PySpark takeOrdered() function is going to take the top k or top bottom elements from our RDD.

Students who have earned more than 80% averages in the second year can be filtered using the filter() function.
How It Works

Let’s solve our problem in steps. We will start with creating an RDD of our data.
Step 4-3-1. Making a List from a Given Table

In this step, we’ll create a nested list . This means that each element of the list is a record, and each record is a list in itself:

In [22]:
studentMarksData = [["si1","year1",62.08,62.4],
 ["si1","year2",75.94,76.75],
 ["si2","year1",68.26,72.95],
 ["si2","year2",85.49,75.8],
 ["si3","year1",75.08,79.84],
 ["si3","year2",54.98,87.72],
 ["si4","year1",50.03,66.85],
 ["si4","year2",71.26,69.77],
 ["si5","year1",52.74,76.27],
 ["si5","year2",50.39,68.58],
 ["si6","year1",74.86,60.8],
 ["si6","year2",58.29,62.38],
 ["si7","year1",63.95,74.51],
 ["si7","year2",66.69,56.92]]

#### Parallelizing the Data

After parallelizing the data by using the parallelize() function, we will find that we have an RDD in which each element is a list itself:

As we know, the collect() function takes the whole RDD to the driver. If the RDD size is very large, the driver may face a memory issue. In order to fetch k first elements of an RDD, we can use the take() function with n as input to take(). As an example, in the following line of code, we are fetching two elements of our RDD. Remember here that take() is an action:

In [23]:
studentMarksDataRDD = sc.parallelize(studentMarksData,4)

In [24]:
studentMarksDataRDD.take(2)

[['si1', 'year1', 62.08, 62.4], ['si1', 'year2', 75.94, 76.75]]

#### Calculating Average Semester Grades

Now let me explain what I want to do in the following code. Just consider the first element of the RDD. Our first element of the RDD is ['si1', 'year1', 62.08, 62.4], which is a list of four elements. Our work is to calculate the mean of grades from two semesters. In the first element, the mean is 0.5(62.08 + 62.4). We are going to use the map() function to get our solution.

In [25]:
studentMarksMean = studentMarksDataRDD.map(lambda x: [x[0],x[1],(x[2] + x[3])/2])
studentMarksMean.collect()

[['si1', 'year1', 62.239999999999995],
 ['si1', 'year2', 76.345],
 ['si2', 'year1', 70.605],
 ['si2', 'year2', 80.645],
 ['si3', 'year1', 77.46000000000001],
 ['si3', 'year2', 71.35],
 ['si4', 'year1', 58.44],
 ['si4', 'year2', 70.515],
 ['si5', 'year1', 64.505],
 ['si5', 'year2', 59.485],
 ['si6', 'year1', 67.83],
 ['si6', 'year2', 60.335],
 ['si7', 'year1', 69.23],
 ['si7', 'year2', 61.805]]

#### Filtering Student Average Grades in the Second Year

The following line of code is going to filter out all the data of the second year. We have implemented our predicate by using a lambda function. Our predicate function checks whether year2 is in the list. If the predicate returns True, the list includes second-year grades.

In [26]:
secondYearMarks = studentMarksMean.filter(lambda x: "year2" in x[1])
secondYearMarks.collect()

[['si1', 'year2', 76.345],
 ['si2', 'year2', 80.645],
 ['si3', 'year2', 71.35],
 ['si4', 'year2', 70.515],
 ['si5', 'year2', 59.485],
 ['si6', 'year2', 60.335],
 ['si7', 'year2', 61.805]]

We can clearly see that the RDD output of secondYearMarks has only second-year grades.
#### Finding the Top Three Students

We can get the top three students in two ways. The first method is to sort the full data according to grades. Obviously, we are going to sort the data in decreasing order. Sorting is done by the sortBy() function. Let’s see the implementation:

In our sortBy() function, we provide the keyfunc parameter. This parameter indicates to sort the grades data in decreasing order. Now collect the output and see the result:

In [29]:
sortedMarksData = secondYearMarks.sortBy(keyfunc = lambda x : x[2]) #asc
# sortedMarksData = secondYearMarks.sortBy(keyfunc = lambda x : -x[2]) desc
sortedMarksData.collect()

[['si5', 'year2', 59.485],
 ['si6', 'year2', 60.335],
 ['si7', 'year2', 61.805],
 ['si4', 'year2', 70.515],
 ['si3', 'year2', 71.35],
 ['si1', 'year2', 76.345],
 ['si2', 'year2', 80.645]]

In [30]:
sortedMarksData.take(3)

[['si5', 'year2', 59.485], ['si6', 'year2', 60.335], ['si7', 'year2', 61.805]]

We have our answer. But can we optimize it further? In order to get top-three data, we are sorting the whole list. We can optimize this by using the takeOrdered() function. This function takes two arguments: the number of elements we require, and key, which uses a lambda function to determine how to take the data out.

In the preceding code, we set num to 3 for the three top elements, and lambda in key so that it can provide three top in decreasing order.


In [32]:
topThreeStudents = secondYearMarks.takeOrdered(num=3, key = lambda x :-x[2])
topThreeStudents

[['si2', 'year2', 80.645], ['si1', 'year2', 76.345], ['si3', 'year2', 71.35]]

In order to print the result, we are not using the collect() function to get the data. Remember that transformation creates another RDD, so we require the collect() function to collect data. But an action will directly fetch the data to the driver, and collect() is not required. So you can conclude that the takeOrdered() function is an action.
#### Finding the Bottom Three Students

We have to find the bottom three students in terms of their average grades. One way is to sort the data in increasing order and take the three on top. But that is not an efficient way, so we will use the takeOrdered() function again, but with a different key parameter:

In [34]:
bottomThreeStudents = secondYearMarks.takeOrdered(num=3, key = lambda x :x[2])
bottomThreeStudents

[['si5', 'year2', 59.485], ['si6', 'year2', 60.335], ['si7', 'year2', 61.805]]

#### Getting All Students with 80% Averages

Now that you understand the filter() function, it is easy to guess that we can solve this problem by using filter(). We will have to provide a predicate, which will return True if grades are greater than 80; otherwise, it returns False.

In [35]:
moreThan80Marks = secondYearMarks.filter(lambda x: x[2]>80)
moreThan80Marks.collect()

[['si2', 'year2', 80.645]]

#### Run Set Operations
#### Problem

You want to run set operations on a research company’s data.
#### Solution

XYZ Research is a company that performs research on many diversified topics. Each research project comes with a research ID. Research may come to a conclusion in one year or may take more than one year. The following data is provided, indicating the number of research projects being conducted in three years:

2001: RIN1, RIN2, RIN3, RIN4, RIN5, RIN6, RIN7

2002: RIN3, RIN4, RIN7, RIN8, RIN9

2003: RIN4, RIN8, RIN10, RIN11, RIN12


Now we have to answer the following questions:

    How many research projects were initiated in the three years?

    How many projects were completed in the first year?

    How many projects were completed in the first two years?

A set is collection of distinct elements. PySpark performs pseudo set operations. They are called pseudo set operations because some functions do not remove duplicate elements.

Remember, the first question is not asking about completed projects. The total number of research projects initiated in three years is just the union of all three years of data. You can perform a union on two RDDs by using the union() function.

The projects that have been started in the first year and not in the second year are the projects that have been completed in the first year. Every project that is started is completed. We can use the subtract() function to find all the projects that were completed in the first year.

If we make a union of first-year and second-year projects and subtract third-year projects, we are going to get all the projects that have been completed in the first two years.
How It Works

Let’s solve this problem step-by-step .
#### Creating a List of Research Data by Year

Let’s start with creating a list of all the projects that the company worked on each year:

In [37]:
data2001 = ['RIN1', 'RIN2', 'RIN3', 'RIN4', 'RIN5', 'RIN6', 'RIN7']
data2002 = ['RIN3', 'RIN4', 'RIN7', 'RIN8', 'RIN9']
data2003 = ['RIN4', 'RIN8', 'RIN10', 'RIN11', 'RIN12']

 data2001 is list of all the projects started in 2001. Similarly, data2002 contains all the research projects that either are continuing from 2001 or started in 2002. The data2003data list contains all the projects that the company worked on in 2003.
#### Parallelizing the Data (Creating the RDD)

After creating lists, we have to parallelize our data:


After parallelizing, we get three RDDs. The first RDD is parData2001, the second RDD is parData2002, and the last one is parData2003.
#### Finding Projects Initiated in Three Years

The total number of projects initiated in three years is determined just by getting the union of all the data for the given three years. RDD union() takes another RDD as input and returns, merging these two RDDs. Let’s see how it works:

In [39]:
parData2001 = sc.parallelize(data2001,2)
parData2002 = sc.parallelize(data2002,2)
parData2003 = sc.parallelize(data2003,2)

We have calculated the union of different research projects initiated in either the first year or the second year. We can observe that the unionized data, unionOf20012002, has duplicate values. Having duplicates values in sets is not allowed. Therefore, a set operation on an RDD is also known as a pseudo set operation. Don’t worry; we will remove these duplicates.

In order to get all the research projects that have been initiated in three years, we have to get the union of parData2003 and unionOf20012002:

In [40]:
unionof01_02 = parData2001.union(parData2002)
unionof01_02.collect()

['RIN1',
 'RIN2',
 'RIN3',
 'RIN4',
 'RIN5',
 'RIN6',
 'RIN7',
 'RIN3',
 'RIN4',
 'RIN7',
 'RIN8',
 'RIN9']

In [41]:
all_research = unionof01_02.union(parData2003)
all_research.collect()

['RIN1',
 'RIN2',
 'RIN3',
 'RIN4',
 'RIN5',
 'RIN6',
 'RIN7',
 'RIN3',
 'RIN4',
 'RIN7',
 'RIN8',
 'RIN9',
 'RIN4',
 'RIN8',
 'RIN10',
 'RIN11',
 'RIN12']

#### Making Sets of Distinct Data

We are going to apply the distinct() function to our RDD allResearchs:

In [42]:
allunique_research = all_research.distinct()

In [45]:
allunique_research.collect()

                                                                                

['RIN1',
 'RIN10',
 'RIN12',
 'RIN2',
 'RIN3',
 'RIN5',
 'RIN8',
 'RIN4',
 'RIN9',
 'RIN11',
 'RIN6',
 'RIN7']

We can see that we have all the research projects that were initiated in the first three years.
#### Counting Distinct Elements

Now count all the distinct research projects by using the count() function on the RDD:

In [46]:
all_research.distinct().count()

                                                                                

12

#### Finding Projects Completed the First Year

Let’s say we have two sets, A and B. Subtracting set B from set A will give us all the elements that are members of set A but not set B. So now it is clear that, in order to know all the projects that have been completed in the first year (2001), we have to subtract the projects in year 2002 from all the projects in year 2001.

Subtraction on a set can be done with the subtract() function:

In [47]:
firstYearCompletion = parData2001.subtract(parData2002)
firstYearCompletion.collect()

['RIN1', 'RIN2', 'RIN5', 'RIN6']

We have all the projects that were completed in 2001. Four projects were completed in 2001.
#### Finding Projects Completed in the First Two Years

A union of RDDs gives us all the projects started in the first two years. After getting all the projects started in the first two years, if we then subtract projects running and started in the third year, we will return all the projects completed in the first two years. The following is the implementation:

In [49]:
first2yearCompletion = parData2001.union(parData2002).subtract(parData2003)
first2yearCompletion.distinct().collect()

                                                                                

['RIN1', 'RIN2', 'RIN3', 'RIN5', 'RIN9', 'RIN6', 'RIN7']

Finding Projects Started in 2001 and Continued Through 2003.

This step requires using the intersection() method defined in PySpark on the RDD:

In [51]:
contuning = parData2001.intersection(parData2002).subtract(parData2003)
contuning.distinct().collect()

                                                                                

['RIN3', 'RIN7']

#### Calculate Summary Statistics
#### Problem

You want to calculate summary statistic s on given data.
#### Solution

Renewable energy sources are gaining in popularity all over the world. The company FindEnergy wants to install windmills at a given location. For efficient operation of windmills, the air requires certain characteristics.
Data is collected as shown in 
                                 
                                             Table 4-3.
 
<img src = '430628_1_En_4_Figc_HTML.gif'>


                                   Table 4-3. Air Velocity Data
                                   
You, as a data scientist, want to calculate the following quantities:

    Number of data points

    Summation of air velocities over a day

    Mean air velocity in a day

    Variance of air data

    Sample variance of air data

    Standard deviation of air data

    Sample standard deviation of air data 

PySpark provides many functions to summarize data on the RDD. The number of elements in an RDD can be found by using the count() function on the RDD. There are two ways to sum all the data in a given RDD. The first is to apply the sum() method to the RDD. The second is to apply the reduce() function to the RDD.

The mean represents the center point of the given data, and it can be calculated in two ways too. We are going to use the mean() method and the fold() method to calculate the mean.

The variance, which indicates the spread of data around the mean, can be calculated using the variance() function. Similarly, the sample variance can be calculated by using the sampleVariance() method on the RDD.

Standard deviation and sample standard deviation will be calculated using the stdev() and sampleStdev() methods, respectively.

PySpark provides the stats() method, which can calculate all the previously mentioned quantities in one go.

#### How It Works

We’ll follow the steps in this section to reach a solution.
#### Parallelizing the Data

Let’s parallelize the air velocity data from a list:

In [52]:
airVelocityKMPH = [12,13,15,12,11,12,11]
parVelocityKMPH = sc.parallelize(airVelocityKMPH,2)

The parVelocityKMPH variable is an RDD.
#### Getting the Number of Data Points

The number of data points gives us an idea of the data size. We apply the count() function to get the number of elements in the RDD:

In [53]:
parVelocityKMPH.count()

7

The total number of data points is seven.
#### Summing Air Velocities in a Day

Let’s get the summation by using the sum() method:

In [54]:
parVelocityKMPH.sum()

86

Finding the Mean Air Velocity
Figure 4-5 shows the mathematical formula for finding a mean, where x1, x2, . . . xn are n data points.

<img src = '430628_1_En_4_Fig5_HTML.gif'>

#### Calculating the mean

We calculate the mean by using the mean() function defined on the RDD:

In [55]:
parVelocityKMPH.mean()

12.285714285714286

##### Finding the Variance of Air Data
If we have the data points x1, x2, . . . xn, then Figure 4-6 shows the mathematical formula for calculating variance. We are going to calculate the variance of the given air data by using the variance() function defined on the RDD.

<img src = '430628_1_En_4_Fig6_HTML.gif'>

In [56]:
parVelocityKMPH.variance()

1.63265306122449

#### Calculating Sample Variance

The variance function calculates the population variance. In order to calculate the sample variance, we have to use sampleVariance() defined on the RDD.
For data points x1, x2, . . . xn, the sample standard variance is defined in Figure 4-7.

<img src = '430628_1_En_4_Fig7_HTML.gif'>

In [57]:
parVelocityKMPH.sampleVariance()

1.904761904761905

#### Calculating Standard Deviation

The standard deviation is the square root of the variance value. Let’s calculate the standard deviation by using the stdev() function:

In [58]:
parVelocityKMPH.stdev()

1.2777531299998799

#### Calculating All Values in One Step

We can calculate all the values of the summary statistics in one go by using the stats() function. The StatCounter object is returned from the stats() function. Let’s use the stats() function to calculate the summary statistics of the air velocity data:

In [59]:
parVelocityKMPH.stats()

(count: 7, mean: 12.285714285714286, stdev: 1.2777531299998799, max: 15.0, min: 11.0)

In [60]:
parVelocityKMPH.stats().asDict()

{'count': 7,
 'mean': 12.285714285714286,
 'sum': 86.0,
 'min': 11.0,
 'max': 15.0,
 'stdev': 1.3801311186847085,
 'variance': 1.904761904761905}

In [61]:
parVelocityKMPH.stats().asDict()['max']

15.0

In [62]:
parVelocityKMPH.stats().asDict()['mean']

12.285714285714286