# BigData fundamentals with PySpark

    - Apache Spark is originally written in Scala programming language. To support Python with Spark, PySpark was developed. Unlike previous versions, the newest version of PySpark provides computation power similar to Scala.
	
	- Spark comes with interactive shells that enable ad-hoc data analysis. Spark shell is an interactive environment through which one can access Spark's functionality quickly and conveniently. 
	
	- Spark shell is particularly helpful for fast interactive prototyping before running the jobs on clusters. Unlike most other shells, Spark shell allow you to interact with data that is distributed on disk or in memory across many machines, and Spark takes care of automatically distributing this processing. Spark provides the shell in three programming languages: spark-shell for Scala, PySpark for Python and sparkR for R
	
	- Pyspark shell is the Python-based command line tool to develop Spark's interactive applications in Python. PySpark helps data scientists interface with Spark data structures in Apache Spark and python. Similar to Scala Shell, Pyspark shell has been augmented to support connecting to a cluster


## Understanding SparkContext

A SparkContext represents the entry point to Spark functionality. It's like a key to your car. PySpark automatically creates a SparkContext for you in the PySpark shell (so you don't have to create it by yourself) and is exposed via a variable sc.

1. Print the version of SparkContext in the PySpark shell.
2. Print the Python version of SparkContext in the PySpark shell.
3. What is the master of SparkContext in the PySpark shell?

In [None]:
# Print the version of SparkContext
print("The version of Spark Context in the PySpark shell is", sc.version)

# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is", sc.pythonVer)

# Print the master of SparkContext
print("The master of Spark Context in the PySpark shell is", sc.master)

## Interactive Use of PySpark

Spark comes with an interactive python shell in which PySpark is already installed in it. PySpark shell is useful for basic testing and debugging and it is quite powerful. The easiest way to demonstrate the power of PySpark’s shell is to start using it. In this example, you'll load a simple list containing numbers ranging from 1 to 100 in the PySpark shell.

The most important thing to understand here is that we are not creating any SparkContext object because PySpark automatically creates the SparkContext object named sc, by default in the PySpark shell.

1. Create a python list named numb containing the numbers 1 to 100.
2. Load the list into Spark using Spark Context's parallelize method and assign it to a variable spark_data.

In [None]:
# Create a python list of numbers from 1 to 100 
numb = range(1, 100)

# Load the list into PySpark  
spark_data = sc.parallelize(numb)

## Load File

In PySpark, we express our computation through operations on distributed collections that are automatically parallelized across the cluster. In the previous example, you have seen a scenario of loading a list as parallelized collections and in this example, you'll load the data from a local file in PySpark shell.

Remember you already have a SparkContext sc and file_path variable (which is the path to the README.md file) already available in your workspace.

1. Load a local text file README.md in PySpark shell.

In [None]:
# Load a local file into PySpark shell
lines = sc.textFile(file_path)

## Review of functional programming in Python

Understanding python becomes easier if we understand functional programming principles in python. 

Anonymous Functions in python are functions that are not bound to a name at runtime, using a construct called a `lambda`. Used in conjunction with `map` and `filter` functions. Similar to `def` lambda creates a function to be later called in the program. However, it returns a function instead of assigning it to a name. This is the reason lambdas are called as anonymous functions.

In practice, they are used as a way to inline a function definition, or to defer execution of a code. 

Lambda functions can be used whenever function objects are required. They can have any number of arguments but only one expression and the expression is evaluated and returned. 

Lambda function syntax: `lambda arguments: expression`

The main difference between `def` and `lambda` is that `lambda` defintion does not include a return statement and it always contains an expression that could be returned.
Also note we can put a lambda definition anywhere a function is expected, and we don't have to assign it to a variable at all



We use `lambda` functions when we need a nameless function for a short period of time.
`map()` function takes a function and a list and returns a new list which contains items returned by the function for each item.

General Syntax: `map(function, list)`

`filter()` function takes a function and a list and returns a new list for which the function evaluates as true.

General Syntax: `filter(function, list)`





In [None]:
double = lambda x: x*2
print(double(3))

In [2]:
# Example of map() with lambda
items = [1, 2, 3, 4]
list(map(lambda x: x + 2, items))

# Example of filter with lambda
list(filter(lambda x: (x%2 !=0), items))

[1, 3]

## Use of lambda() with map()

The `map()` function in Python returns a list of the results after applying the given function to each item of a given iterable (list, tuple etc.). 

The general syntax of map() function is map(fun, iter). We can also use lambda functions with map(). The general syntax of map() function with lambda() is `map(lambda <agument>:<expression>, iter)`.

The general syntax of the filter() function with lambda() is `filter(lambda <argument>:<expression>, list)`

In [4]:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Print my_list in the console
print("Input list is", my_list)

# Square all numbers in my_list
squared_list_lambda = list(map(lambda x: x ** 2, my_list))

# Print the result of the map function
print("The squared numbers are", squared_list_lambda)

my_list2 = [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]

# Print my_list2 in the console
print("Input list is:", my_list2)

# Filter numbers divisible by 10
filtered_list = list(filter(lambda x: (x%10 == 0), my_list2))

# Print the numbers divisible by 10
print("Numbers divisible by 10 are:", filtered_list)

Input list is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The squared numbers are [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
Input list is: [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]
Numbers divisible by 10 are: [10, 40, 60, 80]


# What is RDD

`RDD ` = Resilient Distributed Datasets. It is simply a collection of data distributed across the cluster. Fundamental and backbone data type in Spark

When Spark starts processing data, it divides the data into partitions and distributes the data across cluster nodes, with each node containing a slice of data

![RDD](./Images/RDD1.png)


`Resilient` - Ability to withstand failures (and recompute missing or damaged partitions)

`Distributed` - Spanning across multiple machines(spanning the job across multiple machines for efficient computations)

`Dataset` - Collection of partitioned data(ex. arrays, tables, tuples etc)

## Creating RDD

1. Parallelizing(`sc.parallelize()`) existing collection of datasets(ex. list or array or set)

2. A more common way to create RDDs is to load data from external datasets such as ifles stored in HDFS or objects in Amazon S3 buckets, or from lines in atext file stored locally and pass it to SparContext's `textFile` method.

3. Also from existing RDD's








In [None]:
numRDD = sc.parallelize([1, 2, 3, 4])
helloRDD = sc.parallelize("Hello World")
fileRDD = sc.textFile("README.md")

# Confirm the type of the object
type(helloRDD)

## Partitioning in PySpark

A partition is a logical division of a large distributed data set. With each part being stored in multiple locations across the cluster.

By default Spark partitions the data at the time of creating RDD based on several factors such as available resources, external datasets etc. However this behavior can be controlled by passing a second argument called as `minPartitions` which define the minimum no fo partitions to be created for an RDD.

The number of partitons in an RDD can be found by using `getNumPartitions()` method







In [None]:
numRDD = sc.parallelize(range(10), minPartitions = 6)
fileRDD = sc.parallelize("README.md", minPartitions = 6)

Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It is an immutable distributed collection of objects. Since RDD is a fundamental and backbone data type in Spark, it is important that you understand how to create it (Example 1). 

PySpark can easily create RDDs from files that are stored in external storage devices such as HDFS (Hadoop Distributed File System), Amazon S3 buckets, etc. However, the most common method of creating RDD's is from files stored in your local file system. This method takes a file path and reads it as a collection of lines (Example 2).

SparkContext's textFile() method takes an optional second argument called minPartitions for specifying the minimum number of partitions

In [None]:
# Example 1
# Create an RDD from a list of words
RDD = sc.parallelize(["Spark", "is", "a", "framework", "for", "Big Data processing"])

# Print out the type of the created object
print("The type of RDD is", type(RDD))

# Example 2
# Print the file_path
print("The file_path is", file_path)

# Create a fileRDD from file_path
fileRDD = sc.textFile(file_path)

# Check the type of fileRDD
print("The file type of fileRDD is", type(fileRDD))

# Example 3
# Check the number of partitions in fileRDD
print("Number of partitions in fileRDD is", fileRDD.getNumPartitions())

# Create a fileRDD_part from file_path with 5 partitions
fileRDD_part = sc.textFile(file_path, minPartitions = 5)

# Check the number of partitions in fileRDD_part
print("Number of partitions in fileRDD_part is", fileRDD_part.getNumPartitions())

## Operations supported by RDDs

1. Transformations: Operations that return an RDD

2. Actions: Operations that perform computations on RDD

The most important feature that helps RDD in fault tolerance and optimizing resource is **Lazy Evaluation**.

## What is Lazy Evaluation

Spark creates a graph from all the operations you perform on an RDD and execution of graph only starts when an action is performed on an RDD

## Basic RDD Transformations

`map()`, `filter()`, `flatMap()` and `union()`

**map** transformation takes a function and applies to all the elements in the RDD.

The main method by which you can manipulate data in PySpark is using map(). The map() transformation takes in a function and applies it to each element in the RDD. It can be used to do any number of things, from fetching the website associated with each URL in our collection to just squaring the numbers

**filter** transformation takes in a condition and returns only the elements that pass the condition.

**flatMap** returns multiple values for each element in the source RDD.(ex. splitting an input string into words)

**union** transformation returns the union of one RDD with another RDD.



In [None]:
# map example
RDD = sc.parallelize([1,2,3,4])
RDD_map = RDD.map(lambda x: x * x)

# filter example
RDD = sc.parallelize([1,2,3,4])
RDD_filter = RDD.filter(lambda x: x > 2)

# flatMap example
RDD = sc.parallelize(["Hello World","How are you"])
RDD_flatmap = RDD.flatMap(lambda x: x.split(" "))

# union example
inputRDD = sc.textFile("logs.txt")
errorRDD = inputRDD.filter(lambda x: "error"in x.split())
warningsRDD = inputRDD.filter(lambda x: "warnings"in x.split())
combinedRDD = errorRDD.union(warningsRDD)

# RDD Actions

Operation return a value after running a computation on the RDD.

## Four Basic Actions

`collect`, `take(N)`, `first` and `count`.

**collect** action returns complete list of elements from the RDD.

**take(N)** retunrs an array with first N number of elements from the RDD

**first** returns the firts element of an RDD

**count** returns the number of elements in an RDD


In [None]:
# collect example
RDD_map.collect()

# take example
RDD_map.take(2)

# first example
RDD_map.first()

#count example
RDD_flatmap.count()

1. Create map() transformation that cubes all of the numbers in numbRDD.
2. Collect the results in a numbers_all variable.
3. Print the output from numbers_all variable.

In [None]:
# Create map() transformation to cube numbers
cubedRDD = numbRDD.map(lambda x: x*x*x)

# Collect the results
numbers_all = cubedRDD.collect()

# Print the numbers from numbers_all
for numb in numbers_all:
	print(numb)

# Filter and Count

1. Create filter() transformation to select the lines containing the keyword Spark.

2. How many lines in fileRDD_filter contains the keyword Spark?

3. Print the first four lines of the resulting RDD.

In [None]:
# Filter the fileRDD to select lines with Spark keyword
fileRDD_filter = fileRDD.filter(lambda line: 'Spark' in line)

# How many lines are there in fileRDD?
print("The total number of lines with the keyword Spark is", fileRDD_filter.count())

# Print the first four lines of fileRDD
for line in fileRDD_filter.take(4): 
  print(line)

# Pair RDDs in PySpark

- Real life datasets are usually key/value pairs(common data type required for many operations in spark). Each row is a key and maps to one or more values. Pair RDD is a special data structure to work with this kind of datasets

- Pair RDD: Key is the identier and value is data

- Two common ways to create pair RDDs

    1. From a list of key-value tuple

    2. From a regular RDD
    
- **_*Note: Get the data into key/value form for paired RDD*_**

- All regular transformations work on pair RDD. Need to pass functions that operate on key value pairs rather than on individual elements

### - Examples of paired RDD Transformations

1. **reduceByKey(func)**: Combine values with the same key using a function
 - It runs several parallel operations one for each key in the dataset.
 - Because datasets can have very large number of keys reduceByKey is not implemented as an action but as a transformation.
 - It returns a new RDD consisting of a key and the reduced value for that key
 - operates on key, value (k,v) pairs and merges the values for each key


2. **groupByKey()**: Group all the values with the same key in the pair RDD.


3. **sortByKey()**: Return an RDD sorted by the key
 - Sorting of data is necessary for many downstream applications
 - We can sort pair RDD as long as there is an ordering defined in the key.
 - Returns an RDD sorted by key in ascedning or descending order.

4. **join()**: Join two pair RDDs based on their key








In [None]:
# Create RDD from tuple example
my_tuple = [('Sam', 23), ('Mary', 34), ('Peter', 25)]
pairRDD_tuple = sc.parallelize(my_tuple)

# Create RDD from list
my_list = ['Sam 23', 'Mary 34', 'Peter 25']
regularRDD = sc.parallelize(my_list)
pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))

# reduceByKey Example
regularRDD = sc.parallelize([("Messi", 23), ("Ronaldo", 34),("Neymar", 22), ("Messi", 24)])
pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x + y)
pairRDD_reducebykey.collect()

# sortByKey Example
pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1], x[0]))
pairRDD_reducebykey_rev.sortByKey(ascending=False).collect()

# groupByKey Example
airports = [("US", "JFK"),("UK", "LHR"),("FR", "CDG"),("US", "SFO")]
regularRDD = sc.parallelize(airports)
pairRDD_group = regularRDD.groupByKey().collect()
for cont, air in pairRDD_group:
      print(cont, list(air))

# join example
RDD1 = sc.parallelize([("Messi", 34),("Ronaldo", 32),("Neymar", 24)])
RDD2 = sc.parallelize([("Ronaldo", 80),("Neymar", 120),("Messi", 100)])
RDD1.join(RDD2).collect()

Create a pair RDD named Rdd with tuples (1,2),(3,4),(3,6),(4,5).

Transform the Rdd with reduceByKey() into a pair RDD Rdd_Reduced by adding the values with the same key.

Collect the contents of pair RDD Rdd_Reduced and iterate to print the output.

In [None]:
# Create PairRDD Rdd with key value pairs
Rdd = sc.parallelize([(1,2), (3,4), (3,6), (4,5)])

# Apply reduceByKey() operation on Rdd
Rdd_Reduced = Rdd.reduceByKey(lambda x, y: x + y)

# Iterate over the result and print the output
for num in Rdd_Reduced.collect(): 
  print("Key {} has {} Counts".format(num[0], num[1]))

1. Sort the Rdd_Reduced RDD using the key in descending order.

2. Collect the contents and iterate to print the output.

In [None]:
# Sort the reduced RDD with the key by descending order
Rdd_Reduced_Sort = Rdd_Reduced.sortByKey(ascending=False)

# Iterate over the result and print the output
for num in Rdd_Reduced_Sort.collect():
  print("Key {} has {} Counts".format(num[0], num[1]))

## Advanced RDD Actions

`Reduce` action takes in a function which operates on two elements of the same type of RDD and retruns a new element of the same type.
- The function should be commutative(changing the order of the operands does not change the results) and associative so that it can be computed in parallel. Ex. `+` sign which we can use to sum our RDD.

`saveAsTextFile`: In many cases it is not advisable to run collect action on RDDs because of the huge size of data. In these cases it is common to write out the data to distributed storage systems such as HDFS or Amazon S3.

- saveAsTextFile can be used to save RDD as a text file inside a particular directory with each partition as a separate file(Example 1).By default, saveAsTextFile saves RDD with each partition as a separate file inside a directory. However you can change it to return a new RDD that is reduced to single partition using the `coalesce` method(Example 2). 






In [None]:
x = [1,3,4,6]
RDD = sc.parallelize(x)
RDD.reduce(lambda x, y : x + y)

# Example 1
RDD.saveAsTextFile("tempFile")

# Example 2
RDD.coalesce(1).saveAsTextFile("tempFile")


Similar to pair RDD Transformations, there are also RDD Actions available for pair RDDs. However, pair RDDs also attain some additional actions of PySpark especially those that leverage the advantage of data which is of key-value nature.

`countByKey()` action countByKey() only available for type (K,V)    

 - countByKey action counts the number of elements for each key
 - One thing to note is that countByKey should only be used on a dataset whose size is small enough to fit in memory.

`collectAsMap()` action returns the key-value pairs in the RDD to the as a dictionary. 
- Similar to countByKey, this action should only be used if the resulting data is expected to be small, as all the data is loaded into the memory.


In [None]:
# countByKey Example
onasimplelistrdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
 for kee, val in rdd.countByKey().items(): 
  print(kee, val)

# collectAsMap Example
sc.parallelize([(1, 2), (3, 4)]).collectAsMap()


# Example 2
# Transform the rdd with countByKey()
total = Rdd.countByKey()

# What is the type of total?
print("The type of total is", type(total))

# Iterate over the total and print the output
for k, v in total.items(): 
  print("key", k, "has", v, "counts")




In this example, you will write code that calculates the most common words from Complete Works of William Shakespeare.

Here are the brief steps for writing the word counting program:

1. Create a base RDD from Complete_Shakespeare.txt file.
2. Use RDD transformation to create a long list of words from each element of the base RDD.
3. Remove stop words from your data.
4. Create pair RDD where each element is a pair tuple of ('w', 1)
5. Group the elements of the pair RDD by key (word) and add up their values.
6. Swap the keys (word) and values (counts) so that keys is count and value is the word.
7. Finally, sort the RDD by descending order and print the 10 most frequent words and their frequencies.

- Create an RDD called baseRDD that reads lines from file_path.
- Transform the baseRDD into a long list of words and create a new splitRDD.
- Count the total words in splitRDD
- Convert the words in splitRDD in lower case and then remove stop words from stop_words.
- Create a pair RDD tuple containing the word and the number 1 from each word element in splitRDD.
- Get the count of the number of occurrences of each word (word frequency) in the pair RDD using reduceByKey()
- Print the first 10 words and their frequencies from the resultRDD.
- Swap the keys and values in the resultRDD.
- Sort the keys according to descending order.
- Print the top 10 most frequent words and their frequencies.


In [None]:
# Create a baseRDD from the file path
baseRDD = sc.textFile(file_path)

# Split the lines of baseRDD into words
splitRDD = baseRDD.flatMap(lambda x: x.split(" "))

# Count the total number of words
print("Total number of words in splitRDD:", splitRDD.count())

# Convert the words in lower case and remove stop words from stop_words
splitRDD_no_stop = splitRDD.filter(lambda x: x.lower() not in stop_words)

# Create a tuple of the word and 1 
splitRDD_no_stop_words = splitRDD_no_stop.map(lambda w: (w, 1))

# Count of the number of occurences of each word
resultRDD = splitRDD_no_stop_words.reduceByKey(lambda x, y: x + y)

# Display the first 10 words and their frequencies
for word in resultRDD.take(10):
	print(word)

# Swap the keys and values 
resultRDD_swap = resultRDD.map(lambda x: (x[1], x[0]))

# Sort the keys in descending order
resultRDD_swap_sort = resultRDD_swap.sortByKey(ascending=False)

# Show the top 10 most frequent words and their frequencies
for word in resultRDD_swap_sort.take(10):
	print("{} has {} counts". format(word[1], word[0]))

## PySpark SQL & DataFrames

RDDs - Spark’s core abstraction for working with data.

PySpark - Spark's high level API for working with structured data.

 - PySpark SQL is a Spark library for structured data. Unlike the PySpark RDD API, PySpark SQL provides more information about the structure of data and the computation being performed. 

- PySpark SQL provides a programming abstraction called DataFrames. A DataFrame is an immutable distributed collection of data with named columns. It is similar to a table in SQL. DataFrames are designed to process a large collection of structured data such as relational database and semi-structured data such as JSON (JavaScript Object Notation). DataFrame API currently supports several languages such as Python, R, Scala, and Java. DataFrames allows PySpark to query data using SQL, for example (SELECT * from table) or using the expression method for example (df.select()).


- SparkContext which is the main entry point for creating RDDs. Similarly, `SparkSession` provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with DataFrame API. The `SparkSession` does for DataFrames what the SparkContext does for RDDs.

-  A `SparkSession` can be used to create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables etc., Similar to SparkContext, SparkSession is exposed to the PySpark shell as variable `spark`

- Two different methods of creating DataFrames in PySpark 
 - From existing RDDs using SparkSession's createDataFrame() method
    -  When the schema is a list of column names, the type of each column will be inferred from data as shown above. However when the schema is None, it will try to infer the schema from data
 - From various data sources(CSV,JSON,TXT) using SparkSession's readmethod 
    - Arguments - Path to the file and two optional parameters 
     1. header=True
     2. inferSchema=True
- Schema controls the data (structure of data in the Dataframe) and helps Data Frames to optimize queries(Spark to optimize queries on data more efficiently)
- Schema provides information about column name,type of data in the column,null or empty values etc.,



In [None]:
# Example using createDataFrame method
iphones_RDD = sc.parallelize([("XS", 2018, 5.65, 2.79, 6.24),
                               ("XR", 2018, 5.94, 2.98, 6.84), 
                               ("X10", 2017, 5.65, 2.79, 6.13),
                               ("8Plus", 2017, 6.23, 3.07, 7.12)])

names = ['Model', 'Year', 'Height', 'Width', 'Weight']

iphones_df = spark.createDataFrame(iphones_RDD, schema=names)

type(iphones_df)

# Example using read() method
df_csv = spark.read.csv("people.csv", header=True, inferSchema=True)
df_json = spark.read.json("people.json", header=True, inferSchema=True)
df_txt = spark.read.txt("people.txt", header=True, inferSchema=True)

 

In [None]:
# Create a list of tuples
sample_list = [('Mona',20), ('Jennifer',34), ('John',20), ('Jim',26)]

# Create a RDD from the list
rdd = sc.parallelize(sample_list)

# Create a PySpark DataFrame
names_df = spark.createDataFrame(rdd, schema=['Name', 'Age'])

# Check the type of names_df
print("The type of names_df is", type(names_df))

# Create an DataFrame from file_path
people_df = spark.read.csv(file_path, header=True, inferSchema=True)

# Check the type of people_df
print("The type of people_df is", type(people_df))

## Operating on DataFrames in PySpark

Similar to RDD operations, the DataFrame operations in PySpark can be divided into Transformations and Actions. PySpark DataFrame provides operations to filter, group, or compute aggregates, and can be used with PySpark SQL.

DataFrame Transformations : select(), lter(), groupby(), orderby(), dropDuplicates() and withColumnRenamed()

DataFrame Actions : printSchema(), head(), show(), count(),columns and describe()

`select()` transformation subsets the columns in the DataFrame 

`show()` action prints first 20 rows in the DataFrame

`filter()` transformation filters out the rows based on a condition 

`groupby()` The groupby Transformation groups the DataFrame using the specified columns, so we can run aggregation on them

`orderby()` operations sorts the DataFrame based one or more columns

`dropDuplicates()` removes the duplicate rows of a DataFrame

`withColumnRenamed()` renames a column in the DataFrame

`printSchema()` operation prints the types of columns in the DataFrame

`columns` operator prints the columns of a DataFrame

`describe()` operation compute summary statistics of numerical columns in the DataFrame. If we don’t specify the name of columns it will calculate summary statistics for all numerical columns present in the DataFrame


In [None]:
# Example 1 - select
df_id_age = test.select('Age')

# Example 2 - show
df_id_age.show(3)

# Example 3 - filter
new_df_age21 = new_df.filter(new_df.Age > 21)
new_df_age21.show(3)

# Example 4 - groupby
test_df_age_group = test_df.groupby('Age')
test_df_age_group.count().show(3)

# Example 5 -- orderBy
test_df_age_group.count().orderBy('Age').show(3)

# Example 6 - dropDuplicates
test_df_no_dup = test_df.select('User_ID','Gender', 'Age').dropDuplicates()
test_df_no_dup.count()

# Example 7 - withColumnRenamed
test_df_sex = test_df.withColumnRenamed('Gender', 'Sex')
test_df_sex.show(3)

# Example 8 - printSchema
test_df.printSchema()

# Example 9 columns
test_df.columns

# Example 10 - describe
test_df.describe().show()

### Inspecting data in PySpark DataFrame

Print the first 10 observations in the people_df DataFrame.

Count the number of rows in the people_df DataFrame.

How many columns does people_df DataFrame have and what are their names?

In [None]:
# Print the first 10 observations 
people_df.show(10)

# Count the number of rows 
print("There are {} rows in the people_df DataFrame.".format(people_df.count()))

# Count the number of columns and their names
print("There are {} columns in the people_df DataFrame and their names are {}".format(len(people_df.columns), people_df.columns))

### PySpark DataFrame subsetting and cleaning

Select 'name', 'sex' and 'date of birth' columns from people_df and create people_df_sub DataFrame.

Print the first 10 observations in the people_df DataFrame.

Remove duplicate entries from people_df_sub DataFrame and create people_df_sub_nodup DataFrame.

How many rows are there before and after duplicates are removed?

In [None]:
# Select name, sex and date of birth columns
people_df_sub = people_df.select('name', 'sex', 'date of birth')

# Print the first 10 observations from people_df_sub
people_df_sub.show(10)

# Remove duplicate entries from people_df_sub
people_df_sub_nodup = people_df_sub.dropDuplicates()

# Count the number of rows
print("There were {} rows before removing duplicates, and {} rows after removing duplicates".format(people_df_sub.count(),people_df_sub_nodup.count()))


### Filtering your DataFrame

Filter the people_df DataFrame to select all rows where sex is female into people_df_female DataFrame.

Filter the people_df DataFrame to select all rows where sex is male into people_df_male DataFrame.

Count the number of rows in people_df_female and people_df_male DataFrames.

In [None]:
# Filter people_df to select females 
people_df_female = people_df.filter(people_df.sex == "female")

# Filter people_df to select males
people_df_male = people_df.filter(people_df.sex == "male")

# Count the number of rows 
print("There are {} rows in the people_df_female DataFrame and {} rows in the people_df_male DataFrame".format(people_df_female.count(), people_df_male.count()))

## Interacting with DataFrames using PySpark SQL

- In addition to DataFrame API, PySpark SQL allows you to manipulate DataFrames with SQL queries. What you can do using DataFrames API, can be done using SQL queries and vice versa.
- The DataFrames API provides a programmatic interface – basically a domain-specific language (DSL) for interacting with data.
- DataFrame queries are much easier to construct programmatically. Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. Many of the DataFrame operations that you have seen in the previous chapter, can be done using SQL queries.
- The SparkSession provides a method called sql which can be used to execute a SQL query. The sql method takes a SQL statement as an argument and returns a DataFrame representing the result of the given query.
-  Unfortunately, SQL queries cannot be run directly against a DataFrame. To issue SQL queries against an existing DataFrame we can leverage the `createOrReplaceTempView` function to build a temporary table as shown in this example.
-  After creating the temporary table, we can simply use the sql method, which allows us to write SQL code to manipulate data within a DataFrame.Since the result is a DataFrame, you can run DataFrame actions such as collect, first, show etc.
- The SQL queries are not limited to extracting data, We can also create SQL queries to run aggregations











In [None]:
# Example 1
df.createOrReplaceTempView("table1")
df2 = spark.sql("SELECT field1, field2 FROM table1")
df2.collect()

# Example 2
test_df.createOrReplaceTempView("test_table")
query = '''SELECT Product_ID FROM test_table'''
test_product_df = spark.sql(query)
test_product_df.show(5)

# Example 3 - Aggregations
test_df.createOrReplaceTempView("test_table")
query = '''SELECT Age, max(Purchase) FROM test_table GROUP BY Age'''
spark.sql(query).show(5)

# Example 4 - Filtering
test_df.createOrReplaceTempView("test_table")
query = '''SELECT Age, Purchase, Gender FROM table1 WHERE Purchase > 20000 AND Gender == "F"'''
spark.sql(query).show(5)

### Running SQL Queries Programmatically

Create a temporary table people that's a pointer to the people_df DataFrame.

Construct a query to select the names of the people from the temporary table people.

Assign the result of Spark's query to a new DataFrame - people_df_names.

Print the top 10 names of the people from people_df_names DataFrame.

In [None]:
# Create a temporary table "people"
people_df.createOrReplaceTempView("people")

# Construct a query to select the names of the people from the temporary table "people"
query = '''SELECT name FROM people'''

# Assign the result of Spark's query to people_df_names
people_df_names = spark.sql(query)

# Print the top 10 names of the people
people_df_names.show(10)

### SQL queries for filtering Table

1. Filter the people table to select all rows where sex is female into people_female_df DataFrame.
2. Filter the people table to select all rows where sex is male into people_male_df DataFrame.
3. Count the number of rows in both people_female and people_male DataFrames

In [None]:
# Filter the people table to select female sex 
people_female_df = spark.sql('SELECT * FROM people WHERE sex=="female"')

# Filter the people table DataFrame to select male sex
people_male_df = spark.sql('SELECT * FROM people WHERE sex=="male"')

# Count the number of rows in both DataFrames
print("There are {} rows in the people_female_df and {} rows in the people_male_df DataFrames".format(people_female_df.count(), people_male_df.count()))

## Data Visualization in PySpark using DataFrames

Ploing graphs using PySpark DataFrames is done using three methods
- pyspark_dist_explore library 
  - Pyspark_dist_explore library provides quick insights into DataFrames 
  - Currently three functions available – `hist()`, `distplot()` and `pandas_histogram()` to create matplotlib graphs while minimizing the amount of computation needed
- toPandas() method converts the PySpark DataFrame into a Pandas DataFrame after conversion It's easy to create charts from pandas DataFrames using matplotlib or seaborn plotting tools
   - Pandas DataFrame vs PySpark DataFrame 
    - PandasvDataFrames are in-memory single-server based(single machine tools constrained by single machine limits) whereas structures and operations on PySpark run in parallel on different nodes in a cluster
    - The result is generated as we apply any operation in Pandas whereas operations in PySpark DataFrame are lazy evaluation 
    - Pandas Data Frame as mutable and PySpark DataFrames are immutable     
    - Pandas API support more operations than PySpark Dataframe API
- HandySpark library
   - HandySpark is a package designed to improve PySpark user experience
   - It makes fetching data or computing statistics for columns really easy, returning pandas objects straight away.
   - It brings the long-missing capability of plotting data while retaining the advantage of performing the distributed computation



In [None]:
# Example 1 - Using pyspark_dist_explore library
test_df = spark.read.csv("test.csv", header=True, inferSchema=True)
test_df_age = test_df.select('Age')
hist(test_df_age, bins=20, color="red")

# Example 2 - Using toPandas
test_df = spark.read.csv("test.csv", header=True, inferSchema=True)
test_df_sample_pandas = test_df_sample.toPandas()
test_df_sample_pandas.hist('Age')

#Example 3 - HandySpark
test_df = spark.read.csv('test.csv', header=True, inferSchema=True)
hdf = test_df.toHandy()
hdf.cols["Age"].hist()



### PySpark DataFrame visualization

- Print the names of the columns in names_df DataFrame.
- Convert names_df DataFrame to df_pandas Pandas DataFrame.
- Use matplotlib's plot() method to create a horizontal bar plot with 'Name' on x-axis and 'Age' on y-axis.

In [None]:
# Check the column names of names_df
print("The column names of names_df are", names_df.columns)

# Convert to Pandas DataFrame  
df_pandas = names_df.toPandas()

# Create a horizontal bar plot
df_pandas.plot(kind='barh', x='Name', y='Age', colormap='winter_r')
plt.show()

### Part 1: Create a DataFrame from CSV file

Create a PySpark DataFrame from file_path which is the path to the Fifa2018_dataset.csv file.

Print the schema of the DataFrame.

Print the first 10 observations.

How many rows are in there in the DataFrame?

In [None]:
# Load the Dataframe
fifa_df = spark.read.csv(file_path, header=True, inferSchema=True)

# Check the schema of columns
fifa_df.printSchema()

# Show the first 10 observations
fifa_df.show(10)

# Print the total number of rows
print("There are {} rows in the fifa_df DataFrame".format(fifa_df.count()))

### Part 2: SQL Queries on DataFrame

Create temporary table fifa_df from fifa_df_table DataFrame.

Construct a "query" to extract the "Age" column from Germany players.

Apply the SQL "query" to the temporary view table and create a new DataFrame.

Computes basic statistics of the created DataFrame.

In [None]:
# Create a temporary view of fifa_df
fifa_df.createOrReplaceTempView('fifa_df_table')

# Construct the "query"
query = '''SELECT Age FROM fifa_df_table WHERE Nationality == "Germany"'''

# Apply the SQL "query"
fifa_df_germany_age = spark.sql(query)

# Generate basic statistics
fifa_df_germany_age.describe().show()

### Part 3: Data visualization

Convert fifa_df_germany_age to fifa_df_germany_age_pandas Pandas DataFrame.

Generate a density plot of the 'Age' column from the fifa_df_germany_age_pandas Pandas DataFrame.

In [None]:
# Convert fifa_df to fifa_df_germany_age_pandas DataFrame
fifa_df_germany_age_pandas = fifa_df_germany_age.toPandas()

# Plot the 'Age' density of Germany Players
fifa_df_germany_age_pandas.plot(kind='density')
plt.show()