In [1]:
from pyspark.sql import SparkSession

In [3]:
spark=SparkSession.builder.appName('Agg').getOrCreate()


PySpark RDD – Resilient Distributed Dataset

- PySpark RDD (Resilient Distributed Dataset) is a fundamental data structure of PySpark that is fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you cannot change it. Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of the cluster.
- In order to create an RDD, we need to create a SparkSession which is an entry point to the PySpark application. SparkSession can be created using a builder() or newSession() methods of the SparkSession.
- Spark session internally creates a sparkContext variable of SparkContext. We can create multiple SparkSession objects but only one SparkContext per JVM. In case if you want to create another new SparkContext you should stop existing Sparkcontext (using stop()) before creating a new one.
- If you have tabular data, you should just use a DataFrame. Going down to the RDD layer is helpful for unstructured data, for example, text, images, and signal data.

In [4]:
# Create RDD from list using parallelize

dataList = [("Java", 2000), ("Python", 10000), ("Scala", 3000)]
rdd=spark.sparkContext.parallelize(dataList)

In [5]:
# Create an RDD from external data source

dataList = [("Java", 2000), ("Python", 10000), ("Scala", 3000)]
rdd2=spark.sparkContext.textFile('test3.csv')


RDD Transformations
- Transformations on Spark RDD returns another RDD and transformations are lazy meaning they don’t execute until you call an action on RDD. Some transformations on RDD’s are flatMap(), map(), reduceByKey(), filter(), sortByKey() and return new RDD instead of updating the current.

PySpark DataFrame
- DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.

 Is PySpark faster than pandas? Explain your thinking.

It depends on the scale of the data...
- fits in memory on a single machine 
- is too large for a single machine and is distributed across multiple machines 

PySpark is faster when...
- Large datasets: very large datasets that don't fit into a single machines memory can be paralellized across multiple machines in a cluster. This allows PySpark to process distributed large data outside of Pandas scope (memory limits)

Pandas is faster when...
- Small datasets: data that fits into memory on a single machine is processed using efficient operations for in-memory computation by pandas opposed to PySpark which has a lot of overhead due to distributed nature


In [8]:
df_pyspark=spark.read.csv('test3.csv',header=True,inferSchema=True)
df_pyspark.show()
df_pyspark.printSchema()


+---------+------------+------+
|     Name| Departments|salary|
+---------+------------+------+
|    Krish|Data Science| 10000|
|    Krish|         IOT|  5000|
|   Mahesh|    Big Data|  4000|
|    Krish|    Big Data|  4000|
|   Mahesh|Data Science|  3000|
|Sudhanshu|Data Science| 20000|
|Sudhanshu|         IOT| 10000|
|Sudhanshu|    Big Data|  5000|
|    Sunny|Data Science| 10000|
|    Sunny|    Big Data|  2000|
+---------+------------+------+

root
 |-- Name: string (nullable = true)
 |-- Departments: string (nullable = true)
 |-- salary: integer (nullable = true)



In [9]:
df_pyspark.createOrReplaceTempView("PEOPLE")
df1=spark.sql("select Name, Departments, Salary as salary from PEOPLE")
df1.show()

+---------+------------+------+
|     Name| Departments|salary|
+---------+------------+------+
|    Krish|Data Science| 10000|
|    Krish|         IOT|  5000|
|   Mahesh|    Big Data|  4000|
|    Krish|    Big Data|  4000|
|   Mahesh|Data Science|  3000|
|Sudhanshu|Data Science| 20000|
|Sudhanshu|         IOT| 10000|
|Sudhanshu|    Big Data|  5000|
|    Sunny|Data Science| 10000|
|    Sunny|    Big Data|  2000|
+---------+------------+------+



In [11]:
## Groupby

# Grouped to find the maximum salary
df_pyspark.groupBy('Name').sum().show()

spark.sql("SELECT Name, SUM(Salary) as Total_Salary FROM PEOPLE GROUP BY Name").show() # using PySpark SQL

+---------+-----------+
|     Name|sum(salary)|
+---------+-----------+
|Sudhanshu|      35000|
|    Sunny|      12000|
|    Krish|      19000|
|   Mahesh|       7000|
+---------+-----------+

+---------+------------+
|     Name|Total_Salary|
+---------+------------+
|Sudhanshu|       35000|
|    Sunny|       12000|
|    Krish|       19000|
|   Mahesh|        7000|
+---------+------------+



In [14]:
# using PySpark Dataframe
df_pyspark.groupBy('Name').avg().show()

# using PySpark SQL
spark.sql("SELECT Name, AVG(Salary) as Average_Salary FROM PEOPLE GROUP BY Name").show()

# using PySpark Dataframe
department_salary_sum = df_pyspark.groupBy("Departments").sum("Salary")
department_salary_sum.show()

# using PySpark SQL
department_salary_sum = spark.sql("""
SELECT Departments, SUM(Salary) as Total_Salary
FROM PEOPLE
GROUP BY Departments
ORDER BY Total_Salary DESC
""")

# Show the result
department_salary_sum.show()


+---------+------------------+
|     Name|       avg(salary)|
+---------+------------------+
|Sudhanshu|11666.666666666666|
|    Sunny|            6000.0|
|    Krish| 6333.333333333333|
|   Mahesh|            3500.0|
+---------+------------------+

+---------+------------------+
|     Name|    Average_Salary|
+---------+------------------+
|Sudhanshu|11666.666666666666|
|    Sunny|            6000.0|
|    Krish| 6333.333333333333|
|   Mahesh|            3500.0|
+---------+------------------+

+------------+-----------+
| Departments|sum(Salary)|
+------------+-----------+
|         IOT|      15000|
|    Big Data|      15000|
|Data Science|      43000|
+------------+-----------+

+------------+------------+
| Departments|Total_Salary|
+------------+------------+
|Data Science|       43000|
|         IOT|       15000|
|    Big Data|       15000|
+------------+------------+



In [16]:
# using PySpark Dataframe
department_count = df_pyspark.groupBy("Departments").count()
department_count.show()

# using PySpark SQL
spark.sql("SELECT Departments, COUNT(*) as Department_Count FROM PEOPLE GROUP BY Departments").show()

+------------+-----+
| Departments|count|
+------------+-----+
|         IOT|    2|
|    Big Data|    4|
|Data Science|    4|
+------------+-----+

+------------+----------------+
| Departments|Department_Count|
+------------+----------------+
|         IOT|               2|
|    Big Data|               4|
|Data Science|               4|
+------------+----------------+

