##  Parallel Aggregation


## Table of Contents

* [SparkContext and SparkSession](#one)
* [Parallel Aggregation](#two)
    * [Group By](#groupby)        
    * [Sort By](#sortby)    
    * [Distinct](#distinct)    

<a class="anchor" name="one"></a>
## Import Spark classes and create Spark Context

In [2]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

master = "local[*]"
app_name = "Parallel Aggregation"
spark_conf = SparkConf().setMaster(master).setAppName(app_name)

spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

<a class="anchor" name="two"></a>
## Parallel Aggregation

Now we will implement basic aggregation functionalities and visualise the parallelism embedded in Spark as well as the execution plan and functions done to perform these kind of queries.


In [9]:
from pyspark.sql.types import StructType, StructField, StringType

schema_events = StructType([
    StructField("ID", StringType(), True),
    StructField("Name", StringType(), True),
    StructField("Sex", StringType(), True),
    StructField("Age", StringType(), True),
    StructField("Height", StringType(), True),
    StructField("Weight", StringType(), True),
    StructField("Team", StringType(), True),
    StructField("NOC", StringType(), True),
    StructField("Games", StringType(), True),
    StructField("Year", StringType(), True),
    StructField("Season", StringType(), True),
    StructField("City", StringType(), True),
    StructField("Sport", StringType(), True),
    StructField("Event", StringType(), True),
    StructField("Medal", StringType(), True),
])

# Sample Data for Athlete Events
data_events = [
    ("1", "John Doe", "M", "24", "180", "80", "United States", "USA", "2000 Summer", "2000", "Summer", "Sydney", "Swimming", "100m Freestyle", "Gold"),
    ("2", "Jane Smith", "F", "22", "165", "60", "Canada", "CAN", "2016 Summer", "2016", "Summer", "Rio", "Athletics", "Marathon", "Silver"),
    ("3", "Emily White", "F", "27", "170", "70", "Great Britain", "GBR", "2014 Winter", "2014", "Winter", "Sochi", "Curling", "Curling Women's", "Bronze"),
    ("4", "Emily Blue", "M", "29", "170", "70", "Great Britain", "GBR", "2014 Winter", "2014", "Winter", "Sochi", "Curling", "Curling Women's", "Bronze"),
]

# Create DataFrame
df_events = spark.createDataFrame(data=data_events, schema=schema_events)

# Repartition DataFrame as described
df_events = df_events.repartition(10)

df_events.createOrReplaceTempView("sql_events")
print(f"EVENTS INFO:")
print(f"Number of partitions: {df_events.rdd.getNumPartitions()}")
df_events.printSchema()



EVENTS INFO:
Number of partitions: 10
root
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Height: string (nullable = true)
 |-- Weight: string (nullable = true)
 |-- Team: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Games: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Season: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Medal: string (nullable = true)



### Group By <a class="anchor" name="groupby"></a>
This part contains a simple aggregation query. Look into the query plan and level of parallelism in the Spark UI.

In [4]:
import pyspark.sql.functions as F

#### Aggregate the dataset by 'Year' and count the total number of athletes using Dataframe
agg_attribute = 'Year'
df_count = df_events.groupby(agg_attribute).agg(F.count(agg_attribute).alias('Total'))

#### Aggregate the dataset by 'Year' and count the total number of athletes using SQL
sql_count = spark.sql('''
  SELECT year,count(*)
  FROM sql_events
  GROUP BY year
''')

In [5]:
df_count.take(5)

[Row(Year='2014', Total=1),
 Row(Year='2016', Total=1),
 Row(Year='2000', Total=1)]

### Sort By <a class="anchor" name="sortby"></a>
We can use orderBy operation to sort the dataframe based on some column.


In [6]:
df_events.select('Year','Name','Team').orderBy(df_events.Year).show(15)

+----+-----------+-------------+
|Year|       Name|         Team|
+----+-----------+-------------+
|2000|   John Doe|United States|
|2014|Emily White|Great Britain|
|2016| Jane Smith|       Canada|
+----+-----------+-------------+



### Distinct <a class="anchor" name="distinct"></a>
This part contains a simple query to get the distinct values of one of the attributes and then sorting them by the same attribute in ascending order.


In [11]:
#### Get the distinct values for 'Year' in the dataset using Dataframe
df_distinct_sort = df_events.select('Year').distinct().sort('Year', ascending=True)

#### Get the distinct values for 'Year' in the dataset using SQL
sql_distinct_sort = spark.sql('''
  SELECT distinct Year
  FROM sql_events
  ORDER BY year
''')
df_distinct_sort.take(10)

[Row(Year='2000'), Row(Year='2014'), Row(Year='2016')]

In [10]:
df_events.select('Name','Event','Year').sort('Year', ascending=False).show(10)

+-----------+---------------+----+
|       Name|          Event|Year|
+-----------+---------------+----+
| Jane Smith|       Marathon|2016|
|Emily White|Curling Women's|2014|
| Emily Blue|Curling Women's|2014|
|   John Doe| 100m Freestyle|2000|
+-----------+---------------+----+

