# FIT5202 Data processing for Big data

##  Activity: Parallel Aggregation

For this tutorial we will implement different operations and aggregations like distinct, group by and order by on Spark DataFrames. In the second part, you will need to use all these operations to answer the lab tasks.

Let's get started.


## Table of Contents

* [SparkContext and SparkSession](#one)
* [Parallel Aggregation](#two)
    * [Group By](#groupby)        
    * [Sort By](#sortby)    
    * [Distinct](#distinct)    
* [Miscellaneous DataFrame Operations](#misc)
    * [Describe a column](#describe_column)
    * [Adding/Dropping Columns](#add_drop_column)    
    * [PySpark Built-in Functions](#pyspark_functions)       
    * [User Defined Functions : UDFs](#udf) 
* [Lab Tasks](#lab-task-1)
    * [Lab Task 1](#lab-task-1)
    * [Lab Task 2](#lab-task-2)
    * [Lab Task 3](#lab-task-3)    

<a class="anchor" name="one"></a>
## Import Spark classes and create Spark Context

<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#006DAE">TODO: </strong>In the cell block below, 
<ul>
    <li>Create a SparkConfig object with application name set as "Parallel Aggregation"</li>
    <li>specify 2 cores for processing</li>
    <li>Use the configuration object to create a spark session named as <strong>spark</strong>.</li>
    </ul>
    
<p><strong style="color:red">Important:</strong> You cannot proceed to other steps without completing this.</p>
</div>

In [1]:
# Import SparkConf class into program
from pyspark import SparkConf

# local[*]: run Spark in local mode with as many working processors as logical cores on your machine
# If we want Spark to run locally with 'k' worker threads, we can specify as "local[k]".
master = "local[*]"
# The `appName` field is a name to be shown on the Spark cluster UI page
app_name = "Parallel Aggregation"
# Setup configuration parameters for Spark
spark_conf = SparkConf().setMaster(master).setAppName(app_name)

# Import SparkContext and SparkSession classes
from pyspark import SparkContext # Spark
from pyspark.sql import SparkSession # Spark SQL

# Method 1: Using SparkSession
spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

<a class="anchor" name="two"></a>
## Parallel Aggregation

Now we will implement basic aggregation functionalities and visualise the parallelism embedded in Spark as well as the execution plan and functions done to perform these kind of queries.

In this tutorial, you will use two csv files as datasets which contains information about all the athletes that have participated in the Summer and Winter Olympics (athlete_events.csv) as well as the information of their countries (noc_regions.csv).

In [2]:
# Read athlete events data as dataframe
df_events = spark.read.format('csv')\
            .option('header',True).option('escape','"')\
            .load('athlete_events.csv')

# Create Views from Dataframes
df_events.createOrReplaceTempView("sql_events")

## Verifying the number of partitions for each dataframe
## You can explore the data of each csv file with the function printSchema()
print(f"####### DICTIONARY INFO:")
print(f"Number of partitions: {df_events.rdd.getNumPartitions()}")
df_events.printSchema()

####### DICTIONARY INFO:
Number of partitions: 10
root
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Height: string (nullable = true)
 |-- Weight: string (nullable = true)
 |-- Team: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Games: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Season: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Medal: string (nullable = true)



### Group By <a class="anchor" name="groupby"></a>
This part contains a simple aggregation query. Look into the query plan and level of parallelism in the Spark UI.

In [3]:
import pyspark.sql.functions as F

#### Aggregate the dataset by 'Year' and count the total number of athletes using Dataframe
agg_attribute = 'Year'
df_count = df_events.groupby(agg_attribute).agg(F.count(agg_attribute).alias('Total'))

#### Aggregate the dataset by 'Year' and count the total number of athletes using SQL
sql_count = spark.sql('''
  SELECT year,count(*)
  FROM sql_events
  GROUP BY year
''')

In [4]:
df_count.take(5)

[Row(Year='1956', Total=6434),
 Row(Year='2016', Total=13688),
 Row(Year='1936', Total=7401),
 Row(Year='2012', Total=12920),
 Row(Year='1972', Total=11959)]

<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#006DAE">NOTE: </strong>
  The same thing can be done using 
    <code>groupby(agg_attribute).agg({'Year':'count'})</code>    
</div>

### Sort By <a class="anchor" name="sortby"></a>
We can use orderBy operation to sort the dataframe based on some column.
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#006DAE">NOTE: </strong>
    You can specify the sort order using the method <strong>desc()</strong>
    <code>orderBy(df_events.Year.desc())</code>    
</div>


In [5]:
df_events.select('Year','Name','Team').orderBy(df_events.Year).show(15)

+----+--------------------+--------------------+
|Year|                Name|                Team|
+----+--------------------+--------------------+
|1896| G. Karagiannopoulos|              Greece|
|1896|         Khatzidakis|              Greece|
|1896|Tilemakhos Karakalos|              Greece|
|1896| Gyula Kakas (Kokas)|             Hungary|
|1896|        Karakatsanis|              Greece|
|1896| Gyula Kakas (Kokas)|             Hungary|
|1896|Konstantinos Kara...|              Greece|
|1896|   Filippos Karvelas|Ethnikos Gymnasti...|
|1896|Alexandros Khalko...|              Greece|
|1896|         N. Katravas|              Greece|
|1896| Gyula Kakas (Kokas)|             Hungary|
|1896|   Frederick Keeping|       Great Britain|
|1896| Pantelis Karasevdas|              Greece|
|1896|   Frederick Keeping|       Great Britain|
|1896| Pantelis Karasevdas|              Greece|
+----+--------------------+--------------------+
only showing top 15 rows



### Distinct <a class="anchor" name="distinct"></a>
This part contains a simple query to get the distinct values of one of the attributes and then sorting them by the same attribute in ascending order.
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#006DAE">NOTE: </strong>
    We can use <code>.sort()</code> method to do the sorting as well. In the second parameter of the method, we can specify the order of the sorting.
</div>

In [6]:
#### Get the distinct values for 'Year' in the dataset using Dataframe
df_distinct_sort = df_events.select('Year').distinct().sort('Year', ascending=True)

#### Get the distinct values for 'Year' in the dataset using SQL
sql_distinct_sort = spark.sql('''
  SELECT distinct Year
  FROM sql_events
  ORDER BY year
''')
df_distinct_sort.take(10)

[Row(Year='1896'),
 Row(Year='1900'),
 Row(Year='1904'),
 Row(Year='1906'),
 Row(Year='1908'),
 Row(Year='1912'),
 Row(Year='1920'),
 Row(Year='1924'),
 Row(Year='1928'),
 Row(Year='1932')]

<a class="anchor" id="lab-task-1"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">1. Lab Task: </strong>Sort the above dataframe i.e. events by <strong>Year</strong> in descending order.</div>


In [7]:
df_events.select('Name','Event','Year').sort('Year', ascending=False).show(10)

+--------------------+--------------------+----+
|                Name|               Event|Year|
+--------------------+--------------------+----+
|Alexander Ingvar ...|Rowing Men's Coxl...|2016|
|Andrs Byron Silva...|Athletics Men's 4...|2016|
|Isaac Phillipjuni...|Athletics Men's 1...|2016|
|Andriy Oleksiyovy...|Gymnastics Men's ...|2016|
|           aba Silai|Swimming Men's 10...|2016|
|Arlenis Sierra Ca...|Cycling Women's R...|2016|
|Nicole Sifuentes ...|Athletics Women's...|2016|
|Alex William Pomb...|Judo Men's Lightw...|2016|
|          Alain Sign| Sailing Men's Skiff|2016|
|Altobeli Santos d...|Athletics Men's 3...|2016|
+--------------------+--------------------+----+
only showing top 10 rows



<a class="anchor" name="misc"></a>
## Miscellaneous Dataframe Operations
These are the examples of other dataframe operations which are useful.

### Describing a Column <a class="anchor" name="describe_column"></a>
The <code>describe()</code> melthod gives the statistical summary of the column. If the column is not specified, it gives the summary of the whole dataframe.

In [8]:
df_events.describe().show()

+-------+-----------------+--------------------+------+------------------+------------------+------------------+-----------+------+-----------+------------------+------+-----------+-----------+--------------------+------+
|summary|               ID|                Name|   Sex|               Age|            Height|            Weight|       Team|   NOC|      Games|              Year|Season|       City|      Sport|               Event| Medal|
+-------+-----------------+--------------------+------+------------------+------------------+------------------+-----------+------+-----------+------------------+------+-----------+-----------+--------------------+------+
|  count|           271116|              271116|271116|            271116|            271116|            271116|     271116|271116|     271116|            271116|271116|     271116|     271116|              271116|271116|
|   mean|68248.95439590434|                null|  null|25.556898357297374|175.33896987366376| 70.70239290053351|

### Adding and Dropping a column in dataframe <a class="anchor" name="add_drop_column"></a>

In [9]:
#Here is an example of adding a new column based on the previous column
df_events_new = df_events.withColumn('Years Ago',2022-df_events.Year).select('Years Ago','Name')
display(df_events_new)

DataFrame[Years Ago: double, Name: string]

<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#006DAE">TODO: </strong>
    You can use the <code>.drop('column_name')</code> method to drop columns from a dataframe. Try this method and drop the column created above.
</div>

In [10]:
display(df_events.drop('Years Ago'))

DataFrame[ID: string, Name: string, Sex: string, Age: string, Height: string, Weight: string, Team: string, NOC: string, Games: string, Year: string, Season: string, City: string, Sport: string, Event: string, Medal: string]

### Using PySpark Functions <a class="anchor" name="pyspark_functions"></a>
You can use PySpark built-in functions along with the <code>withColumn()</code> API.

In [11]:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

#Changing the datatype 
#using the display method to see the columns and datatypes of a dataframe
display(df_events)

DataFrame[ID: string, Name: string, Sex: string, Age: string, Height: string, Weight: string, Team: string, NOC: string, Games: string, Year: string, Season: string, City: string, Sport: string, Event: string, Medal: string]

In [12]:
#use CAST to change the datatype of Age Column 
df_events = df_events.withColumn('Age',F.col('Age').cast(IntegerType()))
display(df_events)

DataFrame[ID: string, Name: string, Sex: string, Age: int, Height: string, Weight: string, Team: string, NOC: string, Games: string, Year: string, Season: string, City: string, Sport: string, Event: string, Medal: string]

In [13]:
#The folowing example uses another inbuilt function to extract year from the Games column
df_events = df_events.withColumn('Games Year',F.split(df_events.Games,' ')[0])
df_events.select('Games Year').show(5)

+----------+
|Games Year|
+----------+
|      1992|
|      2012|
|      1920|
|      1900|
|      1988|
+----------+
only showing top 5 rows



### DataFrame UDFs (User Defined Functions) <a class="anchor" name="udf"></a>
Similar to map operation in an RDDs, sometimes we might want to apply a complex operation to the DataFrame, something which is not provided by the DataFrame APIs. In such scenarios, using a Spark UDF could be handy. To use Spark UDFs, we need to use the F.udf to convert a regular function to a Spark UDF.


In [14]:
#For example, the following function does the same things as the above built-function but this time we are using a udf
#1. The function is defined
def extract_year(s):
    return int(s.split(' ')[0])

#2. Calling the UDF with DataFrame
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

#First Register the function as UDF
extract_year_udf = udf(extract_year,IntegerType())

#Call the function
df_events.select('Games',extract_year_udf('Games').alias("Game Year")).show(5)

#4. Calling with Spark SQL
#First Register the function as UDF
spark.udf.register('extract_year',extract_year,IntegerType())

#Call the function 
df_events.createOrReplaceTempView('events')
df_sql = spark.sql('''select Games, extract_year(Games) as Game_Year from events''')
df_sql.show(5)

+-----------+---------+
|      Games|Game Year|
+-----------+---------+
|1992 Summer|     1992|
|2012 Summer|     2012|
|1920 Summer|     1920|
|1900 Summer|     1900|
|1988 Winter|     1988|
+-----------+---------+
only showing top 5 rows

+-----------+---------+
|      Games|Game_Year|
+-----------+---------+
|1992 Summer|     1992|
|2012 Summer|     2012|
|1920 Summer|     1920|
|1900 Summer|     1900|
|1988 Winter|     1988|
+-----------+---------+
only showing top 5 rows



## Combining DataFrame operations <a class="anchor" name="combine"></a>
Now that we have used the main SQL operations to process data, you will implement several queries using Spark Dataframes and SQL to solve each of the queries.
The dataset used for this section will be the 2 attached csv files:
* <code>athlete_events.csv</code>
* <code>noc_regions.csv</code>

The first dataset was already used in the first part of this tutorial. The second one contains the countries with some additional information
In this section, you will need to complete most of the code but in some parts, a hint or the name of variables will be given.

In [15]:
# Stop the previous Spark Context to clean all the previous executions from the previous section
sc.stop()
# Verify that the Spark UI is not running anymore or that there is no content

<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#006DAE">TODO: </strong>
Since we have removed the Spark Context in the previous code block, start the context once again by using the SparkSession object in the next code block.
</div>

In [16]:
# Import SparkConf class into program
from pyspark import SparkConf
from pyspark.sql.functions import col

# local[*]: run Spark in local mode with as many working processors as logical cores on your machine
# If we want Spark to run locally with 'k' worker threads, we can specify as "local[k]".
master = "local[*]"
# The `appName` field is a name to be shown on the Spark cluster UI page
app_name = "Parallel Aggregation"
# Setup configuration parameters for Spark
spark_conf = SparkConf().setMaster(master).setAppName(app_name)

# Import SparkContext and SparkSession classes
from pyspark import SparkContext # Spark
from pyspark.sql import SparkSession # Spark SQL

# Method 1: Using SparkSession
spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

### Create Spark data objects (Dataframes and SQL)

In [17]:
# Read athlete events data as dataframe
df_events = spark.read.format('csv')\
            .option('header',True).option('escape','"')\
            .load('athlete_events.csv')

# TODO: Read noc regions (countries) data as dataframe
df_regions = spark.read.format('csv')\
            .option('header',True)\
            .load('noc_regions.csv')

# Create Views from Dataframes
df_events.createOrReplaceTempView("sql_events")
df_regions.createOrReplaceTempView("sql_regions")

# View Schema for both dataframes
df_events.printSchema()
df_regions.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Height: string (nullable = true)
 |-- Weight: string (nullable = true)
 |-- Team: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Games: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Season: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Medal: string (nullable = true)

root
 |-- Country: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Population: string (nullable = true)
 |-- GDP per Capita: string (nullable = true)



### Queries/Anaysis
For this part, you will need to implement the Dataframe operations and/or the SQL queries to obtain the reports needed for the following questions:

<a class="anchor" id="lab-task-2"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">2. Lab Task: </strong>Get total number of male athletes per year of the 2000s order by ascending year. <strong>Sample Output:</strong>
<pre>
+----+------------------+
|year|number_of_athletes|
+----+------------------+
|2000|             XXXXX|
|2002|              XXXX|
</pre>
</div>


In [18]:
## Dataframe Solution
df_res = df_events.filter(col('Sex')=='M').filter(col('Year')>=2000)\
            .groupby('Year').agg(F.count('Year').alias('number_of_male_athletes'))\
            .sort('Year', ascending=True)

## SQL Solution
sql_res = spark.sql('''
  SELECT year, count(*) as number_of_male_athletes
  FROM sql_events
  WHERE sex='M'
  AND year >= 2000
  GROUP BY year
  ORDER BY year ASC
''')

# df_res_count = df_res.count()
# print(df_res_count)
# df_res.show(sql_res.collect())
# sql_res_count = sql_res.count()
# print(sql_res_count)
print(sql_res.collect())

[Row(year='2000', number_of_male_athletes=8390), Row(year='2002', number_of_male_athletes=2527), Row(year='2004', number_of_male_athletes=7897), Row(year='2006', number_of_male_athletes=2625), Row(year='2008', number_of_male_athletes=7786), Row(year='2010', number_of_male_athletes=2555), Row(year='2012', number_of_male_athletes=7105), Row(year='2014', number_of_male_athletes=2868), Row(year='2016', number_of_male_athletes=7465)]


<a class="anchor" id="lab-task-3"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#FF5555">3. Lab Task: </strong> Get total number of athletes per Olympic event (summer/winter) in the 1990s decade for Australia and New Zealand. <strong>Sample Output:</strong>
<pre>
+-----------+------+------------------+
|    country|season|number_of_athletes|
+-----------+------+------------------+
|  Australia|Summer|               XXX|
</pre>
</div>


In [19]:
## Dataframe Solution
df_res = df_events.join(df_regions,df_events.NOC==df_regions.NOC,how='inner')\
            .filter(F.col("country").isin(["Australia", "New Zealand"]))\
            .filter(F.col('year').between(1990,1999))\
            .groupBy('country','season')\
            .agg(F.count('country').alias('number_of_athletes'))\
            .sort(['country','season'], ascending=True)

df_res_count = df_res.count()
print(df_res_count)
df_res.show()

4
+-----------+------+------------------+
|    country|season|number_of_athletes|
+-----------+------+------------------+
|  Australia|Summer|               940|
|  Australia|Winter|               122|
|New Zealand|Summer|               304|
|New Zealand|Winter|                35|
+-----------+------+------------------+



In [20]:
## Dataframe Solution
df_res = df_events.join(df_regions,df_events.NOC==df_regions.NOC,how='inner')\
            .filter(F.col('year').between(1990,1999))\
            .filter(F.col("country").isin(["Australia", "New Zealand"]))\
            .groupBy('country','season')\
            .agg(F.count('country').alias('number_of_athletes'))\
            .sort(['country','season'], ascending=True)

df_res_count = df_res.count()
print(df_res_count)
df_res.show()

4
+-----------+------+------------------+
|    country|season|number_of_athletes|
+-----------+------+------------------+
|  Australia|Summer|               940|
|  Australia|Winter|               122|
|New Zealand|Summer|               304|
|New Zealand|Winter|                35|
+-----------+------+------------------+



In [21]:
### SOLUTION
sql_res = spark.sql('''
  SELECT country,season,count(*) as number_of_athletes
  FROM sql_regions JOIN sql_events
  USING (NOC)
  WHERE (country='Australia' OR country='New Zealand')
  AND year between 1990 and 1999
  GROUP BY country,season
  ORDER BY country,season
''')
sql_res.show(5)

+-----------+------+------------------+
|    country|season|number_of_athletes|
+-----------+------+------------------+
|  Australia|Summer|               940|
|  Australia|Winter|               122|
|New Zealand|Summer|               304|
|New Zealand|Winter|                35|
+-----------+------+------------------+



<a class="anchor" id="lab-task-4"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#006DAE">TODO: </strong>Obtain the minimum, average and maximum height of each country for the Winter Olympics and order by the average value in descending order. <strong>Output should be in the following format:</strong>
<pre>
+--------------------+----------+------------------+----------+
|             country|min_height|        avg_height|max_height|
+--------------------+----------+------------------+----------+
</pre>
</div>


In [22]:
## Dataframe Solution
df_res = df_events.join(df_regions,df_events.NOC==df_regions.NOC,how='inner')\
            .filter( (F.col("season")=='Winter') & (F.col("height")!='NA') )\
            .groupBy('country')\
            .agg(F.min('height').alias('min_height'),
                F.avg('height').alias('avg_height'),
                F.max('height').alias('max_height'))\
            .sort('avg_height', ascending=False)

df_res_count = df_res.count()
print(df_res_count)
df_res.show()

105
+--------------------+----------+------------------+----------+
|             country|min_height|        avg_height|max_height|
+--------------------+----------+------------------+----------+
|            Cameroon|       198|             198.0|       198|
|             Senegal|       192|             192.0|       192|
|             Uruguay|       188|             188.0|       188|
|        Puerto Rico*|       178|186.76923076923077|       196|
|British Virgin Is...|       186|             186.0|       186|
|            Bermuda*|       172|184.85714285714286|       196|
|               Tonga|       184|             184.0|       184|
|            Dominica|       183|             183.0|       183|
|          San Marino|       166|182.19230769230768|       192|
|            Zimbabwe|       182|             182.0|       182|
| Trinidad and Tobago|       175|181.57142857142858|       193|
|              Serbia|       167|          181.5625|       205|
|             Jamaica|       168|181

In [23]:
### SOLUTION
sql_res = spark.sql('''
  SELECT country,min(height) as min_height,
  avg(height) as avg_height, max(height) as max_height
  FROM sql_regions JOIN sql_events
  USING (NOC)
  WHERE season = 'Winter'
  AND height is not null AND height != 'NA'
  GROUP BY country
  ORDER BY avg_height DESC
''')
sql_res.show()

+--------------------+----------+------------------+----------+
|             country|min_height|        avg_height|max_height|
+--------------------+----------+------------------+----------+
|            Cameroon|       198|             198.0|       198|
|             Senegal|       192|             192.0|       192|
|             Uruguay|       188|             188.0|       188|
|        Puerto Rico*|       178|186.76923076923077|       196|
|British Virgin Is...|       186|             186.0|       186|
|            Bermuda*|       172|184.85714285714286|       196|
|               Tonga|       184|             184.0|       184|
|            Dominica|       183|             183.0|       183|
|          San Marino|       166|182.19230769230768|       192|
|            Zimbabwe|       182|             182.0|       182|
| Trinidad and Tobago|       175|181.57142857142858|       193|
|              Serbia|       167|          181.5625|       205|
|             Jamaica|       168|181.064

<a class="anchor" id="lab-task-5"></a>
<div style="background:rgba(0,109,174,0.2);padding:10px;border-radius:4px"><strong style="color:#006DAE">TODO: </strong> Get the Olympics teams that don't have information of their countries in noc_regions (e.g. Soviet Union since it doesn't exist anymore). <strong>Output should be in the following format:</strong>
<pre>
+--------------------+---+
|                team|noc|
+--------------------+---+
|               Almaz|URS|
|         Australasia|ANZ|
</pre>
</div>


In [24]:
df2 = df_regions.withColumnRenamed('NOC', 'noc2')

In [25]:
## Dataframe Solution
df_res = df_events.join(df2,df_events.NOC==df2.noc2,how='left')\
            .filter(F.col('noc2').isNull())\
            .select('team','noc').distinct()\
            .sort('team', ascending=True)

df_res_count = df_res.count()
print(df_res_count)
df_res.show()

70
+--------------------+---+
|                team|noc|
+--------------------+---+
|               Almaz|URS|
|         Australasia|ANZ|
|             Bohemia|BOH|
|           Bohemia-1|BOH|
|           Bohemia-2|BOH|
|           Bohemia-3|BOH|
|Bohemia/Great Bri...|BOH|
|         Burevestnik|URS|
|         Cha-Cha III|YUG|
|              Circus|WIF|
|               Crete|CRT|
|      Czechoslovakia|TCH|
|    Czechoslovakia-1|TCH|
|    Czechoslovakia-2|TCH|
|    Czechoslovakia-3|TCH|
|             Druzhba|URS|
|        East Germany|GDR|
|      East Germany-1|GDR|
|      East Germany-2|GDR|
|      East Germany-3|GDR|
+--------------------+---+
only showing top 20 rows



In [26]:
### SOLUTION
sql_res = spark.sql('''
  SELECT distinct team,e.noc
  FROM sql_events e LEFT JOIN sql_regions r
  USING (NOC)
  WHERE r.noc is null
  ORDER BY team
''')
sql_res.show()

+--------------------+---+
|                team|noc|
+--------------------+---+
|               Almaz|URS|
|         Australasia|ANZ|
|             Bohemia|BOH|
|           Bohemia-1|BOH|
|           Bohemia-2|BOH|
|           Bohemia-3|BOH|
|Bohemia/Great Bri...|BOH|
|         Burevestnik|URS|
|         Cha-Cha III|YUG|
|              Circus|WIF|
|               Crete|CRT|
|      Czechoslovakia|TCH|
|    Czechoslovakia-1|TCH|
|    Czechoslovakia-2|TCH|
|    Czechoslovakia-3|TCH|
|             Druzhba|URS|
|        East Germany|GDR|
|      East Germany-1|GDR|
|      East Germany-2|GDR|
|      East Germany-3|GDR|
+--------------------+---+
only showing top 20 rows



<div style="background:rgba(0,255,0,0.2);padding:10px;border-radius:4px">
 <h3>Assignment 1</h3>
    Once you are done with the lab tasks, please work on your Assignment 1.
</div>

**Congratulations on finishing this activity!**

Having practiced today's activities, we're now ready to embark on a trip of the rest of exiciting FIT5202 activities! See you next week!