# Spark SQL mini project exercises.

#### README
The exercises comes with this notebook, and a "data" folder. The data folder contains the dataset used for the exercises.
Some of the code will be written to help you get started and some explanatory text to further aid the understanding of each exercise.

The first part is setting up a database and loading the dataset used for the exercises. This is already done berforehand. Just run all the cells until the **Exercise** part.

In [2]:
# Initialising the spark session.
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Practise').getOrCreate()
spark

In [3]:
#Creates database
spark.sql("CREATE DATABASE flightDB")

#Specifies which DB to use
spark.sql("USE flightDB")

#Creates table
spark.sql("""
            CREATE TABLE flights (
            DEST_COUNTRY_NAME STRING COMMENT "Describes destination country", 
            ORIGIN_COUNTRY_NAME STRING COMMENT "Describes departure country", 
            count LONG COMMENT "Describes number of departures")
            USING csv OPTIONS (header true, path 'C:/Users/Hecter/OneDrive/4. sem/Big Data Systems/Topic_1_Spark_Introduction/Kode/2015-summary.csv')
            """)
spark.sql("SELECT * FROM flights").show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|   15|
|       United States|            Croatia|    1|
|       United States|            Ireland|  344|
|               Egypt|      United States|   15|
|       United States|              India|   62|
|       United States|          Singapore|    1|
|       United States|            Grenada|   62|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|       United States|       Sint Maarten|  325|
|       United States|   Marshall Islands|   39|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|       United States|           Paraguay|    6|
|             Algeri

In [65]:
#METADATA
spark.sql("DESCRIBE flights").show()

+-------------------+---------+--------------------+
|           col_name|data_type|             comment|
+-------------------+---------+--------------------+
|  DEST_COUNTRY_NAME|   string|Describes destina...|
|ORIGIN_COUNTRY_NAME|   string|Describes departu...|
|              count|   bigint|Describes number ...|
+-------------------+---------+--------------------+



## Exercise 1: Basic SQL

**TODO** From *flightDB* use the table *flights* to compute the number of flights for each destination country. Order this from highest to lowest.

In [66]:
# FOR DEVELOPERS. WRITE THE CORRECT SOLUTION HERE.

# Number of flights for each destination country. Order this from highest to lowest.
spark.sql("""SELECT DEST_COUNTRY_NAME AS Country, sum(count) AS Number_of_arriving_flights FROM flights
            GROUP BY Country 
            ORDER BY Number_of_arriving_flights DESC""").show()

+------------------+--------------------------+
|           Country|Number_of_arriving_flights|
+------------------+--------------------------+
|     United States|                    411352|
|            Canada|                      8399|
|            Mexico|                      7140|
|    United Kingdom|                      2025|
|             Japan|                      1548|
|           Germany|                      1468|
|Dominican Republic|                      1353|
|       South Korea|                      1048|
|       The Bahamas|                       955|
|            France|                       935|
|          Colombia|                       873|
|            Brazil|                       853|
|       Netherlands|                       776|
|             China|                       772|
|           Jamaica|                       666|
|        Costa Rica|                       588|
|       El Salvador|                       561|
|            Panama|                    

## Exercise 2: Views

**TODO** Create a *view* that only contains countries of origen = 'United States' using the table *flights*.

**TODO** Repeat the same process for exercise 1: compute the number of flights for each destination country. Order this from highest to lowest.

In [67]:
# FOR DEVELOPERS. WRITE THE CORRECT SOLUTION HERE.

#Create a view displaying all departures from United States
spark.sql("""CREATE OR REPLACE VIEW dep_us AS 
            SELECT * FROM flights WHERE ORIGIN_COUNTRY_NAME = 'United States'""")

spark.sql("SELECT * FROM dep_us ORDER BY count DESC").show()


+------------------+-------------------+------+
| DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+------------------+-------------------+------+
|     United States|      United States|370002|
|            Canada|      United States|  8399|
|            Mexico|      United States|  7140|
|    United Kingdom|      United States|  2025|
|             Japan|      United States|  1548|
|           Germany|      United States|  1468|
|Dominican Republic|      United States|  1353|
|       South Korea|      United States|  1048|
|       The Bahamas|      United States|   955|
|            France|      United States|   935|
|          Colombia|      United States|   873|
|            Brazil|      United States|   853|
|       Netherlands|      United States|   776|
|             China|      United States|   772|
|           Jamaica|      United States|   666|
|        Costa Rica|      United States|   588|
|       El Salvador|      United States|   561|
|            Panama|      United States|

In [8]:
#Repeat the same process as in exercise 1 but this time with a view
spark.sql("""CREATE OR REPLACE VIEW all_dept AS
            SELECT * FROM flights""")

spark.sql("""SELECT DEST_COUNTRY_NAME AS Country, sum(count) AS Number_of_arriving_flights FROM all_dept
            GROUP BY Country 
            ORDER BY Number_of_arriving_flights DESC""").show()

+------------------+--------------------------+
|           Country|Number_of_arriving_flights|
+------------------+--------------------------+
|     United States|                    411352|
|            Canada|                      8399|
|            Mexico|                      7140|
|    United Kingdom|                      2025|
|             Japan|                      1548|
|           Germany|                      1468|
|Dominican Republic|                      1353|
|       South Korea|                      1048|
|       The Bahamas|                       955|
|            France|                       935|
|          Colombia|                       873|
|            Brazil|                       853|
|       Netherlands|                       776|
|             China|                       772|
|           Jamaica|                       666|
|        Costa Rica|                       588|
|       El Salvador|                       561|
|            Panama|                    

## Exercise 3: Performance

**TODO** In the sparkUI, determine how the results of exercise 1 and exercise 2 compares. Write with words your observations and explain them.

**ANSWER**: Execution was 91 ms by querying on a Table, however only 14 ms execution time on a View.
The lower execution time is due to the data only being transformed in a View, whereas it is rewritten in a Table.

## Exercise 4: Case statements

**TODO** Imagine your boss says the system is outdated. Every row containing the values 'United States' and 'Denmark' should be 'USA' and 'DK' respectively. And for mysterious reasons (the boss won't tell you) all other values should be 0 (for the country column).

**NOTE** Use the table *partitioned_flights* to solve the exercise.

In [12]:
# FOR DEVELOPERS. WRITE THE CORRECT SOLUTION HERE.

#Partitioned flights table
spark.sql("""
            CREATE TABLE partitioned_flights USING parquet PARTITIONED BY (DEST_COUNTRY_NAME)
            AS SELECT DEST_COUNTRY_NAME, count FROM flights LIMIT 10
            """)

DataFrame[]

In [13]:
#Case, when, then statement
spark.sql("""
            SELECT 
            CASE WHEN DEST_COUNTRY_NAME = 'United States' then 'USA'
            WHEN DEST_COUNTRY_NAME = 'Denmark' then 'DK'
            ELSE NULL END
            FROM partitioned_flights
            """).show()

+---------------------------------------------------------------------------------------------------------------+
|CASE WHEN (DEST_COUNTRY_NAME = United States) THEN USA WHEN (DEST_COUNTRY_NAME = Denmark) THEN DK ELSE NULL END|
+---------------------------------------------------------------------------------------------------------------+
|                                                                                                            USA|
|                                                                                                            USA|
|                                                                                                            USA|
|                                                                                                            USA|
|                                                                                                            USA|
|                                                                                       

## Exercise 5: Lists

**TODO** Convert an array into rows. The view *flights_agg* contains an array, use the created view to solve the exercise.

In [24]:
# FOR DEVELOPERS. WRITE THE CORRECT SOLUTION HERE.

#Create flights_agg view
spark.sql("""
            CREATE OR REPLACE TEMP VIEW flights_agg AS
            SELECT DEST_COUNTRY_NAME, collect_set(count) as collected_counts
            FROM flights GROUP BY DEST_COUNTRY_NAME
            """)

# Convert an array into rows
spark.sql("""SELECT explode(collected_counts), DEST_COUNTRY_NAME FROM flights_agg""").show()

+---+--------------------+
|col|   DEST_COUNTRY_NAME|
+---+--------------------+
|  4|             Algeria|
| 15|              Angola|
| 41|            Anguilla|
|126| Antigua and Barbuda|
|180|           Argentina|
|346|               Aruba|
|329|           Australia|
| 62|             Austria|
| 21|          Azerbaijan|
| 19|             Bahrain|
|154|            Barbados|
|259|             Belgium|
|188|              Belize|
|183|             Bermuda|
| 30|             Bolivia|
| 58|Bonaire, Sint Eus...|
|853|              Brazil|
|107|British Virgin Is...|
|  3|            Bulgaria|
|  1|        Burkina Faso|
+---+--------------------+
only showing top 20 rows



## Exercise 6: User defined functions

**TODO** Create a function that determines the ratio between how many departures and arrivals each country has. **NOTE** Create a view, based on the table *flights*, containing the information needed to compute the ratio.

**TODO** Create a pandas function that also calculates the ratio using the package *pandas_udf*. Is there a performance difference? Describe your answer and explain. **NOTE** The required packages are pre imported and no further packages should be needed.

In [89]:
#import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType

In [None]:
# FOR DEVELOPERS. WRITE THE CORRECT SOLUTION HERE.

# NOTES FOR DEVELOPERS:

In [40]:
spark.sql("""CREATE VIEW IF NOT EXISTS nested_data AS 
SELECT (DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME) as country, count FROM flights""")

DataFrame[]

In [42]:
spark.sql("""SELECT * FROM nested_data""").show()

+--------------------+-----+
|             country|count|
+--------------------+-----+
|{United States, R...|   15|
|{United States, C...|    1|
|{United States, I...|  344|
|{Egypt, United St...|   15|
|{United States, I...|   62|
|{United States, S...|    1|
|{United States, G...|   62|
|{Costa Rica, Unit...|  588|
|{Senegal, United ...|   40|
|{Moldova, United ...|    1|
|{United States, S...|  325|
|{United States, M...|   39|
|{Guyana, United S...|   64|
|{Malta, United St...|    1|
|{Anguilla, United...|   41|
|{Bolivia, United ...|   30|
|{United States, P...|    6|
|{Algeria, United ...|    4|
|{Turks and Caicos...|  230|
|{United States, G...|    1|
+--------------------+-----+
only showing top 20 rows



In [44]:
spark.sql("""SELECT DEST_COUNTRY_NAME as new_name, collect_set(count) as flight_counts,
collect_set(ORIGIN_COUNTRY_NAME) as origin_set
FROM flights GROUP BY DEST_COUNTRY_NAME""").show()

+--------------------+-------------+---------------+
|            new_name|flight_counts|     origin_set|
+--------------------+-------------+---------------+
|             Algeria|          [4]|[United States]|
|              Angola|         [15]|[United States]|
|            Anguilla|         [41]|[United States]|
| Antigua and Barbuda|        [126]|[United States]|
|           Argentina|        [180]|[United States]|
|               Aruba|        [346]|[United States]|
|           Australia|        [329]|[United States]|
|             Austria|         [62]|[United States]|
|          Azerbaijan|         [21]|[United States]|
|             Bahrain|         [19]|[United States]|
|            Barbados|        [154]|[United States]|
|             Belgium|        [259]|[United States]|
|              Belize|        [188]|[United States]|
|             Bermuda|        [183]|[United States]|
|             Bolivia|         [30]|[United States]|
|Bonaire, Sint Eus...|         [58]|[United St

In [46]:
spark.sql("""SELECT DEST_COUNTRY_NAME, ARRAY(1, 2, 3) FROM flights""").show()

+--------------------+--------------+
|   DEST_COUNTRY_NAME|array(1, 2, 3)|
+--------------------+--------------+
|       United States|     [1, 2, 3]|
|       United States|     [1, 2, 3]|
|       United States|     [1, 2, 3]|
|               Egypt|     [1, 2, 3]|
|       United States|     [1, 2, 3]|
|       United States|     [1, 2, 3]|
|       United States|     [1, 2, 3]|
|          Costa Rica|     [1, 2, 3]|
|             Senegal|     [1, 2, 3]|
|             Moldova|     [1, 2, 3]|
|       United States|     [1, 2, 3]|
|       United States|     [1, 2, 3]|
|              Guyana|     [1, 2, 3]|
|               Malta|     [1, 2, 3]|
|            Anguilla|     [1, 2, 3]|
|             Bolivia|     [1, 2, 3]|
|       United States|     [1, 2, 3]|
|             Algeria|     [1, 2, 3]|
|Turks and Caicos ...|     [1, 2, 3]|
|       United States|     [1, 2, 3]|
+--------------------+--------------+
only showing top 20 rows



In [47]:
spark.sql("""SELECT DEST_COUNTRY_NAME as new_name, collect_list(count)[0]
FROM flights GROUP BY DEST_COUNTRY_NAME""").show()

+--------------------+----------------------+
|            new_name|collect_list(count)[0]|
+--------------------+----------------------+
|             Algeria|                     4|
|              Angola|                    15|
|            Anguilla|                    41|
| Antigua and Barbuda|                   126|
|           Argentina|                   180|
|               Aruba|                   346|
|           Australia|                   329|
|             Austria|                    62|
|          Azerbaijan|                    21|
|             Bahrain|                    19|
|            Barbados|                   154|
|             Belgium|                   259|
|              Belize|                   188|
|             Bermuda|                   183|
|             Bolivia|                    30|
|Bonaire, Sint Eus...|                    58|
|              Brazil|                   853|
|British Virgin Is...|                   107|
|            Bulgaria|            

In [55]:
spark.sql("""CREATE OR REPLACE TEMP VIEW flights_agg AS
SELECT DEST_COUNTRY_NAME, collect_set(count) as collected_counts
FROM flights GROUP BY DEST_COUNTRY_NAME""")

DataFrame[]

In [52]:
spark.sql("""SELECT explode(collected_counts), DEST_COUNTRY_NAME FROM flights_agg""").show()

+---+--------------------+
|col|   DEST_COUNTRY_NAME|
+---+--------------------+
|  4|             Algeria|
| 15|              Angola|
| 41|            Anguilla|
|126| Antigua and Barbuda|
|180|           Argentina|
|346|               Aruba|
|329|           Australia|
| 62|             Austria|
| 21|          Azerbaijan|
| 19|             Bahrain|
|154|            Barbados|
|259|             Belgium|
|188|              Belize|
|183|             Bermuda|
| 30|             Bolivia|
| 58|Bonaire, Sint Eus...|
|853|              Brazil|
|107|British Virgin Is...|
|  3|            Bulgaria|
|  1|        Burkina Faso|
+---+--------------------+
only showing top 20 rows



In [57]:
spark.sql("""SELECT * FROM flights
WHERE origin_country_name IN (SELECT dest_country_name FROM flights
GROUP BY dest_country_name ORDER BY sum(count) DESC LIMIT 5)""").show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Egypt|      United States|   15|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|             Algeria|      United States|    4|
|Turks and Caicos ...|      United States|  230|
|Saint Vincent and...|      United States|    1|
|               Italy|      United States|  382|
|            Pakistan|      United States|   12|
|             Iceland|      United States|  181|
|    Marshall Islands|      United States|   42|
|          Luxembourg|      United States|  155|
|            Honduras|      United States|  362|
|         The Bahama

# Dropping the table and database.

In [62]:
spark.sql("DROP TABLE IF EXISTS flights")
#spark.sql("DROP DATABASE IF EXISTS flightDB")

DataFrame[]