# Spark SQL mini project exercises.

#### README
The exercises comes with this notebook, and a "data" folder. The data folder contains the dataset used for the exercises.
Some of the code will be written to help you get started and some explanatory text to further aid the understanding of each exercise.

The first part is setting up a database and loading the dataset used for the exercises. This is already done beforehand except you need to change the path to where you store the data. After editing the path, just run all the cells until the **Exercise** part.

In [1]:
# Initialising the spark session.
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Practise').getOrCreate()
spark

22/02/17 18:24:54 WARN Utils: Your hostname, pop-os resolves to a loopback address: 127.0.1.1; using 192.168.50.144 instead (on interface wlp4s0)
22/02/17 18:24:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/02/17 18:24:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [8]:
#Creates database
spark.sql("CREATE DATABASE flightDB")

#Specifies which DB to use
spark.sql("USE flightDB")

#Creates table
spark.sql("""
            CREATE TABLE flights (
            DEST_COUNTRY_NAME STRING COMMENT "Describes destination country", 
            ORIGIN_COUNTRY_NAME STRING COMMENT "Describes departure country", 
            count LONG COMMENT "Describes number of departures")
            USING csv OPTIONS (header true, path '/home/vugs/Qsync/Private/Kent Vugs Nielsen/DV-AAU/4. Semester/BDS/mini-projects/miniproject1/Mini-project-exercise-sql/data')
            """)
spark.sql("SELECT * FROM flights").show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|   15|
|       United States|            Croatia|    1|
|       United States|            Ireland|  344|
|               Egypt|      United States|   15|
|       United States|              India|   62|
|       United States|          Singapore|    1|
|       United States|            Grenada|   62|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|       United States|       Sint Maarten|  325|
|       United States|   Marshall Islands|   39|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|       United States|           Paraguay|    6|
|             Algeri

In [9]:
#METADATA
spark.sql("DESCRIBE flights").show()

+-------------------+---------+--------------------+
|           col_name|data_type|             comment|
+-------------------+---------+--------------------+
|  DEST_COUNTRY_NAME|   string|Describes destina...|
|ORIGIN_COUNTRY_NAME|   string|Describes departu...|
|              count|   bigint|Describes number ...|
+-------------------+---------+--------------------+



## Exercise 1: Basic SQL

**TODO** From *flightDB* use the table *flights* to compute the number of flights for each destination country. Order this from highest to lowest.

In [None]:
# Write the code for exercise 1 here...

## Exercise 2: Views

**TODO** Create a *view* that only contains countries of origen = 'United States' using the table *flights*.

**TODO** Repeat the same process for exercise 1: compute the number of flights for each destination country. Order this from highest to lowest.

In [None]:
# Write the code for exercise 2 here...

## Exercise 3: Performance

**TODO** In the sparkUI, determine how the results of exercise 1 and exercise 2 compares. Write with words your observations and explain them.

Write your answers here for exercise 3...

## Exercise 4: Case statements

**TODO** Imagine your boss says the system is outdated. Every row containing the values 'United States' and 'Denmark' should be 'USA' and 'DK' respectively. And for mysterious reasons (the boss won't tell you) all other values should be 0 (for the country column).

**NOTE** Use the table *partitioned_flights* to solve the exercise.

In [None]:
# Write the code for exercise 4 here...

#Create a view displaying all departures from United States
spark.sql("""
            CREATE TABLE partitioned_flights USING parquet PARTITIONED BY (DEST_COUNTRY_NAME)
            AS SELECT DEST_COUNTRY_NAME, count FROM flights LIMIT 5
            """)

## Exercise 5: Lists

**TODO** Convert an array into rows. The view *flights_agg* contains an array, use the created view to solve the exercise.

In [None]:
# Write the code for exercise 5 here...
spark.sql("""
            CREATE OR REPLACE TEMP VIEW flights_agg AS
            SELECT DEST_COUNTRY_NAME, collect_set(count) as collected_counts
            FROM flights GROUP BY DEST_COUNTRY_NAME
            """)

## Exercise 6: User defined functions

**TODO** Create a function that determines the ratio between how many departures and arrivals each country has. **NOTE** Create a view, based on the table *flights*, containing the information needed to compute the ratio.

**TODO** Create a pandas function that also calculates the ratio using the package *pandas_udf*. Is there a performance difference? Describe your answer and explain. **NOTE** The required packages are pre imported and no further packages should be needed.

In [None]:
import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType

# Write your solution to exercise 6 here...

# Dropping the table and database.

This is simply here if needed...

In [7]:
spark.sql("DROP TABLE IF EXISTS flights")
spark.sql("DROP DATABASE IF EXISTS flightDB")

DataFrame[]