
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session.                                                                                                 |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0).                               |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |  Changes the session type to Glue ETL.                                                                                                                    |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X.                                                                           |
| %spark_conf                 |  String      |  Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer.                      |

## Using 3rd Party Library

Replace the `{S3_BUCKET}` below, with your bucket name. Below command is adding required third party libraries to the job.

In [None]:
%extra_py_files "s3://{S3_BUCKET}/library/pycountry_convert.zip"

The code below is some boiler-plate imports that will generally be included in the start of every Spark/Glue job and then an import statement for the 3rd party library. Be sure to replace `${S3_BUCKET}` with your bucket name before running the code.

In [None]:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType, StringType
from pyspark import SparkContext
from pyspark.sql import SQLContext

from datetime import datetime
from pycountry_convert import (
    convert_country_alpha2_to_country_name,
    convert_country_alpha2_to_continent,
    convert_country_name_to_country_alpha2,
    convert_country_alpha3_to_country_alpha2,
)

s3_path = "{S3_BUCKET}"

df = spark.read.load("s3://" + s3_path + "/input/lab2/sample.csv", 
                          format="csv", 
                          sep=",", 
                          inferSchema="true",
                          header="true")

We will define a UDF (user defined function) to use for processing a Spark dataframe. UDFs allow a developer to extend the standard Spark functionality using Python code. To do that your code needs to be in the form of a UDF lambda. The code below creates a Spark UDF `udf_get_country_code2`to convert a country name into a two-letter code.

In [None]:
def get_country_code2(country_name):
    country_code2 = 'US'
    try:
        country_code2 = convert_country_name_to_country_alpha2(country_name)
    except KeyError:
        country_code2 = ''
    return country_code2

udf_get_country_code2 = udf(lambda z: get_country_code2(z), StringType())


Next we will create a new dataframe that includes a column created using the UDF we created previously. Notice the new column `country_code_2` in the new dataframe's schema.

In [None]:
new_df = df.withColumn('country_code_2', udf_get_country_code2(col("Country")))
new_df.printSchema()


Let's take a look at the data in this new dataframe - notice the new column `country_code_2`. The dataframe now contains two-letter country codes that were determined based on the `Country` column.

In [None]:
new_df.show(10)

## Using Data Catalog

So far, we have been running standard Spark code. Now, we will try some Glue-flavored PySpark code. We will now load the tables that we created before in Lab 01 into a Glue dynamic frame. After the data is loaded into a Glue dynamic frame, compare the schema it presented with the schema stored in the Glue Data Catalog table.

Notice in the code, we don't specify the S3 location - this is because the Glue Data Catalog knows where the data lives thanks to Glue Data Catalog table definition.

In [None]:
from awsglue.context import GlueContext

glueContext = GlueContext(SparkContext.getOrCreate())

dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database="console_glueworkshop", table_name="console_csv")
dynamic_frame.printSchema()

We can view the data in the **Glue Dynamic Frame** by converting it first to **Data Frame** by calling the `toDF()` function and then using the standard Data Frame `show()` function.

In [None]:
dynamic_frame.toDF().show(10)

In [None]:
## stop the current session 

%stop_session