# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session                                                                                                  |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X                                                                            |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0                                                        |
| %security_configuration     |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |   Changes the session type to Glue ETL.                                                                                                                   |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |

In [2]:
%number_of_workers 2

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
It looks like there is a newer version of the kernel available. The latest version is 0.31 and you have 0.30 installed.
Please run `pip install --upgrade aws-glue-sessions` to upgrade your kernel
Previous number of workers: 5
Setting new number of workers to: 2


In [25]:
# Initialize the Glue Job
import sys
from pyspark.context import SparkContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglueml.transforms import FindMatches

# Create SparkContext
sparkContext = SparkContext.getOrCreate()

# Create Glue Context
glueContext = GlueContext(sparkContext)

# Get spark session
spark = glueContext.spark_session

# Create Glue Job
job = Job(glueContext)





## Read the tests dataset (without labels)

In [26]:
# gdf = Glue Dynamic Frame
gdf_tests = glueContext.create_dynamic_frame.from_options(
    connection_type="s3", 
    connection_options=dict(
        paths=[
            f"s3://aws-data-lab-sanhe-for-everything-us-east-2/poc/2022-05-18-glue-find-matches/find-matches/tests/b5577eb9e7cd43118d7b8d70765853e6.csv"
        ],
        recurse=True,
    ),
    format="csv",
    format_options=dict(
        withHeader=True,
    ),
    transformation_ctx="datasource",
)




In [27]:
# print data schema
gdf_tests.printSchema()

root
|-- id: string
|-- firstname: string
|-- lastname: string
|-- phone: string


In [28]:
# preview the data
gdf_tests.toDF().show(3, truncate=False, vertical=True)

-RECORD 0-------------------
 id        | PersonId-00001 
 firstname | John           
 lastname  | aadden         
 phone     | 672-615-3608   
-RECORD 1-------------------
 id        | PersonId-00002 
 firstname | Jjcn           
 lastname  | aadden         
 phone     | 642-615-3608   
-RECORD 2-------------------
 id        | PersonId-00003 
 firstname | Joln           
 lastname  | aadden         
 phone     | 602-615-3608   
only showing top 3 rows


## Execute The ML Transformation (Predict)

In [29]:
gdf_predict = FindMatches.apply(
    frame=gdf_tests, 
    transformId="tfm-dfaeb9a7a5565ba554bc7c3e3a8b0009a79746a2",
    transformation_ctx="findmatches1",
    computeMatchConfidenceScores=True,
)




In [30]:
gdf_predict.printSchema()

root
|-- id: string
|-- firstname: string
|-- lastname: string
|-- phone: string
|-- match_id: long
|-- match_confidence_score: double


In [31]:
# preview the data
gdf_predict.toDF().show(3, truncate=False, vertical=True)

-RECORD 0--------------------------------
 id                     | PersonId-00031 
 firstname              | Auctin         
 lastname               | Ortgz          
 phone                  | 730-963-9164   
 match_id               | 30             
 match_confidence_score | 1.0            
-RECORD 1--------------------------------
 id                     | PersonId-00034 
 firstname              | Auwtin         
 lastname               | Ortiz          
 phone                  | 730-963-9164   
 match_id               | 30             
 match_confidence_score | 1.0            
-RECORD 2--------------------------------
 id                     | PersonId-00033 
 firstname              | Auwtin         
 lastname               | Ortiz          
 phone                  | 730-963-9164   
 match_id               | 30             
 match_confidence_score | 1.0            
only showing top 3 rows


## Write the Predict Result to S3

In [32]:
# gdf = Glue Dynamic Frame
datasink = glueContext.write_dynamic_frame.from_options(
    frame=gdf_predict,
    connection_type="s3", 
    connection_options=dict(
        path=f"s3://aws-data-lab-sanhe-for-everything-us-east-2/poc/2022-05-18-glue-find-matches/find-matches/predict/1.csv",
    ),
    format="csv",
    format_options=dict(
        withHeader=True,
    ),
    transformation_ctx="datasink",
)


