# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session                                                                                                  |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X                                                                            |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0                                                        |
| %security_configuration     |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |   Changes the session type to Glue ETL.                                                                                                                   |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |

# Initialize the Glue Job

In [None]:
# Initialize the Glue Job
import sys
from pyspark.context import SparkContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms import DropFields
from awsglueml.transforms import FindMatches, FindIncrementalMatches

# Create SparkContext
sparkContext = SparkContext.getOrCreate()

# Create Glue Context
glueContext = GlueContext(sparkContext)

# Get spark session
spark = glueContext.spark_session

# Create Glue Job
job = Job(glueContext)

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
It looks like there is a newer version of the kernel available. The latest version is 0.31 and you have 0.30 installed.
Please run `pip install --upgrade aws-glue-sessions` to upgrade your kernel
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::669508176277:role/sanhe-all-service-admin-access
Attempting to use existing AssumeRole session credentials.
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 2d84f319-3d50-4dac-8b51-932c69032013
Applying the following default arguments:
--glue_kernel_version 0.30
--enable-glue-datacatalog true
Waiting for session 2d84f319-3d50-4dac-8b51-932c6903201

# Read the Test Data (without label)

In [1]:
# gdf = Glue Dynamic Frame
gdf_test = glueContext.create_dynamic_frame.from_options(
    connection_type="s3", 
    connection_options=dict(
        paths=[
            f"s3://aws-data-lab-sanhe-for-everything-us-east-2/poc/2022-05-18-glue-find-matches/find-incr-matches/04-initial/1.csv"
        ],
        recurse=True,
    ),
    format="csv",
    format_options=dict(
        withHeader=True,
    ),
    transformation_ctx="datasource",
)




In [2]:
# preview the schema
gdf_test.printSchema()

root
|-- id: string
|-- firstname: string
|-- lastname: string
|-- phone: string


In [3]:
# preview the data
gdf_test.toDF().show(3, truncate=False, vertical=True)

-RECORD 0-------------------
 id        | PersonId-40001 
 firstname | Sara           
 lastname  | Whste          
 phone     | 468-400-1568   
-RECORD 1-------------------
 id        | PersonId-40003 
 firstname | Sara           
 lastname  | Whife          
 phone     | 468-400-1568   
-RECORD 2-------------------
 id        | PersonId-40004 
 firstname | Sarr           
 lastname  | Whste          
 phone     | 468-400-1568   
only showing top 3 rows


# Execute The ML Transformation for Initial Match

In [5]:
# run the initial match
gdf_predict = FindMatches.apply(
    frame=gdf_test, 
    transformId="tfm-2aa2fe67f9cb5cb4b06818cdfdd25f8f78ae2ed5",
    transformation_ctx="find_matches_1",
    computeMatchConfidenceScores=True,
)




In [6]:
# preview the schema
gdf_predict.printSchema()

root
|-- id: string
|-- firstname: string
|-- lastname: string
|-- phone: string
|-- match_id: long
|-- match_confidence_score: double


In [1]:
# preview the data
gdf_predict.toDF().show(10, truncate=False, vertical=True)

Exception encountered while retrieving session: An error occurred (ExpiredTokenException) when calling the GetSession operation: The security token included in the request is expired 
Traceback (most recent call last):
  File "/home/jupyter-user/.local/lib/python3.7/site-packages/aws_glue_interactive_sessions_kernel/glue_pyspark/GlueKernel.py", line 688, in get_current_session
    current_session = self.glue_client.get_session(Id=self.get_session_id())["Session"]
  File "/home/jupyter-user/.local/lib/python3.7/site-packages/botocore/client.py", line 415, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/jupyter-user/.local/lib/python3.7/site-packages/botocore/client.py", line 745, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ExpiredTokenException) when calling the GetSession operation: The security token included in the request is expired
Failed to retrieve session status 
Ex

In [8]:
# dump prediction result to s3
datasink = glueContext.write_dynamic_frame.from_options(
    frame=gdf_predict,
    connection_type="s3", 
    connection_options=dict(
        path=f"s3://aws-data-lab-sanhe-for-everything-us-east-2/poc/2022-05-18-glue-find-matches/find-incr-matches/06-match-results/",
    ),
    format="csv",
    format_options=dict(
        withHeader=True,
    ),
    transformation_ctx="datasink",
)





# Execute the Transformation for Incremental Match

In [4]:
gdf_initial_match_results = glueContext.create_dynamic_frame.from_catalog(
    name_space="learn_glue_find_incr_matches", 
    table_name="matched_results",
    transformation_ctx="datasource_initial_match_results",
)
gdf_initial_match_results = DropFields.apply(
    frame=gdf_initial_match_results,
    paths=["match_confidence_score",],
    transformation_ctx="datasource_initial_match_results_drop_confidence_score",
)




In [5]:
# preview the schema
gdf_initial_match_results.printSchema()

# preview the data
gdf_initial_match_results.toDF().show(10, truncate=False, vertical=True)

root
|-- id: string
|-- firstname: string
|-- lastname: string
|-- phone: string
|-- match_id: string

-RECORD 0-------------------
 id        | id             
 firstname | firstname      
 lastname  | lastname       
 phone     | phone          
 match_id  | match_id       
-RECORD 1-------------------
 id        | PersonId-40041 
 firstname | Ian            
 lastname  | Gfewn          
 phone     | 793-728-9836   
 match_id  | 29             
-RECORD 2-------------------
 id        | PersonId-40050 
 firstname | Ian            
 lastname  | Ggeen          
 phone     | 793-728-9836   
 match_id  | 29             
-RECORD 3-------------------
 id        | PersonId-40055 
 firstname | Ias            
 lastname  | Ggeen          
 phone     | 793-728-9836   
 match_id  | 29             
-RECORD 4-------------------
 id        | PersonId-40042 
 firstname | Ias            
 lastname  | Gfewn          
 phone     | 793-728-9236   
 match_id  | 29             
-RECORD 5------------------

In [6]:
gdf_incremental = glueContext.create_dynamic_frame.from_catalog(
    name_space="learn_glue_find_incr_matches", 
    table_name="incremental_test",
    transformation_ctx="datasource_incremental_test",
)




In [7]:
# preview the schema
gdf_incremental.printSchema()

# preview the data
gdf_incremental.toDF().show(10, truncate=False, vertical=True)

root
|-- id: string
|-- firstname: string
|-- lastname: string
|-- phone: string

-RECORD 0-------------------
 id        | id             
 firstname | firstname      
 lastname  | lastname       
 phone     | phone          
-RECORD 1-------------------
 id        | PersonId-40002 
 firstname | Sarr           
 lastname  | Whife          
 phone     | 468-400-1568   
-RECORD 2-------------------
 id        | PersonId-40006 
 firstname | Syra           
 lastname  | White          
 phone     | 468-400-1568   
-RECORD 3-------------------
 id        | PersonId-40008 
 firstname | Sara           
 lastname  | Whste          
 phone     | 468-400-1568   
-RECORD 4-------------------
 id        | PersonId-40011 
 firstname | Sara           
 lastname  | White          
 phone     | 468-400-1568   
-RECORD 5-------------------
 id        | PersonId-40014 
 firstname | Syra           
 lastname  | White          
 phone     | 468-400-1568   
-RECORD 6-------------------
 id        | Person

In [16]:
# run the incremental match
gdf_incremental_predict = FindIncrementalMatches.apply(
    existingFrame=gdf_initial_match_results, 
    incrementalFrame=gdf_incremental,
    transformId="tfm-2aa2fe67f9cb5cb4b06818cdfdd25f8f78ae2ed5",
    transformation_ctx="find_incr_matches_1",
    computeMatchConfidenceScores=True,
)

IllegalArgumentException: 'requirement failed: The existing and incremental records have duplicate value in column: id'


In [17]:
# preview the schema
gdf_incremental_predict.printSchema()

# preview the data
gdf_incremental_predict.toDF().show(3, truncate=False, vertical=True)

NameError: name 'gdf_incremental_predict' is not defined


In [None]:
# dump prediction result to s3
datasink = glueContext.write_dynamic_frame.from_options(
    frame=gdf_incremental_predict,
    connection_type="s3", 
    connection_options=dict(
        path=f"s3://aws-data-lab-sanhe-for-everything-us-east-2/poc/2022-05-18-glue-find-matches/find-incr-matches/07-incr-match-results/",
    ),
    format="csv",
    format_options=dict(
        withHeader=True,
    ),
    transformation_ctx="datasink",
)
