#### Notebook example to run EMR serverless job from Sagemaker

us-east-1 applications: <br>
* pd-autoencoder-ad-v1 : **00f64bef5869kl09**
* pd-autoencoder-ad-v2 : **00f66ohicnjchu09**
* pd-test-s3-writes : **00f66mmuts7enm09**
<br>

us-west-2 applications: <br>
* pd-autoencoder-ad-container-v1  : **00f672mqiak1fp0l**


Note: while launching your job, please make note of the region from where you are running it.
jobs for us-east-region can only be launched from us-east-1 

#### **Usage Scenario 1: From CLI**

Run the following command <br>

**python emr_serverless.py --job-role-arn <<job_role_arn>> --applicationId <<applicationID>> --s3-bucket <<s3_bucket_name>> --entry-point <<emr_entry_point>> --zipped-env <<zipped_env_path>> --custom-spark-config <<custom_spark_config>>**


optional arguments
- **--job-role-arn**    : default value = 'arn:aws:iam::064047601590:role/hamza-emr-serverless-role'
- **--custom-spark-config**   : default value = default

Without optional arguments : <br>
**python emr_serverless.py --applicationId <<applicationID>> --s3-bucket <<s3_bucket_name>> --entry-point <<emr_entry_point>> --zipped-env <<zipped_env_path>>**

##### **Run examples** <br>

**1. With only required arguments** <br>

python emr_serverless.py --applicationId 00f66mmuts7enm09 --s3-bucket emr-serverless-output-pd --entry-point s3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/s3_test_emr.py --zipped-env s3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/pyspark_deps_all_rec_types_v2.tar.gz
<br>

**2. With required argumemts and job_role_arn** <br>
python emr_serverless.py --job-role-arn arn:aws:iam::064047601590:role/hamza-emr-serverless-role --applicationId 00f66mmuts7enm09 --s3-bucket emr-serverless-output-pd --entry-point s3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/s3_test_emr.py --zipped-env s3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/pyspark_deps_all_rec_types_v2.tar.gz
<br>

**3. With required arguments and custom_spark_config** <br>
python emr_serverless.py --applicationId 00f66mmuts7enm09 --s3-bucket emr-serverless-output-pd --entry-point s3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/s3_test_emr.py --zipped-env s3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/pyspark_deps_all_rec_types_v2.tar.gz --custom-spark-config "--conf spark.driver.maxResultSize=2g --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=15g --conf spark.memory.offHeap.size=2g"
<br>

**4. With all arguments** <br>
python emr_serverless.py --job-role-arn arn:aws:iam::064047601590:role/hamza-emr-serverless-role --applicationId 00f66mmuts7enm09 --s3-bucket emr-serverless-output-pd --entry-point s3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/s3_test_emr.py --zipped-env s3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/pyspark_deps_all_rec_types_v2.tar.gz --custom-spark-config "--conf spark.driver.maxResultSize=2g --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=15g --conf spark.memory.offHeap.size=2g"


#### **Usage Scenario 2: From Sagemaker Notebook**

In [None]:
pwd

In [None]:
!pip install /root/msspackages/dist/msspackages-0.0.7-py3-none-any.whl

In [None]:
pip install -r requirements.txt

In [None]:
#pip install tensorflow
#import tensorflow

In [2]:
from eks_ml_pipeline import EMRServerless

#### 2. a. When submitting a new job to EMR serverless application

In [3]:
# id of the existing application to submit jobs to
application_id = '00f66mmuts7enm09' 

# serverless_job_role_arn - only pass it if you want to use a custom one, else comment it out
#serverless_job_role_arn = "<<include_custom_role_arn>>"

# s3 bukcet name where the dependencies, logs and code sits
s3_bucket_name = 'emr-serverless-output-pd'

# Entry point to EMR serverless
emr_entry_point = 's3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/s3_test_emr.py'

# Path to the custom spark and python environemnt to use with all the dependencies installed
zipped_env_path = 's3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/pyspark_deps_all_rec_types_v2.tar.gz'

In [4]:
# Instantiate class
emr_serverless = EMRServerless()

In [5]:
## use this only if you want to create a new application
#application_id = emr_serverless.create_application("pd-autoencoder-test-3", "emr-6.9.0")

Starting EMR Serverless Spark App


In [6]:
print("Starting EMR Serverless Spark App")
# Start the application; skips this step automatically if the application is already in 'Started' state
emr_serverless.start_application(application_id)
print(emr_serverless)

EMR Serverless SPARK Application: 00f6gd0ibnjj2c09


Below cell shows an example to submit and run a new job

In [13]:
# Run (and wait for) a Spark job
print("Submitting new Spark job")
job_run_id = emr_serverless.run_spark_job(
    script_location=emr_entry_point,
    #job_role_arn=serverless_job_role_arn,
    application_id = application_id,
    arguments=[f"s3://{s3_bucket_name}/emr-serverless/output"],
    s3_bucket_name=s3_bucket_name,
    zipped_env_path = zipped_env_path
)

Submitting new Spark job
job id : 00f6gd11735l1b09


In [8]:
# Get the configuration and status of the job which we just submitted 
emr_serverless.get_job_run()

{'applicationId': '00f6gd0ibnjj2c09',
 'jobRunId': '00f6gd0kgg2mdj09',
 'arn': 'arn:aws:emr-serverless:us-east-1:064047601590:/applications/00f6gd0ibnjj2c09/jobruns/00f6gd0kgg2mdj09',
 'createdBy': 'arn:aws:sts::064047601590:assumed-role/AmazonSageMakerServiceCatalogProductsUseRole/SageMaker',
 'createdAt': datetime.datetime(2022, 12, 21, 17, 49, 19, 590000, tzinfo=tzlocal()),
 'updatedAt': datetime.datetime(2022, 12, 21, 17, 49, 20, 293000, tzinfo=tzlocal()),
 'executionRole': 'arn:aws:iam::064047601590:role/hamza-emr-serverless-role',
 'state': 'SCHEDULED',
 'stateDetails': 'The job has been scheduled and is acquiring resources to run.',
 'releaseLabel': 'emr-6.9.0',
 'configurationOverrides': {'monitoringConfiguration': {'s3MonitoringConfiguration': {'logUri': 's3://emr-serverless-output-pd/logs/'}}},
 'jobDriver': {'sparkSubmit': {'entryPoint': 's3://emr-serverless-output-pd/code/pyspark/pd-autoencoder-ad/s3_test_emr.py',
   'sparkSubmitParameters': '--conf spark.archives=s3://emr-

In [14]:
# Get final status of the job
emr_serverless.get_job_run().get('state')

'SCHEDULED'

In [15]:
# Cancel job if needed ; Uncomment as per need
#emr_serverless.cancel_job_run()
#emr_serverless.cancel_job_run(job_run_id) # pass in specific job_run_id which you want to cancel

In [11]:
# Verify the state post cancellation
#emr_serverless.get_job_run().get('state')

'CANCELLED'

In [12]:
# Fetch and print the logs
emr_serverless.fetch_driver_log(s3_bucket_name)

File output from stdout.gz:
----
  
----


In [16]:
## use below code only when the application needs to be stopped

#emr_serverless.stop_application(application_id)  # pass in specific application_id which you want to stop
#emr_serverless.stop_application() # if no application id is given, it automatically takes the current application which we started 

Successfully stopped app


In [17]:
## use below code only to delete your custom applications
## DO NOT use this to delete an existing application created by the admins 

#emr_serverless.delete_application(application_id)   # pass in specific application_id which you want to delete
#emr_serverless.delete_application() # if no application id is given, it automatically takes the current application which we started 

Successfully deleted app


#### 2. b. retrievig info on existing jobs in an application

In [None]:
application_id = '00f66mmuts7enm09'
job_run_id = '00f6fkpig0rlip09'
s3_bucket_name = 'emr-serverless-output-pd'

In [None]:
# Instantiate class
emr_with_existing_job = EMRServerless(application_id, job_run_id)

In [None]:
# Get job status
emr_with_existing_job.get_job_run()

In [None]:
# Get logs
emr_with_existing_job.fetch_driver_log(s3_bucket_name)