# AWS Glue Studio Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


#### Optional: Run this cell to see available notebook commands ("magics").


In [2]:
%help

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.38.1 



# Available Magic Commands

## Sessions Magic

----
    %help                             Return a list of descriptions and input types for all magic commands. 
    %profile            String        Specify a profile in your aws configuration to use as the credentials provider.
    %region             String        Specify the AWS region in which to initialize a session. 
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\ USERNAME \.aws\config" on Windows.
    %idle_timeout       Int           The number of minutes of inactivity after which a session will timeout. 
                                      Default: 2880 minutes (48 hours).
    %session_id_prefix  String        Define a String that will precede all session IDs in the format 
                                      [session_id_prefix]-[session_id]. If a session ID is not provided,
                                      a random UUID will be generated.
    %status                           Returns the status of the current Glue session including its duration, 
                                      configuration and executing user / role.
    %session_id                       Returns the session ID for the running session. 
    %list_sessions                    Lists all currently running sessions by ID.
    %stop_session                     Stops the current session.
    %glue_version       String        The version of Glue to be used by this session. 
                                      Currently, the only valid options are 2.0, 3.0 and 4.0. 
                                      Default: 2.0.
----

## Selecting Job Types

----
    %streaming          String        Sets the session type to Glue Streaming.
    %etl                String        Sets the session type to Glue ETL.
    %glue_ray           String        Sets the session type to Glue Ray.
----

## Glue Config Magic 
*(common across all job types)*

----

    %%configure         Dictionary    A json-formatted dictionary consisting of all configuration parameters for 
                                      a session. Each parameter can be specified here or through individual magics.
    %iam_role           String        Specify an IAM role ARN to execute your session with.
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\%USERNAME%\.aws\config` on Windows.
    %number_of_workers  int           The number of workers of a defined worker_type that are allocated 
                                      when a session runs.
                                      Default: 5.
    %additional_python_modules  List  Comma separated list of additional Python modules to include in your cluster 
                                      (can be from Pypi or S3).
    %%tags        Dictionary          Specify a json-formatted dictionary consisting of tags to use in the session.
----

                                      
## Magic for Spark Jobs (ETL & Streaming)

----
    %worker_type        String        Set the type of instances the session will use as workers. 
                                      ETL and Streaming support G.1X, G.2X, G.4X and G.8X. 
                                      Default: G.1X.
    %connections        List          Specify a comma separated list of connections to use in the session.
    %extra_py_files     List          Comma separated list of additional Python files From S3.
    %extra_jars         List          Comma separated list of additional Jars to include in the cluster.
    %spark_conf         String        Specify custom spark configurations for your session. 
                                      E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer
----
                                      
## Magic for Ray Job

----
    %min_workers        Int           The minimum number of workers that are allocated to a Ray job. 
                                      Default: 1.
    %object_memory_head Int           The percentage of free memory on the instance head node after a warm start. 
                                      Minimum: 0. Maximum: 100.
    %object_memory_worker Int         The percentage of free memory on the instance worker nodes after a warm start. 
                                      Minimum: 0. Maximum: 100.
----

## Action Magic

----

    %%sql               String        Run SQL code. All lines after the initial %%sql magic will be passed
                                      as part of the SQL code.  
----



####  Run this cell to set up and start your interactive session.


In [5]:
%idle_timeout 2880
%glue_version 3.0
%worker_type G.1X
%number_of_workers 5

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.38.1 
Current idle_timeout is 2800 minutes.
idle_timeout has been set to 2880 minutes.
Setting Glue version to: 3.0
Previous worker type: G.1X
Setting new worker type to: G.1X
Previous number of workers: 5
Setting new number of workers to: 5


In [1]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::862327261051:role/DE1_1_Glue_Role
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 9e95af5a-3319-4372-900b-cd5ec02833c0
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.38.1
--enable-glue-datacatalog true
Waiting for session 9e95af5a-3319-4372-900b-cd5ec02833c0 to get into ready status...
Session 9e95af5a-3319-4372-900b-cd5ec02833c0 has been created.



#### Example: Create a DynamicFrame from a table in the AWS Glue Data Catalog and display its schema


In [2]:
df = glueContext.create_dynamic_frame.from_catalog(
    database="1-1-test-kms",
    table_name="part_00000_d68b9568_0f0f_413b_aadf_f1a1edf7367f_c000_snappy_parquet",
    transformation_ctx="S3bucket_node1",
)




In [3]:
df.show(2)

{"job_id": "175835", "platform": "wanted", "category": "프론트엔드 개발자", "major_category": "WEB", "middle_category": "프론트엔드 개발자", "sub_category": "프론트엔드 개발자", "company": "뉴스젤리(Newsjelly)", "title": "웹 프론트엔드 개발자", "preferred": "- D3/Highcharts 등의 시각화 라이브러리를 사용해보신 분   - CSV/JSON 등의 다양한 데이터 형식에 익숙하신분   - 반응형 웹 개발을 경험하신 분   - Vue/React 등의 SPA 프레임워크 개발을 경험하신 분   - Vite/Webpack 등의 번들러를 사용해보신 분   - Typescript에 익숙하신 분   - 컨테이너 기반 운영 환경에 익숙하신 분   - 백엔드 개발에도 관심이 있으신 분   - 최신 기술 동향을 파악에 관심이 있으신 분  [이런 분이면 더 좋아요!]   • 개인의 전문성을 바탕으로 자기 주도적으로 일하는 분  • 동료와 유연한 협업이 가능한 분  • 새로운 도전을 두려워하지 않는 분  • ‘할 수 있다, 가보자고’ 긍정 마인드를 소유한 분  • 맡은 일은 끝까지 책임질 줄 아는 분   [ 제출서류 ]     - 이력서 (별도의 지원 양식은 없지만 희망 연봉을 기재해 주세요)    - 포트폴리오    - 포트폴리오 자료 파일 또는 URL    - Github/Gitlab 등의 개인 프로젝트 저장소(선택)    - 개인 기술블로그(선택)  [ 채용절차 ]   서류전형  -＞ 기술 인터뷰 -＞ 임원 면접 -＞ 처우협의 -＞ 최종합격 (* 면접상황에 따라 일부 변경 될 수 있습니다. )   [ 주요 기술 스택 ]    - 언어: Typescript/Javascript, Python   - 프론트엔드: Vue, React   - 백엔드: FastAPI, Django   - 데이터베이스: PostgreSQL, MariaDB   - 네

#### Example: Convert the DynamicFrame to a Spark DataFrame and display a sample of the data


In [4]:
converted_df = df.toDF()
converted_df.show(5)

+------+--------+-----------------+--------------+-----------------+-----------------+-------------------+-----------------------------+------------------------------+---------------------------+-------------------------------+--------------------+----------+--------------------+--------------------------------+-------------------------------+----+---------------------------------+--------------------+
|job_id|platform|         category|major_category|  middle_category|     sub_category|            company|                        title|                     preferred|                   required|         primary_responsibility|                 url|    end_at|              skills|                        location|                        welfare|body|              company_description|          coordinate|
+------+--------+-----------------+--------------+-----------------+-----------------+-------------------+-----------------------------+------------------------------+---------------------

In [5]:
converted_df.printSchema()

root
 |-- job_id: string (nullable = true)
 |-- platform: string (nullable = true)
 |-- category: string (nullable = true)
 |-- major_category: string (nullable = true)
 |-- middle_category: string (nullable = true)
 |-- sub_category: string (nullable = true)
 |-- company: string (nullable = true)
 |-- title: string (nullable = true)
 |-- preferred: string (nullable = true)
 |-- required: string (nullable = true)
 |-- primary_responsibility: string (nullable = true)
 |-- url: string (nullable = true)
 |-- end_at: string (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- location: string (nullable = true)
 |-- welfare: string (nullable = true)
 |-- body: string (nullable = true)
 |-- company_description: string (nullable = true)
 |-- coordinate: array (nullable = true)
 |    |-- element: double (containsNull = true)


In [6]:
converted_df.createOrReplaceTempView("jd_analysis")




In [7]:
spark.sql("""SELECT major_category, COUNT(DISTINCT(job_id))
FROM jd_analysis
GROUP BY major_category
""").show()

+--------------+----------------------+
|major_category|count(DISTINCT job_id)|
+--------------+----------------------+
|          GAME|                    32|
|        MOBILE|                    93|
|       SUPPORT|                    89|
|           ETC|                    63|
|           WEB|                   305|
|     DEVSECOPS|                   100|
|     SW/HW/IOT|                   143|
|          DATA|                   158|
+--------------+----------------------+


In [8]:
spark.sql("""SELECT major_category, middle_category, COUNT(DISTINCT(job_id))
FROM jd_analysis
GROUP BY major_category, middle_category
""").show()

+--------------+---------------------------+----------------------+
|major_category|            middle_category|count(DISTINCT job_id)|
+--------------+---------------------------+----------------------+
|       SUPPORT|                         PM|                    39|
|          DATA|              데이터 분석가|                    34|
|          DATA|        데이터 사이언티스트|                    33|
|       SUPPORT|                   기술지원|                    32|
|        MOBILE|          안드로이드 개발자|                    44|
|     DEVSECOPS|   데브옵스/인프라 엔지니어|                    79|
|           ETC|          블록체인 엔지니어|                    23|
|        MOBILE|                 iOS 개발자|                    47|
|           WEB|                웹 퍼블리셔|                    20|
|        MOBILE|크로스 플랫폼 모바일 개발자|                    33|
|     DEVSECOPS|            정보보안 담당자|                    28|
|     SW/HW/IOT|         HW/임베디드 개발자|                    41|
|          DATA|                        DBA|                

In [9]:
temp_df = spark.sql("""SELECT major_category, middle_category, COUNT(DISTINCT(job_id))
FROM jd_analysis
GROUP BY major_category, middle_category
""")




In [10]:
temp_df.toPandas()

   major_category  middle_category  count(DISTINCT job_id)
0         SUPPORT               PM                      39
1            DATA       데이터 사이언티스트                      33
2            DATA          데이터 분석가                      34
3          MOBILE        안드로이드 개발자                      44
4         SUPPORT             기술지원                      32
5       DEVSECOPS    데브옵스/인프라 엔지니어                      79
6             WEB           웹 퍼블리셔                      20
7          MOBILE          iOS 개발자                      47
8             ETC        블록체인 엔지니어                      23
9          MOBILE  크로스 플랫폼 모바일 개발자                      33
10      SW/HW/IOT      HW/임베디드 개발자                      41
11      DEVSECOPS         정보보안 담당자                      28
12            ETC               기타                      40
13           DATA              DBA                      28
14        SUPPORT          QA 엔지니어                      24
15            WEB       서버/백엔드 개발자                     2

### 데이터 저장

In [11]:
single_partition_df = temp_df.coalesce(1)




In [12]:
# Convert back to DynamicFrame
group_dynamic_frame = DynamicFrame.fromDF(single_partition_df, glueContext, "jd_analysis_sample")





In [13]:
# Store the clean data back to S3
jd_analysis_group_destn = glueContext.write_dynamic_frame.from_options(
    frame=group_dynamic_frame,
    connection_type="s3",
    format="csv",
    connection_options={
        "path": "s3://de-1-1/hajun/glue_spark_sample/",
        "partitionKeys": [],
    },
    transformation_ctx="jd_analysis_cate_group",
)
# worker를 5개로 실행 -> 나뉘어짐




In [14]:
spark.stop()




#### Example: Visualize data with matplotlib


In [None]:
import matplotlib.pyplot as plt

# Set X-axis and Y-axis values
x = [5, 2, 8, 4, 9]
y = [10, 4, 8, 5, 2]
  
# Create a bar chart 
plt.bar(x, y)
  
# Show the plot
%matplot plt

#### Example: Write the data in the DynamicFrame to a location in Amazon S3 and a table for it in the AWS Glue Data Catalog


In [None]:
s3output = glueContext.getSink(
  path="s3://de-1-1/hajun/glue_spark_sample/",
  connection_type="s3",
  updateBehavior="UPDATE_IN_DATABASE",
  partitionKeys=[],
  compression="snappy",
  enableUpdateCatalog=True,
  transformation_ctx="s3output",
)
s3output.setCatalogInfo(
  catalogDatabase="de1_1_demo", catalogTableName="jd_analysis_cate_group"
)
s3output.setFormat("glueparquet")
s3output.writeFrame(DyF)