
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session                                                                                                  |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X                                                                            |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0)                                |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |   Changes the session type to Glue ETL.                                                                                                                   |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %spark_conf                 |  String      |  Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer                       |

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.35 
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::601591369946:role/Kshitij_Glue_Service_and_S3
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: e4519cf0-9e11-4560-97eb-0b44ea2c650f
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
Waiting for session e4519cf0-9e11-4560-97eb-0b44ea2c650f to get into ready status...
Session e4519cf0-9e11-4560-97eb-0b44ea2c650f has been created




In [1]:
business_df = spark.read.json("s3://yelp-dataset-kshitij/input/yelp_academic_dataset_business.json")
business_df = business_df.select("business_id",business_df.categories,"city","review_count","stars","state","type")
business_df.show(20)

+--------------------+--------------------+----------+------------+-----+-----+--------+
|         business_id|          categories|      city|review_count|stars|state|    type|
+--------------------+--------------------+----------+------------+-----+-----+--------+
|vcNAWiLM4dR7D2nww...|[Doctors, Health ...|   Phoenix|           7|  3.5|   AZ|business|
|JwUE5GmEO-sH1FuwJ...|       [Restaurants]| De Forest|          26|  4.0|   WI|business|
|uGykseHzyS5xAMWoN...|[American (Tradit...| De Forest|          16|  4.0|   WI|business|
|LRKJF43s9-3jG9Lgx...|[Food, Ice Cream ...| De Forest|           7|  4.5|   WI|business|
|RgDg-k9S5YD_BaxMc...|[Chinese, Restaur...| De Forest|           3|  4.0|   WI|business|
|oLctHIA1AxmsgOuu4...|[Television Stati...|Mc Farland|          10|  1.5|   WI|business|
|ZW2WeP2Hp20tq0RG1...|[Home Services, H...|Mc Farland|           4|  2.0|   WI|business|
|95p9Xg358BezJyk1w...|[Libraries, Publi...|Mc Farland|           4|  2.5|   WI|business|
|rdAdANPNOcvUtoFgc...

In [4]:
business_df.write.mode("overwrite").parquet("s3://yelp-dataset-kshitij/output/yelp_academic_dataset_business")




In [9]:
business_df = spark.read.parquet("s3://yelp-dataset-kshitij/output/yelp_academic_dataset_business/*.parquet")




In [27]:
business_df.rdd.getNumPartitions()
#business_id|          categories|      city|review_count|stars|state|    type

9


In [28]:
business_df.repartition(20).rdd.getNumPartitions()

20


In [29]:
business_df.coalesce(5).rdd.getNumPartitions()

5


In [30]:
tip_df = spark.read.json("s3://yelp-dataset-kshitij/input/yelp_academic_dataset_tip.json")
tip_df.show(20)
#tip_df = business_df.select("business_id",business_df.categories,"city","review_count","stars","state","type")

+--------------------+----------+-----+--------------------+----+--------------------+
|         business_id|      date|likes|                text|type|             user_id|
+--------------------+----------+-----+--------------------+----+--------------------+
|JwUE5GmEO-sH1FuwJ...|2012-05-16|    0|Great food, huge ...| tip|Vefj29mjork1DLhAL...|
|JwUE5GmEO-sH1FuwJ...|2014-03-29|    0|Great bakery. Gre...| tip|Bbm6c5CHf5IJG5ju0...|
|JwUE5GmEO-sH1FuwJ...|2011-09-29|    0|The desserts are ...| tip|IORZRljfUkedhh1SG...|
|uGykseHzyS5xAMWoN...|2013-07-20|    0|There are too man...| tip|Bdmk6RQUP0sbXA_V9...|
|uGykseHzyS5xAMWoN...|2013-08-03|    0|Really good place...| tip|fjGh54rTqVn0ECKro...|
|LRKJF43s9-3jG9Lgx...|2012-04-14|    0|     get onion rings| tip|4Z4Bv3gEMbEmncLDD...|
|LRKJF43s9-3jG9Lgx...|2012-05-06|    0|              Amaze!| tip|AHN3LdMw5L0DIJ5an...|
|RgDg-k9S5YD_BaxMc...|2011-08-19|    0|They always give ...| tip|u5xcw6LCnnMhddoxk...|
|RgDg-k9S5YD_BaxMc...|2011-05-29|    0|    

In [31]:
tip_df.write.mode("overwrite").parquet("s3://yelp-dataset-kshitij/output/yelp_academic_dataset_tip")




In [32]:
user_df = spark.read.json("s3://yelp-dataset-kshitij/input/yelp_academic_dataset_user.json")
user_df.show(20)

+-------------+--------------------+-----+----+--------------------+--------+------------+----+--------------------+------------+-------------+
|average_stars|         compliments|elite|fans|             friends|    name|review_count|type|             user_id|       votes|yelping_since|
+-------------+--------------------+-----+----+--------------------+--------+------------+----+--------------------+------------+-------------+
|         3.83|        [,,,,,,,,,,]|   []|   0|                  []|     Lee|           6|user|qtrmBGNqCvupHMHL_...|   [0, 1, 5]|      2012-02|
|          5.0|        [,,,,,,,,,,]|   []|   0|[8Y2EN4XNNhnwssuP...| Matthew|           1|user|MWhR9LvOdRbqtu1I_...|   [0, 0, 0]|      2011-12|
|          5.0|[2,, 1,,, 2,, 1,,...|   []|   1|[8wK7_qZ18mokBxw5...| Jasmine|          22|user|0vscrHoajVRa1Yk19...| [11, 5, 20]|      2010-09|
|          1.0|        [,,,,,,,,,,]|   []|   0|                  []|  Harley|           1|user|5Xh4Qc3rxhAQ_NcNt...|   [0, 0, 1]|      2

In [37]:
user_df= user_df.select("average_stars","name","review_count","user_id","yelping_since","votes")
user_df.printSchema()

root
 |-- average_stars: double (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)
 |-- votes: struct (nullable = true)
 |    |-- cool: long (nullable = true)
 |    |-- funny: long (nullable = true)
 |    |-- useful: long (nullable = true)


In [39]:
user_df_new= user_df.select("*", user_df.votes.cool.alias("cool_votes") , user_df.votes.funny.alias("funny_votes") , user_df.votes.useful.alias("useful_votes"))

+-------------+--------+------------+--------------------+-------------+------------+----------+-----------+------------+
|average_stars|    name|review_count|             user_id|yelping_since|       votes|cool_votes|funny_votes|useful_votes|
+-------------+--------+------------+--------------------+-------------+------------+----------+-----------+------------+
|         3.83|     Lee|           6|qtrmBGNqCvupHMHL_...|      2012-02|   [0, 1, 5]|         0|          1|           5|
|          5.0| Matthew|           1|MWhR9LvOdRbqtu1I_...|      2011-12|   [0, 0, 0]|         0|          0|           0|
|          5.0| Jasmine|          22|0vscrHoajVRa1Yk19...|      2010-09| [11, 5, 20]|        11|          5|          20|
|          1.0|  Harley|           1|5Xh4Qc3rxhAQ_NcNt...|      2012-01|   [0, 0, 1]|         0|          0|           1|
|         4.33|   Tyler|           6|4dJLZvpYRcjQ6qDR5...|      2011-08|   [1, 2, 6]|         1|          2|           6|
|         3.19|    Gary|

In [40]:
user_df_new= user_df.drop("votes")
user_df_new.write.mode("overwrite").parquet("s3://yelp-dataset-kshitij/output/yelp_academic_dataset_user")




In [41]:
review_df = spark.read.json("s3://yelp-dataset-kshitij/input/yelp_academic_dataset_review.json")




In [42]:
review_df.show()

+--------------------+----------+--------------------+-----+--------------------+------+--------------------+---------+
|         business_id|      date|           review_id|stars|                text|  type|             user_id|    votes|
+--------------------+----------+--------------------+-----+--------------------+------+--------------------+---------+
|vcNAWiLM4dR7D2nww...|2007-05-17|15SdjuK7DmYqUAj6r...|    5|dr. goldberg offe...|review|Xqd0DzHaiyRqVH3WR...|[1, 0, 2]|
|vcNAWiLM4dR7D2nww...|2010-03-22|RF6UnRTtG7tWMcrO2...|    2|Unfortunately, th...|review|H1kH6QZV7Le4zqTRN...|[0, 0, 2]|
|vcNAWiLM4dR7D2nww...|2012-02-14|-TsVN230RCkLYKBeL...|    4|Dr. Goldberg has ...|review|zvJCcrpm2yOZrxKff...|[1, 0, 1]|
|vcNAWiLM4dR7D2nww...|2012-03-02|dNocEAyUucjT371NN...|    4|Been going to Dr....|review|KBLW4wJA_fwoWmMhi...|[0, 0, 0]|
|vcNAWiLM4dR7D2nww...|2012-05-15|ebcN2aqmNUuYNoyvQ...|    4|Got a letter in t...|review|zvJCcrpm2yOZrxKff...|[1, 0, 2]|
|vcNAWiLM4dR7D2nww...|2013-04-19|_ePLBPr

In [43]:
review_df.printSchema()

root
 |-- business_id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: long (nullable = true)
 |-- text: string (nullable = true)
 |-- type: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- votes: struct (nullable = true)
 |    |-- cool: long (nullable = true)
 |    |-- funny: long (nullable = true)
 |    |-- useful: long (nullable = true)


In [44]:
review_df = review_df.select("business_id","date","review_id","stars","user_id")
review_df.show()

+--------------------+----------+--------------------+-----+--------------------+
|         business_id|      date|           review_id|stars|             user_id|
+--------------------+----------+--------------------+-----+--------------------+
|vcNAWiLM4dR7D2nww...|2007-05-17|15SdjuK7DmYqUAj6r...|    5|Xqd0DzHaiyRqVH3WR...|
|vcNAWiLM4dR7D2nww...|2010-03-22|RF6UnRTtG7tWMcrO2...|    2|H1kH6QZV7Le4zqTRN...|
|vcNAWiLM4dR7D2nww...|2012-02-14|-TsVN230RCkLYKBeL...|    4|zvJCcrpm2yOZrxKff...|
|vcNAWiLM4dR7D2nww...|2012-03-02|dNocEAyUucjT371NN...|    4|KBLW4wJA_fwoWmMhi...|
|vcNAWiLM4dR7D2nww...|2012-05-15|ebcN2aqmNUuYNoyvQ...|    4|zvJCcrpm2yOZrxKff...|
|vcNAWiLM4dR7D2nww...|2013-04-19|_ePLBPrkrf4bhyiKW...|    1|Qrs3EICADUKNFoUq2...|
|vcNAWiLM4dR7D2nww...|2014-01-02|kMu0knsSUFW2DZXqK...|    5|jE5xVugujSaskAoh2...|
|vcNAWiLM4dR7D2nww...|2014-01-08|onDPFgNZpMk-bT1zl...|    5|QnhQ8G51XbUpVEyWY...|
|JwUE5GmEO-sH1FuwJ...|2008-07-07|I7Kte2FwXWPCwdm7i...|    4|zvNimI98mrmhgNOOr...|
|JwUE5GmEO-sH1Fu

In [45]:
review_df.write.mode("overwrite").parquet("s3://yelp-dataset-kshitij/output/yelp_academic_dataset_review")


