Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: Apache-2.0

## Task 4: Load data into Neptune using the bulk load feature and perform basic insert and query operations using Gremlin

In this task, you use the Amazon Neptune bulk loader to ingest data into a Neptune DB cluster. You then perform basic insert and query operations using Gremlin.

### Task 4.1: Prepare sample data for bulk loading

This notebook uses the [MovieLens 100k dataset](https://grouplens.org/datasets/movielens/100k/) provided by [GroupLens Research](https://grouplens.org/datasets/movielens/). This dataset consists of movies, users, and ratings of those movies by users.

The process of downloading the data from the MovieLens websites and formatting it was already completed as part of the lab provisioning process. The formatted data is available in the Amazon S3 bucket used for this lab. All you need to provide is an S3 bucket URI.

### Task 4.2: Use Neptune's bulk load API to import data

In this task, you load the MovieLens 100k dataset into your Neptune cluster using the %load step.

1. To load the sample data, run the %load command in following code cell:

In [None]:
%load

2. When you run the code cell, you are prompted to make the following selections:

- For **Source**, enter `s3://S3_BUCKET_NAME/movielens-100k/`

**Caution:** Replace the *S3_BUCKET_NAME* placeholder value with the value of **S3BucketName** provided to the left of lab instructions.

- For **Format**, use the dropdown menu and select **csv** if not already selected.
- For **Region**, make sure the region matches with the value of **AwsRegionCode** provided to the left of lab instructions.
- For **Fail on Error**, use the dropdown menu and select **TRUE** if not already selected.
- For **Load ARN**, copy the value of **IAMRoleARN** provided to the left of lab instructions and paste it here.

3. Use default values for the rest of the selection options and choose **Submit**.

<i aria-hidden="true" class="fas fa-clipboard-check" style="color:#18ab4b"></i> **Expected output:** The output should show that the data was successfully loaded into your Neptune cluster.

```plain
************************
**** EXAMPLE OUTPUT ****
************************

Load ID: edb549a7-6084-46d8-a797-e1bfc05f1783

Overall Status: LOAD_COMPLETED
Total execution time: 0:01:44
Done.
```

4. To verify if the data was loaded correctly and to see the count of nodes by label, run the following code cell:

In [None]:
%%gremlin
g.V().groupCount().by(label).unfold().order().by(keys)

<i aria-hidden="true" class="fas fa-clipboard-check" style="color:#18ab4b"></i> **Expected output:** If the are nodes loaded correctly, you should see the following output:

```plain
************************
**** EXAMPLE OUTPUT ****
************************

    19 genres
    1682 movies
    100000 rating
    943 users
```

5. To check if the edges were loaded correctly, run the following code cell:

In [None]:
%%gremlin
g.E().groupCount().by(label).unfold().order().by(keys)

<i aria-hidden="true" class="fas fa-clipboard-check" style="color:#18ab4b"></i> **Expected output:** If the edges are loaded correctly, you should see the following output:

```plain
************************
**** EXAMPLE OUTPUT ****
************************

    100000 about
    2893 included_in
    100000 rated
    100000 wrote
```

### Task 4.3: Write and run basic Gremlin queries

In this task, you write and run basic gremlin queries to interact with the data.

6. To retrieve properties of the first 5 movie vertices from the graph, run the following code cell:

In [None]:
%%gremlin
g.V()
 .hasLabel('movie')
 .limit(5)
 .valueMap()

7. To find specific movies by their titles and display their titles and genres, run the following code cell:

In [None]:
%%gremlin

g.V().has('movie', 'title', within('Apollo 13 (1995)', 'Forrest Gump (1994)', 'Sleepless in Seattle (1993)')).
    valueMap('title', 'genre')

### Task 4.4: Perform additional data inserts using Gremlin

In this task, you perform additional data inserts using Gremlin.

8. To add a new movie vertex to the graph with title and genre properties, run the following code cell:

In [None]:
%%gremlin
g.addV('movie')
 .property('title', 'Avengers: Infinity War (2018)')
 .property('genre', 'Action')
 .property('genre', 'Sci-Fi')

9. To verify the recently added movie using its vertex ID, run the following code cell:

**Caution:** Replace the *VERTEX_ID* placeholder value with the actual vertex id from the previous output before running the following code cell.

In [None]:
%%gremlin
g.V('VERTEX_ID')
    .valueMap()

10. To delete a specific movie vertex (Forrest Gump) from the graph, run the following code cell:

In [None]:
%%gremlin
g.V()
 .has('title', 'Forrest Gump (1994)')
 .drop()

11. To verify that the movie was deleted from the graph, run the following code cell:

In [None]:
%%gremlin
g.V().has('movie', 'title', within('Apollo 13 (1995)', 'Forrest Gump (1994)', 'Sleepless in Seattle (1993)')).
    valueMap('title', 'genre')

12. To update the genre property of a specific movie (Sleepless in Seattle) in the graph, run the following code cell:

In [None]:
%%gremlin
g.V()
 .has('title', 'Sleepless in Seattle (1993)')
 .property('genre', 'Drama')

13. To verify only the updated genre property using the vertex ID, run the following code cell:

**Caution:** Replace the *VERTEX_ID* placeholder value with the actual vertex id from the previous output before running the following code cell.

In [None]:
%%gremlin
g.V('VERTEX_ID')
    .valueMap('genre')

You can further interact with the data using different Gremlin queries. Refer to [Gremlin Query Language](https://tinkerpop.apache.org/gremlin.html) for additional information.

**Task complete:** You have successfully used the Amazon Neptune bulk loader to ingest data into a Neptune DB cluster. You then performed basic operations using Gremlin. Close this notebook, and return to the lab instructions to continue with Task 5.