# Module 1: Validating and Importing User-Item-Interaction Data

`
Rev Date           By       Description
PA1 2020-02-15     akirmak  Modified and extended version of PersonalizePoC (github: chrisking@)
`


## Chosing a Dataset or Data Source

Personalized recommendations can be applied to many use cases: A few common examples are:

1. E-Commerce platforms
1. Content curation on a per-user basis.
  1. Video-on-Demand applications
  1. Social-Media platforms
1. Reservation platform for restaurants to make personalized  recommendation based on end user reservation activity
1. hotel recommendations for travel websites
1. credit card recommendations for banks
1. match recommendations for dating sites.

As we mentioned the iteraction data (interations of users with items/catalog) is key for getting started with the service. 

### Last.FM Dataset
To begin with, we are going to use the Last.FM dataset found [here](https://grouplens.org/datasets/hetrec-2011/). This data fits our guidelines with a large number for users, items, and interactions. 

### Data Exploration
EDA, or Exploratory Data Analysis is a form of initial data analysis, whereby the data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data.

For small datasets, you could use python frameworks, for large datasets you could use Spark. For today's workshop, we took a big data approach: We first uploaded the dataset to Amazon S3, then crawled & catalogued using AWS Glue, and finally used Amazon QuickSight visualize the dataset. We won't go through the procedure, as EDA is not today's focus.


### Data Preparation
Your data usually will not arrive in a perfect form. Also in our example, we will need to make some modifications before ingesting data into S3 to be used by Amazon Personalize.

In [1]:
data_dir = "poc_data"
!mkdir $data_dir
!cd $data_dir && wget http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip
!cd $data_dir && unzip hetrec2011-lastfm-2k.zip

--2020-02-15 12:21:04--  http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2589075 (2.5M) [application/zip]
Saving to: ‘hetrec2011-lastfm-2k.zip’


2020-02-15 12:21:04 (11.7 MB/s) - ‘hetrec2011-lastfm-2k.zip’ saved [2589075/2589075]

Archive:  hetrec2011-lastfm-2k.zip
  inflating: user_friends.dat        
  inflating: user_taggedartists.dat  
  inflating: user_taggedartists-timestamps.dat  
  inflating: artists.dat             
  inflating: readme.txt              
  inflating: tags.dat                
  inflating: user_artists.dat        




## Preparing Your Data

The next thing to be done is to read the data with Pandas and confirm the data is in a good state, and save it to a CSV where it is ready to be used with Amazon Personalize

Import the Pandas library as well as a few other data science tools in order to inspect the information.

In [3]:
import boto3
from time import sleep
import subprocess
import pandas as pd
import json
import time
import pprint
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
from datetime import datetime

Next open the file with Pandas and take a look at the contents

Well that did not work so well, looks like the tab delimiter needs to be specified, attempt 2:

In [5]:
original_data = pd.read_csv(data_dir + '/user_taggedartists-timestamps.dat', delimiter='\t')
original_data.head(5)

Unnamed: 0,userID,artistID,tagID,timestamp
0,2,52,13,1238536800000
1,2,52,15,1238536800000
2,2,52,18,1238536800000
3,2,52,21,1238536800000
4,2,52,41,1238536800000


The data looks really good here but lets get some extra insights on it.

In [6]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186479 entries, 0 to 186478
Data columns (total 4 columns):
userID       186479 non-null int64
artistID     186479 non-null int64
tagID        186479 non-null int64
timestamp    186479 non-null int64
dtypes: int64(4)
memory usage: 5.7 MB


In [7]:
original_data.describe()

Unnamed: 0,userID,artistID,tagID,timestamp
count,186479.0,186479.0,186479.0,186479.0
mean,1035.600137,4375.845328,1439.582913,1239204000000.0
std,622.461272,4897.789595,2775.340279,42990910000.0
min,2.0,1.0,1.0,-428720400000.0
25%,488.0,686.0,79.0,1209593000000.0
50%,1021.0,2203.0,195.0,1243807000000.0
75%,1624.0,6714.0,887.0,1275343000000.0
max,2100.0,18744.0,12647.0,1304941000000.0


Now there is clearly a range of values for all of the columns which is great, the last one to be mindful of is that the timestamp should be in Unix Epoch format. You can learn more about the format here: https://en.wikipedia.org/wiki/Unix_time

Let us grab an arbitrary column and convert it to a datetime and confirm that it feels like a reasonable value for the historical data.

For this particular value it rendered a year of 41,132... a bit into the future for us, so somehow we parsed it incorrectly. Attempt number 2...

JavaScript records time in milliseconds and this is a collection of data from a web application, so divide by 1000 first and see what is returned:

In [10]:
arb_time_stamp = arb_time_stamp/1000
print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))

1970-01-15 07:17:42


Feb 2009 feels completely reasonable so now move forward by transforming each row in the dataframe in the same way.

In [11]:
original_data.head(5)
original_data.timestamp = original_data.timestamp / 1000
original_data.head(5)

Unnamed: 0,userID,artistID,tagID,timestamp
0,2,52,13,1238537000.0
1,2,52,15,1238537000.0
2,2,52,18,1238537000.0
3,2,52,21,1238537000.0
4,2,52,41,1238537000.0


In [14]:
interactions_df.astype({'timestamp': 'int64'}).dtypes

userID       int64
artistID     int64
timestamp    int64
dtype: object

In [15]:
interactions_df.head()

Unnamed: 0,userID,artistID,timestamp
0,2,52,1238537000.0
1,2,52,1238537000.0
2,2,52,1238537000.0
3,2,52,1238537000.0
4,2,52,1238537000.0


Personalize has default column names of users, items and timestamp so now we will replace our data set with the correct values.

In [16]:
interactions_df.rename(columns = {'userID':'USER_ID', 'artistID':'ITEM_ID', 
                              'timestamp':'TIMESTAMP'}, inplace = True) 


At this point the data is ready to go, we just need to save it as a CSV.

In [17]:
interactions_filename = "interactions.csv"
interactions_df.to_csv((data_dir+"/"+interactions_filename), index=False, float_format='%.0f')

## Creating Dataset Groups and the Interactions Dataset

The highest level of isolation and abstraction with Amazon Personalize is a Dataset Group. Information stored within one of these has no impact on any other dataset group or models created from one. This allows you to run many experiments and is part of how we keep your models private and fully trained only on your data. 

Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset Groups can house the following types of information:

* User-Item-Interactions
* Event Streams ( Real time Interactions )
* User Metadata
* Item Metadata

The cells below will create the dataset group and the dataset for interactions.


Now validate that your environment can communicate successfully with Amazon Personalize, the lines below do just that.

In [18]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

### Create the Dataset Group

In [19]:
create_dataset_group_response = personalize.create_dataset_group(
    name = "personalize-poc-lastfm"
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:924376141954:dataset-group/personalize-poc-lastfm",
  "ResponseMetadata": {
    "RequestId": "8da8db75-4a5c-4feb-a450-61bd1a484d29",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 15 Feb 2020 12:27:37 GMT",
      "x-amzn-requestid": "8da8db75-4a5c-4feb-a450-61bd1a484d29",
      "content-length": "101",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Wait for Dataset Group to Have ACTIVE Status

Before we can use the Dataset Group in any items below it must be active, execute the cell below and wait for it to show active.

In [20]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE


### Create the Dataset

First define a schema for the interactions:

In [21]:
interactions_schema = schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "personalize-poc-lastfm-interactions",
    schema = json.dumps(interactions_schema)
)

schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:924376141954:schema/personalize-poc-lastfm-interactions",
  "ResponseMetadata": {
    "RequestId": "e7844ed5-99ab-4cf7-9918-dc9ee649d01f",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 15 Feb 2020 12:29:12 GMT",
      "x-amzn-requestid": "e7844ed5-99ab-4cf7-9918-dc9ee649d01f",
      "content-length": "101",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Now create a dataset with that schema.

In [22]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    name = "personalize-poc-lastfm-ints",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = schema_arn
)

dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:924376141954:dataset/personalize-poc-lastfm/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "8f8562c6-d953-4eab-aa42-fc7a3863a096",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 15 Feb 2020 12:29:22 GMT",
      "x-amzn-requestid": "8f8562c6-d953-4eab-aa42-fc7a3863a096",
      "content-length": "103",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [23]:
interactions_dataset_arn = dataset_arn

## Configuring S3 and IAM 


Amazon Personalize will need an S3 bucket to act as the source of your data, as well as IAM roles for accessing it. The code below will set all that up.

Now using the metada stored on this instance of a SageMaker Notebook determine the region we are operating in. If you are using a Jupyter Notebook outside of SageMaker simply define region as the string that indicates the region you would like to use for Forecast and S3.


In [24]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print(region)

us-east-1


In [25]:
session = boto3.Session(region_name=region)

Put your initials in the bucket name below (if it is not available, choose another one. e.g. fs for Frank Sinatra

In [27]:
print(region)
s3 = boto3.client('s3')
# account_id = boto3.client('sts').get_caller_identity().get('Account')

bucket_name = "<PUT-YOUR-INITIALS-HERE>" + "-ai-personalizepoc"
print(bucket_name)
if region != "us-east-1":
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})
else:
    s3.create_bucket(Bucket=bucket_name)

us-east-1
hba-ai-personalizepoc


#### Attach Policy to S3 Bucket
Amazon Personalize needs to be able to read the content of your S3 bucket that you created earlier. The lines below will do that.

In [29]:
s3 = boto3.client("s3")

policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy" + bucket_name,
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*Object",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))

{'ResponseMetadata': {'RequestId': '4B16F0313E92E10D',
  'HostId': '8N/YV/9LyK7GM/qy7skgTuFa5ONuymsOagW1rqrqPfFKEhgqzJK4lV8GpUeEz8K9AaLV6gF8Qt4=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': '8N/YV/9LyK7GM/qy7skgTuFa5ONuymsOagW1rqrqPfFKEhgqzJK4lV8GpUeEz8K9AaLV6gF8Qt4=',
   'x-amz-request-id': '4B16F0313E92E10D',
   'date': 'Sat, 15 Feb 2020 12:34:18 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

### Create Personalize Role
Also Amazon Personalize needs the ability to assume Roles in AWS in order to have the permissions to execute certain tasks, the lines below grant that.

In [30]:
iam = boto3.client("iam")

role_name = "PersonalizeRolePOC" + bucket_name
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

arn:aws:iam::924376141954:role/PersonalizeRolePOChba-ai-personalizepoc


#### Upload to S3

Before Personalize can import the data, it needs to be in S3.

In [31]:
# Upload Interactions File
interactions_file_path = data_dir + "/" + interactions_filename
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)
interactions_s3DataPath = "s3://"+bucket_name+"/"+interactions_filename

## Importing the Interactions Data

Earlier you created the DatasetGroup and Dataset to house your information, now you will execute an import job that will load the data from S3 into Amazon Personalize for usage building your model.

#### Create Dataset Import Job

In [32]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "personalize-poc-import",
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:924376141954:dataset-import-job/personalize-poc-import",
  "ResponseMetadata": {
    "RequestId": "dcc19c1d-2e57-4ba6-afce-ec8fff6f8858",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Sat, 15 Feb 2020 12:38:28 GMT",
      "x-amzn-requestid": "dcc19c1d-2e57-4ba6-afce-ec8fff6f8858",
      "content-length": "110",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


#### Wait for Dataset Import Job to Have ACTIVE Status
It can take a while before the import job completes, please wait until you see that it is active below.

In [33]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE


Now that the dataset import is active you are ready to start building models with SIMS, Personalized-Ranking, Popularity-Count, and HRNN. Work will continue in other notebooks. Run the cell below before moving on to store a few values for usage in the next notebooks.

In [34]:
%store interactions_dataset_arn
%store dataset_group_arn
%store bucket_name
%store role_arn
%store role_name
%store data_dir

Stored 'interactions_dataset_arn' (str)
Stored 'dataset_group_arn' (str)
Stored 'bucket_name' (str)
Stored 'role_arn' (str)
Stored 'role_name' (str)
Stored 'data_dir' (str)


Congratulations. You have prepared your dataset and ingested into Amazon Personalize. 