## Spotify Data Feature Store

This notebook is based on the sagemaker example notebook 'amazon-sagemaker-examples/sagemaker-featurestore/feature_store_introduction' which demonstrates how to get started with Feature Store, create feature groups, and ingest data into them. These feature groups are stored in your Feature Store.

Feature groups are resources that contain metadata for all data stored in your Feature Store. A feature group is a logical grouping of features, defined in the feature store to describe records. A feature group’s definition is composed of a list of feature definitions, a record identifier name, and configurations for its online and offline store. 

### Overview
1. Set up
2. Creating a feature group
3. Ingest data into a feature group

### Prerequisites
This notebook uses both `boto3` and Python SDK libraries, and the `Python 3 (Data Science)` kernel. This notebook works with Studio, Jupyter, and JupyterLab. 

#### Library dependencies:
* sagemaker>=2.0.0
* numpy
* pandas

#### Role requirements:
**IMPORTANT**: You must attach the following policies to your execution role:
* AmazonS3FullAccess
* AmazonSageMakerFeatureStoreAccess 

### Set up

In [1]:
# SageMaker Python SDK version 2.x is required
import sagemaker
import sys

original_version = sagemaker.__version__
%pip install 'sagemaker>=2.0.0'

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
!pip freeze | grep sagemaker

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
sagemaker==2.41.0


In [3]:
import boto3
import pandas as pd
import numpy as np
import io
from sagemaker.session import Session
from sagemaker import get_execution_role

prefix = "sagemaker-featurestore-introduction"
role = get_execution_role()

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
s3_bucket_name = sagemaker_session.default_bucket()

In [4]:
s3_bucket_name

'sagemaker-us-west-2-343678298361'

In [29]:
# inspect data
artist_data = pd.read_csv("data/artists.csv")
tracks_data = pd.read_csv("data/tracks.csv")

In [30]:
artist_data.head()

Unnamed: 0,id,followers,genres,name,popularity
0,0DheY5irMjBUeLybbCUEZ2,0.0,[],Armid & Amir Zare Pashai feat. Sara Rouzbehani,0
1,0DlhY15l3wsrnlfGio2bjU,5.0,[],ปูนา ภาวิณี,0
2,0DmRESX2JknGPQyO15yxg7,0.0,[],Sadaa,0
3,0DmhnbHjm1qw6NCYPeZNgJ,0.0,[],Tra'gruda,0
4,0Dn11fWM7vHQ3rinvWEl4E,2.0,[],Ioannis Panoutsopoulos,0


In [31]:
artist_data.shape

(1104349, 5)

#### fix list data types

In [32]:
artist_data[artist_data.genres=='[]'].shape

(805733, 5)

In [33]:
artist_data[artist_data.genres!='[]'].head()

Unnamed: 0,id,followers,genres,name,popularity
45,0VLMVnVbJyJ4oyZs2L3Yl2,71.0,['carnaval cadiz'],Las Viudas De Los Bisabuelos,6
46,0dt23bs4w8zx154C5xdVyl,63.0,['carnaval cadiz'],Los De Capuchinos,5
47,0pGhoB99qpEJEsBQxgaskQ,64.0,['carnaval cadiz'],Los “Pofesionales”,7
48,3HDrX2OtSuXLW5dLR85uN3,53.0,['carnaval cadiz'],Los Que No Paran De Rajar,6
136,22mLrN5fkppmuUPsHx6i2G,59.0,"['classical harp', 'harp']",Vera Dulova,3


In [34]:
artist_data.loc[artist_data.genres=='[]','genres'].shape

(805733,)

In [35]:
# replace empty lists with na
artist_data.loc[artist_data.genres=='[]','genres'] = np.nan
artist_data.genres.isna().sum()

805733

In [36]:
%%time
from ast import literal_eval
artist_data.loc[artist_data.genres.notna(),'genres'] = artist_data.loc[artist_data.genres.notna(),'genres'].apply(literal_eval)

CPU times: user 2.36 s, sys: 19.6 ms, total: 2.38 s
Wall time: 2.37 s


In [38]:
artist_data.loc[artist_data.genres.notna(),'genres'].iloc[0]

['carnaval cadiz']

In [39]:
tracks_data.head()

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.645,0.445,0,-13.338,1,0.451,0.674,0.744,0.151,0.127,104.851,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.695,0.263,0,-22.136,1,0.957,0.797,0.0,0.148,0.655,102.009,1
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.434,0.177,1,-21.18,1,0.0512,0.994,0.0218,0.212,0.457,130.418,5
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918,0.104,0.397,169.98,3
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.402,0.158,3,-16.9,0,0.039,0.989,0.13,0.311,0.196,103.22,4


In [40]:
tracks_data.id_artists.iloc[0]

"['45tIt06XoI0Iio4LBEVpls']"

In [41]:
tracks_data.artists.iloc[0]

"['Uli']"

In [42]:
%%time
for i in ['id_artists','artists']:
    tracks_data[i] = tracks_data[i].apply(literal_eval)

CPU times: user 7.65 s, sys: 152 ms, total: 7.8 s
Wall time: 7.8 s


In [43]:
tracks_data.id_artists.iloc[0]

['45tIt06XoI0Iio4LBEVpls']

In [44]:
tracks_data.artists.iloc[0]

['Uli']

### normalize data for feature store

In [46]:
%%time
# one hot encode genres
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
genres_oh = pd.DataFrame(mlb.fit_transform(artist_data.loc[artist_data.genres.notna(),'genres']),
                         columns=mlb.classes_,
                         index=artist_data.loc[artist_data.genres.notna(),'genres'].index)
genres_oh.head()

CPU times: user 671 ms, sys: 676 ms, total: 1.35 s
Wall time: 1.37 s


Unnamed: 0,21st century classical,432hz,48g,8-bit,8d,a cappella,a3,aarhus indie,aberdeen indie,abstract,...,zim hip hop,zim urban groove,zimdancehall,zither,zolo,zouglou,zouk,zouk riddim,zurich indie,zydeco
45,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
46,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
136,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [48]:
# create artist one, two, n cols
# first see what the max num of artists for a song
tracks_data['num_artists'] = tracks_data['id_artists'].apply(lambda x: len(x))
tracks_data['num_artists'].describe()

count    586672.000000
mean          1.290619
std           0.869436
min           1.000000
25%           1.000000
50%           1.000000
75%           1.000000
max          58.000000
Name: num_artists, dtype: float64

In [55]:
tracks_data[tracks_data['num_artists'] <= 3].shape[0] / tracks_data.shape[0] 

0.978524627048845

In [50]:
tracks_data.shape

(586672, 21)

### Create a feature group

We first start by creating feature group names for artist_data and tracks_data. Following this, we create two Feature Groups, one for `tracks_data` and another for `artists_data`

In [9]:
from time import gmtime, strftime, sleep

artists_feature_group_name = "artists-feature-group-" + strftime("%d-%H-%M-%S", gmtime())
tracks_feature_group_name = "tracks-feature-group-" + strftime("%d-%H-%M-%S", gmtime())
print(artists_feature_group_name,tracks_feature_group_name,sep='\n')

artists-feature-group-02-21-22-34
tracks-feature-group-02-21-22-34


Instantiate a FeatureGroup object for artist_data and tracks_data. 

In [10]:
from sagemaker.feature_store.feature_group import FeatureGroup

artists_feature_group = FeatureGroup(
    name=artists_feature_group_name, sagemaker_session=sagemaker_session
)
tracks_feature_group = FeatureGroup(
    name=tracks_feature_group_name, sagemaker_session=sagemaker_session
)

In [11]:
import time

current_time_sec = int(round(time.time()))

record_identifier_feature_name = "id"

Append EventTime feature to your data frame. This parameter is required, and time stamps each data point.

In [14]:
artist_data["EventTime"] = pd.Series([current_time_sec] * len(artist_data), dtype="float64")
tracks_data["EventTime"] = pd.Series([current_time_sec] * len(tracks_data), dtype="float64")

Load feature definitions to your feature group. 

In [13]:
customers_feature_group.load_feature_definitions(data_frame=customer_data)
orders_feature_group.load_feature_definitions(data_frame=orders_data)

[FeatureDefinition(feature_name='customer_id', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='order_id', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='order_status', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='store_id', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='EventTime', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>)]