# Music Recommendation on Spotify Data

##### Note: This example requires a KDB.AI endpoint and API key. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

This example demonstrates how you can use KDB.AI to perform similarity recommendations using vector embeddings created from both categorical and numeric music data.

Applications like Spotify and YouTube Music perform hundreds of millions of song recommendations for users every single day. They do this by extracting a vast array of features about every given song and artist and comparing their characteristics.
By leveraging this sort of data, KDB.AI can be used to productionize a music recommendation system and help to quickly and efficiently find music similar to given input songs.

### Aim

In this tutorial, we'll break down how you might perform similarity search on music, taking some Spotify data as an example and using KDB.AI as the vector database to store and query this data.
This breaks down as follows:

1. Load Song Data
1. Create Song Vector Embeddings
1. Store Embeddings In KDB.AI
1. Search For Similar Songs To A Target Song
1. Delete the KDB.AI Table

---

## 0. Setup

### Install dependencies 

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

In [1]:
!pip install kdbai_client
!pip install gensim nltk

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://kx-user-read:****@ext-nexus.kxi-dev.kx.com/repository/kxi/simple


In [2]:
### !!! Only run this cell if you need to download the data into your environment, for example in Colab
### This downloads song data
!mkdir ./data 
!wget -P ./data https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/music_recommendation/data/song_data.csv

mkdir: cannot create directory ‘./data’: File exists
--2024-09-23 15:08:50--  https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/music_recommendation/data/song_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27745759 (26M) [text/plain]
Saving to: ‘./data/song_data.csv.2’


2024-09-23 15:08:52 (13.3 MB/s) - ‘./data/song_data.csv.2’ saved [27745759/27745759]



### Import Packages

We will start by importing all of the Python packages needed to run this music recommendation system example.
This includes packages for reading in the data, embedding it as vectors, and interacting with the vector database.

In [3]:
import pandas as pd
import numpy as np

In [4]:
# embedding categorical data
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

[nltk_data] Downloading package punkt to /home/gflood/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
# timing
from tqdm.auto import tqdm

In [35]:
# vector DB
import os
import kdbai_client as kdbai
from getpass import getpass
import time

### Configure Console

In order to fully view our embeddings when it comes to displaying the results, we must increase the maximum allowed column width in Pandas DataFrames from the default value.

In [7]:
pd.set_option("max_colwidth", 1000)

This removes a warning that appears when performing in-place column assignment.

In [8]:
pd.options.mode.chained_assignment = None

### Define Helper Functions

Defining these two helper functions will allow us to easily show the shape and head of any Pandas DataFrames or embedding arrays passed.

In [9]:
def show_df(df: pd.DataFrame) -> pd.DataFrame:
    print(df.shape)
    return df.head()

In [10]:
def show_embeddings(embeddings: np.array) -> list[int]:
    print("Num Embeddings:", len(embeddings))
    print("Embedding Size:", len(embeddings[0]))
    return list(embeddings[0])

## 1. Load Song Data

The song data we will read in will be taken from an [open-source Spotify dataset](https://www.kaggle.com/datasets/vatsalmavani/spotify-dataset) on Kaggle. There are 5 files on Kaggle, however, only one file is relevant to this analysis.
This dataset contains a list of metadata on 170,000 songs from 1921 to 2020. This metadata includes:
- Song Name
- Artist Name
- Song Year
- Various features about the song's music, including:
    * acousticness
    * danceability
    * duration_ms
    * energy
    * explicit
    * instrumentalness
    * key
    * liveness
    * loudness
    * mode
    * popularity
    * release_date
    * speechiness
    * tempo
    * valence

### Read In The Spotify Data From The CSV

We can read this song data from a CSV into a Pandas DataFrame and show the resulting table.

In [11]:
raw_song_df = pd.read_csv("data/song_data.csv")

In [12]:
show_df(raw_song_df)

(170653, 19)


Unnamed: 0,id,name,artists,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,release_date,speechiness,tempo,valence,year
0,4BJqT0PrAfrxzMOxytFOIz,"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve","['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']",0.982,0.279,831667,0.211,0,0.878,10,0.665,-20.096,1,4,1921,0.0366,80.954,0.0594,1921
1,7xPhfUan2yNtyFG0cUWkt8,Clancy Lowered the Boom,['Dennis Day'],0.732,0.819,180533,0.341,0,0.0,7,0.16,-12.441,1,5,1921,0.415,60.936,0.963,1921
2,1o6I8BglA6ylDMrIELygv1,Gati Bali,['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat'],0.961,0.328,500062,0.166,0,0.913,3,0.101,-14.85,1,5,1921,0.0339,110.339,0.0394,1921
3,3ftBPsC5vPBKxYSee08FDH,Danny Boy,['Frank Parker'],0.967,0.275,210000,0.309,0,2.8e-05,5,0.381,-9.316,1,3,1921,0.0354,100.109,0.165,1921
4,4d6HGyGT8e121BsdKmw9v6,When Irish Eyes Are Smiling,['Phil Regan'],0.957,0.418,166693,0.193,0,2e-06,3,0.229,-10.096,1,2,1921,0.038,101.665,0.253,1921


### Pre-process The Data

Here we will perform a few operations on this Pandas DataFrame to get it into the correct format for creating the vector embeddings for our vector database.
This will include:
- Adding a column prefix
- Removing excess columns
- Fixing column values
- Combining columns into one
- Removing duplicate rows

Once these pre-processing steps have been carried out, our data will be clean and in the correct format to start creating embeddings.

In [13]:
# add "song_" prefix to col names
song_df = raw_song_df.add_prefix("song_")

In [14]:
# drop unused cols
song_df = song_df.drop(columns=["song_id", "song_release_date"])

In [15]:
# fix artists list names - remove quotes
def fix_artists(str_list):
    return ", ".join([v for v in str_list.rstrip("']").lstrip("['").split("', '")])


song_df["song_artists"] = song_df["song_artists"].apply(fix_artists)

In [16]:
# combine song_name & song_artists into song_description
song_df.insert(
    0, "song_description", song_df["song_name"] + " - " + song_df["song_artists"]
)

In [17]:
# remove duplicate rows
song_data = song_df[
    ~song_df.duplicated(subset=["song_description"], keep="first")
].reset_index(drop=True)

In [18]:
show_df(song_df)

(170653, 18)


Unnamed: 0,song_description,song_name,song_artists,song_acousticness,song_danceability,song_duration_ms,song_energy,song_explicit,song_instrumentalness,song_key,song_liveness,song_loudness,song_mode,song_popularity,song_speechiness,song_tempo,song_valence,song_year
0,"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve - Sergei Rachmaninoff, James Levine, Berliner Philharmoniker","Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve","Sergei Rachmaninoff, James Levine, Berliner Philharmoniker",0.982,0.279,831667,0.211,0,0.878,10,0.665,-20.096,1,4,0.0366,80.954,0.0594,1921
1,Clancy Lowered the Boom - Dennis Day,Clancy Lowered the Boom,Dennis Day,0.732,0.819,180533,0.341,0,0.0,7,0.16,-12.441,1,5,0.415,60.936,0.963,1921
2,Gati Bali - KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat,Gati Bali,KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat,0.961,0.328,500062,0.166,0,0.913,3,0.101,-14.85,1,5,0.0339,110.339,0.0394,1921
3,Danny Boy - Frank Parker,Danny Boy,Frank Parker,0.967,0.275,210000,0.309,0,2.8e-05,5,0.381,-9.316,1,3,0.0354,100.109,0.165,1921
4,When Irish Eyes Are Smiling - Phil Regan,When Irish Eyes Are Smiling,Phil Regan,0.957,0.418,166693,0.193,0,2e-06,3,0.229,-10.096,1,2,0.038,101.665,0.253,1921


## 2. Create Song Vector Embeddings

We will create vector embeddings from this data in three steps:

- A. Encoding the categorical `song_description` column as numeric values
- B. Scaling the numeric column values
- C. Joining these two sets of encodings together into one vector embedding

### A. Embed Categorical Song Metadata

To embed the `song_description` column as numeric vectors, we must perform natural language processing on them.
This involves tokenising the descriptions to break them up into their individual sub-parts and then using a `Word2Vec` model to turn these tokenised song descriptions into vectors.

The length of the vectors we turn these desciptions into is configurable, however, in this case we chose to set this to `15` as there are also 15 numeric columns which describe the song. We do not want to bias the final embedding vectors in favour of either the categorical columns or the numeric columns, so it made sense to keep the number of values representing each the same for both.


In [19]:
# tokenize the descriptions
tokenised_song_descs = [word_tokenize(v.lower()) for v in song_data["song_description"]]

In [20]:
# create embedding model
embedding_dim = 15

word2Vec_model = Word2Vec(
    sentences=tokenised_song_descs,
    vector_size=embedding_dim,
    window=5,
    min_count=1,
    sg=1,
)

In [21]:
# function to create embedding vector from tokens
def get_embedding(song_desc_tokens, model, embedding_dim):
    vectors = [model.wv[token] for token in song_desc_tokens if token in model.wv]

    # Average of word vectors OR zeros if no valid tokens found
    return sum(vectors) / len(vectors) if vectors else [0] * embedding_dim

In [22]:
# embed song descriptions as vectors
categorical_embeddings = [
    get_embedding(song_desc_tokens, word2Vec_model, embedding_dim)
    for song_desc_tokens in tokenised_song_descs
]

In [23]:
show_embeddings(categorical_embeddings)

Num Embeddings: 157685
Embedding Size: 15


[-1.5568612,
 1.5368818,
 1.2497157,
 -0.1815275,
 0.6017185,
 -0.38578767,
 -1.456556,
 -0.2645854,
 -0.3903007,
 1.1088337,
 0.68473357,
 0.27928087,
 0.6696859,
 -0.07945254,
 -0.8104405]

### B. Embed Numeric Song Metadata

There are 15 numeric columns in our data which we will use to make up the other half of our final embedding vectors.
First, however, we will scale these values to make them more uniform.

The standard scaled score of a sample `x` is calculated as:

$$
    z = \frac{(x - u)}{s}
$$

where `u` is the mean of the training samples and `s` is the standard deviation of the training samples.

In [24]:
# extract numeric columns
numeric_cols = list(
    song_data.drop(columns=["song_name", "song_artists", "song_description"]).columns
)
numeric_cols

['song_acousticness',
 'song_danceability',
 'song_duration_ms',
 'song_energy',
 'song_explicit',
 'song_instrumentalness',
 'song_key',
 'song_liveness',
 'song_loudness',
 'song_mode',
 'song_popularity',
 'song_speechiness',
 'song_tempo',
 'song_valence',
 'song_year']

In [25]:
# scale these columns
scaled_numeric_cols = [
    (song_data[col] - song_data[col].mean()) / np.std(song_data[col])
    for col in numeric_cols
]

In [26]:
# transpose the array to get row embeddings
numeric_embeddings = list(map(list, zip(*scaled_numeric_cols)))

In [27]:
show_embeddings(numeric_embeddings)

Num Embeddings: 157685
Embedding Size: 15


[1.2703070294949106,
 -1.461259048884883,
 4.752569009266444,
 -1.007676175100162,
 -0.3092011481361043,
 2.262496351074803,
 1.3649563314116429,
 2.6110012104955738,
 -1.5078176079821606,
 0.6453499264358126,
 -1.2499471942272533,
 -0.38364744367670833,
 -1.1655450558051375,
 -1.7786347004763523,
 -2.142666230649]

### C. Merge Categorical & Numeric Embeddings

This leaves us with two sets of vectors: one representing the categorical column and one representing the numeric columns.
Both sets have 15 values each, so when we join these together, the resulting vector will have 30 values.

In [28]:
row_embeddings = [
    np.concatenate([cat_row, num_row])
    for cat_row, num_row in zip(categorical_embeddings, numeric_embeddings)
]

In [29]:
show_embeddings(row_embeddings)

Num Embeddings: 157685
Embedding Size: 30


[-1.556861162185669,
 1.5368818044662476,
 1.2497156858444214,
 -0.1815274953842163,
 0.6017184853553772,
 -0.3857876658439636,
 -1.456555962562561,
 -0.26458540558815,
 -0.3903006911277771,
 1.10883367061615,
 0.6847335696220398,
 0.2792808711528778,
 0.6696859002113342,
 -0.07945253700017929,
 -0.8104404807090759,
 1.2703070294949106,
 -1.461259048884883,
 4.752569009266444,
 -1.007676175100162,
 -0.3092011481361043,
 2.262496351074803,
 1.3649563314116429,
 2.6110012104955738,
 -1.5078176079821606,
 0.6453499264358126,
 -1.2499471942272533,
 -0.38364744367670833,
 -1.1655450558051375,
 -1.7786347004763523,
 -2.142666230649]

### Create DataFrame With Embeddings

We can take these defined embeddings and create a Pandas DataFrame containing them.
This will be the table we insert into our vector database.

To enable proper filtering of the data once inserted into the KDB.AI vector database, we will pair these embedding vectors with three song description columns: `song_name`, `song_artists`, and `song_year`.

In [30]:
embedded_song_df = song_data[["song_name", "song_artists", "song_year"]]

In [31]:
embedded_song_df["song_embeddings"] = row_embeddings

In [32]:
show_df(embedded_song_df)

(157685, 4)


Unnamed: 0,song_name,song_artists,song_year,song_embeddings
0,"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve","Sergei Rachmaninoff, James Levine, Berliner Philharmoniker",1921,"[-1.556861162185669, 1.5368818044662476, 1.2497156858444214, -0.1815274953842163, 0.6017184853553772, -0.3857876658439636, -1.456555962562561, -0.26458540558815, -0.3903006911277771, 1.10883367061615, 0.6847335696220398, 0.2792808711528778, 0.6696859002113342, -0.07945253700017929, -0.8104404807090759, 1.2703070294949106, -1.461259048884883, 4.752569009266444, -1.007676175100162, -0.3092011481361043, 2.262496351074803, 1.3649563314116429, 2.6110012104955738, -1.5078176079821606, 0.6453499264358126, -1.2499471942272533, -0.38364744367670833, -1.1655450558051375, -1.7786347004763523, -2.142666230649]"
1,Clancy Lowered the Boom,Dennis Day,1921,"[-0.3698827922344208, 0.8136828541755676, 1.108202576637268, 0.0015807492891326547, 0.8693060874938965, 0.6672943830490112, -0.8026396632194519, 0.013652309775352478, -0.318189412355423, 0.08227983862161636, 0.7506681084632874, 0.7379704117774963, 0.6048781275749207, 0.005875506438314915, 0.22932788729667664, 0.6055353575765545, 1.600008527700499, -0.3958314054754836, -0.5219481676317469, -0.3092011481361043, -0.5349944602375606, 0.5117210944216953, -0.26640400859153673, -0.1646257835500665, 0.6453499264358126, -1.2044048830284118, 1.872793548131166, -1.8164526201212032, 1.6541851866630062, -2.142666230649]"
2,Gati Bali,KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat,1921,"[-0.4192536473274231, 0.5905762314796448, -0.17144986987113953, 0.39791154861450195, 0.9823921918869019, 0.27525532245635986, 0.24509450793266296, 0.10943138599395752, -0.48348066210746765, 0.2180800586938858, 0.714698851108551, -0.11662854254245758, 0.3113091289997101, 0.42125385999679565, -1.0842307806015015, 1.2144662090537686, -1.183477361379913, 2.1306274127125904, -1.1758127930699978, -0.3092011481361043, 2.374013638541697, -0.6259258882315683, -0.6025761034947833, -0.5873232499193564, 0.6453499264358126, -1.2044048830284118, -0.3997478418740689, -0.21005905433508043, -1.854615663007104, -2.142666230649]"
3,Danny Boy,Frank Parker,1921,"[-0.43691912293434143, 1.1469484567642212, 1.65033757686615, 0.08921252191066742, 0.6098982095718384, 0.6031222343444824, -1.0281951427459717, -0.05181722715497017, -0.16677124798297882, 0.2948986887931824, 0.9466884732246399, 0.8039681315422058, 0.9566766619682312, 0.23975332081317902, 0.39383548498153687, 1.2304207291798093, -1.4839351050077376, -0.16284109162119192, -0.6415119848547415, -0.3092011481361043, -0.5349062022700511, -0.05710239690493648, 0.9928168892663868, 0.3837053008849818, 0.6453499264358126, -1.295489505426095, -0.39080317620886856, -0.5426988976237884, -1.3774552183139837, -2.142666230649]"
4,When Irish Eyes Are Smiling,Phil Regan,1921,"[-0.24179011583328247, 0.7489233016967773, 0.8221080899238586, 0.10220704972743988, 1.1795166730880737, 0.37033796310424805, -1.143221139907837, 0.35877054929733276, -0.32916295528411865, 0.000727369450032711, 1.321073293685913, 0.46000105142593384, 0.9074721932411194, -0.14053796231746674, 0.2806406319141388, 1.203829862303075, -0.6732660986156829, -0.5052618172494476, -1.0749308222880962, -0.3092011481361043, -0.5349891074077622, -0.6259258882315683, 0.12674640748175162, 0.24684186220999385, 0.6453499264358126, -1.3410318166249366, -0.3752990890558546, -0.4921038246856425, -1.0431389831786766, -2.142666230649]"


## 3. Store Embeddings In KDB.AI

With the embeddings created, we need to store them in a vector database to enable efficient searching.

### Define KDB.AI Session

KDB.AI comes in two offerings:

1. [KDB.AI Cloud](https://trykdb.kx.com/kdbai/signup/) - For experimenting with smaller generative AI projects with a vector database in our cloud.
2. [KDB.AI Server](https://trykdb.kx.com/kdbaiserver/signup/) - For evaluating large scale generative AI applications on-premises or on your own cloud provider.

Depending on which you use there will be different setup steps and connection details required.

##### Option 1. KDB.AI Cloud

To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key.
To get these you can sign up for free [here](https://trykdb.kx.com/kdbai/signup).

You can connect to a KDB.AI Cloud session using `kdbai.Session` and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables `KDBAI_ENDPOINTS` and `KDBAI_API_KEY` exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect.
If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

In [33]:
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

In [36]:
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

Compatibility with the KDB.AI server is not guaranteed.


##### Option 2. KDB.AI Server

To use KDB.AI Server, you will need download and run your own container.
To do this, you will first need to sign up for free [here](https://trykdb.kx.com/kdbaiserver/signup/). 

You will receive an email with the required license file and bearer token needed to download your instance.
Follow instructions in the signup email to get your session up and running.

Once the [setup steps](https://code.kx.com/kdbai/gettingStarted/kdb-ai-server-setup.html) are complete you can then connect to your KDB.AI Server session using `kdbai.Session` and passing your local endpoint.

In [35]:
# session = kdbai.Session(endpoint="http://localhost:8082")

### Define Vector DB Table Schema

The next step is to define the schema for the table in KDB.AI which will store our embeddings.

As mentioned above, our table will have four columns:
- Song Name
- Song Artists
- Song Year
- Song Embeddings

When defining the schema, we must supply the types of these columns. We can use the `.dtypes()` function on the defined Pandas DataFrame to help with this.

In [37]:
embedded_song_df.dtypes

song_name          object
song_artists       object
song_year           int64
song_embeddings    object
dtype: object

In [38]:
schema =  [
    {
        "name": "song_name",
        "type": "str",
    },
    {
        "name": "song_artists",
        "type": "bytes",
    },
    {
        "name": "song_year",
        "type": "int64",
    },
    {
        "name": "song_embeddings",
        "type": "float64s"
    }
]

indexes = [
    {
        "name" : "flat_index",
        "column" : "song_embeddings",
        "type" : "flat",
        "params":{
            "dims": len(numeric_cols) + embedding_dim,
            "metric": "L2"
        }
    }
]

### Create Vector DB Table

Use the KDB.AI `create_table` function to create a table that matches the defined schema in the vector database.

In [39]:
# Get database connection. Default database is 'default'.
database = session.database("default")

# First ensure the table does not already exist
try:
    database.table("songs").drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

In [40]:
table = database.create_table("songs", schema=schema, indexes=indexes)

### Add Embedded Data to KDB.AI Table

When adding larger amounts of data, you should insert data into an index in chunks.

It is a good idea to first get an idea of how large your dataset to insert is.

In [41]:
embedded_song_df.memory_usage(deep=True).sum() / (1024**2)

79.63715362548828

This dataset is 80MB which exceeds the insert limit of <10MB at a time. As such, we'll insert this data in chunks, inserting 10,000 rows at a time.

In [42]:
chunk_size = 10_000

In [43]:
# Convert empty string values to None as empty string will create issue in search filter.
for index, row in embedded_song_df.iterrows():
    cast = row['song_artists']
    if 1 == len(cast):
        embedded_song_df.loc[index, 'song_artists'] = 'None'

for i in tqdm(range((len(embedded_song_df) // chunk_size) + 1)):
    index = i * chunk_size
    data = embedded_song_df.iloc[index : index + chunk_size].reset_index(drop=True)
    # change data types as per table schema
    data['song_artists'] = data['song_artists'].str.encode('utf-8')
    table.insert(data)

  0%|          | 0/16 [00:00<?, ?it/s]

### Verify Data Has Been Inserted

Running `table.query()` should show us that data has been added.

In [44]:
show_df(table.query())

(157685, 4)


Unnamed: 0,song_name,song_artists,song_year,song_embeddings
0,"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve","b'Sergei Rachmaninoff, James Levine, Berliner Philharmoniker'",1921,"[-1.556861162185669, 1.5368818044662476, 1.2497156858444214, -0.1815274953842163, 0.6017184853553772, -0.3857876658439636, -1.456555962562561, -0.26458540558815, -0.3903006911277771, 1.10883367061615, 0.6847335696220398, 0.2792808711528778, 0.6696859002113342, -0.07945253700017929, -0.8104404807090759, 1.2703070294949106, -1.461259048884883, 4.752569009266444, -1.007676175100162, -0.3092011481361043, 2.262496351074803, 1.3649563314116429, 2.6110012104955738, -1.5078176079821606, 0.6453499264358126, -1.2499471942272533, -0.38364744367670833, -1.1655450558051375, -1.7786347004763523, -2.142666230649]"
1,Clancy Lowered the Boom,b'Dennis Day',1921,"[-0.3698827922344208, 0.8136828541755676, 1.108202576637268, 0.0015807492891326547, 0.8693060874938965, 0.6672943830490112, -0.8026396632194519, 0.013652309775352478, -0.318189412355423, 0.08227983862161636, 0.7506681084632874, 0.7379704117774963, 0.6048781275749207, 0.005875506438314915, 0.22932788729667664, 0.6055353575765545, 1.600008527700499, -0.3958314054754836, -0.5219481676317469, -0.3092011481361043, -0.5349944602375606, 0.5117210944216953, -0.26640400859153673, -0.1646257835500665, 0.6453499264358126, -1.2044048830284118, 1.872793548131166, -1.8164526201212032, 1.6541851866630062, -2.142666230649]"
2,Gati Bali,b'KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat',1921,"[-0.4192536473274231, 0.5905762314796448, -0.17144986987113953, 0.39791154861450195, 0.9823921918869019, 0.27525532245635986, 0.24509450793266296, 0.10943138599395752, -0.48348066210746765, 0.2180800586938858, 0.714698851108551, -0.11662854254245758, 0.3113091289997101, 0.42125385999679565, -1.0842307806015015, 1.2144662090537686, -1.183477361379913, 2.1306274127125904, -1.1758127930699978, -0.3092011481361043, 2.374013638541697, -0.6259258882315683, -0.6025761034947833, -0.5873232499193564, 0.6453499264358126, -1.2044048830284118, -0.3997478418740689, -0.21005905433508043, -1.854615663007104, -2.142666230649]"
3,Danny Boy,b'Frank Parker',1921,"[-0.43691912293434143, 1.1469484567642212, 1.65033757686615, 0.08921252191066742, 0.6098982095718384, 0.6031222343444824, -1.0281951427459717, -0.05181722715497017, -0.16677124798297882, 0.2948986887931824, 0.9466884732246399, 0.8039681315422058, 0.9566766619682312, 0.23975332081317902, 0.39383548498153687, 1.2304207291798093, -1.4839351050077376, -0.16284109162119192, -0.6415119848547415, -0.3092011481361043, -0.5349062022700511, -0.05710239690493648, 0.9928168892663868, 0.3837053008849818, 0.6453499264358126, -1.295489505426095, -0.39080317620886856, -0.5426988976237884, -1.3774552183139837, -2.142666230649]"
4,When Irish Eyes Are Smiling,b'Phil Regan',1921,"[-0.24179011583328247, 0.7489233016967773, 0.8221080899238586, 0.10220704972743988, 1.1795166730880737, 0.37033796310424805, -1.143221139907837, 0.35877054929733276, -0.32916295528411865, 0.000727369450032711, 1.321073293685913, 0.46000105142593384, 0.9074721932411194, -0.14053796231746674, 0.2806406319141388, 1.203829862303075, -0.6732660986156829, -0.5052618172494476, -1.0749308222880962, -0.3092011481361043, -0.5349891074077622, -0.6259258882315683, 0.12674640748175162, 0.24684186220999385, 0.6453499264358126, -1.3410318166249366, -0.3752990890558546, -0.4921038246856425, -1.0431389831786766, -2.142666230649]"


## 4. Search For Similar Songs To A Target Song

Now that the data has been inserted into the database, we can perform some queries on the data.

### Find Songs By A Certain Artist

We can query the database to find songs by particular artists using KDB.AI's `.query()` function.

Here, we want to return all songs in the dataset by the DJ `Calvin Harris`, sorted by the year they were produced. This returns 32 songs to us.

In [45]:
table.query(filter=[("like", "song_artists", "*Calvin Harris*")], sort_columns=["song_year"])

Unnamed: 0,song_name,song_artists,song_year,song_embeddings
0,Flashback,b'Calvin Harris',2009,"[-0.3945140242576599, 1.0759578943252563, 0.6507813334465027, 0.026049286127090454, 1.2200939655303955, 0.3275987505912781, -0.6032612323760986, -0.09718676656484604, 0.17846214771270752, 0.11843523383140564, 0.4491797685623169, 0.8082219362258911, 0.9458396434783936, 0.3164554536342621, 0.3685174882411957, -1.3351724705550174, -1.5406252453148743, -0.011030115749796763, 1.7460279903169285, -0.3092011481361043, -0.5313940449564866, 1.0805445857483271, -0.7039975151774578, 1.0250333372402145, -1.549546934208044, 0.7539144985217788, -0.307319630000332, 0.36420802995835755, -1.2520866301382436, 1.247808968405853]"
1,You Used To Hold Me,b'Calvin Harris',2009,"[-0.1907130479812622, 1.1836649179458618, 1.0103275775909424, -0.23076722025871277, 1.1485596895217896, 0.3097757399082184, -1.13646399974823, 0.31705376505851746, 0.0680871307849884, 0.03221854940056801, 1.4432165622711182, 0.33792173862457275, 1.2677819728851318, 0.2987383008003235, 0.30935192108154297, -1.279491195315136, 0.46053670752705134, 0.004253770518922344, 1.8244917453695186, -0.3092011481361043, -0.5349904137531296, 1.6493680770749588, 0.6224578016611152, 0.958707209266951, -1.549546934208044, 0.5717452537264123, -0.35562082459241384, 0.3955535635781341, -1.2824790151505443, 1.247808968405853]"
2,I'm Not Alone - Radio Edit,b'Calvin Harris',2009,"[-0.41074925661087036, 1.0968719720840454, 0.9735289812088013, -0.17952172458171844, 1.0419899225234985, 0.41162243485450745, -1.0159991979599, 0.3421984910964966, 0.061519723385572433, 0.21339663863182068, 1.2095028162002563, 0.5175282955169678, 1.0884373188018799, 0.04183227941393852, 0.3326178193092346, -1.3273547556932577, 0.3074733286977821, -0.1491385696844665, 0.7932538218211915, -0.3092011481361043, -0.14309084999676247, 0.5117210944216953, 0.4914076629700189, 0.8665875870818629, 0.6453499264358126, 0.3895760089310457, -0.411077751716656, 0.46191895166730823, -0.36310936852844955, 1.247808968405853]"
3,We Found Love,"b'Rihanna, Calvin Harris'",2011,"[-0.26080122590065, 1.2835898399353027, 0.9702048301696777, -0.20885464549064636, 1.040050983428955, 0.30976125597953796, -1.026407241821289, 0.2817744314670563, 0.05440308153629303, 0.20747336745262146, 1.0128121376037598, 0.7217922806739807, 1.27660071849823, 0.1652015894651413, 0.2759507894515991, -1.2744389306085564, 1.1181423350898372, -0.12151213480453658, 1.0660087798611477, -0.3092011481361043, -0.5305974929031516, -1.1947493795582, -0.5626912786757541, 1.2313812909348116, 0.6453499264358126, 1.9835569008905032, -0.37351015592281456, 0.36375280436636925, 0.2751307167298641, 1.3248652229298272]"
4,Dance Wiv Me - Radio Edit,"b'Dizzee Rascal, Calvin Harris, Chrome'",2011,"[-0.5109012126922607, 0.9720720648765564, 0.799556314945221, -0.21112249791622162, 0.8161373138427734, 0.4482901692390442, -0.6652621030807495, 0.12716078758239746, -0.16659505665302277, 0.218120738863945, 0.6433840394020081, 0.3907545506954193, 0.7545459866523743, 0.25867629051208496, 0.2638013958930969, -1.2143435714671371, 1.9344803555126058, -0.20954668716662583, 0.9912813940967762, -0.3092011481361043, -0.5349944602375606, 1.6493680770749588, -0.30059100129356187, 1.2671763441267319, 0.6453499264358126, 1.5281337889020867, -0.3329610049072396, -0.1561798539118754, 1.00454795702508, 1.3248652229298272]"
5,Feel So Close - Radio Edit,b'Calvin Harris',2012,"[-0.41266751289367676, 1.123816967010498, 0.9636674523353577, -0.15696801245212555, 1.0778895616531372, 0.5142435431480408, -0.9365121126174927, 0.18350185453891754, 0.0030023125000298023, 0.1669534295797348, 1.0780141353607178, 0.5214424729347229, 1.2052561044692993, 0.1277695745229721, 0.4630122184753418, -1.3383314655399736, 0.9650789562605678, -0.19120286091549893, 1.6563551273996828, -0.3092011481361043, -0.5125954164977816, 0.5117210944216953, -0.015699395443352974, 1.5196718418873827, 0.6453499264358126, 2.2112684568847114, -0.4170408621601229, 0.36215951479440944, 1.4870270690953529, 1.363393350191814]"
6,Sweet Nothing (feat. Florence Welch),"b'Calvin Harris, Florence Welch'",2012,"[-0.5268850922584534, 1.3849986791610718, 1.004446268081665, -0.3079139292240143, 0.80889493227005, 0.24300502240657806, -0.9190152287483215, 0.09049827605485916, 0.11909905821084976, 0.09297098219394684, 0.7897297739982605, 0.7790522575378418, 0.8303209543228149, 0.1626654863357544, 0.09249763935804367, -0.8170760203287275, 0.20543107614493594, -0.14259962817167257, 1.6750369738407758, -0.3092011481361043, -0.5346376049176665, 0.7961328400850112, -0.8549900662780685, 1.3266593001662457, -1.549546934208044, 1.8469299672939783, 0.048081752430295104, 0.36206196645326905, 0.20674785045218758, 1.363393350191814]"
7,I Need Your Love (feat. Ellie Goulding),"b'Calvin Harris, Ellie Goulding'",2012,"[-0.5104639530181885, 1.4303083419799805, 0.990503191947937, -0.34354090690612793, 0.8902595639228821, 0.21105189621448517, -0.8916533589363098, 0.11468739807605743, 0.29771849513053894, 0.16011016070842743, 0.9450675845146179, 0.4534563422203064, 1.0149157047271729, 0.05381403863430023, 0.18583163619041443, -0.2506905558542882, 0.8970507878920038, 0.030931386799656055, 1.4508548165476611, -0.3092011481361043, -0.5349944602375606, 0.7961328400850112, 0.17232906441778495, 1.1294355757166477, 0.6453499264358126, 1.8013876560951365, -0.31387905148814554, 0.2663020115671261, 0.1991497541991124, 1.363393350191814]"
8,Let's Go (feat. Ne-Yo),"b'Calvin Harris, Ne-Yo'",2012,"[-0.5206174850463867, 1.508386492729187, 1.1125340461730957, -0.47822290658950806, 0.7240991592407227, 0.4544106721878052, -1.0131542682647705, 0.08151860535144806, 0.27689993381500244, 0.2404632270336151, 0.913619339466095, 0.5367823839187622, 0.9807852506637573, 0.0945235937833786, 0.13261589407920837, -1.3202549942371695, 0.9820859983527089, 0.01743444222608983, 1.4994276172945025, -0.3092011481361043, -0.5104287949127105, -0.34151414256825235, 0.4971054950870231, 1.5038799066556534, -1.549546934208044, 1.5736761001009283, -0.24709221452131636, 0.36472828777777266, 1.3198689515276991, 1.363393350191814]"
9,Thinking About You (feat. Ayah Marar),"b'Calvin Harris, Ayah Marar'",2012,"[-0.395300030708313, 1.2428056001663208, 0.7758424282073975, -0.2259850949048996, 0.7068857550621033, 0.27970224618911743, -0.7813277840614319, 0.1254027783870697, 0.17263813316822052, 0.2683042287826538, 0.7502059936523438, 0.46803006529808044, 0.8620699644088745, 0.13321173191070557, 0.13766808807849884, -1.3339492906786878, 1.067121208813414, 0.1370883744063931, 1.4695366629887539, -0.3092011481361043, -0.5336817424536646, -1.479161125221516, -0.6322048305032051, 1.3664900701396077, -1.549546934208044, 1.61921841129977, -0.36575811234630756, 0.3637202882526556, 0.8373898394574263, 1.363393350191814]"


### Find A Specific Song

We can filter this query further by looking for the song `We Found Love` by `Calvin Harris` in our dataset.

This will only return one song as he only produced one song with this name.

In [46]:
table.query(
    filter=[
        ("like", "song_artists", "*Calvin Harris*"),
        ("like", "song_name", "*We Found Love*"),
    ]
)

Unnamed: 0,song_name,song_artists,song_year,song_embeddings
0,We Found Love,"b'Rihanna, Calvin Harris'",2011,"[-0.26080122590065, 1.2835898399353027, 0.9702048301696777, -0.20885464549064636, 1.040050983428955, 0.30976125597953796, -1.026407241821289, 0.2817744314670563, 0.05440308153629303, 0.20747336745262146, 1.0128121376037598, 0.7217922806739807, 1.27660071849823, 0.1652015894651413, 0.2759507894515991, -1.2744389306085564, 1.1181423350898372, -0.12151213480453658, 1.0660087798611477, -0.3092011481361043, -0.5305974929031516, -1.1947493795582, -0.5626912786757541, 1.2313812909348116, 0.6453499264358126, 1.9835569008905032, -0.37351015592281456, 0.36375280436636925, 0.2751307167298641, 1.3248652229298272]"


### Find Similar Songs To This Song

We can then copy and paste the vector associated with this song below and save it as the variable `my_vec`.

We will then use KDB.AI's `.search()` function to find similar songs in the dataset to this song using this vector.
We will pull out the 5 songs most similar to this song from the dataset.

<div class="alert alert-block alert-warning">
    <b>Note: </b>
    The most similar song will be "We Found Love" by "Calvin Harris" as this is the vector we are using for the search.
</div>

In [47]:
my_vec = [[-0.22046250104904175, 1.4409897327423096, 1.0603516101837158, -0.15190696716308594, 0.7062567472457886, 0.4542839527130127, -1.1391438245773315, 0.2879817485809326, 0.00519922748208046, 0.17919489741325378, 1.0258654356002808, 0.5709000825881958, 1.2584115266799927, -0.15058273077011108, 0.18762657046318054, -1.2744389306085728, 1.1181423350898205, -0.12151213480453628, 1.0660087798611322, -0.30920114813582733, -0.5305974929034635, -1.1947493795585982, -0.562691278675705, 1.2313812909347996, 0.6453499264345925, 1.9835569008902227, -0.3735101559228201, 0.36375280436636764, 0.2751307167298588, 1.324865222929866]]
vector = {'flat_index' : my_vec}

In [48]:
table.search(vectors=vector, n=5)[0]

Unnamed: 0,__nn_distance,song_name,song_artists,song_year,song_embeddings
0,0.316836,We Found Love,"b'Rihanna, Calvin Harris'",2011,"[-0.26080122590065, 1.2835898399353027, 0.9702048301696777, -0.20885464549064636, 1.040050983428955, 0.30976125597953796, -1.026407241821289, 0.2817744314670563, 0.05440308153629303, 0.20747336745262146, 1.0128121376037598, 0.7217922806739807, 1.27660071849823, 0.1652015894651413, 0.2759507894515991, -1.2744389306085564, 1.1181423350898372, -0.12151213480453658, 1.0660087798611477, -0.3092011481361043, -0.5305974929031516, -1.1947493795582, -0.5626912786757541, 1.2313812909348116, 0.6453499264358126, 1.9835569008905032, -0.37351015592281456, 0.36375280436636925, 0.2751307167298641, 1.3248652229298272]"
1,1.09343,Bad At Love,b'Halsey',2017,"[-0.16537222266197205, 1.2058887481689453, 1.2794114351272583, 0.0658358559012413, 0.7568423748016357, 0.4529845714569092, -1.054943323135376, 0.44059205055236816, -0.1214047223329544, 0.20079302787780762, 1.3151286840438843, 0.683242917060852, 0.8072047233581543, 0.12309160083532333, 0.3925808072090149, -1.1803072618649173, 0.7836705072777309, -0.3899329165171471, 1.0099632405378691, -0.3092011481361043, -0.5349944602375606, -1.479161125221516, -0.6692407392637322, 1.3973720768149898, 0.6453499264358126, 1.9380145896916616, -0.4253892167809766, 0.051533080489714826, 0.32071929424831513, 1.556033986501749]"
2,1.332595,9 and Three Quarters (Run Away),b'TOMORROW X TOGETHER',2019,"[-0.6741535663604736, 1.0576506853103638, 1.14078688621521, -0.2761414349079132, 0.6572062969207764, 0.38944244384765625, -1.0103001594543457, 0.13655011355876923, -0.06112262234091759, 0.12571658194065094, 1.214206337928772, 0.3262106776237488, 0.7272558212280273, -0.15216736495494843, 0.23714852333068848, -1.3262911210181882, 0.5795860021720386, -0.14807905558203074, 1.028645086978962, -0.3092011481361043, -0.5349944602375606, -1.1947493795582, -0.4373389721016621, 1.2427865774910607, 0.6453499264358126, 1.8013876560951365, -0.23934017094480936, 0.3946756285078711, 0.016795444125308742, 1.6330902410257229]"
3,1.343494,Stay Gold,b'BTS',2020,"[-0.3429849147796631, 0.9595193266868591, 0.9389689564704895, -0.034431278705596924, 0.7902694344520569, 0.4909695088863373, -1.0156073570251465, 0.11812451481819153, -0.2790054976940155, 0.19211310148239136, 1.1905966997146606, 0.5811991095542908, 0.7114991545677185, -0.043654054403305054, 0.3609951138496399, -1.1037255652599227, 1.0557831807519866, 0.1001951445409801, 0.5728080338162957, -0.3092011481361043, -0.5349944602375606, -1.1947493795582, -0.7296377597039766, 1.0681979602069414, 0.6453499264358126, 2.2112684568847114, -0.2786966998716909, 0.39727691760494743, 0.20674785045218758, 1.67161836828771]"
4,1.391451,Sweet but Psycho,b'Ava Max',2020,"[-0.23769669234752655, 1.0227880477905273, 0.793926477432251, -0.16728071868419647, 0.9293109774589539, 0.3843829929828644, -1.0454347133636475, 0.2793857157230377, -0.145338773727417, 0.6457213163375854, 0.9422516226768494, 0.6954149603843689, 0.7426178455352783, -0.13997024297714233, 0.19155041873455048, -1.15903456836353, 1.0387761386598457, -0.3412506155567211, 0.841826622568033, -0.3092011481361043, -0.5349944602375606, -1.1947493795582, -0.23221701588951166, 1.1903222593323153, 0.6453499264358126, 2.393437701680078, -0.31984216193161247, 0.5268536307530907, 0.3511116792606158, 1.67161836828771]"


### Automate This Song Similarity Search Process

We can define a function to automate this process and find songs which are the most similar to any input song.
This will allow us to use the KDB.AI vector database in a more production-like setting to perform similarity search and music recommendation.

In [49]:
def find_similar_songs(
    vectorDB_song_tab,
    song_name: str,
    song_artists: list[str] = None,
    song_year: int = None,
    n_similar: int = 5,
    exact: bool = False,
) -> None:
    # create filter list
    filter_list = [("like", "song_name", f"{song_name}" if exact else f"*{song_name}*")]
    if song_artists:
        if type(song_artists) == str:
            song_artists = list(song_artists)
        for artist in song_artists:
            filter_list.append(("like", "song_artists", f"*{artist}*"))
    if song_year:
        filter_list.append(("like", "song_year", f"{song_year}"))

    # find songs liks this in vector DB
    resulting_song = vectorDB_song_tab.query(filter=filter_list, sort_columns=["song_year"])

    # quality check
    if resulting_song.empty:
        print(
            "Song Not Found! Please double check the values entered or try another song"
        )
        return

    # find vectors associated with these songs
    resulting_vectors = {'flat_index': [v.tolist() for v in resulting_song["song_embeddings"]]}

    # search for similar songs to selected songs
    similar_songs = vectorDB_song_tab.search(vectors=resulting_vectors, n=n_similar + 1)

    # process similar song table
    for i, similar_df in enumerate(similar_songs):
        name = resulting_song.loc[i, "song_name"]
        artists = resulting_song.loc[i, "song_artists"]
        year = resulting_song.loc[i, "song_year"]
        print(f"Songs Similar To '{name}' By '{artists}' ({year})")
        for j, song in similar_df[1:].iterrows():
            print(
                f"   {j}. {song['song_name']} - {song['song_artists']} ({song['song_year']})"
            )
        print()

##### Songs by multiple artists

Here, we will query this function to search the KDB.AI vector database to look for the song `Let's Go` by two artists - `Calvin Harris` and `Ne-Yo`. 

In [50]:
find_similar_songs(table, song_name="Let's Go", song_artists=["Calvin Harris", "Ne-Yo"])

Songs Similar To 'Let's Go (feat. Ne-Yo)' By 'b'Calvin Harris, Ne-Yo'' (2012)
   1. I Cry - b'Flo Rida' (2012)
   2. Mmm Yeah (feat. Pitbull) - b'Austin Mahone, Pitbull' (2014)
   3. Too Much (feat. Usher) - b'Marshmello, Imanbek, Usher' (2020)
   4. All Around The World - b'Justin Bieber, Ludacris' (2012)
   5. No Money - b'Galantis' (2016)



If you search for these songs on [YouTube](https://www.youtube.com/), you will see that the results returned have quite a similar vibe to the song we searched for showing the similarity search power of KDB.AI.

##### Specify different number of similar songs

We can adjust the number of results returned to us by specifying the `n_similar` parameter.
Here, we will search for the `8` most similar songs to the song `Californiacation` by `Red Hot Chili Peppers`.

In [51]:
find_similar_songs(
    table,
    song_name="Californication",
    song_artists="Red Hot Chili Peppers",
    n_similar=8,
)

Songs Similar To 'Californication' By 'b'Red Hot Chili Peppers'' (1999)
   1. Police Station - b'Red Hot Chili Peppers' (2011)
   2. Charlie - b'Red Hot Chili Peppers' (2006)
   3. Dark Necessities - b'Red Hot Chili Peppers' (2016)
   4. Especially in Michigan - b'Red Hot Chili Peppers' (2006)
   5. Don't Forget Me - b'Red Hot Chili Peppers' (2002)
   6. Cabron - b'Red Hot Chili Peppers' (2002)
   7. Look Around - b'Red Hot Chili Peppers' (2011)
   8. Face Down - b'The Red Jumpsuit Apparatus' (2006)



When we return these similar songs, you will notice that a lot of the similar songs are by the same artist as the original song we searched for - `Red Hot Chili Peppers`.
This makes sense as their music usually has pretty unique features. 

#### All songs with a given name by any artist

The final thing we will perform a similarity search on all songs stored in the KDB.AI vector database with a given song name - `Love Me`.

In [52]:
find_similar_songs(table, song_name="Love Me", exact=True)

Songs Similar To 'Love Me' By 'b'Elvis Presley'' (1956)
   1. Don't - b'Elvis Presley' (1959)
   2. Without Him - b'Elvis Presley' (1967)
   3. Everything I Have Is Yours - 10'' Version - b'Billie Holiday' (1956)
   4. Fine And Mellow - b'Billie Holiday' (1957)
   5. His Hand in Mine - b'Elvis Presley' (1960)

Songs Similar To 'Love Me' By 'b'Buddy Holly'' (1958)
   1. Johnny be good - Radio Version - b'Jonny Bombastic' (1955)
   2. Midnight Shift - b'Buddy Holly' (1958)
   3. A Love That's Worth Having - b'Willie Hutch' (1969)
   4. Lonely Weekends - b'Wanda Jackson' (1961)
   5. Rock & Roll Guitar - b'Johnny Knight' (1959)

Songs Similar To 'Love Me' By 'b'Sarah Vaughan'' (1958)
   1. Summer Is Gone - b'Carmen McRae' (1956)
   2. Make the World Go Away - b'Ray Price' (1956)
   3. Your Love Has Faded - b'Johnny Hodges' (1961)
   4. I'm Confessin' (That I Love You) - b'Judy Garland' (1958)
   5. I Got It Bad And That Ain't Good - b'Eileen Farrell' (1954)

Songs Similar To 'Love Me' By 

There are 15 songs in our vector database with `Love Me` as their title and we have returned the most similar songs to each of these.
With that, we have built a recommendation system which is able to recommend music based both on user numerical and categorical song data.

## 5. Delete the KDB.AI Table

Once finished with the table, it is best practice to drop it.

In [53]:
table.drop()

## Take Our Survey

We hope you found this sample helpful! Your feedback is important to us, and we would appreciate it if you could take a moment to fill out our brief survey. Your input helps us improve our content.

[**Take the Survey**](https://delighted.com/t/gvqeAOuO)