<a href="https://colab.research.google.com/github/LeonelFNR/Cover-Detection-System/blob/main/Project_Cover_Detection_Leonel_Fernando_Nabaza_Ruibal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎶 Cover Identification System Using Lyrics Only

## Overview

My aim in this project is to build a system that takes a YouTube music video URL and returns the title and artist of the song by matching the lyrics with a Database of songs.

The input/output format will be like the following one:

```
covers = get_covers(youtube_url, k)
```
Where `covers` is a list of dicts with length `k` sorted by score.
```
[
    {"title": "Title 1", "artist": "Artist 1", "score": 95.0},
    {"title": "Title 2", "artist": "Artist 2", "score": 89.5},
    ...
]
```

To idea behind this notebook is the following:

Setup:
1. Download a lyrics dataset
2. Extract embeddings for each song's lyrics
3. Create a vector index (database) for fast retrieval of similar lyrics

Then, for each youtube URL (query):

1. Download the youtube video in a temporary file
2. Transcribe the lyrics using the Whisper model
3. Extract the embeddings of the transcribed lyrics
4. Search the top-k similar entries in your vector database and return the song title and artist



# 🚀Setup

## Set the runtime type

Set the runtime type of this Google Collab to T4 GPU.



## Install `insanely-fast-whisper`

This is a library to run Whisper model for audio to text transcription.

Note that first you need to install `pipx`.

I will check it works well for this URL: https://www.signalogic.com/melp/EngSamples/Orig/male.wav

Notes:
* The installation is slow, it might take a few minutes.
* If `insanely-fast-whisper` executable is not globally available once installed, just run it with its absolute path: `/root/.local/bin/insanely-fast-whisper`. It might be tricky to make it globally available inside this collab.

In [None]:
!pip install pipx

Collecting pipx
  Downloading pipx-1.7.1-py3-none-any.whl.metadata (18 kB)
Collecting argcomplete>=1.9.4 (from pipx)
  Downloading argcomplete-3.5.2-py3-none-any.whl.metadata (16 kB)
Collecting userpath!=1.9,>=1.6 (from pipx)
  Downloading userpath-1.9.2-py3-none-any.whl.metadata (3.0 kB)
Downloading pipx-1.7.1-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading argcomplete-3.5.2-py3-none-any.whl (43 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading userpath-1.9.2-py3-none-any.whl (9.1 kB)
Installing collected packages: userpath, argcomplete, pipx
Successfully installed argcomplete-3.5.2 pipx-1.7.1 userpath-1.9.2


In order to avoid problems with the virtual enviroment it is advisable to install `python3.10-venv` .

In [None]:
!apt install python3.10-venv

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  python3-pip-whl python3-setuptools-whl
The following NEW packages will be installed:
  python3-pip-whl python3-setuptools-whl python3.10-venv
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 2,474 kB of archives.
After this operation, 2,885 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-pip-whl all 22.0.2+dfsg-1ubuntu0.5 [1,680 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-setuptools-whl all 59.6.0-1.2ubuntu0.22.04.2 [788 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3.10-venv amd64 3.10.12-1~22.04.7 [5,718 B]
Fetched 2,474 kB in 1s (2,691 kB/s)
Selecting previously unselected package python3-pip-whl.
(Reading database ... 123633 files and directories currently installed.)
Pr

In [None]:
!pipx install insanely-fast-whisper

[K  installed package [1minsanely-fast-whisper[0m [1m0.0.15[0m, installed using Python 3.10.12
  These apps are now globally available
    - insanely-fast-whisper
⚠️  Note: '/root/.local/bin' is not on your PATH environment variable. These apps will not be
    globally accessible until your PATH is updated. Run `pipx ensurepath` to automatically add it,
    or manually modify your PATH in your shell's config file (e.g. ~/.bashrc).
done! ✨ 🌟 ✨
[?25h

In [None]:
#small check in order to verify that the installation works. we'll use absolute path
!/root/.local/bin/insanely-fast-whisper --help

usage: insanely-fast-whisper [-h] --file-name FILE_NAME [--device-id DEVICE_ID]
                             [--transcript-path TRANSCRIPT_PATH] [--model-name MODEL_NAME]
                             [--task {transcribe,translate}] [--language LANGUAGE]
                             [--batch-size BATCH_SIZE] [--flash FLASH] [--timestamp {chunk,word}]
                             [--hf-token HF_TOKEN] [--diarization_model DIARIZATION_MODEL]
                             [--num-speakers NUM_SPEAKERS] [--min-speakers MIN_SPEAKERS]
                             [--max-speakers MAX_SPEAKERS]

Automatic Speech Recognition

options:
  -h, --help            show this help message and exit
  --file-name FILE_NAME
                        Path or URL to the audio file to be transcribed.
  --device-id DEVICE_ID
                        Device ID for your GPU. Just pass the device number when using CUDA, or
                        "mps" for Macs with Apple Silicon. (default: "0")
  --transcript-path TR

Let's have another small test by using the provided url.

In [None]:
url = "https://www.signalogic.com/melp/EngSamples/Orig/male.wav"
!/root/.local/bin/insanely-fast-whisper --file-name {url}

config.json: 100% 1.27k/1.27k [00:00<00:00, 8.01MB/s]
model.safetensors: 100% 3.09G/3.09G [01:13<00:00, 42.1MB/s]
generation_config.json: 100% 3.90k/3.90k [00:00<00:00, 22.2MB/s]
tokenizer_config.json: 100% 283k/283k [00:00<00:00, 651kB/s]
vocab.json: 100% 1.04M/1.04M [00:00<00:00, 1.13MB/s]
tokenizer.json: 100% 2.48M/2.48M [00:01<00:00, 2.17MB/s]
merges.txt: 100% 494k/494k [00:00<00:00, 724kB/s]
normalizer.json: 100% 52.7k/52.7k [00:00<00:00, 136MB/s]
added_tokens.json: 100% 34.6k/34.6k [00:00<00:00, 110MB/s]
special_tokens_map.json: 100% 2.07k/2.07k [00:00<00:00, 14.8MB/s]
preprocessor_config.json: 100% 340/340 [00:00<00:00, 2.35MB/s]
[2K/root/.local/share/pipx/venvs/insanely-fast-whisper/lib/python3.10/site-packages/transformers/models
make sure to use `input_features` instead.
🤗 [33mTranscribing...[0m [37m━[0m[37m━[0m[37m━[0m[37m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[93m━[0m[37m━[0m

Let's see if the test went right by taking a look to the head of the output file `output.json`.

In [None]:
!head output.json

head: cannot open 'output.json' for reading: No such file or directory


Seems about right!

## Install a python library to download youtube videos


There are a few python libraries to download youtube videos, but some of them are not working anymore due to banning issues. For example, `pytube` used to be commonly used for it, but it seems it is not working anymore (see https://www.reddit.com/r/learnpython/comments/1edm1q5/pytube_not_working_please_help/).


Following the Reddit's post answers, `Pytubefix` would seem a like a fine alternative. However, it does not support the `.mp3` format. So, the library `yt-dlp` will be used instead.

In [None]:
!pip install yt-dlp

Collecting yt-dlp
  Downloading yt_dlp-2024.12.13-py3-none-any.whl.metadata (172 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/172.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.1/172.1 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading yt_dlp-2024.12.13-py3-none-any.whl (3.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.2/3.2 MB[0m [31m123.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m72.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: yt-dlp
Successfully installed yt-dlp-2024.12.13


Let's perform a test. We will try do download the audio for this Youtube video: https://www.youtube.com/watch?v=wagn8Wrmzuc .

In [None]:
video = "https://www.youtube.com/watch?v=wagn8Wrmzuc"
!yt-dlp  -x --audio-format mp3 {video}

[youtube] Extracting URL: https://www.youtube.com/watch?v=wagn8Wrmzuc
[youtube] wagn8Wrmzuc: Downloading webpage
[youtube] wagn8Wrmzuc: Downloading ios player API JSON
[youtube] wagn8Wrmzuc: Downloading mweb player API JSON
[youtube] wagn8Wrmzuc: Downloading player 2f1832d2
[youtube] wagn8Wrmzuc: Downloading m3u8 information
[info] wagn8Wrmzuc: Downloading 1 format(s): 251
[download] Destination: Lady Gaga - Judas (Official Music Video) [wagn8Wrmzuc].webm
[K[download] 100% of    5.12MiB in [1;37m00:00:00[0m at [0;32m12.30MiB/s[0m
[ExtractAudio] Destination: Lady Gaga - Judas (Official Music Video) [wagn8Wrmzuc].mp3
Deleting original file Lady Gaga - Judas (Official Music Video) [wagn8Wrmzuc].webm (pass -k to keep)


The file `Lady Gaga - Judas (Official Music Video) [wagn8Wrmzuc].mp3` has been successfully created. We can download it to check that that the installation and test have worked. **Note:** the argument `-x` helps us to extract only the audio of the downloaded video instead of saving the whole video.

## Download the lyrics dataset

For this project I will download this dataset:
https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information

Using the python code suggested in Kaggle web:
```
import kagglehub

# Download latest version
path = kagglehub.dataset_download("carlosgdcj/genius-song-lyrics-with-language-information")

print("Path to dataset files:", path)
```

A very large file `song_lyrics.csv` should appear.


In [None]:
import kagglehub

#download latest version

path = kagglehub.dataset_download("carlosgdcj/genius-song-lyrics-with-language-information")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/carlosgdcj/genius-song-lyrics-with-language-information?dataset_version_number=1...


100%|██████████| 3.04G/3.04G [00:54<00:00, 59.4MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/carlosgdcj/genius-song-lyrics-with-language-information/versions/1


The dataset has been installed in the specified path, and it can be found inside the mentioned folder, inside a `1` folder. The complete path is: `/root/.cache/kagglehub/datasets/carlosgdcj/genius-song-lyrics-with-language-information/versions/1` .

## Installing more dependencies

Now, transformers packages will be installed.

In [None]:
!pip install transformers torch faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1




# ✏️Development of solution

## Implement `get_lyrics_from_youtube_url(youtube_url)`

I will firstly implement a function able to extract lyrics as a string from a youtube url using `insanely-fast-whisper`.

The idea behind this function is straightforward: generate a transcription of the provided url and then clean it so that a legible string can be returned. To do so, the url video is downloaded in `.mp3` format with other settings from `yt_dlp`. Then a JSON file with the transcription is generated using the `insanely-fast-whisper`. Since this resource is a CLI, and the instruction asks to use it, I have used a `subprocess` library in order to call a command from the script itself. Once the JSON file is produced, the last thing left to do is to open it and filter for the field of information we are interested in: `text`. Besides returning the text string, temp files that were created along the process will be removed at the end of it.


In [None]:
import yt_dlp
import tempfile
import os
import subprocess
import json

def get_lyrics_from_youtube_url(youtube_url):
    url = youtube_url  # in order to reduce writing

    # temporal path for audio file
    temp_audio_path = os.path.join(tempfile.gettempdir(), "temp_audio")

    # configuration and options of yt_dlp
    ydl_opts = {
        "format": "bestaudio/best",
        "postprocessors": [
            {
                "key": "FFmpegExtractAudio",
                "preferredcodec": "mp3",
                "preferredquality": "0",
            }
        ],
        "outtmpl": temp_audio_path,  # save temp audio here
    }

    # download the audio from the video with the defined config
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])

    # path of the newly created file
    processed_audio_path = temp_audio_path + ".mp3"

    # a JSON file will be created. its path will be
    output_json_path = os.path.join(tempfile.gettempdir(), "output.json")

    # create the instruction and config of the insanely-fast-whisper
    #that will be called from cl
    command = [
        "/root/.local/bin/insanely-fast-whisper",
        "--task", "transcribe",
        "--model-name", "openai/whisper-large-v3",
        "--file-name", processed_audio_path,
        "--transcript-path", output_json_path,  # save the result in JSON
    ]

    # execute using subprocess
    result = subprocess.run(command, capture_output=True, text=True)

    #verify execution went right and, if so, read json file
    if result.returncode == 0:
        if os.path.exists(output_json_path):
            with open(output_json_path, "r") as f:
                data = json.load(f)
            #extract the text part only
            lyrics = data.get("text", "Transcription not found.")  # Extraer letra
        else:
            lyrics = "Error: Transcription file not created."
    else:
        lyrics = "Error: " + result.stderr

    # remove temp files that were created
    os.remove(processed_audio_path)
    if os.path.exists(output_json_path):
        os.remove(output_json_path)

    return lyrics



Let's do a test with the previously used video.

In [None]:
video = "https://www.youtube.com/shorts/I-fOINFTS0U"
lyrics = get_lyrics_from_youtube_url(video)
print(lyrics)

[youtube] Extracting URL: https://www.youtube.com/shorts/I-fOINFTS0U
[youtube] I-fOINFTS0U: Downloading webpage
[youtube] I-fOINFTS0U: Downloading ios player API JSON
[youtube] I-fOINFTS0U: Downloading mweb player API JSON
[youtube] I-fOINFTS0U: Downloading m3u8 information
[info] I-fOINFTS0U: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of  191.41KiB in 00:00:00 at 944.33KiB/s 
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)
 I just wanna be part of your symphony Will you hold me tight and won't let go


The test has been passes successfully.

## Embeddings extractor

Now are going to prepare a function able to extract embeddings (for example, BERT), from a given text. This function will be tested with some string.



In [None]:
from transformers import BertTokenizer, BertModel
import torch

def get_text_embedding(text):
  #load the pre-trained model BERT and its tokenizer
  tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
  model = BertModel.from_pretrained('bert-base-uncased')

  #tokenize the text and add the special tokens ([CLS] and [SEP])
  inputs = tokenizer(text, return_tensors='pt', truncation = True, padding = True, max_length = 512)

  #obtain the outputs of the model
  with torch.no_grad():
    outputs = model(**inputs)

  #extract the embedding of the token [CLS] (which represents all the text)
  cls_embedding = outputs.last_hidden_state[:, 0, :].numpy()

  return cls_embedding

In [None]:
#Example case
text = "Remember to always thank Beyonce."
embedding = get_text_embedding(text)
print("Embedding shape: ", embedding.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Embedding shape:  (1, 768)


I have experienced some problems later using BERT, so I am going to modify the previous function so that we use another model which is has turned out to be better for my code in finding semantic relationships.

In [None]:
from sentence_transformers import SentenceTransformer

def get_text_embedding(text):
    # Load the pre-trained model from SentenceTransformers
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

    # Generate the embedding for the input text
    embedding = model.encode(text)

    return embedding

# Test the function with a sample string
text = "Look, I was gonna go easy on you not to hurt your feelings."
embedding = get_text_embedding(text)

print("Embedding shape:", embedding.shape)
# Display the first 5 values
print("First 5 values of the embedding:", embedding[:5])


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding shape: (384,)
First 5 values of the embedding: [ 0.13474764  0.27216196  0.21267383  0.07295902 -0.18505321]


## Creating a vector database

Using `faiss`, I am going to create an index with a few embeddings, and use it to search the nearest neighbors from it given a query string.

Note that the input to `faiss` must be numpy arrays with proper shape, typically: `(num_items, embedding_dimension)`. For querying only one string, it might require `(1, embedding_dimension)`.



In [None]:
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer


# corpus of song parts and math :)
corpus = [
    "When I am out faith she's my idol when I need to rest she's my ride or die",
    "Show me how to lie you're getting better all the time",
    "Give it a rest my love let's take it slow we both need some room to breathe",
    "I should've bought you flowers and held your hand shoul've gave all my hours",
    "A tensor is a kind of multilinear form over a set of vector fields and maps them to the real numbers"
]

# 1. Generate embeddings of the sentences
embeddings = np.array([get_text_embedding(text) for text in corpus])

# 2. Create FAISS index
embedding_dimension = embeddings.shape[1]  # embeddings dimension (e.g. 384)
index = faiss.IndexFlatL2(embedding_dimension)  # Eucliden metric for search (L2)

# Add the embeddings to the index
index.add(embeddings.astype(np.float32))

# 3. Perform a query with a new sentence
query = "I love linear algebra and love music"

# 4. Generate embedding for the query using the new function
query_embedding = get_text_embedding(query).reshape(1, -1)

# 5. Query the index in order to obtain the k = 3 nearest neighbours
k = 3
distances, indices = index.search(np.array(query_embedding, dtype=np.float32), k)

# print results
print(f"Query: {query}")
print(f"Nearest Neighbors:")
for i in range(k):
    print(f"Neighbor {i+1}: '{corpus[indices[0][i]]}' with distance {distances[0][i]:.4f}")


Query: I love linear algebra and love music
Nearest Neighbors:
Neighbor 1: 'A tensor is a kind of multilinear form over a set of vector fields and maps them to the real numbers' with distance 65.6027
Neighbor 2: 'Give it a rest my love let's take it slow we both need some room to breathe' with distance 81.4390
Neighbor 3: 'When I am out faith she's my idol when I need to rest she's my ride or die' with distance 81.7410


## Loading the lyrics database

From the databse in `song_lyrics.csv`, I will use the top-1000 songs according to views. I will build the vector database with them.

Important: This file is huge, and does not fit in RAM. In my case, I did it the following way.
```
import pandas as pd

file_path = path + '/song_lyrics.csv'
chunksize = 500000
top_n = 1000

top_views_df = pd.DataFrame()

for chunk in pd.read_csv(file_path, chunksize=chunksize):
    chunk_top = chunk.nlargest(top_n, 'views')
    top_views_df = pd.concat([top_views_df, chunk_top])
    top_views_df = top_views_df.nlargest(top_n, 'views')
```




In [None]:
import pandas as pd

file_path = path + '/song_lyrics.csv'
chunksize = 500000
top_n = 1000

top_views_df = pd.DataFrame()

for chunk in pd.read_csv(file_path, chunksize=chunksize):
    chunk_top = chunk.nlargest(top_n, 'views')
    top_views_df = pd.concat([top_views_df, chunk_top])
    top_views_df = top_views_df.nlargest(top_n, 'views')

In [None]:
top_views_df.head()

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2029340,Despacito Remix,pop,Luis Fonsi & Daddy Yankee,2017,23351415,"{""Justin Bieber""}","[Letra de ""Despacito (Remix)"" ft. Justin Biebe...",3057010,es,es,es
212889,Rap God,rap,Eminem,2013,17575634,{},"[Intro]\n""Look, I was gonna go easy on you not...",235729,en,en,en
3858378,WAP,rap,Cardi B,2020,16003444,"{""Megan Thee Stallion""}","[Intro: Cardi B, Al ""T"" McLaran & Megan Thee S...",5832126,en,en,en
1950930,Shape of You,pop,Ed Sheeran,2017,14569727,{},[Verse 1]\nThe club isn't the best place to fi...,2949128,en,en,en
2015234,HUMBLE.,rap,Kendrick Lamar,2017,11181199,{},[Intro]\nNobody pray for me\nIt been that day ...,3039923,en,en,en


In order to improve the DB a little bit, let us clean a little bit the lyrics column by removing the comments like [Chorus] [Verse], linebreaks, etc.

In [None]:
import re

def clean_lyrics(lyrics):
    # check the given field is not empty
    if not isinstance(lyrics, str):
        return lyrics
    # 1. Remove the [intro] [chorus] [etc] parts
    cleaned_lyrics = re.sub(r'\[.*?\]', '', lyrics)

    # 2. Remove linebreaks
    cleaned_lyrics = re.sub(r'\n+', '\n', cleaned_lyrics).strip()

    # 3. Remove any linebreak if left
    cleaned_lyrics = re.sub(r'\n+', '\n', cleaned_lyrics).strip()

    # 4. If there are abnormal spaces, replace with normal
    cleaned_lyrics = re.sub(r'\s+', ' ', cleaned_lyrics).strip()

    return cleaned_lyrics

# Clean the column lyrics
top_views_df["lyrics"] = top_views_df["lyrics"].apply(clean_lyrics)
top_views_df.head()

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2029340,Despacito Remix,pop,Luis Fonsi & Daddy Yankee,2017,23351415,"{""Justin Bieber""}",Comin' over in my direction So thankful for th...,3057010,es,es,es
212889,Rap God,rap,Eminem,2013,17575634,{},"""Look, I was gonna go easy on you not to hurt ...",235729,en,en,en
3858378,WAP,rap,Cardi B,2020,16003444,"{""Megan Thee Stallion""}",Whores in this house There's some whores in th...,5832126,en,en,en
1950930,Shape of You,pop,Ed Sheeran,2017,14569727,{},The club isn't the best place to find a lover ...,2949128,en,en,en
2015234,HUMBLE.,rap,Kendrick Lamar,2017,11181199,{},Nobody pray for me It been that day for me Way...,3039923,en,en,en


## Extracting embeddings for lyrics database

Embeddings for the 1000 lyrics in the database will be extracted.



In [None]:
def extract_embeddings(df):
  #apply the previously created function to extract embeddings to each row
  embeddings = df['lyrics'].apply(lambda x: get_text_embedding(x))

  #convert the list of embeddings into an appropiate format (numpy array or list of lists)
  return np.vstack(embeddings)


In [None]:
#extract the embeddings for the 1000 most listened songs
embeddings = extract_embeddings(top_views_df)

#visualize the embeddings form (should be (1000,768) for each song)
print("Embeddings shape: ", embeddings.shape)

Embeddings shape:  (1000, 384)



## Creating a `faiss` index with lyrics

Creating a `faiss` index with those 1000 lyrics. I will test it with some example text.


In [None]:
embedding_dimension = embeddings.shape[1]  # embeddings dimension (e.g. 768)
index = faiss.IndexFlatL2(embedding_dimension)  # Eucliden metric for search (L2)

#normalise embeddings
faiss.normalize_L2(embeddings)

# Add the embeddings to the index
index.add(np.array(embeddings, dtype=np.float32))

# Query index with a new sentence
query = "Look, I was gonna go easy on you not to hurt your feelings But I'm only going to get this one chance (six minutes-, six minutes-) Something's wrong, I can feel it (six minutes, Slim Shady, you're on!)"
query_embedding = get_text_embedding(query)
query_search_vector = np.array([query_embedding])
faiss.normalize_L2(query_search_vector)
#search for the 5 nearest neighbours
k = 5
distances, indices = index.search(query_search_vector, k)

#show results
print(f"Query: {query}")
print("Nearest Neighbors:")
for i in range(k):
    song_title = top_views_df.iloc[indices[0][i]]['title']
    print(f"Neighbor {i+1}: '{song_title[:100]}' with distance {distances[0][i]:.4f}")


Query: Look, I was gonna go easy on you not to hurt your feelings But I'm only going to get this one chance (six minutes-, six minutes-) Something's wrong, I can feel it (six minutes, Slim Shady, you're on!)
Nearest Neighbors:
Neighbor 1: 'Rap God' with distance 0.6368
Neighbor 2: 'Im Yours' with distance 0.7525
Neighbor 3: 'FourFiveSeconds' with distance 0.8441
Neighbor 4: 'Feels' with distance 0.8694
Neighbor 5: '505' with distance 0.8951



## Implement final function: `get_covers`

As described at the beginning of this doc.

In [None]:
def get_covers(youtube_url, k):
  #1 and 2: Download the youtube video in a temp file and transcribe the lyrics
  lyrics = get_lyrics_from_youtube_url(youtube_url)

  #3 Extract the embeddings of the transcribed lyrics
  query_embedding = get_text_embedding(lyrics)
  query_search_vector = np.array([query_embedding])
  faiss.normalize_L2(query_search_vector)


  #4 Search the top-k similar entreies in the vector database and return the song title and artist
  distances, indices = index.search(query_search_vector, k)

  #Result formatting
  results = []
  for i in range(k):
    song_lyrics = top_views_df.iloc[indices[0][i]]
    results.append({
        "title": song_lyrics['title'],
        "artist": song_lyrics['artist'],
        "score": 100 - 10*distances[0][i] #convert distance into punctuation
    })

  return results

## 📊Evaluation


In [None]:
covers = get_covers("https://www.youtube.com/watch?v=BDC8Jr-gp_4", 2)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=BDC8Jr-gp_4
[youtube] BDC8Jr-gp_4: Downloading webpage
[youtube] BDC8Jr-gp_4: Downloading ios player API JSON
[youtube] BDC8Jr-gp_4: Downloading mweb player API JSON
[youtube] BDC8Jr-gp_4: Downloading m3u8 information
[info] BDC8Jr-gp_4: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    3.92MiB in 00:00:00 at 22.56MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'Shape of You', 'artist': 'Ed Sheeran', 'score': 99.65201895684004},
 {'title': 'Perfect Duet',
  'artist': 'Ed Sheeran & Beyonc',
  'score': 93.08933615684509}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=W_97b97G5ds", 3)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=W_97b97G5ds
[youtube] W_97b97G5ds: Downloading webpage
[youtube] W_97b97G5ds: Downloading ios player API JSON
[youtube] W_97b97G5ds: Downloading mweb player API JSON
[youtube] W_97b97G5ds: Downloading m3u8 information
[info] W_97b97G5ds: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    3.90MiB in 00:00:00 at 16.12MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'Believer', 'artist': 'Imagine Dragons', 'score': 95.6619393825531},
 {'title': 'Glorious', 'artist': 'Macklemore', 'score': 91.9774866104126},
 {'title': 'Apparently', 'artist': 'J. Cole', 'score': 91.89537405967712}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=L53MZzuE0QY", 3)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=L53MZzuE0QY
[youtube] L53MZzuE0QY: Downloading webpage
[youtube] L53MZzuE0QY: Downloading ios player API JSON
[youtube] L53MZzuE0QY: Downloading mweb player API JSON
[youtube] L53MZzuE0QY: Downloading m3u8 information
[info] L53MZzuE0QY: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    5.87MiB in 00:00:00 at 36.26MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'The Way I Am', 'artist': 'Eminem', 'score': 87.12324023246765},
 {'title': 'Unforgettable',
  'artist': 'French Montana',
  'score': 87.06592559814453},
 {'title': 'Blame Game', 'artist': 'Kanye West', 'score': 86.37513875961304}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=9vmrPrYJPqI", 4)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=9vmrPrYJPqI
[youtube] 9vmrPrYJPqI: Downloading webpage
[youtube] 9vmrPrYJPqI: Downloading ios player API JSON
[youtube] 9vmrPrYJPqI: Downloading mweb player API JSON
[youtube] 9vmrPrYJPqI: Downloading m3u8 information
[info] 9vmrPrYJPqI: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    2.67MiB in 00:00:00 at 14.45MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'Get Lucky', 'artist': 'Daft Punk', 'score': 99.02183197438717},
 {'title': '24K Magic', 'artist': 'Bruno Mars', 'score': 90.73913097381592},
 {'title': 'Happy', 'artist': 'Pharrell Williams', 'score': 90.54934620857239},
 {'title': 'Cheap Thrills',
  'artist': 'Cheap Thrills Lyrics - Sia',
  'score': 89.88998293876648}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=R6ATpAr7rQU", 3)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=R6ATpAr7rQU
[youtube] R6ATpAr7rQU: Downloading webpage
[youtube] R6ATpAr7rQU: Downloading ios player API JSON
[youtube] R6ATpAr7rQU: Downloading mweb player API JSON
[youtube] R6ATpAr7rQU: Downloading m3u8 information
[info] R6ATpAr7rQU: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    5.16MiB in 00:00:00 at 23.48MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'The Way I Am', 'artist': 'Eminem', 'score': 87.12324023246765},
 {'title': 'Unforgettable',
  'artist': 'French Montana',
  'score': 87.06592559814453},
 {'title': 'Blame Game', 'artist': 'Kanye West', 'score': 86.37513875961304}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=RmtP8X4ZErs", 3)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=RmtP8X4ZErs
[youtube] RmtP8X4ZErs: Downloading webpage
[youtube] RmtP8X4ZErs: Downloading ios player API JSON
[youtube] RmtP8X4ZErs: Downloading mweb player API JSON
[youtube] RmtP8X4ZErs: Downloading m3u8 information
[info] RmtP8X4ZErs: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    5.44MiB in 00:00:00 at 31.50MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'Bohemian Rhapsody', 'artist': 'Queen', 'score': 98.78634810447693},
 {'title': 'Kill Yourself Part III',
  'artist': '$UICIDEBOY$',
  'score': 92.69753217697144},
 {'title': 'No Time To Die',
  'artist': 'Billie Eilish',
  'score': 92.50708818435669}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=DfMnRP0pk3A", 3)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=DfMnRP0pk3A
[youtube] DfMnRP0pk3A: Downloading webpage
[youtube] DfMnRP0pk3A: Downloading ios player API JSON
[youtube] DfMnRP0pk3A: Downloading mweb player API JSON
[youtube] DfMnRP0pk3A: Downloading m3u8 information
[info] DfMnRP0pk3A: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    4.09MiB in 00:00:00 at 12.70MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'The Hills', 'artist': 'The Weeknd', 'score': 95.00303238630295},
 {'title': 'Hold On Were Going Home',
  'artist': 'Drake',
  'score': 93.28573346138},
 {'title': 'Blinding Lights',
  'artist': 'The Weeknd',
  'score': 92.8875458240509}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=1BVP72VrGQs", 3)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=1BVP72VrGQs
[youtube] 1BVP72VrGQs: Downloading webpage
[youtube] 1BVP72VrGQs: Downloading ios player API JSON
[youtube] 1BVP72VrGQs: Downloading mweb player API JSON
[youtube] 1BVP72VrGQs: Downloading m3u8 information
[info] 1BVP72VrGQs: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    3.24MiB in 00:00:00 at 18.82MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'Tuyo', 'artist': 'Rodrigo Amarante', 'score': 95.8503919839859},
 {'title': 'Papaoutai', 'artist': 'Stromae', 'score': 93.57823193073273},
 {'title': 'Mi Gente',
  'artist': 'J Balvin & Willy William',
  'score': 93.54008316993713}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=SfgurkrXDSw", 3)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=SfgurkrXDSw
[youtube] SfgurkrXDSw: Downloading webpage
[youtube] SfgurkrXDSw: Downloading ios player API JSON
[youtube] SfgurkrXDSw: Downloading mweb player API JSON
[youtube] SfgurkrXDSw: Downloading m3u8 information
[info] SfgurkrXDSw: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    3.26MiB in 00:00:00 at 19.10MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'Mi Gente',
  'artist': 'J Balvin & Willy William',
  'score': 98.18820387125015},
 {'title': 'Tuyo', 'artist': 'Rodrigo Amarante', 'score': 97.11336493492126},
 {'title': 'Te Boté Remix',
  'artist': 'Nio Garca, Casper Mgico & Bad Bunny',
  'score': 94.80103492736816}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=PXe8POW7Ykw", 3)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=PXe8POW7Ykw
[youtube] PXe8POW7Ykw: Downloading webpage
[youtube] PXe8POW7Ykw: Downloading ios player API JSON
[youtube] PXe8POW7Ykw: Downloading mweb player API JSON
[youtube] PXe8POW7Ykw: Downloading m3u8 information
[info] PXe8POW7Ykw: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    3.17MiB in 00:00:00 at 22.77MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'HUMBLE.', 'artist': 'Kendrick Lamar', 'score': 95.23176550865173},
 {'title': 'Lemonade', 'artist': 'Internet Money', 'score': 93.8145101070404},
 {'title': '679', 'artist': 'Fetty Wap', 'score': 92.98124313354492}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=vlZ9kjCrGJw", 4)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=vlZ9kjCrGJw
[youtube] vlZ9kjCrGJw: Downloading webpage
[youtube] vlZ9kjCrGJw: Downloading ios player API JSON
[youtube] vlZ9kjCrGJw: Downloading mweb player API JSON
[youtube] vlZ9kjCrGJw: Downloading m3u8 information
[info] vlZ9kjCrGJw: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    4.00MiB in 00:00:00 at 37.82MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'Hello', 'artist': 'Adele', 'score': 99.32680197060108},
 {'title': 'Closer', 'artist': 'The Chainsmokers', 'score': 92.52476572990417},
 {'title': 'Before You Go',
  'artist': 'Lewis Capaldi',
  'score': 92.38257646560669},
 {'title': 'Someone You Loved',
  'artist': 'Lewis Capaldi',
  'score': 92.30825185775757}]

In [None]:
covers = get_covers("https://www.youtube.com/watch?v=OVQXiKnx3mE", 2)
covers

[youtube] Extracting URL: https://www.youtube.com/watch?v=OVQXiKnx3mE
[youtube] OVQXiKnx3mE: Downloading webpage
[youtube] OVQXiKnx3mE: Downloading ios player API JSON
[youtube] OVQXiKnx3mE: Downloading mweb player API JSON
[youtube] OVQXiKnx3mE: Downloading m3u8 information
[info] OVQXiKnx3mE: Downloading 1 format(s): 251
[download] Destination: /tmp/temp_audio
[download] 100% of    1.84MiB in 00:00:00 at 12.30MiB/s  
[ExtractAudio] Destination: /tmp/temp_audio.mp3
Deleting original file /tmp/temp_audio (pass -k to keep)


[{'title': 'Moonlight', 'artist': 'XXXTENTACION', 'score': 91.81372463703156},
 {'title': 'The Race', 'artist': 'Tay-K', 'score': 91.61815285682678}]

In these cases, good results were obtained, as 9/12 were correctly guessed!