# 🚀Setup

## Set the runtime type

Set the runtime type of this Google Collab to T4 GPU.

## Run shell commands

You can run shell commands in a cell by using prefix `!`, for example:
```
!pip install transformers
```



In [None]:
!pip install --upgrade pip
!pip install transformers

Defaulting to user installation because normal site-packages is not writeable
Collecting pip
  Using cached pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Using cached pip-24.3.1-py3-none-any.whl (1.8 MB)


ERROR: To modify pip, please run the following command:
C:\anaconda3\python.exe -m pip install --upgrade pip


In [None]:
import os

#Enviroment variables
VIDEOS_MP3_FOLDER_NAME = "videos_mp3"
VIDEOS_MP3_PATH = os.path.join(os.getcwd(), VIDEOS_MP3_FOLDER_NAME)

TRANSCRIPTIONS_FOLDER_NAME = "transcriptions"
TRANSCRIPTIONS_PATH = os.path.join(os.getcwd(), TRANSCRIPTIONS_FOLDER_NAME)

In [None]:
try:
  os.mkdir(TRANSCRIPTIONS_PATH)
except FileExistsError:
  print("Folder already exists")

try:
  os.mkdir(VIDEOS_MP3_PATH)
except FileExistsError:
  print("Folder already exists")



Folder already exists
Folder already exists


## Install `insanely-fast-whisper`

In [None]:
!conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia

Looking in links: https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html


ERROR: Could not find a version that satisfies the requirement torch (from versions: none)

[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: No matching distribution found for torch


In [14]:
"""I found having the model coded directly in Python more convenient than using
the insanely-fast-whisper CLI because YouTube video file names can contain
invalid or unrecognized characters for the bash terminal. Additionally, in my
opinion, it is more elegant. The code below is present in the documentation of
the insanely-fast-whisper repository."""

import torch
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3", # select checkpoint from https://huggingface.co/openai/whisper-large-v3#model-details
    torch_dtype=torch.float16,
    device="cuda:0", # or mps for Mac devices
    model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
)

def transcribe(audio_path):
  """Transcribe audio file to text using Insanely Fast Whisper model"""
  return pipe(
  audio_path,
  chunk_length_s=30,
  batch_size=24,
  return_timestamps=True,
)

ModuleNotFoundError: No module named 'torch'

In [None]:
import json

def write_transcription_to_disk(trasnscription_json, filename):
  """Write the transcription to a json file"""
  with open(filename, 'w') as file:
    json.dump(trasnscription_json, file)

In [None]:
#Check if insanely-fast-whisper works fine
test_file_path = os.path.join(TRANSCRIPTIONS_PATH, "test_file.json")

transcription = transcribe("https://www.signalogic.com/melp/EngSamples/Orig/male.wav")
print("\nTest Transcription: \n")
print(transcription["text"])
write_transcription_to_disk(transcription, test_file_path)




Test Transcription: 

 But what if somebody decides to break it? Be careful that you keep adequate coverage, but look for places to save money. Maybe it's taking longer to get things squared away than the bankers expected. Hiring the wife for one's company may win her taxated retirement income. The boost is helpful, but inadequate. New self-deceiving rags are hurriedly tossed on the two naked bones. What a discussion can ensue when the title of this type of song is in question. There is no dyeing or waxing or gassing needed. Paperweight may be personalized on back while clay is leather hardened. Place work on a flat surface and smooth out. The simplest kind of separate system uses a single self-contained unit. The old shop adage still holds. A good mechanic is usually a bad boss. Both figures would go higher in later years. Some make beautiful chairs, cabinets, chests, dollhouses, etc.


## Install a python library to download youtube videos

In [None]:
!pip install pytubefix



In [None]:
from pytubefix import YouTube
from pytubefix.cli import on_progress

def download_ytvid_as_mp3(url):
  """Download a youtube video as mp3 and return the name of the file"""
  yt = YouTube(url, on_progress_callback = on_progress)
  ys = yt.streams.get_audio_only()
  ys.download(mp3=True, output_path=VIDEOS_MP3_PATH)
  return yt.title + ".mp3"

In [None]:
download_ytvid_as_mp3("https://youtu.be/09839DpTctU")

 ↳ |███████████████████████████████████████████████████████| 100.0%

'Eagles - Hotel California (Live 1977) (Official Video) [HD].mp3'

## Download the lyrics dataset

Download this dataset:
https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information

Using the python code suggested in Kaggle web:
```
import kagglehub

# Download latest version
path = kagglehub.dataset_download("carlosgdcj/genius-song-lyrics-with-language-information")

print("Path to dataset files:", path)
```

You should find a very large file `song_lyrics.csv`, check it is there.


In [None]:
import kagglehub

# Download latest version
dataset_directory = kagglehub.dataset_download("carlosgdcj/genius-song-lyrics-with-language-information")
dataset_path = os.path.join(dataset_directory, "song_lyrics.csv")
print("Path to dataset files:", dataset_path)

Path to dataset files: /root/.cache/kagglehub/datasets/carlosgdcj/genius-song-lyrics-with-language-information/versions/1/song_lyrics.csv


## Install more dependencies

In [None]:
"""I will use faiss-gpu for faster processing"""
!pip install faiss-gpu





# ✏️Development of solution

In [None]:
def get_lyrics_from_youtube_url(youtube_url):
  """Download a youtube video as mp3, transcribe it with Insanely Fast Whisper
  Model, and return the text"""

  mp3_name = download_ytvid_as_mp3(youtube_url)
  mp3_path = os.path.join(VIDEOS_MP3_PATH, mp3_name)

  lyric_dict = transcribe(mp3_path)

  transcription_file_path = os.path.join(TRANSCRIPTIONS_PATH, mp3_name + ".json")
  write_transcription_to_disk(lyric_dict, transcription_file_path)

  return lyric_dict["text"]

In [None]:
get_lyrics_from_youtube_url("https://youtu.be/09839DpTctU")



" The សូវាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប់ពីប្រាប� On a dark desert highway Cool wind in my hair Warm smell of colitas Rising up through the air Up ahead in the distance I saw a shimmering light My head grew heavy and my sight grew dim I had to stop for the night There she stood in the doorway, heard the mission bell And I was thinking to myself, this could be heaven or this could be hell Then she lit up a candle, and she showed me the way There were voices down the corridor, I thought I heard them say Welcome to the Hotel California Such a lovely place, such a lovely face There's plenty of room at the Hotel California Room at Hilltale, California Any time of year You can find it here Her mind is definitely twisted She got the Mercedes Benz She got a lot of pretty, pretty boys She calls friends How they dance in the courtyard Sweet su

## Embeddings extractor

In [None]:
from transformers import BertTokenizer, BertModel # Load Tokenizer and pretrained model
import numpy as np

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Move the model to the GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

def get_bert_embeddings(text_batch):
  inputs = tokenizer(text_batch, return_tensors='pt', truncation=True, padding=True)
  inputs = {k: v.to('cuda') for k, v in inputs.items()}

  with torch.no_grad():
    outputs = model(**inputs)

  # Faiss only receive numbers of simple presicion
  batch_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy().astype("float32")

  # Clear GPU cache:
  torch.cuda.empty_cache()

  return batch_embeddings

In [None]:
# Test if the embedding extractor works
with open(test_file_path, 'r') as file:
  lyric = json.load(file)
  print("Test embedding: \n")
  emb = get_bert_embeddings(lyric["text"])
  print(emb)
  print("Lenght: ", len(emb[0]))

Test embedding: 

[[-1.54148474e-01 -4.04330134e-01  8.98521423e-01  4.76938523e-02
   1.44223660e-01 -4.33926493e-01 -1.95914671e-01  3.28730762e-01
   5.53251766e-02 -7.29216337e-01  2.24528447e-01  4.01933461e-01
   5.08906879e-03 -6.28654510e-02 -3.26610565e-01  6.63785636e-01
  -1.16367526e-01  3.31107318e-01  6.06401712e-02  6.10687360e-02
  -4.52740550e-01  1.12210184e-01  6.03751361e-01  2.59096712e-01
   1.70627519e-01 -7.42425695e-02 -1.91678196e-01 -3.46381962e-01
  -3.55293065e-01  2.74351925e-01 -1.78753659e-01  4.08199400e-01
  -1.86885491e-01 -7.00560033e-01  2.77274579e-01 -4.05105837e-02
   2.88002163e-01  1.84421152e-01  3.50446612e-01  3.18620563e-01
  -4.95715141e-01  2.74117351e-01  4.98141855e-01 -2.57696331e-01
  -4.91634071e-01  4.01091985e-02 -5.13043499e+00  3.63768786e-01
  -1.13698021e-01 -6.39739335e-01 -1.01860881e-01 -1.23607010e-01
  -1.69595852e-01  5.54351449e-01  7.84768879e-01  4.62487578e-01
   2.56723762e-01  4.48112607e-01 -8.22907314e-02 -3.20638

## Load lyrics database

In [None]:
import pandas as pd

chunksize = 500000
top_n = 1000

top_views_df = pd.DataFrame()

for chunk in pd.read_csv(dataset_path, chunksize=chunksize):
    chunk_top = chunk.nlargest(top_n, 'views')
    top_views_df = pd.concat([top_views_df, chunk_top])
    top_views_df = top_views_df.nlargest(top_n, 'views')

In [None]:
top_views_df.head()

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2029340,Despacito Remix,pop,Luis Fonsi & Daddy Yankee,2017,23351415,"{""Justin Bieber""}","[Letra de ""Despacito (Remix)"" ft. Justin Biebe...",3057010,es,es,es
212889,Rap God,rap,Eminem,2013,17575634,{},"[Intro]\n""Look, I was gonna go easy on you not...",235729,en,en,en
3858378,WAP,rap,Cardi B,2020,16003444,"{""Megan Thee Stallion""}","[Intro: Cardi B, Al ""T"" McLaran & Megan Thee S...",5832126,en,en,en
1950930,Shape of You,pop,Ed Sheeran,2017,14569727,{},[Verse 1]\nThe club isn't the best place to fi...,2949128,en,en,en
2015234,HUMBLE.,rap,Kendrick Lamar,2017,11181199,{},[Intro]\nNobody pray for me\nIt been that day ...,3039923,en,en,en


In [None]:
top_views_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 2029340 to 37482
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          1000 non-null   object
 1   tag            1000 non-null   object
 2   artist         1000 non-null   object
 3   year           1000 non-null   int64 
 4   views          1000 non-null   int64 
 5   features       1000 non-null   object
 6   lyrics         1000 non-null   object
 7   id             1000 non-null   int64 
 8   language_cld3  995 non-null    object
 9   language_ft    995 non-null    object
 10  language       988 non-null    object
dtypes: int64(3), object(8)
memory usage: 93.8+ KB


In [None]:
import re
import string

def clean_lyrics(lyrics):
  # Convert all the letters to lowercase
  lyrics = lyrics.lower()

  # Delete text between parenthesis (for example, [Intro] or (Chorus))
  lyrics = re.sub(r' \[.*?\] ', '', lyrics)
  lyrics = re.sub(r'\(.*?\)', '', lyrics)

  # Delete punctuation
  lyrics = lyrics.translate(str.maketrans('', '', string.punctuation))

  # Delete extra blanc spaces
  lyrics = re.sub(r'\s+', ' ', lyrics).strip()

  return lyrics

In [None]:
# Test clean_lyrics
print(top_views_df.iloc[0]['lyrics'])
print("\n")
print("Cleaned lyric: \n")
print(clean_lyrics(top_views_df.iloc[0]['lyrics']))

[Letra de "Despacito (Remix)" ft. Justin Bieber]

[Intro: Justin Bieber]
Comin' over in my direction
So thankful for that, it's such a blessin', yeah
Turn every situation into heaven, yeah
Oh-oh, you are
My sunrise on the darkest day
Got me feelin' some kind of way
Make me wanna savor every moment slowly, slowly
You fit me tailor-made, love how you put it on
Got the only key, know how to turn it on
The way you nibble on my ear, the only words I wanna hear
Baby, take it slow so we can last long

[Verso 1: Luis Fonsi & Daddy Yankee]
¡Oh! Tú, tú eres el imán y yo soy el metal
Me voy acercando y voy armando el plan
Sólo con pensarlo se acelera el pulso (Oh, yeah)
Ya, ya me está gustando más de lo normal
Todos mis sentidos van pidiendo más
Esto hay que tomarlo sin ningún apuro
[Coro: Justin Bieber & Luis Fonsi, Daddy Yankee]
Despacito
Quiero respirar tu cuello despacito
Deja que te diga cosas al oído
Para que te acuerdes si no estás conmigo
Despacito
Quiero desnudarte a besos despacito
Firm

## Extract embeddings for lyrics database

In [None]:
"""It was attempted to obtain the embeddings for the 1000 song lyrics all at
once, but the GPU's RAM was getting filled up during the process.
Therefore, the extraction of embeddings was divided into batches.
Various batch sizes (32, 64, 128, 256, 512) were experimented with, and
eventually, 128 was found to be the best option."""

batch_size = 128
num_batches = len(top_views_df) // batch_size + 1
embeddings = []

for i in range(num_batches):
    start = i * batch_size
    end = min((i + 1) * batch_size, len(top_views_df))
    batch_lyrics = [clean_lyrics(lyric) for lyric in top_views_df['lyrics'][start:end]]
    batch_embeddings = get_bert_embeddings(batch_lyrics)
    embeddings.extend(batch_embeddings)

embeddings = np.array(embeddings)
print(embeddings.shape)

(1000, 768)



## Create a `faiss` index with lyrics

In [None]:
dimension = embeddings[0].shape[0]

# Using Euclidean distance for distance measure between embeddings
lyric_index = faiss.IndexFlatL2(dimension)
lyric_index.add(embeddings)

In [None]:
query_text = """Pasito a pasito, suave suavecito
Nos vamos pegando, poquito a poquito
Que le enseñes a mi boca
Tus lugares favoritos
(Favorito, favorito, baby)
Pasito a pasito, suave suavecito
Nos vamos pegando, poquito a poquito"""

cleaned_query = clean_lyrics(query_text)
query_embedding = get_bert_embeddings(cleaned_query)[0]

In [None]:
k = 3 # Number of nearest neighbors
D, I = lyric_index.search(np.array([query_embedding]), k)
print(D, I)
# Display the results
for i in range(k):
  print(f"Neighbor {i+1}:")
  print(f"Text: {top_views_df.iloc[I[0][i]]['title']}")
  print(f"Distance: {D[0][i]}")

[[31.326694 32.30318  34.950733]] [[496 616 116]]
Neighbor 1:
Text: Tuyo
Distance: 31.32669448852539
Neighbor 2:
Text: Amorfoda
Distance: 32.30318069458008
Neighbor 3:
Text: Despacito
Distance: 34.95073318481445



## Final function: `get_covers`

In [None]:
def get_covers(youtube_url, k):
  lyrics = get_lyrics_from_youtube_url(youtube_url)
  lyrics = clean_lyrics(lyrics)
  query_embedding = get_bert_embeddings(lyrics)[0]
  D, I = lyric_index.search(np.array([query_embedding]), k)
  covers = []
  for i in range(k):
    covers.append(
        { "title": top_views_df.iloc[I[0][i]]['title'],
          "artist": top_views_df.iloc[I[0][i]]['artist'],
          "distance": D[0][i]
        }
    )
  return covers


## 📊Evaluation of the solution

Let's evaluate the system with 8 youtube videos:

* https://www.youtube.com/watch?v=BDC8Jr-gp_4
* https://www.youtube.com/watch?v=W_97b97G5ds
* https://www.youtube.com/watch?v=L53MZzuE0QY
* https://www.youtube.com/watch?v=9vmrPrYJPqI
* https://www.youtube.com/watch?v=R6ATpAr7rQU
* https://www.youtube.com/watch?v=RmtP8X4ZErs
* https://www.youtube.com/watch?v=DfMnRP0pk3A
* https://www.youtube.com/watch?v=1BVP72VrGQs

In [None]:
youtube_links = [
  "https://www.youtube.com/watch?v=BDC8Jr-gp_4",
  "https://www.youtube.com/watch?v=W_97b97G5ds",
  "https://www.youtube.com/watch?v=L53MZzuE0QY",
  "https://www.youtube.com/watch?v=9vmrPrYJPqI",
  "https://www.youtube.com/watch?v=R6ATpAr7rQU",
  "https://www.youtube.com/watch?v=RmtP8X4ZErs",
  "https://www.youtube.com/watch?v=DfMnRP0pk3A",
  "https://www.youtube.com/watch?v=1BVP72VrGQs",
]

covers = []
for link in youtube_links:
  covers.append(get_covers(link, 3))

 ↳ |███████████████████████████████████████████████████████| 100.0%



 ↳ |███████████████████████████████████████████████████████| 100.0%



 ↳ |███████████████████████████████████████████████████████| 100.0%

Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


 ↳ |███████████████████████████████████████████████████████| 100.0%



 ↳ |███████████████████████████████████████████████████████| 100.0%



 ↳ |███████████████████████████████████████████████████████| 100.0%



 ↳ |███████████████████████████████████████████████████████| 100.0%

Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


 ↳ |███████████████████████████████████████████████████████| 100.0%

Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


In [None]:
for item in covers:
  print(item)
  print("\n")

[{'title': 'Shape of You', 'artist': 'Ed Sheeran', 'distance': 29.463589}, {'title': 'All Too Well 10 Minute Version Taylors Version Live Acoustic', 'artist': 'Taylor Swift', 'distance': 46.21553}, {'title': 'Never Be the Same', 'artist': 'Camila Cabello', 'distance': 46.649014}]


[{'title': 'Believer', 'artist': 'Imagine Dragons', 'distance': 23.28256}, {'title': 'Glorious', 'artist': 'Macklemore', 'distance': 33.632366}, {'title': 'Off The Grid', 'artist': 'Kanye West', 'distance': 37.118755}]


[{'title': 'Rap God', 'artist': 'Eminem', 'distance': 13.217316}, {'title': 'Kamikaze', 'artist': 'Eminem', 'distance': 19.092209}, {'title': 'No More Parties In\xa0LA', 'artist': 'Kanye West', 'distance': 20.67038}]


[{'title': 'Blueberry Faygo', 'artist': 'Lil Mosey', 'distance': 56.441013}, {'title': 'Drowning', 'artist': 'A Boogie wit da Hoodie', 'distance': 60.71165}, {'title': 'Congratulations', 'artist': 'Post Malone', 'distance': 61.537067}]


[{'title': 'Marvins Room', 'artist': 'D

In [None]:
top_views_df[top_views_df['title'] == 'Un Buen Dia']

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


In [None]:
top_views_df[top_views_df['artist'] == 'Un Buen Dia']

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


In [None]:
top_views_df[top_views_df['artist'] == 'Los Planetas']

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


In [None]:
top_views_df[top_views_df['title'] == 'Los Planetas']

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


### Evaluation Discution
In the case of the last YouTube link, "Los Planetas - Un Buen Día," both the song and the artist are not present in the top_views_dataset. Consequently, the FAISS index will never find the exact song as a relevant neighbor. However, despite this, the FAISS index retrieves other Spanish songs as relevant neighbors, indicating that the language of the songs is well encoded in the embeddings. Therefore, it is not a bad result for this specific case. Given the above reasons, we will exclude it from the evaluation set.

### Evaluation Metric
I will use the Mean Average Precision for K=1. This will help determine the proportion of times when the first element retrieved by FAISS is the original song corresponding to the cover we use as input.

In [None]:
eval_set = []
real_songs_info = [
    { "title": 'Shape of You', "artist": 'Ed Sheeran'},
    { "title": 'Believer', "artist": 'Imagine Dragons'},
    { "title": 'Rap God', "artist": 'Eminem' },
    { "title": 'Get Lucky', "artist": 'Daft Punk'},
    { "title": 'Get Lucky', "artist": 'Daft Punk'},
    { "title": 'Bohemian Rhapsody', "artist": 'Queen'},
    { "title": 'The Hills', "artist": 'The Weeknd'},
]

for item, original_song_info in zip(covers, real_songs_info):
  # Only the first item is important in K=1 Precision
  eval_item = (item[0], original_song_info)
  eval_set.append(eval_item)

print(eval_set)

[({'title': 'Shape of You', 'artist': 'Ed Sheeran', 'distance': 29.463589}, {'title': 'Shape of You', 'artist': 'Ed Sheeran'}), ({'title': 'Believer', 'artist': 'Imagine Dragons', 'distance': 23.28256}, {'title': 'Believer', 'artist': 'Imagine Dragons'}), ({'title': 'Rap God', 'artist': 'Eminem', 'distance': 13.217316}, {'title': 'Rap God', 'artist': 'Eminem'}), ({'title': 'Blueberry Faygo', 'artist': 'Lil Mosey', 'distance': 56.441013}, {'title': 'Get Lucky', 'artist': 'Daft Punk'}), ({'title': 'Marvins Room', 'artist': 'Drake', 'distance': 37.253536}, {'title': 'Get Lucky', 'artist': 'Daft Punk'}), ({'title': 'Pink Matter', 'artist': 'Frank Ocean', 'distance': 28.029398}, {'title': 'Bohemian Rhapsody', 'artist': 'Queen'}), ({'title': 'Star Shopping', 'artist': 'Lil Peep', 'distance': 25.51746}, {'title': 'The Hills', 'artist': 'The Weeknd'})]


In [None]:
def get_precision_k1(true_neighbor, faiss_neighbor):
  if true_neighbor['title'] == faiss_neighbor['title'] and true_neighbor['artist'] == faiss_neighbor['artist']:
    return 1
  else:
    return 0

In [None]:
def get_Mean_Average_Precision_K1(eval_set):
  presicions = []
  for faiss_item, original_song_info in eval_set:
    presicions.append(get_precision_k1(original_song_info, faiss_item))
  return np.mean(presicions)


In [None]:
print("Mean Average Precision K=1: ", get_Mean_Average_Precision_K1(eval_set))

Mean Average Precision K=1:  0.42857142857142855
