# Matching Lyrics with Songs

This component matches generated lyrics with songs whose MIDI we will have access to. We will be using the lakh pianoroll dataset for the MIDI files.

## Importing Libraries

In [1]:
import csv
import pickle
import pandas as pd

from langdetect import detect
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

## Importing and Preparing Dataset

In [2]:
origData = pd.read_csv("../matched_lyrics.csv")
print(origData.shape)
origData.head()

(12447, 4)


Unnamed: 0,track_id,title,artist_name,Lyrics
0,TRMMMQN128F4238509,Raspberry Beret (LP Version),Prince & The Revolution,"One, two\nOne, two, three, uh\n\nYeah\n\nI was..."
1,TRMMWTG128F4283F07,You Needed Me,Roger Whittaker,I cried a tear: you wiped it dry.\nI was confu...
2,TRMMZRE128F42B799E,Sauver L'Amour,Daniel Balavoine,Partir effacer sur le Gange\nLa douleur\nPouvo...
3,TRMMZOT128F149E9EE,Prayee,The Chantels,
4,TRMMIZZ128F4289298,It Keep Rainin' (Tears From My Eyes) (Radio Edit),Bitty McLean,Pretty fallin' tears shining (Yes)\n(What's th...


In [3]:
origData.dropna(axis = 0, inplace = True)

In [4]:
lyricList = origData.Lyrics.values.tolist()
trackList = origData.track_id.values.tolist()
len(lyricList)

12029

In [5]:
finalLyricList = []
finalTrackList = []

for i in range(len(lyricList)):
	try:
		if(detect(lyricList[i]) == "en"):
			l = lyricList[i].replace("\n", "")
			l = l.replace("******* This Lyrics is NOT for Commercial use *******", "")
			finalLyricList.append(l)
			finalTrackList.append(trackList[i])
	except:
		continue

len(finalLyricList)

9792

## Generating Embeddings

We will be using a huggingface pretrained S-BERT model to generate the embeddings

In [6]:
transformer = SentenceTransformer("bert-base-nli-mean-tokens")

In [7]:
embeds = []
for i in finalLyricList:
	e = transformer.encode(i)
	embeds.append(e)

Storing the sentences and embeddings to directory

In [8]:
with open("../embeddings.pickle", "wb") as file:
	pickle.dump(
		{
			"sentences": finalLyricList[:10],
			"embeddings": embeds
		},
		file,
		protocol = pickle.HIGHEST_PROTOCOL
	)

## Matching Embeddings to Similar Songs

Using one of the generations from Lyrics Generation component

In [9]:
songLyric = "Late to bed, early to rise of my head down all the things that you thought of me ohhhohhohoh and you could change my pride is broken my pride is a face of my mind it's like a child in the sand that i could be with i was lost i can't take back i can't stop to let the sound with the hands of my hands when i was wrong tell me tell me tell me tell me i'm forgiven and it's only so fly"

In [10]:
token = transformer.encode(songLyric)
sim = cosine_similarity([token], embeds)
sim.shape

(1, 9792)

Output of this code is a similarity score of given song compared with each song in the embeds.

Writing tracklist to csv file for easier retrieval

In [11]:
with open("../trackList.csv", "w", newline = "") as file:
	wr = csv.writer(file, quoting = csv.QUOTE_ALL)
	wr.writerow(finalTrackList)