## **MUSIC RECOMMENDATION USING SONG LYRICS**

# **INTRODUCTION**

In the harmonious tapestry of life, music emerges as a universal language, capable of expressing emotions, telling stories, and painting vivid pictures with its melodies and lyrics. We've all experienced moments where the right song at the right time has the power to heal wounds, spark joy, or offer solace. Yet, the vast musical landscape often conceals the treasures of good music, making it a challenge to discover the perfect tune that resonates with our hearts.

Recognizing the profound impact that music has on our lives, I embark on a journey to create an enchanting project. This project is dedicated to the art of music recommendation, but with a twist that sets it apart from the ordinary. Rather than solely relying on genre, tempo, or artist, our project takes a more lyrical approach. We aim to delve into the very essence of songs, exploring their rich tapestries of words and sentiments.

I aim to assist in discovering the perfect musical companion for any moment by recommending songs based on the topics and themes found in their lyrics. The lyrics of a song are the poetic expressions of the artist's heart and mind, and within them, we often find reflections of our own experiences, emotions, and desires. This project seeks to connect music lovers with songs that speak to the unique stories of their lives, songs that provide solace in difficult times and amplify your joy during moments of celebration.

With the charm of song lyrics as my guide, I aspire to create a recommendation system that doesn't just understand musical preferences but also comprehends the lyrical narratives that resonate with the soul. Through this innovative approach, I aim to assist you in finding the perfect soundtrack to your life's unfolding story.

Join me on this captivating journey as I explore the enchanting world of song lyrics to recommend music that not only suits your tastes but also speaks to your heart. Together, we will uncover the hidden gems that complement the various chapters of your life and add an extra layer of magic to your musical experience.

## **OBJECTIVE**

To develop a music recommendation system that intimately connects listeners with songs through the power of lyrics.

## **INSTALL DEPENDENCIES**

In [9]:
# check the status and usage of NVIDIA GPU(s) in the system
!nvidia-smi

Sun Oct 29 21:04:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [10]:
# check the version of the NVIDIA CUDA Compiler (NVCC) installed on the system.
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


In [11]:
#Install necessary packages if needed

!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7
!pip install pandarallel



## **DATA**

The dataset used was from https://www.kaggle.com/datasets/edenbd/150k-lyrics-labeled-with-spotify-valence. The dataset downloaded contains 158,353 songs.

The data consists of the following columns:
- artist: Contains artist names
- seq: song lyrics
- song: song's title
- label: Spotify valence feature attribute for this song.

In [12]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:
# change the current working directory and list the contents of that directory

%cd /content/drive/MyDrive/Recommendation Engine/Meka
%ls

/content/drive/.shortcut-targets-by-id/1p1E13iyI50ZckNrA8S3PXu023GP-adfI/Meka
'=0.30'                                [0m[01;34mLlama-2-7b-Chat-GPTQ[0m/         [01;34moutputs[0m/
'=2.6.1'                               LLM-Lab.ipynb                [01;34m'Pre-processed dataset'[0m/
 [01;34malpaca-lora-7b[0m/                      [01;34m'Music Datasets'[0m/              Recommendation.ipynb
'Copy of Music Recommendation.ipynb'   Music_recommendation.ipynb    [01;34mTransformers_Models[0m/
 [01;34mdatasets[0m/                            'Music Recommendation.ipynb'


In [14]:
# import libraries for text cleaning and preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# track progress of data processing tasks,
from tqdm.notebook import tqdm
tqdm.pandas()

from pandarallel import pandarallel #enable parallel processing of Pandas DataFrames using parallel computing
import multiprocessing
cores = multiprocessing.cpu_count()
pandarallel.initialize(progress_bar=True, nb_workers=int(cores))
print(cores)

from nltk.tag import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer #for text feature extraction
from sklearn.decomposition import NMF

import numpy as np #for numerical computing
import seaborn as sns #data visualization
import matplotlib.pyplot as plt
import pandas as pd

import pickle

INFO: Pandarallel will run on 2 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
2


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [16]:
#Read the CSV file into a DataFrame
file_path1 = '/content/drive/MyDrive/Recommendation Engine/Meka/Music Datasets/labeled_lyrics_cleaned.csv'
df = pd.read_csv(file_path1)

## **DATA EXPLORATION**

Our data is stored in a dataframe named "df". The head and info commands were used to display the first few rows of the dataframe and provide a concise summary of the dataframe respectively. The info command includes the data types of each column, the number of non-null values, and how much memory your DataFrame is using. It's very useful for quickly assessing the integrity of the data, understanding what kind of data is stored in each column, and identifying missing values.

In [17]:
#View the top rows of the dataframe
df.head()

Unnamed: 0.1,Unnamed: 0,artist,seq,song,label
0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626
1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63
2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24
3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536
4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371


In [18]:
#show a summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158353 entries, 0 to 158352
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  158353 non-null  int64  
 1   artist      158353 non-null  object 
 2   seq         158353 non-null  object 
 3   song        158353 non-null  object 
 4   label       158353 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 6.0+ MB


In [None]:
# Access the dataframe
df.seq[2]

'She don\'t live on planet Earth no more\r\nShe found love on Venus, that\'s her word\r\nSaid she needed space, time to explore\r\nNow she movin\' on and on and on and on\r\n\r\nIs it might fault that you\'re broken, is it my fault that your\'e high?\r\nYou caught me cheating with my ex-girlfriend, but instead of calling me I caught her\r\n\r\nI was like "man, oh shit, my new girl fuckin\' with my old chick"\r\nWe both bad, so she switched, packed her bags, said she had to take a trip\r\nTo the otherside, the otherside, the otherside\r\nShe\'s gone, she\'s gone, she\'s gone, to the otherside\r\n\r\nCan we get back to how it was before?\r\nIf I can\'t have you, at least show me what\'s in store\r\nAnd what I\'ve made is [?] come forward more\r\nThen we can go on and on and on and on\r\n\r\nIs it might fault that you\'re broken, is it my fault that your high?\r\nYou caught me cheating with my ex-girlfriend, but instead of calling me, you called her\r\n\r\nI was like "man, oh shit, my new

## **TEXT CLEANING**

The codes in the following cells cleans the lyrics in the coloumn "seq". An example is shown in the code above and it shows how dirty our lyrics in the seq column is. Functions are defined to clean the texts and remove the stop words from the seq column.

In [19]:
# define function to clean the text and remove all the unnecessary elements.
def clean_corpus(sentence):
    sentence = sentence.lower() # text to lowercase
    if sentence.isdigit():
        sentence = 'Code_' + sentence
    if len(sentence) <=1:
        sentence = 'empty'

    sentence = re.sub(r'https?://\S+|www\.\S+','', sentence) # remove all internet links
    sentence = re.sub(r'\s\{\$\S*', '',sentence) # Remove text within curly braces
    sentence = re.sub(r'/\(.*\)/', '',sentence) # Remove text within
    sentence = re.sub(r"[@#$*&]", '',sentence) # Remove special character
    sentence = re.sub(r'\n', '', sentence) # Remove line breaks
    sentence = re.sub(r'\(\w*\)', '', sentence) #remove text within braces
    sentence = re.sub(r'(\W\s)|(\W$)|(\W\d*)', ' ',sentence) # Remove punctuation
    sentence = re.sub(r'x+((/xx)*/\d*\s*)|x*', '',sentence) #Remove date
    sentence = re.sub(r'\d+\s', '', sentence) #Remove other numerical values
    sentence = re.sub(r'\d*', '', sentence) #Remove other numerical values
    sentence = re.sub(r' +', ' ',sentence) #Remove unnecessary white spaces
    return sentence

In [20]:
# define function to remove stopwords from lyrics
def remove_stop_words(document):
    # change sentence to lower case
#     document = document.lower()
    # tokenize into words
    words = word_tokenize(document)
    # remove stop words
    words = [word for word in words if word not in stopwords.words("english")]
    # join words to make sentence
    document = " ".join(words)
    return document

In [None]:
# df.head()

In [None]:
# df_cleaned = df[['artist', 'seq']].copy()
# df_cleaned.head()

Unnamed: 0,artist,seq
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r..."
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m..."
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\..."
4,Elijah Blake,"I see a midnight panther, so gallant and so br..."


In [None]:
#apply clean_corpus to the dataframe
# df_cleaned['seq_cleaned_1'] = df['seq'].progress_apply(clean_corpus)

  0%|          | 0/158353 [00:00<?, ?it/s]

In [None]:
#apply function to remove stop_words to the dataframe
# df_cleaned['seq_cleaned_2'] = df_cleaned['seq_cleaned_1'].parallel_apply(remove_stop_words)
# df_cleaned.head()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=79177), Label(value='0 / 79177')))…

Unnamed: 0,artist,seq,seq_cleaned_1,seq_cleaned_2
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",no no i ain t ever trapped out the bando but o...,ever trapped bando oh lord get wrong know coup...
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",the drinks go down and smoke goes up i feel my...,drinks go smoke goes feel got let go cares get...
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,she don t live on planet earth no more she fou...,live planet earth found love venus word said n...
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",trippin off that grigio mobbin lights low trip...,trippin grigio mobbin lights low trippin grigi...
4,Elijah Blake,"I see a midnight panther, so gallant and so br...",i see a midnight panther so gallant and so bra...,see midnight panther gallant brave found found...


In [None]:
# save the pre-processed dataset
# df_cleaned.to_csv('/content/drive/MyDrive/Recommendation Engine/Meka/Pre-processed dataset/cleaned_lyrics.csv', index = False)

In [None]:
# df_cleaned = pd.read_csv('/content/drive/MyDrive/Recommendation Engine/Meka/Pre-processed dataset/cleaned_lyrics.csv')

In [21]:
#function to extract the POS tags to eliminate Verbs, Numbers, ADJ
def get_clean_pos_tags_2(sentence):
    sent_tokens = word_tokenize(sentence)
    pos_tags_ = pos_tag(sent_tokens)
    sent_ = sentence.split(' ')

    try:
        if len(sent_)>1:
            sel_nouns = [sent for sent,pos in pos_tags_ if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
            nn_words_sent = " ".join(sel_nouns)
        else:
            nn_words_sent = sent_[0]
    except:
        nn_words_sent = sentence
        return nn_words_sent
    return nn_words_sent

In [None]:
# df_cleaned['seq_cleaned_3'] = df_cleaned['seq_cleaned_2'].parallel_apply(get_clean_pos_tags_2)
# df_cleaned.head()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=79177), Label(value='0 / 79177')))…

Unnamed: 0,artist,seq,seq_cleaned_1,seq_cleaned_2,seq_cleaned_3
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",no no i ain t ever trapped out the bando but o...,ever trapped bando oh lord get wrong know coup...,bando oh lord couple place everybody name atti...
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",the drinks go down and smoke goes up i feel my...,drinks go smoke goes feel got let go cares get...,drinks smoke cares crowd lights eyes live till...
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,she don t live on planet earth no more she fou...,live planet earth found love venus word said n...,planet earth venus word space time movin fault...
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",trippin off that grigio mobbin lights low trip...,trippin grigio mobbin lights low trippin grigi...,trippin grigio lights trippin grigio lights tr...
4,Elijah Blake,"I see a midnight panther, so gallant and so br...",i see a midnight panther so gallant and so bra...,see midnight panther gallant brave found found...,midnight panther gallant brave answers thunder...


In [None]:
# df_cleaned.to_csv('/content/drive/MyDrive/Recommendation Engine/Meka/Pre-processed dataset/cleaned_lyrics_2.csv', index = False)

In [None]:
df_cleaned = pd.read_csv('/content/drive/MyDrive/Recommendation Engine/Meka/Pre-processed dataset/cleaned_lyrics_2.csv')
df_cleaned.head()

Unnamed: 0,artist,seq,seq_cleaned_1,seq_cleaned_2,seq_cleaned_3
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",no no i ain t ever trapped out the bando but o...,ever trapped bando oh lord get wrong know coup...,bando oh lord couple place everybody name atti...
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",the drinks go down and smoke goes up i feel my...,drinks go smoke goes feel got let go cares get...,drinks smoke cares crowd lights eyes live till...
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,she don t live on planet earth no more she fou...,live planet earth found love venus word said n...,planet earth venus word space time movin fault...
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",trippin off that grigio mobbin lights low trip...,trippin grigio mobbin lights low trippin grigi...,trippin grigio lights trippin grigio lights tr...
4,Elijah Blake,"I see a midnight panther, so gallant and so br...",i see a midnight panther so gallant and so bra...,see midnight panther gallant brave found found...,midnight panther gallant brave answers thunder...


In [None]:
#calculate the number of missing values in each column of the DataFrame
df_cleaned.isna().sum()

artist             0
seq                0
seq_cleaned_1     78
seq_cleaned_2     85
seq_cleaned_3    119
dtype: int64

In [None]:
#calculate the number of missing values in seq_cleaned_3
df_cleaned.dropna(inplace = True, subset = ['seq_cleaned_3'])

After the the application of the text cleaning functions on the dataframe, we notice the presence of missing values not previously noticed in the original dataset. this can be attributed to our functions removing irrevalent texts from our dataset. We proceed by dropping all the rows with missing value from our dataframe.

In [None]:
##calculate the number of missing values in each column of the DataFrame
df_cleaned.isna().sum()

artist           0
seq              0
seq_cleaned_1    0
seq_cleaned_2    0
seq_cleaned_3    0
dtype: int64

## **TOPIC MODELLING**

Topic modeling is a technique in natural language processing and text mining that is used to identify topics or themes within a collection of documents. It's a way to automatically discover abstract topics from a large dataset of text, enabling researchers and analysts to understand the main ideas or subjects discussed in the documents without having to read through them individually.

Below is how topic modeling works in a nutshell:

- Document Collection: Start with a large collection of text documents, such as articles, research papers, or social media posts.

- Vectorization: Convert the text into numerical form using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings. This step transforms words into numerical vectors that can be processed by machine learning algorithms.

- Algorithm Application: Apply a topic modeling algorithm, such as Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF), to the numerical representations of the documents. These algorithms analyze patterns in word co-occurrence to identify topics. I used NMF in my solution.

- Topic Identification: The algorithm identifies groups of words that frequently occur together in the documents. Each group of words represents a topic. Importantly, the algorithm does this without any prior knowledge of the topics or the documents.

- Interpretation: After the algorithm has identified topics, analysts interpret these topics based on the words that are most strongly associated with each topic. They can then label the topics based on these interpretations.

In [None]:
%%time
## Topic Modeling
vect = TfidfVectorizer(min_df=1024, stop_words='english')

# Fit and transform
X = vect.fit_transform(df_cleaned['seq_cleaned_3'])



CPU times: user 5.75 s, sys: 50.7 ms, total: 5.8 s
Wall time: 5.81 s


In [None]:
%%time
pickle.dump(vect, open(f"/content/drive/MyDrive/Recommendation Engine/Meka/Transformers_Models/tfidf_lyrics_recommendation_mindf1024.pickle", "wb"))
print('Done...')

Done...
CPU times: user 63.5 ms, sys: 6.7 ms, total: 70.2 ms
Wall time: 89.1 ms


In [None]:
%%time
# Create an NMF instance: model
# the 200 components will be the topics

model = NMF(n_components=200, random_state=87)

# Fit the model to TF-IDF
model.fit(X)

# Transform the TF-IDF: nmf_features
nmf_features = model.transform(X)

CPU times: user 27min 3s, sys: 1min 17s, total: 28min 21s
Wall time: 29min 4s


In [None]:
%%time
pickle.dump(model, open(f"/content/drive/MyDrive/Recommendation Engine/Meka/Transformers_Models/model_1_mindf1024_comp200.pickle", "wb"))
print('Done...')

Done...
CPU times: user 1.35 ms, sys: 970 µs, total: 2.32 ms
Wall time: 11.4 ms


In [None]:
#TF-IDF matrix
print(X.shape)

#Features matrix
print(nmf_features.shape)

(158234, 638)
(158234, 200)


In [None]:
# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_, columns=vect.get_feature_names_out())
display(components_df)
print(components_df.shape)

Unnamed: 0,act,age,ah,air,alright,angel,angels,answer,answers,anybody,...,ya,yea,yeah,year,years,yes,yesterday,yo,york,youi
0,0.004178,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.003189,0.0
1,0.000000,0.000000,0.0,0.0,0.000000,0.044410,0.0,0.000000,0.000000,0.000000,...,0.0,0.003309,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
2,0.003562,0.000000,0.0,0.0,0.025710,0.006245,0.0,0.000000,0.000000,0.000000,...,0.0,0.009816,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
3,0.007146,0.002692,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.001268,0.000000,...,0.0,0.000073,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
4,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.010061,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0.000000,0.030789,0.0,0.0,0.000000,0.000000,0.0,0.005045,0.006894,0.000000,...,0.0,0.047154,0.0,0.0,0.0,0.0,0.0,0.0,0.099975,0.0
196,0.033797,0.030896,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.018658,...,0.0,0.008088,0.0,0.0,0.0,0.0,0.0,0.0,0.007965,0.0
197,0.016931,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.109868,0.0
198,0.000000,0.004610,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.010225,0.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


(200, 638)


In [None]:
for topic in range(components_df.shape[0]):
    tmp = components_df.iloc[topic]
    print(f'For topic {topic+1} the words with the highest value are:')
    print(tmp.nlargest(10))
    print('\n')

For topic 1 the words with the highest value are:
money      1.923298
pay        0.080683
dollar     0.069519
cash       0.065773
pocket     0.052010
car        0.051350
paper      0.042947
clothes    0.035978
price      0.031293
women      0.029314
Name: 0, dtype: float64


For topic 2 the words with the highest value are:
love       7.843057
speak      0.080198
brings     0.056124
kisses     0.055188
tender     0.050406
lover      0.046224
angel      0.044410
glad       0.036131
thought    0.036042
lovin      0.036023
Name: 1, dtype: float64


For topic 3 the words with the highest value are:
baby       8.928651
lovin      0.129682
babe       0.124223
diamond    0.095592
treat      0.076062
mon        0.069355
feelin     0.068099
crazy      0.056848
come       0.054850
darlin     0.049698
Name: 2, dtype: float64


For topic 4 the words with the highest value are:
time        7.040465
waste       0.164119
clock       0.071538
crime       0.059656
rhyme       0.034994
pass        0.034

In [None]:
prob = "top songs about peace and love"
# Transform the TF-IDF
X_test = vect.transform([prob])
# Transform the TF-IDF: nmf_features
nmf_features_test = model.transform(X_test)
pd.DataFrame(nmf_features_test).idxmax(axis=1)

0    1
dtype: int64

In [None]:
print("sentence : ",  prob)
tmp_test = components_df.iloc[pd.DataFrame(nmf_features_test).idxmax(axis=1)[0]]
print(f'For topic {pd.DataFrame(nmf_features_test).idxmax(axis=1)[0]+1} the words with the highest value are:')
print(' '.join(tmp_test.nlargest(20)[:20].index))
print(tmp_test.nlargest(20)[:20].index)

sentence :  top songs about peace and love
For topic 2 the words with the highest value are:
love speak brings kisses tender lover angel glad thought lovin grow trust crazy loves desire touch happiness learn hurt return
Index(['love', 'speak', 'brings', 'kisses', 'tender', 'lover', 'angel', 'glad',
       'thought', 'lovin', 'grow', 'trust', 'crazy', 'loves', 'desire',
       'touch', 'happiness', 'learn', 'hurt', 'return'],
      dtype='object')


In [None]:
def topic_extraction(sentence):
  # Transform the TF-IDF
  X_test = vect.transform([sentence])
  # Transform the TF-IDF: nmf_features
  nmf_features_test = model.transform(X_test)
  tmp_test = components_df.iloc[pd.DataFrame(nmf_features_test).idxmax(axis=1)[0]]
  return ' '.join(tmp_test.nlargest(20)[:20].index)


In [None]:
df_cleaned['topics'] = df_cleaned['seq_cleaned_3'].parallel_apply(topic_extraction)
df_cleaned.head()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=79117), Label(value='0 / 79117')))…

Unnamed: 0,artist,seq,seq_cleaned_1,seq_cleaned_2,seq_cleaned_3,topics
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",no no i ain t ever trapped out the bando but o...,ever trapped bando oh lord get wrong know coup...,bando oh lord couple place everybody name atti...,money pay dollar cash pocket car paper clothes...
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",the drinks go down and smoke goes up i feel my...,drinks go smoke goes feel got let go cares get...,drinks smoke cares crowd lights eyes live till...,till skies sleep babe desire plan tune kiss te...
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,she don t live on planet earth no more she fou...,live planet earth found love venus word said n...,planet earth venus word space time movin fault...,man understand wife plan women band truck dog ...
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",trippin off that grigio mobbin lights low trip...,trippin grigio mobbin lights low trippin grigi...,trippin grigio lights trippin grigio lights tr...,baby lovin babe diamond treat mon feelin crazy...
4,Elijah Blake,"I see a midnight panther, so gallant and so br...",i see a midnight panther so gallant and so bra...,see midnight panther gallant brave found found...,midnight panther gallant brave answers thunder...,love speak brings kisses tender lover angel gl...


In [None]:
# df_cleaned.to_csv('/content/drive/MyDrive/Recommendation Engine/Meka/Pre-processed dataset/cleaned_lyrics_topics.csv', index = False)

In [None]:
df_cleaned = pd.read_csv('/content/drive/MyDrive/Recommendation Engine/Meka/Pre-processed dataset/cleaned_lyrics_topics.csv')
df_cleaned.head(70)

Unnamed: 0,artist,seq,seq_cleaned_1,seq_cleaned_2,seq_cleaned_3,topics
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",no no i ain t ever trapped out the bando but o...,ever trapped bando oh lord get wrong know coup...,bando oh lord couple place everybody name atti...,money pay dollar cash pocket car paper clothes...
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",the drinks go down and smoke goes up i feel my...,drinks go smoke goes feel got let go cares get...,drinks smoke cares crowd lights eyes live till...,till skies sleep babe desire plan tune kiss te...
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,she don t live on planet earth no more she fou...,live planet earth found love venus word said n...,planet earth venus word space time movin fault...,man understand wife plan women band truck dog ...
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",trippin off that grigio mobbin lights low trip...,trippin grigio mobbin lights low trippin grigi...,trippin grigio lights trippin grigio lights tr...,baby lovin babe diamond treat mon feelin crazy...
4,Elijah Blake,"I see a midnight panther, so gallant and so br...",i see a midnight panther so gallant and so bra...,see midnight panther gallant brave found found...,midnight panther gallant brave answers thunder...,love speak brings kisses tender lover angel gl...
...,...,...,...,...,...,...
65,Eliza and the Bear,Don't fall asleep before tonight\r\n'Cause I s...,don t fall asleep before tonight cause i steal...,fall asleep tonight cause steal light eyes cal...,fall cause steal eyes thief broken thief broke...,change scene difference view weather blame poi...
66,Eliza Carthy,Oh no\r\nHere comes that sun again\r\nThat mea...,oh no here comes that sun again that means ano...,oh comes sun means another day without friend ...,means day friend hurts mirror hurts easy peopl...,time waste clock crime rhyme pass remember wea...
67,Eliza Carthy,I've given blowjobs on couches\r\nTo men who d...,i ve given blowjobs on couches to men who didn...,given blowjobs couches men want anymore tell n...,blowjobs couches men hours evenings company me...,men angels earth fields winter gold angel chri...
68,Eliza Carthy,Last night when I saw him\r\nLast night last w...,last night when i saw him last night last week...,last night saw last night last week saw lover ...,night night week cheek hand demon lover man so...,love speak brings kisses tender lover angel gl...


In [None]:
columns = ['artist','seq','seq_cleaned_3','topics']
final_df = df_cleaned[columns].copy()
final_df.head()

Unnamed: 0,artist,seq,seq_cleaned_3,topics
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",bando oh lord couple place everybody name atti...,money pay dollar cash pocket car paper clothes...
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",drinks smoke cares crowd lights eyes live till...,till skies sleep babe desire plan tune kiss te...
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,planet earth venus word space time movin fault...,man understand wife plan women band truck dog ...
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",trippin grigio lights trippin grigio lights tr...,baby lovin babe diamond treat mon feelin crazy...
4,Elijah Blake,"I see a midnight panther, so gallant and so br...",midnight panther gallant brave answers thunder...,love speak brings kisses tender lover angel gl...


# Inference

The following lines of code takes a user input. The input text can be a song lyric or topics. The text undergoes preprocessing, which typically includes tokenization and cleaning to prepare the text for analysis.The preprocessed text is transformed using a Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer. TF-IDF quantifies the importance of words in the text relative to a corpus of documents. It assigns numerical values to words based on their frequency in the text and inverse frequency across a collection of texts.

The TF-IDF transformed text is further processed using Non-Negative Matrix Factorization (NMF), a technique used for topic modeling. NMF decomposes the text into topics, represented as distributions of words. In this context, it helps identify the underlying thematic structure within the lyrics.

The code then determines the dominant topic in the text by finding the index with the highest value in the NMF-transformed features. This index corresponds to the specific topic number, indicating the main theme of the input text.

To provide insights into the dominant topic, the code extracts and presents the top 20 words associated with this topic. These words are determined based on the coefficients stored in the components_df_ DataFrame, revealing the key thematic elements identified by the model.

Following this, the code counts the number of words in the input text that match the top words associated with the dominant topic. This word matching process helps gauge the thematic relevance of the input lyrics to the identified topic.

Lastly, the code applies a threshold to filter songs from the final_df DataFrame. Songs are selected where the count of topic-related words exceeds the specified threshold. This filtered list represents song recommendations based on thematic similarity to the provided lyrics, allowing users to explore songs that share similar thematic elements and emotions.

In [22]:
#Make a copy of the cleaned dataframe and save it as final_df
df_cleaned = pd.read_csv('/content/drive/MyDrive/Recommendation Engine/Meka/Pre-processed dataset/cleaned_lyrics_topics.csv')
columns = ['artist','seq','seq_cleaned_3','topics']
final_df = df_cleaned[columns].copy()
#Merge final_df with 'song' column from df based on 'artist' and 'lyrics'
final_df = pd.merge(df_cleaned, df[['artist', 'song', 'seq']], on = ['artist', 'seq'], how = 'left')
final_df.head()

Unnamed: 0,artist,seq,seq_cleaned_1,seq_cleaned_2,seq_cleaned_3,topics,song
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",no no i ain t ever trapped out the bando but o...,ever trapped bando oh lord get wrong know coup...,bando oh lord couple place everybody name atti...,money pay dollar cash pocket car paper clothes...,Everyday
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",the drinks go down and smoke goes up i feel my...,drinks go smoke goes feel got let go cares get...,drinks smoke cares crowd lights eyes live till...,till skies sleep babe desire plan tune kiss te...,Live Till We Die
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,she don t live on planet earth no more she fou...,live planet earth found love venus word said n...,planet earth venus word space time movin fault...,man understand wife plan women band truck dog ...,The Otherside
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",trippin off that grigio mobbin lights low trip...,trippin grigio mobbin lights low trippin grigi...,trippin grigio lights trippin grigio lights tr...,baby lovin babe diamond treat mon feelin crazy...,Pinot
4,Elijah Blake,"I see a midnight panther, so gallant and so br...",i see a midnight panther so gallant and so bra...,see midnight panther gallant brave found found...,midnight panther gallant brave answers thunder...,love speak brings kisses tender lover angel gl...,Shadows & Diamonds


In [25]:
vect_ = pickle.load(open('/content/drive/MyDrive/Recommendation Engine/Meka/Transformers_Models/tfidf_lyrics_recommendation_mindf1024.pickle', 'rb'))
model_ = pickle.load(open('/content/drive/MyDrive/Recommendation Engine/Meka/Transformers_Models/model_1_mindf1024_comp200.pickle', 'rb'))
components_df_ = pd.DataFrame(model_.components_, columns=vect_.get_feature_names_out())

In [43]:
#Input the song lyrics. the lyrics used here is Majid Jordan's 'Hands Tied'
req = """I was stressed out, goin' out of my mind
When you found me, you know you caught my eye
You really calmed me down, it was a different time
You really showed me, you know me

[Chorus]
Sunlight on the water
Far as I can see
Now we found each other
It's making sense to me

[Verse 2]
Out the window, sneakin' out of your room
While they all sleep, you gotta find me soon
I'm in my car now (Feelin' it, feelin' it)
You know what to do, you come and meet me (That shit so special, that shit so special)
It's after dark now
Put your soft hands on me, show me

[Chorus]
Sunlight on the water
Far as I can see
Now we found each other (Ooh-ooh-ooh)
It's making sensе to me
"""

In [44]:
# Transform the TF-IDF
X_test = vect_.transform([req])
# Transform the TF-IDF: nmf_features
nmf_features_test = model_.transform(X_test)
pd.DataFrame(nmf_features_test).idxmax(axis=1)

0    86
dtype: int64

In [45]:
print("sentence : ",  req)
tmp_test = components_df_.iloc[pd.DataFrame(nmf_features_test).idxmax(axis=1)[0]]
print(f'For topic {pd.DataFrame(nmf_features_test).idxmax(axis=1)[0]+1} the words with the highest value are:')
print(' '.join(tmp_test.nlargest(20)[:20].index))
print(tmp_test.nlargest(20)[:20].index)

sentence :  I was stressed out, goin' out of my mind
When you found me, you know you caught my eye
You really calmed me down, it was a different time
You really showed me, you know me

[Chorus]
Sunlight on the water
Far as I can see
Now we found each other
It's making sense to me

[Verse 2]
Out the window, sneakin' out of your room
While they all sleep, you gotta find me soon
I'm in my car now (Feelin' it, feelin' it)
You know what to do, you come and meet me (That shit so special, that shit so special)
It's after dark now
Put your soft hands on me, show me

[Chorus]
Sunlight on the water
Far as I can see
Now we found each other (Ooh-ooh-ooh)
It's making sensе to me

For topic 87 the words with the highest value are:
ooh babe lover huh lovin touch yea brings saw news keeps ho uh darlin minute kisses ones shine kiss heaven
Index(['ooh', 'babe', 'lover', 'huh', 'lovin', 'touch', 'yea', 'brings', 'saw',
       'news', 'keeps', 'ho', 'uh', 'darlin', 'minute', 'kisses', 'ones',
       'shin

In [46]:
top_topics = ' '.join(tmp_test.nlargest(20)[:20].index)

In [47]:
def song_recommendation(df, df_column, topics, threshold):
  df = df.copy()
  df['word_match'] = df[df_column].apply(word_count)

  return df[df['word_match'] >= threshold]

def word_count(sentence):
  return sum(1 for x in sentence.split() if x in top_topics)

In [48]:
word_count(top_topics)

20

In [49]:
song_recommendation(final_df,"topics", top_topics, 15).sort_values(by = "word_match", ascending = False)

Unnamed: 0,artist,seq,seq_cleaned_1,seq_cleaned_2,seq_cleaned_3,topics,song,word_match
794,Judy Roderick,"We're doing it.\r\n\r\nI look around me,\r\nBu...",we re doing it i look around me but all i seem...,look around seem see people going epecting sym...,look people sympathy motions tell inspiration ...,ooh babe lover huh lovin touch yea brings saw ...,When Im Gone,20
102573,BoDeans,Ooh ooh ooh ooh\nOoh ooh ooh ooh\nOoh ooh ooh ...,ooh ooh ooh oohooh ooh ooh oohooh ooh ooh ooho...,ooh ooh ooh oohooh ooh ooh oohooh ooh ooh ooho...,ooh ooh oohooh ooh ooh oohooh ooh ooh oohooh o...,ooh babe lover huh lovin touch yea brings saw ...,Pick Up the Pieces,20
109741,Jesse McCartney,"I'm an addict, a junkie, a fiend.\nI gotta hav...",i m an addict a junkie a fiend i gotta have it...,addict junkie fiend got ta keeps callin need f...,addict junkie keeps hit body hand shakin dope ...,ooh babe lover huh lovin touch yea brings saw ...,In My Veins,20
109863,Jessica Andrews,I spent years and all of this time thinking I ...,i spent years and all of this time thinking i ...,spent years time thinking better cause mine al...,years time cause mine way highwayso life drive...,ooh babe lover huh lovin touch yea brings saw ...,There's More to Me Than You [Version],20
110024,Jessie J,I'm feeling sexy and free\nLike glitter's rain...,i m feeling sey and freelike glitter s raining...,feeling sey freelike glitter raining meyour li...,sey freelike glitter meyour pure goldi taste t...,ooh babe lover huh lovin touch yea brings saw ...,Domino,20
...,...,...,...,...,...,...,...,...
56573,Bananarama,Cheers then here's two old friends \r\nWe thou...,cheers then here s two old friends we thought ...,cheers two old friends thought never say goodb...,cheers friends goodbye times worth try place f...,ooh babe lover huh lovin touch yea brings saw ...,Cheers Then,20
56603,Bananarama,Who needs friends who never show\r\nI'll tell ...,who needs friends who never show i ll tell you...,needs friends never show tell want know could ...,needs know heart nights call til friends ooh o...,ooh babe lover huh lovin touch yea brings saw ...,I Heard a Rumour,20
56645,Bananarama,"Ooh, how do you like your love\r\nOoh, how do ...",ooh how do you like your love ooh how do you l...,ooh like love ooh like love want know really f...,ooh ooh love get cameras action baby take hear...,ooh babe lover huh lovin touch yea brings saw ...,"More, More, More",20
56892,Simple Minds,"Footsteps, I can hear footsteps in the hall\r\...",footsteps i can hear footsteps in the hall i h...,footsteps hear footsteps hall hear footsteps s...,footsteps footsteps footsteps time time things...,ooh babe lover huh lovin touch yea brings saw ...,I Wish You Were Here,20
