## **MUSIC RECOMMENDATION USING SONG LYRICS**

# **INTRODUCTION**

In the harmonious tapestry of life, music emerges as a universal language, capable of expressing emotions, telling stories, and painting vivid pictures with its melodies and lyrics. We've all experienced moments where the right song at the right time has the power to heal wounds, spark joy, or offer solace. Yet, the vast musical landscape often conceals the treasures of good music, making it a challenge to discover the perfect tune that resonates with our hearts.

Recognizing the profound impact that music has on our lives, I embark on a journey to create an enchanting project. This project is dedicated to the art of music recommendation, but with a twist that sets it apart from the ordinary. Rather than solely relying on genre, tempo, or artist, our project takes a more lyrical approach. We aim to delve into the very essence of songs, exploring their rich tapestries of words and sentiments.

I aim to assist in discovering the perfect musical companion for any moment by recommending songs based on the topics and themes found in their lyrics. The lyrics of a song are the poetic expressions of the artist's heart and mind, and within them, we often find reflections of our own experiences, emotions, and desires. This project seeks to connect music lovers with songs that speak to the unique stories of their lives, songs that provide solace in difficult times and amplify your joy during moments of celebration.

With the charm of song lyrics as my guide, I aspire to create a recommendation system that doesn't just understand musical preferences but also comprehends the lyrical narratives that resonate with the soul. Through this innovative approach, I aim to assist you in finding the perfect soundtrack to your life's unfolding story.

Join me on this captivating journey as I explore the enchanting world of song lyrics to recommend music that not only suits your tastes but also speaks to your heart. Together, we will uncover the hidden gems that complement the various chapters of your life and add an extra layer of magic to your musical experience.

## **OBJECTIVE**

To develop a music recommendation system that intimately connects listeners with songs through the power of lyrics.

## **INSTALL DEPENDENCIES**

In [1]:
#Install necessary packages if needed
%pip install pandarallel

Note: you may need to restart the kernel to use updated packages.


## **DATA**

The dataset used was from https://www.kaggle.com/datasets/edenbd/150k-lyrics-labeled-with-spotify-valence. The dataset downloaded contains 158,353 songs.

The data consists of the following columns:
- artist: Contains artist names
- seq: song lyrics
- song: song's title
- label: Spotify valence feature attribute for this song.

In [2]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [3]:
%pip install spotipy

Note: you may need to restart the kernel to use updated packages.


In [4]:
%pip install tqdm

Note: you may need to restart the kernel to use updated packages.


In [5]:
%pip install seaborn

Note: you may need to restart the kernel to use updated packages.


In [6]:
# import libraries for text cleaning and preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# track progress of data processing tasks,
from tqdm.notebook import tqdm
tqdm.pandas()

from pandarallel import pandarallel #enable parallel processing of Pandas DataFrames using parallel computing
import multiprocessing
cores = multiprocessing.cpu_count()
pandarallel.initialize(progress_bar=True, nb_workers=int(cores))
print(cores)

from nltk.tag import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer #for text feature extraction
from sklearn.decomposition import NMF

import numpy as np #for numerical computing
import seaborn as sns #data visualization
import matplotlib.pyplot as plt
import pandas as pd

import pickle

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/chukwuemekaugwu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/chukwuemekaugwu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/chukwuemekaugwu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/chukwuemekaugwu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
8


In [7]:
#Read the CSV file into a DataFrame
file_path1 = '~/Kaggle/Music Datasets 2/labeled_lyrics_cleaned.csv'
df = pd.read_csv(file_path1)

## **DATA EXPLORATION**

Our data is stored in a dataframe named "df". The head and info commands were used to display the first few rows of the dataframe and provide a concise summary of the dataframe respectively. The info command includes the data types of each column, the number of non-null values, and how much memory your DataFrame is using. It's very useful for quickly assessing the integrity of the data, understanding what kind of data is stored in each column, and identifying missing values.

In [8]:
#View the top rows of the dataframe
df.head()

Unnamed: 0.1,Unnamed: 0,artist,seq,song,label
0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626
1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63
2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24
3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536
4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371


In [9]:
#show a summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158353 entries, 0 to 158352
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  158353 non-null  int64  
 1   artist      158353 non-null  object 
 2   seq         158353 non-null  object 
 3   song        158351 non-null  object 
 4   label       158353 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 6.0+ MB


In [10]:
# Access the dataframe
df.seq[2]

'She don\'t live on planet Earth no more\r\nShe found love on Venus, that\'s her word\r\nSaid she needed space, time to explore\r\nNow she movin\' on and on and on and on\r\n\r\nIs it might fault that you\'re broken, is it my fault that your\'e high?\r\nYou caught me cheating with my ex-girlfriend, but instead of calling me I caught her\r\n\r\nI was like "man, oh shit, my new girl fuckin\' with my old chick"\r\nWe both bad, so she switched, packed her bags, said she had to take a trip\r\nTo the otherside, the otherside, the otherside\r\nShe\'s gone, she\'s gone, she\'s gone, to the otherside\r\n\r\nCan we get back to how it was before?\r\nIf I can\'t have you, at least show me what\'s in store\r\nAnd what I\'ve made is [?] come forward more\r\nThen we can go on and on and on and on\r\n\r\nIs it might fault that you\'re broken, is it my fault that your high?\r\nYou caught me cheating with my ex-girlfriend, but instead of calling me, you called her\r\n\r\nI was like "man, oh shit, my new

## **TEXT CLEANING**

The codes in the following cells cleans the lyrics in the coloumn "seq". An example is shown in the code above and it shows how dirty our lyrics in the seq column is. Functions are defined to clean the texts and remove the stop words from the seq column.

In [11]:
# define function to clean the text and remove all the unnecessary elements.
def clean_corpus(sentence):
    sentence = sentence.lower() # text to lowercase
    if sentence.isdigit():
        sentence = 'Code_' + sentence
    if len(sentence) <=1:
        sentence = 'empty'

    sentence = re.sub(r'https?://\S+|www\.\S+','', sentence) # remove all internet links
    sentence = re.sub(r'\s\{\$\S*', '',sentence) # Remove text within curly braces
    sentence = re.sub(r'/\(.*\)/', '',sentence) # Remove text within
    sentence = re.sub(r"[@#$*&]", '',sentence) # Remove special character
    sentence = re.sub(r'\n', '', sentence) # Remove line breaks
    sentence = re.sub(r'\(\w*\)', '', sentence) #remove text within braces
    sentence = re.sub(r'(\W\s)|(\W$)|(\W\d*)', ' ',sentence) # Remove punctuation
    sentence = re.sub(r'x+((/xx)*/\d*\s*)|x*', '',sentence) #Remove date
    sentence = re.sub(r'\d+\s', '', sentence) #Remove other numerical values
    sentence = re.sub(r'\d*', '', sentence) #Remove other numerical values
    sentence = re.sub(r' +', ' ',sentence) #Remove unnecessary white spaces
    return sentence

In [12]:
# define function to remove stopwords from lyrics
def remove_stop_words(document):
    # change sentence to lower case
#     document = document.lower()
    # tokenize into words
    words = word_tokenize(document)
    # remove stop words
    words = [word for word in words if word not in stopwords.words("english")]
    # join words to make sentence
    document = " ".join(words)
    return document

In [13]:
df.head()

Unnamed: 0.1,Unnamed: 0,artist,seq,song,label
0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626
1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63
2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24
3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536
4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371


In [14]:
df_cleaned = df[['artist', 'seq']].copy()
df_cleaned.head()

Unnamed: 0,artist,seq
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r..."
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m..."
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\..."
4,Elijah Blake,"I see a midnight panther, so gallant and so br..."


In [15]:
#apply clean_corpus to the dataframe
# df_cleaned['seq_cleaned_1'] = df['seq'].progress_apply(clean_corpus)

In [16]:
#apply function to remove stop_words to the dataframe
# df_cleaned['seq_cleaned_2'] = df_cleaned['seq_cleaned_1'].parallel_apply(remove_stop_words)
# df_cleaned.head()

In [17]:
# save the pre-processed dataset
# df_cleaned.to_csv('/content/drive/MyDrive/Recommendation Engine/Pre-processed dataset/cleaned_lyrics.csv', index = False)

In [18]:
df_cleaned = pd.read_csv('~/Kaggle/Pre-processed dataset/cleaned_lyrics.csv')

In [15]:
#function to extract the POS tags to eliminate Verbs, Numbers, ADJ
def get_clean_pos_tags_2(sentence):
    sent_tokens = word_tokenize(sentence)
    pos_tags_ = pos_tag(sent_tokens)
    sent_ = sentence.split(' ')

    try:
        if len(sent_)>1:
            sel_nouns = [sent for sent,pos in pos_tags_ if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
            nn_words_sent = " ".join(sel_nouns)
        else:
            nn_words_sent = sent_[0]
    except:
        nn_words_sent = sentence
        return nn_words_sent
    return nn_words_sent

In [20]:
# df_cleaned['seq_cleaned_3'] = df_cleaned['seq_cleaned_2'].parallel_apply(get_clean_pos_tags_2)
# df_cleaned.head()

In [21]:
# df_cleaned.to_csv('~/Kaggle/Pre-processed dataset/cleaned_lyrics_2.csv', index = False)

In [16]:
df_cleaned = pd.read_csv('~/Kaggle/Pre-processed dataset/cleaned_lyrics_2.csv')
df_cleaned.head()

Unnamed: 0,artist,seq,seq_cleaned_1,seq_cleaned_2,seq_cleaned_3
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",no no i ain t ever trapped out the bando but o...,ever trapped bando oh lord get wrong know coup...,bando oh lord couple place everybody name atti...
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",the drinks go down and smoke goes up i feel my...,drinks go smoke goes feel got let go cares get...,drinks smoke cares crowd lights eyes live till...
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,she don t live on planet earth no more she fou...,live planet earth found love venus word said n...,planet earth venus word space time movin fault...
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",trippin off that grigio mobbin lights low trip...,trippin grigio mobbin lights low trippin grigi...,trippin grigio lights trippin grigio lights tr...
4,Elijah Blake,"I see a midnight panther, so gallant and so br...",i see a midnight panther so gallant and so bra...,see midnight panther gallant brave found found...,midnight panther gallant brave answers thunder...


In [17]:
#calculate the number of missing values in each column of the DataFrame
df_cleaned.isna().sum()

artist             0
seq                0
seq_cleaned_1     78
seq_cleaned_2     85
seq_cleaned_3    119
dtype: int64

In [18]:
#calculate the number of missing values in seq_cleaned_3
df_cleaned.dropna(inplace = True, subset = ['seq_cleaned_3'])

After the the application of the text cleaning functions on the dataframe, we notice the presence of missing values not previously noticed in the original dataset. this can be attributed to our functions removing irrevalent texts from our dataset. We proceed by dropping all the rows with missing value from our dataframe.

In [19]:
##calculate the number of missing values in each column of the DataFrame
df_cleaned.isna().sum()

artist           0
seq              0
seq_cleaned_1    0
seq_cleaned_2    0
seq_cleaned_3    0
dtype: int64

## **TOPIC MODELLING**

Topic modeling is a technique in natural language processing and text mining that is used to identify topics or themes within a collection of documents. It's a way to automatically discover abstract topics from a large dataset of text, enabling researchers and analysts to understand the main ideas or subjects discussed in the documents without having to read through them individually.

Below is how topic modeling works in a nutshell:

- Document Collection: Start with a large collection of text documents, such as articles, research papers, or social media posts.

- Vectorization: Convert the text into numerical form using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings. This step transforms words into numerical vectors that can be processed by machine learning algorithms.

- Algorithm Application: Apply a topic modeling algorithm, such as Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF), to the numerical representations of the documents. These algorithms analyze patterns in word co-occurrence to identify topics. I used NMF in my solution.

- Topic Identification: The algorithm identifies groups of words that frequently occur together in the documents. Each group of words represents a topic. Importantly, the algorithm does this without any prior knowledge of the topics or the documents.

- Interpretation: After the algorithm has identified topics, analysts interpret these topics based on the words that are most strongly associated with each topic. They can then label the topics based on these interpretations.

In [20]:
%%time
## Topic Modeling
vect = TfidfVectorizer(min_df=1024, stop_words='english')

# Fit and transform
X = vect.fit_transform(df_cleaned['seq_cleaned_3'])



CPU times: user 2.99 s, sys: 76.9 ms, total: 3.07 s
Wall time: 3.21 s


In [22]:
%%time
pickle.dump(vect, open(f"Transformers_Models/tfidf_lyrics_recommendation_mindf1024.pickle", "wb"))
print('Done...')

Done...
CPU times: user 41.4 ms, sys: 23.5 ms, total: 65 ms
Wall time: 78.3 ms


In [None]:
%%time
# Create an NMF instance: model
# the 200 components will be the topics

model = NMF(n_components=200, random_state=87)

# Fit the model to TF-IDF
model.fit(X)

# Transform the TF-IDF: nmf_features
nmf_features = model.transform(X)

In [None]:
%%time
pickle.dump(model, open(f"~/Kaggle/Transformers_Models/model_1_mindf1024_comp200.pickle", "wb"))
print('Done...')

In [None]:
#TF-IDF matrix
print(X.shape)

#Features matrix
print(nmf_features.shape)

In [None]:
# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_, columns=vect.get_feature_names_out())
display(components_df)
print(components_df.shape)

In [None]:
for topic in range(components_df.shape[0]):
    tmp = components_df.iloc[topic]
    print(f'For topic {topic+1} the words with the highest value are:')
    print(tmp.nlargest(10))
    print('\n')

In [None]:
prob = "top songs about peace and love"
# Transform the TF-IDF
X_test = vect.transform([prob])
# Transform the TF-IDF: nmf_features
nmf_features_test = model.transform(X_test)
pd.DataFrame(nmf_features_test).idxmax(axis=1)

In [None]:
print("sentence : ",  prob)
tmp_test = components_df.iloc[pd.DataFrame(nmf_features_test).idxmax(axis=1)[0]]
print(f'For topic {pd.DataFrame(nmf_features_test).idxmax(axis=1)[0]+1} the words with the highest value are:')
print(' '.join(tmp_test.nlargest(20)[:20].index))
print(tmp_test.nlargest(20)[:20].index)

In [None]:
def topic_extraction(sentence):
  # Transform the TF-IDF
  X_test = vect.transform([sentence])
  # Transform the TF-IDF: nmf_features
  nmf_features_test = model.transform(X_test)
  tmp_test = components_df.iloc[pd.DataFrame(nmf_features_test).idxmax(axis=1)[0]]
  return ' '.join(tmp_test.nlargest(20)[:20].index)


In [None]:
df_cleaned['topics'] = df_cleaned['seq_cleaned_3'].parallel_apply(topic_extraction)
df_cleaned.head()

In [None]:
# df_cleaned.to_csv('~/Kaggle/Pre-processed dataset/cleaned_lyrics_topics.csv', index = False)

In [None]:
df_cleaned = pd.read_csv('~/Kaggle/Pre-processed dataset/cleaned_lyrics_topics.csv')
df_cleaned.head(70)

In [None]:
columns = ['artist','seq','seq_cleaned_3','topics']
final_df = df_cleaned[columns].copy()
final_df.head()

# Inference

In [24]:
df_cleaned = pd.read_csv('~/Kaggle/Pre-processed dataset/cleaned_lyrics_topics.csv')
columns = ['artist','seq','seq_cleaned_3','topics']
final_df = df_cleaned[columns].copy()
# Merge df with 'song' column from df based on 'artist' and 'lyrics'
final_df = pd.merge(df_cleaned, df[['artist', 'song', 'seq']], on=['artist', 'seq'], how='left')
final_df.head()

Unnamed: 0,artist,seq,seq_cleaned_1,seq_cleaned_2,seq_cleaned_3,topics,song
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",no no i ain t ever trapped out the bando but o...,ever trapped bando oh lord get wrong know coup...,bando oh lord couple place everybody name atti...,money pay dollar cash pocket car paper clothes...,Everyday
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",the drinks go down and smoke goes up i feel my...,drinks go smoke goes feel got let go cares get...,drinks smoke cares crowd lights eyes live till...,till skies sleep babe desire plan tune kiss te...,Live Till We Die
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,she don t live on planet earth no more she fou...,live planet earth found love venus word said n...,planet earth venus word space time movin fault...,man understand wife plan women band truck dog ...,The Otherside
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",trippin off that grigio mobbin lights low trip...,trippin grigio mobbin lights low trippin grigi...,trippin grigio lights trippin grigio lights tr...,baby lovin babe diamond treat mon feelin crazy...,Pinot
4,Elijah Blake,"I see a midnight panther, so gallant and so br...",i see a midnight panther so gallant and so bra...,see midnight panther gallant brave found found...,midnight panther gallant brave answers thunder...,love speak brings kisses tender lover angel gl...,Shadows & Diamonds


In [None]:
final_df.to_csv('~/Kaggle/Pre-processed dataset/final_df.csv', index = False)

In [25]:
%%time
pickle.dump(final_df, open(f"Transformers_Models/final_df.pickle", "wb"))
print('Done...')

Done...
CPU times: user 172 ms, sys: 275 ms, total: 447 ms
Wall time: 949 ms


In [None]:
vect_ = pickle.load(open('Transformers_Models/tfidf_lyrics_recommendation_mindf1024.pickle', 'rb'))
model_ = pickle.load(open('Transformers_Models/model_1_mindf1024_comp200.pickle', 'rb'))
components_df_ = pd.DataFrame(model_.components_, columns=vect_.get_feature_names_out())

In [None]:
req = """I was stressed out, goin' out of my mind
When you found me, you know you caught my eye
You really calmed me down, it was a different time
You really showed me, you know me

[Chorus]
Sunlight on the water
Far as I can see
Now we found each other
It's making sense to me

[Verse 2]
Out the window, sneakin' out of your room
While they all sleep, you gotta find me soon
I'm in my car now (Feelin' it, feelin' it)
You know what to do, you come and meet me (That shit so special, that shit so special)
It's after dark now
Put your soft hands on me, show me

[Chorus]
Sunlight on the water
Far as I can see
Now we found each other (Ooh-ooh-ooh)
It's making sensе to me
"""

In [None]:
# Transform the TF-IDF
X_test = vect_.transform([req])
# Transform the TF-IDF: nmf_features
nmf_features_test = model_.transform(X_test)
pd.DataFrame(nmf_features_test).idxmax(axis=1)

In [None]:
print("sentence : ",  req)
tmp_test = components_df_.iloc[pd.DataFrame(nmf_features_test).idxmax(axis=1)[0]]
print(f'For topic {pd.DataFrame(nmf_features_test).idxmax(axis=1)[0]+1} the words with the highest value are:')
print(' '.join(tmp_test.nlargest(20)[:20].index))
print(tmp_test.nlargest(20)[:20].index)

In [None]:
top_topics = ' '.join(tmp_test.nlargest(20)[:20].index)

In [None]:
def song_recommendation(df, df_column, topics, threshold):
  df = df.copy()
  df['word_match'] = df[df_column].apply(word_count)

  return df[df['word_match'] >= threshold]

def word_count(sentence):
  return sum(1 for x in sentence.split() if x in top_topics)

In [None]:
word_count(top_topics)

In [None]:
song_recommendation(final_df,"topics", top_topics, 15).sort_values(by = "word_match", ascending = False)

In [None]:
df_cleaned = pd.read_csv('/content/drive/MyDrive/Recommendation Engine/Pre-processed dataset/cleaned_lyrics_topics.csv')
# df_cleaned = pd.read_csv('Pre-processed dataset/cleaned_lyrics_topics.csv') ## Yassine path for testing

columns = ['artist','seq','seq_cleaned_3','topics']
final_df = df_cleaned[columns].copy()
final_df.head()

In [None]:
import torch
from datasets import load_dataset
from transformers import (
    # GPTQConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)

from peft import LoraConfig
from trl import SFTTrainer

import datasets

In [None]:
# model_id = "Hermes-13B-GPTQ"
model_id = "Llama-2-7b-Chat-GPTQ"

# model_id = "Llma-2-7b-hf"
# model_id = "dr_Llma-2-7b-hf"

In [None]:
## load a dataset and we can train our model.
lyrics_data = datasets.Dataset.from_pandas(final_df[['seq', 'artist', 'topics']], split="train")
lyrics_data

In [None]:
# Tokenizer
llama_tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"  # Fix for fp16

# Quantization Config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)

In [None]:
# Model
base_model = AutoModelForCausalLM.from_pretrained(model_id,
                                                  torch_dtype=torch.float16,
                                                  load_in_8bit=True,
                                                  # quantization_config=quant_config,
                                                  device_map="auto",
                                                  trust_remote_code=False,
                                                  revision="main")

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

In [None]:
# LoRA Config
peft_parameters = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"
)

# Training Params
train_params = TrainingArguments(
    output_dir="outputs",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    # optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

In [None]:
import os
os.environ["CURL_CA_BUNDLE"] = ""

In [None]:
torch.set_grad_enabled(True)  # Context-manager

In [None]:
# Trainer
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=lyrics_data,
    peft_config=peft_parameters,
    dataset_text_field= "seq",
    tokenizer=llama_tokenizer,
    # tokenizer=tokenizer,
    max_seq_length=2048,
    args=train_params
)

In [None]:
# Training
fine_tuning.train()

In [None]:
# Save Model
fine_tuning.model.save_pretrained("Meka_MusicLycs_LLM")

In [None]:
# # Load the tokenizer and model
# model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
c

# # Display information about the model
# display(model)

# # Display quantization configuration
# display(model.config.quantization_config.to_dict())

In [None]:
def generate_song_recommendation(prompt, model, tokenizer, max_tokens=128, device=0):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    output_ids = model.generate(input_ids, max_length=max_tokens, num_return_sequences=1)
    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return generated_text

prompt = "I'm feeling lonely tonight"
recommended_lyrics = generate_song_recommendation(prompt, model, tokenizer)
print("Recommended Lyrics:", recommended_lyrics)
