<a href="https://colab.research.google.com/github/Alii-Tavakolii/Song_Lyrics_Clustering/blob/main/lyrics_clustering_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pandas scikit-learn sentence-transformers matplotlib seaborn

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

In [4]:
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/musicLyrics.csv')
print("Dataset loaded successfully from Google Drive!")
print(df.head())
print(df.info())

Dataset loaded successfully from Google Drive!
                                               Lyric
0  Cryptic psalms Amidst the howling winds A scor...
1  Im sleeping tonight with all the wolves Were d...
2  Wings of the darkest descent Fall from the rea...
3  [Verse 1] Norrid Radd was my real name Had a j...
4  Deep in the dungeons of doom and despair Sneak...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2999 entries, 0 to 2998
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Lyric   2999 non-null   object
dtypes: object(1)
memory usage: 23.6+ KB
None


### Reason for Text Preprocessing


1.  Noise Reduction by deleting irrelevant characters and focuses on meaningful content.
2.  Standardization by converting words to their common base form.
3.  Reducing Dimensionality by removing stop words ("the", "a", "am" that carry little semantic value)
4.  Presenting a cleaner, more consistent, and relevant representation of the text improves the models accuracy.


In [12]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text, method='lemmatization'):

    text = text.lower()

    text = re.sub(r'[^a-z\s]', '', text) # Keep letters and spaces

    text = re.sub(r'\s+', ' ', text).strip() # Replace multiple spaces with one, then remove leading/trailing spaces
    text = text.replace('\\n', ' ').replace('\\t', ' ') # Remove \n and \t

    words = text.split()

    words = [word for word in words if word not in stop_words]

    if method == 'stemming':
        words = [stemmer.stem(word) for word in words]
    elif method == 'lemmatization':
        words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)


df['processed_lyrics_lem'] = df['Lyric'].apply(lambda x: preprocess_text(x, method='lemmatization'))
df['processed_lyrics_stem'] = df['Lyric'].apply(lambda x: preprocess_text(x, method='stemming'))


print("\nProcessed lyrics examples (Lemmatization):")
print(df[['Lyric', 'processed_lyrics_lem']].head())

print("\nProcessed lyrics examples (Stemming):")
print(df[['Lyric', 'processed_lyrics_stem']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!



Processed lyrics examples (Lemmatization):
                                               Lyric  \
0  Cryptic psalms Amidst the howling winds A scor...   
1  Im sleeping tonight with all the wolves Were d...   
2  Wings of the darkest descent Fall from the rea...   
3  [Verse 1] Norrid Radd was my real name Had a j...   
4  Deep in the dungeons of doom and despair Sneak...   

                                processed_lyrics_lem  
0  cryptic psalm amidst howling wind scorching so...  
1  im sleeping tonight wolf dreaming life thats b...  
2  wing darkest descent fall realm dark blackest ...  
3  verse norrid radd real name job hated every da...  
4  deep dungeon doom despair sneak place dark eke...  
