<a href="https://colab.research.google.com/github/Alii-Tavakolii/Song_Lyrics_Clustering/blob/main/lyrics_clustering_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pandas scikit-learn sentence-transformers matplotlib seaborn

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

In [None]:
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/musicLyrics.csv')
print(df.head())
print(df.info())

Mounted at /content/drive
                                               Lyric
0  Cryptic psalms Amidst the howling winds A scor...
1  Im sleeping tonight with all the wolves Were d...
2  Wings of the darkest descent Fall from the rea...
3  [Verse 1] Norrid Radd was my real name Had a j...
4  Deep in the dungeons of doom and despair Sneak...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2999 entries, 0 to 2998
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Lyric   2999 non-null   object
dtypes: object(1)
memory usage: 23.6+ KB
None


### Reason for Text Preprocessing


1.  Noise Reduction by deleting irrelevant characters and focuses on meaningful content.
2.  Standardization by converting words to their common base form.
3.  Reducing Dimensionality by removing stop words ("the", "a", "am" that carry little semantic value)
4.  Presenting a cleaner, more consistent, and relevant representation of the text improves the models accuracy.


### Stemming vs. Lemmatization

Both stemming and lemmatization are techniques used in NLP to reduce words to their base or root form.

* **Stemming:** A faster, rule-based method that chops off word suffixes, often resulting in a form that is not a real word (e.g., "running" $\rightarrow$ "runn"). It's quicker but less precise.
* **Lemmatization:** A more linguistically sophisticated method that reduces words to their dictionary base form (lemma), which is always a valid word (e.g., "running" $\rightarrow$ "run"). It's more accurate but computationally intensive.

For analyzing song lyrics, lemmatization is generally preferred because it preserves more semantic meaning, which is crucial for understanding the themes in the text.

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text, method='lemmatization'):

    text = text.lower()

    text = re.sub(r'[^a-z\s]', '', text) # Keep letters and spaces

    text = re.sub(r'\s+', ' ', text).strip() # Replace multiple spaces with one, then remove leading/trailing spaces
    text = text.replace('\\n', ' ').replace('\\t', ' ') # Remove \n and \t

    words = text.split()

    words = [word for word in words if word not in stop_words]

    if method == 'stemming':
        words = [stemmer.stem(word) for word in words]
    elif method == 'lemmatization':
        words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)


df['processed_lyrics_lem'] = df['Lyric'].apply(lambda x: preprocess_text(x, method='lemmatization'))
df['processed_lyrics_stem'] = df['Lyric'].apply(lambda x: preprocess_text(x, method='stemming'))


print("\nProcessed lyrics examples (Lemmatization):")
print(df[['Lyric', 'processed_lyrics_lem']].head())

print("\nProcessed lyrics examples (Stemming):")
print(df[['Lyric', 'processed_lyrics_stem']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...



Processed lyrics examples (Lemmatization):
                                               Lyric  \
0  Cryptic psalms Amidst the howling winds A scor...   
1  Im sleeping tonight with all the wolves Were d...   
2  Wings of the darkest descent Fall from the rea...   
3  [Verse 1] Norrid Radd was my real name Had a j...   
4  Deep in the dungeons of doom and despair Sneak...   

                                processed_lyrics_lem  
0  cryptic psalm amidst howling wind scorching so...  
1  im sleeping tonight wolf dreaming life thats b...  
2  wing darkest descent fall realm dark blackest ...  
3  verse norrid radd real name job hated every da...  
4  deep dungeon doom despair sneak place dark eke...  

Processed lyrics examples (Stemming):
                                               Lyric  \
0  Cryptic psalms Amidst the howling winds A scor...   
1  Im sleeping tonight with all the wolves Were d...   
2  Wings of the darkest descent Fall from the rea...   
3  [Verse 1] Norrid Radd w

### Why Feature Extraction?

Machine learning models, including clustering algorithms, cannot directly process raw text. They require numerical input.

Feature extraction transforms raw text (like song lyrics) into numerical representations (vectors) that algorithms can understand and operate on.

It's essential because:
* **Machine Readability:** Algorithms only understand numbers, not words.
* **Meaning Capture:** It converts textual information into a format that captures semantic meaning and context.
* **Mathematical Operations:** Once text is numerical, mathematical operations (e.g., calculating similarities or distances for clustering) become possible.
* **Efficiency:** It creates structured, usable data from unstructured text, which is vital for effective model training and analysis.

In [None]:
from sentence_transformers import SentenceTransformer
import torch
import numpy as np
import pandas as pd

model = SentenceTransformer('all-MiniLM-L6-v2')

df_with_embeddings = df.copy()

lyrics_to_embed_lem = df_with_embeddings['processed_lyrics_lem'].tolist()

sentence_embeddings_lem = model.encode(lyrics_to_embed_lem, show_progress_bar=True)

df_with_embeddings['embeddings_lem'] = list(sentence_embeddings_lem)

lyrics_to_embed_stem = df_with_embeddings['processed_lyrics_stem'].tolist()

sentence_embeddings_stem = model.encode(lyrics_to_embed_stem, show_progress_bar=True)

df_with_embeddings['embeddings_stem'] = list(sentence_embeddings_stem)

print("\nDataFrame with new embedding columns:")
# Display the head, showing the new embedding columns
print(df_with_embeddings[['Lyric', 'processed_lyrics_lem', 'processed_lyrics_stem', 'embeddings_lem', 'embeddings_stem']].head())
print(f"Number of rows with embeddings (both lemmatized and stemmed): {len(df_with_embeddings)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]


DataFrame with new embedding columns:
                                               Lyric  \
0  Cryptic psalms Amidst the howling winds A scor...   
1  Im sleeping tonight with all the wolves Were d...   
2  Wings of the darkest descent Fall from the rea...   
3  [Verse 1] Norrid Radd was my real name Had a j...   
4  Deep in the dungeons of doom and despair Sneak...   

                                processed_lyrics_lem  \
0  cryptic psalm amidst howling wind scorching so...   
1  im sleeping tonight wolf dreaming life thats b...   
2  wing darkest descent fall realm dark blackest ...   
3  verse norrid radd real name job hated every da...   
4  deep dungeon doom despair sneak place dark eke...   

                               processed_lyrics_stem  \
0  cryptic psalm amidst howl wind scorch sourc ag...   
1  im sleep tonight wolv dream life that better p...   
2  wing darkest descent fall realm dark blackest ...   
3  vers norrid radd real name job hate everi day ...   
4  deep

### Supervised vs. Unsupervised Learning

Machine learning methods are either **Supervised** or **Unsupervised**:

* **Supervised Learning:**
    * **Goal:** Predict outcomes using **labeled data** (input-output pairs).
    * **Examples:** Classification (e.g., spam detection), Regression (e.g., predicting prices).
* **Unsupervised Learning:**
    * **Goal:** Find hidden patterns in **unlabeled data** (input only).
    * **Examples:** Clustering (grouping similar items, as in this project), Dimensionality Reduction (e.g., PCA).

This project uses **clustering**, an **unsupervised learning** technique, to discover natural groupings in song lyrics without predefined categories.

### Elbow Method for K-Means

The **Elbow Method** is a heuristic used to determine the optimal number of clusters ($K$) for K-Means clustering.


1.  **Calculate WCSS:** It involves calculating the **Within-Cluster Sum of Squares (WCSS)** for different values of $K$.
2.  **Plot WCSS vs. K:** Plot the WCSS values against the number of clusters ($K$).
3.  **Identify the "Elbow":** The plot typically resembles an arm, and the "elbow" is the point on the curve where the rate of decrease in WCSS significantly slows down.