# Project Goal
The goal of this project is to build a simple content-based music recommendation engine. The system will recommend songs to a user based on the lyrical similarity of a song they choose.

### 1. Data Loading

I am using a pre-processed version of the Spotify Million Song Dataset. With a file size of only 75 MB, the dataset is small enough to be loaded directly into a single Pandas DataFrame. This allows for more straightforward data manipulation and analysis, as opposed to processing in chunks.

The dataset contains the following key columns:

Artist: The artist's name.

Song: The song's title.

Link: A link to the song's lyrics page.

Text: The lyrics of the song.

In [1]:
import pandas as pd

# Define the path to your dataset file.
file_path = r'.\spotify_millsongdata.csv'

#### Skipped processing in chunks as dataset is small

My initial plan was to process this data in chunks to handle a potentially large file. However, after inspection, the dataset size was found to be much smaller than anticipated (~75MB), making chunked processing unnecessary.

The following code was originally planned but skipped:

##### # Set the chunk size (e.g., 10,000 rows at a time).
chunk_size = 10000

##### # Create a generator that reads the file in chunks.
chunks = pd.read_csv(file_path, chunksize=chunk_size)

##### # You can iterate through the chunks to inspect the data.
##### # For now, let's just look at the first chunk to see the column names.
first_chunk = next(chunks)
print(first_chunk.info())
print(first_chunk.head())

In [2]:
# Load the entire dataset into a single DataFrame.
try:
    df = pd.read_csv(file_path)
    print("Dataset loaded successfully!")
    print(df.info())
except FileNotFoundError:
    print(f"Error: The file '{file_path}' was not found.")

Dataset loaded successfully!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57650 entries, 0 to 57649
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  57650 non-null  object
 1   song    57650 non-null  object
 2   link    57650 non-null  object
 3   text    57650 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB
None


### 6.1. Data Sampling: The Solution to MemoryError (Skip this step for now and follow order of numbering)

Initial attempts to compute the song similarity matrix resulted in a MemoryError, as the resulting dense matrix was too large to fit into RAM. To solve this, I used data sampling, a common professional strategy for working with large datasets.

I took a random sample of 5,000 songs from the original dataset. This approach allows me to build a fully functional and representative proof-of-concept without sacrificing the core methodology, which is a key skill for data professionals.

In [3]:
# Sample the DataFrame to a smaller size (e.g., 5000 songs)
df = df.sample(n=5000, random_state=42).reset_index(drop=True)

# Print the new shape of the DataFrame
print("Shape of the sampled DataFrame:", df.shape)

Shape of the sampled DataFrame: (5000, 4)


### 2. Initial Data Exploration
After loading the dataset, I'll perform an initial check to understand its structure, identify any missing values, and verify the data types of each column. This is a critical step to ensure the data is clean and ready for analysis.

I'll use the following Pandas methods for this exploration:

df.info(): Provides a summary of the DataFrame, including the column names, number of non-null values, and data types.

df.head(): Displays the first few rows of the DataFrame, giving a quick look at the data.

df.isnull().sum(): Counts the number of missing values in each column, which is essential for data cleaning.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  5000 non-null   object
 1   song    5000 non-null   object
 2   link    5000 non-null   object
 3   text    5000 non-null   object
dtypes: object(4)
memory usage: 156.4+ KB


In [5]:
df.head()

Unnamed: 0,artist,song,link,text
0,Wishbone Ash,Right Or Wrong,/w/wishbone+ash/right+or+wrong_20147150.html,Like to have you 'round \r\nWith all the lies...
1,Aerosmith,This Little Light Of Mine,/a/aerosmith/this+little+light+of+mine_2064448...,"This Little Light of Mine (Light of Mine), \r..."
2,Fall Out Boy,"Dance, Dance",/f/fall+out+boy/dance+dance_10113666.html,She says she's no good with words but I'm wors...
3,Janis Joplin,Easy Rider,/j/janis+joplin/easy+rider_10147381.html,"Hey mama, mama, come a look at sister, \r\nSh..."
4,Moody Blues,Peak Hour,/m/moody+blues/peak+hour_20291295.html,I see it all through my window it seems. \r\n...


In [6]:
df.isnull().sum()

artist    0
song      0
link      0
text      0
dtype: int64

### 3. Data Cleaning: No Missing Values

After loading the dataset, I performed an initial check for missing values using `df.isnull().sum()`. The results show that there are no missing values in any of the columns. This means the dataset is already clean and ready for analysis, and no further cleaning is required.

### 4. Text Preprocessing: Getting the Lyrics Ready for Analysis

To prepare the lyrical data for the recommendation engine, I created a preprocessing function to transform the raw text into a clean and consistent format. This is a crucial step in Natural Language Processing (NLP).

The `preprocess_text` function performs the following steps:
- **Lowercase Conversion**: All text is converted to lowercase to ensure consistency (`'The'` and `'the'` are treated as the same word).
- **Punctuation Removal**: str.maketrans and string.punctuation from string module are used to remove punctuation and special characters that don't contribute to the meaning of the lyrics.
- **Tokenization**: The cleaned text is split into individual words.
- **Stop Word Removal**: Common, non-meaningful words (e.g., `'a'`, `'the'`, `'is'`) are filtered out using `nltk`'s built-in stop words list.
- **Stemming**: I used the `PorterStemmer` to reduce words to their root form (e.g., `'running'` and `'runs'` become `'run'`). This helps improve the accuracy of the similarity calculation.

I applied this function to the `text` column of the DataFrame to create a new `processed_text` column, which now contains the cleaned and ready-to-use lyrical data.

In [7]:
import nltk
# nltk.download('stopwords')
# nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

In [8]:
# preprocess function

def preprocess_text(text):
    """
    Cleans and preprocesses a string of text for natural language processing (NLP).

    The function performs a series of operations to prepare the text:
    1. Converts text to lowercase.
    2. Removes all punctuation.
    3. Tokenizes the text into a list of words.
    4. Removes common English stop words.
    5. Applies stemming to reduce words to their root form.

    Args:
        text (str): The raw text string to be processed.

    Returns:
        str: The processed text as a single string of stemmed words.
    """
    text = text.lower()

    # Create a translation table to delete all punctuation characters
    translator = str.maketrans('', '', string.punctuation)

    # Apply the translation table to the text
    text_without_punctuation = text.translate(translator)

    # Initialize stop words and stemmer
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()

    # Tokenize the text
    tokens = word_tokenize(text_without_punctuation)
    
    # Process tokens in a single loop (list comprehension)
    processed_tokens = [
        stemmer.stem(word) for word in tokens if word not in stop_words
    ]

    # Join the processed tokens back into a string (optional)
    return " ".join(processed_tokens)
    

In [9]:
# Create a new column 'processed_text' by applying your function to the 'text' column
df['processed_text'] = df['text'].apply(preprocess_text)

# You can now view the original text side-by-side with the processed text
print(df[['text', 'processed_text']].head())

                                                text  \
0  Like to have you 'round  \r\nWith all the lies...   
1  This Little Light of Mine (Light of Mine),  \r...   
2  She says she's no good with words but I'm wors...   
3  Hey mama, mama, come a look at sister,  \r\nSh...   
4  I see it all through my window it seems.  \r\n...   

                                      processed_text  
0  like round lie make thing dark peopl say tast ...  
1  littl light mine light mine im let shine aleil...  
2  say she good word im wors bare stutter joke ro...  
3  hey mama mama come look sister she astand leve...  
4  see window seem never fail like million eel wr...  


### 5. Vectorization: Converting Text to Numbers

The next step was to convert the pre-processed lyrical data into a numerical format. I used **TF-IDF (Term Frequency-Inverse Document Frequency)**, a statistical method that reflects how important a word is to a song within the entire dataset.

While other methods like Bag-of-Words and Word Embeddings exist, TF-IDF was chosen for this project because it effectively accounts for word importance, which is crucial for a content-based recommendation engine. It provides a strong balance of simplicity and accuracy for our needs.

- **`TfidfVectorizer`**: I used scikit-learn's `TfidfVectorizer` to perform this conversion. This tool is highly efficient and handles all the steps—from tokenization to calculating the TF-IDF scores—in a single, optimized operation.
- **`tfidf_matrix`**: The result of this process is a sparse matrix, where each row represents a song and each column represents a unique word. The values in the matrix are the TF-IDF scores for each word, which we will use to calculate song similarity.

The dimensions of the resulting matrix are (Number of Songs, Number of Unique Words), confirming that our text data has been successfully vectorized.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
# The TfidfVectorizer handles tokenization, counting, and TF-IDF calculation
vectorizer = TfidfVectorizer()

# Fit and transform the processed text to create the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(df['processed_text'])

# Print the shape of the matrix to see its dimensions
print("Shape of TF-IDF matrix:", tfidf_matrix.shape)


Shape of TF-IDF matrix: (5000, 19491)


### 6. Calculating Song Similarity

With the lyrical data now in a numerical format, the next step was to measure the similarity between each song. I used **Cosine Similarity**, a metric that calculates the cosine of the angle between two TF-IDF vectors. The resulting score, which ranges from 0 to 1, indicates how similar two songs are in their lyrical content.

- **`cosine_similarity()`**: I used scikit-learn's `cosine_similarity` function to compute this metric on our `tfidf_matrix`.
- **`cosine_sim`**: The output is a square matrix where each cell represents the similarity score between two songs. For example, `cosine_sim[0][1]` holds the similarity score between the first and second songs in our dataset.

This matrix is the core of the recommendation engine, as it provides the foundation for finding and recommending songs similar to a user's selection.

#### (Initial Attempt)

An initial attempt to compute the cosine similarity matrix on the full dataset resulted in a MemoryError. The error occurred because the output matrix, which is dense and needs to store a similarity score for every possible pair of songs, was too large to fit in my computer's RAM.

The tfidf_matrix, which is a sparse representation of the data, was small enough, but the cosine_similarity() function from scikit-learn attempted to create a dense matrix of shape (57650, 57650) with over 3.3 billion elements, requiring 23.9 GB of memory.

### 6.1 Solution: Data Sampling (Now move above to just after step 1)
To solve this, I chose a common professional strategy: data sampling. By taking a random sample of 5,000 songs from the original dataset, I was able to reduce the size of the TF-IDF matrix, allowing the final cosine similarity matrix to be computed without any memory issues.  This approach allows for the creation of a fully functional and representative proof-of-concept, demonstrating a practical solution to a common data science challenge.

### 6.2 Calculating Song Similarity (Successful Attempt)
After sampling the data, the process was successful. The code below computes the cosine similarity matrix on the now smaller tfidf_matrix. The resulting matrix is the core of our recommendation engine.

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix)

# Print the shape of the matrix to see its dimensions
print("Shape of cosine similarity matrix:", cosine_sim.shape)


Shape of cosine similarity matrix: (5000, 5000)


### 7. Building the Recommendation Engine

With the cosine similarity matrix computed, the final step was to build the core recommendation function. The `get_recommendations` function takes a song title as input and performs the following tasks:

- **Finds the Index**: It uses the song title to locate its corresponding index in the DataFrame.
- **Retrieves Scores**: It fetches the similarity scores for that song from the `cosine_sim` matrix.
- **Sorts and Filters**: It sorts the scores to find the most similar songs, ensuring the input song itself is not included in the recommendations.
- **Returns Recommendations**: It uses the indices of the top-ranked songs to retrieve and display their titles and artists.

This final function ties all the previous steps—data cleaning, vectorization, and similarity calculation—together into a complete and functional music recommendation engine.

In [12]:
# get_recommendations function

import numpy as np

def get_recommendations(song_title, cosine_sim):
    """
    Generates a list of song recommendations based on lyrical similarity.

    The function finds songs with similar lyrical content to a given song
    by using a pre-computed cosine similarity matrix.

    Args:
        song_title (str): The title of the song to get recommendations for.
        cosine_sim (np.ndarray): The pre-computed cosine similarity matrix.

    Returns:
        list: A list of recommended songs, with each song represented as a
              Pandas Series containing its information. Returns an empty list
              if the song is not found.
    """
    # Find the index of the song that matches the title
    # .tolist() is used to convert the Index object to a simple list
    song_indices = df.index[df['song'] == song_title].tolist()

    # Check if the song was found in the DataFrame
    if not song_indices:
        print(f"Song '{song_title}' not found in the dataset.")
        return []

    # Get the similarity scores for the chosen song from the cosine similarity matrix
    # [0] is used to get the single index from the list
    sim_scores = cosine_sim[song_indices[0]]

    # Get the indices of the songs sorted by similarity score in descending order
    # np.argsort returns the indices that would sort the array
    # reversed() is used to get them from most to least similar
    sorted_indices = np.argsort(sim_scores)
    
    # Use a list comprehension to filter out the input song's own index
    # The list comprehension is a more efficient and "Pythonic" way to do this
    rec_indices = [
        index for index in reversed(sorted_indices)
        if index != song_indices[0]
    ]

    # Take the top 10 recommendations from the filtered list
    top_10_rec = rec_indices[:10]

    # Use a list comprehension to retrieve the actual songs from the DataFrame
    # df.iloc[i] is used to get the entire row (song) by its integer index
    recommended_songs = [
        df.iloc[i] for i in top_10_rec
    ]

    return recommended_songs

In [13]:
get_recommendations("Halloween Dance", cosine_sim)

[artist                                                    New Order
 song                                                   Transmission
 link                        /n/new+order/transmission_10191784.html
 text              Radio, live transmission  \r\nRadio, live tran...
 processed_text    radio live transmiss radio live transmiss list...
 Name: 4174, dtype: object,
 artist                                                         Glee
 song                                            Dancing With Myself
 link                      /g/glee/dancing+with+myself_20615475.html
 text              On the floors of Tokyo  \r\nDown in London tow...
 processed_text    floor tokyo london town gogo record select mir...
 Name: 454, dtype: object,
 artist                                             Guided By Voices
 song                                                   Jupiter Spin
 link                 /g/guided+by+voices/jupiter+spin_21077562.html
 text              Feel, listen like no one  \r\