# Project Goal
The goal of this project is to build a simple content-based music recommendation engine. The system will recommend songs to a user based on the lyrical similarity of a song they choose.

### 1. Data Loading

I am using a pre-processed version of the Spotify Million Song Dataset. With a file size of only 75 MB, the dataset is small enough to be loaded directly into a single Pandas DataFrame. This allows for more straightforward data manipulation and analysis, as opposed to processing in chunks.

The dataset contains the following key columns:

Artist: The artist's name.

Song: The song's title.

Link: A link to the song's lyrics page.

Text: The lyrics of the song.

In [None]:
import pandas as pd

# Define the path to your dataset file.
file_path = r'.\spotify_millsongdata.csv'

#### Skipped processing in chunks as dataset is small

// Set the chunk size (e.g., 10,000 rows at a time).

chunk_size = 10000

// Create a generator that reads the file in chunks.

chunks = pd.read_csv(file_path, chunksize=chunk_size)

// You can iterate through the chunks to inspect the data.

// For now, let's just look at the first chunk to see the column names.

first_chunk = next(chunks)

print(first_chunk.info())

print(first_chunk.head())

In [None]:
# Load the entire dataset into a single DataFrame.
try:
    df = pd.read_csv(file_path)
    print("Dataset loaded successfully!")
    print(df.info())
except FileNotFoundError:
    print(f"Error: The file '{file_path}' was not found.")

### 2. Initial Data Exploration
After loading the dataset, I'll perform an initial check to understand its structure, identify any missing values, and verify the data types of each column. This is a critical step to ensure the data is clean and ready for analysis.

I'll use the following Pandas methods for this exploration:

df.info(): Provides a summary of the DataFrame, including the column names, number of non-null values, and data types.

df.head(): Displays the first few rows of the DataFrame, giving a quick look at the data.

df.isnull().sum(): Counts the number of missing values in each column, which is essential for data cleaning.

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.isnull().sum()

### 3. Data Cleaning: No Missing Values

After loading the dataset, I performed an initial check for missing values using `df.isnull().sum()`. The results show that there are no missing values in any of the columns. This means the dataset is already clean and ready for analysis, and no further cleaning is required.

### 4. Text Preprocessing: Getting the Lyrics Ready for Analysis

To prepare the lyrical data for the recommendation engine, I created a preprocessing function to transform the raw text into a clean and consistent format. This is a crucial step in Natural Language Processing (NLP).

The `preprocess_text` function performs the following steps:
- **Lowercase Conversion**: All text is converted to lowercase to ensure consistency (`'The'` and `'the'` are treated as the same word).
- **Punctuation Removal**: Regular expressions are used to remove punctuation and special characters that don't contribute to the meaning of the lyrics.
- **Tokenization**: The cleaned text is split into individual words.
- **Stop Word Removal**: Common, non-meaningful words (e.g., `'a'`, `'the'`, `'is'`) are filtered out using `nltk`'s built-in stop words list.
- **Stemming**: I used the `PorterStemmer` to reduce words to their root form (e.g., `'running'` and `'runs'` become `'run'`). This helps improve the accuracy of the similarity calculation.

I applied this function to the `text` column of the DataFrame to create a new `processed_text` column, which now contains the cleaned and ready-to-use lyrical data.

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Get the list of English stop words
stop_words = set(stopwords.words('english'))

from nltk.stem import PorterStemmer

import string

In [None]:
# preprocess function

def preprocess_text(text):
    """
    Cleans and preprocesses a string of text for natural language processing (NLP).

    The function performs a series of operations to prepare the text:
    1. Converts text to lowercase.
    2. Removes all punctuation.
    3. Tokenizes the text into a list of words.
    4. Removes common English stop words.
    5. Applies stemming to reduce words to their root form.

    Args:
        text (str): The raw text string to be processed.

    Returns:
        str: The processed text as a single string of stemmed words.
    """
    text = text.lower()
    # Create a translation table to delete all punctuation characters
    translator = str.maketrans('', '', string.punctuation)
    # Apply the translation table to the text
    text_without_punctuation = text.translate(translator)
    # Initialize stop words and stemmer
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    # Tokenize the text
    tokens = word_tokenize(text_without_punctuation)
    # Process tokens in a single loop (list comprehension)
    processed_tokens = [
        stemmer.stem(word) for word in tokens if word.lower() not in stop_words
    ]

    # Join the processed tokens back into a string (optional)
    return " ".join(processed_tokens)
    

In [None]:
# Create a new column 'processed_text' by applying your function to the 'text' column
df['processed_text'] = df['text'].apply(preprocess_text)

# You can now view the original text side-by-side with the processed text
print(df[['text', 'processed_text']].head())