<a href="https://colab.research.google.com/github/Munongedzi/Music_Recommendation_System/blob/main/content_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, we download a dataset with Spotify song information.

In [1]:
import kagglehub
import os

# Download dataset
datasets_path = kagglehub.dataset_download("joebeachcapital/30000-spotify-songs")
dataset_path = os.path.join(datasets_path, os.listdir(datasets_path)[0])
print("Using dataset: " + dataset_path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/joebeachcapital/30000-spotify-songs?dataset_version_number=2...


100%|██████████| 3.01M/3.01M [00:00<00:00, 137MB/s]

Extracting files...
Using dataset: /root/.cache/kagglehub/datasets/joebeachcapital/30000-spotify-songs/versions/2/spotify_songs.csv





Next, we fix the dataset. Empty values are dropped from the dataset. Numeric values are scaled to a value between 0 and 1. This is necessary for building the similarity matrix.

In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

spotify_data = pd.read_csv(dataset_path)

# Drop empty values from the dataset
spotify_data = spotify_data.dropna(subset=['track_name', 'track_artist', 'track_album_name'])
print("First track entries:")
print(spotify_data.head())

#Scale numeric values
scalar = MinMaxScaler()
numeric_cols = [
    'track_popularity', 'danceability', 'energy', 'loudness', 'speechiness',
    'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'
]
spotify_data[numeric_cols] = scalar.fit_transform(spotify_data[numeric_cols])

First track entries:
                 track_id                                         track_name  \
0  6f807x0ima9a1j3VPbc7VN  I Don't Care (with Justin Bieber) - Loud Luxur...   
1  0r7CVbZTWZgbTCYdfa2P31                    Memories - Dillon Francis Remix   
2  1z1Hg7Vb0AhHDiEmnDE79l                    All the Time - Don Diablo Remix   
3  75FpbthrwQmzHlBJLuGdC7                  Call You Mine - Keanu Silva Remix   
4  1e8PAfcKUYoKkxPhrHqw4x            Someone You Loved - Future Humans Remix   

       track_artist  track_popularity          track_album_id  \
0        Ed Sheeran                66  2oCs0DGTsRO98Gh5ZSl2Cx   
1          Maroon 5                67  63rPSO264uRjW1X5E6cWv6   
2      Zara Larsson                70  1HoSmj2eLcsrR0vE9gThr4   
3  The Chainsmokers                60  1nqYsOef1yKKuGOVchbsk6   
4     Lewis Capaldi                69  7m7vv9wlQ4i0LFuJiE2zsQ   

                                    track_album_name track_album_release_date  \
0  I Don't Care (with Just

Next, we build the similarity matrix. Similarity is measured with cosine similarity:
https://en.wikipedia.org/wiki/Cosine_similarity

In [3]:
from sklearn.metrics.pairwise import cosine_similarity

# Build similarity matrix
song_features = spotify_data[numeric_cols]
similarity_matrix = cosine_similarity(song_features)
similarity_df = pd.DataFrame(similarity_matrix, index=spotify_data['track_id'], columns=spotify_data['track_id'])

Finally, we ask the user for a song name, verify that it exists in the dataset, and fetch similar songs from the similarity matrix.

In [6]:
song_name = input("Enter name of song: ").lower()
matching_songs = spotify_data[spotify_data['track_name'].str.lower() == song_name]

if len(matching_songs) > 0:
    if len(matching_songs) > 1:
        print("Multiple songs found with that name:")
        for i, (_, row) in enumerate(matching_songs.iterrows(), start=1):
            print(f"{i}: {row['track_name']} by {row['track_artist']} from {row['track_album_name']}")
        selected_index = int(input("Enter the index of the desired song: "))
        selected_song_id = matching_songs.iloc[selected_index-1]['track_id']
    else:
        selected_song_id = matching_songs.iloc[0]['track_id']

    # Get the top similar songs based on the similarity matrix
    similar_songs = similarity_df.loc[selected_song_id].sort_values(ascending=False)
    top_similar_songs = similar_songs[1:11]  # Exclude the song itself (the first entry)

    print("Top 10 most similar songs:")

    # Print details for each similar song
    for song_id, similarity_score in top_similar_songs.items():
        song_details = spotify_data[spotify_data['track_id'] == song_id].iloc[0]
        print(f"({similarity_score:.4f}) {song_details['track_name']} by {song_details['track_artist']} "
              f"from {song_details['track_album_name']}")
else:
    print(f"Error: Song '{song_name}' not found in the dataset.")

Enter name of song: Clocks
Multiple songs found with that name:
1: Clocks by Coldplay from A Rush of Blood to the Head
2: Clocks by Pickin' On Series from The Fantastic Pickin' on Series Bluegrass Sampler, Vol. 2
Enter the index of the desired song: 1
Top 10 most similar songs:
(0.9968) Fuiste Tú by Ricardo Arjona from Independiente + Demos
(0.9967) Ocean (feat. Khalid) by Martin Garrix from Ocean (feat. Khalid)
(0.9967) Ocean (feat. Khalid) by Martin Garrix from Ocean (feat. Khalid)
(0.9955) Livin' Thing by Electric Light Orchestra from A New World Record
(0.9953) Hallucinogenics by Matt Maeson from Bank On The Funeral
(0.9940) Twenty Eight by The Weeknd from Trilogy
(0.9928) Non Avere Paura by Tommaso Paradiso from Non Avere Paura
(0.9927) The Day You Said Goodnight by Hale from Hale
(0.9923) When I See You Smile by Bad English from Bad English
(0.9923) When I See You Smile by Bad English from Bad English
