<a href="https://colab.research.google.com/github/Munongedzi/Music_Recommendation_System/blob/content_filtering/content_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, we download a dataset with Spotify song information.

In [5]:
import kagglehub
import os

# Download dataset
datasets_path = kagglehub.dataset_download("joebeachcapital/30000-spotify-songs")
dataset_path = os.path.join(datasets_path, os.listdir(datasets_path)[0])
print("Using dataset: " + dataset_path)

Using dataset: /root/.cache/kagglehub/datasets/joebeachcapital/30000-spotify-songs/versions/2/spotify_songs.csv


Next, we fix the dataset. Empty values are dropped from the dataset. Numeric values are scaled to a value between 0 and 1. This is necessary for building the similarity matrix.

In [9]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

spotify_data = pd.read_csv(dataset_path)

# Drop empty values from the dataset
spotify_data = spotify_data.dropna(subset=['track_name', 'track_artist', 'track_album_name'])
print("First track entries:")
print(spotify_data.head())

#Scale numeric values
scalar = MinMaxScaler()
numeric_cols = [
    'track_popularity', 'danceability', 'energy', 'loudness', 'speechiness',
    'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'
]
spotify_data[numeric_cols] = scalar.fit_transform(spotify_data[numeric_cols])

First track entries:
                 track_id                                         track_name  \
0  6f807x0ima9a1j3VPbc7VN  I Don't Care (with Justin Bieber) - Loud Luxur...   
1  0r7CVbZTWZgbTCYdfa2P31                    Memories - Dillon Francis Remix   
2  1z1Hg7Vb0AhHDiEmnDE79l                    All the Time - Don Diablo Remix   
3  75FpbthrwQmzHlBJLuGdC7                  Call You Mine - Keanu Silva Remix   
4  1e8PAfcKUYoKkxPhrHqw4x            Someone You Loved - Future Humans Remix   

       track_artist  track_popularity          track_album_id  \
0        Ed Sheeran                66  2oCs0DGTsRO98Gh5ZSl2Cx   
1          Maroon 5                67  63rPSO264uRjW1X5E6cWv6   
2      Zara Larsson                70  1HoSmj2eLcsrR0vE9gThr4   
3  The Chainsmokers                60  1nqYsOef1yKKuGOVchbsk6   
4     Lewis Capaldi                69  7m7vv9wlQ4i0LFuJiE2zsQ   

                                    track_album_name track_album_release_date  \
0  I Don't Care (with Just

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spotify_data[numeric_cols] = scalar.fit_transform(spotify_data[numeric_cols])


Next, we build the similarity matrix. Similarity is measured with cosine similarity:
https://en.wikipedia.org/wiki/Cosine_similarity

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

# Build similarity matrix
song_features = spotify_data[numeric_cols]
similarity_matrix = cosine_similarity(song_features)
similarity_df = pd.DataFrame(similarity_matrix, index=spotify_data['track_id'], columns=spotify_data['track_id'])

In [16]:
import random

# Simulate user data
user_data = pd.DataFrame({
    'user_id': [1] * 20 + [2] * 20,  # Two users with 20 songs each
    'song_id': random.choices(spotify_data['track_id'].unique(), k=40),
    'freq': [random.randint(1, 5) for _ in range(40)]  # Random listening frequency
})


In [11]:
from sklearn.metrics.pairwise import cosine_similarity

class CollaborativeFiltering:
    def __init__(self, user_data):
        self.user_data = user_data
        self.user_song_matrix = None
        self.similarity_matrix = None

    def create_user_song_matrix(self):
        """Create user-song interaction matrix."""
        self.user_song_matrix = pd.pivot_table(
            self.user_data,
            values='freq',
            index='user_id',
            columns='song_id',
            fill_value=0
        )
        return self.user_song_matrix

    def compute_user_similarity(self):
        """Compute user similarity matrix."""
        if self.user_song_matrix is None:
            self.create_user_song_matrix()
        self.similarity_matrix = cosine_similarity(self.user_song_matrix)
        return self.similarity_matrix

    def recommend_songs(self, user_id, content_based_songs, top_n=10):
        """Recommend songs for a user based on collaborative filtering."""
        if self.similarity_matrix is None:
            self.compute_user_similarity()

        # Get the user's row in the matrix
        user_index = self.user_song_matrix.index.get_loc(user_id)

        # Find similar users
        similar_users = self.similarity_matrix[user_index]
        similar_user_indices = similar_users.argsort()[::-1][1:]  # Exclude the user itself

        # Collect songs from similar users
        recommended_songs = set()
        for other_user_index in similar_user_indices:
            other_user_id = self.user_song_matrix.index[other_user_index]
            other_user_songs = self.user_song_matrix.columns[
                self.user_song_matrix.loc[other_user_id] > 0
            ].tolist()
            recommended_songs.update(other_user_songs)

        # Filter out songs already interacted with by the user
        user_songs = self.user_song_matrix.columns[self.user_song_matrix.loc[user_id] > 0].tolist()
        recommended_songs = [song for song in recommended_songs if song not in user_songs]

        # Prioritize content-based recommendations
        ranked_recommendations = [
            song for song in content_based_songs if song in recommended_songs
        ] + [song for song in recommended_songs if song not in content_based_songs]

        return ranked_recommendations[:top_n]


In [21]:
def hybrid_recommendation(song_name, user_id, top_n=10):
    # Find the song ID from the song name
    song_id = spotify_data.loc[spotify_data['track_name'].str.lower() == song_name.lower(), 'track_id'].values
    if len(song_id) == 0:
        raise ValueError("Song not found in the dataset.")
    song_id = song_id[0]

    # Content-based recommendations
    content_recommendations = (
        similarity_df[song_id]
        .sort_values(ascending=False)
        .head(top_n + 1)  # Include the input song itself
        .iloc[1:]  # Exclude the input song
    )
    content_based_songs = content_recommendations.index.tolist()
    content_similarity_scores = content_recommendations.values.tolist()

    # Collaborative filtering recommendations
    collaborative_filter = CollaborativeFiltering(user_data)
    collaborative_recommendations = collaborative_filter.recommend_songs(user_id, content_based_songs, top_n=top_n)

    # Combine recommendations (removing duplicates while maintaining order)
    combined_recommendations = list(dict.fromkeys(content_based_songs + collaborative_recommendations))
    return combined_recommendations[:top_n], content_based_songs, content_similarity_scores, collaborative_recommendations


Finally, we ask the user for a song name, verify that it exists in the dataset, and fetch similar songs from the similarity matrix.

In [23]:
# Example driver code
if __name__ == "__main__":
    user_id = 1  # Example user ID
    input_song = input("Enter a song name: ")

    try:
        # Get recommendations
        recommendations, content_based, content_scores, collaborative = hybrid_recommendation(input_song, user_id)

        # Print categorized recommendations
        print("\nRecommended Songs:\n")

        print("Content-Based Filtering Recommendations (with similarity scores):")
        for song_id, score in zip(content_based, content_scores):
            song_name = spotify_data.loc[spotify_data['track_id'] == song_id, 'track_name'].values[0]
            print(f"- {song_name} (Similarity: {score:.2f})")

        print("\nCollaborative Filtering Recommendations:")
        for song_id in collaborative:
            song_name = spotify_data.loc[spotify_data['track_id'] == song_id, 'track_name'].values[0]
            print(f"- {song_name}")

        print("\nCombined Hybrid Recommendations:")
        for song_id in recommendations:
            song_name = spotify_data.loc[spotify_data['track_id'] == song_id, 'track_name'].values[0]
            print(f"- {song_name}")

    except ValueError as e:
        print(str(e))


Enter a song name: hello

Recommended Songs:

Content-Based Filtering Recommendations (with similarity scores):
- Head In The Clouds (Similarity: 1.00)
- Wanted (Similarity: 1.00)
- If That's Alright (Similarity: 1.00)
- If That's Alright (Similarity: 1.00)
- Amanda (Similarity: 1.00)
- White Flag (Similarity: 1.00)
- Yellow Ledbetter (Similarity: 1.00)
- Wind Of Change (Similarity: 1.00)
- Right My Wrongs (Similarity: 0.99)
- Oceans (Similarity: 0.99)

Collaborative Filtering Recommendations:
- Piscininha Amor
- I'll Never Forget You
- Dreaming My Dreams
- Rudolph The Red-Nosed Reindeer
- Tradici√≥n
- Drop The Bomb On 'Em
- Up
- Odessa
- All I Have
- Chilango Blues

Combined Hybrid Recommendations:
- Head In The Clouds
- Wanted
- If That's Alright
- Amanda
- White Flag
- Yellow Ledbetter
- Wind Of Change
- Right My Wrongs
- Oceans
- Piscininha Amor
