S.a.M: Song and Music Recommendation System 

Hi! My name is S.a.M. I'm an AI recommendation system that allows the user to input their favorite movie or song and get recommendations based off it. This is posible through the use of content based filtering, principal component analysis, k-means clustering, cosine similartiy and eueclidean distance. I'm split up into 3 main compoenents, there's a msuic feature, a movie feature, and an interactive menu. Let's begin with the music componenet. 

Step 1: Loading the data
In order to start, we will have to import the necessary libraries and resources.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

We will now read the CSV file containing our dataset and put it in a pandas dataframe. In dataframe format, we will able to show important information such as the format of the dataset and its shape.

In [None]:
df = pd.read_csv(r"C:\Users\kylek\Downloads\CS 450 Project\data\dataset.csv", index_col=0) #index_col=0 used to removed any unnamed columns which our dataset had one
df.head(10) #the head() function usually prints out the first 5 rows of the dataset but we will show the first 10

In [None]:
print("The dimensions of our dataset are:", df.shape)

As seen above, our dataset consists of 114,000 songs with 20 features. These features consists of basic information of the song such as track name, artist, and album. But, it also has more interesting features that might be useful for use such as, loudness, acousticness, tempo, and genre. These can be useful when it comes to creating clusters.

Step 2: Cleaning the Data
Before implementing clusters, we first need to clean our data of unnecessary columns. Clustering works best if we keep features that are numerical values. This will make our recommendation system more interesting as we will be recommending songs that sound similar rather than something simple such as genre. To achieve this, we will drop columns that do not contain numerical data. We will also be dropping NaN columns which are columns that contain missing or incomplete data.

In [None]:
df.dropna(inplace=True) #we will first drop any NaN values for efficiency
df.head(10)

In [None]:
#now we have to manually drop any remaining uneccessary columns such as 'track_id', 'explicit', and 'track_genre'
dropped_df = df.drop(['track_id', 'artists','album_name', 'track_name','explicit', 'track_genre'], axis=1)
dropped_df.head(10)

In [None]:
print("New dimensions of dataset:", dropped_df.shape)

Last step before implementing clusters is to normalize our data. If we did not do that, our graph would be skewed to favored larger values such as duration.

In [None]:
final_df = dropped_df.copy()
final_df = StandardScaler().fit_transform(final_df)
final_df = pd.DataFrame(final_df, columns=dropped_df.columns)
final_df.head(10)

Step 3: Implementing Principal Component Analysis
For our clusters to be as efficient as possible, we will first reduce the dimensionality of our data to make our calculations more efficient. We will be doing this using Principal Component Analysis (PCA).

PCA is a dimension reduction algorithm that reduces the dimensionality of a dataset while still maintaining crucial information. When combined with K-means clustering, it will lead to better defined clusters

We begin by figuring out how many principal components we can reduce our data too. We do this by creating a PCA instance which will use as many components as the number of features in our data. We will then plot the cumulative explained variance ratio for each number of components.

In [None]:
pca = PCA()
pca_df = pca.fit_transform(final_df)
pca.explained_variance_ratio_

As seen above, we have 14 principal components since we have 14 features and each component has there own percentage of explained variance. We will plot the ratio and find at which number of components will we have explained variance ratio above 0.80.

In [None]:
plt.figure(figsize=(10,8))
plt.plot(range(1, len(pca.explained_variance_ratio_.cumsum()) + 1), pca.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title('Explained variance by components')
plt.xlabel('# of components')
plt.ylabel('cumulative explained variance')

According to our graph, at 9 principal components we will have cumulative explained variance ratio above 0.80 which means that will be the number of components we will use.

In [None]:
pca = PCA(n_components=9)
pca_df = pca.fit_transform(final_df)
print(sum(pca.explained_variance_ratio_)) #to verify that 9 is the right number

Step 4. Implementing K-means Clustering
Step 4a. Elbow Method
For K-means clustering to work, we need to figure out the k number of clusters for our data. This is a number that has to be figured out by us. It is important to have the right amount of clusters as having too little clusters leads to underfitting and have too many clusters leads to overfitting.

To figure out how many clusters we will be using, we will be using the Elbow Method. This method involves creating a K-means clustering model with varying amounts of clusters, calculating the inertias of each and plotting it to see at which point do we see a elbow-like bend in the graph. So, we will be taking our Principal Component model and running it through the clustering algorithm to find the optimal amount of clusters.

In [None]:
inertias = []
means = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, n_init=10)
    kmeans.fit(pca_df)
    means.append(i)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(10,5))
plt.plot(means, inertias, marker='o')
plt.title('Elbow Method')
plt.xlabel("# of clusters")
plt.ylabel("Inertias")
plt.grid(True)
plt.show()

We pick the point in which the inertia starts decreasing in a linear manner which appears to be point 3. We will be forming 3 clusters.

In [None]:
kmeans = KMeans(n_clusters=3, n_init='auto')
kmeans.fit(final_df)
df['Cluster'] = kmeans.labels_ #assigns each song into their respective cluster
final_df['Cluster'] = kmeans.labels_ #does the same into our modified dataframe
df.head(50)

In [None]:
print(df.shape)

In [None]:
final_df.head(50)

In [None]:
print(final_df.shape)

Now we will visualize our data. To do this well be plotting the first two principal components of our dataset while showing the cluster that each song is in. We will only be plotting two as it is not possible to plot 9 components but 2 is enough to showcase our data.

In [None]:
final_df['pca_1'] = pca_df[:,0]
final_df['pca_2'] = pca_df[:,1]
plt.scatter(final_df['pca_1'], final_df['pca_2'], c=final_df['Cluster'])
plt.show()

We have successfully clustered our songs! It is time to create our song recommending functions.

Step 5. Create Song Recommender
With our songs clustered, to create our recommender we need a way to locate song the user inputs in our database, find the cluster number and recommend songs inside that cluster

In [None]:
def get_Song_Index(track_name, df):
    try:
        track_index = df[df['track_name'] == track_name].index[0] #Finds the id of the first matching result
        return track_index
    except IndexError:
        return None

In [None]:
def get_recommendations(track_name, df):
    track_index = get_Song_Index(track_name, df)
    print('You chose: ' + track_name + ' by ' + df.loc[track_index]['artists'])
    print('Here are your recommendations:')
    cluster = df.loc[track_index]['Cluster'] #finds the cluster of the inputted song
    filter = (df['Cluster'] == cluster)
    
    filtered_df = df[filter] #creates a dataframe with only songs of the indicated cluster
    chosen_index = filtered_df.index.get_loc(track_index)
    start_index = max(0, chosen_index - 5)
    end_index = min(len(filtered_df), chosen_index + 6)
    for i in range(start_index, end_index):
        if i != chosen_index:
            recommendation = filtered_df.iloc[i]
            print(recommendation['track_name'] + ' by ' + recommendation['artists'])

In [None]:
get_recommendations('Is This It', df)

Due to the vast nature of our clusters, some of the recommendations may seem outlandish and only vaguely similar. To improve our recommender, we will introduce euclidean distance calculations into our recommender. We will be measuring the distance between our inputted song and every other song in the cluster and recommend the songs with the shortest distance to our song. The calculation will be based off certain features to make sure that our songs will be as close as possible.

In [None]:
def get_euclidean_recommendations(track_name, features, df, final_df, n_songs=10):
    
    track_index = get_Song_Index(track_name, df)
    print('You chose: ' + track_name + ' by ' + df.loc[track_index]['artists'])
    print('Here are your recommendations:')
    track_cluster = final_df.loc[track_index]['Cluster']
    cluster = final_df[final_df['Cluster'] == track_cluster]
    cluster_df = cluster[features]
    
    target_song = cluster_df.loc[track_index, features]
    target_song = target_song.to_frame().T
    
    distances = euclidean_distances(target_song, cluster_df)
    distances = distances.flatten()
    sorted_indices = np.argsort(distances)
        
    most_similar_songs = cluster_df.iloc[sorted_indices[:n_songs + 1]]
    indices = most_similar_songs.index
    songs_in_df = df.loc[indices]
    
    for i in range(n_songs + 1):
        if songs_in_df.iloc[i]['track_name'] != track_name:
            print(songs_in_df.iloc[i]['track_name'] + ' by ' + songs_in_df.iloc[i]['artists'])

In [None]:
features = ['popularity', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
           'instrumentalness', 'liveness', 'valence', 'tempo'] #the features we will be comparing
get_euclidean_recommendations('Is This It', features, df, final_df)

We've successfully implemented the music component of our recommendation system. Now let's move onto the movie portion. 

Step 1: Loading data
 
In order to start, we must import additonal libraries and resources that will be needed. 

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from ast import literal_eval
from datetime import datetime

We use a different dataset for our movie section so we have to read the new CSV files that contains our dataset and put it in a pandas dataframe. In dataframe format, we will be able to show important information such as the format of the dataset and its shape. Df1 contains movie_id, cast, and crew information. Df2 contains budget, genre, homepage, id, keywords, and other movie features. We will eventually combine the datasets.

In [None]:
df1 = pd.read_csv(r"C:\Users\kylek\Downloads\CS 450 Project\data\tmdb_5000_credits.csv")
df1.head()

This is our first dataset. Now let's take a look at our second.

In [None]:
df2 = pd.read_csv(r"C:\Users\kylek\Downloads\CS 450 Project\data\tmdb_5000_movies.csv")
df2.head()

Now lets join the two datasets together and take a look at our new dataset.

In [None]:
df1.columns = ['id','tittle','cast','crew']
movie_df = df2.merge(df1,on='id')
movie_df.head()

Let's take a look at the shape our dataset. This will tell us how many movies and features are in our dataset.

In [None]:
movie_df.shape

We are returned a tupple, (4803, 23). This means that in our dataset, we have 4803 movies, each movie containing 23 features.

We will begin our recommendation system with content based filtering. The content of the movie (overview, cast, crew, keyword, and tagline) are used to compute a similarity score with other movies. Then the movies with the highest similarity score are recommended. We'll start by computing the pairwise similarity scores for all movies based on their plot description. The plot description is given in the overview feature of our dataset.

Let's take a look.

In [None]:
movie_df['overview'].head(5)

Now we must perform data preprocessing in order to transform our raw data into something usable. We want to perform text processing in order to accurately analyze the plot description for each movie. This is achieved by computing the Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each overview. Term frequency is the relative frequency of a word in a document. It's given as (term instance / total instance). Inverse Document Frequency is the relative count of documents containing the term and is given as log(number of documents / documents with term). The importance of each word to the document is equal to TF * IDF. This gives you a matrix where each column represents a word in the overview vocabulary and each row represents a movie. We perform these actions in order to reduce the importance of words that occur frequently in plot overviews.

In [None]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
movie_df['overview'] = movie_df['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movie_df['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

There's 20,978 words used to describe the 4803 movies in our dataset.

Using this matrix, we can now compute a similarity score. There are multiple methods for finding a similary score such as eculidean, pearson and cosine similarty score. No method is better than another, they all have advantages for certain sceanrios. In this case, we will be using the cosine similarty score to calculate a numeric quanity that represent the similarity between two movies.

Similarity = cos(θ) = (A * B) / ||A|| ||B|| (dot product)

Since we have used the TF-IDF vectorizer, calculating the dot product will direclty give us the cosine similarity score. Therefore, we will use sklearn's linear_kernel().

In [None]:
#Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Now we will define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. We need a way to identify the index of a movie in our metadata DataFrame given its title. This can be achieved by reverse mapping the movies titles and DataFrame indices.

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(movie_df.index, index=movie_df['title']).drop_duplicates()

Basic Steps of Recommendation System

1. Get the index of the movie based off title
2. Compute cosine similarity scores for the particular movie with all movies. Then convert it into a list of tuples where the first element is its position and the second is the similarity score
3. Sort the list of tuples based on the similarity scores
4. Get the top 10 elements of this list. Ignore the first element since it's refering to itself
5. Return the titles corresponding to the indices of the top elements

In [None]:
#Function that takes in movie title as input and outputs most similar movies
def get_content_recommendations(title, cosine_sim=cosine_sim):
    try:
        # Get the index of the movie that matches the title
        idx = indices[title]

        # Get the pairwise similarity scores of all movies with that movie
        sim_scores = list(enumerate(cosine_sim[idx]))

        # Sort the movies based on the similarity scores
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

        # Get the scores of the 10 most similar movies
        sim_scores = sim_scores[1:11]

        # Get the movie indices
        movie_indices = [i[0] for i in sim_scores]

        # Create a DataFrame with the top 10 most similar movies and their similarity scores
        result_df = pd.DataFrame(columns=['Title', 'Similarity Score'])
        result_df['Title'] = movie_df['title'].iloc[movie_indices]
        result_df['Similarity Score'] = [i[1] for i in sim_scores]

        # Return the top 10 most similar movies
        return result_df
    except KeyError:
        print(f"Sorry, the movie '{title}' was not found in our database. Please check the title and try again.")
        return None

Let's test out our content based recommendation by inputing our favorite movie.

In [None]:
get_content_recommendations('Batman & Robin')

Our recommendation system does a good job at finding movies with similar plot descriptions but the quality of the recommendation could improve. For example, "Batman & Robin" returns all batman movies rather than movies with similar actors. We can improve our movie recommendation by building a recommendation based on 4 key features - movie plot keywords, the director, the 3 most popular actors, and the genre. In order to achieve this, we must manipulate our data. 

In [None]:
#Credits, Genres and Keywords Based Recommender
attributes = ['cast', 'crew', 'keywords', 'genres']
for attribute in attributes:
    movie_df[attribute] = movie_df[attribute].apply(literal_eval)

Next we'll write functions that allow us to extract information from each feature. 

In [None]:
#Extract director's name, return none if not found
def extract_director(crew_data):
    for crew_member in crew_data:
        if crew_member['job'] == 'Director':
            return crew_member['name']
    return None 

In [None]:
#Returns the top 3 elements from list 
def get_elements(elements):
    if isinstance(elements, list):
        names = [element['name'] for element in elements][:3]
        return names
    return []

In [None]:
#Define new director, cast, genres and keywords features that are in a suitable form.
movie_df['director'] = movie_df['crew'].apply(extract_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    movie_df[feature] = movie_df[feature].apply(get_elements)

In [None]:
#Print the new features of the first 3 films
movie_df[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Now we must normalize our data. We have to convert the names nad keywords into lowercase and delete the spaces between them. This way our vectorizer doesn't count the Ryan of "Ryan Reynolds" and "Ryan Gosling" as the same. 

In [None]:
#Normalize data by converting strings to lowercase and removing spaces 
def normalize_data(data):
    if isinstance(data, list):
        return [str.lower(item.replace(" ", "")) for item in data]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(data, str):
            return str.lower(data.replace(" ", ""))
        else:
            return ''

In [None]:
#Normalize all features
features = ['cast', 'keywords', 'director', 'genres']
for feature in features:
    movie_df[feature] = movie_df[feature].apply(normalize_data)

Now let's create our metadata soup (a string that contains all the metadata that we want to feed to our vectorizer (actors, director, and keywords))

In [None]:
def create_soup(features):
    return ' '.join(features['keywords']) + ' ' + ' '.join(features['cast']) + ' ' + features['director'] + ' ' + ' '.join(features['genres'])
movie_df['soup'] = movie_df.apply(create_soup, axis=1)

This step is very similar to the text processing we pefromed with TF-IDF. This time we are going to utilize COuntVectorizer(). 

In [None]:
vectorizer = CountVectorizer(stop_words='english')
word_matrix = vectorizer.fit_transform(movie_df['soup'])

We now created a word_matrix. We are able to use this word matrix to calcualte the cosine simialrty score. 

In [None]:
#Calculate the cosine similartiy matrix 
cosine_similarity_matrix = cosine_similarity(word_matrix, word_matrix)

In [None]:
#Reset index of our main DataFrame and construct reverse mapping as before
movie_df = movie_df.reset_index()
indices = pd.Series(movie_df.index, index=movie_df['title'])

We can now reuse our get_recommendation() function by passing in the new cosine_similarity_matrix 

In [None]:
get_content_recommendations('The Dark Knight Rises', cosine_similarity_matrix)

Now that we have implemented our song and movie compoenents we are able to create a user-friendly interactive menu. 

Let's create functions that retrieve our recommendations based on user input. 

Now let's define a function that shows an interactive menu 

In [111]:
def get_movie_recommendations(title, movies):
    if title in movies['title'].values: #check if movie exists in dataset
        print(get_content_recommendations(title, cosine_similarity_matrix)) #if movie exists print recommendations
    else:
        print("So sorry! It seems as if the movie you are searching for is not available.\n")
        print("Either the movie is not present in our database or was typed incorrectly.\n")
        print("Make sure to spell the movie exactly as the title appears and try again.")

def get_song_recommendations(title, songs):
    if title in songs['track_name'].values: #check if song exists in dataset
        print(get_euclidean_recommendations(title, features, df, final_df)) #if song exists print recommendations
    else:
        print("So sorry! It seems as if the song you are trying to search for is not available.\n")
        print("Either the song is not present in our database or was typed incorrectly.\n")
        print("Make sure to spell the song exactly as it appears on Spotify and try again.")

def show_menu():
    movies = df2 #load movie dataset
    songs = df #load song dataset
    
    print("Hi! My name is S.a.M. I'm your Song and Movie recommendation system. I can't wait to get started.")
    
    #Start an infinite loop to continually offer options until the user exists 
    while True:
        print("\nPlease select one of the following options")
        print("1. Get movie recommendations")
        print("2. Get song recommendations")
        print("3. Exit")
        choice = input("Please enter your choice: ")
        
        if choice == '1':
            print("Please enter the movie title you want a recommendation for:")
            title = input("Please enter the movie title you want a recommendations for: ")
            get_movie_recommendations(title, movies)
        elif choice == '2':
            print("Please enter the song title you want a recommednation for")
            title = input("Please enter the song title you want a recommendation for: ")
            get_song_recommendations(title, songs)
        elif choice == '3':
            print("Thank you for listening to my recommednations. Good bye :) ")
            break
        else:
            print("I'm sorry but I didn't understand your answer. Please select 1 for movie recommendations. 2 for song recommednations and 3 to exit the program.")

show_menu()

Hi! My name is S.a.M. I'm your Song and Movie recommendation system. I can't wait to get started.

Please select one of the following options
1. Get movie recommendations
2. Get song recommendations
3. Exit
Please enter the song title you want a recommednation for
You chose: Destroy Everything You Touch by Ladytron
Here are your recommendations:
I Know A Place by Jay Reatard
Is This It by The Strokes
空と虚 by sasanomaly
空と虚 by sasanomaly
Levántate y Anda by Avalanch
Sun King - Remastered 2009 by The Beatles
Mean Mr Mustard - Remastered 2009 by The Beatles
Skiptracing by Mild High Club
If I Could Find You (Eternity) by The Holydrug Couple
空と虚 by sasanomaly
None

Please select one of the following options
1. Get movie recommendations
2. Get song recommendations
3. Exit
Please enter the song title you want a recommednation for
You chose: Is This It by The Strokes
Here are your recommendations:
Destroy Everything You Touch by Ladytron
Skiptracing by Mild High Club
If I Could Find You (Eterni