$$ ITI \space AI-Pro: \space Intake \space 45 $$
$$ Recommender \space Systems $$
$$ Lab \space no. \space 1 $$

# `01` Import Necessary Libraries

## `i` Default Libraries

In [2]:
import numpy as np
import pandas as pd

## `ii` Additional Libraries
Add imports for additional libraries you used throughout the notebook

In [3]:
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split, cross_validate
from surprise import KNNBasic, SVD, accuracy

----------------------------

# `02` Load Data

In [4]:
ratings = pd.read_csv("Data/songsDataset.csv", names=['userID', 'songID', 'rating'], skiprows=[0])
ratings.head()

Unnamed: 0,userID,songID,rating
0,0,90409,5
1,4,91266,1
2,5,8063,2
3,5,24427,4
4,5,105433,4


---------------------------------

# `03` Similarity Metrics

## `0` Utility Matrix
Construct utility matrix for the loaded data `ratings`
- Users as Index
- Songs as Columns

**Hint**: you can use `pandas.DataFrame.pivot` method (see [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html))

In [5]:
utility_matrix = ratings.pivot(index='userID', columns='songID', values='rating').fillna(0) # fill with 0 becase rating is 1-5

## `i` Cosine Similarity
Finish implmenting the function below to calculate `Cosine Similarity` between two vectors

In [6]:
def cosine_sim(vec_a, vec_b):
    """
    Returns the raw cosine similarity score between two vectors.

            Parameters:
                vec_a (pandas.Series): Vector A
                vec_b (pandas.Series): Vector B

            Returns:
                sim_score (float): Similarity score between vectors vec_a and vec_b
    """
    # Calculate the cosine similarity score between two vectors
    # 1. Calculate the dot product of vec_a and vec_b
    dot_product = np.dot(vec_a, vec_b)
    # 2. Calculate the norm of vec_a and vec_b
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    # 3. Calculate the cosine similarity score
    sim_score = dot_product / (norm_a * norm_b) if (norm_a * norm_b) != 0 else 0

    return sim_score


In [7]:
print(f'Cosine Similarity between userID 56 and userID 227 is: {cosine_sim(utility_matrix.iloc[56].copy(), utility_matrix.iloc[227].copy())}')

Cosine Similarity between userID 56 and userID 227 is: 0.7808688094430304


## `ii` Adjusted Cosine Similarity
Finish implmenting the function below to calculate `Adjusted Cosine Similarity` between two vectors

In [8]:
def adjusted_cosine_sim(vec_a, vec_b):
    """
    Returns the adjusted cosine similarity score between two vectors.

            Parameters:
                vec_a (pandas.Series): Vector A
                vec_b (pandas.Series): Vector B

            Returns:
                sim_score (float): Similarity score between vectors vec_a and vec_b
    """

    # Calculate the adjusted cosine similarity score between two vectors
    # 1. Calculate the mean rating for each vector
    mean_a = vec_a.mean()
    mean_b = vec_b.mean()
    # 2. Center the vectors by subtracting the mean rating
    vec_a_centered = vec_a - mean_a
    vec_b_centered = vec_b - mean_b
    # 3. Calculate the dot product of the centered vectors
    dot_product = np.dot(vec_a_centered, vec_b_centered)
    # 4. Calculate the norm of the centered vectors
    norm_a = np.linalg.norm(vec_a_centered)
    norm_b = np.linalg.norm(vec_b_centered)
    # 5. Calculate the adjusted cosine similarity score
    sim_score = dot_product / (norm_a * norm_b) if (norm_a * norm_b) != 0 else 0

    return sim_score

In [9]:
print(f'Adjusted Cosine Similarity between userID 56 and userID 227 is: {adjusted_cosine_sim(utility_matrix.iloc[56].copy(), utility_matrix.iloc[227].copy())}')

Adjusted Cosine Similarity between userID 56 and userID 227 is: 0.7764278070396685


## `iii` Pearson Correlation Coefficient
Finish implmenting the function below to calculate `Pearson Correlation Coefficient` between two vectors

In [10]:
def pearson_sim(vec_a, vec_b):
    """
    Returns the pearson similarity score between two vectors.

            Parameters:
                vec_a (pandas.Series): Vector A
                vec_b (pandas.Series): Vector B

            Returns:
                sim_score (float): Similarity score between vectors vec_a and vec_b
    """

    # Calculate the pearson similarity score between two vectors
    # 1. Calculate the mean rating for each vector
    mean_a = vec_a.mean()
    mean_b = vec_b.mean()
    # 2. Center the vectors by subtracting the mean rating
    vec_a_centered = vec_a - mean_a
    vec_b_centered = vec_b - mean_b
    # 3. Calculate the covariance of the centered vectors
    covariance = np.dot(vec_a_centered, vec_b_centered)
    # 4. Calculate the norm (standard deviation) of the centered vectors
    norm_a = np.linalg.norm(vec_a_centered)
    norm_b = np.linalg.norm(vec_b_centered)
    # 5. Calculate the pearson similarity score
    sim_score = covariance / (norm_a * norm_b) if (norm_a * norm_b) != 0 else 0

    return sim_score

In [11]:
print(f'Pearson Similarity between songID 3785 and songID 17029 is: {pearson_sim(utility_matrix[3785].copy(), utility_matrix[17029].copy())}')

Pearson Similarity between songID 3785 and songID 17029 is: -0.015085785303531218


## `iv` Mean Squared Difference
Finish implmenting the function below to calculate `Mean Squared Difference` between two vectors

**Note**: Make sure you calculate the difference for common dimensions only (i.e. the dimensions both items/users have non-zero values in)

In [12]:
def msd_sim(vec_a, vec_b):
    """
    Returns the mean squared difference similarity score between two vectors.
    Note: Only consider common items between the two vectors

            Parameters:
                vec_a (pandas.Series): Vector A
                vec_b (pandas.Series): Vector B

            Returns:
                sim_score (float): Similarity score between vectors vec_a and vec_b
    """

    # Calculate the mean squared difference similarity score between two vectors
    # 1. Get the common items between the two vectors
    common_items = vec_a.index.intersection(vec_b.index)
    # 2. Filter out items where both vectors have non-zero values
    common_items = common_items[(vec_a[common_items] != 0) & (vec_b[common_items] != 0)]
    # 3. If there are no common items, return 0
    if len(common_items) == 0:
        return 0
    # 4. Calculate the mean squared difference between the two vectors
    vec_a_common = vec_a[common_items]
    vec_b_common = vec_b[common_items]
    squared_diff = np.square(vec_a_common - vec_b_common)
    mean_squared_diff = np.mean(squared_diff)
    # 5. Calculate the similarity score
    sim_score = 1 / (1 + mean_squared_diff)

    return sim_score

In [13]:
print(f'MSD Similarity between userID 56 and userID 227 is: {msd_sim(utility_matrix.iloc[56].copy(), utility_matrix.iloc[227].copy())}')
print(f'MSD Similarity between songID 3785 and songID 17029 is: {msd_sim(utility_matrix[3785].copy(), utility_matrix[17029].copy())}')

MSD Similarity between userID 56 and userID 227 is: 1.0
MSD Similarity between songID 3785 and songID 17029 is: 0.6363636363636364


--------------------------

# `04` Collaborative Filtering

Practice for item-based collaborative filtering

## `0` Utility Matrix
Construct utility matrix for the loaded data `ratings`
- Songs as Index
- Users as Columns

In [14]:
utility_matrix = ratings.pivot(index='songID', columns='userID', values='rating').fillna(0) # fill with 0 becase rating is 1-5

In [15]:
utility_matrix.head()

userID,0,4,5,7,14,20,31,33,40,46,...,199956,199969,199973,199974,199975,199976,199980,199988,199990,199996
songID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2263,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
2726,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0
3785,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8063,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## `i` Item-Item Similarity Matrix

Construct item-item (Cosine/Adjusted Cosine) similarity matrix from the utility matrix  above.

In [16]:
# create empty matrix
sim_mat = np.array([[adjusted_cosine_sim(utility_matrix.iloc[i], utility_matrix.iloc[j])
                     for j in range(len(utility_matrix))]
                    for i in range(len(utility_matrix))]).round(6)

In [17]:
sim_df = pd.DataFrame(sim_mat, index=utility_matrix.index, columns=utility_matrix.index)
sim_df.head()

songID,2263,2726,3785,8063,12709,13859,16548,17029,19299,19670,...,113954,119103,120147,122065,123176,125557,126757,131048,132189,134732
songID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2263,1.0,-0.0065,-0.017511,-0.016326,-0.01752,-0.013347,-0.022847,-0.007725,-0.017581,-0.017882,...,-0.007331,-0.008085,-0.009286,-0.004556,-0.020674,-0.014522,-0.011948,-0.013081,-0.017415,-0.012329
2726,-0.0065,1.0,-0.016699,-0.01094,-0.016806,-0.011452,-0.023635,0.01324,-0.019354,-0.020125,...,-0.000729,0.00947,0.013797,-0.016811,-0.018107,-0.009166,-0.011642,-0.012274,-0.02302,-0.007772
3785,-0.017511,-0.016699,1.0,0.001511,-0.002429,-0.007363,-0.010149,-0.015086,-0.013344,-0.014637,...,-0.015709,-0.00797,-0.015821,-0.015284,-0.005766,-0.010261,-0.0133,-0.007578,-0.00749,-0.003461
8063,-0.016326,-0.01094,0.001511,1.0,-0.003506,-0.001862,-0.013025,-0.005731,0.007944,-0.016066,...,-0.01948,-0.001559,-0.014644,-0.015865,-0.004209,-0.006944,-0.011152,-0.006553,-0.013862,0.005777
12709,-0.01752,-0.016806,-0.002429,-0.003506,1.0,-0.011653,-0.014726,-0.004692,-0.002641,-0.006035,...,-0.014878,-0.011811,-0.006868,-0.007521,-0.013235,-0.011558,-0.016553,-0.009346,0.000393,-0.005


in general adj cosine is better and specially because we do not have enugh info

## `ii` Candidate Generation and Filtering

Filter out items (user 199988) has rated from the similarity matrix above.

In [18]:
user_id = 199988


# different ways to get the user ratings

# 1. Get the user ratings from the utility matrix
user_ratings = utility_matrix[user_id]
# Filter items:
# 1. Items user has rated (rating != 0)
# 2. Items present in sim_df columns
# 3. Items with sim_df similarity > 0 (any row)
# mask = (user_ratings != 0) & utility_matrix.index.isin(sim_df.columns) & (sim_df.loc[:, utility_matrix.index] > 0).any(axis=0)
mask = (user_ratings != 0)

# Get the potential items
potential_items = utility_matrix.index[mask]

# 2. Get the user ratings from the utility matrix
# potential_items = utility_matrix.index[utility_matrix[user_id] > 0]

# 3. Get the user ratings from the utility matrix
# potential_items = utility_matrix.index[(utility_matrix[user_id] != 0) & 
#                                        utility_matrix.index.isin(sim_df.columns) & 
#                                        (sim_df.loc[sim_df.index, utility_matrix.index] > 0).any(axis=0).values]


potential_items



Index([2726, 19299, 43267, 56660], dtype='int64', name='songID')

In [19]:
filtered_sim_df = sim_df.loc[potential_items, sim_df.columns > 0]

In [20]:
filtered_sim_df

songID,2263,2726,3785,8063,12709,13859,16548,17029,19299,19670,...,113954,119103,120147,122065,123176,125557,126757,131048,132189,134732
songID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2726,-0.0065,1.0,-0.016699,-0.01094,-0.016806,-0.011452,-0.023635,0.01324,-0.019354,-0.020125,...,-0.000729,0.00947,0.013797,-0.016811,-0.018107,-0.009166,-0.011642,-0.012274,-0.02302,-0.007772
19299,-0.017581,-0.019354,-0.013344,0.007944,-0.002641,-0.003426,-0.001161,-0.012135,1.0,-0.004975,...,-0.018252,-0.005171,-0.015623,-0.002624,0.014827,-0.013707,-0.017614,-0.000595,-0.011755,0.00685
43267,-0.009534,0.012825,-0.008429,-0.010259,-0.013956,-0.012019,-0.016373,0.007456,-0.015588,-0.018029,...,-0.00747,0.016524,0.018883,-0.015632,-0.01338,-0.006118,-0.003468,-0.010407,-0.018965,0.004009
56660,-0.016032,-0.018568,-0.007015,-0.009887,0.004105,-0.014507,-0.020362,-0.007994,-0.00043,-0.00904,...,-0.019435,-0.007092,-0.011,-0.010886,-0.003124,-0.009896,-0.014101,-0.009387,0.002315,0.000836


## `iii` Top-K Candidate Selection

Selet top-K (a k of your choice) similar items for each item (user 199988) rated from the filtered similarity matrix above.

In [21]:
k = 5

rated_items = user_ratings[user_ratings > 0].index

## top k similar items for each rated item from the filtered similarity dataframe
# 1. Get the top k similar items for each rated item
# 2. Drop the item itself from the list of similar items
# 3. Sort the similar items by similarity score
# 4. Get the top k similar items
# 5. Store the similar items in a dictionary with the item as the key and the list of similar items as the value
top_k_similar = {}

for item in rated_items:
    # Get the top k similar items for the item
    similar_items = filtered_sim_df.loc[item].nlargest(k+1).index.tolist()
    # Remove the item itself from the list of similar items
    similar_items.remove(item)
    # Store the similar items in the dictionary
    top_k_similar[item] = similar_items



print(f"Top {k} similar items for user {user_id}:")
for item, similar_items in top_k_similar.items():
    print(f"Item {item}: {similar_items}")



Top 5 similar items for user 199988:
Item 2726: [120147, 17029, 43267, 40712, 86341]
Item 19299: [105433, 123176, 8063, 43827, 134732]
Item 43267: [120147, 119103, 2726, 42906, 45026]
Item 56660: [90409, 48731, 12709, 60465, 25182]


## `iv` Candidate Rating Prediction

Calculate the predicted rating for each of the candidate items.

In [22]:
# user_ratings = utility_matrix.loc[potential_items][user_id]

# top_k_similar = filtered_sim_df.apply(lambda row: row.sort_values(ascending=False).index[:k],axis=1)
# # similarity scores
# sim_scores = [filtered_sim_df[top_k_similar[item]].loc[item] for item in top_k_similar.index ]

# sim_scores_df = pd.concat(
#     [
#         pd.DataFrame({
#             'candidate': item.index,    
#             'ref_item': item.name,    
#             'similarity': item.values 
#         })
#         for item in sim_scores
#     ],
#     ignore_index=True  # reset index
# )
# # user ratings
# user_ratings = utility_matrix.loc[potential_items][user_id]

# final_df = sim_scores_df.pivot(index='candidate', columns="ref_item", values="similarity")
# result = []
# for item in sim_scores_df['candidate']:
#     sims = final_df.loc[item]
#     aligned_user_ratings = user_ratings[sims.index]  # Align user_ratings with sims index
#     predicted_rating = np.dot(sims, aligned_user_ratings) / sims.abs().sum()
    
#     # print(f"{item}: {predicted_rating}")
#     result.append(predicted_rating)
# sim_scores_df['predicted_ratings'] = result

# # Combine the predicted ratings, similarity scores, user ratings, and top-k similar items into a DataFrame
# combined_df = sim_scores_df[['candidate', 'predicted_ratings']].copy()
# combined_df['user_rating'] = combined_df['candidate'].map(user_ratings)
# combined_df['top_k_similar_items'] = combined_df['candidate'].map(top_k_similar)

# # Display the combined DataFrame
# print("Combined DataFrame:")
# combined_df


In [31]:
# results = []

# for candidate in filtered_sim_df.index:
#     sims = filtered_sim_df.loc[candidate]

#     # Top K refs with highest similarity
#     top_k = sims.sort_values(ascending=False).head(k)

#     # User ratings for those refs, fill missing with NaN
#     top_k_ratings = user_ratings.reindex(top_k.index)

#     # Calculate predicted rating (weighted average)
#     denom = top_k.abs().sum()
#     if denom > 0:
#         predicted_rating = np.dot(top_k, top_k_ratings.fillna(0)) / denom
#     else:
#         predicted_rating = np.nan

#     # For ref_1 and ref_2 details
#     ref_1 = top_k.index[0] if len(top_k) > 0 else pd.NA
#     ref_1_sim = top_k.iloc[0] if len(top_k) > 0 else pd.NA
#     ref_1_rating = top_k_ratings.iloc[0] if len(top_k_ratings) > 0 else pd.NA

#     ref_2 = top_k.index[1] if len(top_k) > 1 else pd.NA
#     ref_2_sim = top_k.iloc[1] if len(top_k) > 1 else pd.NA
#     ref_2_rating = top_k_ratings.iloc[1] if len(top_k_ratings) > 1 else pd.NA

#     results.append({
#         'candidate': candidate,
#         'predicted_rating': predicted_rating,
#         'ref_1': ref_1,
#         'ref_1_similarity': ref_1_sim,
#         'ref_1_rating': ref_1_rating,
#         'ref_2': ref_2,
#         'ref_2_similarity': ref_2_sim,
#         'ref_2_rating': ref_2_rating
#     })

# final_result_df = pd.DataFrame(results).set_index('candidate')

# print("Final DataFrame:")
# final_result_df


In [30]:
def predict_ratings_for_user(user_id):
    results = []
    user_ratings = utility_matrix[user_id].to_dict()
    top_k = filtered_sim_df.apply(lambda row: row.sort_values(ascending=False).head(k), axis=1)

    for candidate in top_k.columns:
        refs = []
        numerator = 0
        denominator = 0

        for ref_item in top_k.index:  # Specify the iterable and add a colon
            similarity = top_k.at[ref_item, candidate]

            if pd.isna(similarity):
                continue

            user_rating = user_ratings.get(ref_item)

            if user_rating is not None:
                numerator += similarity * user_rating
                denominator += similarity
                refs.append((ref_item, similarity, user_rating))

        if denominator != 0:
            predicted_rating = numerator / denominator
        else:
            predicted_rating = None

        row = {
            'candidate': candidate,
            'predicted_rating': predicted_rating
        }

        for i in range(2):
            if i < len(refs):
                ref_id, ref_sim, ref_rating = refs[i]
                row[f'ref_{i+1}'] = ref_id
                row[f'ref_{i+1}_similarity'] = ref_sim
                row[f'ref_{i+1}_rating'] = ref_rating
            else:
                row[f'ref_{i+1}'] = pd.NA
                row[f'ref_{i+1}_similarity'] = pd.NA
                row[f'ref_{i+1}_rating'] = pd.NA

        results.append(row)

    results_df = pd.DataFrame(results)
    results_df.set_index('candidate', inplace=True)
    return results_df

predict_ratings_for_user(user_id)

Unnamed: 0_level_0,predicted_rating,ref_1,ref_1_similarity,ref_1_rating,ref_2,ref_2_similarity,ref_2_rating
candidate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2726,4.974675,2726,1.0,5.0,43267.0,0.012825,3.0
8063,5.0,19299,0.007944,5.0,,,
12709,5.0,56660,0.004105,5.0,,,
17029,5.0,2726,0.01324,5.0,,,
19299,5.0,19299,1.0,5.0,,,
40712,5.0,2726,0.012574,5.0,,,
42906,3.0,43267,0.011221,3.0,,,
43267,3.025325,2726,0.012825,5.0,43267.0,1.0,3.0
43827,5.0,19299,0.00763,5.0,,,
48731,5.0,56660,0.015657,5.0,,,


Unnamed: 0_level_0,predicted_rating,ref_1,ref_1_similarity,ref_1_rating,ref_2,ref_2_similarity,ref_2_rating
candidate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
45026,3.0,43267,0.010135,3,,,
86341,5.0,2726,0.009534,5,,,
17029,4.279497,2726,0.01324,5,43267.0,0.007456,3.0
12709,5.0,56660,0.004105,5,,,
40712,5.0,2726,0.012574,5,,,
123176,5.0,19299,0.014827,5,,,
90409,5.0,56660,0.020505,5,,,
134732,5.0,19299,0.00685,5,,,
60465,5.0,56660,0.003673,5,,,
120147,3.844362,2726,0.013797,5,43267.0,0.018883,3.0


------------------------------------------------------

# `05` Additional Tasks

## `i` Explore Surprise Library

- Install Scikit Surprise library.
- Explore the Library Documentation

In [None]:
## `i` Explore Surprise Library

# - Install Scikit Surprise library.
# - Explore the Library Documentation

# - https://surprise.readthedocs.io/en/stable/
# - https://surprise.readthedocs.io/en/stable/getting_started.html
# - https://surprise.readthedocs.io/en/stable/installation.html
# - https://surprise.readthedocs.io/en/stable/getting_started.html#quick-start



## `ii` Implement Item-Based KNN Approach [Bonus]

- Follow the steps explained in the sessions to prepare the KNN approach.
- Generate prediction ratings for user $199988$ on all songs.

In [130]:
# Load the dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[['userID', 'songID', 'rating']], reader)
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Build the KNN model
sim_options = {
    'name': 'cosine',
    'user_based': False  # Compute similarities between items
}
knn = KNNBasic(sim_options=sim_options)
knn.fit(trainset)

# Predict ratings for the testset
predictions = knn.test(testset)

# Compute RMSE
rmse = accuracy.rmse(predictions)
print(f"RMSE: {rmse}")

# Predict ratings for a specific user
user_id = 199988
song_ids = ratings['songID'].unique()
predictions = []
for song_id in song_ids:
    pred = knn.predict(user_id, song_id)
    predictions.append((song_id, pred.est))

# Sort predictions by estimated rating
predictions.sort(key=lambda x: x[1], reverse=True)

# Display top 10 recommended songs
print("Top 10 recommended songs for user 199988:")
for song_id, rating in predictions[:10]:
    print(f"Song ID: {song_id}, Predicted Rating: {rating:.2f}")


Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.5847
RMSE: 1.5846729916758167
Top 10 recommended songs for user 199988:
Song ID: 52611, Predicted Rating: 4.70
Song ID: 48731, Predicted Rating: 4.64
Song ID: 71582, Predicted Rating: 4.61
Song ID: 56660, Predicted Rating: 4.59
Song ID: 72017, Predicted Rating: 4.56
Song ID: 43827, Predicted Rating: 4.55
Song ID: 86341, Predicted Rating: 4.53
Song ID: 40712, Predicted Rating: 4.53
Song ID: 19299, Predicted Rating: 4.53
Song ID: 90409, Predicted Rating: 4.53


In [139]:
# Load the dataset
data = Dataset.load_from_df(ratings[['userID', 'songID', 'rating']], reader)

# We'll use the famous SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

print(perf)

# Train the algorithm on the trainset, and predict ratings for the testset
trainset = data.build_full_trainset()
algo.fit(trainset)

# Predict ratings for the testset
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

# Compute RMSE
rmse = accuracy.rmse(predictions)
print(f"RMSE: {rmse}")

# Predict ratings for a specific user
user_id = 199988
song_ids = ratings['songID'].unique()
predictions = []
for song_id in song_ids:
    pred = algo.predict(user_id, song_id)
    predictions.append((song_id, pred.est))

# Sort predictions by estimated rating
predictions.sort(key=lambda x: x[1], reverse=True)

# Display top 10 recommended songs
print("Top 10 recommended songs for user 199988:")
for song_id, rating in predictions[:10]:
    print(f"Song ID: {song_id}, Predicted Rating: {rating:.2f}")


Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.5006  1.5032  1.5010  1.5016  0.0012  
MAE (testset)     1.3007  1.3038  1.3020  1.3021  0.0013  
Fit time          0.39    0.39    0.43    0.41    0.02    
Test time         0.06    0.06    0.05    0.06    0.00    
{'test_rmse': array([1.50057483, 1.50321693, 1.5010407 ]), 'test_mae': array([1.30069178, 1.30377037, 1.30195695]), 'fit_time': (0.39254164695739746, 0.39478540420532227, 0.43482232093811035), 'test_time': (0.055329322814941406, 0.056224822998046875, 0.05439448356628418)}
RMSE: 0.5486
RMSE: 0.548611923927401
Top 10 recommended songs for user 199988:
Song ID: 56660, Predicted Rating: 4.79
Song ID: 19299, Predicted Rating: 4.77
Song ID: 2726, Predicted Rating: 4.63
Song ID: 55240, Predicted Rating: 4.48
Song ID: 125557, Predicted Rating: 4.44
Song ID: 52611, Predicted Rating: 4.42
Song ID: 92881, Predicted Rating: 4.40
Song ID: 132189, Predicted

The first code (Item-based KNN) is valid but usually less accurate than SVD.

The second code (SVD) is more accurate (lower RMSE).

SVD is better for accuracy because it captures complex user-item patterns, while KNN relies on simpler similarity.

----------------------------------------------

$$ Wish \space you \space all \space the \space best \space ♡ $$
$$ Abdelrahman \space Eid $$