<a href="https://colab.research.google.com/github/Sadiya-Akter-Mim/MovieLens-Recommender-Assignment/blob/main/DataSynthesis_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split


In [25]:
#github raw link

movies_url  = "https://raw.githubusercontent.com/Sadiya-Akter-Mim/MovieLens-Recommender-Assignment/main/movies.csv"
ratings_url = "https://raw.githubusercontent.com/Sadiya-Akter-Mim/MovieLens-Recommender-Assignment/main/ratings.csv"

In [26]:
#load dataset
movies = pd.read_csv(movies_url)
ratings = pd.read_csv(ratings_url)

In [27]:
#show 5 rows
print("Movies dataset:")
print(movies.head())
print("\nRatings dataset:")
print(ratings.head())

Movies dataset:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

Ratings dataset:
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


load data from github,
print 5 rows to check data,
movie has movieid, title, genres, and
rating has  userid, movieid,ratings,timestamp



In [30]:
# Split ratings into train (80%) and test (20%) for evaluation
train, test = train_test_split(ratings, test_size=0.2, random_state=42)

print("Train size:", train.shape)
print("Test size:", test.shape)


Train size: (80668, 4)
Test size: (20168, 4)


Train set is used to train SVD,and test set is used to evaluate recommendations.

In [29]:
# Create User-Movie matrix from train set
train_matrix = train.pivot_table(index="userId", columns="movieId", values="rating").fillna(0)

print("User-Movie matrix:")
print(train_matrix.head())


User-Movie matrix:
movieId  1       2       3       4       5       6       7       8       \
userId                                                                    
1           4.0     0.0     4.0     0.0     0.0     4.0     0.0     0.0   
2           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
3           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
4           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
5           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   

movieId  9       10      ...  191005  193565  193571  193573  193579  193581  \
userId                   ...                                                   
1           0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
2           0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
3           0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
4           0.0     0.0  ...     0.0     0.0     0.0   

Converts train ratings to a matrix with users as rows,movies as columns,and missing values filled with 0 for SVD.

In [31]:
# Convert to numpy array for SVD
matrix = train_matrix.values

# Apply TruncatedSVD
svd = TruncatedSVD(n_components=20, random_state=42)
latent_matrix = svd.fit_transform(matrix)
reconstructed = np.dot(latent_matrix, svd.components_)

# Convert back to DataFrame
svd_train_df = pd.DataFrame(reconstructed, index=train_matrix.index, columns=train_matrix.columns)

print("SVD trained rating matrix:")
print(svd_train_df.head())


SVD trained rating matrix:
movieId    1         2         3         4         5         6         7       \
userId                                                                          
1        2.276717  1.254279  1.111054  0.009030  0.154842  1.467038  0.274476   
2        0.159419 -0.005200  0.031014  0.006267  0.028691 -0.045036 -0.028494   
3        0.046178  0.021501  0.032577 -0.003379 -0.012925  0.036691 -0.003069   
4        1.716662  0.323975  0.138665  0.029194  0.179403  0.507154  0.358360   
5        1.108818  0.792571  0.270276  0.105957  0.383099  0.462174  0.462996   

movieId    8         9         10      ...    191005    193565    193571  \
userId                                 ...                                 
1        0.041337  0.163454  1.531384  ... -0.014538 -0.011308 -0.012923   
2        0.026206  0.019432 -0.076790  ...  0.011383  0.008854  0.010119   
3        0.006263  0.012388  0.016572  ... -0.000318 -0.000247 -0.000282   
4       -0.050802  0.0070

Trains SVD to capture latent features, and reconstruct matrix to predict missing ratings.

In [32]:
def recommend_movies(user_id, N=10):

    #Returns Top-N recommended movies for a given user using SVD

    if user_id not in svd_train_df.index:
        return []

    user_ratings = svd_train_df.loc[user_id]

    # Remove movies already rated in train
    watched = train[train["userId"] == user_id]["movieId"].values
    recommendations = user_ratings.drop(watched).sort_values(ascending=False).head(N)

    return recommendations.index.tolist()

# Example: top 10 movies for user 1
top_movies = recommend_movies(1, N=10)
print("Top-10 recommended movie IDs for user 1:", top_movies)


Top-10 recommended movie IDs for user 1: [589, 1198, 1200, 32, 1214, 1580, 2115, 1036, 541, 1259]


Returns Top-N movie IDs the user hasn’t rated,and core function required for the assignment.

In [33]:
# Map movie IDs to titles for readability
def movie_titles(movie_ids):
    return movies[movies["movieId"].isin(movie_ids)][["movieId", "title"]]

# Example: show recommended movie titles
movie_titles(top_movies)



Unnamed: 0,movieId,title
31,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
474,541,Blade Runner (1982)
507,589,Terminator 2: Judgment Day (1991)
793,1036,Die Hard (1988)
900,1198,Raiders of the Lost Ark (Indiana Jones and the...
902,1200,Aliens (1986)
915,1214,Alien (1979)
958,1259,Stand by Me (1986)
1183,1580,Men in Black (a.k.a. MIB) (1997)
1576,2115,Indiana Jones and the Temple of Doom (1984)


Converts movie IDs into human-readable titles

In [35]:
def precision_at_k(actual, predicted, k=10):
    return len(set(predicted[:k]) & set(actual)) / k

def recall_at_k(actual, predicted, k=10):
    return len(set(predicted[:k]) & set(actual)) / len(actual) if len(actual) > 0 else 0

def ndcg_at_k(actual, predicted, k=10):
    dcg = 0.0
    for i, p in enumerate(predicted[:k]):
        if p in actual:
            dcg += 1 / np.log2(i + 2)
    idcg = sum(1 / np.log2(i + 2) for i in range(min(len(actual), k)))
    return dcg / idcg if idcg > 0 else 0


Implements Precision@K, Recall@K, NDCG@K metrics.

In [36]:
precisions = []
recalls = []
ndcgs = []

users = test['userId'].unique()
for u in users:
    true_movies = test[test["userId"]==u]["movieId"].values
    pred_movies = recommend_movies(u, N=10)
    precisions.append(precision_at_k(true_movies, pred_movies, k=10))
    recalls.append(recall_at_k(true_movies, pred_movies, k=10))
    ndcgs.append(ndcg_at_k(true_movies, pred_movies, k=10))

print("Average Precision@10:", np.mean(precisions))
print("Average Recall@10:", np.mean(recalls))
print("Average NDCG@10:", np.mean(ndcgs))


Average Precision@10: 0.27524590163934426
Average Recall@10: 0.144280751947409
Average NDCG@10: 0.31575447784345545


Loops over all test users,and computes average metrics to evaluate model performance