# 🎬 Movie Rating Prediction (Without Movie Titles)

Since the provided dataset (`rating-disposition-2023`) contains only **ratings.csv**  
(with columns `userId, movieId, rating, timestamp`) and no `movies.csv`,  
our recommendations will display movies by their **movieId** identifiers instead of titles.

This approach is still valid for collaborative filtering, since the model uses  
**userId–movieId–rating** interactions to learn preferences.

### Step 1: Import Libraries
We import libraries for data handling, preprocessing, collaborative filtering, and evaluation.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate

### Step 2: Load Dataset
We use the **MovieLens dataset (ratings.csv)** which has columns:
- userId  
- movieId  
- rating  
- timestamp  

In [4]:
df = pd.read_csv("ratings.csv")
df.head()

Unnamed: 0,userId,movieId,rating,tstamp
0,206,4803,4.0,2003-04-07 13:52:01
1,5073,72731,4.0,2020-02-19 16:07:53
2,4739,91653,4.0,2020-12-28 15:35:58
3,535,3005,3.0,2008-12-26 05:38:11
4,465,4776,3.0,2008-08-13 20:22:36


### Step 3: Preprocessing
We check for missing values and prepare the dataset for collaborative filtering.

In [6]:
# Check for nulls
print("Missing values:", df.isnull().sum().sum())

# Keep only relevant columns
df = df[['userId', 'movieId', 'rating']]

Missing values: 0


### Step 4: Baseline Model
As a simple baseline, we predict the mean rating of each movie.

In [8]:
movie_means = df.groupby('movieId')['rating'].mean()

# Example: Predict rating for movieId=50
movie_means.loc[50]

4.051533219761499

### Step 5: Collaborative Filtering
We use **Surprise SVD** (Singular Value Decomposition), a popular Matrix Factorization technique, to predict user-movie ratings.

In [10]:
# Load dataset into Surprise format
reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)

# Train SVD model
algo = SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.7432  0.7433  0.7440  0.7433  0.7440  0.7436  0.0004  
MAE (testset)     0.5550  0.5554  0.5558  0.5553  0.5555  0.5554  0.0003  
Fit time          46.25   47.07   49.89   47.83   49.92   48.19   1.49    
Test time         12.66   12.13   12.08   12.73   11.98   12.32   0.32    


{'test_rmse': array([0.74322708, 0.74329394, 0.74400383, 0.74326938, 0.74397194]),
 'test_mae': array([0.55495395, 0.55543072, 0.55584093, 0.55530752, 0.55548897]),
 'fit_time': (46.24607276916504,
  47.06695294380188,
  49.88764190673828,
  47.833648920059204,
  49.92322325706482),
 'test_time': (12.662708044052124,
  12.125094175338745,
  12.084789037704468,
  12.730836868286133,
  11.97577977180481)}

### Step 6: Train-Test Evaluation
We split the dataset and evaluate RMSE/MAE to measure prediction accuracy.

In [12]:
trainset, testset = train_test_split(df, test_size=0.2, random_state=42)

# Reload with Surprise
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)
trainset = data.build_full_trainset()

algo = SVD()
algo.fit(trainset)

# Predict on a sample
pred = algo.predict(uid=1, iid=50)
print("Predicted rating:", pred.est)

Predicted rating: 3.929342011771923


### Step 7: Generate Recommendations
For a given user, recommend top-N movies they haven’t rated yet.

In [15]:
def recommend_movies(user_id, n=5):
    all_movies = df['movieId'].unique()
    watched = df[df['userId']==user_id]['movieId'].values
    not_watched = [m for m in all_movies if m not in watched]
    
    predictions = [algo.predict(user_id, m) for m in not_watched]
    predictions.sort(key=lambda x: x.est, reverse=True)
    
    return predictions[:n]

# Example: Recommend for user 1
top_movies = recommend_movies(1, n=5)
top_movies

[Prediction(uid=1, iid=168040, r_ui=None, est=5, details={'was_impossible': False}),
 Prediction(uid=1, iid=26528, r_ui=None, est=4.992946994729287, details={'was_impossible': False}),
 Prediction(uid=1, iid=167392, r_ui=None, est=4.977228532503565, details={'was_impossible': False}),
 Prediction(uid=1, iid=174615, r_ui=None, est=4.966659045696877, details={'was_impossible': False}),
 Prediction(uid=1, iid=133712, r_ui=None, est=4.959956993758615, details={'was_impossible': False})]

### Defining the Recommendation Function

We define a function `recommend_movies()` that:  
- Takes a `user_id` and number of recommendations `n`.  
- Identifies movies the user has **not yet watched**.  
- Uses the trained collaborative filtering model (`algo`) to **predict ratings** for these unseen movies.  
- Returns the top-N movies (by `movieId`) with the **highest predicted ratings**.

In [20]:
def recommend_movies(user_id, n=5):
    """Recommend top-N movies for a given user, showing only movieId and predicted rating."""
    
    all_movies = df['movieId'].unique()
    watched = df[df['userId'] == user_id]['movieId'].values
    not_watched = [m for m in all_movies if m not in watched]
    
    predictions = [algo.predict(user_id, m) for m in not_watched]
    predictions.sort(key=lambda x: x.est, reverse=True)
    
    top_predictions = predictions[:n]
    
    # Convert into DataFrame
    return pd.DataFrame([(pred.iid, pred.est) for pred in top_predictions],
                        columns=['movieId', 'predicted_rating'])

### Generating Recommendations for a User

Here, we generate **top 5 recommendations** for `userId = 1`.  
Since the dataset does not include movie titles, the results are shown using **movieId** and their **predicted ratings**.

In [22]:
recommend_movies(user_id=1, n=5)

Unnamed: 0,movieId,predicted_rating
0,168040,5.0
1,26528,4.992947
2,167392,4.977229
3,174615,4.966659
4,133712,4.959957
