#Project: Movie Recommendation System   

##Name: Amritha Prakash


# **Singular Value Decomposition**

Singular Value Decomposition, or SVD, is a mathematical technique used in many fields such as signal processing, statistics, and machine learning, particularly in the context of recommendation systems. It's a method for decomposing a matrix into three other matrices that reveal its underlying structure.


---

## Basic Concepts

### Matrices
- **Matrix**: A rectangular array of numbers.
- **Dimension of a Matrix**: Given in the form of rows × columns.

### Decomposition
- **Decomposition**: Breaking down a complex matrix into simpler, understandable parts.

## What is SVD?

```
SVD breaks down any given matrix A into three separate matrices named U, Σ and V*
ie. A = UΣV*
```
Where the components are:
```
- A: Original matrix.
- U: Left singular vectors (orthogonal matrix).
- Σ: Diagonal matrix of singular values (non-negative).
- V*: Right singular vectors (conjugate transpose of V , an orthogonal matrix).
```


## Where do we use SVDs?

### Applications in Recommendation Systems

In recommendation systems, SVD is used to predict unknown preferences by decomposing a large matrix of user-item interactions into factors representing latent features. It helps in capturing the underlying patterns in the data.

### Process

1. **Matrix Creation**: Start with a matrix where rows represent users, columns represent items, and entries represent user ratings.
2. **Apply SVD**: Decompose this matrix using SVD.
3. **Latent Features**: The decomposition reveals latent features that explain observed ratings.
4. **Prediction**: Use the decomposed matrices to predict missing ratings.

### Advantages of an SVD
- Effective at uncovering latent features in the data.
- Reduces dimensionality, making computations more manageable.

### Limitations of an SVD
- Assumes linear relationships in data.
- Sensitive to missing data and outliers.

#### Through this project, will build a movie recommendation system using an SVD


#### Dataset being used : **Movielens 100k dataset**

- This specific dataset, often referred to as "ml-100k," contains 100,000 ratings from 943 users on 1,682 movies. The data was collected through the MovieLens website during the seven-month period from September 19th, 1997 to April 22nd, 1998.

- **Data Structure**: The dataset includes user ratings that range from 1 to 5. Additionally, it provides demographic information about the users (age, gender, occupation, etc.) and details about the movies (titles, genres).

- **Usage**: It's a standard dataset used for implementing and testing recommender systems. Its size is manageable, making it a popular choice for educational purposes and for initial experimentation with recommendation algorithms.

- **Significance**: The diversity in the dataset, both in terms of users and movie genres, provides a rich ground for analyzing different recommendation strategies, testing algorithms like SVD, and understanding user preferences and behavioral patterns.

This dataset is an excellent starting point for anyone looking to delve into the world of recommender systems and practice with real-world data.


Now, I will write some code to understand and explore the dataset

In [1]:
#Run this only once, you can comment out this part of the code after.
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl.metadata (327 bytes)
Collecting scikit-surprise (from surprise)
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m143.4/154.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp311-cp31

In [8]:
# Downgrade NumPy to a version compatible with surprise
!pip install "numpy<2.0"

Collecting numpy<2.0
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m86.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you ha

In [1]:
#Importing necessary modules for this project
import pandas as pd
from surprise import Dataset
from surprise.model_selection import train_test_split

In [2]:
#Installing the dataset from pandas, run this only once, you can comment out this part of the code after.
!pip install pandas scikit-surprise



### Loading the dataset

In [5]:
# Download the dataset

In [6]:
data = Dataset.load_builtin('ml-100k')
df = pd.DataFrame(data.raw_ratings, columns=["user", "item", "rating", "timestamp"])

print(df.head())

  user item  rating  timestamp
0  196  242     3.0  881250949
1  186  302     3.0  891717742
2   22  377     1.0  878887116
3  244   51     2.0  880606923
4  166  346     1.0  886397596


We see the following columns:

* **User ID**: A unique identifier for the user who provided the rating.

* **Item ID (Movie ID)**: A unique identifier for the movie that was rated.

* **Rating:** The rating given to the movie by the user. In the MovieLens 100k dataset, these ratings are typically on a scale of 1 to 5.

* **Timestamp:** The time at which the rating was provided. The timestamp is usually in Unix time format, which counts seconds since the Unix epoch (January 1, 1970).



### Data Preprocessing

In [7]:
# check for missing values

print(df.isnull().sum())

user         0
item         0
rating       0
timestamp    0
dtype: int64


In [8]:
# Converting timestamps to a readable format

df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
print(df.head())

  df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')


  user item  rating           timestamp
0  196  242     3.0 1997-12-04 15:55:49
1  186  302     3.0 1998-04-04 19:22:22
2   22  377     1.0 1997-11-07 07:18:36
3  244   51     2.0 1997-11-27 05:02:03
4  166  346     1.0 1998-02-02 05:33:16


### Spliting the data to Train and Test set

In [10]:
# Split the data into a training set and a test set
trainset, testset = train_test_split(data, test_size=0.20)

# Display the number of users and items in the training set
print(f"Training set :")
print(f"Number of users: {trainset.n_users}")
print(f"Number of items: {trainset.n_items}")

# Display the first few elements of the test set
print(f"\nTest set :")
print(testset[:5])

Training set :
Number of users: 943
Number of items: 1648

Test set :
[('821', '70', 4.0), ('128', '118', 5.0), ('592', '236', 3.0), ('555', '748', 4.0), ('229', '286', 4.0)]


### Hyperparameter Tuning in SVD
Hyperparameter tuning is a critical step in optimizing the performance of an SVD model. The goal is to find the best combination of parameters that results in the most accurate predictions or lowest error rates.

#### Hyperparameters we will be tuning in this project

1. **`n_factors`**:
   - Represents the number of latent factors (or features) to extract from the dataset.
   - The values `[50, 100, 150]` are chosen to test the model's performance with a varying number of factors. A higher number of factors can capture more complex patterns but may lead to overfitting and increased computation time.

2. **`n_epochs`**:
   - Refers to the number of iterations over the entire dataset during training.
   - The values `[20, 30]` provide a range to evaluate whether more iterations improve model performance or lead to overtraining.

3. **`lr_all`** (Learning Rate):
   - Determines the step size at each iteration while moving toward a minimum of the loss function.
   - The values `[0.005, 0.010]` are chosen to test how fast the model learns. A smaller learning rate may lead to more precise convergence but requires more epochs.

4. **`reg_all`** (Regularization Term):
   - Helps prevent overfitting by penalizing larger model parameters.
   - The values `[0.02, 0.1]` offer a range to assess the impact of regularization on model performance. Higher regularization can reduce overfitting but may lead to underfitting.

In [11]:
# Define a grid of SVD hyperparameters explained above for tuning
param_grid = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30],
    'lr_all': [0.005, 0.010],
    'reg_all': [0.02, 0.1]
}


Now, I will train the model with the following parameters:

1. **`SVD`**:
   - This is the recommendation algorithm being tuned. SVD is a popular algorithm used in recommendation systems, particularly for matrix factorization.

2. **`param_grid`**:
   - This is a dictionary where keys are hyperparameter names, and values are lists of parameter settings to try as values. It defines the grid of parameters that will be tested.
   - Example: If `param_grid` is `{'n_factors': [50, 100], 'lr_all': [0.005, 0.01]}`, GridSearchCV will evaluate the SVD algorithm for all combinations of `n_factors` and `lr_all` from these lists.

3. **`measures=['RMSE', 'MAE']`**:
   - These are the performance metrics used to evaluate the algorithm.
   - `RMSE` stands for Root Mean Square Error, and `MAE` stands for Mean Absolute Error. Both are common metrics for evaluating the accuracy of prediction algorithms, with lower values indicating better performance.

4. **`cv=3`**:
   - This specifies the number of folds for cross-validation.
   - In this context, `cv=3` means that a 3-fold cross-validation will be used. The dataset will be split into three parts: in each iteration, two parts will be used for training, and one part will be used for testing. This process repeats three times, each time with a different part used for testing.

In [12]:
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise import SVD, Dataset, Reader, accuracy

# Perform grid search with cross-validation to find the best hyperparameters for our model
gs = GridSearchCV(SVD, param_grid, measures=['RMSE', 'MAE'], cv=3)
gs.fit(data)

In [13]:
# Best score and parameters
print(f"Best RMSE: {gs.best_score['rmse']}")
print(f"Best parameters: {gs.best_params['rmse']}")

Best RMSE: 0.9205828491341563
Best parameters: {'n_factors': 150, 'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.1}


In [15]:
# TODO - Use the best model. Use best_estimator function on gs
algo = gs.best_estimator['rmse']

In [16]:
# TODO - Train and test split. Make sure test_size is 0.25
trainset, testset = train_test_split(data, test_size=0.25)
# TODO - Fit the trainset to train the model
algo.fit(trainset)
# TODO - Make predictions on the testset
predictions = algo.test(testset)

# TODO - Calculate and print RMSE on the predictions made
accuracy.rmse(predictions)

RMSE: 0.9183


0.9182764275230733

In [17]:
#Predict rating for a user and item
user_id = '196'  # replace with a specific user ID
item_id = '302'  # replace with a specific item (movie) ID
predicted_rating = algo.predict(user_id, item_id)
print(f"Predicted rating for user {user_id} and item {item_id}: {predicted_rating.est}")

Predicted rating for user 196 and item 302: 4.184095163685722


In [18]:
# To inspect the predictions in detail, let's print the first 10 predictions made by the model
for idx, prediction in enumerate(predictions[:10]):
    print(f'Prediction {idx}: User {prediction.uid} and item {prediction.iid} has true rating {prediction.r_ui}, and the predicted rating is {prediction.est}')


Prediction 0: User 168 and item 118 has true rating 4.0, and the predicted rating is 3.829540409114962
Prediction 1: User 342 and item 134 has true rating 4.0, and the predicted rating is 4.312825773828511
Prediction 2: User 807 and item 418 has true rating 4.0, and the predicted rating is 4.056131361604987
Prediction 3: User 22 and item 433 has true rating 3.0, and the predicted rating is 4.077679661786188
Prediction 4: User 405 and item 1113 has true rating 1.0, and the predicted rating is 2.4968975598327474
Prediction 5: User 314 and item 120 has true rating 3.0, and the predicted rating is 2.6957245175505404
Prediction 6: User 405 and item 1229 has true rating 1.0, and the predicted rating is 1.2334730481870217
Prediction 7: User 537 and item 346 has true rating 3.0, and the predicted rating is 3.3312579299203025
Prediction 8: User 899 and item 64 has true rating 4.0, and the predicted rating is 4.2146191350659965
Prediction 9: User 844 and item 45 has true rating 4.0, and the pred

Let us round the values of the predictions so that it falls within the rating categories of [1.0, 2.0, 3.0, 4.0, 5.0]

In [21]:
#TODO - Round the prediction.est variable being printed. Use python's default rounding function to achieve this
import math

for idx, prediction in enumerate(predictions[:10]):
    temp = math.ceil(int(prediction.est))
    print(f'Prediction {idx}: User {prediction.uid} and item {prediction.iid} has true rating {prediction.r_ui}, and the predicted rating is {round(prediction.est)}')


Prediction 0: User 168 and item 118 has true rating 4.0, and the predicted rating is 4
Prediction 1: User 342 and item 134 has true rating 4.0, and the predicted rating is 4
Prediction 2: User 807 and item 418 has true rating 4.0, and the predicted rating is 4
Prediction 3: User 22 and item 433 has true rating 3.0, and the predicted rating is 4
Prediction 4: User 405 and item 1113 has true rating 1.0, and the predicted rating is 2
Prediction 5: User 314 and item 120 has true rating 3.0, and the predicted rating is 3
Prediction 6: User 405 and item 1229 has true rating 1.0, and the predicted rating is 1
Prediction 7: User 537 and item 346 has true rating 3.0, and the predicted rating is 3
Prediction 8: User 899 and item 64 has true rating 4.0, and the predicted rating is 4
Prediction 9: User 844 and item 45 has true rating 4.0, and the predicted rating is 4


Metrics to Evaluate:
1. RMSE (Root Mean Squared Error)   
Measures the square root of the average squared differences between predicted and actual ratings. Lower is better.

2. MAE (Mean Absolute Error)   
Measures the average of the absolute differences between predicted and actual ratings. Also, lower is better.

In [22]:
from surprise import accuracy

# Calculate and print RMSE
rmse = accuracy.rmse(predictions)

# Calculate and print MAE
mae = accuracy.mae(predictions)


RMSE: 0.9183
MAE:  0.7279


RMSE :
* This means that, on average, the predicted rating deviates from the true rating by about 0.92 on a scale of 1 to 5.

* Because RMSE penalizes larger errors more heavily than MAE, a value below 1.0 is actually quite decent for movie recommendation tasks.

* It suggests the model is generally predicting ratings fairly close to what users actually gave, but occasionally makes bigger errors.

MAE :
* This means the average absolute difference between predicted and true ratings is about 0.73.

* Since MAE treats all errors equally, this tells us the model is consistently within about 0.7 stars of the true rating, which is also acceptable.