# IT - 550 Information Retrieval Assignment - 9
### Student ID - 202011032
### User-based and Item-based Collaborative Filtering Recommendation Systems <br/>Dataset - MovieLens 100K Dataset

## Importing required classes and functions from `scikit-surprise` module and loading the MovieLens 100K Dataset

In [1]:
from surprise import Dataset
# from surprise import KNNBasic
from surprise import KNNBaseline
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate

In [2]:
data = Dataset.load_builtin(name='ml-100k')

In [3]:
# Splitting the data into 80:20 ratio for training and testing
trainset, testset = train_test_split(data, test_size=0.2)

## User-based collaborative Filtering
- Using basic KNN algorithm with top 30 neighbours/users for prediction
- Using cosine similarity measure for similarity
- Using Stochastic Gradient Descent(SGD) for baseline estimation
- Cross-validating the approach on the training set
- Checking accuracy of the approach using RMSE on the testing set

In [4]:
sim_options = {"name": "cosine", "user-based": True}
algo_u = KNNBaseline(k=40, sim_options=sim_options)

In [5]:
# Run 5-fold cross-validation and print results
cross_validate(algo_u, data, measures=['RMSE'], cv=5, verbose=True)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9361  0.9317  0.9391  0.9335  0.9302  0.9341  0.0032  
Fit time          2.17    2.33    2.27    2.21    2.27    2.25    0.05    
Test time         4.50    4.53    4.63    4.48    4.30    4.49    0.11    


{'test_rmse': array([0.93606955, 0.93169392, 0.93910061, 0.93353212, 0.93017032]),
 'fit_time': (2.1723716259002686,
  2.3289098739624023,
  2.2706432342529297,
  2.2128779888153076,
  2.270914077758789),
 'test_time': (4.498136758804321,
  4.533735513687134,
  4.628356218338013,
  4.476269483566284,
  4.302533388137817)}

In [6]:
# Train the algorithm on the trainset, and predict ratings for the testset
preds_u = algo_u.fit(trainset).test(testset)

# Compute RMSE to check accuracy
accuracy.rmse(preds_u)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9343


0.9342813964380853

## Item-based collaborative Filtering
- Using basic KNN algorithm with top 30 neighbours/items for prediction
- Using cosine similarity measure for similarity
- Using Stochastic Gradient Descent(SGD) for baseline estimation
- Cross-validating the approach on the training set
- Checking accuracy of the approach using RMSE on the testing set

In [7]:
sim_options = {"name": "cosine", "user-based": False}
algo_i = KNNBaseline(k=40, sim_options=sim_options)

In [8]:
# Run 5-fold cross-validation and print results
cross_validate(algo_i, data, measures=['RMSE'], cv=5, verbose=True)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9371  0.9340  0.9396  0.9277  0.9307  0.9338  0.0043  
Fit time          2.07    2.00    2.03    1.99    2.00    2.02    0.03    
Test time         4.14    4.89    4.30    4.52    4.36    4.44    0.25    


{'test_rmse': array([0.9370892 , 0.93404609, 0.93960291, 0.92768457, 0.9307076 ]),
 'fit_time': (2.071254014968872,
  1.997800588607788,
  2.030951976776123,
  1.9946470260620117,
  1.9982123374938965),
 'test_time': (4.1448023319244385,
  4.892037391662598,
  4.29661226272583,
  4.52424955368042,
  4.36464262008667)}

In [9]:
# Train the algorithm on the trainset, and predict ratings for the testset
preds_i = algo_i.fit(trainset).test(testset)

# Compute RMSE to check accuracy
accuracy.rmse(preds_i)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9343


0.9342813964380853

## Results
1. **User-based collaborative filtering**:
    - Trainset 5-fold Cross validation RMSE = 0.9341 ± 0.0032
    - Testset RMSE = 0.9343
2. **Item-based collaborative filtering**:
    - Trainset 5-fold Cross validation RMSE = 0.9338 ± 0.0043
    - Testset RMSE = 0.9343