In [1]:
import pandas as pd
from surprise import SVD
from surprise import KNNBaseline
from surprise.model_selection import train_test_split, LeaveOneOut
from surprise import Reader
from surprise import Dataset
from surprise import accuracy
from collections import defaultdict

* **Surprise** is a Python library for building and analyzing recommender systems that deal with explicit rating data.

* **SVD (Singular Value Decomposition)-** SVD is a matrix factorization technique commonly used in recommendation systems. It decomposes a matrix into three other matrices such that when multiplied together, they approximate the original matrix. In recommendation systems, the original matrix typically represents user-item interactions, where rows represent users, columns represent items, and the values represent ratings or preferences.

The three matrices produced by SVD are:
* **U (User matrix):** Represents the relationship between users and latent factors.

* **Sigma (Singular Values):** Contains the weights of the latent factors.

* **Vt (Item matrix):** Represents the relationship between items and latent factors.


* By decomposing the user-item interaction matrix into these matrices, SVD enables the model to learn latent factors that capture underlying patterns in the data, such as user preferences and item characteristics. This learned representation can then be used to make predictions for missing values (unrated items) or to recommend items to users based on their preferences.

* **LeaveOneOut** is a **cross-validation strategy**. Specifically, it is a technique for **evaluating the performance of a recommendation system algorithm.**

Here's how LeaveOneOut cross-validation works:

**1. Data Splitting:** For each user in the dataset, one of their interactions (rating or preference) is held out as the test set, and the remaining interactions are used as the training set.

**2. Model Training and Testing:** The recommendation system algorithm is trained on the training set, which includes all interactions except one for each user. Then, the held-out interaction for each user is used as a test case. The algorithm predicts the rating or preference for the held-out item, and this prediction is compared against the actual rating or preference.

**3. Evaluation:** The evaluation metric (such as RMSE, MAE, etc.) is calculated based on the differences between the predicted ratings and the actual ratings for all held-out interactions.

**4. Aggregation:** The evaluation results from all users are aggregated to obtain the overall performance of the recommendation algorithm.

* By repeatedly leaving out one interaction per user and evaluating the algorithm's performance, LeaveOneOut provides a robust estimate of how well the algorithm generalizes to unseen data. It is particularly useful for small datasets or when a high degree of confidence in the algorithm's performance is required.

* **Reader** class is used to parse the file containing the dataset for recommendation systems. It allows you to specify the format of the dataset file so that Surprise can properly read and interpret it.

In [2]:
movies = pd.read_csv('D://New folder//ML//Completed//movies.csv')
ratings = pd.read_csv('D://New folder//ML//Completed//ratings.csv')
df = pd.merge(movies, ratings, on='movieId', how='inner')
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,2,5.0,859046895
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,1303501039
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,5.0,858610933
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,4.0,850815810
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,14,4.0,851766286


In [3]:
reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(df[['userId','movieId','rating']],reader)
trainset, testset = train_test_split(data, test_size=0.25, random_state=0)

In [4]:
algo = SVD(random_state=0)
algo.fit(trainset)
predictions = algo.test(testset)

In [5]:
def MAE(predictions):
    return accuracy.mae(predictions, verbose=False)

In [6]:
def RMSE(predictions):
    return accuracy.rmse(predictions, verbose=False)

In [7]:
print("RMSE:", RMSE(predictions))
print("MAE:", MAE(predictions))

RMSE: 0.8750231268479528
MAE: 0.6735292043111697


In [8]:
def GetTopN(predictions, n=10, minimumRating=4.0):
    topN = defaultdict(list)
    for userID, movieID, actualRating, estimatedRating, _ in predictions:
        if (estimatedRating >= minimumRating):
            topN[int(userID)].append((int(movieID), estimatedRating))

    for userID, ratings in topN.items():
        ratings.sort(key=lambda x:x[1], reverse=True)
        topN[int(userID)] = ratings[:n]

    return topN

In [9]:
loocv = LeaveOneOut(n_splits=1, random_state=1)

for trainset, testset in loocv.split(data):
    algo.fit(trainset)
    leftoutpredictions = algo.test(testset)
    bigTestset = trainset.build_anti_testset()
    allpredictions = algo.test(bigTestset)
    topNpredicted = GetTopN(allpredictions, n=10)

* **bigTestset = trainset.build_anti_testset():** This line builds a "big" test set containing all user/item combinations that were not in the training set. It essentially creates a set of all possible user/item combinations.

In [10]:
topNpredicted

defaultdict(list,
            {2: [(1217, 4.8072137877866),
              (527, 4.7980656800206285),
              (589, 4.764267142336458),
              (1104, 4.757772754289017),
              (44555, 4.73665906485584),
              (1207, 4.732952535359771),
              (2324, 4.731909923514756),
              (1212, 4.724470949091673),
              (858, 4.718487411916641),
              (923, 4.716510852816198)],
             5: [(913, 4.578859225315407),
              (1207, 4.45649635989984),
              (318, 4.404778671822453),
              (527, 4.393190051110049),
              (1949, 4.382736931879065),
              (1172, 4.363784136263369),
              (2186, 4.314015777155085),
              (1212, 4.305215863672745),
              (1248, 4.295311142197843),
              (1254, 4.2899632380959645)],
             8: [(527, 4.933659578580601),
              (56782, 4.773398138158881),
              (1207, 4.771750609956454),
              (356, 4.76529174666226

In [11]:
def HitRate(topNpredicted, leftoutpredictions):
    hits = 0
    total = 0

    for leftout in leftoutpredictions:
        userid = leftout[0]
        leftoutmovieid = leftout[1]

        hit = False
        for movieid, predictedRating in topNpredicted[int(userid)]:
            if (int(leftoutmovieid) == int(movieid)):
                hit = True
                break

        if (hit):
             hits += 1
    
        total += 1

    return hits/total

    

In [12]:
print("Hit Rate:", HitRate(topNpredicted,leftoutpredictions))

Hit Rate: 0.020958083832335328
