<a href="https://colab.research.google.com/github/SONG-0502/1st-pap/blob/main/ML_SVD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357279 sha256=41be15264072d517465e8a22c8edee04e14dee8923d35266afd0611de700c6d0
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Succ

Step 1: Data Acquisition
We'll begin by acquiring the ml-100k dataset. This dataset is available on the MovieLens website, but for simplicity, we can load it from a library like surprise in Python. The surprise library is a good tool for building recommendation systems.

In [2]:
import pandas as pd

from surprise import Dataset
from surprise import Reader

# Load MovieLens 100k dataset
data = Dataset.load_builtin('ml-100k')

# Convert the dataset to a pandas dataframe
reader = Reader(line_format='user item rating timestamp', sep='\t')
df = pd.DataFrame(data.raw_ratings, columns=['user', 'item', 'rating', 'timestamp'])

# Display the first few rows of the dataset
df.head()


Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


Unnamed: 0,user,item,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


Step 2: Data Preprocessing
Next, we'll clean the data. For this dataset, we can focus on ensuring there are no missing values and that all entries are valid.

In [3]:
# Check for missing values
df.isnull().sum()

# Drop any rows with missing values
df.dropna(inplace=True)

# Convert ratings to integer for consistency
df['rating'] = df['rating'].astype(int)

# Show the cleaned dataset
df.head()


Unnamed: 0,user,item,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Step 3: Feature Engineering
For this recommendation system, we will use collaborative filtering. Collaborative filtering makes recommendations based on the user's past behavior and the behavior of similar users. We can use a well-known algorithm for this, such as K-Nearest Neighbors (KNN) or matrix factorization.

We'll proceed with matrix factorization using Singular Value Decomposition (SVD), a popular technique.

In [4]:
from surprise import SVD
from surprise.model_selection import train_test_split  # Corrected import
from surprise import accuracy

# Load the MovieLens 100k dataset
from surprise import Dataset
data = Dataset.load_builtin('ml-100k')

# Split the dataset into training and testing sets (80% training, 20% testing)
trainset, testset = train_test_split(data, test_size=0.2)

# Initialize the SVD (Singular Value Decomposition) model
svd = SVD()

# Train the model using the training set
svd.fit(trainset)

# Make predictions on the test set
predictions = svd.test(testset)

# Evaluate the model using RMSE (Root Mean Squared Error)
rmse = accuracy.rmse(predictions)
print(f"RMSE: {rmse}")


RMSE: 0.9283
RMSE: 0.9283253476063263


In this code:

We use train_test_split to divide the data into training and testing datasets.
The SVD() function implements the matrix factorization model.
accuracy.rmse() is used to calculate the Root Mean Squared Error (RMSE) to measure the prediction quality.


Step 4: Model Evaluation
The performance of the model is evaluated using the RMSE metric, which quantifies the difference between the predicted and actual ratings. A lower RMSE indicates a better model.

You can improve the model by fine-tuning the hyperparameters of the SVD algorithm or by switching to a different model (e.g., KNN).

In [5]:
from surprise import SVD
from surprise.model_selection import train_test_split
from surprise import accuracy
from surprise import Dataset

# Load the MovieLens 100k dataset
data = Dataset.load_builtin('ml-100k')

# Split the dataset into training and testing sets (80% training, 20% testing)
trainset, testset = train_test_split(data, test_size=0.2)

# Initialize the SVD (Singular Value Decomposition) model
svd = SVD()

# Train the model using the training set
svd.fit(trainset)

# Make predictions on the test set
predictions = svd.test(testset)

# Evaluate the model using RMSE (Root Mean Squared Error)
rmse = accuracy.rmse(predictions)
print(f"RMSE: {rmse}")

# To calculate precision at K and recall at K, we will use the following functions
from surprise import accuracy

# To calculate precision and recall at k (e.g., k=10)
def precision_recall_at_k(predictions, k=10, testset=testset): # Add testset as argument
    # First, we need to map the predicted items to the user and calculate precision/recall
    top_n = {}

    # Get the top-N recommendations for each user
    for uid, iid, true_r, est, _ in predictions:
        if uid not in top_n:
            top_n[uid] = []
        top_n[uid].append((iid, est))

    # Sort the recommendations for each user
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:k]

    # Now calculate precision and recall at k
    precision = 0
    recall = 0
    total_users = len(top_n)

    for uid, user_ratings in top_n.items():
        # Get the true ratings for this user
        # Correct the unpacking to match the structure of testset
        true_items = [iid for (uid_test, iid, true_r) in testset if uid_test == uid]

        # Get the top k recommended items
        recommended_items = [iid for iid, _ in user_ratings]

        # Calculate precision at k: relevant / recommended
        relevant_items = set(true_items) & set(recommended_items)
        precision += len(relevant_items) / k

        # Calculate recall at k: relevant / actual relevant
        recall += len(relevant_items) / len(true_items) if true_items else 0

    # Return the average precision and recall at k
    return precision / total_users, recall / total_users

# Calculate precision and recall at k (e.g., k=10)
precision, recall = precision_recall_at_k(predictions, k=10) # Pass testset explicitly if needed
print(f"Precision at 10: {precision}")
print(f"Recall at 10: {recall}")

RMSE: 0.9337
RMSE: 0.9336671735113826
Precision at 10: 0.829361702127658
Recall at 10: 0.6744103289247286


Step 5: System Implementation
Once the model is trained, we can implement it into a web application. Here's how you can use Flask to build a simple web API that provides recommendations.

First, create a recommend function that returns top recommendations for a user.