In [5]:
conda install -c conda-forge scikit-surprise

Collecting package metadata (current_repodata.json): ...working... done
Note: you may need to restart the kernel to use updated packages.




  current version: 23.3.1
  latest version: 23.7.4

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.4





Solving environment: ...working... done

## Package Plan ##

  environment location: c:\Users\JACINTA\anaconda3\envs\learn-env

  added / updated specs:
    - scikit-surprise


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.7.22  |       h56e8100_0         146 KB  conda-forge
    certifi-2023.7.22          |     pyhd8ed1ab_0         150 KB  conda-forge
    openssl-1.1.1w             |       hcfcfb64_0         5.0 MB  conda-forge
    scikit-surprise-1.1.3      |  py310h9b08ddd_1         859 KB  conda-forge
    ucrt-10.0.22621.0          |       h57928b3_0         1.2 MB  conda-forge
    vc14_runtime-14.36.32532   |      hdcecf7f_17         722 KB  conda-forge
    vs2015_runtime-14.36.32532 |      h05e6639_17          17 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         8.1 MB

T

# MovieLens Recommendation System

# 1. Business Understanding
## a. Introduction
The MovieLens dataset is a valuable resource for building and enhancing recommendation systems, and it can serve various business goals. Here's a specific business understanding and problem statement within the general context of recommending movies:The primary goal is to enhance the movie recommendation system to provide users with personalized and engaging movie suggestions.
Current Situation: Currently, our platform uses collaborative filtering to recommend movies to users based on their past movie ratings and behaviors. However, we have identified certain limitations:
Limited initial user ratings: Many users have a sparse rating history, making it challenging to provide accurate recommendations, especially for new users.
Cold-start problem: Recommending movies for new users who haven't provided any ratings is challenging.
Narrow recommendation scope: Users may be missing out on potentially interesting movies due to limitations in our current recommendation approach.
Data Collection and User Interaction: To address these challenges, we plan to collect additional data and create more interactive ways for users to provide ratings and feedback on movies. Here are the details:


## b. Problem Statement

## c. Objective
Rating Collection Mechanisms:
Develop user-friendly interfaces: Create user interfaces (web or mobile) that encourage users to rate movies easily and intuitively.
Implement incentives: Offer rewards, discounts, or exclusive content access to users who provide ratings, to boost participation.
Capture explicit and implicit feedback: Collect explicit ratings (e.g., star ratings) and implicit feedback (e.g., user clicks, watch history) to better understand user preferences.
Encouraging User Participation:
Implement recommendation prompts: Use personalized prompts and notifications to encourage users to rate more movies.
Gamify the rating process: Introduce gamification elements like badges, leaderboards, or challenges to make rating movies more engaging.
Data Integration and Algorithm Improvement: Combine the new user ratings and feedback with the existing dataset to improve our recommendation algorithms. Here's how we plan to do it:
Hybrid Recommendation Approach:
Combine collaborative filtering and content-based recommendation techniques to mitigate the cold-start problem.
Utilize matrix factorization, deep learning, or hybrid models to improve recommendation accuracy.
Diversified Recommendations:
Implement techniques like item diversification to expand the range of recommended movies, introducing users to a broader set of options.
Key Performance Indicators (KPIs):
To measure the success of our efforts in enhancing movie recommendations and user engagement, we will track the following KPIs:
User Engagement Metrics:
User rating frequency and volume.
Click-through rates on movie recommendations.
Time spent on the platform.
Recommendation Effectiveness Metrics:
Recommendation precision and recall.
User satisfaction surveys and feedback.
Conversion rates for recommended movies.
Cold-start Problem Mitigation:
Percentage of successfully recommended movies for new users.
Improvement in the recommendation coverage.
By addressing these specific business objectives and implementing data collection and algorithmic improvements, we aim to provide users with more accurate, diverse, and engaging movie recommendations, ultimately leading to higher user satisfaction and increased user retention on our platform.

## d. Defining the metric for success

## e. Experimental Design

## f. Data Understanding


# 2. Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds
from surprise import Reader, Dataset, KNNBasic
from surprise.model_selection import train_test_split
from surprise import accuracy

# 3. Reading The Data

In [3]:
links_data = pd.read_csv("links.csv")
links_data.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [4]:
movies_data = pd.read_csv("movies.csv")
movies_data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
ratings_data = pd.read_csv("ratings.csv")
ratings_data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [8]:
ratings_data.rating.value_counts()

rating
4.0    26818
3.0    20047
5.0    13211
3.5    13136
4.5     8551
2.0     7551
2.5     5550
1.0     2811
1.5     1791
0.5     1370
Name: count, dtype: int64

In [9]:
tags_data = pd.read_csv("tags.csv")
tags_data.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


# 4. Checking The Data

In [10]:
# checking the shape of the movies dataset
movies_data.shape

(9742, 3)

In [11]:
movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [12]:
# checking the shape of the ratings dataset
ratings_data.shape

(100836, 4)

In [13]:
ratings_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [14]:
# checking the number of unique values in ratings dataframe
ratings_data.movieId.nunique()

9724

# 5. Tidying The Dataset

a. Item based filtering

In [16]:
# Creating a reader object to specify the rating scale
reader = Reader(rating_scale=(0, 5))

In [18]:
# Loading the data into a Surprise Dataset
data = Dataset.load_from_df(ratings_data[['userId', 'movieId', 'rating']], reader)

In [19]:
# Split the data into training set and testing set
trainset, testset = train_test_split(data, test_size=0.2)

In [20]:
# Creating an item-based collaborative filtering model
sim_options = {'name': 'cosine', 'user_based': False}
model = KNNBasic(sim_options=sim_options)

In [21]:
# Training the model on the training data
model.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x187c4a9da50>

In [22]:
# Making predictions on the testing set
predictions = model.test(testset)

In [23]:
# Evaluating the model's performance using the Root Mean Squared Error
rmse = accuracy.rmse(predictions)
print(f'RMSE: {rmse:.2f}')

RMSE: 0.9824
RMSE: 0.98


In [24]:
# Function to get a movie Recommendation for a specific user
def get_top_n_recommendations(predictions, n=10):
    top_n = {}
    for uid, iid, true_r, est, _ in predictions:
        if uid not in top_n:
            top_n[uid] = []
        top_n[uid].append((iid, est))
    
    # Sort the predictions for each user and get the top n
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
    
    return top_n

In [26]:
# Getting item-based movie recommendations for a specific user
user_id = 1  # Replace with the actual user ID
top_n_recommendations = get_top_n_recommendations(predictions, n=10)
user_recommendations = top_n_recommendations[user_id]

# Print the top recommendations for the user
print(f"Top {len(user_recommendations)} recommendations for User {user_id}:")
for movie_id, estimated_rating in user_recommendations:
    movie_title = movies_data[movies_data['movieId'] == movie_id]['title'].values[0]
    print(f"{movie_title} (MovieId: {movie_id}, Estimated Rating: {estimated_rating:.2f})")


Top 10 recommendations for User 1:
Grumpier Old Men (1995) (MovieId: 3, Estimated Rating: 4.77)
L.A. Confidential (1997) (MovieId: 1617, Estimated Rating: 4.67)
South Park: Bigger, Longer and Uncut (1999) (MovieId: 2700, Estimated Rating: 4.63)
McHale's Navy (1997) (MovieId: 1445, Estimated Rating: 4.62)
Predator (1987) (MovieId: 3527, Estimated Rating: 4.62)
Spaceballs (1987) (MovieId: 3033, Estimated Rating: 4.60)
Big Trouble in Little China (1986) (MovieId: 3740, Estimated Rating: 4.60)
Indiana Jones and the Last Crusade (1989) (MovieId: 1291, Estimated Rating: 4.60)
Seven (a.k.a. Se7en) (1995) (MovieId: 47, Estimated Rating: 4.58)
So I Married an Axe Murderer (1993) (MovieId: 543, Estimated Rating: 4.57)


b. User-based filtering

In [27]:
# Creating a reader object to specify the rating scale
reader = Reader(rating_scale=(0, 5))

In [29]:
# Loading the data into a Surprise Dataset
data = Dataset.load_from_df(ratings_data[['userId', 'movieId', 'rating']], reader)

In [30]:
# Splitting the data into a train set and a test set
trainset, testset = train_test_split(data, test_size=0.2)

In [31]:
# Creating a User-Based Collaborative Filtering method
sim_options = {'name': 'cosine', 'user_based': True}
model = KNNBasic(sim_options=sim_options)

In [32]:
# Training the model on the training set
model.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x187cb743580>

In [33]:
# Making predictions on the test set
predictions = model.test(testset)

In [34]:
# Evaluating the model's performance using Root Mean Squared Error
rmse = accuracy.rmse(predictions)
print(f'RMSE: {rmse:.2f}')

RMSE: 0.9673
RMSE: 0.97


In [35]:
# This function gets a movie recommendation for a specific user
def get_top_n_recommendations(predictions, n=10):
    top_n = {}
    for uid, iid, true_r, est, _ in predictions:
        if uid not in top_n:
            top_n[uid] = []
        top_n[uid].append((iid, est))
    
    # Sort the predictions for each user and get the top n
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
    
    return top_n

In [37]:
# Get user-based movie recommendations for a specific user
user_id = 1  # Replace with the actual user ID
top_n_recommendations = get_top_n_recommendations(predictions, n=10)
user_recommendations = top_n_recommendations[user_id]

# Print the top recommendations for the user
print(f"Top {len(user_recommendations)} recommendations for User {user_id}:")
for movie_id, estimated_rating in user_recommendations:
    movie_title = movies_data[movies_data['movieId'] == movie_id]['title'].values[0]
    print(f"{movie_title} (MovieId: {movie_id}, Estimated Rating: {estimated_rating:.2f})")


Top 10 recommendations for User 1:
Forrest Gump (1994) (MovieId: 356, Estimated Rating: 4.38)
Star Wars: Episode IV - A New Hope (1977) (MovieId: 260, Estimated Rating: 4.35)
Back to the Future (1985) (MovieId: 1270, Estimated Rating: 4.29)
Saving Private Ryan (1998) (MovieId: 2028, Estimated Rating: 4.29)
Psycho (1960) (MovieId: 1219, Estimated Rating: 4.18)
Road Warrior, The (Mad Max 2) (1981) (MovieId: 3703, Estimated Rating: 4.12)
Run Lola Run (Lola rennt) (1998) (MovieId: 2692, Estimated Rating: 4.05)
Blues Brothers, The (1980) (MovieId: 1220, Estimated Rating: 3.98)
South Park: Bigger, Longer and Uncut (1999) (MovieId: 2700, Estimated Rating: 3.97)
Edward Scissorhands (1990) (MovieId: 2291, Estimated Rating: 3.78)
