# MovieLens Recommendation System

# 1. Business Understanding
## a. Introduction
The MovieLens dataset is a valuable resource for building and enhancing recommendation systems, and it can serve various business goals. Here's a specific business understanding and problem statement within the general context of recommending movies:


## b. Problem Statement
## c. Objective
The primary goal is to enhance the movie recommendation system to provide users with personalized and engaging movie suggestions.
Current Situation: Currently, our platform uses collaborative filtering to recommend movies to users based on their past movie ratings and behaviors. However, we have identified certain limitations:
Limited initial user ratings: Many users have a sparse rating history, making it challenging to provide accurate recommendations, especially for new users.
Cold-start problem: Recommending movies for new users who haven't provided any ratings is challenging.
Narrow recommendation scope: Users may be missing out on potentially interesting movies due to limitations in our current recommendation approach.
Data Collection and User Interaction: To address these challenges, we plan to collect additional data and create more interactive ways for users to provide ratings and feedback on movies. Here are the details:
Rating Collection Mechanisms:
Develop user-friendly interfaces: Create user interfaces (web or mobile) that encourage users to rate movies easily and intuitively.
Implement incentives: Offer rewards, discounts, or exclusive content access to users who provide ratings, to boost participation.
Capture explicit and implicit feedback: Collect explicit ratings (e.g., star ratings) and implicit feedback (e.g., user clicks, watch history) to better understand user preferences.
Encouraging User Participation:
Implement recommendation prompts: Use personalized prompts and notifications to encourage users to rate more movies.
Gamify the rating process: Introduce gamification elements like badges, leaderboards, or challenges to make rating movies more engaging.
Data Integration and Algorithm Improvement: Combine the new user ratings and feedback with the existing dataset to improve our recommendation algorithms. Here's how we plan to do it:
Hybrid Recommendation Approach:
Combine collaborative filtering and content-based recommendation techniques to mitigate the cold-start problem.
Utilize matrix factorization, deep learning, or hybrid models to improve recommendation accuracy.
Diversified Recommendations:
Implement techniques like item diversification to expand the range of recommended movies, introducing users to a broader set of options.
Key Performance Indicators (KPIs):
To measure the success of our efforts in enhancing movie recommendations and user engagement, we will track the following KPIs:
User Engagement Metrics:
User rating frequency and volume.
Click-through rates on movie recommendations.
Time spent on the platform.
Recommendation Effectiveness Metrics:
Recommendation precision and recall.
User satisfaction surveys and feedback.
Conversion rates for recommended movies.
Cold-start Problem Mitigation:
Percentage of successfully recommended movies for new users.
Improvement in the recommendation coverage.
By addressing these specific business objectives and implementing data collection and algorithmic improvements, we aim to provide users with more accurate, diverse, and engaging movie recommendations, ultimately leading to higher user satisfaction and increased user retention on our platform.


# 2. Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds

# 3. Reading The Data

In [3]:
links_data = pd.read_csv("links.csv")
links_data.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [4]:
movies_data = pd.read_csv("movies.csv")
movies_data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
ratings_data = pd.read_csv("ratings.csv")
ratings_data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
tags_data = pd.read_csv("tags.csv")
tags_data.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


# 4. Checking The Data

In [7]:
# checking the shape of the links dataset
links_data.shape

(9742, 3)

In [12]:
links_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


In [8]:
# checking the shape of the movies dataset
movies_data.shape

(9742, 3)

In [13]:
movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [9]:
# checking the shape of the ratings dataset
ratings_data.shape

(100836, 4)

In [14]:
ratings_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [10]:
# checking the shape of the tags dataset
tags_data.shape

(3683, 4)

In [15]:
tags_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [11]:
# checking the number of unique values in ratings dataframe
ratings_data.movieId.nunique()

9724

# 5. Tidying The Dataset

We merge the movies dataset and the links dataset as they share the same movieId column.

In [None]:
# merging the movies dataset and the links dataset