# Recommendation System - Movies

I am going to build a recommendation system for movies using the MovieLens dataset. The dataset contains 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018. The dataset was generated on September 26, 2018. Users were selected at random for inclusion. All selected users had rated at least 20 movies. Each user is represented by an id, and no other information is provided.

Citation:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

In [2]:
import pandas as pd

movies_df = pd.read_csv('./data/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('./data/ml-latest-small/ratings.csv')


In [3]:
print(f"The dimensions of the movies dataframe are: {movies_df.shape}")
print(f"The dimensions of the ratings dataframe are: {ratings_df.shape}")

The dimensions of the movies dataframe are: (9742, 3)
The dimensions of the ratings dataframe are: (100836, 4)


In [4]:
# Let's take a look at the movies dataframe
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
# Let's take a look at the ratings dataframe
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Comment
As we can see, there are duplicates in the ratings dataframe. It's a good sign because it means that users rated more than one movie. Actually, at least 20 movies each. <br>
## Movies Dataframe
The movies dataframe has 3 columns: movieId, title and genres. The movieId is a unique identifier for each movie. The title is the name of the movie. The genres column contains a list of genres separated by a pipe (|). <br>
## Ratings Dataframe
The ratings dataframe has 4 columns: userId, movieId, rating and timestamp. The userId is a unique identifier for each user. The movieId is a unique identifier for each movie. The rating is the score given by the user to the movie. The timestamp is the time when the user rated the movie. <br>


## Theoretical Background
### Collaborative Filtering
Collaborative filtering relies on user behavior and preferences. It makes recommendations based on the idea that if two users have agreed on many items in the past, they are likely to agree on future items as well. There are two types: <br>
- User-based collaborative filtering: Recommends items that similar users have liked.<br>
- Item-based collaborative filtering: Recommends items that are similar to the ones the user has liked or interacted with in the past.<br>
Example: If user A and user B both liked movies X and Y, and user A liked movie Z, the system might recommend movie Z to user B.
### Content-Based Filtering
Content-based filtering recommends items based on the attributes of the items themselves and the user's past behavior. It focuses on the features of items (such as genre, author, keywords, etc.) and suggests items similar to those the user has shown interest in. <br>
Example: If a user likes action movies, the system might recommend more action movies based on their genre, director, or cast.
### Key Difference:
- Collaborative filtering leverages user preferences and interactions.
- Content-based filtering uses item features and user history to make recommendations.
### What are we going to do with our dataset?
We will primarily use collaborative filtering, specifically matrix factorization.
## Matrix Factorization
Matrix factorization is a technique often used in collaborative filtering methods for recommendation systems, where the goal is to predict missing entries in a matrix, typically a user-item interaction matrix (e.g., user ratings of movies, products, etc.). The idea is to break down the matrix into two lower-dimensional matrices, called factors or embeddings, that capture the latent features underlying the interactions between users and items.
### In the context of recommendation systems:
- The matrix: This is typically a large, sparse matrix, where rows represent users, columns represent items (e.g., movies), and the entries are ratings (or interactions) provided by users to items.
- The goal: The objective of matrix factorization is to approximate the original matrix by decomposing it into two smaller matrices that can be multiplied together to predict the missing ratings.