This project implements a Content-Based Recommendation System for movies. The primary objective is to recommend movies to a user based on the content and attributes of movies they have previously enjoyed. Unlike collaborative filtering, which relies on user-item interactions, this approach focuses on the intrinsic properties of the items themselves.
The system works by creating a profile for each user based on their movie ratings and then suggesting new movies that are similar to their highest-rated films. The similarity is calculated using movie genres, which are vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) technique. This project serves as a practical demonstration of building a simple yet effective recommendation engine from scratch.
This project is developed in Python and leverages several key libraries for data manipulation and machine learning:
- Python 3: The core programming language.
- Pandas: Used extensively for data loading, manipulation, and structuring. It is the primary tool for managing the movies and ratings data.
- NumPy: Provides support for numerical operations, although its use is minimal in this specific notebook.
- Scikit-learn: A fundamental machine learning library used for:
TfidfVectorizer: To convert the textual genre data into a matrix of TF-IDF features. This is crucial for quantifying the importance of each genre in a movie's description.linear_kernel: To compute the cosine similarity between the TF-IDF vectors of the movies. This similarity score is used to determine how alike two movies are based on their genres.
- Jupyter Notebook: The interactive development environment used to write, execute, and document the analysis.
The analysis is performed using two datasets that are part of the MovieLens dataset collection:
-
movies.csv:
- Description: Contains information about movies, including
movieId,title, andgenres. - Source: Loaded from an IBM Cloud Object Storage URL.
- Description: Contains information about movies, including
-
ratings.csv:
- Description: Contains user ratings for movies, including
userId,movieId, andrating. - Source: Loaded from an IBM Cloud Object Storage URL.
- Description: Contains user ratings for movies, including
The recommendation system is built following a structured, content-based filtering approach.
- The
movies.csvandratings.csvdatasets are loaded into Pandas DataFrames. - The
genrescolumn in themoviesDataFrame is preprocessed by replacing the|separator with spaces. This prepares the data for text-based feature extraction.
- TF-IDF Vectorization: The
TfidfVectorizerfrom Scikit-learn is used to transform thegenrescolumn into a TF-IDF matrix. This matrix represents each movie as a vector where each dimension corresponds to a genre, and the value indicates the importance of that genre to the movie. - The
stop_words='english'parameter is used to ignore common English words, although it has a limited effect in this context since the genres are single-word terms.
- Cosine Similarity: The
linear_kernelfunction is used to compute the cosine similarity between all pairs of movie TF-IDF vectors. This results in a similarity matrix where each entry(i, j)represents the similarity score between movieiand moviej. A score closer to 1 indicates a higher similarity.
- User Profile Creation: The system takes a sample user's ratings as input. This user has provided ratings for several movies.
- Identifying Top-Rated Movies: The user's highest-rated movies are identified. These movies form the basis of the user's "profile" and are used as the seed for generating recommendations.
- Finding Similar Movies: For each of the user's top-rated movies, the system looks up its similarity scores with all other movies in the similarity matrix.
- Aggregating and Sorting: The similarity scores are aggregated, and the movies are sorted based on their total similarity to the user's favorite movies.
- Filtering and Final Recommendation:
- Movies that the user has already watched are removed from the recommendation list.
- The top 20 movies with the highest similarity scores are returned as the final recommendation list.
This project successfully implements a content-based movie recommendation system using TF-IDF and cosine similarity. By analyzing the genre information of movies, the system can provide personalized recommendations based on a user's viewing history.
The approach is straightforward and effective, demonstrating how item features can be leveraged to create a powerful recommendation engine. This project serves as an excellent starting point for anyone interested in building recommendation systems and exploring content-based filtering techniques.