Skip to content

AnderCruz/Content-Based-Movie-Recommendation-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Content-Based Movie Recommendation System

Project Overview

This project implements a Content-Based Recommendation System for movies. The primary objective is to recommend movies to a user based on the content and attributes of movies they have previously enjoyed. Unlike collaborative filtering, which relies on user-item interactions, this approach focuses on the intrinsic properties of the items themselves.

The system works by creating a profile for each user based on their movie ratings and then suggesting new movies that are similar to their highest-rated films. The similarity is calculated using movie genres, which are vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) technique. This project serves as a practical demonstration of building a simple yet effective recommendation engine from scratch.

Tools and Libraries Used

This project is developed in Python and leverages several key libraries for data manipulation and machine learning:

  • Python 3: The core programming language.
  • Pandas: Used extensively for data loading, manipulation, and structuring. It is the primary tool for managing the movies and ratings data.
  • NumPy: Provides support for numerical operations, although its use is minimal in this specific notebook.
  • Scikit-learn: A fundamental machine learning library used for:
    • TfidfVectorizer: To convert the textual genre data into a matrix of TF-IDF features. This is crucial for quantifying the importance of each genre in a movie's description.
    • linear_kernel: To compute the cosine similarity between the TF-IDF vectors of the movies. This similarity score is used to determine how alike two movies are based on their genres.
  • Jupyter Notebook: The interactive development environment used to write, execute, and document the analysis.

Dataset

The analysis is performed using two datasets that are part of the MovieLens dataset collection:

  1. movies.csv:

    • Description: Contains information about movies, including movieId, title, and genres.
    • Source: Loaded from an IBM Cloud Object Storage URL.
  2. ratings.csv:

    • Description: Contains user ratings for movies, including userId, movieId, and rating.
    • Source: Loaded from an IBM Cloud Object Storage URL.

Methodology & How It Works

The recommendation system is built following a structured, content-based filtering approach.

1. Data Loading and Preprocessing

  • The movies.csv and ratings.csv datasets are loaded into Pandas DataFrames.
  • The genres column in the movies DataFrame is preprocessed by replacing the | separator with spaces. This prepares the data for text-based feature extraction.

2. Content-Based Feature Extraction (TF-IDF)

  • TF-IDF Vectorization: The TfidfVectorizer from Scikit-learn is used to transform the genres column into a TF-IDF matrix. This matrix represents each movie as a vector where each dimension corresponds to a genre, and the value indicates the importance of that genre to the movie.
  • The stop_words='english' parameter is used to ignore common English words, although it has a limited effect in this context since the genres are single-word terms.

3. Calculating Movie Similarity

  • Cosine Similarity: The linear_kernel function is used to compute the cosine similarity between all pairs of movie TF-IDF vectors. This results in a similarity matrix where each entry (i, j) represents the similarity score between movie i and movie j. A score closer to 1 indicates a higher similarity.

4. Generating Recommendations

  • User Profile Creation: The system takes a sample user's ratings as input. This user has provided ratings for several movies.
  • Identifying Top-Rated Movies: The user's highest-rated movies are identified. These movies form the basis of the user's "profile" and are used as the seed for generating recommendations.
  • Finding Similar Movies: For each of the user's top-rated movies, the system looks up its similarity scores with all other movies in the similarity matrix.
  • Aggregating and Sorting: The similarity scores are aggregated, and the movies are sorted based on their total similarity to the user's favorite movies.
  • Filtering and Final Recommendation:
    • Movies that the user has already watched are removed from the recommendation list.
    • The top 20 movies with the highest similarity scores are returned as the final recommendation list.

Conclusion

This project successfully implements a content-based movie recommendation system using TF-IDF and cosine similarity. By analyzing the genre information of movies, the system can provide personalized recommendations based on a user's viewing history.

The approach is straightforward and effective, demonstrating how item features can be leveraged to create a powerful recommendation engine. This project serves as an excellent starting point for anyone interested in building recommendation systems and exploring content-based filtering techniques.

About

This project implements a Content-Based Recommendation System for movies. The primary objective is to recommend movies to a user based on the content and attributes of movies they have previously enjoyed. Unlike collaborative filtering, which relies on user-item interactions, this approach focuses on the intrinsic properties of the items themselves

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors