Content-Based Movie Recommendation System

Project Overview

This project implements a Content-Based Recommendation System for movies. The primary objective is to recommend movies to a user based on the content and attributes of movies they have previously enjoyed. Unlike collaborative filtering, which relies on user-item interactions, this approach focuses on the intrinsic properties of the items themselves.

The system works by creating a profile for each user based on their movie ratings and then suggesting new movies that are similar to their highest-rated films. The similarity is calculated using movie genres, which are vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) technique. This project serves as a practical demonstration of building a simple yet effective recommendation engine from scratch.

Tools and Libraries Used

This project is developed in Python and leverages several key libraries for data manipulation and machine learning:

Python 3: The core programming language.
Pandas: Used extensively for data loading, manipulation, and structuring. It is the primary tool for managing the movies and ratings data.
NumPy: Provides support for numerical operations, although its use is minimal in this specific notebook.
Scikit-learn: A fundamental machine learning library used for:
- TfidfVectorizer: To convert the textual genre data into a matrix of TF-IDF features. This is crucial for quantifying the importance of each genre in a movie's description.
- linear_kernel: To compute the cosine similarity between the TF-IDF vectors of the movies. This similarity score is used to determine how alike two movies are based on their genres.
Jupyter Notebook: The interactive development environment used to write, execute, and document the analysis.

Dataset

The analysis is performed using two datasets that are part of the MovieLens dataset collection:

movies.csv:
- Description: Contains information about movies, including movieId, title, and genres.
- Source: Loaded from an IBM Cloud Object Storage URL.
ratings.csv:
- Description: Contains user ratings for movies, including userId, movieId, and rating.
- Source: Loaded from an IBM Cloud Object Storage URL.

Methodology & How It Works

The recommendation system is built following a structured, content-based filtering approach.

1. Data Loading and Preprocessing

The movies.csv and ratings.csv datasets are loaded into Pandas DataFrames.
The genres column in the movies DataFrame is preprocessed by replacing the | separator with spaces. This prepares the data for text-based feature extraction.

2. Content-Based Feature Extraction (TF-IDF)

TF-IDF Vectorization: The TfidfVectorizer from Scikit-learn is used to transform the genres column into a TF-IDF matrix. This matrix represents each movie as a vector where each dimension corresponds to a genre, and the value indicates the importance of that genre to the movie.
The stop_words='english' parameter is used to ignore common English words, although it has a limited effect in this context since the genres are single-word terms.

3. Calculating Movie Similarity

Cosine Similarity: The linear_kernel function is used to compute the cosine similarity between all pairs of movie TF-IDF vectors. This results in a similarity matrix where each entry (i, j) represents the similarity score between movie i and movie j. A score closer to 1 indicates a higher similarity.

4. Generating Recommendations

User Profile Creation: The system takes a sample user's ratings as input. This user has provided ratings for several movies.
Identifying Top-Rated Movies: The user's highest-rated movies are identified. These movies form the basis of the user's "profile" and are used as the seed for generating recommendations.
Finding Similar Movies: For each of the user's top-rated movies, the system looks up its similarity scores with all other movies in the similarity matrix.
Aggregating and Sorting: The similarity scores are aggregated, and the movies are sorted based on their total similarity to the user's favorite movies.
Filtering and Final Recommendation:
- Movies that the user has already watched are removed from the recommendation list.
- The top 20 movies with the highest similarity scores are returned as the final recommendation list.

Conclusion

This project successfully implements a content-based movie recommendation system using TF-IDF and cosine similarity. By analyzing the genre information of movies, the system can provide personalized recommendations based on a user's viewing history.

The approach is straightforward and effective, demonstrating how item features can be leveraged to create a powerful recommendation engine. This project serves as an excellent starting point for anyone interested in building recommendation systems and exploring content-based filtering techniques.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
RecSys_Content-Based_Movies.ipynb		RecSys_Content-Based_Movies.ipynb
movies.csv		movies.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content-Based Movie Recommendation System

Project Overview

Tools and Libraries Used

Dataset

Methodology & How It Works

1. Data Loading and Preprocessing

2. Content-Based Feature Extraction (TF-IDF)

3. Calculating Movie Similarity

4. Generating Recommendations

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Content-Based Movie Recommendation System

Project Overview

Tools and Libraries Used

Dataset

Methodology & How It Works

1. Data Loading and Preprocessing

2. Content-Based Feature Extraction (TF-IDF)

3. Calculating Movie Similarity

4. Generating Recommendations

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages