Skip to content

Radwanbaba13/MovieRecommenderSystem

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This project aims to analyze a movie database along with different user reviews. This dataset contains metadata for 45,000 movies alongside 26 million ratings for all 45,000 movies. The main goal of this project is to build a movie recommendation system based on the ratings and votes of users found on the platform and to test different models for recommendation systems which are prominently used in today’s online media consumption.

Dataset Description

This dataset can be found on Kaggle.

The dataset is comprised of 45,000 movies which were listed in the full MovieLens Dataset. The data consists of movies released before July 2017. This first dataset is built upon 24 different features as movie metadata:

  • Adult: Whether the movie is rated as an adult film or not (boolean)
  • Belongs_to_collection
  • Budget: The budget of the film (int)
  • Genres: Genres of the movie (dict)
  • Homepage: Homepage link of the film (string)
  • Id: Unique identifier for the film (int)
  • Imdb_id: Unique IMDB ID (int)
  • Original_language: Original language of the movie (string)
  • Original_title: Movie title (string)
  • Overview: Short description of the film (string)
  • Popularity: Popularity scale from 0 to 30 (int)
  • Poster_path: Link to movie poster (string)
  • Production_companies: Names of the production companies involved (dict)
  • Production_countries: Countries involved in the production of the film (dict)
  • Release_date: Date of the release of the movie (date)
  • Revenue: Total revenue accrued from the movie (int)
  • Runtime: Total length of the film in minutes (double)
  • Spoken_languages: Language spoken by the characters of the movie (dict)
  • Status: Status of the movie, released, rumored, in production etc.. (string)
  • Tagline: Tagline of the movie (string)
  • Title: Title of the film (string)
  • Video
  • Vote_average: Average vote or rating given by viewers. (type: float)
  • Vote_count: Total count of votes received for the movie. (type: int)

The second dataset is a large dataset which comprises over 26 million ratings for the movies in the previous dataset for 270,000 users in total. The features for this dataset include:

  • Userid: Unique identifier for the user (int)
  • Movieid: ID of the movie being reviewed (int)
  • Rating: User rating on a scale of 0 to 5 (double)
  • Timestamp: Date of the rating (date)

Research Questions

We would like to take this opportunity to learn more about how Big Data applies in the real commercial world, as streaming and e-commerce have become a big point in sales and revenue for many different industries, we would like to gain hands-on experience on a project that uses similar techniques used in the real world.

The project aims to address the following research questions:

  1. How to build a movie recommendation system using past user ratings?
  2. Which movie feature would be the most optimal to increase the effectiveness of recommendations?
  3. Which system model yields the most promising results?

Model Design

Class of Models

We plan to use recommender system models to recommend movies to different user profiles. Although hybrid models are more commonly used for recommender systems, the selected class of models chosen includes:

  • Collaborative Filtering Model: Collaborative filtering models recommend items based on user-item interactions and similarities between users or items. It uses past user data to make recommendations and helps understand user-movie relations. It may provide more diverse recommendations.
  • Content-Based Filtering Model: Content-based filtering models recommend items similar to those that a user has liked or interacted with based on item attributes. It connects movies that a user has liked in the past to other ones within the same category. This helps match the movies to the user’s personal preferences and interests.

Algorithms

The algorithms to be researched include:

  • Matrix Factorization: Given the user dataset, we can build a user-item matrix. Matrix factorization will then decompose the large into two, one for users and one for movies. The model can then learn different patterns and recommend movies based on the preferences of different users.
  • TF-IDF (Term Frequency-Inverse Document Frequency): It was selected as the algorithm for our Content-Based Filtering System. TF-IDF aids in assessing the relevance of movie attributes by analyzing textual features such as summaries, tags, and user reviews.

Model Comparison

We will compare the performance of the algorithms using metrics such as accuracy, precision, recall, and F1-score. Moreover, recommender systems can also be compared using the Root-Mean-Square Error (RMSE). These different metrics will provide insights into the strengths and weaknesses of each algorithm.

Conclusion

The project summary outlines the objectives, dataset characteristics, research questions, and model design for analyzing "The Movies Dataset" found on Kaggle.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 100.0%