---
title: "Movie Recommender System Project"
authors: "Jiajia Feng, Tess Ivinjack, Leslie Cervantes Rivera, Valerie De La Fuente"
date: 03-19-15
editor: visual
format: html
code-copy: true
execute:
  message: false
  warning: false
  echo: false
  cache: true
---


## Introduction

### Objective

This project aims to develop a movie recommendation system using data from The Movie Database (TMDb) API. By applying Python-based machine learning techniques, the system will analyze user preferences and suggest relevant movies. The goal is to create a personalized, efficient recommendation system that enhances content discovery for users.

### Motivation

In today's digital entertainment world, users are overwhelmed with content options, making personalized recommendations more essential than ever. Recommendation systems help by offering tailored suggestions, improving user engagement and satisfaction. With platforms like Netflix and Hulu relying on data-driven algorithms, effective recommendation systems are crucial in enhancing viewer experience. Using data from an API like TMDb, which includes extensive movie data, allows for more accurate and relevant content suggestions. This allows for a more engaging and relevant user experience, fostering deeper interaction with the platform and greater satisfaction with the recommendations provided.

### Data Description

### Methodology

Briefly mention the key methods, tools, and datasets used.

## Data Collection and Preparation

### API Data Retrieval

Describe the API used (e.g., TMDb), endpoints queried, and relevant data fields collected (e.g., genres, ratings, cast).

### Data Cleaning

Explain key preprocessing steps like handling missing values, formatting genre data, and transforming ratings.

### Feature Engineering

Describe features created for both content-based (e.g., TF-IDF for movie descriptions) and collaborative filtering (e.g., user-item matrices).

## Exploratory Data Analysis (EDA)

In this section, we are focusing on EDA to better understand the data. This is the stage where we are given insights into the structure and characteristics of the data, for instance missing values.

### Variables


In [None]:
from prettytable import PrettyTable
table = PrettyTable()
table.field_names = ["Variable Name", "Description"]
table.add_row(['movie_id', 'Unique identifier of the movie.'])
table.add_row(['title', 'Title of the movie.'])
table.add_row(['release_year', 'The year the movie was released.'])
table.add_row(['genre_ids', 'List of genres associated with a movie.'])
table.add_row(['original_language', 'Original language of the movie.'])
table.add_row(['cast_names', 'List of actors in the movie.'])
table.add_row(['watch_providers', 'Streaming platforms where the movie is available.'])
table.add_row(['rating_average', 'A quantitative assessment of the overall quality of a movie.'])
table.add_row(['vote_count', 'The total number of people who voted for the movie.'])
table.add_row(['author', 'Unique identifier for the user.'])
table.add_row(['user_rating', 'Rating given by user.'])
print(table)

### Missing Data

![](images/clipboard-1834072139.png){width="307" height="424"}

The dataset `movie_content_df` is missing some observations in `genre_ids`, `cast_names`, and `watch_providers`.

![](images/clipboard-3610589772.png){width="323"}

While in the `movie_reviews_df`, it is missing a majority of `user_rating`.

#### Popular Genres

![](images/clipboard-1866196344.png){width="394"}

About 1600 of movies mention Drama in their genre list. Comedy and Thriller are the next popular genres with approximately 1400 and 1000 counts.

#### Popular Movie Languages

![](images/clipboard-1260077373.png){width="394"}

We can see most of the languages in our dataset have the original language in English with a count of approximately 2,800. The following most popular languages are French and Japanese with a count of approximately 500 and 100.

#### Popular Actors

![](images/clipboard-1337669541.png){width="397"}

As we can see "Jr." is the most common actor in our movie dataset with a count of approximately 85. We do not know who exactly "Jr." is, but we believe the name was cut when retrieving the data. Other actors such as Samuel L. Jackson, and Bruce Willis were also common with a count of approximately 45 and 42.

#### Providers

![](images/clipboard-3952843974.png){width="394"}

Our top providers are 'Apple TV', 'Google Play Movies', and 'Amazon Video'. They offer most of our movies in the dataset, offering approximately 3,000 movies.

#### Reviewers

![](images/clipboard-577182410.png){width="394"}

'CinemaSerf' is our top reviewer, reviewing approximately 300 movies. Following the author, 'r96sk' and 'Manuel São Bento' are the next top reviewers with approximately 170 and 110 movie reviews.

## Modeling Approach

### Content-Based Filtering

Content-based filtering recommends movies based on their features rather than user interactions. We use TF-IDF vectorization to encode movie attributes and compute similarity using cosine similarity and linear kernel similarity.

**Implementation Steps:** \
1. **Feature Engineering**

-   Combine `genre_ids`, `cast_names`, and `watch_providers` into a single text feature.

2\. **TF-IDF Vectorization**

-   Convert text features into numerical vectors.

<!-- -->

-    Use `min_df=2` to filter out infrequent terms.

3\. **Computing Similarity**

-    Use cosine similarity for text-based similarity.

-    Integrate `rating_average` and `vote_count` with linear kernel similarity for better ranking.

4\. **Recommendation System**

-    Use fuzzy matching to correct user input.

<!-- -->

-   Retrieve and rank the most similar movies based on the similarity matrix.

    ![Cosine Similarity Recommendation Results](images/Screenshot 2025-03-14 at 13.42.54.png)![]()

### Collaborative Filtering

-   Explain the technique (e.g., matrix factorization with Surprise library or k-nearest neighbors).

-   Describe how user-item interactions are utilized.

### Evaluation

-   Describe performance metrics used (e.g., precision, recall, RMSE).

-   Provide results with brief interpretation.

## Results and Analysis

-   Highlight key findings from both models.

-   Compare the strengths and weaknesses of content-based and collaborative filtering approaches.

-   Discuss any challenges faced (e.g., cold start problem) and how you addressed them.

## Conclusion

-   Summarize the project's achievements.

-   Suggest potential improvements (e.g., hybrid models, enhanced data sources, or improved user interface).

-   Mention real-world applications or extensions of the system.