# Movie Recommendation System

## Project overview:

This project is about the recommendation of movies you want to watch. I used the script when I was younger to get updates about the latest movies. I revitalized it: It basically means I updated the code to Python 3.12.x.

By incorporating these advanced algorithms into the project, I aim to provide users with a diverse range of recommendations that cater to their individual preferences and viewing habits. It's fascinating to observe how companies like Amazon and Netflix leverage these methods to enhance user experience and engagement, and I'm excited to bring similar capabilities to my own project.

__1. Content-based Filtering__: Content-based Filtering recommends movies based on the attributes or features of the movies themselves, as well as the user's past preferences. It's useful for suggesting movies with similar content or characteristics to those the user has enjoyed in the past.

__2. Popularity-based Filtering__: This method recommends movies based on their overall popularity or trends, without considering the specific preferences of the user. It's simple and easy to implement, making it a good starting point for recommendation systems.

__3. Collaborative Filtering__: Collaborative Filtering recommends movies based on the preferences of users who have similar tastes to the target user. It's powerful because it doesn't require explicit information about the movies or users, but rather relies on the collective behavior of users.

## Good to know:

In addition to the migration to Python 3.12.x and the incorporation of popular movie recommendation algorithms, there are several other key aspects that developers should consider when creating the Jupyter Notebook (ipynb) for this project:

Data Preprocessing: Before applying recommendation algorithms, it's essential to preprocess the data to ensure its quality and relevance. This may involve tasks such as handling missing values, removing duplicates, and encoding categorical variables.

Exploratory Data Analysis (EDA): EDA is crucial for gaining insights into the characteristics of the movie dataset. Developers can use visualization techniques to explore trends, distributions, and relationships within the data, helping them make informed decisions during the recommendation process.

Algorithm Implementation: Developers should implement the selected recommendation algorithms using appropriate libraries and frameworks. This may involve writing code to calculate movie similarities, predict user preferences, and generate personalized recommendations.

Evaluation Metrics: It's important to evaluate the performance of the recommendation system using relevant metrics. Common evaluation metrics for recommendation systems include precision, recall, accuracy, and mean average precision (MAP). Developers can use these metrics to assess the effectiveness of their algorithms and fine-tune their models accordingly.

User Interface (UI): Consider designing a user-friendly interface for the Jupyter Notebook, allowing users to interact with the recommendation system seamlessly. This may involve incorporating widgets, interactive visualizations, and input forms to enhance the user experience.

Documentation and Collaboration: Provide comprehensive documentation within the Jupyter Notebook, explaining the project's objectives, methodologies, and implementation details. Additionally, consider collaborating with other developers and stakeholders to gather feedback, share insights, and improve the project iteratively.

# 1. Content Based Filtering

In [151]:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [152]:
# return the data of the csv file as a DataFrame object
movies = pandas.read_csv(r"data/movies_small.csv", sep=";")

In [153]:
# convert text data into TF-IDF (Term Frequency-Inverse Document Frequency) vectors
tfidf = TfidfVectorizer(stop_words="english")
movies["overview"] = movies["overview"].fillna("")

In [154]:
# transform the text data in the "overview" column of the movies DataFrame into TF-IDF vectors
tfidf_matrix = tfidf.fit_transform(movies["overview"])

In [155]:
# converting the sparse matrix tfidf_matrix into a dense pandas DataFrame for easier inspection and manipulation
pandas.DataFrame(tfidf_matrix.toarray()) # columns=tfidf.get_feature_names())
# returns a tuple representing its dimensions
tfidf_matrix.shape

(6, 121)

In [156]:
# computes the linear kernel between two matrices. In this case, you're computing the linear kernel between the TF-IDF matrix (tfidf_matrix) and itself, resulting in a similarity matrix
similarity_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)
# return the second row of the similarity matrix. 
similarity_matrix[1]

array([0.        , 1.        , 0.        , 0.        , 0.        ,
       0.02259057])

In [157]:
# searching for the index of a movie titled "John Carter" in the DataFrame movies
movie_title = "John Carter"
idx = movies.loc[movies["title"]==movie_title].index[0]

In [158]:
# iterate over the similarity scores between the movie corresponding to the index idx and all other movies
scores = enumerate(similarity_matrix[idx])

In [159]:
# sorts the similarity scores obtained from the enumeration in descending order based on the similarity score (the second element of each tuple)
scores = sorted(scores, key=lambda x: x[1], reverse=True)

In [160]:
# extracts the indices of the most similar movies from the sorted scores list
movies_indices = [tpl[0] for tpl in scores[1:4]]

In [161]:
# retrieves the titles of the most similar movies based on their indices
list(movies["title"].iloc[movies_indices])

['Kung Fu Panda 3', 'Cars 2', 'The Dark Knight Rises']

In [162]:
def similar_movies(movie_title, nr_movies):
    """Find similar movies based on a given movie title.

    Parameters:
    - movie_title (str): The title of the movie for which similar movies are to be found.
    - nr_movies (int): The number of similar movies to be returned.

    Returns:
    - list: A list of titles of similar movies to the input movie_title, sorted by similarity score.
    """
    # locate the index of the specified movie title within the DataFrame
    idx = movies.loc[movies["title"]==movie_title].index[0] 
    # calculate similarity scores between the specified movie and all other movies
    scores = list(enumerate(similarity_matrix[idx]))
    # sort the similarity scores in descending order
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    # select the indices of the top similar movies
    movies_indices = [tpl[0] for tpl in scores[1:nr_movies+1]]
    # retrieve the titles of the similar movies corresponding to the selected indices
    similar_titles = list(movies["title"].iloc[movies_indices])
    
    return similar_titles 

In [163]:
# find similar movies to "Kung Fu Panda 3"
num_of_closest = 3
actual_movie = "Kung Fu Panda 3"
similar_movies(actual_movie, num_of_closest)
print(f"The top {num_of_closest} similar movies based on overview or descriptions to '{actual_movie}' are: {similar_movies(actual_movie, num_of_closest)}")

The top 3 similar movies based on overview or descriptions to 'Kung Fu Panda 3' are: ['John Carter', 'Furious 7', 'Cars 2']


# Interpretation of the output

The function similar_movies was called with the movie title "Kung Fu Panda 3" and
requested to find the top 3 similar movies based on their overviews or descriptions.
The output represents the titles of these similar movies.

Interpreting the output:

"John Carter": This movie is considered similar to "Kung Fu Panda 3" based on their
content. Despite being of a different genre (action-adventure versus animated family),
there might be certain thematic or narrative similarities that the algorithm has identified.
    
"Furious 7": This is another action-packed movie, possibly sharing some thematic elements or
audience appeal with "Kung Fu Panda 3". Both movies might offer high-energy entertainment,
albeit in different styles.
    
"Cars 2": This animated film shares the same genre as "Kung Fu Panda 3" (animated family),
and the algorithm likely identified similarities in themes, target audience, or storytelling
approach.

In summary, while the output may seem unexpected at first glance, it reflects the nuances of the
similarity calculation based on the content of movie overviews. It demonstrates the algorithm's
capability to identify connections and associations between movies beyond superficial genre categorizations.
This information can be valuable for users seeking diverse yet thematically aligned movie recommendations.

# 2. Popularity-based Filtering

In [164]:
import pandas

In [165]:
# read in data
movies = pandas.read_csv(r"data/movies.csv")
credits = pandas.read_csv(r"data/credits.csv")
ratings = pandas.read_csv(r"data/ratings.csv")

In [166]:
# Calculate weighted rating
# --------------------------
# wr = (v+(v+w))*r+(m+(v+m))*C
# v = number of votes of movies
# m = minimum number of votes required
# R = average ratingof movie
# C = average rating across all movies

In [167]:
# calculates the 90th percentile of the "vote_count" column in the DataFrame movies and assigns it to the variable m
m = movies["vote_count"].quantile(0.9)

In [168]:
# calculates the mean (average) of the "vote_average" column in the DataFrame movies and assigns it to the variable C
C = movies["vote_average"].mean()

In [169]:
# creates a filtered copy of the DataFrame movies, containing only those rows where the "vote_count" column meets a certain threshold (m), which is typically set to the 90th percentile.
movies_filtered = movies.copy().loc[movies['vote_count'] >= m]

In [170]:
def weight_rating(df, m=m, C=C):
    """Calculate weighted rating for each movie in a DataFrame,
    based on the vote count and average rating of each movie, along
    with predefined values for m and C.

    wr = (v+(v+w))*r+(m+(v+m))*C
    """
    r = df['vote_average'] # vote average (rating) of the movie
    v = df['vote_count'] # vote count of the movie
    wr = ((v / v+m) * r) + (m / (v+m) * C) # compute the weighted rating
    
    return wr

In [171]:
# assign new colum to the DataFrame where the weighted rating for each movie is displayed
movies_filtered["weighted_rating"] = movies_filtered.apply(weight_rating, axis=1)

In [172]:
# descending order and selects the top 10 movies with the highest weighted ratings. 
movies_filtered.sort_values("weighted_rating", ascending=False)[["title", "weighted_rating"]].head(10).to_dict()

{'title': {1881: 'The Shawshank Redemption',
  3337: 'The Godfather',
  2731: 'The Godfather: Part II',
  2294: 'Spirited Away',
  3865: 'Whiplash',
  1818: "Schindler's List",
  3232: 'Pulp Fiction',
  662: 'Fight Club',
  2247: 'Princess Mononoke',
  1987: "Howl's Moving Castle"},
 'weighted_rating': {1881: 15636.01514508981,
  3337: 15452.408618386708,
  2731: 15269.183636541795,
  2294: 15268.992359853999,
  3865: 15268.85833106739,
  1818: 15268.835975645323,
  3232: 15268.110922640362,
  662: 15268.015418187517,
  2247: 15086.0108233095,
  1987: 15086.004700526171}}

# Interpretation of the output

The output dictionary provides a concise representation of the top 10 movies with the highest weighted ratings, allowing for quick identification and analysis of the most highly rated movies in the dataset.

# 3. Collaborative Filtering

In [173]:
import pandas
from surprise import Dataset, Reader
from surprise import SVD
from surprise import model_selection

In [174]:
# Load ratings data from a CSV file into a DataFrame
ratings = pandas.read_csv(r"data/ratings.csv")

In [175]:
# Create a Reader object specifying the rating scale (from 1 to 5)
reader = Reader(rating_scale=(1,5))
# Load data into a surprise Dataset from a DataFrame with specific columns
dataset = Dataset.load_from_df(ratings[["userId","movieId","rating"]], reader)

In [176]:
# build the training set
trainset = dataset.build_full_trainset()

In [177]:
# Retrieve all ratings from the trainset
list(trainset.all_ratings())

[(0, 0, 2.5),
 (0, 1, 3.0),
 (0, 2, 3.0),
 (0, 3, 2.0),
 (0, 4, 4.0),
 (0, 5, 2.0),
 (0, 6, 2.0),
 (0, 7, 2.0),
 (0, 8, 3.5),
 (0, 9, 2.0),
 (0, 10, 2.5),
 (0, 11, 1.0),
 (0, 12, 4.0),
 (0, 13, 4.0),
 (0, 14, 3.0),
 (0, 15, 2.0),
 (0, 16, 2.0),
 (0, 17, 2.5),
 (0, 18, 1.0),
 (0, 19, 3.0),
 (1, 20, 4.0),
 (1, 21, 5.0),
 (1, 22, 5.0),
 (1, 23, 4.0),
 (1, 24, 4.0),
 (1, 25, 3.0),
 (1, 26, 3.0),
 (1, 27, 4.0),
 (1, 28, 3.0),
 (1, 29, 5.0),
 (1, 30, 4.0),
 (1, 31, 3.0),
 (1, 32, 3.0),
 (1, 33, 3.0),
 (1, 34, 3.0),
 (1, 35, 3.0),
 (1, 36, 3.0),
 (1, 37, 5.0),
 (1, 38, 1.0),
 (1, 39, 3.0),
 (1, 40, 3.0),
 (1, 41, 3.0),
 (1, 42, 4.0),
 (1, 43, 4.0),
 (1, 44, 5.0),
 (1, 45, 5.0),
 (1, 46, 3.0),
 (1, 47, 4.0),
 (1, 48, 3.0),
 (1, 49, 4.0),
 (1, 50, 3.0),
 (1, 51, 4.0),
 (1, 52, 2.0),
 (1, 53, 1.0),
 (1, 54, 3.0),
 (1, 55, 4.0),
 (1, 56, 4.0),
 (1, 57, 3.0),
 (1, 58, 3.0),
 (1, 59, 3.0),
 (1, 60, 3.0),
 (1, 61, 2.0),
 (1, 62, 3.0),
 (1, 63, 3.0),
 (1, 64, 3.0),
 (1, 65, 3.0),
 (1, 66, 2.0),
 (1, 

In [178]:
# initialize Singular Value Decomposition (SVD) algorithm
svd = SVD()
# train the SVD model using the trainset
svd.fit(trainset)
# predict the rating that user 15 would give to item (movie) 1956
svd.predict(15, 1956)

Prediction(uid=15, iid=1956, r_ui=None, est=3.0393524416389788, details={'was_impossible': False})

In [179]:
# perform cross-validation on the SVD model
model_selection.cross_validate(svd, dataset, measures=["RMSE", "MAE"])

{'test_rmse': array([0.89971287, 0.90339145, 0.89109571, 0.88964482, 0.8952063 ]),
 'test_mae': array([0.6926912 , 0.6926692 , 0.68529421, 0.68748454, 0.69175952]),
 'fit_time': (0.34098124504089355,
  0.373248815536499,
  0.34330010414123535,
  0.38664770126342773,
  0.3646054267883301),
 'test_time': (0.049631357192993164,
  0.024857044219970703,
  0.03126049041748047,
  0.041982173919677734,
  0.039472341537475586)}

# Interpretation of the output

The output of the cross_validate function provides performance metrics such as RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) that help evaluate the accuracy of the Singular Value Decomposition (SVD) model for collaborative filtering. Let's break down the interpretation of the output:

    Mean RMSE (Root Mean Squared Error):
        RMSE measures the average difference between the predicted ratings and the actual ratings in the test set.
        Lower RMSE values indicate better predictive accuracy.
        The mean RMSE across all folds of cross-validation provides an overall assessment of the model's accuracy in predicting user ratings.

    Mean MAE (Mean Absolute Error):
        MAE measures the average absolute difference between the predicted ratings and the actual ratings in the test set.
        Like RMSE, lower MAE values indicate better predictive accuracy.
        The mean MAE across all folds of cross-validation provides an additional measure of the model's accuracy, particularly focusing on the magnitude of prediction errors.

    Interpretation:
        If the mean RMSE and MAE values are low, it suggests that the SVD model accurately predicts user ratings, with small differences between predicted and actual ratings.
        Conversely, higher RMSE and MAE values indicate that the model's predictions deviate more from the actual ratings, suggesting lower accuracy.
        Comparing RMSE and MAE values can provide insights into the nature of prediction errors: RMSE emphasizes larger errors (due to squaring), while MAE provides a more balanced view of all errors.

    Standard Deviation (Optional):
        In addition to mean values, the output may include standard deviation values for RMSE and MAE across folds.
        Standard deviation measures the variability or spread of the error metrics across different folds.
        Higher standard deviation values suggest greater variability in the model's performance across different subsets of the data.

    Cross-Validation Folds:
        Cross-validation typically involves splitting the dataset into multiple folds (e.g., 5 or 10).
        The output may include RMSE and MAE values for each fold, providing insights into the consistency of the model's performance across different subsets of the data.

In summary, the output of cross_validate provides comprehensive information about the accuracy and consistency of the SVD model in predicting user ratings. Lower mean RMSE and MAE values, along with low standard deviation, indicate better predictive performance. These metrics help assess the effectiveness of the collaborative filtering model and guide further improvements if needed.


