<a href="https://colab.research.google.com/github/BD157/MLE-Capstone-BD/blob/main/Student_MLE_MiniProject_Recommendation_Engines_BD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini Project: Recommendation Engines

Recommendation engines are algorithms designed to provide personalized suggestions or recommendations to users. These systems analyze user behavior, preferences, and interactions with items (products, movies, music, articles, etc.) to predict and offer items that users are likely to be interested in. Recommendation engines play a crucial role in enhancing user experience, driving engagement, and increasing conversion rates in various applications, including e-commerce, entertainment, content platforms, and more.

There are generally two approaches taken in collaborative filtering and content-based recommendation engines:

**1. Collaborative Filtering:**
Collaborative Filtering is a popular approach to building recommendation systems that leverages the collective behavior of users to make personalized recommendations. It is based on the idea that users who have agreed in the past will likely agree in the future. There are two main types of collaborative filtering:

- **User-based Collaborative Filtering:** This method finds users similar to the target user based on their past interactions (e.g., ratings or purchases). It then recommends items that similar users have liked but the target user has not interacted with yet.

- **Item-based Collaborative Filtering:** In this approach, the system identifies similar items based on user interactions. It recommends items that are similar to the ones the target user has already liked or interacted with.

Collaborative filtering does not require any explicit information about items but relies on the similarity between users or items. It is effective in capturing complex patterns and can provide serendipitous recommendations. However, it suffers from the cold-start problem (i.e., difficulty in recommending to new users or items with no interactions) and scalability challenges in large datasets.

**2. Content-Based Recommendation:**
Content-based recommendation is an alternative approach to building recommendation systems that focuses on the attributes or features of items and users. It leverages the characteristics of items to make recommendations. The key steps involved in content-based recommendation are:

- **Feature Extraction:** For each item, relevant features are extracted. For movies, these features could be genre, director, actors, and plot summary.

- **User Profile:** A user profile is created based on the items they have interacted with in the past. The user profile contains the weighted importance of features based on their interactions.

- **Similarity Calculation:** The similarity between items or between items and the user profile is calculated using similarity metrics like cosine similarity or Euclidean distance.

- **Recommendation:** Items that are most similar to the user profile are recommended to the user.

Content-based recommendation systems are less affected by the cold-start problem as they can still recommend items based on their features. They are also more interpretable as they rely on item attributes. However, they may miss out on providing serendipitous recommendations and can be limited by the quality of feature extraction and user profiles.

**Choosing Between Collaborative Filtering and Content-Based:**
Both collaborative filtering and content-based approaches have their strengths and weaknesses. The choice between them depends on the specific requirements of the recommendation system, the type of data available, and the user base. Hybrid approaches that combine collaborative filtering and content-based techniques are also common, aiming to leverage the strengths of both methods and mitigate their weaknesses.

In this mini-project, you'll be building both content based and collaborative filtering engines for the [MovieLens 25M dataset](https://grouplens.org/datasets/movielens/25m/). The MovieLens 25M dataset is one of the most widely used and popular datasets for building and evaluating recommendation systems. It is provided by the GroupLens Research project, which collects and studies datasets related to movie ratings and recommendations. The MovieLens 25M dataset contains movie ratings and other related information contributed by users of the MovieLens website.

**Dataset Details:**
- **Size:** The dataset contains approximately 25 million movie ratings.
- **Users:** It includes ratings from over 162,000 users.
- **Movies:** The dataset consists of ratings for more than 62,000 movies.
- **Ratings:** The ratings are provided on a scale of 1 to 5, where 1 is the lowest rating and 5 is the highest.
- **Timestamps:** Each rating is associated with a timestamp, indicating when the rating was given.

**Data Files:**
The dataset is usually split into three CSV files:

1. **movies.csv:** Contains information about movies, including the movie ID, title, genres, and release year.
   - Columns: movieId, title, genres

2. **ratings.csv:** Contains movie ratings provided by users, including the user ID, movie ID, rating, and timestamp.
   - Columns: userId, movieId, rating, timestamp

3. **tags.csv:** Contains user-generated tags for movies, including the user ID, movie ID, tag, and timestamp.
   - Columns: userId, movieId, tag, timestamp

First, import all the libraries you'll need.

In [2]:
import zipfile
import numpy as np
import pandas as pd
from urllib.request import urlretrieve
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Next, download the relevant components of the MoveLens dataset. Note, these instructions are roughly based on the colab [here](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/recommendation-systems/recommendation-systems.ipynb?utm_source=ss-recommendation-systems&utm_campaign=colab-external&utm_medium=referral&utm_content=recommendation-systems#scrollTo=O3bcgduFo4s6).

In [3]:
print("Downloading movielens data...")

urlretrieve('http://files.grouplens.org/datasets/movielens/ml-100k.zip', 'movielens.zip')
zip_ref = zipfile.ZipFile('movielens.zip', 'r')
zip_ref.extractall()
print("Done. Dataset contains:")
print(zip_ref.read('ml-100k/u.info'))

ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(
    'ml-100k/u.data', sep='\t', names=ratings_cols, encoding='latin-1')

# The movies file contains a binary feature for each genre.
genre_cols = [
    "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]
movies_cols = [
    'movie_id', 'title', 'release_date', "video_release_date", "imdb_url"
] + genre_cols
movies = pd.read_csv(
    'ml-100k/u.item', sep='|', names=movies_cols, encoding='latin-1')

Downloading movielens data...
Done. Dataset contains:
b'943 users\n1682 items\n100000 ratings\n'


Before doing any kind of machine learning, it's always good to familiarize yourself with the datasets you'lll be working with.

Here are your tasks:

1. Spend some time familiarizing yourself with both the `movies` and `ratings` dataframes. How many unique user ids are present? How many unique movies are there?
2. Create a new dataframe that merges the `movies` and `ratings` tables on 'movie_id'. Only keep the 'user_id', 'title', 'rating' fields in this new dataframe.

In [4]:
# Spend some time familiarizing yourself with both the movies and ratings
# dataframes. How many unique user ids are present? How many unique movies
# are there?

# Number of unique users
# We calculate this by counting the number of unique values in the user_id column
num_unique_users = ratings['user_id'].nunique()
print(f"Number of unique users: {num_unique_users}")

# Number of unique movies
# We calculate this by counting the number of unique values in the movie_id column
num_unique_movies = movies['movie_id'].nunique()
print(f"Number of unique movies: {num_unique_movies}")

Number of unique users: 943
Number of unique movies: 1682


In [5]:
# Merge movies and ratings dataframes
# Before merging the dataset we look at the contents and the structure of both
# datasets that need to be merged, that is the ratings and the movies datasets.
# Ratings dataset
print("Ratings Dataset:")
print("Column Names:", ratings.columns.tolist())
print("Number of Rows and Columns:", ratings.shape)

# Movies dataset
print("\nMovies Dataset:")
print("Column Names:", movies.columns.tolist())
print("Number of Rows and Columns:", movies.shape)

Ratings Dataset:
Column Names: ['user_id', 'movie_id', 'rating', 'unix_timestamp']
Number of Rows and Columns: (100000, 4)

Movies Dataset:
Column Names: ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url', 'genre_unknown', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
Number of Rows and Columns: (1682, 24)


In [6]:
# The two datasets can be merged based on the 'movie_id' as it is the key variable
merged_df = pd.merge(ratings, movies, on='movie_id')

# Display the first few rows of the merged dataset
print(merged_df.head())

# Print the following to make sure it was merged successfully.
# Merged dataset
print("\nMovies Dataset:")
print("Column Names:", merged_df.columns.tolist())
print("Number of Rows and Columns:", merged_df.shape)

   user_id  movie_id  rating  unix_timestamp                       title  \
0      196       242       3       881250949                Kolya (1996)   
1      186       302       3       891717742    L.A. Confidential (1997)   
2       22       377       1       878887116         Heavyweights (1994)   
3      244        51       2       880606923  Legends of the Fall (1994)   
4      166       346       1       886397596         Jackie Brown (1997)   

  release_date  video_release_date  \
0  24-Jan-1997                 NaN   
1  01-Jan-1997                 NaN   
2  01-Jan-1994                 NaN   
3  01-Jan-1994                 NaN   
4  01-Jan-1997                 NaN   

                                            imdb_url  genre_unknown  Action  \
0    http://us.imdb.com/M/title-exact?Kolya%20(1996)              0       0   
1  http://us.imdb.com/M/title-exact?L%2EA%2E+Conf...              0       0   
2  http://us.imdb.com/M/title-exact?Heavyweights%...              0       0  

In [9]:
# List all unique user_ids in the merged_df
unique_user_ids = merged_df['user_id'].unique()
print(unique_user_ids)

[196 186  22 244 166 298 115 253 305   6  62 286 200 210 224 303 122 194
 291 234 119 167 299 308  95  38 102  63 160  50 301 225 290  97 157 181
 278 276   7  10 284 201 287 246 242 249  99 178 251  81 260  25  59  72
  87  42 292  20  13 138  60  57 223 189 243  92 241 254 293 127 222 267
  11   8 162 279 145  28 135  32  90 216 250 271 265 198 168 110  58 237
  94 128  44 264  41  82 262 174  43  84 269 259  85 213 121  49 155  68
 172  19 268   5  80  66  18  26 130 256   1  56  15 207 232  52 161 148
 125  83 272 151  54  16  91 294 229  36  70  14 295 233 214 192 100 307
 297 193 113 275 219 218 123 158 302  23 296  33 154  77 270 187 170 101
 184 112 133 215  69 104 240 144 191  61 142 177 203  21 197 134 180 236
 263 109  64 114 239 117  65 137 257 111 285  96 116  73 221 235 164 281
 182 129  45 131 230 126 231 280 288 152 217  79  75 245 282  78 118 283
 171 107 226 306 173 185 150 274 188  48 311 165 208   2 205 248  93 159
 146  29 156  37 141 195 108  47 255  89 140 190  2

As mentioned in the introduction, content-Based Filtering is a recommendation engine approach that focuses on the attributes or features of items (products, movies, music, articles, etc.) and leverages these features to make personalized recommendations. The underlying idea is to match the characteristics of items with the preferences of users to suggest items that align with their interests. Content-based filtering is particularly useful when explicit user-item interactions (e.g., ratings or purchases) are sparse or unavailable.

**Key Steps in Content-Based Filtering:**

1. **Feature Extraction:**
   - For each item, relevant features are extracted. These features are typically descriptive attributes that can be represented numerically, such as genre, director, actors, author, publication date, and keywords.
   - In the case of text-based items, natural language processing techniques may be used to extract features like TF-IDF (Term Frequency-Inverse Document Frequency) scores.

2. **User Profile Creation:**
   - A user profile is created based on the items they have interacted with in the past. The user profile contains the weighted importance of features based on their interactions.
   - For example, if a user has watched several action movies, the action genre feature would receive a higher weight in their profile.

3. **Similarity Calculation:**
   - The similarity between items or between items and the user profile is calculated using similarity metrics like cosine similarity, Euclidean distance, or Pearson correlation.
   - Cosine similarity is commonly used as it measures the cosine of the angle between two vectors, which represents their similarity.

4. **Recommendation:**
   - Items that are most similar to the user profile are recommended to the user. These are items whose features have the highest similarity scores with the user profile.
   - The recommended items are presented as a list sorted by their similarity scores.

**Advantages of Content-Based Filtering:**
1. **No Cold-Start Problem:** Content-based filtering can make recommendations even for new users with no historical interactions because it relies on item features rather than user history.

2. **User Independence:** The recommendations are based solely on the features of items and do not require knowledge of other users' preferences or behavior.

3. **Transparency:** Content-based recommendations are interpretable, as they depend on the features of items, making it easier for users to understand why specific items are recommended.

4. **Serendipity:** Content-based filtering can recommend items with characteristics not seen before by the user, leading to serendipitous discoveries.

5. **Diversity in Recommendations:** The method can offer diverse recommendations since it suggests items with different feature combinations.

**Limitations of Content-Based Filtering:**
1. **Limited Discovery:** Content-based filtering may struggle to recommend items outside the scope of users' historical interactions or interests.

2. **Over-Specialization:** Users may receive recommendations that are too similar to their previous choices, leading to a lack of exposure to new item categories.

3. **Dependency on Feature Quality:** The quality and relevance of item features significantly influence the quality of recommendations.

4. **Limited for Cold Items:** Content-based filtering can struggle to recommend new items with limited feature information.

Here is your task:

1. Write a function that takes in a user id and the dataframe you created before that contains 'user_id', 'title', and 'rating'. The function should return content-based recommendations for this user. Here are steps you can take:

  A. Get the user's rated movies

  B. Create a TF-IDF matrix using movie genres. Note, this can be extracted from the `movies` dataframe.

  C. Compute the cosine similarity between movie genres. Use the [cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) function.

  D. Get the indices of similar movies to those rated by the user based on cosine similarity. Keep only the top 5.

  E. Remove duplicates and movies already rated by the user.

In [8]:
# Content-Based Filtering using Movie Genres
def content_based_recommendation(user_id, merged_df):
    # Get the user's rated movies
    # First filter the merged dataset to get all rows where the 'user_id' column
    # supplied as the input to the function
    # Extract all the movies rated by that user and store them in a list called rated_movie_ids
    user_ratings = merged_df[merged_df['user_id'] == user_id]
    rated_movie_ids = user_ratings['movie_id'].tolist()

    # Create a TF-IDF matrix using movie genres
    # First, combine the movie genres into a single description for each movie
    # This is done by joining all genres with a value of 1 (e.g., Action, Comedy) into a space-separated string
    # Then, convert these genre descriptions into numerical representations using the TF-IDF method
    genre_columns = [
        "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
        "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
        "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
    ]
    # To do this, we create a new column 'genres' in the dataset by applying a lambda function across all rows
    # This function concatenates the genre names with a value of 1 into a single string for each movie.
    # Create a new column called genre containing this string.
    merged_df['genres'] = merged_df[genre_columns].apply(lambda row: ' '.join([genre for genre, val in row.items() if val == 1]), axis=1)

    # Use TF-IDF Vectorizer on the 'genres' column
    # Use TfidfVectorizer from scikit-learn
    # Convert the genres column into a TF-IDF matrix, which assigns weights to
    # words (genres) based on their frequency in the dataset and how unique
    # they are across the dataset.
    # stop_words='english' ensures common words are ignored during vectorization.
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['genres'])

    # Compute the cosine similarity between movie genres
    # Now we use cosine similarity to measure how similar two movies are based on
    # their genre descriptions.
    # This will help in recommending movies that are very much like the ones a user has already rated.
    # Cosine similarity measures how similar two vectors are, ranging from -1 (completely dissimilar)
    # to 1 (completely similar).
    # The resulting matrix cosine_sim stores the similarity scores for all pairs of movies.
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

    # Get the indices of the similar movies based on cosine similarity
    recommendations = {}

    # Find top 5 most similar movies based on cosine similarity
    # Using a for loop,
    # Loop through each movie that the user has rated, rated_movie_ids.
    # For each movie find the index of that movie in the data, movie_index.
    # Look up the similarity scores of that movie to all other movies
    # from the cosine_sim matrix and sort them in descending order.
    # Select the top 5 similar movies not the movie itself and
    # for each similar movie store the movie title and similarity score
    # in the recommendations dictionary.

    for movie_id in rated_movie_ids:
        movie_index = merged_df[merged_df['movie_id'] == movie_id].index[0]
        similar_movies = list(enumerate(cosine_sim[movie_index]))

        # Sort the movies based on cosine similarity and get the top 5 (not including the movie itself)
        similar_movies = sorted(similar_movies, key=lambda x: x[1], reverse=True)[1:6]

        for idx, score in similar_movies:
            movie_title = merged_df.iloc[idx]['title']
            if movie_title not in recommendations:
                recommendations[movie_title] = score

    # Remove duplicates and movies already rated by the user
    # Keep only those movies that the user hasn't rated
    recommended_movies = [(movie, score) for movie, score in recommendations.items() if movie not in user_ratings['title'].values]

    # Sort in descending order of similarity score
    recommended_movies = sorted(recommended_movies, key=lambda x: x[1], reverse=True)

    # Return only the top 5 movie recommendations
    return recommended_movies[:5]

# Example:
user_id = 22
recommended_movies = content_based_recommendation(user_id, merged_df)

print("Top 5 content-based movie recommendations:")
for movie, score in recommended_movies:
    print(f"{movie} - Similarity Score: {score}")

KeyboardInterrupt: 

In [None]:
# I was getting a session crashed error message due to excessive RAM usage
# which sould be due to the cosine similarity calculation.
# cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
# Since we have 1,682 movies (items) in our dataset. It is creating a matrix of
#  1,682 × 1,682 which is, resulting in 2,829,124 similarity values which could be the reason
# for memory overload and resulting crash.

In [11]:
# To get rid of the error caused by the code above, instead of calculation cosine similarity
# for all movies at once, we use special storage format which is called sparse matrix to save on memory.
# Compare only one movie at a time to others.

def content_based_recommendation(user_id, merged_df):
    # Get the user's rated movies
    user_ratings = merged_df[merged_df['user_id'] == user_id]
    rated_movie_ids = user_ratings['movie_id'].tolist()
    # Create a TF-IDF matrix using movie genres
    genre_columns = [
        "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
        "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
        "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
    ]
    # Combine genres into a single string
    merged_df['genres'] = merged_df[genre_columns].apply(lambda row: ' '.join(
        [genre for genre, val in row.items() if val == 1]), axis=1)

    # use sparse matrix for efficiency
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf_vectorizer.fit_transform(merged_df['genres'])

    # Compute cosine similarity only for rated movies using sparse matrix format
    recommendations = {}

    for movie_id in rated_movie_ids:
        movie_index = merged_df[merged_df['movie_id'] == movie_id].index[0]

        # Compute similarity only for the current movie
        movie_similarities = cosine_similarity(tfidf_matrix[movie_index], tfidf_matrix).flatten()

        # Get top 5 similar movies, do not include the movie itself
        similar_movies = np.argsort(movie_similarities)[::-1][1:6]

        for idx in similar_movies:
            movie_title = merged_df.iloc[idx]['title']
            score = movie_similarities[idx]
            if movie_title not in recommendations:
                recommendations[movie_title] = score

    # Remove duplicates and movies already rated by the user
    recommended_movies = [(movie, score) for movie, score in recommendations.items()
                          if movie not in user_ratings['title'].values]

    # Sort recommendations by score
    recommended_movies = sorted(recommended_movies, key=lambda x: x[1], reverse=True)

    return recommended_movies[:5]

# Example
user_id = 22
recommended_movies = content_based_recommendation(user_id, merged_df)

print("Top 5 content-based movie recommendations:")
for movie, score in recommended_movies:
    print(f"{movie} - Similarity Score: {score}")


Top 5 content-based movie recommendations:
George of the Jungle (1997) - Similarity Score: 1.0000000000000002
Mouse Hunt (1997) - Similarity Score: 1.0000000000000002
Santa Clause, The (1994) - Similarity Score: 1.0000000000000002
Mediterraneo (1991) - Similarity Score: 1.0000000000000002
Adventures of Robin Hood, The (1938) - Similarity Score: 1.0000000000000002


The key idea behind collaborative filtering is that users who have agreed in the past will likely agree in the future. Instead of relying on item attributes or user profiles, collaborative filtering identifies patterns of user behavior and item preferences from the interactions present in the data.

**Types of Collaborative Filtering:**
There are two main types of collaborative filtering:

**Collaborative Filtering Process:**
The collaborative filtering process typically involves the following steps:

1. **Data Collection:**
   - Gather data on user-item interactions, such as movie ratings, product purchases, or article clicks.

2. **User-Item Matrix:**
   - Organize the data into a user-item matrix, where rows represent users, columns represent items, and the entries contain the users' interactions (e.g., ratings).

3. **Similarity Calculation:**
   - Calculate the similarity between users or items using similarity metrics such as cosine similarity, Pearson correlation, or Jaccard similarity.
   - For user-based collaborative filtering, user similarities are calculated, and for item-based collaborative filtering, item similarities are calculated.

4. **Neighborhood Selection:**
   - For each user or item, select the most similar users or items as the neighborhood.
   - The size of the neighborhood (the number of similar users or items to consider) is an important parameter to control the system's behavior.

5. **Prediction Generation:**
   - Predict the ratings for items that the target user has not yet interacted with by combining the ratings of neighboring users or items.

6. **Recommendation Generation:**
   - Recommend items with the highest predicted ratings to the target user.

**Advantages of Collaborative Filtering using User-Item Interactions:**
- Collaborative filtering is based solely on user interactions and does not require knowledge of item attributes, making it useful for cases where item data is sparse or unavailable.
- It can provide serendipitous recommendations, suggesting items that users may not have discovered on their own.
- Collaborative filtering can be applied in various domains, including e-commerce, music, movie, and content recommendations.

**Limitations of Collaborative Filtering:**
- The cold-start problem: Collaborative filtering struggles to recommend to new users or items with no or limited interaction history.
- It may suffer from sparsity when data is limited or when users have only interacted with a small subset of items.
- Scalability issues can arise with large datasets and an increasing number of users or items.

Here is your task:

1. Write a function that takes in a user id and the dataframe you created before that contains 'user_id', 'title', and 'rating'. The function should return collaborative filtering recommendations for this user based on a user-item interaction matrix. Here are steps you can take:

  A. Create the user-item matrix using Pandas' [pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html).

  B. Fill missing values with zeros in this matrix.

  C. Calculate user-user similarity matrix using cosine similarity.

  D. Get the array of similarity scores of the target user with all other users from the similarity matrix.

  E. Extract, say the the top 5 most similar users (excluding the target user).

  F. Generate movie recommendations based on the most similar users.

  G. Remove duplicate movies recommendations.

In [13]:
# Collaborative Filtering using User-Item Interactions
def collaborative_filtering_recommendation(user_id, df):
    # Create the user-item matrix
    user_item_matrix = df.pivot_table(index='user_id', columns='title', values='rating')

    # Fill missing values with 0 (indicating no rating)
    user_item_matrix = user_item_matrix.fillna(0)

    # Calculate user-user similarity matrix using cosine similarity
    user_similarity = cosine_similarity(user_item_matrix)
    user_similarity_df = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index)

    # Get the similarity scores of the target user with all other users
    similarity_scores = user_similarity_df[user_id]

    # Find the top 5 most similar users (not including the target user)
    top_similar_users = similarity_scores.drop(user_id).nlargest(5).index

    # Generate movie recommendations based on the most similar users
    recommended_movies = set()
    for similar_user in top_similar_users:
        similar_user_rated_movies = df[df['user_id'] == similar_user]['title'].tolist()
        recommended_movies.update(similar_user_rated_movies)

    # Remove duplicates from recommendations
    user_rated_movies = df[df['user_id'] == user_id]['title'].tolist()
    final_recommendations = [movie for movie in recommended_movies if movie not in user_rated_movies]

    # Return top 5 recommendations
    return final_recommendations[:5]

# Example usage
user_id = 22
recommended_movies = collaborative_filtering_recommendation(user_id, merged_df)

print("Top 5 collaborative filtering movie recommendations:")
for movie in recommended_movies:
    print(movie)

Top 5 collaborative filtering movie recommendations:
Kingpin (1996)
Good Will Hunting (1997)
Rosencrantz and Guildenstern Are Dead (1990)
Apocalypse Now (1979)
Judge Dredd (1995)


Now, test your recommendations engines! Select a few user ids and generate recommendations using both functions you've written. Are the recommendations similar? Do the recommendations make sense?

In [15]:
# Test the recommendation engines

# User_ids to test
user_ids_to_test = [682, 315, 47, 788, 238, 322, 537, 319]

# Test both recommendation functions for each user
for user_id in user_ids_to_test:
    print(f"\nUser ID: {user_id}")

    # Content-based recommendations
    content_recommendations = content_based_recommendation(user_id, merged_df)
    print("\nContent-Based Recommendations:")
    for movie, score in content_recommendations:
        print(f"{movie} - Score: {score}")

    # Collaborative filtering recommendations
    collaborative_recommendations = collaborative_filtering_recommendation(user_id, merged_df)
    print("\nCollaborative Filtering Recommendations:")
    for movie in collaborative_recommendations:
        print(f"{movie}")


User ID: 682

Content-Based Recommendations:
Mouse Hunt (1997) - Score: 1.0000000000000002
Santa Clause, The (1994) - Score: 1.0000000000000002
Diabolique (1996) - Score: 1.0000000000000002
Dolores Claiborne (1994) - Score: 1.0000000000000002
Adventures of Robin Hood, The (1938) - Score: 1.0000000000000002

Collaborative Filtering Recommendations:
Tin Men (1987)
Microcosmos: Le peuple de l'herbe (1996)
Deer Hunter, The (1978)
Rosencrantz and Guildenstern Are Dead (1990)
Judge Dredd (1995)

User ID: 315

Content-Based Recommendations:
Hackers (1995) - Score: 1.0000000000000002
Indiana Jones and the Last Crusade (1989) - Score: 1.0000000000000002
Adventures of Robin Hood, The (1938) - Score: 1.0000000000000002
Raiders of the Lost Ark (1981) - Score: 1.0000000000000002
Conan the Barbarian (1981) - Score: 1.0000000000000002

Collaborative Filtering Recommendations:
Tin Men (1987)
Kingpin (1996)
Good Will Hunting (1997)
Deer Hunter, The (1978)
Rosencrantz and Guildenstern Are Dead (1990)



In [None]:
# If a user has like a movie before, content-based recommendations will suggest movies that are
# similar in that genre, they will get recommendatiions on similar types of films.
# collaborative filtering will look at what other users with similar likes/dislikes have recommended
# it will recommend movies that might not match the user's genre choices from previous movie likes
# but the movies that are popular with other people who have liked movies.
# The two methods provide different suggestions,
# with content-based recommendations focusing on movie genres and collaborative filtering
# focusing on what other users liked. Both methods are useful
# and they offer different ways to find new movies.
# Content-based recommendations help discover movies that the user is likely to enjoy based on
# their favorite genres while collaborative filtering will suggest movies liked by others
# people with similar tastes even if they’re outside the user's past preferences.