## **Notebook Summary**

This notebook demonstrates how to build and compare movie recommendation systems using two main approaches: Collaborative Filtering and Content-Based Filtering.

**Dataset:** The analysis utilizes the **MovieLens dataset**, which contains movie metadata and user ratings. Specifically, the `movies_metadata.csv` and `ratings_small.csv` files are loaded and processed.

**Techniques Used:**

1.  **Collaborative Filtering:**
    *   A User-Item Matrix is created to represent user ratings for different movies.
    *   The `NearestNeighbors` model from `sklearn` is used with cosine similarity to find similar movies based on user rating patterns.
    *   Recommendations are generated by finding movies similar to the input movie based on how users have rated them.

2.  **Content-Based Filtering:**
    *   The movie overviews are used as the content for analysis.
    *   Two different vectorization techniques are applied to the movie overviews:
        *   **CountVectorizer:** Creates a Document-Term Matrix (DTM) based on the frequency of word pairs (bigrams).
        *   **TF-IDF Vectorizer:** Creates a TF-IDF matrix, which weights words based on their importance in a document relative to the entire corpus.
    *   Cosine similarity is used to calculate the similarity between movies based on their vectorized overviews.
    *   Recommendations are generated by finding movies with similar content to the input movie.

**Conclusions:**

The notebook provides a comparative view of the recommendations generated by each method. The results demonstrate that:

*   **Collaborative Filtering** recommends movies that are often enjoyed by the same users, even if the movies themselves are not similar in content.
*   **Content-Based Filtering** recommends movies that share similar textual content in their overviews, regardless of how users have rated them.

The comparison highlights the different types of recommendations each approach provides and suggests that a hybrid approach combining both methods could potentially offer more comprehensive and relevant recommendations.

#### **Load Libraries and Dataset**

In [None]:
!pip install fuzzywuzzy
!pip install python-Levenshtein



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
moviesdf=pd.read_csv("movies_metadata.csv",low_memory=False)

In [None]:
ratings=pd.read_csv("ratings_small.csv")

In [None]:
moviesdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [None]:
moviesdf.isnull().sum()

Unnamed: 0,0
adult,0
belongs_to_collection,40972
budget,0
genres,0
homepage,37684
id,0
imdb_id,17
original_language,11
original_title,0
overview,954


In [None]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int64  
 1   movieId    100004 non-null  int64  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


#### **Data Preprocessing**

In [None]:
# Rename Columns in ratings csv - movieId as id
ratings.columns=['userId','id','rating','timestamp']

In [None]:
# Convert id variable in moviesdf into numeric
moviesdf.id=pd.to_numeric(moviesdf.id,errors="coerce")

In [None]:
# Merge both moviesdf and ratings into one dataframe
moviesdf_new=moviesdf.merge(ratings,on="id")
moviesdf_new.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,userId,rating,timestamp
0,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949.0,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,23,3.5,1148721092
1,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949.0,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,102,4.0,956598942
2,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949.0,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,232,2.0,955092697


#### **Collaborative Filtering Based Recommendation**

In [None]:
# Create the User Item Matrix
user_item_matrix=moviesdf_new.pivot_table(index=['userId'],columns=['title'],
                                          values="rating",aggfunc="mean").fillna(0)

In [None]:
user_item_matrix.head(2)

title,!Women Art Revolution,'Gator Bait,'Twas the Night Before Christmas,...And God Created Woman,00 Schneider - Jagd auf Nihil Baxter,10 Items or Less,10 Things I Hate About You,"10,000 BC",11'09''01 - September 11,12 Angry Men,...,Zodiac,Zombie Flesh Eaters,Zombie Holocaust,Zozo,eXistenZ,xXx,¡Three Amigos!,À nos amours,Ödipussi,Şaban Oğlu Şaban
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Different Methods are used for identifying Collaborative Similarity
# 1) Nearest Neighbors , 2) Singular Value Decomposition Matrix
# 3) Non Matrix Factorization

In [None]:
from sklearn.neighbors import NearestNeighbors

In [None]:
cf_nn_model=NearestNeighbors(metric="cosine",     # Calculates the cosine
                             algorithm="brute",   # similarity between all
                             n_neighbors=10,      # the user vectors.
                             n_jobs=-1)



In [None]:
cf_nn_model.fit(user_item_matrix)

In [None]:
distances,indices=cf_nn_model.kneighbors(user_item_matrix)

In [None]:
from fuzzywuzzy import process

In [None]:
def movie_recommender_engine(movie_name, matrix, cf_model, n_recs):
    # Fit model on matrix
    cf_nn_model.fit(matrix)

    # Extract input movie ID
    movie_id = process.extractOne(movie_name, moviesdf['title'])[2]

    # Calculate neighbour distances
    # Ensure matrix[movie_id] is in the correct format
    try:
        distances, indices = cf_model.kneighbors(user_item_matrix.loc[movie_id].values.reshape(1, -1), n_neighbors=n_recs)
    except KeyError as e:
        print(f"KeyError: {e}. Movie ID {movie_id} might be missing from the matrix.")

    movie_rec_ids = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),key=lambda x: x[1])[:0:-1]

    # List to store recommendations
    cf_recs = []
    for i in movie_rec_ids:
        cf_recs.append({'Title':moviesdf['title'][i[0]],'Distance':i[1]})

    # Select top number of recommendations needed
    df = pd.DataFrame(cf_recs, index = range(1,n_recs))

    return df

In [None]:
n_recs=10
movie_recommender_engine("Heat",user_item_matrix,cf_nn_model,n_recs)
#movie_recommender_engine("Vampire in Brooklyn",user_item_matrix,cf_nn_model,n_recs)



Unnamed: 0,Title,Distance
1,Pushing Hands,0.6772
2,Friday,0.671089
3,Kids in the Hall: Brain Candy,0.666445
4,The Neverending Story III: Escape from Fantasia,0.662785
5,Beautiful Girls,0.661726
6,A Pyromaniac's Love Story,0.652426
7,No Escape,0.649934
8,Stargate,0.646235
9,Farewell My Concubine,0.614588


#### **Content Based Recommendation**

**First we create similarity matrices then use cosine similarity score to calculate similarity between movies**

##### **1. Similarity Matrix Calculation using Count Vectorizer**

In [None]:
# Content Based recommender system
pd.set_option("display.max_colwidth",None)
moviesdf.overview.head(2) # NLP - text data

Unnamed: 0,overview
0,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."
1,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures."


In [None]:
# Fill na with space
moviesdf.overview=moviesdf.overview.fillna("")

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
DTM=CountVectorizer(max_features=500, stop_words="english",ngram_range=(2,2))

In [None]:
X_DTM=DTM.fit_transform(moviesdf.overview) # Similarity matrix calculation

In [None]:
pd.DataFrame(X_DTM.toarray(),columns=DTM.vocabulary_).head()

Unnamed: 0,los angeles,cat mouse,young boy,accused murder,las vegas,new york,life story,serial killer,young man,coming age,...,tv series,make life,forced confront,hard working,feature documentary,based book,true events,comedy central,comedy special,stand special
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
X_DTM.shape

(45466, 500)

##### **2. Similarity matrix calculation using TF-IDF vectorizer**

Term Frequencey-Inverse Document Frequency(TF-IDF) is the product of TF and IDF --> TF-IDF = TF * IDF

**Why this is an improvement over count vectorizer:**

1. **Downweights common words:** By multiplying TF by IDF, words that are very common across the corpus (high TF in many documents, but low IDF) get a lower TF-IDF score. This prevents common words from dominating the representation and giving misleading similarity scores.
2. **Highlights important words:** Words that are frequent in a specific document but rare in the overall corpus (high TF in one document, high IDF overall) get a higher TF-IDF score. These words are often more discriminative and indicative of the document's unique content.
3. **Better for similarity calculations:** When you calculate similarity between documents (like using cosine similarity as in the code generated earlier), TF-IDF weights lead to more meaningful results because they emphasize the terms that are most relevant and unique to each document. This helps distinguish documents that are truly similar in content from those that just happen to share many common words.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf=TfidfVectorizer(max_features=500,stop_words="english",ngram_range=(2,2))

In [None]:
X_tfidf=tfidf.fit_transform(moviesdf.overview) # Similarity Matrix Calculation

In [None]:
pd.DataFrame(X_tfidf.toarray(),columns=tfidf.vocabulary_).head()

Unnamed: 0,los angeles,cat mouse,young boy,accused murder,las vegas,new york,life story,serial killer,young man,coming age,...,tv series,make life,forced confront,hard working,feature documentary,based book,true events,comedy central,comedy special,stand special
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### **Reccomendation Using Cosine Similarity**

Here's why cosine similarity is needed in conjunction with Similarity Matrix:

1. **Similarity Matrix provides the representation, not the comparison:** The similarity matrix vectors for each movie is essentially a point in a high-dimensional space, where each dimension corresponds to a word in the vocabulary and the value along that dimension is the word's TF-IDF or DTM score for that movie.
2. **Cosine Similarity measures the angle between vectors:** This is where cosine similarity comes in. It calculates the cosine of the angle between two vectors in this high-dimensional space.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def content_based_recommender(movie_title, movies_df, similarity_matrix, n_recommendations=10):
    """
    Recommends similar movies based on content using a pre-calculated similarity matrix
    (either from CountVectorizer or TF-IDF) and cosine similarity.

    Args:
        movie_title (str): The title of the movie to get recommendations for.
        movies_df (pd.DataFrame): The DataFrame containing movie information, including 'title'.
        similarity_matrix (scipy.sparse.csr_matrix or np.ndarray): The matrix representing
                                                                    the movie content features
                                                                    (e.g., X_DTM or X_tfidf).
        n_recommendations (int): The number of recommendations to return.

    Returns:
        pd.DataFrame: A DataFrame containing the top N recommended movies and their similarity scores.
    """

    # Find the index of the input movie
    try:
        movie_index = movies_df[movies_df['title'] == movie_title].index[0]
    except IndexError:
        print(f"Movie '{movie_title}' not found in the dataset.")
        return pd.DataFrame()

    # Get the vector for the input movie from the similarity matrix
    input_movie_vector = similarity_matrix[movie_index]

    # Calculate the similarity scores between the input movie and all other movies
    # This calculates a 1 x n matrix of similarity scores
    similarity_scores = cosine_similarity(input_movie_vector, similarity_matrix).flatten()

    # Create a list of (index, score) tuples
    similarity_scores = list(enumerate(similarity_scores))

    # 3. Sort the movies by similarity score
    sorted_similar_movies = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    # 4. Return the top N similar movies (excluding the input movie)
    recommended_movies_indices = [i[0] for i in sorted_similar_movies[1:n_recommendations+1]]
    recommended_movies_scores = [i[1] for i in sorted_similar_movies[1:n_recommendations+1]]

    # Get the titles and create DataFrame
    recommended_movies_titles = movies_df['title'].iloc[recommended_movies_indices].tolist()

    recommendations_df = pd.DataFrame({
        'Title': recommended_movies_titles,
        'Similarity Score': recommended_movies_scores
    })

    return recommendations_df

**Set the parameters**

In [None]:
# Specify the movie title you want recommendations for
movie_to_recommend = "Heat" # You can change this to any movie title from your dataset

# Specify the number of recommendations you want
num_recommendations = 10

##### **Recommendation using Document-Term Matrix (X_DTM)**

In [None]:
# Call the function to get recommendations
recommended_movies = content_based_recommender(
    movie_to_recommend,
    moviesdf,
    X_DTM,
    num_recommendations
)

# Display the recommended movies
if not recommended_movies.empty:
    print(f"Recommendations Similar to '{movie_to_recommend}' Using CountVectorizer:")
    display(recommended_movies)
else:
    print(f"Could not find recommendations similar to '{movie_to_recommend}'.")

Recommendations Similar to 'Heat' Using CountVectorizer:


Unnamed: 0,Title,Similarity Score
0,The Hunt for Red October,1.0
1,The Lodger,1.0
2,Friday,0.707107
3,My Family,0.707107
4,Speed,0.707107
5,Rising Sun,0.707107
6,Blade Runner,0.707107
7,Son in Law,0.707107
8,T-Men,0.707107
9,Bean,0.707107


##### **Recommendation using TF-IDF Matrix (X_tfidf)**

In [None]:
# Call the function to get recommendations
recommended_movies = content_based_recommender(
    movie_to_recommend,
    moviesdf,
    X_tfidf,
    num_recommendations
)

# Display the recommended movies
if not recommended_movies.empty:
    print(f"Recommendations Similar to '{movie_to_recommend}' Using TfidfVectorizer:")
    display(recommended_movies)
else:
    print(f"Could not find recommendations similar to '{movie_to_recommend}'.")

Recommendations Similar to 'Heat' Using TfidfVectorizer:


Unnamed: 0,Title,Similarity Score
0,The Hunt for Red October,1.0
1,The Lodger,1.0
2,Masterminds,0.78895
3,Duel,0.78895
4,Quigley Down Under,0.78895
5,Show Me,0.78895
6,Someone's Watching Me!,0.78895
7,The Groundstar Conspiracy,0.78895
8,Open Windows,0.78895
9,A House In The Hills,0.78895


#### **Final Comparison**

In [None]:
def compare_movie_recommendations(movie_name, movies_df, user_item_matrix, X_DTM, X_tfidf, cf_model, n_recommendations=10):
    """
    Generates and compares movie recommendations using Collaborative Filtering
    and Content-Based methods (CountVectorizer and TF-IDF).

    Args:
        movie_name (str): The title of the movie to get recommendations for.
        movies_df (pd.DataFrame): The DataFrame containing movie information, including 'title'.
        user_item_matrix (pd.DataFrame): User-item matrix for Collaborative Filtering.
        X_DTM (scipy.sparse.csr_matrix): Document-Term Matrix from CountVectorizer.
        X_tfidf (scipy.sparse.csr_matrix): TF-IDF matrix from TfidfVectorizer.
        cf_model (sklearn.neighbors.NearestNeighbors): Fitted Nearest Neighbors model.
        n_recommendations (int): The number of recommendations to return for each method.

    Returns:
        pd.DataFrame: A DataFrame containing recommendations from all methods for comparison.
    """

    all_recommendations = []

    # --- Collaborative Filtering Recommendations ---
    # Transpose the user_item_matrix to have movies as rows and users as columns
    user_item_matrix_T = user_item_matrix.T
    # Fit the collaborative filtering model on the transposed matrix
    cf_model.fit(user_item_matrix_T)

    # Find the index of the input movie in the transposed matrix
    try:
        cf_movie_index_T = user_item_matrix_T.index.get_loc(movie_name)
        # Get the vector for the input movie from the transposed matrix
        input_movie_vector_cf = user_item_matrix_T.iloc[cf_movie_index_T].values.reshape(1, -1)

        distances, indices = cf_model.kneighbors(input_movie_vector_cf, n_neighbors=n_recommendations + 1)

        # Get the indices of the recommended movies (excluding the input movie itself)
        cf_rec_indices = indices.squeeze().tolist()[1:]
        cf_rec_distances = distances.squeeze().tolist()[1:]

        # Get the titles and distances of the recommended movies
        for i in range(len(cf_rec_indices)):
             all_recommendations.append({'Title': user_item_matrix_T.index[cf_rec_indices[i]],
                                         'Score/Distance': cf_rec_distances[i],
                                         'Method': 'Collaborative Filtering'})

    except KeyError:
        print(f"Movie '{movie_name}' not found in the Collaborative Filtering matrix.")


    # --- Content Based Recommendations (CountVectorizer) ---
    dtm_recommendations_df = content_based_recommender(movie_name, movies_df, X_DTM, n_recommendations)
    if not dtm_recommendations_df.empty:
        dtm_recommendations_df['Method'] = 'Content Based (CountVectorizer)'
        dtm_recommendations_df.rename(columns={'Similarity Score': 'Score/Distance'}, inplace=True)
        all_recommendations.extend(dtm_recommendations_df.to_dict('records'))


    # --- Content Based Recommendations (TF-IDF) ---
    tfidf_recommendations_df = content_based_recommender(movie_name, moviesdf, X_tfidf, n_recommendations)
    if not tfidf_recommendations_df.empty:
        tfidf_recommendations_df['Method'] = 'Content Based (TF-IDF)'
        tfidf_recommendations_df.rename(columns={'Similarity Score': 'Score/Distance'}, inplace=True)
        all_recommendations.extend(tfidf_recommendations_df.to_dict('records'))

    # Combine and display results
    if all_recommendations:
        comparative_df = pd.DataFrame(all_recommendations)
        return comparative_df
    else:
        return pd.DataFrame()

In [None]:
# Specify the movie title and the number of recommendations
movie_to_compare = "Heat"  # Replace with the movie title you want to compare
num_recommendations_compare = 5

# Call the comparison function
comparison_results = compare_movie_recommendations(
    movie_to_compare,
    moviesdf,
    user_item_matrix,
    X_DTM,
    X_tfidf,
    cf_nn_model,
    num_recommendations_compare
)

# Display the comparative table
if not comparison_results.empty:
    print(f"Comparative Movie Recommendations for '{movie_to_compare}':")
    display(comparison_results)
else:
    print(f"Could not generate comparative recommendations for '{movie_to_compare}'.")

Comparative Movie Recommendations for 'Heat':


Unnamed: 0,Title,Score/Distance,Method
0,Good Neighbor Sam,0.432895,Collaborative Filtering
1,Nell,0.450922,Collaborative Filtering
2,Adaptation.,0.467349,Collaborative Filtering
3,Lucky You,0.48683,Collaborative Filtering
4,eXistenZ,0.491969,Collaborative Filtering
5,The Hunt for Red October,1.0,Content Based (CountVectorizer)
6,The Lodger,1.0,Content Based (CountVectorizer)
7,Friday,0.707107,Content Based (CountVectorizer)
8,My Family,0.707107,Content Based (CountVectorizer)
9,Speed,0.707107,Content Based (CountVectorizer)
