Machine Learning Project - 8: **Movie Recommendation System using Cosine Similarity**

**Load Data into Collab:**

In [1]:
import pandas as pd

df = pd.read_csv("movies.csv")

print(df.head())

   Movie_ID      Title             Genre  \
0         1  Inception  Sci-Fi, Thriller   
1         2     Avatar    Sci-Fi, Action   
2         3    Titanic    Drama, Romance   

                                        Overview             Keywords  
0     A thief who enters the dreams of others...  dream, subconscious  
1                 A marine on an alien planet...        alien, future  
2  A love story on the ill-fated Titanic ship...      love, shipwreck  


**Handling Missing Data:**

In [3]:
# Fill overviews with empty string

df["Overview"].fillna("",inplace = True)
print(df.isnull().sum())                # Check if any missing values remain

Movie_ID    0
Title       0
Genre       0
Overview    0
Keywords    0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Overview"].fillna("",inplace = True)


**Convert Text to Numerical Features (TF-IDF Vectorization):**

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert movie overviews to Tf-IDF features
vectorizer = TfidfVectorizer(stop_words = "english")
tfidf_matrix = vectorizer.fit_transform(df["Overview"])

print(tfidf_matrix.shape)                               # Number of movies x Number of unique words

(3, 12)


**Compute Cosine Similarity:**

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarity between all movies
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

print(cosine_sim.shape)                                     # Should be (number of movies, number of movies)

(3, 3)


**Build Recommendation Function:**

In [8]:
# Create a mapping of movie title to index
movie_index = pd.Series(df.index, index = df["Title"]).drop_duplicates()

def recommend_movies(movie_title, num_recommendations = 5):
  if movie_title not in movie_index:
    return "Movie not found in dataset!"

  idx = movie_index[movie_title]

  # Get similarity score for all movies with the given movie
  sim_scores = list(enumerate(cosine_sim[idx]))

  # Sort movies based on similarity score
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

  # Get top N similar movies (excluding itself)
  sim_scores = sim_scores[1:num_recommendations+1]

  # Get movie indices
  movie_indices = [i[0] for i in sim_scores]

  return df["Title"].iloc[movie_indices]

# Test the function
print(recommend_movies("Inception"))

1     Avatar
2    Titanic
Name: Title, dtype: object


**Try your Own Movie:**

In [13]:
print(recommend_movies("The Dark Knight Rises"))
print(recommend_movies("Avatar"))

Movie not found in dataset!
0    Inception
2      Titanic
Name: Title, dtype: object
