# Movie Recommendation System 🎬
Hi there! Welcome to my project, where I’m learning Data Science by building a movie recommendation system from scratch. I’ve added  insights, explanations, and extra features to enhance learning. This project is inspired by @siddhardhan's YouTube tutorial [link](https://www.youtube.com/watch?v=YRkN5k47NSY). Let’s explore and learn together! 🚀 Suggestions and improvements are always welcome — let’s explore and learn together! 

## Step 1: Importing the Required Libraries
Before we start, we need some libraries to help us handle data, process text, and calculate similarity.

In [99]:
# Import dependencies
import pandas as pd
import numpy as np
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

- pandas (pd): Load, manipulate, and analyze data in tabular form.
- numpy (np): Perform numerical operations and handle arrays.
- difflib: Find close matches (helps handle typos in movie titles).
- TfidfVectorizer: Convert text data into numerical vectors using TF-IDF.
- cosine_similarity: Measure similarity between two vectors (used to compare movies).


## Step 2: Loading the Dataset


In [100]:
# Reading the CSV file into a pandas DataFrame
# Ensure that the file path is correct
df = pd.read_csv('movies.csv')

# Print the first 5 rows of the dataset to understand its structure
display(df.head())

# Print the shape (number of rows and coloumns) in the dataset to understand its structure
print(df.shape)

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


(4803, 24)


## Step 3: Choosing Important Features
- When recommending movies, we want to consider what makes them unique. Let’s pick useful features!

In [101]:
# We Create a list of important features
selected_features = ["genres", "keywords", "tagline", "cast", "director", "overview"]
print(selected_features)
# gives out a list of selected featuers 

['genres', 'keywords', 'tagline', 'cast', 'director', 'overview']


## Step 4: Handling Missing Data
- Not all movies will have complete information, so we’ll fill missing values with an empty string:

In [102]:
# Handle missing values by replacing NaNs with an empty string using a for loop
for feature in selected_features:   # for loop iterate thorugh our selected_featuers list
    df[feature] = df[feature].fillna("")    # fills the nan value with empty string("") in its place.

- fillna(""):
    - Replaces any missing values (NaNs) with an empty string to avoid issues when concatenating text data.

## Step 5: Combining Features
- Now, we concatenate the selected features to form a single string representation for each movie
- This merged text acts as a “fingerprint” for each movie!

In [103]:
# Combine the selected features into a single string for each movie
# Combine selected features into a single string
df["combined_features"] = (
    df["genres"]
    + " "
    + df["keywords"]
    + " "
    + df["tagline"]
    + " "
    + df["cast"]
    + " "
    + df["director"]
    + " "
    + df["overview"]
)

# Preview the combined features
print(df["combined_features"])

0       Action Adventure Fantasy Science Fiction cultu...
1       Adventure Fantasy Action ocean drug abuse exot...
2       Action Adventure Crime spy based on novel secr...
3       Action Crime Drama Thriller dc comics crime fi...
4       Action Adventure Science Fiction based on nove...
                              ...                        
4798    Action Crime Thriller united states\u2013mexic...
4799    Comedy Romance  A newlywed couple's honeymoon ...
4800    Comedy Drama Romance TV Movie date love at fir...
4801      A New Yorker in Shanghai Daniel Henney Eliza...
4802    Documentary obsession camcorder crush dream gi...
Name: combined_features, Length: 4803, dtype: object


- Combines the selected columns into a single string for each movie.
- This combined string serves as the movie profile.

## Step 6: Vectorizing the Text Data
Computers don’t understand text directly — we must convert it into numbers.

In [104]:
# Convert the combined text data into numerical feature vectors using TF-IDF
vectorizer = TfidfVectorizer()   # load the tfidvectorizer
feature_vectors = vectorizer.fit_transform(df["combined_features"])   # fit and transform the combined featues 
print(feature_vectors)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 307355 stored elements and shape (4803, 30592)>
  Coords	Values
  (0, 561)	0.05971816344971169
  (0, 703)	0.06846420517510078
  (0, 9754)	0.08513696797398294
  (0, 23977)	0.07941905576010944
  (0, 10023)	0.07960231361105431
  (0, 6601)	0.1498786462809525
  (0, 5279)	0.1549075340655008
  (0, 10796)	0.11095111375730655
  (0, 25413)	0.24369151759694266
  (0, 29469)	0.08637114089261566
  (0, 5592)	0.17322386697661618
  (0, 25232)	0.1336739112380023
  (0, 9065)	0.14505772980982928
  (0, 27118)	0.08328687324810813
  (0, 30121)	0.07272849549697494
  (0, 19410)	0.0318696249813729
  (0, 20039)	0.37747447361884223
  (0, 23619)	0.11047375838269538
  (0, 30145)	0.18243919685121024
  (0, 30543)	0.15064979633862852
  (0, 23578)	0.16802034155542864
  (0, 24802)	0.158925784923944
  (0, 29614)	0.15064979633862852
  (0, 25868)	0.11590870590502833
  (0, 15569)	0.171366691592621
  :	:
  (4802, 9588)	0.10415492266309316
  (4802, 21386)	0.0897551

- TfidfVectorizer():
- Converts the combined text into numerical feature vectors.
- How it Works?
    - TF-IDF (Term Frequency-Inverse Document Frequency):
    - Measures the importance of a word:
        - Term Frequency (TF): Measures how frequently a word appears in a document.
        - Inverse Document Frequency (IDF): Reduces the weight of common words (like "the", "and") that appear in many movies, so they don’t dominate the recommendations.
        
- Imagine TF-IDF as a highlighter — it emphasizes rare, meaningful words while dimming down common words that appear in most movies. This way, the model can focus on distinctive features rather than generic words.
- Example: In a dataset with lots of sci-fi movies, words like "future" may appear often — IDF lowers their importance to avoid bias.

## Step 7: Calculating Cosine Similarity
Next, we measure how similar two movies are using Cosine Similarity


In [105]:
# Compute the cosine similarity matrix
similarity = cosine_similarity(feature_vectors)
print(similarity)

# give out a matrix containing similarity values between 0 and 1 

[[1.         0.05083168 0.0332947  ... 0.02749812 0.0304889  0.0072518 ]
 [0.05083168 1.         0.04356836 ... 0.05077045 0.03100979 0.01521198]
 [0.0332947  0.04356836 1.         ... 0.02646984 0.04751623 0.01372603]
 ...
 [0.02749812 0.05077045 0.02646984 ... 1.         0.03481447 0.03546821]
 [0.0304889  0.03100979 0.04751623 ... 0.03481447 1.         0.03098945]
 [0.0072518  0.01521198 0.01372603 ... 0.03546821 0.03098945 1.        ]]


- Cosine Similarity:
- Measures how similar two vectors are, based on the cosine of the angle between them.
- Formula:              Cosine Similarity= A⋅B / ∣∣A∣∣×∣∣B∣∣

- Range: 0 (completely different) to 1 (identical).
- Example:
    - If Movie A and Movie B have similar genres and cast, their similarity score will be closer to 1.

## Step 8: Finding Movie Matches
Now, let’s take the user’s input and match it to the closest movie title:

In [106]:
# we take an input from the user and convert it to lowercase
input_movie = input("Enter a movie title: ").lower()

# convert the titles in our dataset to lowercase to make comparison easier
df["title"] = df["title"].str.lower()

# try to find a close match using difflib.get_close_match
# provide input movie and the target col that is title col as argumetns
find_close_match = difflib.get_close_matches(input_movie, df['title'])

#
if find_close_match:  # If a match is found:
    close_match = find_close_match[0] # It grabs the first match from the list (find_close_match[0]) — the most similar title.
    print(f"Closest match: {close_match}")  # Prints the closest matching movie title.
else:  # If no match is found:
      print("Movie not found. Please try another title.")  # Prints a message asking the user to try another title.

Closest match: iron man


- Why difflib?
- It helps us handle typos and spelling errors — so typing “Incepton” instead of “Inception” still works!

## Step 9: Finding Similar Movies


In [107]:
# Once we have the correct movie title, we’ll grab its index
index_of_movie = df[df.title == close_match].index[0]
# df[df.title == close_match]: This filters the DataFrame to only include the row where the title matches close_match (the closest movie title we found earlier).
# .index[0]: Extracts the index of the matched movie. This index will help us locate the similarity scores for that specific movie.

# Retrieve similarity scores using enumerate function for all movies:
similarity_score = list(enumerate(similarity[index_of_movie]))
# similarity[index_of_movie]: Retrieves the row of similarity scores for the matched movie.
# For movie at index 3, this would grab all similarity scores for that movie.
# enumerate(): Wraps the similarity scores in (index, score) pairs so we can keep track of movie indices while sorting.

# Sort them in descending order:
sorted_similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)
# sorted(): Sorts the list of (index, similarity score) pairs.
# key=lambda x: x[1]: Tells Python to sort the list by the similarity score (x[1]).
# reverse=True: Sorts the list in descending order — most similar movies appear first.

## Step 10: Displaying Recommendations

In [108]:
print(f"\nMovies similar to '{close_match.title()}':\n")

i = 1  # Initializes a counter to number the recommendations.

# Loops through the sorted similarity scores.
# [1:11]: Skips the first entry (which is the movie itself) and gets the top 10 similar movies.
for movie in sorted_similarity_score[1:11]:  
    index = movie[0]   # extract index 
    title_from_index = df.loc[index, "title"]  # look up for the title based on the idnex
    print(f"{i}. {title_from_index.title()}")
    i += 1


Movies similar to 'Iron Man':

1. Iron Man 2
2. Iron Man 3
3. Avengers: Age Of Ultron
4. The Avengers
5. X-Men
6. The Helix... Loaded
7. Captain America: Civil War
8. X-Men: Apocalypse
9. Ant-Man
10. Made


## Step 11: Function for Movie Recommendation

In [109]:
# Function to get movie recommendations
def recommend_movies():
    input_movie = input("Type the name of a movie: ")

    # Convert all titles to lowercase to handle case sensitivity
    input_movie = input_movie.lower()
    df["title"] = df["title"].str.lower()

    # Create a list of all movie titles
    list_of_titles = df["title"].tolist()

    # Find the closest match to the input movie
    find_close_match = difflib.get_close_matches(input_movie, list_of_titles)

    if not find_close_match:
        print("Movie not found. Please try another title.")
        return

    close_match = find_close_match[0]
    print(f"\nClosest match found: {close_match.title()}\n")

    # Find the index of the matched movie
    index_of_movie = df[df.title == close_match].index[0]

    # Get a list of similar movies based on similarity scores
    similarity_score = list(enumerate(similarity[index_of_movie]))

    # Sort movies by similarity score in descending order
    sorted_similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)

    # Print the top 10 recommended movies
    print("Top 10 Recommended Movies:\n")
    i = 1
    for movie in sorted_similarity_score[
        1:11
    ]:  # Skip the first one because it's the input movie itself
        index = movie[0]
        title_from_index = df.iloc[index]["title"].title()
        print(f"{i}. {title_from_index}")
        i += 1

In [110]:
# Run the recommendation function
recommend_movies()


Closest match found: Avatar

Top 10 Recommended Movies:

1. Lifeforce
2. Moonraker
3. Alien
4. Star Trek Beyond
5. Gattaca
6. Aliens
7. Guardians Of The Galaxy
8. Gravity
9. Lockout
10. Apollo 18


# Summary: Flowchart about how Our Movie Recommendation System Works
- This flowchart summarizes the entire process — from taking a movie title as input to generating personalized recommendations. It highlights the key libraries and steps that power the recommendation engine!

**Start: User Inputs a Movie Title**  
        ↓  
**Step 1: Load and Preprocess Data (pandas)**  
- Read the CSV file  
- Fill missing values in relevant features (genres, keywords, tagline, cast, director)  
        ↓  
**Step 2: Combine Features**  
- Merge selected features into a single string  
        ↓  
**Step 3: Text Vectorization (TfidfVectorizer from sklearn)**  
- Convert combined text into numerical feature vectors  
        ↓  
**Step 4: Calculate Similarity (cosine_similarity from sklearn)**  
- Compute cosine similarity scores between movies  
        ↓  
**Step 5: Match User Input (difflib)**  
- Find the closest matching movie title (handles typos)  
        ↓  
**Step 6: Sort and Recommend Movies**  
- Rank movies based on similarity scores  
- Display the top N recommended movies  
        ↓  
End: Show Recommendations to the User  
