In [1]:
import numpy as np
import pandas as pd

In [2]:
credits = pd.read_csv(
    "tmdb_5000_credits.csv",
    engine="python",
    on_bad_lines="skip"
)
movies = pd.read_csv(
    "tmdb_5000_movies.csv",
    engine="python",
    on_bad_lines="skip"
)


### Dataset Parsing Issue and Resolution

While loading the dataset, a CSV parsing error was encountered due to improperly formatted text entries containing unmatched quotation marks.

Such issues are common in real-world datasets containing free-text fields.

The issue was resolved by switching to the Python parsing engine, which is more tolerant of formatting inconsistencies.

In [3]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [4]:
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [5]:
#Check dataset shape
print("Movies dataset shape: ",movies.shape)
print("Credits dataset shape: ",credits.shape)

Movies dataset shape:  (4803, 20)
Credits dataset shape:  (2087, 4)


In [6]:
#Check columns
print("Movies dataset columns:")
movies.columns

Movies dataset columns:


Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

In [7]:
print("Credits dataset columns:")
credits.columns

Credits dataset columns:


Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')

In [8]:
#Understanding dataset information
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [9]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2087 entries, 0 to 2086
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  2087 non-null   int64 
 1   title     2087 non-null   object
 2   cast      2087 non-null   object
 3   crew      2087 non-null   object
dtypes: int64(1), object(3)
memory usage: 65.3+ KB


In [10]:
#Merge datasets on tile
movies = movies.merge(credits,on='title')

### Pandas performs an INNER JOIN on the title column.

#### What happens internally:

- If a movie title appears more than once in either file

- Pandas creates multiple combinations

- This causes row duplication

In [11]:
movies.shape

(2089, 23)

In [12]:
movies['title'].value_counts().head()

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
The Host,2
Batman,2
Malcolm X,1
This Is 40,1
Old Dogs,1


## Dataset Merge Observation

- After merging the movies and credits datasets, the number of rows increased slightly.
- This occurred because some movie titles appear more than once in the dataset.
- During the merge operation, duplicate titles resulted in multiple matching rows.
- Such duplication is common in real-world datasets and must be handled during preprocessing.
- Duplicate entries will be addressed in the data preprocessing stage.


In [13]:
#Check for missing values
movies.isnull().sum()

Unnamed: 0,0
budget,0
genres,0
homepage,1196
id,0
keywords,0
original_language,0
original_title,0
overview,0
popularity,0
production_companies,0


## Dataset Exploration & Integration

- The movie information is distributed across two datasets: movies and credits.
- Both datasets were merged using the movie title to create a unified dataset.
- The merged dataset contains textual attributes such as genres, keywords, overview, cast, and crew.
- These attributes provide rich semantic information required for content-based filtering.
- Some fields contain missing values, indicating the need for preprocessing in the next step.

In [14]:
movies[['title', 'genres', 'overview', 'keywords', 'cast', 'crew']].head()

Unnamed: 0,title,genres,overview,keywords,cast,crew
0,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","In the 22nd century, a paraplegic Marine is di...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","Captain Barbossa, long believed to be dead, ha...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",A cryptic message from Bond’s past sends him o...,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",Following the death of District Attorney Harve...,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","John Carter is a war-weary, former military ca...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [15]:
movies.shape

(2089, 23)

In [16]:
movies.drop_duplicates(subset='title', inplace=True)

## Data Cleaning Column Selection & Duplicate Removal

- Only relevant columns required for content-based movie recommendation were selected.
- Duplicate movie titles were removed to ensure that each movie appears only once.
- Removing duplicates prevents repeated or biased recommendations.
- This step improves data consistency and prepares the dataset for further preprocessing.

In [17]:
movies.isnull().sum()

Unnamed: 0,0
budget,0
genres,0
homepage,1195
id,0
keywords,0
original_language,0
original_title,0
overview,0
popularity,0
production_companies,0


In [18]:
movies.fillna('',inplace=True)

In [19]:
movies.isnull().sum()

Unnamed: 0,0
budget,0
genres,0
homepage,0
id,0
keywords,0
original_language,0
original_title,0
overview,0
popularity,0
production_companies,0


## Handling Missing Values

- The dataset contained missing values in textual columns such as overview, keywords, cast, and crew.
- Missing values were replaced with empty strings to ensure smooth text processing.
- This approach prevents errors during feature extraction using NLP techniques.
- Handling missing data improves the robustness and reliability of the recommendation system.


In [20]:
import ast

In [21]:
#Helper function to extract names from JSON-like text
def convert(text):
  result = []
  for i in ast.literal_eval(text):
    result.append(i['name'])
  return result

In [22]:
#Apply conversion to genres and keywords
movies['genres'] = movies['genres'].apply(convert)
movies['keywords'] = movies['keywords'].apply(convert)

In [23]:
#Extract top 3 cast members
def convert_cast(text):
  result = []
  counter = 0
  for i in ast.literal_eval(text):
    if counter < 3:
      result.append(i['name'])
      counter += 1
    else:
      break
  return result

movies['cast'] = movies['cast'].apply(convert_cast)

In [24]:
#Extract director from crew
def fetch_director(text):
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            return [i['name']]
    return []

movies['crew'] = movies['crew'].apply(fetch_director)

In [25]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[Action, Adventure, Fantasy, Science Fiction]",http://www.avatarmovie.com/,19995,"[culture clash, future, space war, space colon...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,300000000,"[Adventure, Fantasy, Action]",http://disney.go.com/disneypictures/pirates/,285,"[ocean, drug abuse, exotic island, east india ...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,245000000,"[Action, Adventure, Crime]",http://www.sonypictures.com/movies/spectre/,206647,"[spy, based on novel, secret agent, sequel, mi...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,250000000,"[Action, Crime, Drama, Thriller]",http://www.thedarkknightrises.com/,49026,"[dc comics, crime fighter, terrorist, secret i...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,"[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,260000000,"[Action, Adventure, Science Fiction]",http://movies.disney.com/john-carter,49529,"[based on novel, mars, medallion, space travel...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,"[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


## Text Parsing and Feature Extraction

- The genres, keywords, cast, and crew columns were originally stored in JSON-like format.
- These fields were parsed to extract only meaningful textual information such as names.
- For the cast column, only the top three actors were selected to reduce noise.
- The director was extracted from the crew data as it plays an important role in movie similarity.
- This transformation converts raw text into structured tokens suitable for NLP processing.

In [26]:
#Remove spaces within tokens
movies['genres'] = movies['genres'].apply(lambda x: [i.replace(" ", "") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x: [i.replace(" ", "") for i in x])
movies['cast'] = movies['cast'].apply(lambda x: [i.replace(" ", "") for i in x])
movies['crew'] = movies['crew'].apply(lambda x: [i.replace(" ", "") for i in x])

In [27]:
#Convert overview into list of words
movies['overview'] = movies['overview'].apply(lambda x: x.split())

In [28]:
#Create a single content column
movies['content'] = movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew'] + movies['overview']

In [29]:
#Convert list to string
movies['content'] = movies['content'].apply(lambda x: " ".join(x))

In [30]:
#Convert text to lowercase
movies['content'] = movies['content'].apply(lambda x: x.lower())

In [31]:
movies[['title', 'content']].head()

Unnamed: 0,title,content
0,Avatar,action adventure fantasy sciencefiction cultur...
1,Pirates of the Caribbean: At World's End,adventure fantasy action ocean drugabuse exoti...
2,Spectre,action adventure crime spy basedonnovel secret...
3,The Dark Knight Rises,action crime drama thriller dccomics crimefigh...
4,John Carter,action adventure sciencefiction basedonnovel m...


## Feature Combination and Text Normalization

- All extracted textual features were normalized by removing spaces and converting text to lowercase.
- The movie overview was tokenized into individual words.
- Relevant features such as genres, keywords, cast, director, and overview were combined into a single content column.
- This unified text representation serves as the final input for feature extraction using TF-IDF.

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [33]:
#Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')

In [34]:
#Fit and transform the content column
tfidf_matrix = tfidf.fit_transform(movies['content'])

In [35]:
#Check TF-IDF matrix shape
tfidf_matrix.shape

(2087, 5000)

## TF-IDF Feature Extraction

- The combined content column was converted into numerical feature vectors using TF-IDF vectorization.
- Stop words were removed to reduce noise and improve feature quality.
- TF-IDF assigns higher importance to relevant words while reducing the weight of commonly occurring terms.
- The resulting matrix represents each movie as a numerical vector suitable for similarity computation.

In [36]:
from sklearn.metrics.pairwise import cosine_similarity

In [37]:
#Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix)

In [38]:
#Check similarity matrix shape
cosine_sim.shape

(2087, 2087)

## Cosine Similarity Computation

- Cosine similarity was used to measure the similarity between movie feature vectors.
- The similarity score represents how closely two movies are related based on their content.
- A cosine similarity matrix was generated where each value indicates the similarity between a pair of movies.
- This matrix forms the foundation for generating movie recommendations.


In [39]:
#Create an index mapping for movie titles
movie_index = pd.Series(movies.index, index=movies['title']).drop_duplicates()

In [40]:
#Define the recommendation function
def get_recommendations(movie_title, cosine_sim=cosine_sim, movies_df=movies, top_n=10):
    if movie_title not in movie_index:
        return ["Movie title not found in the dataset."]

    idx = movie_index[movie_title]
    similarity_scores = list(enumerate(cosine_sim[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    top_movies = similarity_scores[1:top_n+1]
    movie_indices = [i[0] for i in top_movies]

    return movies_df['title'].iloc[movie_indices].tolist()


In [41]:
#Test the recommendation function
get_recommendations("Avatar")

['Battle: Los Angeles',
 'Star Trek Into Darkness',
 'The Book of Life',
 'Titan A.E.',
 'Predators',
 "Ender's Game",
 'The Inhabited Island',
 'Lifeforce',
 'Alien³',
 'The Lovers']

## Recommendation Function Development

- A recommendation function was developed to generate movie suggestions based on content similarity.
- The function takes a movie title as input and retrieves its index from the dataset.
- Cosine similarity scores are calculated between the selected movie and all other movies.
- Movies are sorted based on similarity scores, and the top recommendations are returned.
- This function represents the core logic of the movie recommendation system.


In [43]:
def recommend_movies():
    print("-"*60)
    print("MOVIE RECOMMENDATION SYSTEM")
    print("-"*60)

    movie_name = input("Enter a movie you like: ").strip()

    print("\nSearching for similar movies...\n")

    recommendations = get_recommendations(movie_name)

    if not recommendations:
        print("Movie not found in the dataset.")
        return

    print(f"Because you liked '{movie_name}', you may also like:\n")

    for idx, movie in enumerate(recommendations, start=1):
        print(f"{idx}. {movie}")

    print("\n" + "-"*60)
    print("End of Recommendations")
    print("-"*60)

In [44]:
recommend_movies()

------------------------------------------------------------
MOVIE RECOMMENDATION SYSTEM
------------------------------------------------------------
Enter a movie you like: The Lovers

Searching for similar movies...

Because you liked 'The Lovers', you may also like:

1. A Passage to India
2. Beyond Borders
3. Nomad: The Warrior
4. The Horseman on the Roof
5. The Time Traveler's Wife
6. Kate & Leopold
7. The Age of Innocence
8. The Tourist
9. Legends of the Fall
10. Dangerous Liaisons

------------------------------------------------------------
End of Recommendations
------------------------------------------------------------


## Conclusion

- In this project, a content-based movie recommendation system was successfully developed using Natural Language Processing and similarity-based machine learning techniques.

- The system analyzes movie metadata such as genres, keywords, plot descriptions, cast, and crew to identify semantic similarities between movies. Textual features were converted into numerical representations using TF-IDF vectorization, and cosine similarity was applied to generate relevant recommendations.

- The model was tested using multiple movie inputs and consistently produced meaningful and contextually similar recommendations, validating the effectiveness of the approach.

- This project demonstrates a complete end-to-end AI/ML workflow, including data acquisition, preprocessing, feature engineering, model development, and evaluation. The solution effectively addresses the problem of content discovery and showcases the practical application of machine learning techniques in real-world recommendation systems.
