Importing Required Libraries

We use:

Pandas & NumPy → Data handling

Matplotlib & Seaborn → Visualization

Scikit-learn (TF-IDF + Cosine Similarity) → Feature extraction & similarity measure

In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Loading the Dataset

Here we load the movies dataset and preview the first few rows.

In [6]:
movies_data = pd.read_csv('/content/movies.csv')
movies_data.head()

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


Exploring the Dataset

Shape → Number of rows and columns

Missing values check → Identify null data before preprocessing

In [7]:
movies_data.shape

(4803, 24)

In [8]:
movies_data.isnull().sum()

Unnamed: 0,0
index,0
budget,0
genres,28
homepage,3091
id,0
keywords,412
original_language,0
original_title,0
overview,3
popularity,0


Data Preprocessing

We select only the relevant features for recommendations and handle missing values.

In [9]:
selected_features = ['genres' , 'keywords' , 'tagline' , 'cast' , 'director' ]
print(selected_features)

['genres', 'keywords', 'tagline', 'cast', 'director']


In [10]:
for feature in selected_features:
  movies_data[feature] = movies_data[feature].fillna('')

Feature Engineering

We combine all selected features into one single text string for each movie.

In [11]:
combined_features = movies_data['genres']+' '+['keywords']+' '+['tagline']+' '+['cast']+' '+['director']
print(combined_features)

0       Action Adventure Fantasy Science Fiction keywo...
1       Adventure Fantasy Action keywords tagline cast...
2       Action Adventure Crime keywords tagline cast d...
3       Action Crime Drama Thriller keywords tagline c...
4       Action Adventure Science Fiction keywords tagl...
                              ...                        
4798    Action Crime Thriller keywords tagline cast di...
4799        Comedy Romance keywords tagline cast director
4800    Comedy Drama Romance TV Movie keywords tagline...
4801                       keywords tagline cast director
4802           Documentary keywords tagline cast director
Name: genres, Length: 4803, dtype: object


TF-IDF Vectorization

We convert text data into numerical vectors using TF-IDF.

In [12]:
vectorizer = TfidfVectorizer()

In [13]:
vector_feature=vectorizer.fit_transform(combined_features)

In [14]:
print(vector_feature)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 31869 stored elements and shape (4803, 26)>
  Coords	Values
  (0, 0)	0.3428146609317517
  (0, 1)	0.3961809909951843
  (0, 10)	0.48596434970205704
  (0, 20)	0.4524925628205004
  (0, 11)	0.4524925628205004
  (0, 15)	0.14129609214982725
  (0, 21)	0.14129609214982725
  (0, 3)	0.14129609214982725
  (0, 6)	0.14129609214982725
  (1, 0)	0.4461173177836259
  (1, 1)	0.5155648844750456
  (1, 10)	0.6324032689801039
  (1, 15)	0.18387379778876506
  (1, 21)	0.18387379778876506
  (1, 3)	0.18387379778876506
  (1, 6)	0.18387379778876506
  (2, 0)	0.4727769452370094
  (2, 1)	0.5463746450475276
  (2, 15)	0.1948619543836292
  (2, 21)	0.1948619543836292
  (2, 3)	0.1948619543836292
  (2, 6)	0.1948619543836292
  (2, 5)	0.5710271291600073
  (3, 0)	0.4672690252002193
  (3, 15)	0.19259178433035043
  :	:
  (4798, 22)	0.4778919419803599
  (4799, 15)	0.2555661153378224
  (4799, 21)	0.2555661153378224
  (4799, 3)	0.2555661153378224
  (4799, 6)	0.2555661153

Cosine Similarity

We measure similarity between movies using cosine similarity.

In [15]:
similarity = cosine_similarity(vector_feature)
print(similarity)

[[1.         0.7684406  0.48867105 ... 0.05068858 0.28259218 0.10931628]
 [0.7684406  1.         0.63592559 ... 0.06596291 0.3677476  0.1422573 ]
 [0.48867105 0.63592559 1.         ... 0.06990481 0.38972391 0.15075848]
 ...
 [0.05068858 0.06596291 0.06990481 ... 1.         0.17937007 0.06938645]
 [0.28259218 0.3677476  0.38972391 ... 0.17937007 1.         0.38683406]
 [0.10931628 0.1422573  0.15075848 ... 0.06938645 0.38683406 1.        ]]


In [16]:
user =input('enter your favourite movies:')

enter your favourite movies:iron man


In [17]:
list_of_all_titles = movies_data['title'].tolist()
print(list_of_all_titles)

['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre', 'The Dark Knight Rises', 'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron', 'Harry Potter and the Half-Blood Prince', 'Batman v Superman: Dawn of Justice', 'Superman Returns', 'Quantum of Solace', "Pirates of the Caribbean: Dead Man's Chest", 'The Lone Ranger', 'Man of Steel', 'The Chronicles of Narnia: Prince Caspian', 'The Avengers', 'Pirates of the Caribbean: On Stranger Tides', 'Men in Black 3', 'The Hobbit: The Battle of the Five Armies', 'The Amazing Spider-Man', 'Robin Hood', 'The Hobbit: The Desolation of Smaug', 'The Golden Compass', 'King Kong', 'Titanic', 'Captain America: Civil War', 'Battleship', 'Jurassic World', 'Skyfall', 'Spider-Man 2', 'Iron Man 3', 'Alice in Wonderland', 'X-Men: The Last Stand', 'Monsters University', 'Transformers: Revenge of the Fallen', 'Transformers: Age of Extinction', 'Oz: The Great and Powerful', 'The Amazing Spider-Man 2', 'TRON: Legacy', 'Cars 2', 'Green Lant

In [18]:
find_close_match = difflib.get_close_matches(user , list_of_all_titles)
print(find_close_match)

['Iron Man', 'Iron Man 3', 'Iron Man 2']


In [19]:
close_match = find_close_match[0]
print(close_match)

Iron Man


In [21]:
index_of_the_movie = movies_data[movies_data.title == close_match]['index'].values[0]
print(index_of_the_movie)

68


In [23]:
similarity_score = list(enumerate(similarity[index_of_the_movie]))
print(similarity_score)

[(0, np.float64(0.8739786329302666)), (1, np.float64(0.5276046150517384)), (2, np.float64(0.5591338606543207)), (3, np.float64(0.30782965203630197)), (4, np.float64(1.0)), (5, np.float64(0.5276046150517384)), (6, np.float64(0.11681932608059682)), (7, np.float64(1.0)), (8, np.float64(0.3279509052397385)), (9, np.float64(0.5276046150517384)), (10, np.float64(0.8739786329302666)), (11, np.float64(0.5088042314560765)), (12, np.float64(0.5276046150517384)), (13, np.float64(0.4346414511712149)), (14, np.float64(0.8739786329302666)), (15, np.float64(0.3279509052397385)), (16, np.float64(1.0)), (17, np.float64(0.5276046150517384)), (18, np.float64(0.8366850976812034)), (19, np.float64(0.5276046150517384)), (20, np.float64(0.5276046150517384)), (21, np.float64(0.6810971100379901)), (22, np.float64(0.3939965506255375)), (23, np.float64(0.3939965506255375)), (24, np.float64(0.6296538340533792)), (25, np.float64(0.14571779791393435)), (26, np.float64(1.0)), (27, np.float64(0.935391325873821)), (28

In [24]:
len(similarity_score)

4803

In [29]:
sorted_movies_similarity = sorted(similarity_score, key = lambda x:x[1] , reverse = True)
print(sorted_movies_similarity)

[(4, np.float64(1.0)), (7, np.float64(1.0)), (16, np.float64(1.0)), (26, np.float64(1.0)), (31, np.float64(1.0)), (35, np.float64(1.0)), (36, np.float64(1.0)), (39, np.float64(1.0)), (47, np.float64(1.0)), (51, np.float64(1.0)), (52, np.float64(1.0)), (56, np.float64(1.0)), (59, np.float64(1.0)), (68, np.float64(1.0)), (79, np.float64(1.0)), (85, np.float64(1.0)), (91, np.float64(1.0)), (94, np.float64(1.0)), (101, np.float64(1.0)), (102, np.float64(1.0)), (111, np.float64(1.0)), (158, np.float64(1.0)), (169, np.float64(1.0)), (174, np.float64(1.0)), (182, np.float64(1.0)), (183, np.float64(1.0)), (193, np.float64(1.0)), (207, np.float64(1.0)), (229, np.float64(1.0)), (230, np.float64(1.0)), (233, np.float64(1.0)), (242, np.float64(1.0)), (260, np.float64(1.0)), (400, np.float64(1.0)), (466, np.float64(1.0)), (483, np.float64(1.0)), (495, np.float64(1.0)), (507, np.float64(1.0)), (508, np.float64(1.0)), (511, np.float64(1.0)), (577, np.float64(1.0)), (1079, np.float64(1.0)), (1490, np.

Movie Recommendation Function

We build a function to recommend movies similar to the user’s input.

In [34]:
print('movies recommended for you: \n')

i=1
for movie in sorted_movies_similarity:
  index = movie[0]
  title_from_index = movies_data[movies_data.index==index]['title'].values[0]
  if (i<30):
    print(i , '.' ,title_from_index)
    i+=1


movies recommended for you: 

1 . John Carter
2 . Avengers: Age of Ultron
3 . The Avengers
4 . Captain America: Civil War
5 . Iron Man 3
6 . Transformers: Revenge of the Fallen
7 . Transformers: Age of Extinction
8 . TRON: Legacy
9 . Star Trek Into Darkness
10 . Pacific Rim
11 . Transformers: Dark of the Moon
12 . Star Trek Beyond
13 . 2012
14 . Iron Man
15 . Iron Man 2
16 . Captain America: The Winter Soldier
17 . Independence Day: Resurgence
18 . Guardians of the Galaxy
19 . X-Men: First Class
20 . The Hunger Games: Mockingjay - Part 2
21 . Transformers
22 . Star Trek
23 . Captain America: The First Avenger
24 . The Incredible Hulk
25 . Ant-Man
26 . The Hunger Games: Catching Fire
27 . After Earth
28 . Total Recall
29 . Star Wars: Episode III - Revenge of the Sith


Creating predictive analysis

In [38]:
user =input('enter your favourite movies:')
list_of_all_titles = movies_data['title'].tolist()
find_close_match = difflib.get_close_matches(user , list_of_all_titles)
close_match = find_close_match[0]
index_of_the_movie = movies_data[movies_data.title == close_match]['index'].values[0]
similarity_score = list(enumerate(similarity[index_of_the_movie]))
sorted_movies_similarity = sorted(similarity_score, key = lambda x:x[1] , reverse = True)

print('movies recommended for you: \n')

i=1
for movie in sorted_movies_similarity:
  index = movie[0]
  title_from_index = movies_data[movies_data.index==index]['title'].values[0]
  if (i<30):
    print(i , '.' ,title_from_index)
    i+=1

enter your favourite movies:Pirates of the Caribbean
movies recommended for you: 

1 . Pirates of the Caribbean: At World's End
2 . Spider-Man 3
3 . Batman v Superman: Dawn of Justice
4 . Pirates of the Caribbean: Dead Man's Chest
5 . Pirates of the Caribbean: On Stranger Tides
6 . The Hobbit: The Battle of the Five Armies
7 . The Amazing Spider-Man
8 . Spider-Man 2
9 . The Amazing Spider-Man 2
10 . The Mummy: Tomb of the Dragon Emperor
11 . The Hobbit: An Unexpected Journey
12 . Warcraft
13 . Thor: The Dark World
14 . Thor
15 . Pirates of the Caribbean: The Curse of the Black Pearl
16 . Clash of the Titans
17 . The 13th Warrior
18 . The Lord of the Rings: The Fellowship of the Ring
19 . The Mummy Returns
20 . The Lord of the Rings: The Return of the King
21 . The Lord of the Rings: The Two Towers
22 . Conan the Barbarian
23 . The Last Witch Hunter
24 . The Scorpion King
25 . Reign of Fire
26 . The Monkey King 2
27 . The Forbidden Kingdom
28 . Krull
29 . Conan the Destroyer


Conclusion

Implemented a content-based filtering system.

Used TF-IDF Vectorization + Cosine Similarity for recommendations.

This system recommends movies based on similar content attributes.