# Making a movie Recommendation program for an user input

In this notebook, we're going to go through an example project with the goal of recommending movies to an user based upon his given input

## 1. Problem Defination
> Develop a python code that recommends movies to users based on the content (characteristics) of the movies and a user's past preferences for similar content?

## 2. Data
The data is downloaded from Kaggle: https://www.kaggle.com/datasets/utkarshx27/movies-dataset

> The movies dataset is a comprehensive collection of information about 4,803 movies. It provides a wide range of details about each movie, including budget, genres, production companies, release date, revenue, runtime, language, popularity, and more.

## 3. Evaluation

## 4. Features
index, budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count, cast, crew, director.

### Importing the required Libraries

In [52]:
import numpy as np
import pandas as pd
import difflib
# We will be requiring the above library because the user can make an error while typing the movie name, so we have to make a close match the movie he meant to type.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Data Collection and Pre-processing

In [53]:
# Loading the data from the csv file to a Panda dataFrame

movie_df = pd.read_csv("movies.csv")

In [54]:
# Viewing the data
movie_df.head().T

Unnamed: 0,0,1,2,3,4
index,0,1,2,3,4
budget,237000000,300000000,245000000,250000000,260000000
genres,Action Adventure Fantasy Science Fiction,Adventure Fantasy Action,Action Adventure Crime,Action Crime Drama Thriller,Action Adventure Science Fiction
homepage,http://www.avatarmovie.com/,http://disney.go.com/disneypictures/pirates/,http://www.sonypictures.com/movies/spectre/,http://www.thedarkknightrises.com/,http://movies.disney.com/john-carter
id,19995,285,206647,49026,49529
keywords,culture clash future space war space colony so...,ocean drug abuse exotic island east india trad...,spy based on novel secret agent sequel mi6,dc comics crime fighter terrorist secret ident...,based on novel mars medallion space travel pri...
original_language,en,en,en,en,en
original_title,Avatar,Pirates of the Caribbean: At World's End,Spectre,The Dark Knight Rises,John Carter
overview,"In the 22nd century, a paraplegic Marine is di...","Captain Barbossa, long believed to be dead, ha...",A cryptic message from Bond’s past sends him o...,Following the death of District Attorney Harve...,"John Carter is a war-weary, former military ca..."
popularity,150.437577,139.082615,107.376788,112.31295,43.926995


In [55]:
# Finding the shape of the imported data
movie_df.shape

(4803, 24)

In [56]:
# Selecting the relevant features for recommendation

selected_feature = ['genres', 'keywords', 'tagline', 'cast', 'director']
selected_feature

['genres', 'keywords', 'tagline', 'cast', 'director']

In [57]:
print(movie_df['genres'].isna().sum())

28


In [58]:
# We wil check if there are any missing values in our data
missing_values = [movie_df[col].isna().sum() for col in selected_feature]
missing_values

[28, 412, 844, 43, 30]

In [59]:
# Replacing the null values with null string
movie_df[selected_feature] = movie_df[selected_feature].fillna('')

In [60]:
missing_values = [movie_df[col].isna().sum() for col in selected_feature]
missing_values

[0, 0, 0, 0, 0]

In [66]:
# Combining all the 5 selected features

combined_features = movie_df['genres']+' '+movie_df['keywords']+' '+movie_df['tagline']+' '+movie_df['cast']+' '+movie_df['director']

In [67]:
len(combined_features)

4803

In [68]:
# Converting the text data to feature vectors
vectorizer = TfidfVectorizer()

In [69]:
feature_vectors = vectorizer.fit_transform(combined_features)
print(feature_vectors)

  (0, 201)	0.07860022416510505
  (0, 274)	0.09021200873707368
  (0, 5274)	0.11108562744414445
  (0, 13599)	0.1036413987316636
  (0, 5437)	0.1036413987316636
  (0, 3678)	0.21392179219912877
  (0, 3065)	0.22208377802661425
  (0, 5836)	0.1646750903586285
  (0, 14378)	0.33962752210959823
  (0, 16587)	0.12549432354918996
  (0, 3225)	0.24960162956997736
  (0, 14271)	0.21392179219912877
  (0, 4945)	0.24025852494110758
  (0, 15261)	0.07095833561276566
  (0, 16998)	0.1282126322850579
  (0, 11192)	0.09049319826481456
  (0, 11503)	0.27211310056983656
  (0, 13349)	0.15021264094167086
  (0, 17007)	0.23643326319898797
  (0, 17290)	0.20197912553916567
  (0, 13319)	0.2177470539412484
  (0, 14064)	0.20596090415084142
  (0, 16668)	0.19843263965100372
  (0, 14608)	0.15150672398763912
  (0, 8756)	0.22709015857011816
  :	:
  (4801, 403)	0.17727585190343229
  (4801, 4835)	0.24713765026964
  (4801, 17266)	0.28860981849329476
  (4801, 13835)	0.27870029291200094
  (4801, 13175)	0.28860981849329476
  (4801, 171

### Cosine Similarity

In [70]:
# Getting the similarity scores using cosine similarity

similarity = cosine_similarity(feature_vectors)
similarity

array([[1.        , 0.07219487, 0.037733  , ..., 0.        , 0.        ,
        0.        ],
       [0.07219487, 1.        , 0.03281499, ..., 0.03575545, 0.        ,
        0.        ],
       [0.037733  , 0.03281499, 1.        , ..., 0.        , 0.05389661,
        0.        ],
       ...,
       [0.        , 0.03575545, 0.        , ..., 1.        , 0.        ,
        0.02651502],
       [0.        , 0.        , 0.05389661, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.02651502, 0.        ,
        1.        ]])

In [71]:
similarity.shape

(4803, 4803)

In [72]:
# Getting the movie name from the user

movie_name = input('Enter you favourite movie name : ')

In [73]:
# Creating a list with all the movie names given in the dataset


list_of_all_titles = movie_df['title'].to_list()
list_of_all_titles

['Avatar',
 "Pirates of the Caribbean: At World's End",
 'Spectre',
 'The Dark Knight Rises',
 'John Carter',
 'Spider-Man 3',
 'Tangled',
 'Avengers: Age of Ultron',
 'Harry Potter and the Half-Blood Prince',
 'Batman v Superman: Dawn of Justice',
 'Superman Returns',
 'Quantum of Solace',
 "Pirates of the Caribbean: Dead Man's Chest",
 'The Lone Ranger',
 'Man of Steel',
 'The Chronicles of Narnia: Prince Caspian',
 'The Avengers',
 'Pirates of the Caribbean: On Stranger Tides',
 'Men in Black 3',
 'The Hobbit: The Battle of the Five Armies',
 'The Amazing Spider-Man',
 'Robin Hood',
 'The Hobbit: The Desolation of Smaug',
 'The Golden Compass',
 'King Kong',
 'Titanic',
 'Captain America: Civil War',
 'Battleship',
 'Jurassic World',
 'Skyfall',
 'Spider-Man 2',
 'Iron Man 3',
 'Alice in Wonderland',
 'X-Men: The Last Stand',
 'Monsters University',
 'Transformers: Revenge of the Fallen',
 'Transformers: Age of Extinction',
 'Oz: The Great and Powerful',
 'The Amazing Spider-Man 2',

In [79]:
# Finding the close match for the movie name given by the user.

# The difflib gives a list with all the movie names having a close match so we only take the first component of the list
close_match = difflib.get_close_matches(movie_name, list_of_all_titles)[0]
close_match

'Iron Man'

In [82]:
# Find the index of the movie with title

index_of_the_movie = movie_df[movie_df.title == close_match]['index'].values[0]
index_of_the_movie

68

In [85]:
# Getting a list of similar movies

similarity_score = list(enumerate(similarity[index_of_the_movie]))
similarity_score

[(0, 0.033570748780675445),
 (1, 0.0546448279236134),
 (2, 0.013735500604224325),
 (3, 0.006468756104392058),
 (4, 0.03268943310073387),
 (5, 0.013907256685755475),
 (6, 0.07692837576335508),
 (7, 0.23944423963486416),
 (8, 0.007882387851851008),
 (9, 0.07599206098164224),
 (10, 0.07536074882460439),
 (11, 0.01192606921174529),
 (12, 0.013707618139948932),
 (13, 0.01237607492508997),
 (14, 0.09657127116284187),
 (15, 0.007286271383816743),
 (16, 0.22704403782296806),
 (17, 0.013112928084103857),
 (18, 0.04140526820609594),
 (19, 0.07883282546834255),
 (20, 0.07981173664799916),
 (21, 0.011266873271064948),
 (22, 0.006892575895462364),
 (23, 0.006599097891242659),
 (24, 0.012665208122549735),
 (25, 0.0),
 (26, 0.21566241096831162),
 (27, 0.030581282093826635),
 (28, 0.061074402219665376),
 (29, 0.014046184258938901),
 (30, 0.0807734659476981),
 (31, 0.3146705244947752),
 (32, 0.02878209913426701),
 (33, 0.13089810941050173),
 (34, 0.0),
 (35, 0.0353500906748656),
 (36, 0.031853252699375

In [89]:
# Sorting the movies based on their similarity sore
sorted_similar_movie = sorted(similarity_score, key = lambda x:x[1], reverse= True)
sorted_similar_movie[:5]

[(68, 1.0),
 (79, 0.40890433998005965),
 (31, 0.3146705244947752),
 (7, 0.23944423963486416),
 (16, 0.22704403782296806)]

In [90]:
# Print the name of the similar movies based on the index

print("Movies suggested for you : \n")
i = 1
for movie in sorted_similar_movie:
    index = movie[0]
    title_from_index = movie_df[movie_df.index == index]['title'].values[0]
    if (i < 30):
        print(i, '.', title_from_index)
        i+=1

Movies suggested for you : 

1 . Iron Man
2 . Iron Man 2
3 . Iron Man 3
4 . Avengers: Age of Ultron
5 . The Avengers
6 . Captain America: Civil War
7 . Captain America: The Winter Soldier
8 . Ant-Man
9 . X-Men
10 . Made
11 . X-Men: Apocalypse
12 . X2
13 . The Incredible Hulk
14 . The Helix... Loaded
15 . X-Men: First Class
16 . X-Men: Days of Future Past
17 . Captain America: The First Avenger
18 . Kick-Ass 2
19 . Guardians of the Galaxy
20 . Deadpool
21 . Thor: The Dark World
22 . G-Force
23 . X-Men: The Last Stand
24 . Duets
25 . Mortdecai
26 . The Last Airbender
27 . Southland Tales
28 . Zathura: A Space Adventure
29 . Sky Captain and the World of Tomorrow


In [97]:
movie_name = input('Enter your favourite movie name: ')
print(movie_name)

list_of_all_titles = movie_df['title'].tolist()

close_match = difflib.get_close_matches(movie_name, list_of_all_titles)[0]

index_of_the_movie = movie_df[movie_df.title == close_match]['index'].values[0]

similarity_score = list(enumerate(similarity[index_of_the_movie]))

sorted_similar_movie = sorted(similarity_score, key = lambda x:x[1], reverse= True)

i = 1

for movie in sorted_similar_movie:
    index = movie[0]
    title_from_index = movie_df[movie_df.index == index]['title'].values[0]

    if i <= 30:
        print(i, '.', title_from_index)
    i+=1

avatar
1 . Avatar
2 . Alien
3 . Aliens
4 . Guardians of the Galaxy
5 . Star Trek Beyond
6 . Star Trek Into Darkness
7 . Galaxy Quest
8 . Alien³
9 . Cargo
10 . Trekkies
11 . Gravity
12 . Moonraker
13 . Jason X
14 . Pocahontas
15 . Space Cowboys
16 . The Helix... Loaded
17 . Lockout
18 . Event Horizon
19 . Space Dogs
20 . Machete Kills
21 . Gettysburg
22 . Clash of the Titans
23 . Star Wars: Clone Wars: Volume 1
24 . The Right Stuff
25 . Terminator Salvation
26 . The Astronaut's Wife
27 . Planet of the Apes
28 . Star Trek
29 . Wing Commander
30 . Sunshine
