### Project: MovieMatcher

#### Introduction

This project aims to construct a content-based movie recommendation system using machine learning. The system is designed to predict user preferences and offer personalised recommendations based on their likes and dislikes.

#### Business Problem

The movie industry has an abundance of data on movies, such as plot summaries, ratings, and reviews. It can be difficult for users to go through this large amount of information to find the films they want. A movie recommendation system can therefore be a useful tool to help users discover relevant movies quickly and conveniently.

##### Import Libraries and data

In [1]:
import pandas as pd
import numpy as np

# Read CSV
credits = pd.read_csv("credits.csv")
movies = pd.read_csv("movies.csv")

# First 5 rows
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [2]:
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [3]:
print("Credits:",credits.shape)
print("Movies Dataframe:",movies.shape)

Credits: (4803, 4)
Movies Dataframe: (4803, 20)


In [4]:
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

Foreign Key: A foreign key is a column or set of columns in one table that refers to the primary key (or a unique key) in another table. It establishes a link between two tables and enables the tables to be related to each other.

in the credits DF, we have movie_id
in the movies DF, we have id. 

We will merge the two DF using the foreign key.

In [36]:
# Rename Column
credits_renamed = credits.rename(index=str, columns={"movie_id": "id"})

# merge the two using the foreign key.
data = movies.merge(credits_renamed, on='id')

print(data.head())

      budget                                             genres  \
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  300000000  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  245000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  250000000  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  260000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                                       homepage      id  \
0                   http://www.avatarmovie.com/   19995   
1  http://disney.go.com/disneypictures/pirates/     285   
2   http://www.sonypictures.com/movies/spectre/  206647   
3            http://www.thedarkknightrises.com/   49026   
4          http://movies.disney.com/john-carter   49529   

                                            keywords original_language  \
0  [{"id": 1463, "name": "culture clash"}, {"id":...                en   
1  [{"id": 270, "name": "ocean"}, {"id": 726, "na...                en   
2  [{"id": 470, "nam

Some of the columns we see are repeating the title or would not surve our model. let us check them and drop them.

In [6]:
data.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title_x', 'vote_average',
       'vote_count', 'title_y', 'cast', 'crew'],
      dtype='object')

Let us get rid of columns that are repeating information and drop all the rows with nan values in them. 

In [7]:
data[['title_x', 'title_y', 'homepage', 'status','production_countries', 'production_companies', 'tagline']].head(3)

Unnamed: 0,title_x,title_y,homepage,status,production_countries,production_companies,tagline
0,Avatar,Avatar,http://www.avatarmovie.com/,Released,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...",Enter the World of Pandora.
1,Pirates of the Caribbean: At World's End,Pirates of the Caribbean: At World's End,http://disney.go.com/disneypictures/pirates/,Released,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","At the end of the world, the adventure begins."
2,Spectre,Spectre,http://www.sonypictures.com/movies/spectre/,Released,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",A Plan No One Escapes


In [8]:
data_cleaned = data.drop( columns = ['title_x', 'title_y', 'homepage', 'status','production_countries', 'production_companies', 'tagline'] )

print(data_cleaned.info())
data_cleaned.head(3)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 16 columns):
budget               4803 non-null int64
genres               4803 non-null object
id                   4803 non-null int64
keywords             4803 non-null object
original_language    4803 non-null object
original_title       4803 non-null object
overview             4800 non-null object
popularity           4803 non-null float64
release_date         4802 non-null object
revenue              4803 non-null int64
runtime              4801 non-null float64
spoken_languages     4803 non-null object
vote_average         4803 non-null float64
vote_count           4803 non-null int64
cast                 4803 non-null object
crew                 4803 non-null object
dtypes: float64(3), int64(4), object(9)
memory usage: 469.0+ KB
None


Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,release_date,revenue,runtime,spoken_languages,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",6.3,4466,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


In [16]:
data_cleaned = data_cleaned.dropna(axis=0)

data_cleaned.isnull().sum()


budget               0
genres               0
id                   0
keywords             0
original_language    0
original_title       0
overview             0
popularity           0
release_date         0
revenue              0
runtime              0
spoken_languages     0
vote_average         0
vote_count           0
cast                 0
crew                 0
dtype: int64

Creating a TF-IDF Vectorized with an ngram_range from 1 - 3 words, disregarding the stop words. 

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode', 
                        analyzer='word',token_pattern=r'\w{1,}',
                        ngram_range=(1, 3),
                        stop_words = 'english')

In [18]:
# Fitting the TF-IDF on the 'overview' text
tfv_matrix = tfv.fit_transform(data_cleaned['overview'])
print(tfv_matrix)
print(tfv_matrix.shape)

  (0, 148)	0.3091371723189815
  (0, 1670)	0.27815413610097817
  (0, 431)	0.2108413295939603
  (0, 7055)	0.2686774882613109
  (0, 6447)	0.2566772823097974
  (0, 3582)	0.21787716706422044
  (0, 9392)	0.24143973860066492
  (0, 5907)	0.17991690004805883
  (0, 9716)	0.24435186815985976
  (0, 6543)	0.29591523792195684
  (0, 5972)	0.27473495736009795
  (0, 2634)	0.28189942265833556
  (0, 5658)	0.26104797888037473
  (0, 1514)	0.20118105634370556
  (0, 147)	0.3091371723189815
  (1, 1810)	0.36794361320381896
  (1, 7159)	0.3031053671236092
  (1, 2916)	0.30082340684198766
  (1, 9608)	0.3355244901637141
  (1, 2848)	0.21555947963880698
  (1, 2872)	0.3232534752773416
  (1, 4205)	0.3080331741124809
  (1, 5263)	0.13328087934556093
  (1, 1806)	0.21045212959628706
  (1, 2318)	0.21891500721538523
  :	:
  (4798, 671)	0.15980111581570794
  (4798, 2360)	0.15010143939644302
  (4798, 677)	0.13488458793471353
  (4798, 3736)	0.14858681983687555
  (4798, 3480)	0.14339731171521328
  (4798, 1252)	0.1517283521734410

In [31]:
from sklearn.metrics.pairwise import sigmoid_kernel

# Compute the sigmoid kernel
kernel = sigmoid_kernel(tfv_matrix, tfv_matrix)
print(kernel[0])

# Reverse mapping of indices and movie titles
indices = pd.Series(data_cleaned.index, index = data_cleaned['original_title']).drop_duplicates()
print(indices)

# Get the index corresponding to original_title
print(indices['The Dark Knight Rises'])
print(kernel[3])

# Get the pairwsie similarity scores
print(list(enumerate(kernel[indices['The Dark Knight Rises']])))

# Sorting and enumerating 
print(sorted(list(enumerate(kernel[indices['The Dark Knight Rises']])), key=lambda x: x[1], reverse=True))

[0.76163447 0.76159416 0.76159416 ... 0.76159416 0.76159416 0.76159416]
original_title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4799, dtype: int64
3
[0.76159506 0.76159416 0.76159441 ... 0.76159558 0.76159573 0.76159616]
[(0, 0.7615950636376972), (1, 0.7615941559557649), (2, 0.7615944076915708), (3, 0.7616344731250241), (4, 0.7615950454825771), (5, 0.7615945108439179), (6, 0.7615953505035772), (7, 0.7615954025197594), (8, 0.7615950491551496), (9, 0.7616005201569384), (10, 0.761594155955764

In [32]:
def give_recomendations(title, sig = kernel):
    # Get the index corresponding to original_title
    index = indices[title]

    # Get the pairwsie similarity scores
    kernel_scores = list(enumerate(kernel[index]))

    # Sorting
    kernel_scores = sorted(kernel_scores, key=lambda x: x[1], reverse=True)

    # 10 most similar movies
    kernel_scores = kernel_scores[1:11]

    # Movies indices
    movie_indices = [i[0] for i in kernel_scores]

    # Top 10 similar movies
    return data_cleaned['original_title'].iloc[movie_indices]

In [33]:
give_recomendations('The Dark Knight Rises')

299                              Batman Forever
65                              The Dark Knight
1359                                     Batman
428                              Batman Returns
2507                                  Slow Burn
119                               Batman Begins
1181                                        JFK
9            Batman v Superman: Dawn of Justice
3854    Batman: The Dark Knight Returns, Part 2
210                              Batman & Robin
Name: original_title, dtype: object

It makes sense that all the recommended movies are Batman movies.
