Movie recommandation system project

Aim of the project:
- Build a recommendation system using Machine Learning

1. Problem statement\
Our goal is to build a recommendation system displaying the 5 most accurate recommended movies from a list of 10 000 according to a movie typed by the user


The data set includes information about:
- Information about the 10 000 movies (title, lenght...)
- The popularity of the movie
- The overview of the movie

2. Data collection

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import warnings
warnings.filterwarnings('ignore')

In [2]:
movies_df = pd.read_csv("..\\content\\top10K-TMDB-movies.csv")

In [3]:
movies_df.head(5)

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811


2.2 Dataset information

In [4]:

# Display the titles of the columns
movies_df.keys()

Index(['id', 'title', 'genre', 'original_language', 'overview', 'popularity',
       'release_date', 'vote_average', 'vote_count'],
      dtype='object')

Feature Summary:
- id: MOvie's unique identification number
- title: Title of the movie
- genre: One or multiple genre of the movie
- original_language: language of the movie's OV
- overview: Summary of the movie's intrigue
- popularity: Popularity of the movie
- release_date: Release date of the movie
- vote_average: Average of the note given to the movie by critiques
- vote_count: number of votes

In [5]:
movies_df.shape

(10000, 9)

There are indeed 10 000 columns for the 10 000 movies

3. Data checks

3.1 Check missing values

In [6]:
# Count missing values (null)
movies_df.isnull().sum()

id                    0
title                 0
genre                 3
original_language     0
overview             13
popularity            0
release_date          0
vote_average          0
vote_count            0
dtype: int64

In [7]:
# Display the missing values
null_data = movies_df[movies_df.isnull().any(axis=1)]
null_data

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
3361,50472,Anplagghed al cinema,,it,"A queue at the ATM machine, a displaced family...",4.42,2006-11-26,7.0,313
4150,38537,Nati stanchi,Comedy,it,,5.671,2002-03-01,6.8,211
6973,31359,Would I Lie to You? 2,Comedy,fr,,4.741,2001-02-07,6.2,325
7821,43211,7 Kilos in 7 Days,,it,Two not very clever young doctors open a fitne...,5.885,1986-02-02,6.0,212
7941,2029,Tanguy,Comedy,fr,,5.449,2001-11-21,6.0,387
8518,57114,"Amore, bugie e calcetto",,en,,4.709,2008-04-04,5.8,200
9293,17413,Incognito,Comedy,fr,,5.602,2009-04-28,5.5,213
9440,516043,Arrivano i prof,Comedy,it,,6.558,2018-05-01,5.4,337
9620,154512,Lightning Strike,Comedy,it,,4.07,2012-12-13,5.3,216
9792,42426,A spasso nel tempo - L'avventura continua,"Comedy,Fantasy",it,,5.02,1997-12-11,5.1,209


Since we don't have so much missing values we can make some research and try to fill in the blanks for the genre

In [8]:
movies_df["genre"][3361]="Comedy"
movies_df["genre"][8518]="Comedy"
movies_df["genre"][7821]="Comedy"

In [9]:
movies_df.loc[[7821]]

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
7821,43211,7 Kilos in 7 Days,Comedy,it,Two not very clever young doctors open a fitne...,5.885,1986-02-02,6.0,212


We can drop the movies without overviews because it will take too much time to search for every one of them

In [10]:
movies_df = movies_df.dropna(axis=0)

3.2 Check for duplicates

In [11]:
movies_df.duplicated().sum()

0

In [12]:
movies_df.loc[movies_df.duplicated()]

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count


Check the datatypes of the features

In [13]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9987 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 9987 non-null   int64  
 1   title              9987 non-null   object 
 2   genre              9987 non-null   object 
 3   original_language  9987 non-null   object 
 4   overview           9987 non-null   object 
 5   popularity         9987 non-null   float64
 6   release_date       9987 non-null   object 
 7   vote_average       9987 non-null   float64
 8   vote_count         9987 non-null   int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 780.2+ KB


We observe that the release date has an object type that we can convert into date type for ease 

release_date type conversion from object to date_time:

In [14]:
movies_df["release_date"] = pd.to_datetime(movies_df["release_date"])

3.4 Checking the number of unique values

In [15]:
movies_df.nunique()

id                   9987
title                9648
genre                2123
original_language      43
overview             9985
popularity           8499
release_date         6109
vote_average           42
vote_count           3191
dtype: int64

In [16]:
print(movies_df['id'].nunique() == movies_df.shape[0])

True


We just have unique values for customerID so we can drop this column which won't help the model to learn new things

In [17]:
#movies_df = movies_df.drop(["id"], axis=1)

In [18]:
clean_dataset = movies_df.to_csv("..\\content\\clean_dataset.csv", index=False)

3.5 Check statistics of data set

4. Build a recommendation system based on the overview and genre of the movies

4.1 Create the new dataset for this recommendation system

In [None]:
# Load the clean dataset

clean_dataset = pd.read_csv("..\\content\\clean_dataset.csv")
clean_dataset.head(5)


4.1.1 Create a new dataframe with the columns we want to use for the recommendation system

In [None]:
saved_col = ["title", "genre", "overview"]
new_df = clean_dataset[saved_col]
new_df.head(10)

In [None]:
new_df["tags"] = new_df["genre"] + " " + new_df["overview"]

In [None]:
new_df = new_df.drop(["genre", "overview"], axis=1)

In [None]:
new_df

4.2 Convert text to vector\
2 Methods:
- Bag of words
- TF-IDF

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=9987, stop_words="english", ngram_range=(1, 2))

In [None]:
vectorized_tags = cv.fit_transform(new_df["tags"].values.astype("U")).toarray()

In [None]:
vectorized_tags.shape

In [None]:
vectorized_tags

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity_vector = cosine_similarity(vectorized_tags)

In [None]:
similarity_vector

In [None]:
# Display the index of a movie based on its title
class Retrieve_data():
    def __init__(self, dataset, title):
        self.dataset = dataset
        self.title = title

    def get_index_from_title(self):
        try:
            return self.dataset[self.dataset["title"]==self.title].index[0]
        except:
            print("Either the movie or the dataset doesn't exist")

    def recommended_movies(self, index, number):

        recommended_movies = []

        self.distance = sorted(list(enumerate(similarity_vector[index])), reverse=True, key=lambda vector:vector[1])

        for movies in self.distance[1:number+1]:
            recommended_movies.append(new_df.iloc[movies[0]].title)
        print(recommended_movies)


In [None]:
data_retrieved = Retrieve_data(new_df, "Iron Man")

In [None]:
index = data_retrieved.get_index_from_title()
index

In [None]:
recommendations = data_retrieved.recommended_movies(index, 10)

Save the model

In [None]:
import pickle
# save the similarity vector
pickle.dump(similarity_vector, open("similarity_vector.pkl", "wb"))

In [None]:
# Save the dataset
pickle.dump(new_df, open("movies.pkl", "wb"))


In [None]:
pickle.load(open('movies.pkl', 'rb'))

4.2.2 TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
