Movie recommandation system project

Aim of the project:
- Use different techniques to build a recommendation system using Machine Learning

1. Problem statement\
Our goal is to build a recommendation system displaying the 5 most accurate recommended movies from a list of 10 000 according to a movie typed by the user


The data set includes information about:
- Information about the 10 000 movies (title, lenght...)
- The popularity of the movie
- The overview of the movie

2. Data collection

In [38]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import warnings
warnings.filterwarnings('ignore')

In [9]:
movies_df = pd.read_csv("..\\content\\top10K-TMDB-movies.csv")

In [49]:
movies_df.head(5)

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811


2.2 Dataset information

In [11]:

# Display the titles of the columns
movies_df.keys()

Index(['id', 'title', 'genre', 'original_language', 'overview', 'popularity',
       'release_date', 'vote_average', 'vote_count'],
      dtype='object')

Feature Summary:
- id: MOvie's unique identification number
- title: Title of the movie
- genre: One or multiple genre of the movie
- original_language: language of the movie's OV
- overview: Summary of the movie's intrigue
- popularity: Popularity of the movie
- release_date: Release date of the movie
- vote_average: Average of the note given to the movie by critiques
- vote_count: number of votes

In [12]:
movies_df.shape

(10000, 9)

There are indeed 10 000 columns for the 10 000 movies

3. Data checks

3.1 Check missing values

In [37]:
# Count missing values (null)
movies_df.isnull().sum()

id                   0
title                0
genre                0
original_language    0
overview             0
popularity           0
release_date         0
vote_average         0
vote_count           0
dtype: int64

In [19]:
# Display the missing values
null_data = movies_df[movies_df.isnull().any(axis=1)]
null_data

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
3361,50472,Anplagghed al cinema,,it,"A queue at the ATM machine, a displaced family...",4.42,2006-11-26,7.0,313
4150,38537,Nati stanchi,Comedy,it,,5.671,2002-03-01,6.8,211
6973,31359,Would I Lie to You? 2,Comedy,fr,,4.741,2001-02-07,6.2,325
7821,43211,7 Kilos in 7 Days,,it,Two not very clever young doctors open a fitne...,5.885,1986-02-02,6.0,212
7941,2029,Tanguy,Comedy,fr,,5.449,2001-11-21,6.0,387
8518,57114,"Amore, bugie e calcetto",,en,,4.709,2008-04-04,5.8,200
9293,17413,Incognito,Comedy,fr,,5.602,2009-04-28,5.5,213
9440,516043,Arrivano i prof,Comedy,it,,6.558,2018-05-01,5.4,337
9620,154512,Lightning Strike,Comedy,it,,4.07,2012-12-13,5.3,216
9792,42426,A spasso nel tempo - L'avventura continua,"Comedy,Fantasy",it,,5.02,1997-12-11,5.1,209


Since we don't have so much missing values we can make some research and try to fill in the blanks for the genre

In [39]:
movies_df["genre"][3361]="Comedy"
movies_df["genre"][8518]="Comedy"
movies_df["genre"][7821]="Comedy"

In [48]:
movies_df.loc[[7821]]

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
7821,43211,7 Kilos in 7 Days,Comedy,it,Two not very clever young doctors open a fitne...,5.885,1986-02-02,6.0,212


We can drop the movies without overviews because it will take too much time to search for every one of them

In [36]:
movies_df = movies_df.dropna(axis=0)

3.2 Check for duplicates

In [84]:
movies_df.duplicated().sum()

0

In [85]:
movies_df.loc[movies_df.duplicated()]

Unnamed: 0,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count


Check the datatypes of the features

In [47]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9987 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 9987 non-null   int64         
 1   title              9987 non-null   object        
 2   genre              9987 non-null   object        
 3   original_language  9987 non-null   object        
 4   overview           9987 non-null   object        
 5   popularity         9987 non-null   float64       
 6   release_date       9987 non-null   datetime64[ns]
 7   vote_average       9987 non-null   float64       
 8   vote_count         9987 non-null   int64         
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 1.0+ MB


We observe that the release date has an object type that we can convert into date type for ease 

release_date type conversion from object to date_time:

In [45]:
movies_df["release_date"] = pd.to_datetime(movies_df["release_date"])

3.4 Checking the number of unique values

In [83]:
movies_df.nunique()

title                9648
genre                2123
original_language      43
overview             9985
popularity           8499
release_date         6109
vote_average           42
vote_count           3191
dtype: int64

In [51]:
print(movies_df['id'].nunique() == movies_df.shape[0])

True


We just have unique values for customerID so we can drop this column which won't help the model to learn new things

In [82]:
movies_df = movies_df.drop(["id"], axis=1)

In [87]:
clean_dataset = movies_df.to_csv("..\\content\\clean_dataset.csv", index=False)

3.5 Check statistics of data set

4. Build a recommendation system based on the overview and genre of the movies

4.1 Create the new dataset for this recommendation system

In [89]:
# Load the clean dataset

clean_dataset = pd.read_csv("..\\content\\clean_dataset.csv")
clean_dataset.head(5)


Unnamed: 0,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811


4.1.1 Create a new dataframe with the columns we want to use for the recommendation system

In [91]:
saved_col = ["title", "genre", "overview"]
new_df = clean_dataset[saved_col]
new_df.head(10)

Unnamed: 0,title,genre,overview
0,The Shawshank Redemption,"Drama,Crime",Framed in the 1940s for the double murder of h...
1,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance","Raj is a rich, carefree, happy-go-lucky second..."
2,The Godfather,"Drama,Crime","Spanning the years 1945 to 1955, a chronicle o..."
3,Schindler's List,"Drama,History,War",The true story of how businessman Oskar Schind...
4,The Godfather: Part II,"Drama,Crime",In the continuing saga of the Corleone crime f...
5,Impossible Things,"Family,Drama","Matilde is a woman who, after the death of her..."
6,Spirited Away,"Animation,Family,Fantasy","A young girl, Chihiro, becomes trapped in a st..."
7,Your Eyes Tell,"Romance,Drama","A tragic accident lead to Kaori's blindness, b..."
8,Dou kyu sei – Classmates,"Romance,Animation","Rihito Sajo, an honor student with a perfect s..."
9,Your Name.,"Romance,Animation,Drama",High schoolers Mitsuha and Taki are complete s...


In [92]:
new_df["tags"] = new_df["genre"] + " " + new_df["overview"]

In [96]:
new_df = new_df.drop(["genre", "overview"], axis=1)

In [97]:
new_df

Unnamed: 0,title,tags
0,The Shawshank Redemption,"Drama,Crime Framed in the 1940s for the double..."
1,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance Raj is a rich, carefree, ..."
2,The Godfather,"Drama,Crime Spanning the years 1945 to 1955, a..."
3,Schindler's List,"Drama,History,War The true story of how busine..."
4,The Godfather: Part II,"Drama,Crime In the continuing saga of the Corl..."
...,...,...
9982,The Last Airbender,"Action,Adventure,Fantasy The story follows the..."
9983,Sharknado 3: Oh Hell No!,"Action,TV Movie,Science Fiction,Comedy,Adventu..."
9984,Captain America,"Action,Science Fiction,War During World War II..."
9985,In the Name of the King: A Dungeon Siege Tale,"Adventure,Fantasy,Action,Drama A man named Far..."


4.2 Convert text to vector\
2 Methods:
- Bag of words
- TF-IDF

In [103]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=9987, stop_words="english", ngram_range=(1, 2))

In [114]:
vectorized_tags = cv.fit_transform(new_df["tags"].values.astype("U")).toarray()

In [115]:
vectorized_tags.shape

(9987, 9987)

In [116]:
vectorized_tags

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [118]:
from sklearn.metrics.pairwise import cosine_similarity

In [119]:
similarity_vector = cosine_similarity(vectorized_tags)

In [120]:
similarity_vector

array([[1.        , 0.0571662 , 0.14142136, ..., 0.06681531, 0.09284767,
        0.05986843],
       [0.0571662 , 1.        , 0.07276069, ..., 0.        , 0.03184649,
        0.        ],
       [0.14142136, 0.07276069, 1.        , ..., 0.01889822, 0.05252257,
        0.07620008],
       ...,
       [0.06681531, 0.        , 0.01889822, ..., 1.        , 0.02481458,
        0.02400077],
       [0.09284767, 0.03184649, 0.05252257, ..., 0.02481458, 1.        ,
        0.03335187],
       [0.05986843, 0.        , 0.07620008, ..., 0.02400077, 0.03335187,
        1.        ]])

In [174]:
# Display the index of a movie based on its title
class Retrieve_data():
    def __init__(self, dataset, title):
        self.dataset = dataset
        self.title = title

    def get_index_from_title(self):
        try:
            return self.dataset[self.dataset["title"]==self.title].index[0]
        except:
            print("Either the movie or the dataset doesn't exist")

    def recommended_movies(self, index, number):
        self.distance = sorted(list(enumerate(similarity_vector[index])), reverse=True, key=lambda vector:vector[1])

        for movies in self.distance[1:number+1]:
            print(new_df.iloc[movies[0]].title)


In [175]:
data_retrieved = Retrieve_data(new_df, "The Godfather")

In [176]:
index = data_retrieved.get_index_from_title()
index

2

In [177]:
recommendations = data_retrieved.recommended_movies(index, 10)

The Godfather: Part II
Blood Ties
Bomb City
Proud Mary
Gotti
In the Fade
Joker
House of Gucci
The Unforgivable
Batman: The Killing Joke


4.2.2 TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
