# Content Recommendation for Disney+ 
## by Diego Garrocho
---

### Project Overview
1. Intro
2. Data Wrangling
3. Data Cleaning
4. Model Selection
5. Model Testing
6. Conclusion

## 1. Intro
---
The aim of this project is to build a recommendation system for movies and TV shows. The recommendation system should be able to provide a list of similar movies or TV shows based on the input of a user. The recommendation system will use several features, including title, description, genres, production countries, actors, and more, to calculate similarity scores between movies or TV shows.

The data used for this project comes from a publicly available dataset that includes information about movies and TV shows. The dataset contains information such as title, description, genre, production country, actors, runtime, and rating. The data is in a tabular format and stored in a CSV file.

## 2. Data Wrangling
---
Before building the recommendation system, several preprocessing steps are required to clean and transform the data. This includes handling missing values, tokenizing the text, removing stop words, lemmatizing the text, and transforming the data into a format that can be used by the recommendation system.

### Imports and set up

In [1]:
import pandas as pd
import numpy as np
import sklearn as sk
import random
import seaborn as sns
import ast
import re
import nltk

In [2]:
credits_df = pd.read_csv(r'C:\Users\logan\Desktop\disneycont\credits.csv')
titles_df = pd.read_csv(r'C:\Users\logan\Desktop\disneycont\titles.csv')

In [3]:
print(credits_df.columns.values, titles_df.columns.values)
credits_df.shape, titles_df.shape

['person_id' 'id' 'name' 'character' 'role'] ['id' 'title' 'type' 'description' 'release_year' 'age_certification'
 'runtime' 'genres' 'production_countries' 'seasons' 'imdb_id'
 'imdb_score' 'imdb_votes' 'tmdb_popularity' 'tmdb_score']


((30689, 5), (1854, 15))

In [4]:
full_df = pd.merge(titles_df, credits_df, on= ['id','id'])
full_df.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,person_id,name,character,role
0,tm89464,Miracle on 34th Street,MOVIE,"Kris Kringle, seemingly the embodiment of Sant...",1947,G,96,"['family', 'comedy', 'drama']",['US'],,tt0039628,7.9,50969.0,23.515,7.388,35549,Maureen O'Hara,Doris Walker,ACTOR
1,tm89464,Miracle on 34th Street,MOVIE,"Kris Kringle, seemingly the embodiment of Sant...",1947,G,96,"['family', 'comedy', 'drama']",['US'],,tt0039628,7.9,50969.0,23.515,7.388,57832,John Payne,Fred Gailey,ACTOR
2,tm89464,Miracle on 34th Street,MOVIE,"Kris Kringle, seemingly the embodiment of Sant...",1947,G,96,"['family', 'comedy', 'drama']",['US'],,tt0039628,7.9,50969.0,23.515,7.388,57833,Edmund Gwenn,Kris Kringle,ACTOR
3,tm89464,Miracle on 34th Street,MOVIE,"Kris Kringle, seemingly the embodiment of Sant...",1947,G,96,"['family', 'comedy', 'drama']",['US'],,tt0039628,7.9,50969.0,23.515,7.388,25096,Natalie Wood,Susan Walker,ACTOR
4,tm89464,Miracle on 34th Street,MOVIE,"Kris Kringle, seemingly the embodiment of Sant...",1947,G,96,"['family', 'comedy', 'drama']",['US'],,tt0039628,7.9,50969.0,23.515,7.388,27185,Porter Hall,Granville Sawyer,ACTOR


In [5]:
full_df.info()
print('_'*99)
full_df.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30689 entries, 0 to 30688
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    30689 non-null  object 
 1   title                 30689 non-null  object 
 2   type                  30689 non-null  object 
 3   description           30686 non-null  object 
 4   release_year          30689 non-null  int64  
 5   age_certification     27498 non-null  object 
 6   runtime               30689 non-null  int64  
 7   genres                30689 non-null  object 
 8   production_countries  30689 non-null  object 
 9   seasons               2379 non-null   float64
 10  imdb_id               26878 non-null  object 
 11  imdb_score            26712 non-null  float64
 12  imdb_votes            26581 non-null  float64
 13  tmdb_popularity       30689 non-null  float64
 14  tmdb_score            30407 non-null  float64
 15  person_id          

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,person_id
count,30689.0,30689.0,2379.0,26712.0,26581.0,30689.0,30407.0,30689.0
mean,2005.619636,91.208413,2.770912,6.628036,201362.2,48.316732,6.780725,398550.4
std,16.666517,35.892204,3.101734,1.024169,308263.7,125.831614,0.878787,633979.5
min,1928.0,2.0,1.0,1.6,5.0,0.6,2.0,45.0
25%,2000.0,80.0,1.0,6.0,5525.0,10.052,6.273,12834.0
50%,2010.0,96.0,2.0,6.7,47260.0,20.771,6.802,57974.0
75%,2017.0,113.0,3.0,7.3,269035.0,52.384,7.352,598711.0
max,2023.0,182.0,36.0,9.5,1403757.0,2159.377,10.0,2770632.0


In [6]:
full_df.describe(include=['O'])

Unnamed: 0,id,title,type,description,age_certification,genres,production_countries,imdb_id,name,character,role
count,30689,30689,30689,30686,27498,30689,30689,26878,30689,28794,30689
unique,1733,1695,2,1730,11,700,84,1295,19851,18235,2
top,tm84668,Enchanted,MOVIE,The beautiful princess Giselle is banished by ...,PG,['documentation'],['US'],tt0461770,Jim Cummings,Self,ACTOR
freq,245,245,28310,245,11439,1778,26039,245,61,1209,29122


### Initial notes
---
After initial observation, there appears to be some columns that are not complete which is understandable since the dataset includes movies and tv shows combined. There also appears to be that the information obtained from imdb and tmdb has the most null values that may impact the usability of it, therefore, it might be more negative than positive to use them. In addition, other values (ids, descriptions) hold no relevance for the objetive of the project and should not be taken into consideration. Some features could still be transformed for easier use and others could be mixed and still have to check for typos. The column genres appears to have many unique values which should be looked into.
Given the nature of the information available for the project, taking on a content-based filtering approach would be the best suited option to choose.

In [7]:
full_df['genres'].unique()

array(["['family', 'comedy', 'drama']",
       "['horror', 'fantasy', 'animation', 'family', 'comedy']",
       "['fantasy', 'animation', 'family', 'romance']",
       "['animation', 'drama', 'family', 'fantasy']",
       "['animation', 'family', 'fantasy', 'music']",
       "['animation', 'drama', 'family']",
       "['fantasy', 'animation', 'romance', 'family', 'thriller', 'drama']",
       "['animation', 'documentation']", "['comedy']",
       "['family', 'action']",
       "['animation', 'family', 'fantasy', 'comedy']", "['animation']",
       "['animation', 'romance', 'comedy', 'family', 'fantasy']",
       "['animation', 'comedy', 'family']",
       "['animation', 'comedy', 'family', 'fantasy']",
       "['animation', 'music', 'comedy', 'family']", "['thriller']",
       "['fantasy', 'animation']", "['comedy', 'family', 'animation']",
       "['animation', 'family']", "['romance', 'animation']",
       "['action', 'family']", "['documentation']",
       "['action', 'scifi', 'fant

The values for genres are grouped as arrays for each row, which will prove complicated to work with if left in that state.

Removing some columns that hold no impact for the project:

In [8]:
drop_df = full_df.drop(['id', 'imdb_id', 'person_id', 'character', 'role'], axis=1)
drop_df.head()


Unnamed: 0,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,name
0,Miracle on 34th Street,MOVIE,"Kris Kringle, seemingly the embodiment of Sant...",1947,G,96,"['family', 'comedy', 'drama']",['US'],,7.9,50969.0,23.515,7.388,Maureen O'Hara
1,Miracle on 34th Street,MOVIE,"Kris Kringle, seemingly the embodiment of Sant...",1947,G,96,"['family', 'comedy', 'drama']",['US'],,7.9,50969.0,23.515,7.388,John Payne
2,Miracle on 34th Street,MOVIE,"Kris Kringle, seemingly the embodiment of Sant...",1947,G,96,"['family', 'comedy', 'drama']",['US'],,7.9,50969.0,23.515,7.388,Edmund Gwenn
3,Miracle on 34th Street,MOVIE,"Kris Kringle, seemingly the embodiment of Sant...",1947,G,96,"['family', 'comedy', 'drama']",['US'],,7.9,50969.0,23.515,7.388,Natalie Wood
4,Miracle on 34th Street,MOVIE,"Kris Kringle, seemingly the embodiment of Sant...",1947,G,96,"['family', 'comedy', 'drama']",['US'],,7.9,50969.0,23.515,7.388,Porter Hall


The column of actors could still be groupped and age certifications should be inspected along with seasons. All the missing values for scores should be replaced with average.

In [9]:
drop_df.age_certification.unique()

array(['G', nan, 'PG', 'TV-G', 'TV-PG', 'TV-MA', 'PG-13', 'TV-Y7',
       'TV-Y7-FV', 'TV-Y', 'TV-14', 'R'], dtype=object)

Age_certification appears to contain several rating scales which can be homogenized, also due to the importance of maintaining a proper hierarchy on the ratings all the rows with nan values will be dropped. For the new rating format:<br> 
'G', 'TV-G', 'TV-Y' = '0' <br>
'PG', 'TV-PG', 'TV-Y7', 'TV-Y7-FV' = '1'<br>
'PG-13', 'TV-14' = '2'<br>
'TV-MA', 'R' = '3'<br>

In [10]:
agefix_df = drop_df.dropna(subset = ['age_certification'])
agechange_df = agefix_df.replace({'age_certification' : {'G' : 0, 'TV-G' : 0,'TV-Y' : 0, 'PG' : 1, 'TV-PG' : 1,
                                                         'TV-Y7' : 1, 'TV-Y7-FV' : 1, 'PG-13' : 2, 'TV-14' : 2,
                                                        'TV-MA' : 3, 'R' : 3}})
agechange_df.age_certification.unique()

array([0, 1, 3, 2], dtype=int64)

Now replace the missing values in the rating columns with their respective average. (imdb_score,imdb_votes,tmdb_popularity, tmdb_score)

In [11]:
mean_imsc = agechange_df['imdb_score'].mean()
mean_imvo = agechange_df['imdb_votes'].mean()
mean_tmpo = agechange_df['tmdb_popularity'].mean()
mean_tmsc = agechange_df['tmdb_score'].mean()
mean_df = agechange_df.fillna({'imdb_score': mean_imsc, 
                               'imdb_votes': mean_imvo, 
                               'tmdb_popularity': mean_tmpo, 
                               'tmdb_score': mean_tmsc})
mean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27498 entries, 0 to 30688
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   title                 27498 non-null  object 
 1   type                  27498 non-null  object 
 2   description           27498 non-null  object 
 3   release_year          27498 non-null  int64  
 4   age_certification     27498 non-null  int64  
 5   runtime               27498 non-null  int64  
 6   genres                27498 non-null  object 
 7   production_countries  27498 non-null  object 
 8   seasons               2061 non-null   float64
 9   imdb_score            27498 non-null  float64
 10  imdb_votes            27498 non-null  float64
 11  tmdb_popularity       27498 non-null  float64
 12  tmdb_score            27498 non-null  float64
 13  name                  27498 non-null  object 
dtypes: float64(5), int64(3), object(6)
memory usage: 3.1+ MB


Something could be done with the 'seasons' column, should check if every show has seasons values first.

In [12]:
mean_df.type.unique()

array(['MOVIE', 'SHOW'], dtype=object)

In [13]:
def show_check(mean_df):
    idtype = mean_df.index[mean_df['type']== 'SHOW']
    if mean_df.loc[idtype, 'seasons'].isna().any():
        return True
    else:
        return False
    
show_check(mean_df)

False

All shows have valid 'seasons' values so only the movie types need to be filled.

In [14]:
mean_df = mean_df.fillna({'seasons':0})

In [15]:
lean_df = mean_df.groupby(['title','type','description','release_year','age_certification','runtime','genres','production_countries',
                          'imdb_score','imdb_votes','tmdb_popularity','tmdb_score'])['name'].apply(lambda x: list(x)).reset_index()
lean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1327 entries, 0 to 1326
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   title                 1327 non-null   object 
 1   type                  1327 non-null   object 
 2   description           1327 non-null   object 
 3   release_year          1327 non-null   int64  
 4   age_certification     1327 non-null   int64  
 5   runtime               1327 non-null   int64  
 6   genres                1327 non-null   object 
 7   production_countries  1327 non-null   object 
 8   imdb_score            1327 non-null   float64
 9   imdb_votes            1327 non-null   float64
 10  tmdb_popularity       1327 non-null   float64
 11  tmdb_score            1327 non-null   float64
 12  name                  1327 non-null   object 
dtypes: float64(4), int64(3), object(6)
memory usage: 134.9+ KB


In [16]:
   lean_df.head(10)                      

Unnamed: 0,title,type,description,release_year,age_certification,runtime,genres,production_countries,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,name
0,'Twas the Night,MOVIE,A mischievous 14-year-old boy and his irrespon...,2001,0,84,"['comedy', 'family']",['US'],5.2,1126.0,3.911,5.365,"[Bryan Cranston, Josh Zuckerman, Brenda Grate,..."
1,10 Things I Hate About You,MOVIE,"On the first day at his new school, Cameron in...",1999,2,97,"['drama', 'romance', 'comedy']",['US'],7.3,351998.0,32.094,7.56,"[Julia Stiles, Heath Ledger, Joseph Gordon-Lev..."
2,101 Dalmatian Street,SHOW,Follow the adventures of eldest siblings Dolly...,2019,1,17,"['animation', 'family', 'comedy']",['GB'],6.0,419.0,17.398,7.5,"[Bert Davis, Margot Powell, Michaela Dietz, Jo..."
3,101 Dalmatians,MOVIE,"An evil, high-fashion designer plots to steal ...",1996,0,103,"['comedy', 'family', 'crime']","['US', 'GB']",5.7,113287.0,34.265,5.896,"[Glenn Close, Jeff Daniels, Joely Richardson, ..."
4,101 Dalmatians II: Patch's London Adventure,MOVIE,"Being one of 101 takes its toll on Patch, who ...",2002,0,74,['comedy'],['US'],6.647706,217526.383949,32.759,6.002,"[Barry Bostwick, Jason Alexander, Martin Short..."
5,101 Dalmatians: The Series,SHOW,After foiling Cruella DeVil's plot to make a f...,1997,0,22,"['action', 'comedy', 'family', 'animation']",['US'],6.1,1708.0,27.092,6.837,"[Kath Soucie, Tara Strong, Jeff Bennett, Frank..."
6,102 Dalmatians,MOVIE,Get ready for a howling good time as an all ne...,2000,0,100,"['family', 'comedy']",['US'],4.8,38296.0,15.584,5.476,"[Glenn Close, Ioan Gruffudd, Alice Evans, Tim ..."
7,"20,000 Leagues Under the Sea",MOVIE,A ship sent to investigate a wave of mysteriou...,1955,0,127,"['action', 'drama', 'family', 'fantasy', 'scifi']",['US'],7.2,34959.0,19.999,7.065,"[Kirk Douglas, James Mason, Paul Lukas, Peter ..."
8,22 vs. Earth,MOVIE,"Set before the events of ‘Soul’, 22 refuses to...",2021,0,9,"['animation', 'fantasy']",['US'],6.647706,217526.383949,29.662,7.1,"[Tina Fey, Alice Braga, Richard Ayoade, Micah ..."
9,3 Men and a Baby,MOVIE,Three bachelors find themselves forced to take...,1987,1,102,"['comedy', 'drama', 'family']",['US'],6.1,55164.0,12.93,6.185,"[Tom Selleck, Steve Guttenberg, Ted Danson, Na..."


## 3. Data Cleaning
---

The differnt rating scores could be consolidated. After some research the imbd votes and tmdb popularity columns should be dropped since there is no clear description of how the values are obtained. For the sake of using the data from both platforms a new column showing the mean of both scores and using that instead. Doing so will probably have little impact since the values for both of them are very similar.

In [17]:
lean_df['combined_score'] = lean_df[['imdb_score','tmdb_score']].mean(axis=1).round(2)
clear_df = lean_df.drop(['imdb_score','imdb_votes','tmdb_popularity','tmdb_score'], axis=1)

clear_df.head()

Unnamed: 0,title,type,description,release_year,age_certification,runtime,genres,production_countries,name,combined_score
0,'Twas the Night,MOVIE,A mischievous 14-year-old boy and his irrespon...,2001,0,84,"['comedy', 'family']",['US'],"[Bryan Cranston, Josh Zuckerman, Brenda Grate,...",5.28
1,10 Things I Hate About You,MOVIE,"On the first day at his new school, Cameron in...",1999,2,97,"['drama', 'romance', 'comedy']",['US'],"[Julia Stiles, Heath Ledger, Joseph Gordon-Lev...",7.43
2,101 Dalmatian Street,SHOW,Follow the adventures of eldest siblings Dolly...,2019,1,17,"['animation', 'family', 'comedy']",['GB'],"[Bert Davis, Margot Powell, Michaela Dietz, Jo...",6.75
3,101 Dalmatians,MOVIE,"An evil, high-fashion designer plots to steal ...",1996,0,103,"['comedy', 'family', 'crime']","['US', 'GB']","[Glenn Close, Jeff Daniels, Joely Richardson, ...",5.8
4,101 Dalmatians II: Patch's London Adventure,MOVIE,"Being one of 101 takes its toll on Patch, who ...",2002,0,74,['comedy'],['US'],"[Barry Bostwick, Jason Alexander, Martin Short...",6.32


Hot and label encoding can be applied to some columns, also genres and prod countries need to be turned to actual lists. Further manipulation can be used to turn the actor names into actual lists.

In [18]:
type(clear_df['name'].iloc[0])

list

In [19]:
#not used, found better fix using lambda 
#Mod to the ast lit eval for it to get through all the extra things cause normal is not running
def custom_literal_eval(s):
    
    s = re.sub(r"'([^']*)'", r'\1', s)
    s = re.sub(r'\[|\]|\s', '', s)
    
    return ast.literal_eval(s)



In [20]:
#lean_df['genres'] = lean_df['genres'].apply(custom_literal_eval)
#lean_df['production_countries'] = lean_df['production_countries'].apply(custom_literal_eval)
clear_df['genres'] = clear_df['genres'].apply(lambda x: ast.literal_eval(x))


clear_df.head()

Unnamed: 0,title,type,description,release_year,age_certification,runtime,genres,production_countries,name,combined_score
0,'Twas the Night,MOVIE,A mischievous 14-year-old boy and his irrespon...,2001,0,84,"[comedy, family]",['US'],"[Bryan Cranston, Josh Zuckerman, Brenda Grate,...",5.28
1,10 Things I Hate About You,MOVIE,"On the first day at his new school, Cameron in...",1999,2,97,"[drama, romance, comedy]",['US'],"[Julia Stiles, Heath Ledger, Joseph Gordon-Lev...",7.43
2,101 Dalmatian Street,SHOW,Follow the adventures of eldest siblings Dolly...,2019,1,17,"[animation, family, comedy]",['GB'],"[Bert Davis, Margot Powell, Michaela Dietz, Jo...",6.75
3,101 Dalmatians,MOVIE,"An evil, high-fashion designer plots to steal ...",1996,0,103,"[comedy, family, crime]","['US', 'GB']","[Glenn Close, Jeff Daniels, Joely Richardson, ...",5.8
4,101 Dalmatians II: Patch's London Adventure,MOVIE,"Being one of 101 takes its toll on Patch, who ...",2002,0,74,[comedy],['US'],"[Barry Bostwick, Jason Alexander, Martin Short...",6.32


In [21]:
clear_df['production_countries'] = clear_df['production_countries'].apply(lambda x: ast.literal_eval(x))
type(clear_df['production_countries'].iloc[0])

list

In [22]:
clear_df.rename(columns = {'name':'actors'}, inplace = True)
clear_df.head()

Unnamed: 0,title,type,description,release_year,age_certification,runtime,genres,production_countries,actors,combined_score
0,'Twas the Night,MOVIE,A mischievous 14-year-old boy and his irrespon...,2001,0,84,"[comedy, family]",[US],"[Bryan Cranston, Josh Zuckerman, Brenda Grate,...",5.28
1,10 Things I Hate About You,MOVIE,"On the first day at his new school, Cameron in...",1999,2,97,"[drama, romance, comedy]",[US],"[Julia Stiles, Heath Ledger, Joseph Gordon-Lev...",7.43
2,101 Dalmatian Street,SHOW,Follow the adventures of eldest siblings Dolly...,2019,1,17,"[animation, family, comedy]",[GB],"[Bert Davis, Margot Powell, Michaela Dietz, Jo...",6.75
3,101 Dalmatians,MOVIE,"An evil, high-fashion designer plots to steal ...",1996,0,103,"[comedy, family, crime]","[US, GB]","[Glenn Close, Jeff Daniels, Joely Richardson, ...",5.8
4,101 Dalmatians II: Patch's London Adventure,MOVIE,"Being one of 101 takes its toll on Patch, who ...",2002,0,74,[comedy],[US],"[Barry Bostwick, Jason Alexander, Martin Short...",6.32


## After previous fails in working with hot encoding a different approach will be tried.

Text needs to be properly processed in this version. Title as index, everything else as bag of words.

### Experimental transformation
---

Checking for how many unique values there are in the columns that have to be encoded.

In [23]:
unique_genres = set()
for genre_list in clear_df['genres']:
    unique_genres.update(genre_list)
num_unique_genres = len(unique_genres)
print("Number of unique genres:", num_unique_genres)

unique_actors = set()
for actor_list in clear_df['actors']:
    unique_actors.update(actor_list)
num_unique_actors = len(unique_actors)
print("Number of unique actors:", num_unique_actors)

unique_countries = set()
for country_list in clear_df['production_countries']:
    unique_countries.update(country_list)
num_unique_countries = len(unique_countries)
print("Number of unique countries:", num_unique_countries)

Number of unique genres: 19
Number of unique actors: 18155
Number of unique countries: 36


The number of actors is very high, might prove useful to use multilabelbinarizer. This was the main issue with the past approach, bag of words will be used instead.

Now text processing should be considered. Lowercasing, stopwords (NLTK or SpaCy), tokenization. PCA afterwards.

In [24]:
pra_df = clear_df

In [25]:
pra_df['description'] = clear_df['description'].str.lower()

nltk.download('stopwords')
from nltk.corpus import stopwords

trimwords = set(stopwords.words('english'))

pra_df['description'] = clear_df['description'].apply(lambda x: ' '.join(word for word in x.
                                                                             split() if word not in trimwords))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\logan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
clear_df['description'] = clear_df['description'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.
                                                                              split()]))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\logan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [27]:
nltk.download('punkt')
from nltk.tokenize import word_tokenize
def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

pra_df['description'] = clear_df['description'].apply(lambda x: tokenize_text(x))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\logan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


For the model selection, given the objective of the project and the features available, choosing supervised learning would be optimal and using nearest neightbor might be the best suited.

In [28]:
pre_df = pra_df.set_index('title')
pre_df.head()

Unnamed: 0_level_0,type,description,release_year,age_certification,runtime,genres,production_countries,actors,combined_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
'Twas the Night,MOVIE,"[mischievous, 14-year-old, boy, irresponsible,...",2001,0,84,"[comedy, family]",[US],"[Bryan Cranston, Josh Zuckerman, Brenda Grate,...",5.28
10 Things I Hate About You,MOVIE,"[first, day, new, school, ,, cameron, instantl...",1999,2,97,"[drama, romance, comedy]",[US],"[Julia Stiles, Heath Ledger, Joseph Gordon-Lev...",7.43
101 Dalmatian Street,SHOW,"[follow, adventure, eldest, sibling, dolly, dy...",2019,1,17,"[animation, family, comedy]",[GB],"[Bert Davis, Margot Powell, Michaela Dietz, Jo...",6.75
101 Dalmatians,MOVIE,"[evil, ,, high-fashion, designer, plot, steal,...",1996,0,103,"[comedy, family, crime]","[US, GB]","[Glenn Close, Jeff Daniels, Joely Richardson, ...",5.8
101 Dalmatians II: Patch's London Adventure,MOVIE,"[one, 101, take, toll, patch, ,, feel, unique,...",2002,0,74,[comedy],[US],"[Barry Bostwick, Jason Alexander, Martin Short...",6.32


## 4. Model Selection
---
The recommendation system will be built using several techniques, including cosine similarity, Euclidean distance, and Jaccard similarity. The system will use these techniques to calculate similarity scores between movies or TV shows based on their features. The system will then provide a list of similar movies or TV shows based on the input of a user.

After further research on what could be the best way to use the dataset, hot encoding the actors appears to be a bad approach for its intended use. In the end, it seems that keeping the actors as a bag of words and finding cosine similarity  would be better.

This run will use TfidfVectorizer (creates a matrix for contained text). For simplicity all text from rows will be added in a single bag of words. In this case, the text columns need to be turned from lists to strings.

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

pre_df['type'] = pre_df['type'].apply(lambda x: ' '.join(x))
pre_df['description'] = pre_df['description'].apply(lambda x: ' '.join(x))
pre_df['genres'] = pre_df['genres'].apply(lambda x: ' '.join(x))
pre_df['production_countries'] = pre_df['production_countries'].apply(lambda x: ' '.join(x))
pre_df['actors'] = pre_df['actors'].apply(lambda x: ' '.join(x))

pre_df['text'] = pre_df['type'] + ' ' + pre_df['description'] + ' ' + pre_df['genres'] + ' ' + pre_df['production_countries'] + ' ' + pre_df['actors']

tfidf = TfidfVectorizer()
text_features = tfidf.fit_transform(pre_df['text'])

For the numerical data, standard scaler. The release year could be switched to years passed since release to have a better scale.

In [30]:
from sklearn.preprocessing import StandardScaler

num_features = StandardScaler().fit_transform(pre_df[['release_year','age_certification','runtime', 'combined_score']])

So now both text and num dataframes can be joint and then get the cosine similarity in a matrix.

In [31]:
features = np.concatenate([text_features.toarray(), num_features], axis=1)

In [32]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(features)

Now for the query we can use the index which is the same as the movie titles and return similar items.

## 5. Model Testing
---
To evaluate the performance of the recommendation system, we will use metrics such as precision, recall, and F1-score. These metrics will help us determine how well the recommendation system is able to provide accurate recommendations to users.

In [35]:
titles = pre_df.index.tolist()

In [67]:
query_title = random.choice(titles)
print(query_title)

A Goofy Movie


In [68]:
query_index = pre_df.index.get_loc(query_title)
similarities = similarity_matrix[query_index]
most_similar_indices = similarities.argsort()[-6:-1]
most_similar_titles = pre_df.iloc[most_similar_indices].index.tolist()[::-1]

print(f'The 5 most similar titles to "{query_title}" are:')
print('\n'.join(most_similar_titles))

The 5 most similar titles to "A Goofy Movie" are:
The Fox and the Hound
The Muppet Movie
The Great Muppet Caper
Robin Hood
The Great Mouse Detective


## 6. Conclusion
---
This project aims to build a recommendation system for movies and TV shows using several techniques. The system will use features such as title, description, genres, production countries, actors, and more to calculate similarity scores between movies or TV shows. The system will provide a list of similar movies or TV shows based on the input of a user. By evaluating the performance of the recommendation system, we will be able to determine how well the system is able to provide accurate recommendations to users.