# Problem Statement

* Develope a content-based recommender system using the genres and/or descriptions.
* Identify the main content available on the streaming.
* Perform Exploratory data analysis to find interesting insights.

[You can download the dataset from here.](https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies)

In [1]:
# # This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/netflix-tv-shows-and-movies/credits.csv
/kaggle/input/netflix-tv-shows-and-movies/titles.csv


In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import word_tokenize
import ast

In [3]:
credits = pd.read_csv('/kaggle/input/netflix-tv-shows-and-movies/credits.csv')
titles = pd.read_csv('/kaggle/input/netflix-tv-shows-and-movies/titles.csv')

In [4]:
credits.head()

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR


In [5]:
titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.6


In [6]:
titles.shape, credits.shape

((5850, 15), (77801, 5))

### About the features

#### Features in titles.csv

* **id**: The title ID on JustWatch.
* **title**: The name of the title.
* **show_type**: TV show or movie.
* **description**: A brief description.
* **release_year**: The release year.
* **age_certification**: The age certification.
* **runtime**: The length of the episode (SHOW) or movie.
* **genres**: A list of genres.
* **production_countries**: A list of countries that produced the title.
* **seasons**: Number of seasons if it's a SHOW.
* **imdb_id**: The title ID on IMDB.
* **imdb_score**: Score on IMDB.
* **imdb_votes**: Votes on IMDB.
* **tmdb_popularity**: Popularity on TMDB.
* **tmdb_score**: Score on TMDB.

#### Features in credits.csv

* **person_ID**: The person ID on JustWatch.
* **id**: The title ID on JustWatch.
* **name**: The actor or director's name.
* **character_name**: The character name.
* **role**: ACTOR or DIRECTOR.

In [7]:
titles[titles['imdb_score']==8.4]

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
93,ts22176,Stargate SG-1,SHOW,The story of Stargate SG-1 begins about a year...,1997,TV-PG,44,"['scifi', 'drama', 'action']","['CA', 'US']",10.0,tt0118480,8.4,90196.0,88.851,8.300
219,ts21465,Supernatural,SHOW,"When they were boys, Sam and Dean Winchester l...",2005,TV-14,45,"['scifi', 'horror', 'thriller', 'drama', 'fant...",['US'],15.0,tt0460681,8.4,434081.0,388.093,8.278
229,ts22011,Heartland,SHOW,Life is hard on the Flemings' ranch in the Alb...,2007,TV-PG,44,"['drama', 'family']",['CA'],15.0,tt1094229,8.4,16337.0,74.638,8.300
245,ts20305,Naruto,SHOW,"In another world, ninja are the ultimate power...",2002,TV-PG,23,"['animation', 'action', 'scifi', 'comedy', 'fa...",['JP'],6.0,tt0409591,8.4,96729.0,218.843,8.363
316,tm142564,3 Idiots,MOVIE,Rascal. Joker. Dreamer. Genius... You've never...,2009,PG-13,170,"['drama', 'comedy']",['IN'],,tt1187043,8.4,390739.0,44.999,8.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4999,ts341767,Kotaro Lives Alone,SHOW,A lonely little boy moves into a ramshackle ap...,2022,TV-14,27,"['animation', 'drama', 'comedy']",['JP'],1.0,tt15490038,8.4,2281.0,13.897,6.900
5015,ts271867,Vincenzo,SHOW,Vincenzo Cassano is an Italian lawyer and Mafi...,2021,,81,"['action', 'drama', 'comedy', 'crime', 'romance']",['KR'],1.0,tt13433812,8.4,16358.0,50.764,8.800
5044,ts297483,Hometown Cha-Cha-Cha,SHOW,A big-city dentist opens up a practice in a cl...,2021,TV-14,78,"['comedy', 'romance', 'drama']",['KR'],1.0,tt14518756,8.4,11060.0,40.151,8.200
5053,ts222864,Falling Into Your Smile,SHOW,Student Tong Yao makes two vows: to never be i...,2021,TV-14,44,"['drama', 'comedy', 'romance']",['CN'],1.0,tt11290960,8.4,2267.0,39.102,9.000


In [8]:
df = pd.merge(credits,titles,on='id',how='left')
df.head()

Unnamed: 0,person_id,id,name,character,role,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,7064,tm84618,Albert Brooks,Tom,ACTOR,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179


In [9]:
required_columns = ['person_id','name','character','title','description','genres','imdb_score','imdb_votes']
df1 = df[required_columns]
df1.head()

Unnamed: 0,person_id,name,character,title,description,genres,imdb_score,imdb_votes
0,3748,Robert De Niro,Travis Bickle,Taxi Driver,A mentally unstable Vietnam War veteran works ...,"['drama', 'crime']",8.2,808582.0
1,14658,Jodie Foster,Iris Steensma,Taxi Driver,A mentally unstable Vietnam War veteran works ...,"['drama', 'crime']",8.2,808582.0
2,7064,Albert Brooks,Tom,Taxi Driver,A mentally unstable Vietnam War veteran works ...,"['drama', 'crime']",8.2,808582.0
3,3739,Harvey Keitel,Matthew 'Sport' Higgins,Taxi Driver,A mentally unstable Vietnam War veteran works ...,"['drama', 'crime']",8.2,808582.0
4,48933,Cybill Shepherd,Betsy,Taxi Driver,A mentally unstable Vietnam War veteran works ...,"['drama', 'crime']",8.2,808582.0


# Search Based Recommendation System

In [10]:
titles.shape

(5850, 15)

In [11]:
titles.imdb_votes.fillna(0, inplace=True)
titles.description.fillna(' ', inplace=True)
titles.fillna(0, inplace=True)

In [12]:
tfidf = TfidfVectorizer(stop_words='english')
matrix = tfidf.fit_transform(titles['description'])
cosine_sim = cosine_similarity(matrix, matrix)

In [13]:
cosine_sim.shape

(5850, 5850)

In [14]:
def search(string, cosine_sim = cosine_sim):

    index = titles[titles['title'].str.lower()==string.lower()].index
    all_movies = []

    for i in index:
        scores = list(enumerate(cosine_sim[i]))
        scores = sorted(scores,key=lambda x: x[1], reverse=True)[0:11]
        movies = [titles.iloc[n]['title'] for n,j in scores]
        all_movies.extend(movies)
    return all_movies

In [15]:
search('Avatar: The Last Airbender')

['Avatar: The Last Airbender',
 'The Legend of Korra',
 'Blood and Bone',
 'The Dragon Prince',
 'Vivo',
 'Five Came Back',
 'Shadow and Bone',
 'Violet Evergarden: The Movie',
 'The Worthy',
 'The Giver',
 'The Liberator']

Here we go! Avatar the last air bender is an animated series. The Legend Of Korra is a sequel to the show and Violet Evergarden and The Dragon Prince are also animated movies. Let's try some more.

In [16]:
search("Monty Python's Flying Circus")

["Monty Python's Flying Circus",
 'Standup and Away! with Brian Regan',
 'Monty Python Conquers America',
 'Parrot Sketch Not Included: Twenty Years of Monty Python',
 'I Think You Should Leave with Tim Robinson',
 'The Who Was? Show',
 'Shor in the City',
 'Plastic Cup Boyz: Laughing My Mask Off!',
 'Hot Date',
 'Horrid Henry',
 'All That']

In [17]:
search('Violet Evergarden')

['Violet Evergarden',
 'Violet Evergarden: Eternity and the Auto Memories Doll',
 'Violet Evergarden: The Movie',
 'The Doll',
 'Nappily Ever After',
 'The Boy',
 'Brahms: The Boy II',
 'Dirty Lines',
 'Family Blood',
 'Robin Robin',
 'Polly Pocket']

These are some of the recommendations. If I want, I can set a threshold to imdb_score and votings to show only the shows that have best score.

# Character, Actor and Genre based Recommendation System

In [18]:
df1 = df1.drop_duplicates()

In [19]:
titles['crew'] = titles.title
titles['crew'] = titles['crew'].transform(lambda x: ' '.join(df1[df1['title']==x]['name'].to_list()))

In [20]:
' '.join(df1[df1['title']=='Violet Evergarden']['character'].to_list())

'Violet Evergarden (voice) Gilbert Bougainvillea (voice) Cattleya Baudelaire (voice) Iris Cannary (voice) Benedict Blue (voice) Erica Brown (voice) Claudia Hodgins (voice)'

In [21]:
titles['characters'] = titles.title
titles['characters'] = titles['characters'].transform(lambda x: df1[df1['title']==x]['character'].to_list())
titles['characters'] = titles['characters'].transform(lambda x: ' '.join(str(i) for i in x))

In [22]:
titles['genres'] = titles['genres'].transform(lambda x: ' '.join(ast.literal_eval(x)))

In [23]:
titles['key_words'] = titles.index
titles['key_words'] = titles['key_words'].transform(lambda x:
                                                            str(titles.iloc[x]['crew']) +
                                                            str(titles.iloc[x]['characters']) +
                                                            str(titles.iloc[x]['genres']))

In [24]:
df2 = titles[['title','key_words']]
df2.head()

Unnamed: 0,title,key_words
0,Five Came Back: The Reference Films,documentation
1,Taxi Driver,Robert De Niro Jodie Foster Albert Brooks Harv...
2,Deliverance,Jon Voight Burt Reynolds Ned Beatty Ronny Cox ...
3,Monty Python and the Holy Grail,Graham Chapman John Cleese Eric Idle Terry Gil...
4,The Dirty Dozen,Lee Marvin Ernest Borgnine Charles Bronson Jim...


In [25]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(titles['key_words'])

In [26]:
cosine_distance = cosine_similarity(count_matrix, count_matrix)

In [27]:
search('Avatar: The Last Airbender',cosine_sim=cosine_distance)

['Avatar: The Last Airbender',
 'Cloudy with a Chance of Meatballs',
 'Cloudy with a Chance of Meatballs',
 'Words Bubble Up Like Soda Pop',
 'She-Ra and the Princesses of Power',
 'The Legend of Korra',
 'Drifting Dragons',
 'Tiger & Bunny: The Beginning',
 'Watership Down',
 'Hey Arnold! The Jungle Movie',
 'Dorohedoro']

Now, the recommendations are even more better than the previous ones. Almost all of the recommendations are animated movies. 

In [28]:
search('Violet Evergarden',cosine_sim=cosine_distance)

['Violet Evergarden',
 'Violet Evergarden: The Movie',
 'Violet Evergarden: Eternity and the Auto Memories Doll',
 'Words Bubble Up Like Soda Pop',
 'Cloudy with a Chance of Meatballs',
 'Bubble',
 'Cloudy with a Chance of Meatballs',
 'Record of Ragnarok',
 'Dorohedoro',
 'She-Ra and the Princesses of Power',
 'Mobile Suit Gundam Unicorn']

In [29]:
search('Taxi Driver',cosine_sim=cosine_distance)

['Taxi Driver',
 'Taxi Driver',
 'Dirty Harry',
 'Contagion',
 'The Imitation Game',
 'War Dogs',
 'Luck by Chance',
 'White Christmas',
 'Search Party',
 'Cleaner',
 'Ava',
 'Taxi Driver',
 'Taxi Driver',
 'Dirty Harry',
 'Contagion',
 'The Imitation Game',
 'War Dogs',
 'Luck by Chance',
 'White Christmas',
 'Search Party',
 'Cleaner',
 'Ava']

In this case, there are 20 recommendations because there are two movies names Taxi Driver. 

In [30]:
search('The Legend of Korra',cosine_sim=cosine_distance)

['The Legend of Korra',
 'Cloudy with a Chance of Meatballs',
 'Cloudy with a Chance of Meatballs',
 'Words Bubble Up Like Soda Pop',
 'Tiger & Bunny: The Beginning',
 'Maya and the Three',
 'She-Ra and the Princesses of Power',
 'Dorohedoro',
 'Hey Arnold! The Jungle Movie',
 'Black Is Beltza',
 'Avatar: The Last Airbender']

In [31]:
search('3 Idiots',cosine_sim=cosine_distance)

['3 Idiots',
 'Ek Main Aur Ekk Tu',
 'Mrs. Serial Killer',
 'Rang De Basanti',
 'Billu',
 'Humpty Sharma Ki Dulhania',
 'Do Dooni Chaar',
 'Lean On Me',
 'Damini',
 'Chaman Bahar',
 'PK']