![](https://i.postimg.cc/ZqrQRHMs/image.jpg)

# Data upload.

## Importing libs.

In [1]:
import numpy as np 
import pandas as pd 

import plotly.express as px
import plotly.figure_factory as ff

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings('ignore')

/kaggle/input/netflix-shows/netflix_titles.csv


## Short description

* **Show_id** - id of the movie or tv show
* **Type** - product type, movie or tv show
* **Title** - name of the movie or tv show
* **Director** - movie's or tv show's director
* **Сast** - main actors of the project
* **Country** - country of the production
* **Date_added**  - release date at Netflix
* **Release_year**  - year of the release 
* **Rating** - age rating (e.g. PG-13, TV-MA)
* **Duration** - duration of the project in mins for Movies and seasons for TV Shows  
* **Listed_in** - genre
* **Description** - short description 

# EDA and visualization

In [2]:
netflix_full = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
netflix_full.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [3]:
# looking at dtypes and NANs. There are 12 features for analysis.

netflix_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [4]:
netflix_full.isna().sum() 

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

In [5]:
duration = netflix_full['duration'].value_counts()
duration

duration
1 Season     1793
2 Seasons     425
3 Seasons     199
90 min        152
94 min        146
             ... 
16 min          1
186 min         1
193 min         1
189 min         1
191 min         1
Name: count, Length: 220, dtype: int64

In [6]:
chart_colors = ['#2ca02c', '#8c564b', '#ff7f0e', '#1f77b4',  '#FF9900', '#d1d6d5', '#333333', '#FFFFFF']

fig = px.histogram(duration, duration.index, duration.values, 
                   template= "plotly_dark",
                   color_discrete_sequence= chart_colors, 
                   width = 800, height = 500, 
                   title = 'The most popular projects duration wise'
                  )
fig.update_layout(bargap=0.5)

fig.show()

* 1 season TV Shows are the most popular, as for Movies the ones are with duration 97 and 92 mins.

* Let's split our data and consider TV Shows and Movies separately in our further analysis

In [7]:
movies = netflix_full[netflix_full['type'] == 'Movie'].copy()

In [8]:
tv_shows = netflix_full[netflix_full['type'] == 'TV Show'].copy()

In [9]:
chart_colors = ['#8c564b', '#2ca02c', '#ff7f0e', '#1f77b4',  '#FF9900', '#d1d6d5', '#333333', '#FFFFFF']

fig = px.histogram(netflix_full, 'type', 
                   template= "plotly_dark",
                   color_discrete_sequence= chart_colors, 
                   width = 800, height = 400,
                   text_auto=True, 
                   title = 'Qty of Movies and TV Shows on Netflix'
                  )

fig.show()

* We see that Movies are represented more than two times as often in our data.

### TV Shows

* Let'start with TV Shows

#### Release

In [10]:
tv_shows_release = tv_shows[['date_added']]
tv_shows_release = tv_shows_release.dropna()
tv_shows_release[:5]

Unnamed: 0,date_added
1,"September 24, 2021"
2,"September 24, 2021"
3,"September 24, 2021"
4,"September 24, 2021"
5,"September 24, 2021"


In [11]:
tv_shows_release['month'] = tv_shows_release['date_added'].apply(lambda x: x.split(' ')[0])
tv_shows_release['year'] = tv_shows_release['date_added'].apply(lambda x: x.split(', ')[1])
tv_shows_release.head()

Unnamed: 0,date_added,month,year
1,"September 24, 2021",September,2021
2,"September 24, 2021",September,2021
3,"September 24, 2021",September,2021
4,"September 24, 2021",September,2021
5,"September 24, 2021",September,2021


In [12]:
months_list = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

In [13]:
tv_shows_release_matrix =  tv_shows_release.groupby('year')['month'].value_counts().unstack()[months_list].fillna(0).T
tv_shows_release_matrix

year,2008,2013,2014,2015,2016,2017,2018,2019,2020,2021
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
January,0.0,0.0,0.0,0.0,26.0,14.0,18.0,35.0,52.0,36.0
February,1.0,0.0,1.0,0.0,6.0,16.0,23.0,42.0,42.0,44.0
March,0.0,1.0,0.0,1.0,2.0,36.0,32.0,52.0,44.0,37.0
April,0.0,0.0,1.0,4.0,7.0,25.0,27.0,42.0,50.0,53.0
May,0.0,0.0,0.0,0.0,2.0,22.0,25.0,48.0,52.0,38.0
June,0.0,0.0,0.0,1.0,7.0,27.0,27.0,46.0,41.0,83.0
July,0.0,0.0,0.0,2.0,9.0,30.0,25.0,57.0,43.0,88.0
August,0.0,1.0,0.0,0.0,11.0,33.0,33.0,44.0,47.0,61.0
September,0.0,1.0,0.0,0.0,17.0,32.0,42.0,36.0,53.0,65.0
October,0.0,1.0,0.0,4.0,19.0,28.0,44.0,63.0,51.0,0.0


In [14]:
fig = px.imshow(tv_shows_release_matrix, 
                template= "plotly_dark",
                text_auto=True,
                width = 800, height = 800, 
                title = 'Release matrix')
fig.show()

# The lighter the month the more frequently projects launched 

In [15]:
fig = px.histogram(tv_shows_release,  'year', 
             template= "plotly_dark",
             color_discrete_sequence= ['#F9917A'],
             width = 800, height = 500,
             text_auto=True,
             title = 'Number of released TV Shows (Year split)'
            )

fig.show()

#### Rating

In [16]:
tvshow_rating = tv_shows['rating'].reset_index()
tvshow_rating = tvshow_rating.groupby('rating').count().reset_index().sort_values(by = 'index', ascending=False)

In [17]:
fig = px.bar(tvshow_rating,  'rating', 'index', 
             template= "plotly_dark",
             color_discrete_sequence= ['#1F77B4'],
             width = 800, height = 500,
             text_auto=True,
             title = 'The most frequent tv shows (rating split)'
            )

fig.show()

#### Production country

In [18]:
countries = tv_shows['country']
countries.fillna('No_info', inplace = True)
countries.isna().sum()

0

In [19]:
# unsplitting fields with several countries

contry_dict = {}

c = list(countries)
for i in c:
    i = i.replace(' ', '')
    i = i.split(',')
        
    for j in i:
        if j in list(contry_dict.keys()):
            contry_dict[j] += 1
        else:
            contry_dict[j] = 1  

In [20]:
countries_sorted = sorted(contry_dict.items(), key = lambda item: contry_dict[item[0]], reverse = True)
countries_sorted = countries_sorted[:15]

In [21]:
countries_sorted

[('UnitedStates', 938),
 ('No_info', 391),
 ('UnitedKingdom', 272),
 ('Japan', 199),
 ('SouthKorea', 170),
 ('Canada', 126),
 ('France', 90),
 ('India', 84),
 ('Taiwan', 70),
 ('Australia', 66),
 ('Spain', 61),
 ('Mexico', 58),
 ('China', 48),
 ('Germany', 44),
 ('Colombia', 32)]

In [22]:
# getting countries' names

states = [i for i, value in countries_sorted]
states 

['UnitedStates',
 'No_info',
 'UnitedKingdom',
 'Japan',
 'SouthKorea',
 'Canada',
 'France',
 'India',
 'Taiwan',
 'Australia',
 'Spain',
 'Mexico',
 'China',
 'Germany',
 'Colombia']

In [23]:
# getting values

qty = [value for i, value in countries_sorted]
qty 

[938, 391, 272, 199, 170, 126, 90, 84, 70, 66, 61, 58, 48, 44, 32]

In [24]:
chart_colors = [ '#2CA02C', '#D62728', '#9467BD', '#E377C2', '#1F77B4', '#FF7F0E',] 

fig = px.pie(values = qty, names = states,
             template= "plotly_dark", 
             title = 'Top-15 tv show production countries', 
             width = 800, height = 500, 
             color_discrete_sequence=  chart_colors)


fig.show()

#### TV Shows duration

In [25]:
tvshow_duration = tv_shows[['title', 'duration']]#.str.replace('Season', '')
tvshow_duration.head(3)

Unnamed: 0,title,duration
1,Blood & Water,2 Seasons
2,Ganglands,1 Season
3,Jailbirds New Orleans,1 Season


In [26]:
# Getting number of seasons

tvshow_duration['seasons_no'] = tvshow_duration['duration'].apply(
    lambda x:  x.replace('Seasons', '') if 'Seasons' in x else x.replace('Season', ''))

In [27]:
tvshow_duration['seasons_no'] = tvshow_duration['seasons_no'].astype('int')
tvshow_duration = tvshow_duration[['title', 'seasons_no']]
tvshow_duration[:3]

Unnamed: 0,title,seasons_no
1,Blood & Water,2
2,Ganglands,1
3,Jailbirds New Orleans,1


In [28]:
# getting tv shows with the biggest number of seasons +10

tvshow_duration10 = tvshow_duration.sort_values(by = 'seasons_no', ascending = False)
tvshow_duration10 = tvshow_duration[tvshow_duration['seasons_no'] >= 10]

In [29]:
tvshow_duration10 = tvshow_duration10.sort_values(by = 'seasons_no', ascending = False)
tvshow_duration10 

Unnamed: 0,title,seasons_no
548,Grey's Anatomy,17
4798,NCIS,15
2423,Supernatural,15
4220,COMEDIANS of the world,13
7847,Red vs. Blue,13
1354,Heartland,13
4964,Trailer Park Boys,12
5412,Criminal Minds,12
6456,Cheers,11
6795,Frasier,11


In [30]:
fig = px.bar(tvshow_duration10,  'title', 'seasons_no', 
             template= "plotly_dark",
             color_discrete_sequence= ['#F94233'],
             width = 800, height = 500,
             text_auto=True,
             title = 'Tv shows with the biggest number of seasons +10'
            )

fig.show()

In [31]:
x = tvshow_duration['seasons_no']
hist_data = [x]
group_labels = ['distplot'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels)

fig.update_layout(title="TV Shows duration distribution", 
                  width = 900, height = 600, 
                  template= "plotly_dark")                    
                   
fig.show()

* 1-season project has significant domination among other TV shows

#### Genres

In [32]:
genre = tv_shows['listed_in']

In [33]:
genre_list = []

for i in genre:
    i = i.split(',')
    for j in i:
        genre_list.append(' '.join(j.split())) # merging all values into one list and deleting extra spaces

In [34]:
from collections import Counter

genre = Counter(genre_list)
genre = sorted(genre.items(), key= lambda item: genre[item[0]], reverse = True)
genre = genre[:15]

In [35]:
value = [value for value, value in genre] # extracting values
name = [name for name, value in genre] # extracting keys

In [36]:
chart_colors = ['#D62728', '#9467BD', '#E377C2', '#1F77B4', '#FF7F0E', '#2CA02C'] 

fig = px.pie(values = value, names = name,
             template= "plotly_dark", 
             title = 'Top 15 genres', 
             width = 800, height = 500, 
             color_discrete_sequence=  chart_colors)


fig.show()

#### Directors

In [37]:
d = tv_shows['director']
d = d.dropna()

In [38]:
director = []

for i in d:
    i = i.split(',')
    for j in i:
        director.append(' '.join(j.split()))

In [39]:
director = Counter(director)

In [40]:
director = sorted(director.items(), key = lambda item: director[item[0]], reverse = True)[:11]

In [41]:
director

[('Alastair Fothergill', 3),
 ('Ken Burns', 3),
 ('Gautham Vasudev Menon', 2),
 ('Hsu Fu-chun', 2),
 ('Rob Seidenglanz', 2),
 ('Joe Berlinger', 2),
 ('Jung-ah Im', 2),
 ('Lynn Novick', 2),
 ('Shin Won-ho', 2),
 ('Stan Lathan', 2),
 ('Iginio Straffi', 2)]

In [42]:
name = [name for name, value in director]
value = [value for name, value in director]

In [43]:
fig = px.bar(x = name, y = value, 
             template= "plotly_dark",
             color_discrete_sequence= ['#033695'],
             width = 800, height = 500,
             text_auto=True,
             title = 'Top-11 directors by produced tv shows q-ty'
            )

fig.update_layout(
    xaxis_title=dict(text="director"), 
    yaxis_title=dict(text="TvShows_number")
)

fig.show()

### Movies

* Now let's switch yo movies analysis

#### Release

In [44]:
movies_release = movies[['date_added']]
movies_release = movies_release.dropna()
movies_release[:5]

Unnamed: 0,date_added
0,"September 25, 2021"
6,"September 24, 2021"
7,"September 24, 2021"
9,"September 24, 2021"
12,"September 23, 2021"


In [45]:
movies_release['month'] = movies_release['date_added'].apply(lambda x: x.split(' ')[0])
movies_release['year'] = movies_release['date_added'].apply(lambda x: x.split(', ')[1])
movies_release.head()

Unnamed: 0,date_added,month,year
0,"September 25, 2021",September,2021
6,"September 24, 2021",September,2021
7,"September 24, 2021",September,2021
9,"September 24, 2021",September,2021
12,"September 23, 2021",September,2021


* Let's create a matrix with information about release of projects and respective qty in each period. This information might be considered for new launches taking into account that seasonality. 

In [46]:
months_list = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

In [47]:
movies_release_matrix =  movies_release.groupby('year')['month'].value_counts().unstack()[months_list].fillna(0).T
movies_release_matrix

year,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
January,1.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,15.0,58.0,105.0,116.0,152.0,96.0
February,0.0,0.0,0.0,0.0,1.0,0.0,1.0,3.0,9.0,65.0,63.0,103.0,72.0,65.0
March,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,14.0,87.0,138.0,119.0,93.0,75.0
April,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,14.0,66.0,87.0,119.0,127.0,135.0
May,0.0,1.0,0.0,1.0,0.0,0.0,0.0,5.0,9.0,63.0,70.0,91.0,105.0,94.0
June,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,11.0,65.0,50.0,122.0,115.0,124.0
July,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,19.0,45.0,125.0,98.0,103.0,169.0
August,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,23.0,77.0,130.0,87.0,82.0,117.0
September,0.0,0.0,0.0,1.0,0.0,1.0,1.0,6.0,29.0,81.0,81.0,86.0,115.0,118.0
October,0.0,0.0,0.0,11.0,0.0,1.0,4.0,10.0,32.0,97.0,146.0,128.0,116.0,0.0


In [48]:
fig = px.imshow(movies_release_matrix, 
                template= "plotly_dark",
                text_auto=True,
                width = 800, height = 800, 
                title = 'Release matrix')
fig.show()

# The lighter the month the more frequently projects launched 

In [49]:
fig = px.histogram(movies_release,  'year', 
             template= "plotly_dark",
             color_discrete_sequence= ['white'],
             width = 800, height = 500,
             text_auto=True,
             title = 'Number of released Movies (year split)'
            )

fig.show()

* 2019, 2020 and 2018 are in top three.

#### Rating

In [50]:
movies_rating = movies['rating'].reset_index()
movies_rating = movies_rating.groupby('rating').count().reset_index().sort_values(by = 'index', ascending=False)

In [51]:
fig = px.bar(movies_rating,  'rating', 'index', 
             template= "plotly_dark",
             color_discrete_sequence= ['#ff7f0e'],
             width = 800, height = 500,
             text_auto=True,
             title = 'The most frequent movies (rating split)'
            )
fig.update_layout(
    xaxis_title=dict(text="rating"), 
    yaxis_title=dict(text="movies_number")
)

fig.show()

* TV-MA ranks first. TV-14, R, TV-PG and PG-13 are another ones from top-5

#### Production country

In [52]:
countries = movies['country']
countries.fillna('No_info', inplace = True)
countries.isna().sum()

0

In [53]:
# unsplitting fields with several countries

contry_dict = {}

c = list(countries)
for i in c:
    i = i.replace(' ', '')
    i = i.split(',')
        
    for j in i:
        if j in list(contry_dict.keys()):
            contry_dict[j] += 1
        else:
            contry_dict[j] = 1        

In [54]:
countries_sorted = sorted(contry_dict.items(), key = lambda item: contry_dict[item[0]], reverse = True)
countries_sorted = countries_sorted[:15]

In [55]:
countries_sorted

[('UnitedStates', 2752),
 ('India', 962),
 ('UnitedKingdom', 534),
 ('No_info', 440),
 ('Canada', 319),
 ('France', 303),
 ('Germany', 182),
 ('Spain', 171),
 ('Japan', 119),
 ('China', 114),
 ('Mexico', 111),
 ('Egypt', 102),
 ('HongKong', 100),
 ('Nigeria', 94),
 ('Australia', 94)]

In [56]:
# getting countries' names

states = [i for i, value in countries_sorted]
states 

['UnitedStates',
 'India',
 'UnitedKingdom',
 'No_info',
 'Canada',
 'France',
 'Germany',
 'Spain',
 'Japan',
 'China',
 'Mexico',
 'Egypt',
 'HongKong',
 'Nigeria',
 'Australia']

In [57]:
# getting values

qty = [value for i, value in countries_sorted]
qty 

[2752, 962, 534, 440, 319, 303, 182, 171, 119, 114, 111, 102, 100, 94, 94]

In [58]:
chart_colors = ['#1F77B4', '#FF7F0E', '#2CA02C', '#D62728', '#9467BD', '#E377C2'] 

fig = px.pie(values = qty, names = states,
             template= "plotly_dark", 
             title = 'Top 15 production movies countries', 
             width = 800, height = 500, 
             color_discrete_sequence=  chart_colors)


fig.show()

* US has a significant advantage in comparison with other countries from top-15 in terms of production with almost half of produced Movies.
* India, UK, Canada and France are another countries from top-5 (exclding 'Other' category which was not marked in our DF)

#### Duration

In [59]:
movies_duration = movies['duration'].str.replace('min', '')
movies_duration = movies_duration.astype('float').dropna()

In [60]:
x = movies_duration.values
hist_data = [x]
group_labels = ['distplot'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels)

fig.update_layout(title="Movies duration distribution", 
                  width = 900, height = 600, 
                  template= "plotly_dark")                    
                   
fig.show()

* We see that the most frequant are movies with duration between 85-105 mins.

#### Genres

In [61]:
genre = movies['listed_in']

In [62]:
genre_list = []

for i in genre:
    i = i.split(',')
    for j in i:
        genre_list.append(' '.join(j.split())) # merging all values into one list and deleting extra spaces

In [63]:
from collections import Counter

genre = Counter(genre_list)
genre = sorted(genre.items(), key= lambda item: genre[item[0]], reverse = True)
genre = genre[:15]

In [64]:
value = [value for value, value in genre] # extracting values
name = [name for name, value in genre] # extracting keys

In [65]:
chart_colors = ['#1F77B4', '#FF7F0E', '#2CA02C', '#D62728', '#9467BD', '#E377C2'] 

fig = px.pie(values = value, names = name,
             template= "plotly_dark", 
             title = 'Top 15 genres', 
             width = 800, height = 500, 
             color_discrete_sequence=  chart_colors)


fig.show()

* International Movies, Dramas, Comedies are in top-3, respectively.

#### Directors

In [66]:
d = movies['director']
d = d.dropna()

In [67]:
director = []

for i in d:
    i = i.split(',')
    for j in i:
        director.append(' '.join(j.split()))

In [68]:
director = Counter(director)

In [69]:
director = sorted(director.items(), key = lambda item: director[item[0]], reverse = True)[:10]

In [70]:
director

[('Rajiv Chilaka', 22),
 ('Jan Suter', 21),
 ('Raúl Campos', 19),
 ('Suhas Kadav', 16),
 ('Jay Karas', 15),
 ('Marcus Raboy', 15),
 ('Cathy Garcia-Molina', 13),
 ('Youssef Chahine', 12),
 ('Martin Scorsese', 12),
 ('Jay Chapman', 12)]

In [71]:
name = [name for name, value in director]
value = [value for name, value in director]

In [72]:
fig = px.bar(x = name, y = value, 
             template= "plotly_dark",
             color_discrete_sequence= ['#ff7f0e'],
             width = 800, height = 500,
             text_auto=True,
             title = 'Top-10 directors by produced movies q-ty'
            )

fig.update_layout(
    xaxis_title=dict(text="director"), 
    yaxis_title=dict(text="movies_number")
)

fig.show()

# Recommendation system.

# ![](https://i.postimg.cc/7Zf0Qq86/image.jpg)

* We gonna use TF-IDF

In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [74]:
movies['description'] = movies['description'].fillna('')
movies['description'].isna().sum()

0

In [75]:
movies['description'].head(3)

0    As her father nears the end of his life, filmm...
6    Equestria's divided. But a bright-eyed hero be...
7    On a photo shoot in Ghana, an American model s...
Name: description, dtype: object

In [76]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['description'])
tfidf_matrix.shape

(6131, 15483)

In [77]:
from sklearn.metrics.pairwise import cosine_similarity

cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cos_sim

array([[1.        , 0.        , 0.        , ..., 0.        , 0.01624469,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.03650444],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.01624469, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.03650444, ..., 0.        , 0.        ,
        1.        ]])

In [78]:
cos_sim.shape

(6131, 6131)

In [79]:
indexes = pd.Series(movies.index, index=movies['title']).drop_duplicates()
indexes

title
Dick Johnson Is Dead                   0
My Little Pony: A New Generation       6
Sankofa                                7
The Starling                           9
Je Suis Karl                          12
                                    ... 
Zinzana                             8801
Zodiac                              8802
Zombieland                          8804
Zoom                                8805
Zubaan                              8806
Length: 6131, dtype: int64

In [80]:
cos_sim_df = pd.DataFrame(cos_sim)

In [81]:
cos_sim_df.columns = indexes.index

In [82]:
cos_sim_df['title'] = indexes.index

In [83]:
cos_sim_df = cos_sim_df.set_index('title')

In [84]:
cos_sim_df.head(5)

title,Dick Johnson Is Dead,My Little Pony: A New Generation,Sankofa,The Starling,Je Suis Karl,Confessions of an Invisible Girl,Europe's Most Dangerous Man: Otto Skorzeny in Spain,Intrusion,Avvai Shanmughi,Go! Go! Cory Carson: Chrissy Takes the Wheel,...,Young Tiger,"Yours, Mine and Ours",اشتباك,Zed Plus,Zenda,Zinzana,Zodiac,Zombieland,Zoom,Zubaan
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Dick Johnson Is Dead,1.0,0.0,0.0,0.018285,0.0,0.0,0.014848,0.0,0.023927,0.0,...,0.0,0.02889,0.0,0.0,0.0,0.0,0.0,0.0,0.016245,0.0
My Little Pony: A New Generation,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sankofa,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.029736,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036504
The Starling,0.018285,0.0,0.0,1.0,0.029133,0.0,0.016294,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.061674,0.017826,0.045998
Je Suis Karl,0.0,0.0,0.0,0.029133,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028473


In [85]:
def show_recommendation(title, cos_sim = cos_sim_df):
    idx = indexes[title]
    sim_scores = list(enumerate(cos_sim_df.loc[title]))
    
    sim_scores = sorted(sim_scores, key= lambda x: x[1], reverse = True)
    sim_scores = sim_scores[1:11]
    
    movies_indexes = [i[0] for i in sim_scores]
    
    return movies['title'].iloc[movies_indexes]

In [86]:
# Let's take 'Dick Johnson Is Dead' movie as example.

show_recommendation('Dick Johnson Is Dead')

4877                                   End Game
1066                                   The Soul
7506                                       Moon
5047                    The Cloverfield Paradox
5233    The Death and Life of Marsha P. Johnson
2674                                      Alelí
6327                           Black Snake Moan
4241                  Secrets in the Hot Spring
1731               A New York Christmas Wedding
2380                               Riding Faith
Name: title, dtype: object

In [87]:
cos_sim_df.loc[(cos_sim_df.index == 'Dick Johnson Is Dead') | (cos_sim_df.index == 'End Game'), 
               ['Dick Johnson Is Dead', 'End Game']]

title,Dick Johnson Is Dead,End Game
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Dick Johnson Is Dead,1.0,0.156451
End Game,0.156451,1.0


In [88]:
movies[movies['title'] == 'Dick Johnson Is Dead'].description.iloc[0]

'As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.'

In [89]:
movies[movies['title'] == 'End Game'].description.iloc[0]

'Facing an inevitable outcome, terminally ill patients meet extraordinary medical practitioners seeking to change our approach to life and death.'

* Thanks for your attention.
* If you liked this work or found it usefull, please share your feedback.


* P.S: special thanks for inspiration to Ms.Julia Ponomareva -https://t.me/machine_learrrning