## Introduction

This notebook explores a dataset of Netflix titles, including movies and TV shows. 
The main tasks involve loading, viewing, and analyzing this data using Python libraries.

In [1]:
import pandas as pd # Importing the Pandas Library 
file = pd.read_csv('netflix_titles.csv') # function from Pandas that reads a CSV file and loads into a DataFrame
file.head(2) # displays the first 2 rows of the DataFrame.

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."


### Importing Libraries
- __pandas__: Essential for handling and analyzing data in tables (DataFrames).
- __numpy__: Useful for numerical operations (not directly used here but commonly paired with pandas).
- __matplotlib.pyplot__: A plotting library for visualizing data.
- __os__: Used for working with file and directory paths.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

### Loading the Data
- `pd.read_csv()`: Reads a CSV file and converts it into a DataFrame for easier data manipulation.
- `file.head(2)`: Displays the first two rows of the dataset to give a quick look at the data structure and content.

In [3]:
movies = pd.read_csv('netflix_titles.csv')
movies.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."


### Checking Dataset Dimensions
- Example: (8807, 12) indicates 8,807 entries and 12 columns

In [4]:
movies.shape # return a tuple like (rows, columns)

(8807, 12)

### Accesses a single column in the DataFrame.

__`movies.iloc[1]['type']`__
- `.iloc[1]`: selects the row based on its position in the DataFrame. `movies.iloc[1]` retrieves the data from the second row (index 1) of the DataFrame, since indexing starts at 0.
- `['type']`: This part specifies the column from which you want to retrieve the data in that row. By appending `['type']`, it selects the value in the "type" column of the second row.

| Index |	type |	title |
| -- | -- | -- |
| 0	| Movie	| Movie A |
| 1	| TV Show	| TV Show B |
| 2	| Movie	| Movie C |

----
__`movies['type']`__
- `movies['type'].unique()`
- `movies['type'].value_counts()`
- `movies[movies['type'] == 'Movie']`

In [5]:
movies.iloc[1]['type']  

'TV Show'

In [6]:
movies['type']

0         Movie
1       TV Show
2       TV Show
3       TV Show
4       TV Show
         ...   
8802      Movie
8803    TV Show
8804      Movie
8805      Movie
8806      Movie
Name: type, Length: 8807, dtype: object

In [7]:
movies['type'].value_counts() # used to count the occurrences of each unique value in the type column 

type
Movie      6131
TV Show    2676
Name: count, dtype: int64

In [8]:
movies.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [9]:
movies[['type','title','rating','duration']].head()

Unnamed: 0,type,title,rating,duration
0,Movie,Dick Johnson Is Dead,PG-13,90 min
1,TV Show,Blood & Water,TV-MA,2 Seasons
2,TV Show,Ganglands,TV-MA,1 Season
3,TV Show,Jailbirds New Orleans,TV-MA,1 Season
4,TV Show,Kota Factory,TV-MA,2 Seasons


### Checking for Missing Values
- `movies.isnull()`: Identifies where data is missing by creating a table of True/False values (True if data is missing).
- `sum()`: Calculates the total missing values for each column.
- This helps decide if and how to handle missing data (e.g., removing rows, filling in values).

In [10]:
movies.isnull().sum() # count the number of missing (null or NaN) values in each column 

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

In [11]:
movies.duplicated().sum()

np.int64(0)

## Splitting text / Categorizing and Tagging Data


- Retrieve and split the genres or categories listed in the `listed_in` column for a specific row in the DataFrame.
- This method is helpful for breaking down a single string of categories into separate items, allowing you to analyze or work with each genre/category individually. This approach is commonly used in data cleaning and preparation.
- Text data, like genres, tags, or keywords, often comes in a single string separated by commas, spaces, or other delimiters. Splitting allows you to work with each category separately.
- Example: Splitting "Action, Adventure, Comedy" into individual genres [“Action”, “Adventure”, “Comedy”] makes it easier to analyze the frequency of each genre or tag.

- `listed_in` | `cast` | `description` | `director` | `country` 

In [12]:
movies.iloc[1]['listed_in'].split(',')

['International TV Shows', ' TV Dramas', ' TV Mysteries']

In [13]:
def convert(text):
    return text.split(',')

In [14]:
movies['listed_in'] = movies['listed_in'].apply(convert)

In [15]:
movies.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,[Documentaries],"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"[International TV Shows, TV Dramas, TV Myste...","After crossing paths at a party, a Cape Town t..."


In [16]:
movies.loc[2]['cast'].split(',')

['Sami Bouajila',
 ' Tracy Gotoas',
 ' Samuel Jouy',
 ' Nabiha Akkari',
 ' Sofia Lesaffre',
 ' Salim Kechiouche',
 ' Noureddine Farihi',
 ' Geert Van Rampelberg',
 ' Bakary Diombera']

In [17]:
def convert_cast(text):
    counter = 0
    l=[]
    for i in text.split(','):
        if counter < 3:
            l.append(i)
            counter += 1
            print(counter)
        else:
            break
    return l
# convert_cast("hello, their how, are, you")

### Explaination

To handle missing values (NaN) in the "cast" column of the movies DataFrame by applying a custom function, convert_cast, only to non-NaN values. 

1. `movies['cast']`: Accesses the "cast" column in the DataFrame.
2. `.apply(lambda x: ...)`: Applies a function to each element (or "row") in the "cast" column.
    - In this case, it uses a lambda function to apply different logic depending on whether the value is NaN or not.
3. `pd.notna(x)`: Checks if the value x is not NaN (meaning it has an actual value).
    - If pd.notna(x) is True: It calls convert_cast(x), applying the custom convert_cast function to the value.
    - If pd.notna(x) is False: It assigns an empty list [] to the value.
4. `Purpose of convert_cast(x)`: While we don’t have the specifics of convert_cast, it’s likely a function that processes or reformats the "cast" column’s data (e.g., splitting names, formatting lists of actors). Applying convert_cast only to non-NaN values ensures that the function won’t run into errors due to missing data.

### Why This is Useful in Data Science

- Handles Missing Data: This approach fills missing (NaN) values in the "cast" column with empty lists, which can simplify data processing steps later on (no need to worry about NaN checks).
- Transforms Data for Analysis: If convert_cast splits or reformats the cast lists, it makes the data more usable for further analysis or visualization.
- Ensures Consistency: This ensures all entries in "cast" are either lists of actor names or empty lists, making the column's format consistent.

In [18]:
# Check not NAN
movies['cast'] = movies['cast'].apply(lambda x: convert_cast(x)  if pd.notna(x) else [] )

1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
1
1
2
3
1
2
1
2
3
1
2
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
1
2
3
1
2
3
1
2
3
1
2
3
1
2
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
1
2
3
1
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2


In [19]:
movies.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,[],United States,"September 25, 2021",2020,PG-13,90 min,[Documentaries],"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"[Ama Qamata, Khosi Ngema, Gail Mabalane]",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"[International TV Shows, TV Dramas, TV Myste...","After crossing paths at a party, a Cape Town t..."


In [20]:
movies['description'] = movies['description'].apply(lambda x: x.split()  if pd.notna(x) else [] )

In [21]:
movies.iloc[3]['description']

['Feuds,',
 'flirtations',
 'and',
 'toilet',
 'talk',
 'go',
 'down',
 'among',
 'the',
 'incarcerated',
 'women',
 'at',
 'the',
 'Orleans',
 'Justice',
 'Center',
 'in',
 'New',
 'Orleans',
 'on',
 'this',
 'gritty',
 'reality',
 'series.']

In [22]:
movies.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,[],United States,"September 25, 2021",2020,PG-13,90 min,[Documentaries],"[As, her, father, nears, the, end, of, his, li..."
1,s2,TV Show,Blood & Water,,"[Ama Qamata, Khosi Ngema, Gail Mabalane]",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"[International TV Shows, TV Dramas, TV Myste...","[After, crossing, paths, at, a, party,, a, Cap..."


In [23]:
# remove Nan & convert to list format
movies['director'] = movies['director'].apply(lambda x: x.split(",")  if pd.notna(x) else [] )

In [24]:
movies.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,[Kirsten Johnson],[],United States,"September 25, 2021",2020,PG-13,90 min,[Documentaries],"[As, her, father, nears, the, end, of, his, li..."
1,s2,TV Show,Blood & Water,[],"[Ama Qamata, Khosi Ngema, Gail Mabalane]",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"[International TV Shows, TV Dramas, TV Myste...","[After, crossing, paths, at, a, party,, a, Cap..."


In [25]:
movies['country'] = movies['country'].apply(lambda x: x.split(",")  if pd.notna(x) else [] )

# Text Camel casing : Kido Chan -> KidoChan

## Preserve Named Entities and Consistent Tokenization
- In movie data, names (like "Kido Chan") represent specific entities, like actors or characters, and should be treated as single units. When you leave spaces between names, vectorization might split "Kido Chan" into "Kido" and "Chan", treating them as two separate words.
- Merging these into a single token (like "KidoChan") ensures that vectorization doesn’t separate related entities, thus preserving the meaning and making recommendations more accurate.

## Improved Similarity Scoring
- In recommendation systems, having consistent tokens allows for better similarity scoring. If two movies have "Kido Chan" in their cast, but this is split into "Kido" and "Chan" separately, the similarity score may not reflect their commonality as well as it would if "KidoChan" were treated as a single unit.
- This approach helps capture relationships and reduces noise in the vector space, as vectorization will not assign separate weights to "Kido" and "Chan" independe

## Better Handling of Stop Words
- Some names or entities could include words that are often removed as stop words (e.g., "The Rock"). By camel casing them (e.g., "TheRock"), you avoid unintentionally discarding important tokens that would have been removed as stop words. This way, names retain their full structure and are not affected by stop-word filtering.

In [26]:
def remove_space(words):
    l = []
    for i in words:
        l.append(i.replace(" ","") )
    return l
remove_space(["Kido Chan"])

['KidoChan']

In [27]:
movies['cast'] = movies['cast'].apply(remove_space)

In [28]:
movies['cast'] = movies['director'].apply(remove_space)

In [29]:
movies['listed_in'] = movies['listed_in'].apply(remove_space)

In [30]:
movies['country'] = movies['country'].apply(remove_space)

In [31]:
movies.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,[Kirsten Johnson],[KirstenJohnson],[UnitedStates],"September 25, 2021",2020,PG-13,90 min,[Documentaries],"[As, her, father, nears, the, end, of, his, li..."
1,s2,TV Show,Blood & Water,[],[],[SouthAfrica],"September 24, 2021",2021,TV-MA,2 Seasons,"[InternationalTVShows, TVDramas, TVMysteries]","[After, crossing, paths, at, a, party,, a, Cap..."


In [32]:
# New Column has Been Created (tags)
movies['tags'] = movies['listed_in'] = movies['description'] + movies['country'] + movies['cast']

In [33]:
movies.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,tags
0,s1,Movie,Dick Johnson Is Dead,[Kirsten Johnson],[KirstenJohnson],[UnitedStates],"September 25, 2021",2020,PG-13,90 min,"[As, her, father, nears, the, end, of, his, li...","[As, her, father, nears, the, end, of, his, li...","[As, her, father, nears, the, end, of, his, li..."
1,s2,TV Show,Blood & Water,[],[],[SouthAfrica],"September 24, 2021",2021,TV-MA,2 Seasons,"[After, crossing, paths, at, a, party,, a, Cap...","[After, crossing, paths, at, a, party,, a, Cap...","[After, crossing, paths, at, a, party,, a, Cap..."
2,s3,TV Show,Ganglands,[Julien Leclercq],[JulienLeclercq],[],"September 24, 2021",2021,TV-MA,1 Season,"[To, protect, his, family, from, a, powerful, ...","[To, protect, his, family, from, a, powerful, ...","[To, protect, his, family, from, a, powerful, ..."
3,s4,TV Show,Jailbirds New Orleans,[],[],[],"September 24, 2021",2021,TV-MA,1 Season,"[Feuds,, flirtations, and, toilet, talk, go, d...","[Feuds,, flirtations, and, toilet, talk, go, d...","[Feuds,, flirtations, and, toilet, talk, go, d..."
4,s5,TV Show,Kota Factory,[],[],[India],"September 24, 2021",2021,TV-MA,2 Seasons,"[In, a, city, of, coaching, centers, known, to...","[In, a, city, of, coaching, centers, known, to...","[In, a, city, of, coaching, centers, known, to..."


In [34]:
movies.iloc[2]['tags']

['To',
 'protect',
 'his',
 'family',
 'from',
 'a',
 'powerful',
 'drug',
 'lord,',
 'skilled',
 'thief',
 'Mehdi',
 'and',
 'his',
 'expert',
 'team',
 'of',
 'robbers',
 'are',
 'pulled',
 'into',
 'a',
 'violent',
 'and',
 'deadly',
 'turf',
 'war.',
 'JulienLeclercq']

In [35]:
# New data Frame | # Concatinate all
new_df = movies[['show_id','title','tags']]

In [36]:
new_df.head(10)

Unnamed: 0,show_id,title,tags
0,s1,Dick Johnson Is Dead,"[As, her, father, nears, the, end, of, his, li..."
1,s2,Blood & Water,"[After, crossing, paths, at, a, party,, a, Cap..."
2,s3,Ganglands,"[To, protect, his, family, from, a, powerful, ..."
3,s4,Jailbirds New Orleans,"[Feuds,, flirtations, and, toilet, talk, go, d..."
4,s5,Kota Factory,"[In, a, city, of, coaching, centers, known, to..."
5,s6,Midnight Mass,"[The, arrival, of, a, charismatic, young, prie..."
6,s7,My Little Pony: A New Generation,"[Equestria's, divided., But, a, bright-eyed, h..."
7,s8,Sankofa,"[On, a, photo, shoot, in, Ghana,, an, American..."
8,s9,The Great British Baking Show,"[A, talented, batch, of, amateur, bakers, face..."
9,s10,The Starling,"[A, woman, adjusting, to, life, after, a, loss..."


In [37]:
new_df.iloc[2]['tags']

['To',
 'protect',
 'his',
 'family',
 'from',
 'a',
 'powerful',
 'drug',
 'lord,',
 'skilled',
 'thief',
 'Mehdi',
 'and',
 'his',
 'expert',
 'team',
 'of',
 'robbers',
 'are',
 'pulled',
 'into',
 'a',
 'violent',
 'and',
 'deadly',
 'turf',
 'war.',
 'JulienLeclercq']

In [38]:
new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x) )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x) )


In [39]:
new_df.iloc[2]['tags']

'To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war. JulienLeclercq'

In [40]:
# Lower Casing
new_df['tags'] = new_df['tags'].apply(lambda x: x.lower() )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x: x.lower() )


In [41]:
new_df.iloc[2]['tags']

'to protect his family from a powerful drug lord, skilled thief mehdi and his expert team of robbers are pulled into a violent and deadly turf war. julienleclercq'

## Overview
- Goal: Build a movie recommendation system that suggests similar movies based on content.
- Approach: Combine movie tags, genres, and descriptions into a single text block (or "paragraph") for each movie. This allows us to find similarities in movie content by analyzing these text blocks.


## Vector Matrix (Word Counts)
- Structure: A matrix with dimensions representing the number of movies by the number of unique words (e.g., 4806 movies x 5000 words).


## Step Learning (play,playing,played) => PLAY

In [42]:
# Step Learning (play,playing,played) => PLAY
import nltk
from nltk.stem import PorterStemmer

In [43]:
ps = PorterStemmer()

In [44]:
def stems(text):
    l = []
    for i in text.split():
        l.append( ps.stem(i) )
        # print(i)
    return " ".join(l)

In [45]:
new_df['tags'] = new_df['tags'].apply(stems)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stems)


In [46]:
new_df.iloc[2]['tags']

'to protect hi famili from a power drug lord, skill thief mehdi and hi expert team of robber are pull into a violent and deadli turf war. julienleclercq'

## Converting Text to Numerical Data (Vectorization)
- Problem: Models need numbers, not text.
- Solution: Use CountVectorizer from sklearn to turn words into numerical data.
    - The vectorizer creates a matrix where:
        - Each row represents a movie.
        - Each column represents a unique word from the data.
        - The values in the matrix represent how often each word appears.

In [47]:
# as well as remove stop words
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words='english')

In [48]:
vector = cv.fit_transform(new_df['tags']).toarray()

In [49]:
# Similarity score
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [50]:
vector.shape

(8807, 5000)

In [51]:
# len(cv.get_feature_names())

In [52]:
# Cosine Similarirt
from sklearn.metrics.pairwise import cosine_similarity

In [53]:
similarity = cosine_similarity(vector)

In [54]:
similarity.shape

(8807, 8807)

In [55]:
# similarity
# Get index of that movie
new_df[ new_df['title'] == 'The Starling' ].index[0].tolist()

9

In [56]:
#--------------
def recommend(movie):
    # locates the index of the specified movie in the new_df DataFrame.
    # .index[0] gets the index of the first matching row.
    index = new_df[new_df['title'] == movie].index[0] # 1. find index of that movie
    # 2.1 get similarity score : similarity[index]
    # 2.2 iterate and sort in a list
    # similarity[index] retrieves the similarity scores between the specified movie and all other movies, represented as a list.
    # enumerate(similarity[index]) pairs each score with its index, making it easier to identify each movie in the similarity list.
    # sorted(..., reverse=True, key=lambda x: x[1]) sorts these pairs in descending order based on similarity scores (x[1]), so the most similar movies appear first.
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1]) # 2. check distance with other movie
    # distances[1:6] selects the top 5 movies (from index 1 to 5) with the highest similarity scores, excluding the first entry (which is the movie itself with a similarity score of 1).
    # print(new_df.iloc[i[0]].title) prints the title of each recommended movie by accessing its index in new_df. i[0] is the index of the recommended movie from the sorted distances list.
    for i in distances[1:6]:
        print(new_df.iloc[i[0]].title)

In [57]:
recommend('The Starling')

My Daddy is in Heaven
God Knows Where I Am
Spivak
Emma' (Mother)
The Bygone


In [58]:
import pickle

In [59]:
pickle.dump(new_df,open('artifacts/movie_list.pkl','wb'))
pickle.dump(similarity,open('artifacts/similarity.pkl','wb'))