# Determine Movie Genre by Neighboring Movies

Using the k-nearest neighbors method, use the top-k most similar movies to a target film to predict the target movie's genre.

Use Jaccard similarity based on actors in each movie to rank movies and select the top-k most similar movies.

In [145]:
import json
import pandas as pd
import numpy as np
from scipy.sparse import lil_matrix
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier


In [146]:
df = pd.read_csv("netflix_shows.csv")   
df.head()

Unnamed: 0,title,cast,country,date_added,release_year,rating,duration,listed_in,description
0,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,24-Sep-21,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
1,Ganglands,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",France,24-Sep-21,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
2,Jailbirds New Orleans,,United States,24-Sep-21,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
3,Kota Factory,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,24-Sep-21,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
4,Midnight Mass,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",United States,24-Sep-21,2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...


In [147]:
df['date_added'] = pd.to_datetime(df['date_added'])

  df['date_added'] = pd.to_datetime(df['date_added'])


In [148]:
filtered_df = df[df['release_year'].isin([2020, 2021])]
filtered_df

Unnamed: 0,title,cast,country,date_added,release_year,rating,duration,listed_in,description
0,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
1,Ganglands,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",France,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
2,Jailbirds New Orleans,,United States,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
3,Kota Factory,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
4,Midnight Mass,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",United States,2021-09-24,2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
...,...,...,...,...,...,...,...,...,...
1190,BoJack Horseman,"Will Arnett, Aaron Paul, Amy Sedaris, Alison B...",United States,2019-10-25,2020,TV-MA,6 Seasons,TV Comedies,Meet the most beloved sitcom horse of the '90s...
1213,The Hook Up Plan,"Marc Ruchmann, Zita Hanrot, Sabrina Ouazani, J...",France,2019-10-11,2020,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...","When Parisian Elsa gets hung up on her ex, her..."
2537,Super Wings,"Luca Padovan, Evan Smolin, Junah Jang, Colin C...","United States, South Korea, China",2020-12-01,2020,TV-Y,3 Seasons,"Kids' TV, Korean TV Shows",A cheerful jet and his transforming pals striv...
2539,Surviving R. Kelly Part II: The Reckoning,R. Kelly,United States,2020-04-13,2020,TV-MA,1 Season,"Crime TV Shows, Docuseries",As more women come forward with harrowing accu...


In [149]:
shows_count = filtered_df['release_year'].value_counts()
print(shows_count)

release_year
2020    435
2021    314
Name: count, dtype: int64


In [150]:
most_frequent_country = filtered_df['country'].mode()[0]
most_frequent_country

'United States'

In [151]:
X = df[['release_year']]
y = df[['country']]

In [152]:
model = DecisionTreeClassifier(max_depth = 1)
model.fit(X, y)

In [153]:
future_years = [2022, 2023, 2024, 2025]
future_countries = model.predict(pd.DataFrame(future_years, columns=['release_year']))

predicted_counts = pd.Series(future_countries).value_counts()

most_frequent_country = predicted_counts.idxmax()
print("Predicted country with the most possible Netflix show releases:", most_frequent_country)


Predicted country with the most possible Netflix show releases: United States


In [154]:
import random
all_countries = list(filtered_df.index)
train_test_split = 0.8  
split_index = int(train_test_split * len(all_countries))
random.shuffle(all_countries)

In [155]:
train_countries = all_countries[:split_index]
test_countries = all_countries[split_index:]

In [156]:
X_train = X.loc[train_countries]
y_train = y.loc[train_countries]

X_test = X.loc[test_countries]
y_test = y.loc[test_countries]

In [157]:
y_predict = model.predict(X_test)

In [163]:
future_data = filtered_df[['release_year']]
future_pred = model.predict(future_data)
future_pred

array(['United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', 'United States', 'United States', 'United States',
       'United States', '

In [159]:
pd.DataFrame(y_train.value_counts())

Unnamed: 0_level_0,count
country,Unnamed: 1_level_1
United States,241
South Korea,48
United Kingdom,38
Japan,28
India,26
...,...
Lebanon,1
"Norway, Denmark",1
"Poland, United States",1
Russia,1


In [160]:
pd.DataFrame(y_test.value_counts())

Unnamed: 0_level_0,count
country,Unnamed: 1_level_1
United States,64
Japan,8
India,8
South Korea,8
United Kingdom,6
France,5
Spain,5
Turkey,4
Germany,4
Brazil,3


In [174]:
test_countries = []
test_countries = filtered_df['country'].tolist()

prediction_df = pd.DataFrame(future_pred, index=test_countries, columns=['Predicted_Country'])
prediction_df.index.name = 'Test_Country'
prediction_df[:20]

Unnamed: 0_level_0,Predicted_Country
Test_Country,Unnamed: 1_level_1
South Africa,United States
France,United States
United States,United States
India,United States
United States,United States
United Kingdom,United States
United Kingdom,United States
Thailand,United States
India,United States
United States,United States
