# Anbefaling av filmer med clustering

Basert på https://www.kaggle.com/airalex/recommend-movie-with-clustering

Her skal vi ved hjelp av clustering foreslå filmer basert på en du liker. Alle filmene tilhører noen sjangre, og vi har informasjon om hvem som har regien, og de viktigste skuespillerne. Vi vil derfor si at to filmer ligner hverandre hvis de har mange sjangre og mange skuespillere til felles.

In [4]:
import pandas as pd
import numpy as np

## Innlesing og opprydding i data

In [5]:
data = pd.read_csv('../../datasets/imdb-5000-movie-dataset/movie_metadata.csv')

to_use = ['genres', 'plot_keywords', 'movie_title', 'actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name', 'imdb_score']
data_use = data[to_use].copy()
data_use['movie_title'] = [i.replace("\xa0","") for i in list(data_use['movie_title'])]

In [6]:
clean_data = data_use.dropna(axis=0)
clean_data = clean_data.drop_duplicates(['movie_title'])
clean_data = clean_data.reset_index(drop=True)

In [7]:
people_list = []
merge_columns = ['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
for i in range(clean_data.shape[0]):
    people_list.append('|'.join(clean_data.iloc[i][col].replace(' ', '_') for col in merge_columns))
clean_data['people'] = people_list

clean_data.head()

Unnamed: 0,genres,plot_keywords,movie_title,actor_1_name,actor_2_name,actor_3_name,director_name,imdb_score,people
0,Action|Adventure|Fantasy|Sci-Fi,avatar|future|marine|native|paraplegic,Avatar,CCH Pounder,Joel David Moore,Wes Studi,James Cameron,7.9,CCH_Pounder|Joel_David_Moore|Wes_Studi|James_C...
1,Action|Adventure|Fantasy,goddess|marriage ceremony|marriage proposal|pi...,Pirates of the Caribbean: At World's End,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski,7.1,Johnny_Depp|Orlando_Bloom|Jack_Davenport|Gore_...
2,Action|Adventure|Thriller,bomb|espionage|sequel|spy|terrorist,Spectre,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes,6.8,Christoph_Waltz|Rory_Kinnear|Stephanie_Sigman|...
3,Action|Thriller,deception|imprisonment|lawlessness|police offi...,The Dark Knight Rises,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan,8.5,Tom_Hardy|Christian_Bale|Joseph_Gordon-Levitt|...
4,Action|Adventure|Sci-Fi,alien|american civil war|male nipple|mars|prin...,John Carter,Daryl Sabara,Samantha Morton,Polly Walker,Andrew Stanton,6.6,Daryl_Sabara|Samantha_Morton|Polly_Walker|Andr...


Som variabler her, skal vi bruke om hver av sjangerne, keywords eller person har noe med filmen å gjøre. Altså 1 hvis den er der, og 0 ellers.
Avstanden mellom to filmer blir dermed liten hvis de har mange av disse felles.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

def token(text):
    return(text.split('|'))

cv_kw=CountVectorizer(max_features=100, tokenizer=token)
keywords = cv_kw.fit_transform(clean_data['plot_keywords'])
keywords_list = ['kw_' + i for i in cv_kw.get_feature_names()]

cv_ge=CountVectorizer(tokenizer=token )
genres = cv_ge.fit_transform(clean_data['genres'])
genres_list = ['genres_' + i for i in cv_ge.get_feature_names()]

cv_pp=CountVectorizer(max_features=100,tokenizer=token )
people = cv_pp.fit_transform(clean_data['people'])
people_list = ['pp_' + i for i in cv_pp.get_feature_names()]

cluster_data = np.hstack([keywords.todense(),genres.todense(),people.todense()*2])
criterion_list = keywords_list+genres_list+people_list

## Clustering med K-Means

In [9]:
from sklearn.cluster import KMeans

mod = KMeans(n_clusters=100)
category = mod.fit_predict(cluster_data)
category_dataframe = pd.DataFrame({'category': category}, index=clean_data['movie_title'])

In [10]:
clean_data.iloc[list(category_dataframe['category'] == 0)][['genres', 'movie_title', 'people']]

Unnamed: 0,genres,movie_title,people
379,Biography|Drama|Sport,Cinderella Man,Paddy_Considine|Bruce_McGill|Rosemarie_DeWitt|...
491,Biography|Drama,A Beautiful Mind,Adam_Goldberg|Austin_Pendleton|Judd_Hirsch|Ron...
640,Biography|Drama|Sport|War,Unbroken,Finn_Wittrock|Jack_O'Connell|Alex_Russell|Ange...
1158,Biography|Drama,The Social Network,Andrew_Garfield|Dustin_Fitzsimons|Marcella_Len...
1165,Biography|Drama|Romance,Julie & Julia,Meryl_Streep|Mary_Lynn_Rajskub|Vanessa_Ferlito...
1173,Biography|Drama|Music,Ray,Bokeem_Woodbine|Curtis_Armstrong|Harry_Lennix|...
1218,Biography|Drama|Music|Musical,The Doors,Michael_Wincott|Kevin_Dillon|Kathleen_Quinlan|...
1230,Biography|Crime|Drama|Music,Get Rich or Die Tryin',50_Cent|Bill_Duke|Marc_John_Jefferies|Jim_Sher...
1260,Biography|Drama|Romance,Big Miracle,Ted_Danson|Tim_Blake_Nelson|Andrew_Daly|Ken_Kw...
1292,Biography|Drama|Romance|Sport,Against the Ropes,Omar_Epps|Charles_S._Dutton|Tim_Daly|Charles_S...


## Anbefaling av filmer

Vi kan bruke `recommend`-funksjonen til å få filmer som ligner på den vi sender inn. Filmtitlene kommer ut sortert etter IMDb-score.

In [11]:
def recommend(movie_name, recommend_number=5):
    if movie_name in list(clean_data['movie_title']):
        movie_cluster = category_dataframe.loc[movie_name]['category']
        score = clean_data.iloc[list(category_dataframe['category'] == movie_cluster)][['imdb_score', 'movie_title']]
        sort_score = score.sort_values(['imdb_score'], ascending=[0])
        sort_score = sort_score[sort_score['movie_title'] != movie_name]
        recommend_number = min(sort_score.shape[0], recommend_number)
        recommend_movie = list(sort_score.iloc[range(recommend_number), 1])
        return recommend_movie
    else:
        print('Can\'t find this movie!')

In [12]:
recommend('Harry Potter and the Prisoner of Azkaban', 10)

['Monty Python and the Holy Grail',
 'The Princess Bride',
 'The Hobbit: The Desolation of Smaug',
 'The Hobbit: An Unexpected Journey',
 'Stardust',
 'Harry Potter and the Goblet of Fire',
 'The Hobbit: The Battle of the Five Armies',
 "Harry Potter and the Sorcerer's Stone",
 'Harry Potter and the Half-Blood Prince',
 'Harry Potter and the Order of the Phoenix']

**Oppgave:** Synes du sortering på IMDb-rating er den beste måten å sortere output? Kommer du på et alternativ

Løsning: eks. sortere etter avstand til filmen i clusteret