## Sistemi za preporuku filmova
Sistemi za preporuku su sistemi čiji je glavni zadatak da pruže korisniku informaciju - preporuku o potencijalno zanimljivom predmetu, u ovom slučaju, filmu. U procesu identifikacije sistem mora predvideti korisnost objekta, uporediti niz takvih korisnosti i odrediti najbolje objekte na temelju tog poređenja. Prema načinu određivanja preporuke, sistemi za preporuku mogu biti:
- Na osnovu saradnji (collaborative filtering) – preporučuju filmove na osnovu ocena koje su drugi korisnici sličnih interesa dali pojedinim filmovima
- Na osnovu sadržaja (content-based)– preporučuju filmove koji su slični po sadržaju datom filmu
- Hibridni – kao kombinacija metoda iz prethodna dva sistema

Učitavanje potrebnih paketa

In [1]:
import pandas as pd
import numpy as np
from ast import literal_eval
from scipy.sparse.linalg import svds
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import warnings; warnings.simplefilter('ignore')

### Collaborative Filtering
Sistemi preporuke na osnovu saradnji preporučuju korisniku filmove koji su bili zanimljivi korisnicima sa sličnim interesima (korisnici sa sličnim ukusom). Sličnost u ukusu između dvoje korisnika računa se na osnovu prošlih odabranih zanimljivih objekata. Osnovna pretpostavka sistema je da će korisnici koji su imali sličan interes u prošlosti, imati sličan interes i u budućnosti. 

Učitavamo skupove podataka koje ćemo koristiti

In [2]:
# Uzeli smo manji skup, ali ako nije dovoljno, mozemo uzeti i veliki skup; sa velikim neće da radi, preveliki je 
ratings = pd.read_csv('input/ratings.csv')
# Učitvamo i podatke o filmovima
movies = pd.read_csv('input/movies_metadata.csv')

In [3]:
#Uzimamo prvih 5000 korisnika zbog prevelikog skupa ratings
ratings['userId'] = ratings['userId'].astype('long')
ratings = ratings[ratings['userId'] <= 5000]
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


In [4]:
movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


Sada sređujemo učitane skupove. Izbacujemo one koji nam ne odgovaraju i kastujemo u odgovarajuće formate

In [5]:
# Ovde vidimo da imamo 3 filma kojima je ID u obliku datuma, pa ćemo ih izbaciti, zbog lakšeg rukovanja
for i in movies.id:
    if "-" in i:
        print(i)

1997-08-20
2012-09-29
2014-01-01


In [6]:
# Izbacujemo ih
movies = movies.drop(movies[movies.id.str.contains("-")].index)

In [7]:
# Kastujemo podatke iz kolone ID u skupu podataka movies u long; isto i za movieId u skupu podataka ratings
movies['id'] = movies['id'].astype('long')
ratings['movieId'] = ratings['movieId'].astype('long')

In [8]:
# Za kolonu genre: želimo da pišu samo imena žanrova
movies['genres'] = movies['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x])
# zelimo samo godinu, ne ceo datum
movies['year'] = pd.to_datetime(movies['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

Pravimo matricu od DataFrame-a ratings

In [9]:
Ratings = ratings.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(0)
Ratings.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,175475,175579,175589,175773,175775,175777,175779,175945,176165,176271
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
Ratings.shape

(5000, 15443)

In [11]:
R = Ratings.as_matrix()

SVD

In [12]:
# Primenjujemo SVD na normalizovanu matricu Ratings; naci cemo 50 sopstvenih vrednosti i odgovarajucih vektora
U, sigma, Vt = svds(R, k = 50)

In [13]:
sigma = np.diag(sigma)

In [14]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt)

In [15]:
# Pravimo DataFrame od dobijene matrice, to su nam ustvari predictions
predictions = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)
predictions.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,175475,175579,175589,175773,175775,175777,175779,175945,176165,176271
0,-0.264379,0.030233,0.113729,0.006246,0.201768,-0.318568,0.169156,0.016351,0.035743,-0.298139,...,-0.00021,-0.01619,0.006745,0.006745,0.006745,0.006745,0.006745,-0.010019,-0.013309,-0.003263
1,1.380755,0.232886,0.757812,0.023073,0.705719,0.758108,0.921725,0.024023,0.236622,0.340099,...,-0.000228,-0.000806,0.001104,0.001104,0.001104,0.001104,0.001104,-0.001431,0.003256,-0.001041
2,0.061426,0.177972,0.029013,0.003471,-0.026881,-0.139021,0.016289,0.007653,-0.000161,-0.204009,...,-0.000142,0.005687,0.002141,0.002141,0.002141,0.002141,0.002141,-0.002818,-0.007014,-0.003172
3,0.859852,0.388358,0.110531,-0.040485,0.184659,0.179706,0.155032,0.033255,-0.027188,0.271105,...,-0.000414,0.005428,0.000723,0.000723,0.000723,0.000723,0.000723,-0.003162,-0.009497,-0.005963
4,0.604004,0.135514,0.075281,0.010332,0.176027,-0.335847,0.176959,0.060572,0.004275,-0.153369,...,-9.8e-05,0.000385,0.001641,0.001641,0.001641,0.001641,0.001641,-0.005717,-0.008134,0.006054


#### Preporučujemo filmove sa najvećim predviđenim vrednostima koje korisnik nije još pogledao i to tako sto:
 - prvo isključujemo sve filmove koje je korisnik pogledao
 - zatim skup podataka fulmova koje imamo spajamo sa dobijenim sortiranim predvidjanjima prema id-u filma
 - onda preimenujemo dodatu kolonu i sortiramo opadajuće, kako bi filmovi sa najvećim rejtingom bili na vrhu
 - na samom kraju izdvojimo onoliko filmova koliko želimo
 
Funkcija vraca podatke o filmovima koje je korisnik vec ocenio i preporuke

In [16]:
def recommend_movies(predictions, userID, movies, ratings, num_recommendations):
    
    user_row_number = userID - 1 # jer krece od 1, a treba nam od 0
    
    # Sortiramo vrednosti datog reda u matrici predictions
    sorted_user_predictions = predictions.iloc[user_row_number].sort_values(ascending=False) 
    
    # Uzimamo podatke o korisniku i spajamo sa info o filmovima
    user_data = ratings[ratings.userId == (userID)]
    user_full = (user_data.merge(movies, how = 'inner', left_on = 'movieId', right_on = 'id').
                     sort_values(['rating'], ascending=False))
    
    # Preporučujemo filmove sa najvećim predviđenim vrednostima koje korisnik nije još pogledao
    recommendations = (movies[~movies['id'].isin(user_full['movieId'])].
                       merge(pd.DataFrame(sorted_user_predictions).reset_index(), 
                             how = 'left', left_on = 'id', right_on = 'movieId').
                       rename(columns = {user_row_number: 'predictions'}).
                       sort_values('predictions', ascending = False).
                       iloc[:num_recommendations])
    
    return user_full, recommendations[['title', 'genres', 'vote_average', 'vote_count', 'predictions']]

Testiramo urađeno: Tražimo 20 preporuka za korisnika 99 prema navedenim skupovima podataka 

In [17]:
already_rated, recommendations = recommend_movies(predictions, 99, movies, ratings, 20)

In [18]:
recommendations

Unnamed: 0,title,genres,vote_average,vote_count,predictions
897,2001: A Space Odyssey,"[Science Fiction, Mystery, Adventure]",7.9,3075.0,1.356063
5321,Men in Black II,"[Action, Adventure, Comedy, Science Fiction]",6.1,3188.0,1.317915
4744,Donnie Darko,"[Fantasy, Drama, Mystery]",7.7,3574.0,1.288236
1807,Armageddon,"[Action, Thriller, Science Fiction, Adventure]",6.5,2540.0,1.242371
3656,Shaft in Africa,"[Adventure, Action, Thriller, Crime, Mystery]",5.4,6.0,1.069367
474,Judgment Night,"[Action, Thriller, Crime]",6.4,79.0,1.002592
495,Mrs. Doubtfire,"[Comedy, Drama, Family]",7.0,1638.0,0.976028
10511,Jarhead,"[Drama, War]",6.6,776.0,0.927562
12941,Shadows in Paradise,"[Drama, Comedy]",7.1,35.0,0.924316
2050,Rosemary's Baby,"[Horror, Drama, Mystery]",7.5,892.0,0.872552


In [19]:
already_rated

Unnamed: 0,userId,movieId,rating,timestamp,adult,belongs_to_collection,budget,genres,homepage,id,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year
0,99,260,5.0,866605992,False,,0,"[Action, Thriller, Mystery]",,260,...,0.0,86.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Handcuffed to the girl who double-crossed him,The 39 Steps,False,7.4,217.0,1935
1,99,648,4.0,866605941,False,,0,"[Drama, Fantasy, Romance]",,648,...,0.0,96.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,Once upon a time,Beauty and the Beast,False,7.8,133.0,1946
4,99,802,4.0,866606031,False,,2000000,"[Drama, Romance]",,802,...,9250000.0,153.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,How did they ever make a movie of ...,Lolita,False,7.3,409.0,1962
2,99,780,3.0,866605941,False,,0,"[Drama, History]",,780,...,0.0,110.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,An Immortal Screen Classic that will live Fore...,The Passion of Joan of Arc,False,8.2,159.0,1928
3,99,786,3.0,866605992,False,,60000000,"[Drama, Music]",,786,...,47383689.0,122.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Experience it. Enjoy it. Just don't fall for it.,Almost Famous,False,7.4,807.0,2000
5,99,1073,3.0,866605992,False,,21500000,"[Drama, Thriller, Mystery]",,1073,...,0.0,117.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Your Paranoia Is Real.,Arlington Road,False,7.0,246.0,1999


####  Funkcija za Top n filmova 
- Izvući ćemo informaciju o prosecnoj oceni nekog filma iz skupa podataka movies, kao i o broju ocena
- Uzimamo u obzir samo filmove koji imaju više glasova od 80% filmova 
- Izračunavamo rejting prema formuli (prema formuli nađenoj na internetu)
- Na kraju sortiramo prema rejtingu i uzimamo prvih n
- Ukoliko se navede poseban žanr, uzima prvih n iz tog žanra


In [20]:
#računamo rating prema ovoj formuli
def weighted_rating(x, C, m):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [21]:
def top_n_movies(movies, n=250, genre='', percentile=0.8):
    
    # Ukoliko je dat odredjen žanr, uzimamo u obzir samo te filmove
    if genre != '':
        s = movies.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
        s.name = 'genre'
        gen_movies = movies.drop('genres', axis=1).join(s)
        top_movies = gen_movies[gen_movies['genre'] == genre]
    else:
        top_movies = movies
    
    #računamo prosečnu ocenu koju su filmovi dobijali
    vote_counts = top_movies[top_movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = top_movies[top_movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    #vidimo samo koji filmovi nas zanimaju
    qualified = top_movies[(top_movies['vote_count'] >= m) & (top_movies['vote_count'].notnull()) & (top_movies['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    

    qualified['wr'] = weighted_rating(qualified, C, m)
    
    qualified = qualified.sort_values('wr', ascending=False).head(n)
    
    return qualified

Testiramo urađeno prvo za sve žanrove, a zatim za određeni žanr

In [22]:
top_n_movies(movies)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457,8.735928
15480,Inception,2010,14075,8,29.1081,7.990247
12481,The Dark Knight,2008,12269,8,123.167,7.988818
22879,Interstellar,2014,11187,8,32.2135,7.987741
2843,Fight Club,1999,9678,8,63.8696,7.985839
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.0707,7.984595
292,Pulp Fiction,1994,8670,8,140.95,7.984202
314,The Shawshank Redemption,1994,8358,8,51.6454,7.983616
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.3244,7.983355
351,Forrest Gump,1994,8147,8,48.3072,7.983194


Top 20 prema zanru

In [23]:
top_n_movies(movies, 20, 'Thriller')

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
15480,Inception,2010,14075,8,29.1081,7.973053
12481,The Dark Knight,2008,12269,8,123.167,7.969131
292,Pulp Fiction,1994,8670,8,140.95,7.956516
46,Se7en,1995,5915,8,18.4574,7.936721
24860,The Imitation Game,2014,5895,8,31.5959,7.936511
586,The Silence of the Lambs,1991,4549,8,4.30722,7.918275
11354,The Prestige,2006,4510,8,16.9456,7.917589
289,Leon: The Professional,1994,4293,8,20.4773,7.913552
4099,Memento,2000,4168,8,15.4508,7.911042
1213,The Shining,1980,3890,8,19.6116,7.904901


### Content Based Filtering
Sistemi preporuke koji su na osnovu sadržaja, nastoje korisniku preporučiti filmove slične zadatom filmu. Osnova istraživanja ovih sistema nalaze se u područjima dohvata i filtriranja informacija. 

Potrebni su nam skupovi podataka credits, keywords i links_small

In [24]:
credits = pd.read_csv('input/credits.csv')
keywords = pd.read_csv('input/keywords.csv')
links_small = pd.read_csv('input/links_small.csv')

In [25]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [26]:
# Spajamo skup podataka movies sa novim skupovima
movies = movies.merge(credits, on='id')
movies = movies.merge(keywords, on='id')
movies.shape

(46628, 28)

U obzir uzimamo samo filmove koji se nalaze u links_small. Ovo radimo samo zbog veličine matrica 

In [27]:
smovies = movies[movies['id'].isin(links_small)]
smovies.shape

(9219, 28)

Izdvajamo prvih 5 glumaca, direktora(navodeći ga 5 puta, da bi bio konkurentan u odnosu na glumce) i ključne reči

In [28]:
smovies['cast'] = smovies['cast'].apply(literal_eval)
smovies['crew'] = smovies['crew'].apply(literal_eval)
smovies['keywords'] = smovies['keywords'].apply(literal_eval)

In [29]:
# Funkcija izdvaja ime režisera 
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [30]:
smovies['director'] = smovies['crew'].apply(get_director)
smovies['director'] = smovies['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smovies['director'] = smovies['director'].apply(lambda x: [x,x,x,x,x])

In [31]:
#uzimamo samo prvih pet glumaca
smovies['cast'] = smovies['cast'].apply(lambda x: [i['name'] for i in x])
smovies['cast'] = smovies['cast'].apply(lambda x: x[:5] if len(x) >= 5 else x)
smovies['cast'] = smovies['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [32]:
smovies['keywords'] = smovies['keywords'].apply(lambda x: [i['name'] for i in x])

In [33]:
smovies.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,tagline,title,video,vote_average,vote_count,year,cast,crew,keywords,director
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,,Toy Story,False,7.7,5415.0,1995,"[tomhanks, timallen, donrickles, jimvarney, wa...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy, friendship, friends, riva...","[johnlasseter, johnlasseter, johnlasseter, joh..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995,"[robinwilliams, jonathanhyde, kirstendunst, br...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[board game, disappearance, based on children'...","[joejohnston, joejohnston, joejohnston, joejoh..."


Izbacujemo one ključne reči koje se pojavljuju samo jednom, radi boljeg predvidjanja

In [34]:
s = smovies.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [35]:
s = s.value_counts()
s[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

In [36]:
s = s[s > 1]

In [37]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [38]:
smovies['keywords'] = smovies['keywords'].apply(filter_keywords)
smovies['keywords'] = smovies['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
smovies.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,tagline,title,video,vote_average,vote_count,year,cast,crew,keywords,director
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,,Toy Story,False,7.7,5415.0,1995,"[tomhanks, timallen, donrickles, jimvarney, wa...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy, friendship, friends, riva...","[johnlasseter, johnlasseter, johnlasseter, joh..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995,"[robinwilliams, jonathanhyde, kirstendunst, br...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[boardgame, disappearance, basedonchildren'sbo...","[joejohnston, joejohnston, joejohnston, joejoh..."


In [39]:
smovies['genres'] = smovies['genres'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

Sada sve ove podatke objedinjujemo u jednu kolonu, tags; nju koristimo za predviđanje

In [40]:
smovies['tags'] = smovies['keywords'] + smovies['cast'] + smovies['director'] + smovies['genres']
smovies['tags'] = smovies['tags'].apply(lambda x: ' '.join(x))

In [41]:
# Provera kako izgleda 
smovies['tags'][0]

'jealousy toy boy friendship friends rivalry boynextdoor newtoy toycomestolife tomhanks timallen donrickles jimvarney wallaceshawn johnlasseter johnlasseter johnlasseter johnlasseter johnlasseter animation comedy family'

Pretvaramo sekvencu tekstualnih polja u retku reprezentaciju i izračunavamo sličnost

In [42]:
count = CountVectorizer(analyzer='word')
count_matrix = count.fit_transform(smovies['tags'])

In [43]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

Napravimo novi skup podataka koji sadrži samo naslove poređane po indeksu kao u skupu smovies

In [44]:
smovies = smovies.reset_index()
titles = pd.Series(smovies.index, index=smovies['title'])

### Funkcija koja preproučuje filmove na osnovu zadatog filma:

 - prvo izvučemo indeks datog filma
 - zatim, pronađemo taj film u matrici sličnosti i sortiramo 
 - uzmemo prvih 25 najsličnijih
 - nakon toga, izvučemo im indekse
 - na kraju, pozovemo funkciju top_n_movies koja će nam vratiti filmove koji su kvalifikovani


In [45]:
def improved_recommendations(title):

    idx = titles[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indexes = [i[0] for i in sim_scores]
    
    similar_movies = smovies.iloc[movie_indexes][['title', 'vote_count', 'vote_average', 'year', 'id', 'genres', 'popularity']]
    
    return top_n_movies(similar_movies, percentile=0.5)

Testiramo urađeno na primeru nekih naslova

In [46]:
improved_recommendations('Cloud Atlas')

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
1589,Metropolis,1927,666,8,14.4879,7.344
2079,The Matrix,1999,9079,7,33.3663,6.967193
6622,Children of Men,2006,2120,7,14.1146,6.875468
7284,Moon,2009,1831,7,13.3368,6.859109
6632,Perfume: The Story of a Murderer,2006,1198,7,10.2973,6.802195
2171,Run Lola Run,1998,672,7,7.76538,6.702248
7226,The International,2009,373,6,6.58361,6.079569
3401,Bridget Jones's Diary,2001,1397,6,10.7805,6.033431
4928,The Matrix Revolutions,2003,3155,6,15.3682,6.016754
4651,The Matrix Reloaded,2003,3500,6,16.2293,6.01526


In [47]:
improved_recommendations('Inside Out')

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
7307,Up,2009,7048,7,19.3309,6.828748
3833,"Monsters, Inc.",2001,6150,7,26.42,6.808936
8432,Despicable Me 2,2013,4729,7,24.8236,6.766119
7629,Toy Story 3,2010,4710,7,16.9665,6.765416
521,Aladdin,1992,3495,7,16.3574,6.709606
1263,Hercules,1997,1741,7,14.0487,6.557698
1662,One Hundred and One Dalmatians,1961,1643,6,15.7275,6.039619
7334,Ice Age: Dawn of the Dinosaurs,2009,2330,6,12.9806,6.032714
8371,The Croods,2013,2447,6,14.7579,6.031771
9182,Finding Dory,2016,4333,6,14.477677,6.021692


### Hybrid Recommender System
Hibridni sistemi se razvijaju kako bi se izbegli glavni nedostaci pojedinačnih sistema za preporuku, a nastaju kao kombinacija sistema preporuke na osnovu sadržaja i saradnji. 

### Funkcija koja za datog korisnika i film preporucuje filmove:

 - nađemo prvih 25 filmova kao u prethodnoj funkciji 
 - zatim, primenimo funkciju recommend_movies, kako bismo dobili sortiranu listu filmova za datog korisnikas

In [48]:
def hybrid(userId, title):
    
    idx = titles[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indexes = [i[0] for i in sim_scores]
    
    recomended_movies = smovies.iloc[movie_indexes][['title', 'vote_count', 'vote_average', 'year', 'id', 'genres']]
    
    user_full, recomendations = recommend_movies(predictions, userId, recomended_movies, ratings,20)
    
    return recomendations

Testiramo urađeno za proizvoljne korisnike i naslove

In [49]:
hybrid(4532, 'Harry Potter and the Half-Blood Prince')

Unnamed: 0,title,genres,vote_average,vote_count,predictions
7,Harry Potter and the Goblet of Fire,"[adventure, fantasy, family]",7.5,5758.0,0.574832
5,Harry Potter and the Philosopher's Stone,"[adventure, fantasy, family]",7.5,7188.0,0.568126
23,Zathura: A Space Adventure,"[family, fantasy, sciencefiction, adventure]",6.1,808.0,0.015008
12,Spirited Away,"[fantasy, adventure, animation, family]",8.3,3968.0,0.006516
20,Inkheart,"[adventure, family, fantasy]",6.0,610.0,-9.6e-05
13,In the Name of the King: A Dungeon Siege Tale,"[adventure, fantasy, action, drama]",4.1,227.0,-0.081125
8,Harry Potter and the Prisoner of Azkaban,"[adventure, fantasy, family]",7.7,6037.0,-0.32719
0,Harry Potter and the Order of the Phoenix,"[adventure, fantasy, family, mystery]",7.4,5633.0,
1,Harry Potter and the Deathly Hallows: Part 1,"[adventure, fantasy, family]",7.5,5708.0,
2,Harry Potter and the Deathly Hallows: Part 2,"[family, fantasy, adventure]",7.9,6141.0,


In [50]:
hybrid(1000 ,'Pan\'s Labyrinth')

Unnamed: 0,title,genres,vote_average,vote_count,predictions
3,Hellboy,"[fantasy, action, sciencefiction]",6.5,2278.0,0.022612
1,The Devil's Backbone,"[fantasy, drama, horror, thriller, sciencefict...",7.2,277.0,0.000348
17,Heavenly Creatures,"[drama, fantasy]",6.9,299.0,0.000241
5,Mimic,"[fantasy, horror, thriller]",5.7,255.0,-0.005733
0,Cronos,"[drama, horror, thriller]",6.4,153.0,
2,Hellboy II: The Golden Army,"[adventure, fantasy, sciencefiction]",6.5,1555.0,
4,Pacific Rim,"[action, sciencefiction, adventure]",6.7,4903.0,
6,Blade II,"[fantasy, horror, action, thriller]",6.3,1556.0,
7,Hellboy: The Seeds of Creation,[],8.5,2.0,
8,Goya in Bordeaux,"[drama, war]",5.5,8.0,
