## Preporuke filmova pomoću liste najpopularnijih i content-based algoritma preporuke

Preporuka filmova pomoću liste najpopularnijih svim korisnicima preporuča iste filmove poredane po izračunatom ratingu
**IMDb-formule**

\begin{equation}
\text TežinskiRating (\bf WR) = \left({{\bf v} \over {\bf v} + {\bf m}} \cdot R\right) + \left({{\bf m} \over {\bf v} + {\bf m}} \cdot C\right)
\end{equation}

gdje je:
v = broj ocjena filma ,
m = minimalni broj ocjena da bi se prikazao u listi (*izračunat kao 90% kvantil*),
R = prosjecna ocjena filma,
C = prosječna ocjena svih filmova.

Content based algoritam pomoću sličnosti između filmova traži filmove s sličnim atributima i te preporuča korisniku.

Atributi koje ovdje koristimo su žanrovi iz movies.csv i svi tagovi koje su korisnici upisali u tags.csv za film.
Što je više atributa dostupno za svaki film ovaj će algoritam bolje raditi i pravilnije preproručavati filmove.

Kako bi odredili težinu (_važnost_) atributa koristimo **TF-IDF** score (*term frequency-inverted document frequency score*).

\begin{equation}
\text tfidf_{i,j} = tf_{i,j} \cdot \log (\frac{N}{df_i})
\end{equation}

where:
Kada smo izračunali TF-IDF gradimo **Vector Space Model** uz **Cosine similarity**.

Rezultat nam je kvadratna matrica veličine broja filmova gdje na svakom sjecištu leži ocjena njihove sličnosti.

In [30]:
import pandas as pd

# Više izlaznih linija
from IPython.core.interactiveshell import InteractiveShell
from typing import List
InteractiveShell.ast_node_interactivity = "all"

movies = pd.read_csv('ml-latest/movies.csv')
movies.head()

ratings = pd.read_csv('ml-latest/ratings.csv',usecols=['userId','movieId','rating'])
ratings.head()

tags = pd.read_csv('ml-latest/tags.csv',usecols=['movieId','tag'])
tags.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Unnamed: 0,userId,movieId,rating
0,1,307,3.5
1,1,481,3.5
2,1,1091,1.5
3,1,1257,4.5
4,1,1449,4.5


Unnamed: 0,movieId,tag
0,110,epic
1,110,Medieval
2,260,sci-fi
3,260,space action
4,318,imdb top 250


## Najpopularniji preporučavanje

Izračun srednje ocjene filma iz svih njegovih ocjena i zbrajanje koliko je puta ocjenjen.
Ta dva podataka natrag upisujemo u dataset filma.

In [31]:
movie_rating_temp_df = pd.merge(movies,ratings,how='inner',on='movieId')[['movieId','rating']]

avg_rating_df = movie_rating_temp_df.groupby('movieId', as_index=False).mean().rename(columns={'rating': 'avgRating'})

movie_rating_count = movie_rating_temp_df.groupby('movieId', as_index=False).count().rename(columns={'rating': 'ratingCount'})

movies = pd.merge(pd.merge(movies,avg_rating_df,how='left',on='movieId'),movie_rating_count,how='left',on='movieId')

movies['avgRating'] = movies['avgRating'].fillna(0)
movies['ratingCount'] = movies['ratingCount'].fillna(0)

movies.head()

Unnamed: 0,movieId,title,genres,avgRating,ratingCount
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.886649,68469.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.246583,27143.0
2,3,Grumpier Old Men (1995),Comedy|Romance,3.173981,15585.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.87454,2989.0
4,5,Father of the Bride Part II (1995),Comedy,3.077291,15474.0


Računanje srednje ocjene svih filmova i računanje granice 90% kvantila kao donje granice top filmova.

In [32]:
mean_rating = ratings['rating'].mean()
print(mean_rating)

low_vote_number = movies['ratingCount'].quantile(0.90)
print(low_vote_number)

3.5304452124932677
453.0


Filtiriraje filmova koji su u top 10% po broju ocjena i njihovo pohranjivanje u novi dataset kvalificiranih filmova.

In [33]:
q_movies = movies.copy().loc[movies['ratingCount']>=low_vote_number]
q_movies.shape

movies.shape

(5813, 5)

(58098, 5)

Fukncija za računanje težinskog rating pomoću **IMDb-formule**
Upis u dataset kvalificiranih filmova njihov tezinski rating.

In [34]:
def weighed_rating(x, m=low_vote_number, C=mean_rating):
    v = x['ratingCount']
    R = x['avgRating']

    return (v/(v+m) * R) + (m/(m+v) * C)

q_movies['score'] = q_movies.apply(weighed_rating,axis=1)

Sortiranje filmova po izračunatom ratingu i ispis top 20 i najgorih 10.

In [35]:
q_movies = q_movies.sort_values(by='score',ascending=False)

q_movies[['title', 'avgRating', 'ratingCount', 'score']].head(20)
q_movies[['title', 'avgRating', 'ratingCount', 'score']].tail(10)

Unnamed: 0,title,avgRating,ratingCount,score
315,"Shawshank Redemption, The (1994)",4.424188,97999.0,4.420076
843,"Godfather, The (1972)",4.332893,60904.0,4.326968
49,"Usual Suspects, The (1995)",4.291959,62180.0,4.286451
1195,"Godfather: Part II, The (1974)",4.263035,38875.0,4.254597
523,Schindler's List (1993),4.257502,71516.0,4.252925
1936,Seven Samurai (Shichinin no samurai) (1954),4.254116,14578.0,4.232306
42845,Planet Earth (2006),4.458092,1384.0,4.229337
2874,Fight Club (1999),4.230663,65678.0,4.225867
1178,12 Angry Men (1957),4.237075,17931.0,4.219663
887,Rear Window (1954),4.230799,22264.0,4.216833


Unnamed: 0,title,avgRating,ratingCount,score
7983,Catwoman (2004),1.852258,2325.0,2.125915
3182,Stop! Or My Mom Will Shoot (1992),1.811079,2067.0,2.120155
6373,Dumb and Dumberer: When Harry Met Lloyd (2003),1.76245,2008.0,2.087888
1648,Home Alone 3 (1997),1.889405,3445.0,2.080116
1506,Speed 2: Cruise Control (1997),1.957457,6276.0,2.063351
4680,Glitter (2001),1.141026,741.0,2.047564
6478,Gigli (2003),1.205556,810.0,2.039423
1694,Spice World (1997),1.826339,3193.0,2.038067
11628,Epic Movie (2007),1.472287,1281.0,2.009972
3503,Battlefield Earth (2000),1.610675,4965.0,1.771187


## Content based preporučvanje

Čistimo tag dataset tako da pretvaramo sve u mala slova i spajamo riječi s razmacima iz jednog taga.
Mergamo sve tagove jednoga filma od različitih korisnika u dataset movie tako da spojimo sve rijeci u
jednu konkatenacijom.

In [36]:
tags['tag'] = tags['tag'].apply(lambda x: str(x).replace(" ","").lower())

movies = pd.merge(tags.groupby(['movieId'], as_index=False)['tag'].apply(lambda x: " ".join(x)), movies,how='right',on='movieId')

movies['tag']=movies['tag'].fillna('')

Funkcija koja izbacuje sve duplikate tagova iz jednoga filma i ubacuje imena žanrova te stvara novi stupac
u movie datasetu, soup koji sadži sve te ključne riječi.

In [37]:
def createSoup(x):
    infoSet = set(x['tag'].split(' '))

    data = ' '.join(infoSet)

    for genre in x['genres'].split('|'):
        data+=(' ' + genre.lower())
    return data

movies['soup'] = movies.apply(createSoup,axis=1)

movies.head()

Unnamed: 0,movieId,tag,title,genres,avgRating,ratingCount,soup
0,1,animated buddymovie cartoon cgi comedy compute...,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.886649,68469.0,childrencartoon toys dolls action classic toyc...
1,2,fantasy adaptedfrom:book animals badcgi basedo...,Jumanji (1995),Adventure|Children|Fantasy,3.246583,27143.0,jungle family game basedonchildren'sbook time ...
2,3,moldy old annmargaret burgessmeredith darylhan...,Grumpier Old Men (1995),Comedy|Romance,3.173981,15585.0,comedinhadevelhinhosengraãƒâ§ada burgessmeredi...
3,4,characters girlmovie characters chickflick bas...,Waiting to Exhale (1995),Comedy|Drama|Romance,2.87454,2989.0,basedonnovelorbook clv characters chickflick s...
4,5,stevemartin stevemartin pregnancy remake aging...,Father of the Bride Part II (1995),Comedy,3.077291,15474.0,family pregnancy worstmoviesever midlifecrisis...


Stvaranje TF-IDF Vectorizera i slanje mu svih ključih riječi za filmove.

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')

movies['soup'] = movies['soup'].fillna('')

tfidf_matrix = tfidf.fit_transform(movies['soup'])

tfidf_matrix.shape

tfidf.get_feature_names()[1000:1020]

(58098, 64665)

['adolescent',
 'adolescentdrama',
 'adolescentfantasy',
 'adolescentgays',
 'adolescentphilosophy',
 'adolescentpsychology',
 'adolescentrebellion',
 'adolescents',
 'adolfeichmann',
 'adolfhitler',
 'adolfi',
 'adolfomarsillach',
 'adoorgopalakrishnan',
 'adopted',
 'adoptedbrother',
 'adoptedchild',
 'adopteddaughter',
 'adoptedfrom',
 'adoptedson',
 'adoptee']

Stvaranje linear_kernel objekta koji ce pomocu cosine similarity ocjene moći ocjenjivati filmove.

In [39]:
from sklearn.metrics.pairwise import linear_kernel

cosin_sin = linear_kernel(tfidf_matrix,tfidf_matrix)

cosin_sin.shape

cosin_sin[1]

(58098, 58098)

array([0.07776567, 1.        , 0.01078806, ..., 0.        , 0.05633312,
       0.        ])

In [40]:
indices = pd.Series(movies.index, index = movies['movieId']).drop_duplicates()

indices[:10]

movieId
1     0
2     1
3     2
4     3
5     4
6     5
7     6
8     7
9     8
10    9
dtype: int64

Funkcija koja za dan id filma pomoću sličnosti s drugim filmovima ispisuje slične filmove.

In [41]:
def get_recommendations_by_id(movieId,n=10, cosine_sim=cosin_sin):

    idx = indices[movieId]

    sim_scores = list(enumerate(cosine_sim[idx]))

    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse=True)

    sim_scores = sim_scores[1:n+1]

    movie_indices = [i[0] for i in sim_scores]

    return movies['movieId'].iloc[movie_indices]

movies.loc[movies['movieId'].isin(get_recommendations_by_id(1))]

Unnamed: 0,movieId,tag,title,genres,avgRating,ratingCount,soup
2271,2355,animation disney pixar insects kevinspacey opp...,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,3.575722,25521.0,animation oscarnominee kevinspacey johnlassete...
3028,3114,pixar sequelbetterthanoriginal abandonment ani...,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,3.809977,29820.0,toys dolls airplane daringrescues animation os...
4791,4886,funny pixar comedy funny pixar animated animat...,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy,3.853349,37112.0,billycrystal classic conveyorbelt dvd own cute...
6272,6377,pixar funny pixar family father-sonrelationshi...,Finding Nemo (2003),Adventure|Animation|Children|Comedy,3.845176,37000.0,memorylack aftercreditsstinger mydvds saturnaw...
8278,8961,funny stylized super-hero animation powers ani...,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy,3.855338,31857.0,harnoglemegetfineting.mendeterikkeheltitop.den...
11048,45517,redemption villainnonexistentornotneededforgoo...,Cars (2006),Animation|Children|Comedy,3.347277,8391.0,aftercreditsstinger carrace instillsgoodmoralv...
15471,78499,boring overrated story pixar children adventur...,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,3.87009,14841.0,toys dolls score sad boring toycomestolife bet...
25071,115875,pixaranimation short toystory pixar short gary...,Toy Story Toons: Hawaiian Vacation (2011),Adventure|Animation|Children|Comedy|Fantasy,3.005882,85.0,pixar garyrydstrom toystory short pixaranimati...
25073,115879,fastfood fastfoodrestaurant pixaranimation toy...,Toy Story Toons: Small Fry (2011),Adventure|Animation|Children|Comedy|Fantasy,3.205882,68.0,toys short angusmaclane fastfoodrestaurant fun...
26550,120474,animation familyfriendly short computeranimati...,Toy Story That Time Forgot (2014),Animation|Children,3.229258,229.0,cute toys animated pixar animation dinosaurs e...


Funkcija koja za dani id usera računa filmove koje bi se tom useru svidjeli.

In [42]:
def get_recommendations_by_user(userId, n = 10, cosine_sim = cosin_sin):
    userMovies = ratings.loc[ratings['userId']==userId]
    watched = pd.Series(userMovies['movieId']).tolist()

    scores=list()

    for item in userMovies.itertuples():
        movieId = item[2]
        rating = item[3]

        idx = indices[movieId]
        sim_scores = list(enumerate(cosine_sim[idx]*rating))
        sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse=True)
        sim_scores = sim_scores[1:n+1]

        scores.extend(sim_scores)

    scores = sorted(scores, key = lambda  x: x[1], reverse=True)

    movie_recommendations = []

    num=n
    while True:
        val = scores[num]
        if not watched.__contains__(val[0]) and not movie_recommendations.__contains__(val[0]):
            movie_recommendations.append(val[0])
            num-=1
        if num==0:
            break

    return movies['movieId'].iloc[movie_recommendations]

movies.loc[movies['movieId'].isin(get_recommendations_by_user(1))]

Unnamed: 0,movieId,tag,title,genres,avgRating,ratingCount,soup
303,306,rewatch holes90s krzysztofkieslowski threecolo...,Three Colors: Red (Trois couleurs: Rouge) (1994),Drama,4.057025,8470.0,mannequin french gentle bibliothek john geneva...
515,519,cyborgs franchise franchise badacting surreal ...,RoboCop 3 (1993),Action|Crime|Drama|Sci-Fi|Thriller,2.244411,5904.0,policeman dystopia cyborg crappysequel scifi c...
2061,2144,80'sclassic teenangst johnhughes 1980s 80'scla...,Sixteen Candles (1984),Comedy|Romance,3.618039,10721.0,highschool mysoginistic birthday farce nerds b...
4422,4516,robertdowneyjr. sport robertdowneyjr. betamax,Johnny Be Good (1988),Comedy,2.279661,295.0,betamax sport robertdowneyjr. comedy
6783,6892,robertdowneyjr robertdowneyjr. robertdowneyjr ...,"Singing Detective, The (2003)",Comedy|Drama|Musical|Mystery,2.783333,180.0,melgibson robertdowneyjr robertdowneyjr. comed...
14018,70008,highway hitchhike roadmovie writer,Kill Your Darlings (2006),Comedy|Drama,2.571429,7.0,roadmovie highway hitchhike writer comedy drama
20573,100103,writer,Day and Night (Le jour et la nuit) (1997),Drama,0.5,2.0,writer drama
26087,118782,pizza pizzeria,Fat Pizza (2003),Action|Adventure|Comedy|Crime|Thriller,3.1,10.0,pizzeria pizza action adventure comedy crime t...
29707,128661,writer,The Man Who Wouldn't Die (1994),Crime|Drama|Thriller,4.5,1.0,writer crime drama thriller
47344,169700,blackout heatwave hip-hop newyorkcity punk,NY77: The Coolest Year in Hell (2007),Documentary,3.833333,3.0,blackout hip-hop newyorkcity punk heatwave doc...
