## Preporuke filmova pomoću liste najpopularnijih i content-based algoritma preporuke

Preporuka filmova pomoću liste najpopularnijih svim korisnicima preporuča iste filmove poredane po izračunatom ratingu
**IMDb-formule**

\begin{equation}
\text TežinskiRating (\bf WR) = \left({{\bf v} \over {\bf v} + {\bf m}} \cdot R\right) + \left({{\bf m} \over {\bf v} + {\bf m}} \cdot C\right)
\end{equation}

gdje je:
v = broj ocjena filma ,
m = minimalni broj ocjena da bi se prikazao u listi (*izračunat kao 90% kvantil*),
R = prosjecna ocjena filma,
C = prosječna ocjena svih filmova.

Content based algoritam pomoću sličnosti između filmova traži filmove s sličnim atributima i te preporuča korisniku.

Atributi koje ovdje koristimo su žanrovi iz movies.csv i svi tagovi koje su korisnici upisali u tags.csv za film.
Što je više atributa dostupno za svaki film ovaj će algoritam bolje raditi i pravilnije preproručavati filmove.

Kako bi odredili težinu (_važnost_) atributa koristimo **TF-IDF** score (*term frequency-inverted document frequency score*).

\begin{equation}
\text tfidf_{i,j} = tf_{i,j} \cdot \log (\frac{N}{df_i})
\end{equation}

where:
Kada smo izračunali TF-IDF gradimo **Vector Space Model** uz **Cosine similarity**.

Rezultat nam je kvadratna matrica veličine broja filmova gdje na svakom sjecištu leži ocjena njihove sličnosti.

In [14]:
import pandas as pd

# Više izlaznih linija
from IPython.core.interactiveshell import InteractiveShell
from typing import List
InteractiveShell.ast_node_interactivity = "all"

movies = pd.read_csv('ml-latest/movies.csv')
movies.head()

ratings = pd.read_csv('ml-latest/ratings.csv',usecols=['userId','movieId','rating'])
ratings.head()

tags = pd.read_csv('ml-latest/tags.csv',usecols=['movieId','tag'])
tags.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Unnamed: 0,userId,movieId,rating
0,1,307,3.5
1,1,481,3.5
2,1,1091,1.5
3,1,1257,4.5
4,1,1449,4.5


Unnamed: 0,movieId,tag
0,110,epic
1,110,Medieval
2,260,sci-fi
3,260,space action
4,318,imdb top 250


## Najpopularniji preporučavanje

Izračun srednje ocjene filma iz svih njegovih ocjena i zbrajanje koliko je puta ocjenjen.
Ta dva podataka natrag upisujemo u dataset filma.

In [15]:
movie_rating_temp_df = pd.merge(ratings,movies,how='inner',on='movieId')[['movieId','rating']]

avg_rating_df = movie_rating_temp_df.groupby('movieId', as_index=False).mean().rename(columns={'rating': 'avgRating'})

movie_rating_count = movie_rating_temp_df.groupby('movieId', as_index=False).count().rename(columns={'rating': 'ratingCount'})

movies = pd.merge(pd.merge(movies,avg_rating_df,how='inner',on='movieId'),movie_rating_count,how='inner',on='movieId')

movies.head()

Unnamed: 0,movieId,title,genres,avgRating,ratingCount
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.886649,68469
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.246583,27143
2,3,Grumpier Old Men (1995),Comedy|Romance,3.173981,15585
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.87454,2989
4,5,Father of the Bride Part II (1995),Comedy,3.077291,15474


Računanje srednje ocjene svih filmova i računanje granice 90% kvantila kao donje granice top filmova.

In [16]:
mean_rating = ratings['rating'].mean()
print(mean_rating)

low_vote_number = movies['ratingCount'].quantile(0.90)
print(low_vote_number)

3.5304452124932677
531.0


Filtiriraje filmova koji su u top 10% po broju ocjena i njihovo pohranjivanje u novi dataset kvalificiranih filmova.

In [17]:
q_movies = movies.copy().loc[movies['ratingCount']>=low_vote_number]
q_movies.shape

movies.shape

(5392, 5)

(53889, 5)

Fukncija za računanje težinskog rating pomoću **IMDb-formule**
Upis u dataset kvalificiranih filmova njihov tezinski rating.

In [18]:
def weighed_rating(x, m=low_vote_number, C=mean_rating):
    v = x['ratingCount']
    R = x['avgRating']

    return (v/(v+m) * R) + (m/(m+v) * C)

q_movies['score'] = q_movies.apply(weighed_rating,axis=1)

Sortiranje filmova po izračunatom ratingu i ispis top 20 i najgorih 10.

In [19]:
q_movies = q_movies.sort_values(by='score',ascending=False)

q_movies[['title', 'avgRating', 'ratingCount', 'score']].head(20)
q_movies[['title', 'avgRating', 'ratingCount', 'score']].tail(10)

Unnamed: 0,title,avgRating,ratingCount,score
315,"Shawshank Redemption, The (1994)",4.424188,97999,4.419371
843,"Godfather, The (1972)",4.332893,60904,4.325957
49,"Usual Suspects, The (1995)",4.291959,62180,4.285511
1195,"Godfather: Part II, The (1974)",4.263035,38875,4.253164
523,Schindler's List (1993),4.257502,71516,4.252143
1936,Seven Samurai (Shichinin no samurai) (1954),4.254116,14578,4.228683
2874,Fight Club (1999),4.230663,65678,4.225047
1178,12 Angry Men (1957),4.237075,17931,4.216752
887,Rear Window (1954),4.230799,22264,4.214484
1169,One Flew Over the Cuckoo's Nest (1975),4.22292,42181,4.214311


Unnamed: 0,title,avgRating,ratingCount,score
1350,Grease 2 (1982),1.991641,4247,2.162655
3182,Stop! Or My Mom Will Shoot (1992),1.811079,2067,2.162497
4680,Glitter (2001),1.141026,741,2.138496
6373,Dumb and Dumberer: When Harry Met Lloyd (2003),1.76245,2008,2.132204
6478,Gigli (2003),1.205556,810,2.126149
1648,Home Alone 3 (1997),1.889405,3445,2.108568
1506,Speed 2: Cruise Control (1997),1.957457,6276,2.080163
11623,Epic Movie (2007),1.472287,1281,2.075423
1694,Spice World (1997),1.826339,3193,2.069325
3503,Battlefield Earth (2000),1.610675,4965,1.796155


## Content based preporučvanje

Čistimo tag dataset tako da pretvaramo sve u mala slova i spajamo riječi s razmacima iz jednog taga.
Mergamo sve tagove jednoga filma od različitih korisnika u dataset movie tako da spojimo sve rijeci u
jednu konkatenacijom.

In [20]:
tags['tag'] = tags['tag'].apply(lambda x: str(x).replace(" ","").lower())

movies = pd.merge(tags.groupby(['movieId'], as_index=False)['tag'].apply(lambda x: " ".join(x)), movies,how='inner',on='movieId')

Funkcija koja izbacuje sve duplikate tagova iz jednoga filma i ubacuje imena žanrova te stvara novi stupac
u movie datasetu, soup koji sadži sve te ključne riječi.

In [21]:
def createSoup(x):
    infoSet = set(x['tag'].split(' '))

    data = ' '.join(infoSet)

    for genre in x['genres'].split('|'):
        data+=(' ' + genre.lower())
    return data

movies['soup'] = movies.apply(createSoup,axis=1)

movies.head()

Unnamed: 0,movieId,tag,title,genres,avgRating,ratingCount,soup
0,1,animated buddymovie cartoon cgi comedy compute...,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.886649,68469,goodtime greatmovie feel-good kids woody fanta...
1,2,fantasy adaptedfrom:book animals badcgi basedo...,Jumanji (1995),Adventure|Children|Fantasy,3.246583,27143,timetravel animals kids dynamiccgiaction satur...
2,3,moldy old annmargaret burgessmeredith darylhan...,Grumpier Old Men (1995),Comedy|Romance,3.173981,15585,comedinhadevelhinhosengraãƒâ§ada oldpeoplethat...
3,4,characters girlmovie characters chickflick bas...,Waiting to Exhale (1995),Comedy|Drama|Romance,2.87454,2989,revenge girlmovie divorce basedonnovelorbook s...
4,5,stevemartin stevemartin pregnancy remake aging...,Father of the Bride Part II (1995),Comedy,3.077291,15474,remake gynecologist childhoodclassics touching...


Stvaranje TF-IDF Vectorizera i slanje mu svih ključih riječi za filmove.

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')

movies['soup'] = movies['soup'].fillna('')

tfidf_matrix = tfidf.fit_transform(movies['soup'])

tfidf_matrix.shape

tfidf.get_feature_names()[1000:1020]

(41772, 64065)

['adolescentpsychology',
 'adolescentrebellion',
 'adolescents',
 'adolfeichmann',
 'adolfhitler',
 'adolfomarsillach',
 'adoorgopalakrishnan',
 'adopted',
 'adoptedbrother',
 'adoptedchild',
 'adopteddaughter',
 'adoptedfrom',
 'adoptedson',
 'adoptee',
 'adoption',
 'adoptivefather',
 'adoptivemother',
 'adoptiveparents',
 'adorable',
 'adoration']

Stvaranje linear_kernel objekta koji ce pomocu cosine similarity ocjene moći ocjenjivati filmove.

In [23]:
from sklearn.metrics.pairwise import linear_kernel

cosin_sin = linear_kernel(tfidf_matrix,tfidf_matrix)

cosin_sin.shape

cosin_sin[1]

(41772, 41772)

array([0.07708015, 1.        , 0.01042336, ..., 0.        , 0.01620595,
       0.0094338 ])

In [24]:
indices = pd.Series(movies.index, index = movies['title']).drop_duplicates()

indices[:10]

title
Toy Story (1995)                      0
Jumanji (1995)                        1
Grumpier Old Men (1995)               2
Waiting to Exhale (1995)              3
Father of the Bride Part II (1995)    4
Heat (1995)                           5
Sabrina (1995)                        6
Tom and Huck (1995)                   7
Sudden Death (1995)                   8
GoldenEye (1995)                      9
dtype: int64

Funkcija koja za dano ime filma pomoću sličnosti s drugim filmovima ispisuje slične filmove.

In [25]:
def get_recommendations(title,n=10, cosine_sim=cosin_sin):

    idx = indices[title]

    sim_scores = list(enumerate(cosine_sim[idx]))

    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse=True)

    sim_scores = sim_scores[1:n+1]

    movie_indices = [i[0] for i in sim_scores]

    return movies['title'].iloc[movie_indices]

get_recommendations('Toy Story (1995)')

2866                            Toy Story 2 (1999)
23824            Toy Story That Time Forgot (2014)
10558                                  Cars (2006)
2132                          Bug's Life, A (1998)
22785    Toy Story Toons: Hawaiian Vacation (2011)
5970                           Finding Nemo (2003)
14686                           Toy Story 3 (2010)
22787            Toy Story Toons: Small Fry (2011)
4548                         Monsters, Inc. (2001)
7904                       Incredibles, The (2004)
Name: title, dtype: object