# Movie Recommendation System
### Stvaranje prediktivnog modela preporuke filma pomoću MovieLens dataseta.

Cilj je eksploratorna analiza te razvijanje prediktivnog modela preporuke filma algoritmima strojnog učenja pomoću MovieLens dataseta.
Model bi preporučivao nove filmove korisniku s obzirom na podatke koje ima o korisniku
kao što su koje žanrove preferira, koje filmove je dobro ocijenio te koje loše  te s obzirom na
podatke vrlo sličnih korisnika.

[MovieLens dataset](https://grouplens.org/datasets/movielens/)

U demonstraciji koristimo minimalnu verziju dataseta "ml-latest-small"

Početna ideja je predlagati korisnicima  filmove
koji su najviše puta ocjenjeni i pregledani.

In [2]:
#Import paketa

from math import sqrt
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# Više izlaznih linija
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

movies = pd.read_csv('ml-latest-small/movies.csv')
movies.head()

ratings = pd.read_csv('ml-latest-small/ratings.csv',usecols=['userId','movieId','rating'])
ratings.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [8]:
df1 = pd.merge(ratings, movies, on=['movieId'])

df1.groupby('title').size().sort_values(ascending=False)[:15]

title
Forrest Gump (1994)                                      329
Shawshank Redemption, The (1994)                         317
Pulp Fiction (1994)                                      307
Silence of the Lambs, The (1991)                         279
Matrix, The (1999)                                       278
Star Wars: Episode IV - A New Hope (1977)                251
Jurassic Park (1993)                                     238
Braveheart (1995)                                        237
Terminator 2: Judgment Day (1991)                        224
Schindler's List (1993)                                  220
Fight Club (1999)                                        218
Toy Story (1995)                                         215
Star Wars: Episode V - The Empire Strikes Back (1980)    211
Usual Suspects, The (1995)                               204
American Beauty (1999)                                   204
dtype: int64

 Problem ovog početnog pristupa je što svima predlažemo
 iste filmove, iako različiti korisnici
 imaju različite preference i različito ocjenjuju filmove.
 
 Sljedeća ideja je regresijom pronaći parametre
 kojima možemo pretpostaviti kojom ocjenom bi korisnik
 ocjenio neki film koji još nije pogledao.

 Parametar X filma biti će vektor koji ima jedinice za onu kategoriju filma
 u koju on pripada, a nulu na svim ostalim mjestima.

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

data = pd.merge(ratings, movies, on=['movieId'])

#Biramo prvog usera za test
data = data.query('userId == 1')

#Stvaranje dataframea s parametrima žanrova
values = movies['genres'].str.split("|")
column_names1 = []
for v in values:
    for i in v:
        if i not in column_names1:
            column_names1.append(i)

#Drugi dataframe za izračunavanje ocjena
column_names2 = column_names1.copy()
column_names2.append('movieId')
column_names2.append('title')

column_names1.append('rating')
df2 = pd.DataFrame(columns=column_names1)

for index, row in data.iterrows():
    tmp = row['genres'].split("|")
    df2.loc[index, 'rating'] = data.loc[index, 'rating']
    
    for i in tmp:
        df2.loc[index, i] = 1

df2 = df2.fillna(0)

X = df2.iloc[:, :20]
y = df2.iloc[:, 20:21]

#Razdvajanje train i test skupa
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size=0.20)

lin = LinearRegression()
lin.fit(X_train, y_train)

lin.score(X_train, y_train)


0.11167022695367346

In [12]:
lin.score(X_test, y_test)

0.31243616691024434

Rezultati nisu dobro ispali. Štoviše, train skup
ima lošiji score nego test skup.

In [5]:
df3 = pd.DataFrame(columns=column_names2)

for index, row in movies.iterrows():
    tmp = row['genres'].split("|")
    df3.loc[index, 'movieId'] = movies.loc[index, 'movieId']
    df3.loc[index, 'title'] = movies.loc[index, 'title']
    
    for i in tmp:
        df3.loc[index, i] = 1

df3 = df3.fillna(0)

scores = df3.iloc[:, :20].dot(lin.coef_.T)
df3 = df3.assign(scores = scores)
print(df3.sort_values(ascending=False, by='scores').iloc[:15, 21:23])

                                         title    scores
1483     This World, Then the Fireworks (1997)  0.843567
5184                           Suddenly (1954)  0.843567
5207                         White Heat (1949)  0.843567
4471                     Double Life, A (1947)  0.843567
6275                       Born to Kill (1947)  0.843567
884                       Grifters, The (1990)  0.843567
5837                 Call Northside 777 (1948)  0.843567
2568                   Double Indemnity (1944)  0.843567
2745                       Blood Simple (1984)  0.843567
5096  I Am a Fugitive from a Chain Gang (1932)  0.843567
1203                            Hoodlum (1997)  0.843567
7062             Limits of Control, The (2009)  0.843567
6176                              Brick (2005)  0.783897
1967                     Mildred Pierce (1945)  0.770527
3273             Sweet Smell of Success (1957)  0.770527


Kao što vidimo, rezultati nisu baš zadovoljavajući.
Moguće je da bi korisnik ovim redom ocjenjivao filmove(sortirano),
ali definitivno ne s ovim ocjenama.

PLANOVI ZA DALJE:
Drugim metodama izračunati regresiju (NN, RandomForrest, SVR)

Formulirati problem tako da se koriste klasifikacije kako bi se mogle
isprobati i ML metode klasifikacije.