Collaborative Filtering v_1

U ovom pristupu, radimo prema user-user collaborative filteringu. Pronalazimo najsličnijeg korisnika
i predlažemo filmove koje je taj korisnik najbolje ocijenio, a da prvi korisnik još nije pogledao.


In [1]:
import pandas as pd
import numpy as np

ratings = pd.read_csv("ml-latest-small/ratings.csv")
movies = pd.read_csv("ml-latest-small/movies.csv")

df = pd.DataFrame(columns=movies.loc[:, 'movieId'])
df

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609


Df će nam na kraju postupka biti user-item matrica. Uvidom u csv file vidimo da ima 600 korisnika pa hardkodiramo
tu vrijednost.

In [2]:
#Making user-item matrix
for index in range(600):
    userRating = ratings.query("userId == {}".format(index + 1))
    for ind, r in userRating.iterrows():
        df.loc[index + 1, r['movieId']] = r['rating']

df

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
1,4,,4,,,4,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
596,4,,,,,,,,,,...,,,,,,,,,,
597,4,,,,,3,1,,,3,...,,,,,,,,,,
598,,,,,,,,,,,...,,,,,,,,,,
599,3,2.5,1.5,,,4.5,2.5,,1.5,3.5,...,,,,,,,,,,


Problem u ovom pristupu je što imamo veliki broj NaN vrijednosti.
Zato ćemo centrirati ocjene filmova koje je jedan korisnik dao tako da srednja vrijednost ocjene uvijek bude 0.
Tako možemo neutralno za NaN vrijednosti postaviti nule i nećemo povećati 'bias effect'.

In [9]:
df = pd.DataFrame(columns=movies.loc[:, 'movieId'])
# Making user-item matrix
for index in range(600):
    userRating = ratings.query("userId == {}".format(index + 1))
    mean = userRating['rating'].mean()
    for ind, r in userRating.iterrows():
        df.loc[index + 1, r['movieId']] = r['rating'] - mean
        
df = df.fillna(0)
df

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
1,-0.366379,0.000000,-0.366379,0.000000,0.000000,-0.366379,0.000000,0.0,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.363636,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
596,0.504866,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
597,0.022573,0.000000,0.000000,0.000000,0.000000,-0.977427,-2.977427,0.0,0.00000,-0.977427,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
598,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
599,0.357950,-0.142050,-1.142050,0.000000,0.000000,1.857950,-0.142050,0.0,-1.14205,0.857950,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
from sklearn.metrics.pairwise import cosine_similarity

def findBestFitUserAndRecommend(n):
    user = np.array(df.loc[n, :])
    user = np.array([user])
    max = 0
    userBestId = 0
    for i in range(600):
        if((i + 1) == n):
            continue
        secondUser = np.array(df.loc[i + 1, :])
        secondUser = np.array([secondUser])
        result = cosine_similarity(user, secondUser)
        if(result > max):
            max = result
            userBestId = i + 1

    bestFit = ratings.query("userId == {}".format(userBestId))

    bestFit = bestFit.sort_values('rating', ascending=False)

    for index, row in bestFit.head(20).iterrows():
        movie = ratings.query("userId == 1 & movieId == {}".format(row['movieId']))
        # Ako korisnik nije pogledao film
        if (len(movie) == 0):
            print(movies.query("movieId == {}".format(row['movieId'])).loc[:, 'title'])


findBestFitUserAndRecommend(50)

694    Casablanca (1942)
Name: title, dtype: object
602    Dr. Strangelove or: How I Learned to Stop Worr...
Name: title, dtype: object
913    Third Man, The (1949)
Name: title, dtype: object
5695    Old Boy (2003)
Name: title, dtype: object
685    Vertigo (1958)
Name: title, dtype: object
951    Chinatown (1974)
Name: title, dtype: object
4769    Nausicaä of the Valley of the Wind (Kaze no ta...
Name: title, dtype: object
3544    Mulholland Drive (2001)
Name: title, dtype: object
2789    Conversation, The (1974)
Name: title, dtype: object
930    Annie Hall (1977)
Name: title, dtype: object
937    Seventh Seal, The (Sjunde inseglet, Det) (1957)
Name: title, dtype: object
706    2001: A Space Odyssey (1968)
Name: title, dtype: object
962    Deer Hunter, The (1978)
Name: title, dtype: object
3167    Scarface (1983)
Name: title, dtype: object
1422    On the Waterfront (1954)
Name: title, dtype: object
596    Ghost in the Shell (Kôkaku kidôtai) (1995)
Name: title, dtype: object
2619    Net