### Priporočilni sistem

Seminarska naloga izdelave priporočilnega sistema.
By Samo Pritržnik

## Podatki

Za seminarsko nalogo bom uporabil movielens podatke. Opis podatkov je v readme.txt.

In [32]:
# import library for reading .dat files
import numpy as np
import matplotlib as mpl
import pandas as pd
from csv import DictReader
import pickle as pkl
import random
from scipy.spatial.distance import cosine


# Branje ocen

In [14]:
class UserItemData:
    def __init__(self, path, from_date=None, to_date=None, min_ratings=None):
        self.data = pd.read_csv(path, delimiter='\t')
        self.process_data(from_date, to_date, min_ratings)

    def process_data(self, from_date, to_date, min_ratings):
        # Convert date columns to a single datetime column
        self.data['datetime'] = pd.to_datetime(self.data['date_year'].astype(str) + '-' +
                                               self.data['date_month'].astype(str).str.zfill(2) + '-' +
                                               self.data['date_day'].astype(str).str.zfill(2) + ' ' +
                                               self.data['date_hour'].astype(str).str.zfill(2) + ':' +
                                               self.data['date_minute'].astype(str).str.zfill(2) + ':' +
                                               self.data['date_second'].astype(str).str.zfill(2))

        # Filter by date range if specified
        if from_date:
            from_date = pd.to_datetime(from_date, dayfirst=True)
            self.data = self.data[self.data['datetime'] >= from_date]
        
        if to_date:
            to_date = pd.to_datetime(to_date, dayfirst=True)
            self.data = self.data[self.data['datetime'] <= to_date]
        
        # Filter by minimum ratings for each movie if specified
        if min_ratings:
            movie_counts = self.data['movieID'].value_counts()
            movies_with_min_ratings = movie_counts[movie_counts >= min_ratings].index
            self.data = self.data[self.data['movieID'].isin(movies_with_min_ratings)]

    def nratings(self):
        return len(self.data)

# Example usage
uim = UserItemData('podatki/user_ratedmovies.dat')
print(uim.nratings())

uim = UserItemData('podatki/user_ratedmovies.dat', from_date='12.1.2007', to_date='16.2.2008', min_ratings=100)
print(uim.nratings())

855598
73584


# Branje filmov

In [18]:
class MovieData:
    def __init__(self, path):
        self.data = pd.read_csv(path, delimiter='\t', encoding='latin1')

    def get_title(self, movie_id):
        return self.data[self.data['id'] == movie_id]['title'].values[0]
    
md = MovieData('podatki/movies.dat')
print(md.get_title(499))
        

Mr. Wonderful


## Prediktor

Z besedo "prediktor" bomo označevali razrede, ki za določenega uporabnika na nek način ocenijo, s kakšno vrednostjo bi ta uporabnik ocenil filme oz. produkte, ki jih ima na voljo. Ti razredi bodo imeli metodo fit(self, X), kjer je X tipa UserItemData, in metodo predict(self, user_id), kjer je user_id ID uporabnika. Metodo fit bomo uporabljali za učenje modela, predict pa za izračun priporočenih vrednosti za podanega uporabnika.

# Naključni prediktor

In [25]:
class RandomPredictor:
    def __init__(self, min_rating, max_rating):
        self.min_rating = min_rating
        self.max_rating = max_rating

    def fit(self, user_item_data):
        # In this case, 'fitting' the model doesn't do anything
        self.user_item_data = user_item_data
        pass

    def predict(self, user_id):
        movie_ids = self.user_item_data.data['movieID'].unique()
        return {movie_id: random.uniform(self.min_rating, self.max_rating) for movie_id in movie_ids}

# Example usage
md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat')
rp = RandomPredictor(1, 5)
rp.fit(uim)
pred = rp.predict(78)
print(type(pred))
items = [1, 3, 20, 50, 100]
for item in items:
    print("Film: {}, ocena: {}".format(md.get_title(item), pred[item]))

<class 'dict'>
Film: Toy story, ocena: 2.039608946218791
Film: Grumpy Old Men, ocena: 1.1707559756861001
Film: Money Train, ocena: 1.0012051750252389
Film: The Usual Suspects, ocena: 4.199859417762402
Film: City Hall, ocena: 3.9793208996496205


# Priporočanje

In [28]:
class Recommender:
    def __init__(self, predictor):
        self.predictor = predictor

    def fit(self, X):
        self.X = X
        self.predictor.fit(X)

    def recommend(self, userID, n=10, rec_seen=True):
        predictions = self.predictor.predict(userID)
        if not rec_seen:
            # Filter out movies the user has already rated
            rated_movies = set(self.X.data[self.X.data['userID'] == userID]['movieID'])
            predictions = {movie_id: rating for movie_id, rating in predictions.items() if movie_id not in rated_movies}

        # Sort the predictions and return the top n
        sorted_predictions = sorted(predictions.items(), key=lambda x: x[1], reverse=True)
        return sorted_predictions[:n]

# Example usage
md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat')
rp = RandomPredictor(1, 5)
rec = Recommender(rp)
rec.fit(uim)
rec_items = rec.recommend(78, n=5, rec_seen=False)
for idmovie, val in rec_items:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

Film: Love Is All There Is, ocena: 4.999475612031708
Film: Viskningar och rop, ocena: 4.998946622783171
Film: At the Circus, ocena: 4.99758319544617
Film: BubbaHo-Tep, ocena: 4.99717266543475
Film: Seven Girlfriends, ocena: 4.996090193142638


# Napovedovanje s povprečjem

In [29]:
class AveragePredictor:
    def __init__(self, b):
        self.b = b
        self.avg_ratings = {}

    def fit(self, user_item_data):
        # Calculate the global average rating
        g_avg = user_item_data.data['rating'].mean()

        # Calculate the average rating for each movie
        for movie_id in user_item_data.data['movieID'].unique():
            movie_data = user_item_data.data[user_item_data.data['movieID'] == movie_id]
            vs = movie_data['rating'].sum()  # Sum of ratings for the movie
            n = len(movie_data)  # Number of ratings for the movie

            # Calculate the adjusted average
            self.avg_ratings[movie_id] = (vs + self.b * g_avg) / (n + self.b)

    def predict(self, user_id):
        # Return the calculated average ratings
        # This predictor ignores the user_id because it does not do personalized predictions
        return self.avg_ratings

# Example usage
md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat')

# Using AveragePredictor with b=0
avg_pred = AveragePredictor(b=0)
rec = Recommender(avg_pred)
rec.fit(uim)
rec_items = rec.recommend(78, n=5, rec_seen=False)
for idmovie, val in rec_items:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

# Using AveragePredictor with b=100
avg_pred_b100 = AveragePredictor(b=100)
rec_b100 = Recommender(avg_pred_b100)
rec_b100.fit(uim)
rec_items_b100 = rec_b100.recommend(78, n=5, rec_seen=False)
for idmovie, val in rec_items_b100:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

Film: Brother Minister: The Assassination of Malcolm X, ocena: 5.0
Film: Synthetic Pleasures, ocena: 5.0
Film: Adam & Steve, ocena: 5.0
Film: Gabbeh, ocena: 5.0
Film: Eve and the Fire Horse, ocena: 5.0
Film: The Usual Suspects, ocena: 4.225944245560473
Film: The Godfather: Part II, ocena: 4.146907937910189
Film: Cidade de Deus, ocena: 4.116538340205236
Film: The Dark Knight, ocena: 4.10413904093503
Film: 12 Angry Men, ocena: 4.103639627096175


# Priporočanje najbolj gledanih filmov

In [30]:
class ViewsPredictor:
    def __init__(self):
        self.views_count = {}

    def fit(self, user_item_data):
        # Count the number of views for each movie
        self.views_count = user_item_data.data['movieID'].value_counts().to_dict()

    def predict(self, user_id):
        # Return the views count for each movie
        # This predictor ignores the user_id because it does not do personalized predictions
        return self.views_count

# Example usage
md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat')

views_pred = ViewsPredictor()
rec = Recommender(views_pred)
rec.fit(uim)
rec_items = rec.recommend(78, n=5, rec_seen=False)
for idmovie, val in rec_items:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

Film: The Lord of the Rings: The Fellowship of the Ring, ocena: 1576
Film: The Lord of the Rings: The Two Towers, ocena: 1528
Film: The Lord of the Rings: The Return of the King, ocena: 1457
Film: The Silence of the Lambs, ocena: 1431
Film: Shrek, ocena: 1404


# Napovedovanje ocen s podobnostjo med produkti

In [35]:
class ItemBasedPredictor:
    def __init__(self, min_values=0, threshold=0):
        self.min_values = min_values
        self.threshold = threshold
        self.similarities = {}

    def fit(self, user_item_data):
        # Pivot table to create a matrix of users and movie ratings
        matrix = user_item_data.data.pivot_table(index='userID', columns='movieID', values='rating')
        
        # Calculate the global mean for normalization
        global_mean = user_item_data.data['rating'].mean()

        # Initialize dictionary for each movie
        movies = matrix.columns
        for movie in movies:
            self.similarities[movie] = {}

        # Calculate similarities for each pair of movies
        for i in range(len(movies)):
            for j in range(i+1, len(movies)):
                movie1 = matrix[movies[i]]
                movie2 = matrix[movies[j]]
                common = matrix.loc[:, [movies[i], movies[j]]].dropna()

                # Check for minimum number of common users
                if len(common) < self.min_values:
                    continue

                # Adjust ratings by subtracting the global mean
                adjusted_ratings1 = common[movies[i]] - global_mean
                adjusted_ratings2 = common[movies[j]] - global_mean

                # Compute similarity
                similarity = 1 - cosine(adjusted_ratings1, adjusted_ratings2)

                # Apply threshold
                if similarity < self.threshold:
                    similarity = 0

                self.similarities[movies[i]][movies[j]] = similarity
                self.similarities[movies[j]][movies[i]] = similarity  # Symmetric

    def similarity(self, p1, p2):
        # Return the calculated similarity
        return self.similarities.get(p1, {}).get(p2, 0)

    def predict(self, user_id):
        # Predict ratings for the user
        # Implementation depends on how you want to aggregate the similarities and existing ratings
        # This is a placeholder for the prediction logic
        return {}

# Example usage
md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat', min_ratings=1000)
rp = ItemBasedPredictor()
rec = Recommender(rp)
rec.fit(uim)

print("Podobnost med filmoma 'Men in black'(1580) in 'Ghostbusters'(2716): ", rp.similarity(1580, 2716))
print("Podobnost med filmoma 'Men in black'(1580) in 'Schindler's List'(527): ", rp.similarity(1580, 527))
print("Podobnost med filmoma 'Men in black'(1580) in 'Independence day'(780): ", rp.similarity(1580, 780))

print("Predictions for 78: ")
rec_items = rec.recommend(78, n=15, rec_seen=False)
for idmovie, val in rec_items:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

Podobnost med filmoma 'Men in black'(1580) in 'Ghostbusters'(2716):  0.4547352985965746
Podobnost med filmoma 'Men in black'(1580) in 'Schindler's List'(527):  0
Podobnost med filmoma 'Men in black'(1580) in 'Independence day'(780):  0.5481157140454866
Predictions for 78: 
