# Priporočilni sistem

Seminarska naloga izdelave priporočilnega sistema.
By Samo Pritržnik

## Podatki

Za seminarsko nalogo bom uporabil movielens podatke. Opis podatkov je v readme.txt.

In [2]:
# import library for reading .dat files
import numpy as np
import matplotlib as mpl
import pandas as pd
from csv import DictReader
import pickle as pkl
import random
from scipy.spatial.distance import cosine


### Branje ocen

In [6]:
class UserItemData:
    def __init__(self, path, from_date=None, to_date=None, min_ratings=None):
        self.data = pd.read_csv(path, delimiter='\t')
        self.process_data(from_date, to_date, min_ratings)

    def process_data(self, from_date, to_date, min_ratings):
        # Convert date columns to a single datetime column
        self.data['datetime'] = pd.to_datetime(self.data['date_year'].astype(str) + '-' +
                                               self.data['date_month'].astype(str).str.zfill(2) + '-' +
                                               self.data['date_day'].astype(str).str.zfill(2) + ' ' +
                                               self.data['date_hour'].astype(str).str.zfill(2) + ':' +
                                               self.data['date_minute'].astype(str).str.zfill(2) + ':' +
                                               self.data['date_second'].astype(str).str.zfill(2))

        # Filter by date range if specified
        if from_date:
            from_date = pd.to_datetime(from_date, dayfirst=True)
            self.data = self.data[self.data['datetime'] >= from_date]
        
        if to_date:
            to_date = pd.to_datetime(to_date, dayfirst=True)
            self.data = self.data[self.data['datetime'] <= to_date]
        
        # Filter by minimum ratings for each movie if specified
        if min_ratings:
            movie_counts = self.data['movieID'].value_counts()
            movies_with_min_ratings = movie_counts[movie_counts >= min_ratings].index
            self.data = self.data[self.data['movieID'].isin(movies_with_min_ratings)]

    def nratings(self):
        return len(self.data)

uim = UserItemData('podatki/user_ratedmovies.dat')
print(uim.nratings())

uim = UserItemData('podatki/user_ratedmovies.dat', from_date='12.1.2007', to_date='16.2.2008', min_ratings=100)
print(uim.nratings())

855598
73584


### Branje filmov

In [4]:
class MovieData:
    def __init__(self, path):
        self.data = pd.read_csv(path, delimiter='\t', encoding='latin1')

    def get_title(self, movie_id):
        return self.data[self.data['id'] == movie_id]['title'].values[0]
    
md = MovieData('podatki/movies.dat')
print(md.get_title(499))
        

Mr. Wonderful


## Prediktor

Z besedo "prediktor" bomo označevali razrede, ki za določenega uporabnika na nek način ocenijo, s kakšno vrednostjo bi ta uporabnik ocenil filme oz. produkte, ki jih ima na voljo. Ti razredi bodo imeli metodo fit(self, X), kjer je X tipa UserItemData, in metodo predict(self, user_id), kjer je user_id ID uporabnika. Metodo fit bomo uporabljali za učenje modela, predict pa za izračun priporočenih vrednosti za podanega uporabnika.

### Naključni prediktor

In [14]:
class RandomPredictor:
    def __init__(self, min_rating, max_rating):
        self.min_rating = min_rating
        self.max_rating = max_rating

    def fit(self, user_item_data):
        self.user_item_data = user_item_data

    def predict(self, user_id):
        movie_ids = self.user_item_data.data['movieID'].unique()
        return {movie_id: round(random.uniform(self.min_rating, self.max_rating)) for movie_id in movie_ids}

# Example usage
md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat')
rp = RandomPredictor(1, 5)
rp.fit(uim)
pred = rp.predict(78)
print(type(pred))
items = [1, 3, 20, 50, 100]
for item in items:
    print("Film: {}, ocena: {}".format(md.get_title(item), pred[item]))

<class 'dict'>
Film: Toy story, ocena: 3
Film: Grumpy Old Men, ocena: 3
Film: Money Train, ocena: 3
Film: The Usual Suspects, ocena: 3
Film: City Hall, ocena: 5


### Priporočanje

In [17]:
class Recommender:
    def __init__(self, predictor):
        self.predictor = predictor

    def fit(self, X):
        self.X = X
        self.predictor.fit(X)

    def recommend(self, userID, n=10, rec_seen=True):
        predictions = self.predictor.predict(userID)
        if not rec_seen:
            # Filter out movies the user has already rated
            rated_movies = set(self.X.data[self.X.data['userID'] == userID]['movieID'])
            predictions = {movie_id: rating for movie_id, rating in predictions.items() if movie_id not in rated_movies}

        # Sort the predictions and return the top n
        sorted_predictions = sorted(predictions.items(), key=lambda x: x[1], reverse=True)
        return sorted_predictions[:n]

# Example usage
md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat')
rp = RandomPredictor(1, 5)
rec = Recommender(rp)
rec.fit(uim)
rec_items = rec.recommend(78, n=10, rec_seen=False)
for idmovie, val in rec_items:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

Film: Congo, ocena: 5
Film: Beverly Hills Cop III, ocena: 5
Film: The Abyss, ocena: 5
Film: Tears of the Sun, ocena: 5
Film: King of Kings, ocena: 5
Film: Hudson Hawk, ocena: 5
Film: The Haunted Mansion, ocena: 5
Film: 50 First Dates, ocena: 5
Film: 13 Going on 30, ocena: 5
Film: Fat Albert, ocena: 5


### Napovedovanje s povprečjem

In [18]:
class AveragePredictor:
    def __init__(self, b):
        self.b = b
        self.avg_ratings = {}

    def fit(self, user_item_data):
        # Calculate the global average rating
        g_avg = user_item_data.data['rating'].mean()

        # Calculate the average rating for each movie
        for movie_id in user_item_data.data['movieID'].unique():
            movie_data = user_item_data.data[user_item_data.data['movieID'] == movie_id]
            vs = movie_data['rating'].sum()  # Sum of ratings for the movie
            n = len(movie_data)  # Number of ratings for the movie

            # Calculate the adjusted average
            self.avg_ratings[movie_id] = (vs + self.b * g_avg) / (n + self.b)

    def predict(self, user_id):
        # Return the calculated average ratings
        # This predictor ignores the user_id because it does not do personalized predictions
        return self.avg_ratings

# Example usage
md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat')

# Using AveragePredictor with b=0
avg_pred = AveragePredictor(b=0)
rec = Recommender(avg_pred)
rec.fit(uim)
rec_items = rec.recommend(78, n=5, rec_seen=False)
for idmovie, val in rec_items:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

# Using AveragePredictor with b=100
avg_pred_b100 = AveragePredictor(b=100)
rec_b100 = Recommender(avg_pred_b100)
rec_b100.fit(uim)
rec_items_b100 = rec_b100.recommend(78, n=5, rec_seen=False)
for idmovie, val in rec_items_b100:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

Film: Brother Minister: The Assassination of Malcolm X, ocena: 5.0
Film: Synthetic Pleasures, ocena: 5.0
Film: Adam & Steve, ocena: 5.0
Film: Gabbeh, ocena: 5.0
Film: Eve and the Fire Horse, ocena: 5.0
Film: The Usual Suspects, ocena: 4.225944245560473
Film: The Godfather: Part II, ocena: 4.146907937910189
Film: Cidade de Deus, ocena: 4.116538340205236
Film: The Dark Knight, ocena: 4.10413904093503
Film: 12 Angry Men, ocena: 4.103639627096175


### Priporočanje najbolj gledanih filmov

In [30]:
class ViewsPredictor:
    def __init__(self):
        self.views_count = {}

    def fit(self, user_item_data):
        # Count the number of views for each movie
        self.views_count = user_item_data.data['movieID'].value_counts().to_dict()

    def predict(self, user_id):
        # Return the views count for each movie
        # This predictor ignores the user_id because it does not do personalized predictions
        return self.views_count

# Example usage
md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat')

views_pred = ViewsPredictor()
rec = Recommender(views_pred)
rec.fit(uim)
rec_items = rec.recommend(78, n=5, rec_seen=False)
for idmovie, val in rec_items:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

Film: The Lord of the Rings: The Fellowship of the Ring, ocena: 1576
Film: The Lord of the Rings: The Two Towers, ocena: 1528
Film: The Lord of the Rings: The Return of the King, ocena: 1457
Film: The Silence of the Lambs, ocena: 1431
Film: Shrek, ocena: 1404


## Napovedovanje ocen s podobnostjo med produkti

In [21]:
class ItemBasedPredictor:
    def __init__(self, min_values=0, threshold=0):
        self.min_values = min_values
        self.threshold = threshold
        self.similarities = {}
        self.user_item_data = None  # Add this line to initialize the user-item data attribute


    def fit(self, user_item_data):
        self.user_item_data = user_item_data  # Store the user-item data

        # Pivot table to create a matrix of users and movie ratings
        matrix = user_item_data.data.pivot_table(index='userID', columns='movieID', values='rating')
        
        # Calculate the global mean for normalization
        global_mean = user_item_data.data['rating'].mean()

        # Initialize dictionary for each movie
        movies = matrix.columns
        for movie in movies:
            self.similarities[movie] = {}

        # Calculate similarities for each pair of movies
        for i in range(len(movies)):
            for j in range(i+1, len(movies)):
                movie1_id = movies[i]
                movie2_id = movies[j]

                # Get the ratings for the two movies
                ratings1 = matrix[movie1_id]
                ratings2 = matrix[movie2_id]

                # Find common user ratings
                common = matrix.loc[:, [movie1_id, movie2_id]].dropna()

                # Check for minimum number of common users
                if len(common) < self.min_values:
                    continue

                # Adjust ratings by subtracting the global mean
                adjusted_ratings1 = common[movie1_id] - global_mean
                adjusted_ratings2 = common[movie2_id] - global_mean

                # Compute similarity
                similarity = 1 - cosine(adjusted_ratings1, adjusted_ratings2)

                # Apply threshold
                if similarity < self.threshold:
                    similarity = 0

                self.similarities[movie1_id][movie2_id] = similarity
                self.similarities[movie2_id][movie1_id] = similarity  # Symmetric

    def similarity(self, p1, p2):
        # Return the calculated similarity
        return self.similarities.get(p1, {}).get(p2, 0)

    def predict(self, user_id):
        user_ratings = self.user_item_data.data[self.user_item_data.data['userID'] == user_id]

        # Initialize a dictionary to store the predicted ratings
        predictions = {}

        # Iterate over all movies in the dataset
        for movie in self.similarities.keys():
            # Initialize the sum of similarities and weighted ratings
            sim_sum = 0
            weighted_ratings_sum = 0

            # Iterate over movies the user has rated
            for rated_movie, row in user_ratings.iterrows():
                rated_movie_id = row['movieID']
                rated_movie_rating = row['rating']

                # Check if similarity exists between the rated movie and the current movie
                if rated_movie_id in self.similarities[movie]:
                    similarity = self.similarities[movie][rated_movie_id]
                    sim_sum += similarity
                    weighted_ratings_sum += similarity * rated_movie_rating

            # Calculate the predicted rating
            if sim_sum > 0:
                predicted_rating = weighted_ratings_sum / sim_sum
                predictions[movie] = predicted_rating
            else:
                predictions[movie] = 0  # Default prediction when no similar movies are found

        return predictions


# Example usage
md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat', min_ratings=1000)
rp = ItemBasedPredictor()
rec = Recommender(rp)
rec.fit(uim)

print("Podobnost med filmoma 'Men in black'(1580) in 'Ghostbusters'(2716): ", rp.similarity(1580, 2716))
print("Podobnost med filmoma 'Men in black'(1580) in 'Schindler's List'(527): ", rp.similarity(1580, 527))
print("Podobnost med filmoma 'Men in black'(1580) in 'Independence day'(780): ", rp.similarity(1580, 780))

print("Predictions for 78: ")
rec_items = rec.recommend(78, n=15, rec_seen=False)
for idmovie, val in rec_items:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

Podobnost med filmoma 'Men in black'(1580) in 'Ghostbusters'(2716):  0.4547352985965746
Podobnost med filmoma 'Men in black'(1580) in 'Schindler's List'(527):  0
Podobnost med filmoma 'Men in black'(1580) in 'Independence day'(780):  0.5481157140454866
Predictions for 78: 
Film: The Usual Suspects, ocena: 4.228043285847402
Film: Shichinin no samurai, ocena: 4.177395447109325
Film: The Silence of the Lambs, ocena: 4.135398935441527
Film: Sin City, ocena: 4.106810339647614
Film: The Lord of the Rings: The Fellowship of the Ring, ocena: 4.043119228683461
Film: The Incredibles, ocena: 4.029512949977216
Film: The Lord of the Rings: The Return of the King, ocena: 3.9912446413928375
Film: Batman Begins, ocena: 3.975295132734058
Film: Good Will Hunting, ocena: 3.958036740972706
Film: The Lord of the Rings: The Two Towers, ocena: 3.9466268423676336
Film: A Beautiful Mind, ocena: 3.9371294773749
Film: Rain Man, ocena: 3.933872547332966
Film: Die Hard, ocena: 3.925370224555466
Film: Indiana Jones

### Najbolj podobni filmi

In [22]:
def print_top_similar_pairs(predictor, movie_data, top_n=20):
    # Flatten the similarity matrix into a list of tuples (movie1, movie2, similarity)
    similarity_list = []
    for movie1 in predictor.similarities:
        for movie2, similarity in predictor.similarities[movie1].items():
            if movie1 < movie2:  # Ensure each pair is only counted once
                similarity_list.append((movie1, movie2, similarity))

    # Sort the list by similarity in descending order
    sorted_similarities = sorted(similarity_list, key=lambda x: x[2], reverse=True)

    # Print the top N pairs
    for movie_pair in sorted_similarities[:top_n]:
        movie1_title = movie_data.get_title(movie_pair[0])
        movie2_title = movie_data.get_title(movie_pair[1])
        print(f"Film1: {movie1_title}, Film2: {movie2_title}, podobnost: {movie_pair[2]}")

# Example usage
# Assuming md is an instance of MovieData and rp is an instance of ItemBasedPredictor
print_top_similar_pairs(rp, md, top_n=20)

Film1: The Lord of the Rings: The Two Towers, Film2: The Lord of the Rings: The Return of the King, podobnost: 0.8844354394609556
Film1: The Lord of the Rings: The Fellowship of the Ring, Film2: The Lord of the Rings: The Two Towers, podobnost: 0.866171197727726
Film1: The Lord of the Rings: The Fellowship of the Ring, Film2: The Lord of the Rings: The Return of the King, podobnost: 0.8560599049871471
Film1: Kill Bill: Vol. 2, Film2: Kill Bill: Vol. 2, podobnost: 0.7996686702947231
Film1: Star Wars, Film2: Star Wars: Episode V - The Empire Strikes Back, podobnost: 0.7809640630519074
Film1: Star Wars: Episode V - The Empire Strikes Back, Film2: Star Wars: Episode VI - Return of the Jedi, podobnost: 0.7195582469322426
Film1: Ace Ventura: Pet Detective, Film2: The Mask, podobnost: 0.7150004096605594
Film1: Star Wars, Film2: Star Wars: Episode VI - Return of the Jedi, podobnost: 0.6900018201651158
Film1: Speed, Film2: Pretty Woman, podobnost: 0.6369561144208051
Film1: The Mask, Film2: Mrs.

## Priporočanje glede na trenutno ogledano vsebino

In [26]:
class ItemBasedPredictor:
    def __init__(self, min_values=0, threshold=0):
        self.min_values = min_values
        self.threshold = threshold
        self.similarities = {}
        self.user_item_data = None  # Add this line to initialize the user-item data attribute


    def fit(self, user_item_data):
        self.user_item_data = user_item_data  # Store the user-item data

        # Pivot table to create a matrix of users and movie ratings
        matrix = user_item_data.data.pivot_table(index='userID', columns='movieID', values='rating')
        
        # Calculate the global mean for normalization
        global_mean = user_item_data.data['rating'].mean()

        # Initialize dictionary for each movie
        movies = matrix.columns
        for movie in movies:
            self.similarities[movie] = {}

        # Calculate similarities for each pair of movies
        for i in range(len(movies)):
            for j in range(i+1, len(movies)):
                movie1_id = movies[i]
                movie2_id = movies[j]

                # Get the ratings for the two movies
                ratings1 = matrix[movie1_id]
                ratings2 = matrix[movie2_id]

                # Find common user ratings
                common = matrix.loc[:, [movie1_id, movie2_id]].dropna()

                # Check for minimum number of common users
                if len(common) < self.min_values:
                    continue

                # Adjust ratings by subtracting the global mean
                adjusted_ratings1 = common[movie1_id] - global_mean
                adjusted_ratings2 = common[movie2_id] - global_mean

                # Compute similarity
                similarity = 1 - cosine(adjusted_ratings1, adjusted_ratings2)

                # Apply threshold
                if similarity < self.threshold:
                    similarity = 0

                self.similarities[movie1_id][movie2_id] = similarity
                self.similarities[movie2_id][movie1_id] = similarity  # Symmetric

    def similarity(self, p1, p2):
        # Return the calculated similarity
        return self.similarities.get(p1, {}).get(p2, 0)

    def predict(self, user_id):
        user_ratings = self.user_item_data.data[self.user_item_data.data['userID'] == user_id]

        # Initialize a dictionary to store the predicted ratings
        predictions = {}

        # Iterate over all movies in the dataset
        for movie in self.similarities.keys():
            # Initialize the sum of similarities and weighted ratings
            sim_sum = 0
            weighted_ratings_sum = 0

            # Iterate over movies the user has rated
            for rated_movie, row in user_ratings.iterrows():
                rated_movie_id = row['movieID']
                rated_movie_rating = row['rating']

                # Check if similarity exists between the rated movie and the current movie
                if rated_movie_id in self.similarities[movie]:
                    similarity = self.similarities[movie][rated_movie_id]
                    sim_sum += similarity
                    weighted_ratings_sum += similarity * rated_movie_rating

            # Calculate the predicted rating
            if sim_sum > 0:
                predicted_rating = weighted_ratings_sum / sim_sum
                predictions[movie] = predicted_rating
            else:
                predictions[movie] = 0  # Default prediction when no similar movies are found

        return predictions
    
    def similarItems(self, item, n):
        # Check if the item exists in the similarity matrix
        if item not in self.similarities:
            return []

        # Retrieve all items and their similarity scores to the given item
        similar_items = self.similarities[item].items()

        # Sort the items based on similarity scores in descending order
        sorted_items = sorted(similar_items, key=lambda x: x[1], reverse=True)

        # Return the top n items
        return sorted_items[:n]


md = MovieData('podatki/movies.dat')
uim = UserItemData('podatki/user_ratedmovies.dat', min_ratings=1000)
rp = ItemBasedPredictor()
rec = Recommender(rp)
rec.fit(uim)

# Assuming the predictor has been fitted with user-item data
rec_items = rp.similarItems(4993, 10)
print('Filmi podobni "The Lord of the Rings: The Fellowship of the Ring":')
for idmovie, val in rec_items:
    print("Film: {}, ocena: {}".format(md.get_title(idmovie), val))

Filmi podobni "The Lord of the Rings: The Fellowship of the Ring":
Film: The Lord of the Rings: The Two Towers, ocena: 0.866171197727726
Film: The Lord of the Rings: The Return of the King, ocena: 0.8560599049871471
Film: Star Wars: Episode V - The Empire Strikes Back, ocena: 0.419423520538885
Film: Star Wars, ocena: 0.40500493964569284
Film: The Matrix, ocena: 0.39793497278477485
Film: Raiders of the Lost Ark, ocena: 0.38860539594709076
Film: Star Wars: Episode VI - Return of the Jedi, ocena: 0.3550792764455102
Film: Schindler's List, ocena: 0.3383137916061756
Film: The Usual Suspects, ocena: 0.3353498207221435
Film: Indiana Jones and the Last Crusade, ocena: 0.31770306842834906


### Moja priporočila

63200236 - uporabnik