Предлагается построить рекомендательную систему Collaborative Filtering (CF) для имеющегося датасета. 
Данная рекомендательная система основана на принципе "wisdom of the crowd", где рекомендация для конкретного юзера формируется на основе поведения других юзеров со схожими предпочтениями.
Данные хранятся в архиве «задание6»
В датасете 'BX-Book-Ratings.csv' перечислены следующие данные: "User-ID" - юзер, "ISBN" - код книги, "Book-Rating" - рейтинг, который поставил данный юзер данной книге. 
Таблица 'BX-Books.csv' для связи кодового наименования книги с ее реальным названием. Таблица 'BX-Users.csv' содержит данные о пользователях
Основываясь на оценках пользователей сформируйте рекомендации по книгам для всех юзеров.
P.S. : для открытия .csv используйте (pd.read_csv(‘имя файла’, encoding ='latin1', on_bad_lines='skip', sep=';')

In [1]:
import pandas as pd
import numpy as np
import random

  from pandas.core.computation.check import NUMEXPR_INSTALLED


In [2]:
user_data = pd.read_csv('BX-Users.csv', encoding ='latin1', on_bad_lines='skip', sep=';')
book_data = pd.read_csv('BX-Books.csv', encoding ='latin1', on_bad_lines='skip', sep=';')
rating_data = pd.read_csv('BX-Book-Ratings.csv', encoding ='latin1', on_bad_lines='skip', sep=';')

  book_data = pd.read_csv('BX-Books.csv', encoding ='latin1', on_bad_lines='skip', sep=';')


In [3]:
rating_data = pd.merge(rating_data, book_data[['Book-Title', 'ISBN']], on='ISBN', how='inner')

In [4]:
rating_data.head(3)

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title
0,276725,034545104X,0,Flesh Tones: A Novel
1,2313,034545104X,5,Flesh Tones: A Novel
2,6543,034545104X,0,Flesh Tones: A Novel


In [5]:
rating_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031136 entries, 0 to 1031135
Data columns (total 4 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1031136 non-null  int64 
 1   ISBN         1031136 non-null  object
 2   Book-Rating  1031136 non-null  int64 
 3   Book-Title   1031136 non-null  object
dtypes: int64(2), object(2)
memory usage: 39.3+ MB


In [6]:
# Filter out books with more than 100 rating scores for memory optimization
agg_ratings = rating_data.groupby('ISBN').agg(mean_rating = ('Book-Rating', 'mean'), 
                                              number_of_ratings = ('Book-Rating', 'count')).reset_index()

agg_ratings_popular = agg_ratings[agg_ratings['number_of_ratings']>100]
agg_ratings_popular.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 717 entries, 1759 to 240806
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ISBN               717 non-null    object 
 1   mean_rating        717 non-null    float64
 2   number_of_ratings  717 non-null    int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 22.4+ KB


In [7]:
# Let's select only the above-mentioned "popular" books
df_popular = pd.merge(rating_data, agg_ratings_popular[['ISBN']], on='ISBN', how='inner')
df_popular.info()
# We can compare the quantity and make sure that the selection was done

<class 'pandas.core.frame.DataFrame'>
Int64Index: 136423 entries, 0 to 136422
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   User-ID      136423 non-null  int64 
 1   ISBN         136423 non-null  object
 2   Book-Rating  136423 non-null  int64 
 3   Book-Title   136423 non-null  object
dtypes: int64(2), object(2)
memory usage: 5.2+ MB


In [8]:
# Imagine a dataframe in matrix format. Rows are clients, columns are book IDs
matrix = df_popular.pivot_table(index='User-ID', columns='ISBN', values = 'Book-Rating')
matrix.head()

ISBN,002542730X,0060008032,0060096195,006016848X,0060173289,0060175400,006019491X,0060199652,0060391626,0060392452,...,1558744630,1558745157,1559029838,1573225517,1573225789,1573227331,1573229326,1573229571,1592400876,1878424319
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9,,,,,,,,,,,...,,,,,,,,,,
14,,,,,,,,,,,...,,,,,,,,,,
16,,,,,,,,,,,...,,,,,,,,,,
26,,,,,,,,,,,...,,,,,,,,,,
39,,,,,,,,,,,...,,,,,,,,,,


In [9]:
# Let's normalize the rating score by averaging all ratings by user (because users give ratings differently). Then based on
# book's score(is it below or above the user's average rating), it gets either a positive or a negative rating.
matrix_norm = matrix.subtract(matrix.mean(axis=1), axis='rows')
matrix_norm

ISBN,002542730X,0060008032,0060096195,006016848X,0060173289,0060175400,006019491X,0060199652,0060391626,0060392452,...,1558744630,1558745157,1559029838,1573225517,1573225789,1573227331,1573229326,1573229571,1592400876,1878424319
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9,,,,,,,,,,,...,,,,,,,,,,
14,,,,,,,,,,,...,,,,,,,,,,
16,,,,,,,,,,,...,,,,,,,,,,
26,,,,,,,,,,,...,,,,,,,,,,
39,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
278832,,,,,,,,,,,...,,,,,,,,,,
278836,,,,,,,,,,,...,,,,,,,,,,
278843,,,,,5.133333,,,,,,...,,,,,,,,,,
278844,,,,,,,,,,,...,,,,,,,,,,


In [10]:
# User similarity is calculated through Pearson's correlation
user_similarity = matrix_norm.T.corr()

In [11]:
user_similarity

User-ID,9,14,16,26,39,42,44,51,67,75,...,278800,278807,278813,278819,278828,278832,278836,278843,278844,278854
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9,1.0,,,,,,,,,,...,,,,,,,,,,
14,,,,,,,,,,,...,,,,,,,,,,
16,,,1.0,,,,,,,,...,,,,,,,,,,
26,,,,1.0,,,,,,,...,,,,,,,,,,
39,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
278832,,,,,,,,,,,...,,,,,,,,,,
278836,,,,,,,,,,,...,,,,,,,,,,
278843,,,,,,,,,,,...,,,,,,,,1.0,,
278844,,,,,,,,,,,...,,,,,,,,,,


In [120]:
def find_recommendations(picked_userid: int) -> None:
    """
    On input takes userid for which we need to fing some recomendations (print them).
    """
    # Coefficient restriction
    user_similarity_treshold = 0.3
    
    # Select top m movies
    number_of_top = 10
    
    similar_users = user_similarity[(user_similarity[picked_userid]>user_similarity_treshold)][picked_userid]
    similar_users = similar_users[similar_users.index != picked_userid]
    
    # Remove books that the target user has read
    picked_user_read = matrix_norm[matrix_norm.index == picked_userid].dropna(axis=1, how='all')
    
    
    if similar_users.empty:
        books_for_rand_choice_matrix = matrix_norm.drop(picked_user_read.columns, axis=1, errors='ignore')
        books_for_rand_choice_list = books_for_rand_choice.columns
        books_recommendation_list = random.choices(books_for_rand_choice_list, k=number_of_top)
        books_recommendation_list = pd.DataFrame(books_recommendation_list, columns=['ISBN']).\
                                    merge(book_data[['ISBN', 'Book-Title']])['Book-Title']
        
        print(f"Congratulations you are a unique person! We can offer you a random selection of books \
        that you have not read yet: \n{books_recommendation_list}")
    else:
        similar_user_books = matrix_norm[matrix_norm.index.isin(similar_users.index)].dropna(axis=1, how='all')
        similar_user_books.drop(picked_user_read.columns, axis=1, inplace=True, errors='ignore')

        item_score = {}
        # Loop through items
        for i in similar_user_books.columns:
          # Get the ratings for movie i
          book_rating = similar_user_books[i]
          # Create a variable to store the score
          total = 0
          # Create a variable to store the number of scores
          count = 0
          # Loop through similar users
          for u in similar_users.index:
            # If the movie has rating
            if not pd.isna(book_rating[u]):
              # Score is the sum of user similarity score multiply by the movie rating
              score = similar_users[u] * book_rating[u]
              # Add the score to the total score for the movie so far
              total += score
              # Add 1 to the count
              count +=1
          # Get the average score for the item
          item_score[i] = total / count
        # Convert dictionary to pandas dataframe
        item_score = pd.DataFrame(item_score.items(), columns=['ISBN', 'Book-Score'])

        # Sort the movies by score
        ranked_item_score = item_score.sort_values(by='Book-Score', ascending=False)
        books_recommendation_matrix = ranked_item_score.head(number_of_top)
        books_recommendation_list = pd.DataFrame(books_recommendation_matrix, columns=['ISBN', 'Book-Score']).\
                                    merge(book_data[['ISBN', 'Book-Title']])[['Book-Title', 'Book-Score']]
        

        print(f"Based on other users' ratings we reccomend you to read: \n {books_recommendation_list}")
        


In [122]:
find_recommendations(278844)

Congratulations you are a unique person! We can offer you a random selection of books         that you have not read yet: 
0                         Daughter of Fortune: A Novel
1                The Sweet Potato Queens' Book of Love
2       Harry Potter and the Sorcerer's Stone (Book 1)
3    The Hobbit : The Enchanting Prelude to The Lor...
4                                         The Alienist
5    Cruel &amp; Unusual (Kay Scarpetta Mysteries (...
6         I Know This Much Is True (Oprah's Book Club)
7                                  The Mists of Avalon
8                                    The Next Accident
9     The Book of Ruth (Oprah's Book Club (Paperback))
Name: Book-Title, dtype: object
