# Recommender Systems Challenge, 2023/2024 @ PoliMi

# Introduction

## Problem Description
The application domain is book recommendation. The datasets contain interactions of users with books, in particular, if the user attributed to the book a rating of at least 4. The main goal of the competition is to discover which items (books) a user will interact with.

The datasets includes around 600k interactions, 13k users, 22k items (books).
The training-test split is done via random holdout, 80% training, 20% test.
The goal is to recommend a list of 10 potentially relevant items for each user.


## Datasets
All files are comma-separated (columns are separated with ',' ).

- **data_train.csv**:
Contains the training set, describing implicit preferences expressed by the users.
    - **user_id** : identifier of the user
    - **item_id** : identifier of the item (Book)
    - **data** : "1.0" if the user liked the book attributing it a rating of at least 4.


- **data_target_users_test.csv**:
Contains the ids of the users that should appear in your submission file.
The submission file should contain all and only these users.

- **sample_submission.csv**:
A sample submission file in the correct format: [user_id],[ordered list of recommended items].
Be careful with the spaces and be sure to recommend the correct number of items to every user.

# Requirements

The working environment and the required libraries are defined here.

In [1]:
# Import libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy.sparse as sps # creation of sparse matrix
import matplotlib.pyplot as pyplot # plot
import re

from tqdm import tqdm

In [2]:
# Open the submission sample file
submission_sample = open('Input/data_target_users_test.csv', 'r')
type(submission_sample)

_io.TextIOWrapper

In [3]:
# Read the dataframe
submission_df = pd.read_csv(filepath_or_buffer=submission_sample)
submission_df

Unnamed: 0,user_id
0,1
1,2
2,3
3,4
4,5
...,...
10877,13020
10878,13021
10879,13022
10880,13023


In [4]:
# Convert the user_id column to numpy list
user_list = submission_df['user_id'].to_numpy()
user_list

array([    1,     2,     3, ..., 13022, 13023, 13024], dtype=int64)

# Combine result files

In [2]:
# Handle the input file
input_file = open('Results/t_1.csv', 'r')
type(input_file)

_io.TextIOWrapper

In [3]:
# Build the dataframe from the input file
input_df = pd.read_csv(
    filepath_or_buffer=input_file,
    names=['user_id','item_list'],
    header=0
)

input_df.shape

(10882, 2)

In [4]:
# Show the first rows of the dataframe
input_df.head(n=10)

Unnamed: 0,user_id,item_list
0,1,101 36 123 506 515 403 694 1546 52 1422
1,2,1095 47 12 50 1522 949 102 3176 54 28
2,3,59 857 2172 4252 4 648 4623 259 956 536
3,4,249 28 50 314 171 136 146 254 128 5
4,5,1570 77 170 5138 95 471 1511 131 116 135
5,6,886 9 35 395 874 88 168 104 14 663
6,8,210 443 451 600 95 722 3916 480 1094 1749
7,9,9018 2821 10108 17799 2282 115 21322 13958 248...
8,10,1816 1446 561 2617 1668 2565 2423 3905 1767 3721
9,11,31 40 58 67 955 185 34 52 44 32


In [5]:
input_df.item_list[1]

'1095 47 12 50 1522 949 102 3176 54 28'

In [17]:
storage = []
for user in user_list:
    storage.append([0] * 22350)

In [18]:
weights = [0.14043, 0.14119, 0.14115, 0.1411, 0.14109, 0.14145, 0.14136, 0.14146, 0.14164, 0.14114, 0.14074, 0.14127, 0.14081, 0.14112, 0.14025, 0.14193, 0.14149, 0.14141, 0.14122, 0.14188, 0.14176, 0.1418, 0.14134, 0.14136, 0.1414, 0.14154, 0.14014, 0.14081, 0.14171, 0.14145, 0.14145, 0.14161]

In [19]:
weights = [1] * 32

In [20]:
for i in tqdm(range(32)):
    input_file = open('Results/t_{}.csv'.format(i+1), 'r')
    input_df = pd.read_csv(
        filepath_or_buffer=input_file,
        names=['user_id','item_list'],
        header=0
    )
    
    for j in range(len(user_list)):
        
        myList = re.split(' ', input_df.item_list[j])
        intList = [int(x) for x in myList]
        
        for k in range(10):
            storage[j][intList[k]] += ((5-0.2*k) * weights[i])

100%|██████████████████████████████████████████████████████████████████████████████████| 33/33 [00:04<00:00,  6.86it/s]


In [21]:
sample = 0

In [22]:
np.nonzero(storage[sample])

(array([  36,   52,  101,  123,  403,  506,  515,  592,  694, 1422, 1546],
       dtype=int64),)

In [23]:
arr = np.array(storage[sample]).argsort()[::-1][:10]
arr

array([ 101,   36,  123,  506,  515,  403,  694, 1546,   52, 1422],
      dtype=int64)

In [24]:
for elem in arr.tolist():
    print(storage[sample][elem])

162.00000000000006
161.39999999999998
149.59999999999994
147.40000000000006
138.00000000000003
132.4
124.19999999999995
120.19999999999992
110.20000000000006
101.00000000000004


In [16]:
all_recommend = []

for i in tqdm(range(len(user_list))):
    arr = np.array(storage[i]).argsort()[::-1][:10]
    rec_list = arr.tolist()
    rec_row = ' '.join(str(s) for s in rec_list)
    all_recommend.append(rec_row)

100%|███████████████████████████████████████████████████████████████████████████| 10882/10882 [00:12<00:00, 852.06it/s]


In [17]:
all_recommend[:10]

['101 36 123 506 515 403 694 1546 52 1422',
 '1095 12 47 50 1522 11 949 54 102 3176',
 '59 857 2172 4252 648 956 4 1281 536 259',
 '249 28 50 314 171 136 146 254 7 5',
 '1570 170 77 5138 131 95 1511 471 8 1220',
 '886 35 874 9 395 88 168 104 14 184',
 '210 443 451 600 3916 722 95 480 1094 1130',
 '9018 2821 2282 10108 17799 21322 115 20151 4535 2350',
 '1816 1446 561 1668 2617 2565 2423 3905 1767 3721',
 '31 40 58 67 955 185 34 52 32 44']

In [18]:
submission_df['item_list'] = all_recommend

In [19]:
submission_df.head(10)

Unnamed: 0,user_id,item_list
0,1,101 36 123 506 515 403 694 1546 52 1422
1,2,1095 12 47 50 1522 11 949 54 102 3176
2,3,59 857 2172 4252 648 956 4 1281 536 259
3,4,249 28 50 314 171 136 146 254 7 5
4,5,1570 170 77 5138 131 95 1511 471 8 1220
5,6,886 35 874 9 395 88 168 104 14 184
6,8,210 443 451 600 3916 722 95 480 1094 1130
7,9,9018 2821 2282 10108 17799 21322 115 20151 453...
8,10,1816 1446 561 1668 2617 2565 2423 3905 1767 3721
9,11,31 40 58 67 955 185 34 52 32 44


In [20]:
# Export the dataframe on csv file
submission_df.to_csv('Output/OutputHybridwMAP.csv', index=False)

# Compare two output files

In [207]:
# Handle the input file
input_1 = open('Output/Test_62.csv', 'r')
input_2 = open('Output/NewXGB30_tot.csv', 'r')

In [208]:
# Build the dataframe from the input file
df1 = pd.read_csv(
    filepath_or_buffer=input_1,
    names=['user_id','item_list'],
    header=0
)

df2 = pd.read_csv(
    filepath_or_buffer=input_2,
    names=['user_id','item_list'],
    header=0
)

In [209]:
num_users = len(user_list)

In [210]:
equal_rows = []
equal_users = []

perm_rows = []
perm_users = []

diff_rows = []
diff_users = []

changed_item_count = 0

for i in range(num_users):

    if df1.item_list[i] == df2.item_list[i]:
        equal_rows.append(i)
        equal_users.append(user_list[i])
        print("Row {}".format(i) + " (User {}): ".format(user_list[i]) + " identical")

    else:
        
        list1 = re.split(' ', df1.item_list[i])
        list2 = re.split(' ', df2.item_list[i])
        
        if set(list1) == set(list2):
            perm_rows.append(i)
            perm_users.append(user_list[i])
            print("Row {}".format(i) + " (User {}): ".format(user_list[i]) + " different ordering!")
            print("[" + df1.item_list[i] + "] vs [" + df2.item_list[i] + "]")
            
        else:
            new_items = list(set(list1) - set(list2))
            changed_item_count += len(new_items)
        
            diff_rows.append(i)
            diff_users.append(user_list[i])
            print("Row {}".format(i) + " (User {}): ".format(user_list[i]) + " different elements!")
            print("[" + df1.item_list[i] + "] vs [" + df2.item_list[i] + "]")
        

Row 0 (User 1):  identical
Row 1 (User 2):  different elements!
[1095 12 47 1522 50 949 54 102 11 196] vs [1095 47 12 1522 50 28 196 11 102 3176]
Row 2 (User 3):  different elements!
[59 857 2172 4252 4 1281 648 4623 259 956] vs [59 857 4252 259 2172 536 584 648 956 1281]
Row 3 (User 4):  different elements!
[249 28 50 314 171 146 7 136 254 128] vs [28 249 50 171 314 136 139 254 7 146]
Row 4 (User 5):  different elements!
[1570 170 77 1511 471 95 131 5138 1220 8] vs [1570 77 170 5138 95 1220 131 471 116 1511]
Row 5 (User 6):  different elements!
[886 874 35 9 395 88 168 104 184 9329] vs [886 9 35 874 88 14 395 184 104 168]
Row 6 (User 8):  different ordering!
[210 443 451 3916 600 722 95 1094 480 1130] vs [210 443 451 3916 722 600 95 480 1130 1094]
Row 7 (User 9):  different elements!
[9018 2821 2282 10108 17799 115 2350 4535 20151 21322] vs [9018 2821 1 17799 316 115 10108 2282 227 248]
Row 8 (User 10):  different elements!
[1816 1446 561 1668 2617 2565 3905 2423 67 1767] vs [561 1446

[618 376 313 31 636 94 58 55 1214 76] vs [31 94 55 58 76 376 618 313 636 1214]
Row 9078 (User 10821):  different elements!
[324 344 14 1598 56 104 144 174 184 8345] vs [324 344 14 56 1598 104 144 35 4 24]
Row 9079 (User 10822):  different elements!
[1196 275 4324 983 599 343 126 657 1274 1100] vs [1196 4324 275 599 126 983 657 1100 89 293]
Row 9080 (User 10823):  different elements!
[2844 162 182 284 1199 102 63 817 12742 11652] vs [4 2 1199 2844 63 102 162 182 284 270]
Row 9081 (User 10824):  different ordering!
[138 344 94 16 279 262 170 1362 339 4741] vs [94 344 16 138 262 170 279 1362 339 4741]
Row 9082 (User 10825):  different elements!
[168 104 144 240 56 224 372 35 7647 258] vs [56 104 168 144 35 224 372 240 9 413]
Row 9083 (User 10826):  different elements!
[6723 859 6722 9568 11633 269 7056 9046 4881 7395] vs [60 11619 238 9568 11633 859 6723 6722 67 269]
Row 9084 (User 10827):  different elements!
[928 639 174 1195 2283 2 2284 1795 532 4352] vs [928 639 174 1195 1795 2283 228

In [211]:
input_1.close()
input_2.close()

In [212]:
print("Number of identical rows: {}".format(len(equal_rows)))
print("Number of rows with different order: {}".format(len(perm_rows)))
print("Number of different rows: {}".format(len(diff_rows)))

Number of identical rows: 10
Number of rows with different order: 1282
Number of different rows: 9590


In [213]:
n_users_to_recommend = num_users
n_recommended_items = n_users_to_recommend * 10
print("Changed {} items, {:.2f}% of total".format(changed_item_count, changed_item_count/n_recommended_items * 100))

Changed 21478 items, 19.74% of total


In [214]:
# print("Identical row index: {}".format(equal_rows))

In [56]:
# print("Identical users: {}".format(equal_users))

In [57]:
# print("Different row index: {}".format(diff_rows))

In [58]:
# print("Different users: {}".format(diff_users))