Model typu filtrowanie kolaboracyjne (collaborative filtering).

Wykorzystujemy w tym celu macierz interakcji, którą następnie dekomponujemy na macierze mniejszej wymiarowości z użyciem TruncatedSVD - sklearn. Ze względu na to, że gry są dominujacą klasą w sklepie, tworzymy oddzielną macierz dla gier i nie gier.

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

Przygotowanie macierzy interakcji

In [2]:
sessionsDataPath = '../notebooks/data/v2/sessions.jsonl'
productsDataPath = '../notebooks/data/v2/products.jsonl'
sessionsDF = pd.read_json(sessionsDataPath, lines=True)
productsDF = pd.read_json(productsDataPath, lines=True)

df = sessionsDF.drop(columns=["session_id", "timestamp", "event_type", "offered_discount", "purchase_id"])
df["count"] = 1
interactionMatrixDF = pd.pivot_table(df, index="user_id", columns="product_id", values="count", aggfunc=np.sum, fill_value=0)

Funkcja kasutjąca kategorię produktów

In [3]:
separator = ';'
newGroups = ['Gry komputerowe', 'Gry na konsole', 'Sprzęt RTV', 'Komputery', 'Telefony i akcesoria']

def castCategoryPath(categoryPath):
    categories = categoryPath.split(separator)
    foundGroups = [group for group in newGroups if group in categories]
    if len(foundGroups) != 1:
        raise RuntimeError('wrong group cast: {}'.format(foundGroups))
    return foundGroups[0]

#casting category
productsCastedDF = productsDF.copy()
productsCastedDF['category_path'] = productsDF['category_path'].apply(castCategoryPath)

Utworzenie DataFrame'ów zawierających gry i nie-gry.  
Utworzenie słowników do odzyskania odczytania prdouct_id z wyników.

In [4]:
gamesDF = productsCastedDF[productsCastedDF['category_path'].isin(['Gry komputerowe', 'Gry na konsole'])]
nonGamesDF = productsCastedDF[~productsCastedDF['category_path'].isin(['Gry komputerowe', 'Gry na konsole'])]

#creating lists
gamesList = gamesDF['product_id']
gamesList.reset_index(drop=True, inplace=True)
nonGamesList = nonGamesDF['product_id']
nonGamesList.reset_index(drop=True, inplace=True)

#creatings dicts for faster searching
gamesIdxNameDict = pd.Series(gamesList.values, index=gamesList.index).to_dict()
nonGamesIdxNameDict = pd.Series(nonGamesList.values, index=nonGamesList.index).to_dict()

Normalizacja wartości w macierzach interakcji.

In [5]:
gamesInteractionMatrixDF = interactionMatrixDF.drop(columns=nonGamesList)
nonGamesInteractionMatrixDF = interactionMatrixDF.drop(columns=gamesList)

gamesInteractionMatrixDF = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(gamesInteractionMatrixDF))
nonGamesInteractionMatrixDF = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(nonGamesInteractionMatrixDF))
interactionMatrixDF = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(interactionMatrixDF))


Dekompozycja utworzonej macierzy na podmacierze ze względu na użytkowników i produkty.


In [6]:
from sklearn.decomposition import TruncatedSVD

#initial hiperparameters
epsilon = 1e-9
latentFactors = 10

#generate item latent features
gamesSVD = TruncatedSVD(n_components=latentFactors)
gamesFeatures = gamesSVD.fit_transform(gamesInteractionMatrixDF.transpose()) + epsilon #transpose because items are columns

#generate item latent features
nonGamesSVD = TruncatedSVD(n_components=latentFactors)
nonGamesFeatures = nonGamesSVD.fit_transform(nonGamesInteractionMatrixDF.transpose()) + epsilon #transpose because items are columns

#generate user latent features
userSVD = TruncatedSVD(n_components=latentFactors)
userFeatures = userSVD.fit_transform(interactionMatrixDF) + epsilon

pd.DataFrame(gamesFeatures)
pd.DataFrame(nonGamesFeatures)
pd.DataFrame(userFeatures)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.383352,-0.005340,0.106952,0.090407,0.031147,0.059937,0.032949,0.034099,-0.115362,-0.013931
1,4.743837,0.511675,-1.358887,-0.617235,-0.335662,-0.794700,0.590200,-0.123164,0.008213,-0.204597
2,4.002009,0.354899,0.789688,1.489023,-0.012757,0.428438,-0.159704,0.708473,-0.560596,0.104846
3,5.828327,0.841767,-0.457951,-1.000367,1.184680,0.127947,-0.375160,-0.328533,-1.354306,0.450858
4,3.836404,-0.205994,0.177453,-0.743968,0.501588,0.326312,-0.501907,0.284142,0.024499,-0.538360
...,...,...,...,...,...,...,...,...,...,...
195,4.982290,0.249174,0.112670,-0.719098,-0.166870,-0.595272,0.498162,0.928648,-0.559530,-0.168564
196,2.812221,-0.270846,-0.436285,0.668807,0.174951,0.214495,0.220745,-0.006480,-0.681519,0.280170
197,1.070717,-0.055831,0.014460,-0.251186,-0.002369,0.264489,0.132134,0.092821,-0.172030,0.061074
198,3.880630,-1.262774,-0.438388,0.525878,0.212223,-0.073060,1.458608,-0.067967,0.719025,-0.380007


Definicja funkcji zwracającej top k podobnych elementów na podstawie wartości *cosine similarity*

In [7]:
def top_k(item_id, top_k, corr_mat, map_name):
    topItems = corr_mat[item_id,:].argsort()[-top_k:][::-1]
    topItems = [map_name[e] for e in topItems]
    return topItems

Rekomendacja

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

#split games no-games
itemCorrMat = cosine_similarity(gamesFeatures)

#before test_split it is mandatory to create dict mapping indexes of products in productsDF to labels or ids
#because truncatedSVD has rows coresponding to rows in productsDF, but after split there not the same
recommendations = top_k(0, 10, itemCorrMat, gamesIdxNameDict)
display(productsDF.loc[productsDF['product_id'].isin(recommendations)])


itemCorrMat = cosine_similarity(nonGamesFeatures)

#before test_split it is mandatory to create dict mapping indexes of products in productsDF to labels or ids
#because truncatedSVD has rows coresponding to rows in productsDF, but after split there not the same
recommendations = top_k(1, 10, itemCorrMat, nonGamesIdxNameDict)
display(productsDF.loc[productsDF['product_id'].isin(recommendations)])

Unnamed: 0,product_id,product_name,category_path,price,user_rating
3,1004,Fallout 3 (Xbox 360),Gry i konsole;Gry na konsole;Gry Xbox 360,49.99,4.06397
8,1009,Kinect Joy Ride (Xbox 360),Gry i konsole;Gry na konsole;Gry Xbox 360,69.0,4.80192
9,1010,BioShock 2 (Xbox 360),Gry i konsole;Gry na konsole;Gry Xbox 360,89.99,3.510874
10,1011,BioShock Infinite (Xbox 360),Gry i konsole;Gry na konsole;Gry Xbox 360,139.99,3.251818
48,1049,Max Payne 3 (PC),Gry i konsole;Gry komputerowe,17.9,1.495826
49,1050,Bioshock 2 (PC),Gry i konsole;Gry komputerowe,37.9,4.959925
51,1052,Duke Nukem Forever (PC),Gry i konsole;Gry komputerowe,78.9,2.176047
54,1055,Call of Duty Modern Warfare 3 (PC),Gry i konsole;Gry komputerowe,32.99,0.518008
62,1063,Air Conflicts (PC),Gry i konsole;Gry komputerowe,75.99,0.597509
89,1090,Władca Pierścieni Wojna Na Północy (PC),Gry i konsole;Gry komputerowe,33.99,1.764497


Unnamed: 0,product_id,product_name,category_path,price,user_rating
1,1002,Kyocera FS-1135MFP,Komputery;Drukarki i skanery;Biurowe urządzeni...,2048.5,1.875949
2,1003,Kyocera FS-3640MFP,Komputery;Drukarki i skanery;Biurowe urządzeni...,7639.0,1.493143
75,1076,Samsung CLX-6260FR ### Gadżety Samsung ### Eks...,Komputery;Drukarki i skanery;Biurowe urządzeni...,2399.0,4.436501
76,1077,Kyocera FS-C2026MFP,Komputery;Drukarki i skanery;Biurowe urządzeni...,3777.0,4.822859
78,1079,Kyocera FS-3040MFP,Komputery;Drukarki i skanery;Biurowe urządzeni...,4598.0,0.69548
79,1080,Kyocera FS-3140MFP,Komputery;Drukarki i skanery;Biurowe urządzeni...,5301.9,4.834499
296,1297,Telmor DSP-860,Sprzęt RTV;Video;Telewizory i akcesoria;Anteny...,119.0,3.376318
309,1310,One For All SV 9125,Sprzęt RTV;Video;Telewizory i akcesoria;Anteny...,79.99,2.384448
314,1315,Jabra Talk,Telefony i akcesoria;Akcesoria telefoniczne;Ze...,54.99,0.335627
315,1316,Plantronics Voyager Legend,Telefony i akcesoria;Akcesoria telefoniczne;Ze...,249.0,4.287721
