Model typu filtrowanie kolaboracyjne (collaborative filtering).

Wykorzystujemy w tym celu macierz interakcji, którą następnie dekomponujemy na macierze mniejszej wymiarowości z użyciem TruncatedSVD - sklearn. Ze względu na to, że gry są dominujacą klasą w sklepie, tworzymy oddzielną macierz dla gier i nie gier.

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

Przygotowanie macierzy interakcji

In [2]:
sessionsDataPath = '../notebooks/data/v2/sessions.jsonl'
productsDataPath = '../notebooks/data/v2/products.jsonl'
sessionsDF = pd.read_json(sessionsDataPath, lines=True)
productsDF = pd.read_json(productsDataPath, lines=True)

df = sessionsDF.drop(columns=["session_id", "timestamp", "event_type", "offered_discount", "purchase_id"])
df["count"] = 1

#interaction matrix is build on part of the sessionDF 
train, test = train_test_split(df, test_size=0.2, stratify=df['product_id'])

interactionMatrixDF = pd.pivot_table(train, index="user_id", columns="product_id", values="count", aggfunc=np.sum, fill_value=0)
display(interactionMatrixDF)

product_id,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,...,1310,1311,1312,1313,1314,1315,1316,1317,1318,1319
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
102,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
103,0,2,4,15,1,3,2,3,3,9,...,0,1,1,1,0,1,6,0,0,7
104,0,0,6,5,5,8,0,4,2,6,...,1,1,1,0,1,0,3,1,5,1
105,0,1,7,7,4,4,0,3,2,3,...,0,1,1,0,2,1,8,5,0,1
106,2,1,2,8,1,1,1,2,1,6,...,0,0,1,3,1,1,6,2,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,3,0,5,12,4,2,0,3,8,1,...,0,2,0,1,0,0,9,1,3,3
298,0,0,6,1,2,2,0,5,0,1,...,1,0,0,0,1,2,7,3,0,0
299,0,0,2,0,0,0,0,1,0,1,...,0,0,1,0,2,1,2,1,0,1
300,3,4,8,6,3,5,0,3,4,6,...,1,1,0,0,4,4,7,2,0,2


Funkcja kasutjąca kategorię produktów

In [3]:
separator = ';'
newGroups = ['Gry komputerowe', 'Gry na konsole', 'Sprzęt RTV', 'Komputery', 'Telefony i akcesoria']

def castCategoryPath(categoryPath):
    categories = categoryPath.split(separator)
    foundGroups = [group for group in newGroups if group in categories]
    if len(foundGroups) != 1:
        raise RuntimeError('wrong group cast: {}'.format(foundGroups))
    return foundGroups[0]

#casting category
productsCastedDF = productsDF.copy()
productsCastedDF['category_path'] = productsDF['category_path'].apply(castCategoryPath)

Utworzenie DataFrame'ów zawierających gry i nie-gry.  
Utworzenie słowników do odzyskania odczytania prdouct_id z wyników.

In [4]:
gamesDF = productsCastedDF[productsCastedDF['category_path'].isin(['Gry komputerowe', 'Gry na konsole'])]
nonGamesDF = productsCastedDF[~productsCastedDF['category_path'].isin(['Gry komputerowe', 'Gry na konsole'])]

#creating lists
gamesList = gamesDF['product_id']
gamesList.reset_index(drop=True, inplace=True)
nonGamesList = nonGamesDF['product_id']
nonGamesList.reset_index(drop=True, inplace=True)

#creatings dicts for faster searching
gamesIdxNameDict = pd.Series(gamesList.values, index=gamesList.index).to_dict()
nonGamesIdxNameDict = pd.Series(nonGamesList.values, index=nonGamesList.index).to_dict()

Normalizacja wartości w macierzach interakcji.

In [5]:
gamesInteractionMatrixDF = interactionMatrixDF.drop(columns=nonGamesList)
nonGamesInteractionMatrixDF = interactionMatrixDF.drop(columns=gamesList)

gamesInteractionMatrixDF = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(gamesInteractionMatrixDF))
nonGamesInteractionMatrixDF = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(nonGamesInteractionMatrixDF))
interactionMatrixDF = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(interactionMatrixDF))


Dekompozycja utworzonej macierzy na podmacierze ze względu na użytkowników i produkty.


In [6]:
from sklearn.decomposition import TruncatedSVD

#initial hiperparameters
epsilon = 1e-9
latentFactors = 10

#generate item latent features
gamesSVD = TruncatedSVD(n_components=latentFactors)
gamesFeatures = gamesSVD.fit_transform(gamesInteractionMatrixDF.transpose()) + epsilon #transpose because items are columns

#generate item latent features
nonGamesSVD = TruncatedSVD(n_components=latentFactors)
nonGamesFeatures = nonGamesSVD.fit_transform(nonGamesInteractionMatrixDF.transpose()) + epsilon #transpose because items are columns

#generate user latent features
userSVD = TruncatedSVD(n_components=latentFactors)
userFeatures = userSVD.fit_transform(interactionMatrixDF) + epsilon

pd.DataFrame(gamesFeatures)
pd.DataFrame(nonGamesFeatures)
pd.DataFrame(userFeatures)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.385380,-0.078308,-0.068981,-0.017276,-0.045856,-0.077127,-0.023555,-0.025681,-0.025794,0.115807
1,4.296812,0.448854,-0.116722,0.617638,-0.175199,-0.458337,-0.724674,0.499969,-0.184143,0.564528
2,3.738206,-0.127462,-0.289848,-0.289504,0.585055,-0.761815,-0.073545,-1.031211,-0.992413,0.245779
3,5.672823,0.213148,1.016267,0.402229,-1.408584,0.881047,-0.274027,-1.943494,0.272526,0.097953
4,3.720176,-0.034922,-0.473810,0.111053,-0.769521,0.304753,0.049197,0.793772,-0.031394,0.123723
...,...,...,...,...,...,...,...,...,...,...
195,4.605532,-0.019214,-0.092146,1.596455,-0.137108,0.153326,0.459553,-0.414355,-0.119076,0.279106
196,2.769166,-0.421469,0.341737,-0.339693,-0.044866,-0.134932,-0.526786,-0.134933,-0.353671,0.308401
197,0.996871,0.009082,-0.097710,0.037190,-0.163675,0.040941,-0.148666,0.036887,0.079010,0.130167
198,3.669375,-0.956387,-0.284069,-0.198463,0.576230,0.541525,-0.455337,0.550668,0.593661,0.785344


Definicja funkcji zwracającej top k podobnych elementów na podstawie wartości *cosine similarity*

In [7]:
def top_k(item_id, top_k, corr_mat, map_name):
    topItems = corr_mat[item_id,:].argsort()[-top_k:][::-1]
    topItems = [map_name[e] for e in topItems]
    return topItems

Rekomendacja

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

#split games no-games
itemCorrMat = cosine_similarity(gamesFeatures)

#before test_split it is mandatory to create dict mapping indexes of products in productsDF to labels or ids
#because truncatedSVD has rows coresponding to rows in productsDF, but after split there not the same
recommendations = top_k(1, 10, itemCorrMat, gamesIdxNameDict)
display(productsDF.loc[productsDF['product_id'].isin(recommendations)])


itemCorrMat = cosine_similarity(nonGamesFeatures)

#before test_split it is mandatory to create dict mapping indexes of products in productsDF to labels or ids
#because truncatedSVD has rows coresponding to rows in productsDF, but after split there not the same
recommendations = top_k(4, 10, itemCorrMat, nonGamesIdxNameDict)
display(productsDF.loc[productsDF['product_id'].isin(recommendations)])

Unnamed: 0,product_id,product_name,category_path,price,user_rating
4,1005,Szalone Króliki Na żywo i w kolorze (Xbox 360),Gry i konsole;Gry na konsole;Gry Xbox 360,49.99,2.949198
10,1011,BioShock Infinite (Xbox 360),Gry i konsole;Gry na konsole;Gry Xbox 360,139.99,3.251818
11,1012,Fallout New Vegas (Xbox 360),Gry i konsole;Gry na konsole;Gry Xbox 360,69.0,2.386605
26,1027,Skate 3 (Xbox 360),Gry i konsole;Gry na konsole;Gry Xbox 360,56.0,4.774147
47,1048,Gra o tron (PC),Gry i konsole;Gry komputerowe,63.49,2.346389
48,1049,Max Payne 3 (PC),Gry i konsole;Gry komputerowe,17.9,1.495826
53,1054,Call of Duty 2 (PC),Gry i konsole;Gry komputerowe,32.99,4.628316
54,1055,Call of Duty Modern Warfare 3 (PC),Gry i konsole;Gry komputerowe,32.99,0.518008
55,1056,Call of Duty Black Ops (PC),Gry i konsole;Gry komputerowe,29.99,0.364934
271,1272,The Ball (PC),Gry i konsole;Gry komputerowe,1.0,3.863771


Unnamed: 0,product_id,product_name,category_path,price,user_rating
16,1017,LCD Dell U2412M,Komputery;Monitory;Monitory LCD,399.0,4.003806
24,1025,LCD BenQ GL2250,Komputery;Monitory;Monitory LCD,349.0,2.409741
31,1032,LCD Iiyama E2280WSD,Komputery;Monitory;Monitory LCD,688.78,4.360233
32,1033,LCD Iiyama T1932MSC,Komputery;Monitory;Monitory LCD,3029.0,3.394407
35,1036,LCD Asus VK228H,Komputery;Monitory;Monitory LCD,639.0,2.791702
37,1038,LCD Asus VK278Q,Komputery;Monitory;Monitory LCD,1117.01,3.026183
38,1039,LCD Asus VS197D,Komputery;Monitory;Monitory LCD,269.0,0.544427
68,1069,LCD NEC EA224WMi,Komputery;Monitory;Monitory LCD,979.0,4.4497
291,1292,Philips SDV8622,Sprzęt RTV;Video;Telewizory i akcesoria;Anteny...,189.0,1.009917
298,1299,Vivanco TVA 301,Sprzęt RTV;Video;Telewizory i akcesoria;Anteny...,109.0,1.874503
