Model typu filtrowanie kolaboracyjne (collaborative filtering).

Wykorzystujemy w tym celu macierz interakcji, którą następnie dekomponujemy na macierze mniejszej wymiarowości z użyciem 
TruncatedSVD - sklearn. 

Poniższa implementacja pozwala prześledzić proces budowy modelu i zapoznać się z jego ideą. Ostateczna implementacja (zachowująca schemat budowy, ale nie kopująca go 1 do 1) znajduje się w ./models/advanced.py

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

Przygotowanie macierzy interakcji

In [2]:
sessionsDataPath = '../notebooks/data/v2/sessions.jsonl'
productsDataPath = '../notebooks/data/v2/products.jsonl'
usersDataPath = '../notebooks/data/v2/users.jsonl'
sessionsDF = pd.read_json(sessionsDataPath, lines=True)
productsDF = pd.read_json(productsDataPath, lines=True)
usersDF = pd.read_json(usersDataPath, lines=True)

df = sessionsDF.drop(
    columns=["timestamp", "event_type", "offered_discount", "purchase_id"])
df["count"] = 1

train, test = train_test_split(df, test_size=0.2, stratify=df['user_id'])
interactionMatrixDF = pd.pivot_table(train,
                                     index="user_id",
                                     columns="product_id",
                                     values="count",
                                     aggfunc=np.sum,
                                     fill_value=0)
display(interactionMatrixDF)


product_id,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,...,1310,1311,1312,1313,1314,1315,1316,1317,1318,1319
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
102,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
103,0,2,3,11,1,2,2,3,1,9,...,0,1,2,1,1,3,6,1,0,6
104,0,1,4,5,5,9,0,5,2,5,...,1,1,1,0,1,1,6,2,5,1
105,0,2,7,8,4,3,0,4,2,4,...,0,1,2,0,1,0,9,4,0,1
106,2,1,2,9,2,2,2,3,0,4,...,0,0,1,3,1,1,6,2,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,1,1,5,12,7,1,0,3,8,6,...,0,1,0,2,0,1,9,1,2,2
298,0,0,5,1,3,3,0,6,1,1,...,1,0,0,0,0,2,7,2,0,0
299,0,0,3,0,1,0,0,1,0,0,...,0,0,1,0,1,1,2,1,0,1
300,3,4,6,6,2,5,0,5,2,8,...,1,1,0,0,4,2,9,1,0,4


Skalowanie wartości w macierzy interakcji.

In [3]:
interactionMatrixDF = pd.DataFrame(
    preprocessing.MinMaxScaler().fit_transform(interactionMatrixDF))
display(interactionMatrixDF)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,309,310,311,312,313,314,315,316,317,318
0,0.000000,0.0,0.055556,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0000,...,0.0,0.000000,0.0,0.000000,0.000000,0.00,0.000000,0.2,0.000000,0.000000
1,0.000000,0.4,0.166667,0.379310,0.071429,0.153846,0.5,0.250000,0.090909,0.5625,...,0.0,0.142857,0.4,0.333333,0.166667,0.75,0.260870,0.2,0.000000,0.857143
2,0.000000,0.2,0.222222,0.172414,0.357143,0.692308,0.0,0.416667,0.181818,0.3125,...,0.5,0.142857,0.2,0.000000,0.166667,0.25,0.260870,0.4,0.714286,0.142857
3,0.000000,0.4,0.388889,0.275862,0.285714,0.230769,0.0,0.333333,0.181818,0.2500,...,0.0,0.142857,0.4,0.000000,0.166667,0.00,0.391304,0.8,0.000000,0.142857
4,0.285714,0.2,0.111111,0.310345,0.142857,0.153846,0.5,0.250000,0.000000,0.2500,...,0.0,0.000000,0.2,1.000000,0.166667,0.25,0.260870,0.4,0.000000,0.142857
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0.142857,0.2,0.277778,0.413793,0.500000,0.076923,0.0,0.250000,0.727273,0.3750,...,0.0,0.142857,0.0,0.666667,0.000000,0.25,0.391304,0.2,0.285714,0.285714
196,0.000000,0.0,0.277778,0.034483,0.214286,0.230769,0.0,0.500000,0.090909,0.0625,...,0.5,0.000000,0.0,0.000000,0.000000,0.50,0.304348,0.4,0.000000,0.000000
197,0.000000,0.0,0.166667,0.000000,0.071429,0.000000,0.0,0.083333,0.000000,0.0000,...,0.0,0.000000,0.2,0.000000,0.166667,0.25,0.086957,0.2,0.000000,0.142857
198,0.428571,0.8,0.333333,0.206897,0.142857,0.384615,0.0,0.416667,0.181818,0.5000,...,0.5,0.142857,0.0,0.000000,0.666667,0.50,0.391304,0.2,0.000000,0.571429


Utworzenie słownika dla szybszego wyszukiwania.

In [4]:
idxNameDict = pd.Series(productsDF["product_id"].values,
                        index=productsDF.index).to_dict()
idxUserDict = pd.Series(usersDF["user_id"].values,
                        index=usersDF.index).to_dict()


Dekompozycja utworzonej macierzy na podmacierze ze względu na użytkowników i produkty.

In [5]:
from sklearn.decomposition import TruncatedSVD

#initial hiperparameters
epsilon = 1e-9
latentFactors = 10

#generate user latent features
userSVD = TruncatedSVD(n_components=latentFactors)
userFeatures = userSVD.fit_transform(interactionMatrixDF) + epsilon

#generate item latent. At the end not used in current implementation.
itemSVD = TruncatedSVD(n_components=latentFactors)
itemFeatures = itemSVD.fit_transform(interactionMatrixDF.transpose(
)) + epsilon  #transpose because items are columns

pd.DataFrame(itemFeatures)
pd.DataFrame(userFeatures)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.367281,0.029281,0.024246,0.102174,-0.002553,-0.092535,0.059707,-0.041813,-0.011968,-0.043646
1,4.583006,0.307588,-0.912538,-1.113317,-0.156357,-0.795703,0.774616,-0.146545,-0.518854,0.147870
2,3.765190,0.640754,-0.656752,1.046037,0.714344,0.462094,0.656338,0.008864,-0.532910,-0.810272
3,5.506008,1.025717,0.879809,-1.467996,-0.987770,-0.112879,-0.059698,1.365018,-0.675826,-0.141256
4,3.701279,-0.290698,0.100126,-0.642942,-0.274867,0.504570,-0.063024,0.221612,-0.181342,0.104109
...,...,...,...,...,...,...,...,...,...,...
195,4.600296,-0.172850,0.325911,-0.466063,0.915778,-0.050839,0.361075,-0.330044,0.550269,-0.019744
196,2.499357,-0.062864,-0.381131,0.052593,-0.338035,-0.531264,-0.156245,0.228092,-0.105152,-0.147906
197,0.963855,-0.054570,0.116466,-0.051311,-0.054170,-0.140578,-0.129505,0.188384,-0.090507,-0.193744
198,3.660492,-0.859937,0.060752,1.098366,0.267928,-0.072937,0.247381,-0.387993,0.371458,0.515733


Podział użytkowników na grupy (klastrowanie).

In [6]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=8, n_init=1000)

model = kmeans.fit_predict(userFeatures)
modelDF = pd.DataFrame(model).rename(columns={0: 'group'}, index=idxUserDict)
modelDF


Unnamed: 0,group
102,0
103,4
104,1
105,4
106,2
...,...
297,1
298,5
299,0
300,1


Budowanie profili użytkowników  

In [7]:
separator = ';'
newGroups = [
    'Gry komputerowe', 'Gry na konsole', 'Sprzęt RTV', 'Komputery',
    'Telefony i akcesoria'
]


def castCategoryPath(categoryPath):
    categories = categoryPath.split(separator)
    foundGroups = [group for group in newGroups if group in categories]
    if len(foundGroups) != 1:
        raise RuntimeError('wrong group cast: {}'.format(foundGroups))
    return foundGroups[0]

In [8]:
def build_user_profile(userID: int, merged: pd.DataFrame) -> dict:
    profile = {}
    userActDF = merged.loc[merged['user_id'] == userID].drop(
        columns=['user_id'])

    for group in newGroups:
        g = userActDF.loc[userActDF['category_path'] == group].drop(
            columns=['category_path']).value_counts().to_frame().rename(
                columns={
                    0: 'count'
                }).reset_index()
        g = g[g['count'] >= g['count'].quantile(0.9)]
        profile[group] = g.set_index('product_id').to_dict()
    return profile


In [9]:
merged = pd.merge(sessionsDF, productsDF, how='inner', on='product_id')
merged.drop(columns=[
    'session_id', "timestamp", "event_type", "offered_discount", "purchase_id",
    'product_name', 'price', 'user_rating'
],
            inplace=True)
merged['category_path'] = merged['category_path'].apply(castCategoryPath)

usersID = usersDF['user_id'].unique()
profileDict = {}

for userID in usersID:
    profileDict[userID] = build_user_profile(userID, merged)

#konieczne jest przepakowanie dict'a, gdyż konwersja z int64 na int ma w numpy'u buga, który dokonuje konwersji int64 na int32. Problem otwarty na Github'ie, tylko Windows.
profileDict = {int(key): value for key, value in profileDict.items()}


In [10]:
groupsLabels = modelDF['group'].unique()
groupsTop = {}

for label in groupsLabels:
    group = modelDF.loc[modelDF['group'] == label].index
    categoryDict = {}
    for category in newGroups:
        dfListPre = []
        for user in group:
            dfListPre.append(pd.DataFrame(profileDict[user][category]))

        dfList = [
            df.reset_index().rename(columns={'index': 'product_id'})
            for df in dfListPre
        ]

        df = pd.concat(dfList).groupby(['product_id']).sum().reset_index().sort_values(
            by=['count'], ascending=False)

        df.reset_index(drop=True, inplace=True)
        df.drop(columns=['count'], inplace=True)
        categoryDict[category] = df['product_id'].to_list()

    groupsTop[label] = categoryDict

#konieczne jest przepakowanie dict'a, gdyż konwersja z int64 na int ma w numpy'u buga, który dokonuje konwersji int64 na int32. Problem otwarty na Github'ie, tylko Windows.
groupsTop = {int(key): value for key, value in groupsTop.items()}


Zapis modelu i profili do plików.

In [11]:
import json

modelDF.to_json('../models/similar_users.json')

with open('../models/profiles.json', 'w') as file:
    json.dump(profileDict, file, sort_keys=True, indent=4)

with open('../models/groups_profiles.json', 'w') as file:
    json.dump(groupsTop, file, sort_keys=True, indent=4)


Testowe odczytanie danych

In [12]:
savedModel = pd.read_json('../models/similar_users.json')

with open('../models/profiles.json', 'r') as file:
    data = json.load(file)

#json.load wczytuje klucze jako str, przepakowanie na int upraszcza pracę z modelem
savedProfiles = {int(key): value for key, value in data.items()}

display(savedModel)
savedProfiles[102]

Unnamed: 0,group
102,0
103,4
104,1
105,4
106,2
...,...
297,1
298,5
299,0
300,1


{'Gry komputerowe': {'count': {'1049': 6, '1050': 14, '1053': 6, '1054': 6}},
 'Gry na konsole': {'count': {'1041': 2}},
 'Komputery': {'count': {'1017': 3, '1077': 3, '1276': 3}},
 'Sprzęt RTV': {'count': {'1233': 3}},
 'Telefony i akcesoria': {'count': {'1317': 1}}}

Przykład predykcji  
Użytkownik o id 102 przegląda produkt z kategorii *Gry komputerowe*

Poniżej predykcja na modelu w pamięci.

In [13]:
input = [102, 'Gry komputerowe']

group = modelDF.loc[modelDF['group'] == modelDF.iloc[102]['group']].index
dfListPre = []

for user in group:
    dfListPre.append(pd.DataFrame(profileDict[user][input[1]]))

dfList = [
    df.reset_index().rename(columns={'index': 'product_id'})
    for df in dfListPre
]

df = pd.concat(dfList).groupby(['product_id']).sum().reset_index().sort_values(
    by=['count'], ascending=False)
df

Unnamed: 0,product_id,count
5,1053,1690
2,1050,1446
0,1048,1009
1,1049,649
7,1055,602
4,1052,559
6,1054,443
8,1056,422
3,1051,196
10,1272,134


Predykcja dla modelu odczytanego z .json.  
Predykcje są identyczne, stąd wnioskujemy poprawnośc serializacji.

In [14]:
input = [102, 'Gry komputerowe']

group = savedModel.loc[savedModel['group'] == savedModel.iloc[102]
                       ['group']].index
dfListPre = []

for user in group:
    dfListPre.append(pd.DataFrame(savedProfiles[user][input[1]]))

dfList = [
    df.reset_index().rename(columns={'index': 'product_id'})
    for df in dfListPre
]

df = pd.concat(dfList).groupby(['product_id']).sum().reset_index().sort_values(
    by=['count'], ascending=False)
df

Unnamed: 0,product_id,count
5,1053,1690
2,1050,1446
0,1048,1009
1,1049,649
7,1055,602
4,1052,559
6,1054,443
8,1056,422
3,1051,196
10,1272,134
