# LightFM model epic 5

Deze notebook implementeert een hybride LightFM aanbevelingssysteem met zowel item- als gebruikerskenmerken. Deze kenmerken worden gebruikt om cold start problemen aan te pakken, zodat nieuwe gebruikers en items kunnen worden aanbevolen. Data wordt geladen vanuit een materialized view en zorgvuldig voorbereid voor gebruik in het model. Het model wordt getraind en voornamelijk geëvalueerd op basis van de AUC-score. AUC berekent de kans dat een willekeurig gekozen relevant item hoger scoort dan een willekeurig gekozen irrelevant item. Een hoge AUC-score betekent dat het model beter is in het voorspellen van relevante items en dus een goed aanbevelingssysteem vormt. 

Hier heeft het hybride model een lagere AUC-score (0.64) dan een zuiver collaboratief filteringmodel (0.78). Dit is waarschijnlijk te wijten aan het feit dat de kenmerken niet zo informatief blijken te zijn, vooral de thema's in de gebruikerskenmerken. Het is echter mogelijk dat het model beter presteert als de kenmerken worden verbeterd. De hybrid wordt toch verkozen boven het collaboratief filteringmodel omdat het model meer flexibiliteit biedt en *cold start* problemen aanpakt. De kenmerken kunnen eventueel in det toekomst verbeterd worden binnen de datacollectie zelf, en kunnen desnoods verwijderd worden.

## 0. Documentatie

[Artikel LightFM](https://medium.com/@speedfirefox1/games-recommender-system-using-lightfm-on-steam-dataset-76b05de4c187)

[Documentatie LightFM](https://making.lyst.com/lightfm/docs/index.html)

## 1. Imports

In [3]:
import os
import pickle
import sqlalchemy
import numpy as np
import pandas as pd
from tqdm import tqdm
from dotenv import load_dotenv
from lightfm.data import Dataset
from lightfm.evaluation import auc_score
from sqlalchemy import create_engine, text
from lightfm.evaluation import recall_at_k
from lightfm.evaluation import precision_at_k
from lightfm import LightFM, cross_validation

## 2. Variabelen

In [4]:
# LightFM model parameters
SEED = 42
NO_THREADS = 8
NO_EPOCHS = 15
NO_COMPONENTS = 20
TEST_PERCENTAGE = 0.1
LEARNING_RATE = 0.2
ITEM_ALPHA = 1e-7
USER_ALPHA = 1e-7
LOSS = 'logistic'
# Pickle
CHECKPOINT = 'LightFM'

## 3. Data inladen

In [5]:
load_dotenv()
DB_URL = os.getenv("DB_URL")
engine = create_engine(DB_URL)

try:
    connection = engine.connect()
    print("Successfully connected to the database")
except Exception as e:
    print(f"Failed to connect to the database: {e}")

print(f"SQLAlchemy version: {sqlalchemy.__version__}")

Successfully connected to the database
SQLAlchemy version: 2.0.21


In [6]:
# materialized view
query = text('SELECT * FROM epic_5')

try:
    df = pd.read_sql_query(query, connection)
except Exception as e:
    print(f"Failed to execute query: {e}")

df.head()

Unnamed: 0,PersoonId,CampagneId,aantal_sessies,aantal_bezoeken,SessieThema,SoortCampagne,TypeCampagne,ThemaDuurzaamheid,ThemaFinancieelFiscaal,ThemaInnovatie,ThemaInternationaalOndernemen,ThemaMobiliteit,ThemaOmgeving,ThemaSalesMarketingCommunicatie,ThemaStrategieEnAlgemeenManagement,ThemaTalent,ThemaWelzijn
0,CA75D311-2F3A-EB11-8118-001DD8B72B62,CA56CAFB-4E35-EB11-8116-001DD8B72B61,1,0,Communicatie,Online,Opleiding,0,0,0,0,0,0,0,0,0,0
1,9442B53E-E267-E111-A00F-00505680000A,7E04D7C8-59FC-EC11-82E5-000D3A3A954E,1,0,Netwerking,Offline,Netwerkevenement,0,0,0,0,0,0,0,0,0,0
2,969F7BEB-BB2B-E611-BEEF-005056B06EB4,0BB4BF28-C3C4-EC11-A7B6-000D3A497E09,1,1,Netwerking,Offline,Netwerkevenement,0,0,0,0,0,0,0,0,0,0
3,AC1392A3-EBD4-E711-80EE-001DD8B72B61,023C77E0-5ADE-E711-80EE-001DD8B72B61,3,0,Opvolging en Overname,Offline,Opleiding,0,0,0,0,0,0,0,0,0,0
4,0DEE595B-4B69-E111-B43A-00505680000A,320F556D-A72D-EC11-8124-001DD8B72B61,1,0,Familiebedrijven,Offline,Netwerkevenement,0,0,0,0,0,0,0,0,0,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52567 entries, 0 to 52566
Data columns (total 17 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   PersoonId                           52567 non-null  object
 1   CampagneId                          52567 non-null  object
 2   aantal_sessies                      52567 non-null  int64 
 3   aantal_bezoeken                     52567 non-null  int64 
 4   SessieThema                         52567 non-null  object
 5   SoortCampagne                       52567 non-null  object
 6   TypeCampagne                        52567 non-null  object
 7   ThemaDuurzaamheid                   52567 non-null  int64 
 8   ThemaFinancieelFiscaal              52567 non-null  int64 
 9   ThemaInnovatie                      52567 non-null  int64 
 10  ThemaInternationaalOndernemen       52567 non-null  int64 
 11  ThemaMobiliteit                     52567 non-null  in

## 4. Data voorbereiden

In [8]:
# binaire integers van persoon thema's omzetten naar binaire strings voor betere leesbaarheid

columns = ['ThemaDuurzaamheid', 'ThemaFinancieelFiscaal', 'ThemaInnovatie', 'ThemaInternationaalOndernemen', 
           'ThemaMobiliteit', 'ThemaOmgeving', 'ThemaSalesMarketingCommunicatie', 
           'ThemaStrategieEnAlgemeenManagement', 'ThemaTalent', 'ThemaWelzijn']

for col in columns:
    df[col] = df[col].replace({0: col + '_False', 1: col + '_True'})

In [9]:
# aantal bezoeken laten vallen, focus ligt op aantal_sessies
df.drop('aantal_bezoeken', axis=1, inplace=True)

In [10]:
df.head(1)

Unnamed: 0,PersoonId,CampagneId,aantal_sessies,SessieThema,SoortCampagne,TypeCampagne,ThemaDuurzaamheid,ThemaFinancieelFiscaal,ThemaInnovatie,ThemaInternationaalOndernemen,ThemaMobiliteit,ThemaOmgeving,ThemaSalesMarketingCommunicatie,ThemaStrategieEnAlgemeenManagement,ThemaTalent,ThemaWelzijn
0,CA75D311-2F3A-EB11-8118-001DD8B72B62,CA56CAFB-4E35-EB11-8116-001DD8B72B61,1,Communicatie,Online,Opleiding,ThemaDuurzaamheid_False,ThemaFinancieelFiscaal_False,ThemaInnovatie_False,ThemaInternationaalOndernemen_False,ThemaMobiliteit_False,ThemaOmgeving_False,ThemaSalesMarketingCommunicatie_False,ThemaStrategieEnAlgemeenManagement_False,ThemaTalent_False,ThemaWelzijn_False


In [11]:
# item_features en user_features invullen voor lightFM

item_cols = ['SessieThema', 'SoortCampagne', 'TypeCampagne']
user_cols = ['ThemaDuurzaamheid', 'ThemaFinancieelFiscaal', 'ThemaInnovatie', 'ThemaInternationaalOndernemen', 'ThemaMobiliteit', 'ThemaOmgeving', 'ThemaSalesMarketingCommunicatie', 'ThemaStrategieEnAlgemeenManagement', 'ThemaTalent', 'ThemaWelzijn']

all_item_features = np.concatenate([df[col].unique() for col in item_cols]).tolist()
all_user_features = np.concatenate([df[col].unique() for col in user_cols]).tolist()

print(all_item_features)
print(all_user_features)

['Communicatie', 'Netwerking', 'Opvolging en Overname', 'Familiebedrijven', 'Marketing & Sales', 'Duurzaam Ondernemen', 'Unknown', 'Haven', 'Ruimtelijke ordening en Infrastructuur', 'Human Resources', 'Innovatie', 'Algemeen Management', 'Internationaal Ondernemen', 'Starten', 'Financieel', 'Digitalisering, IT & Technologie', 'Welt 2.0-2023', 'Welt 2.0', 'Groeien', 'Bryo', 'Arbeidsmarkt', 'Jong Voka', 'Opleidingen', 'Plato', 'Onderwijs', 'Economie', 'Strategie', 'Milieu', 'Logistiek en Transport', 'Lidmaatschap', 'Welt', 'Juridisch', 'Veiligheid & Preventie', 'Mobiliteit', 'Energie', 'Supply Chain', 'Coronavirus', 'Persoonlijke vaardigheden', 'Aankoop', 'Retail', 'Aantrekkelijke regio', 'Online', 'Offline', 'On en Offline', 'Opleiding', 'Netwerkevenement', 'Project', 'Infosessie', 'Campagne', 'Projectgebonden']
['ThemaDuurzaamheid_False', 'ThemaDuurzaamheid_True', 'ThemaFinancieelFiscaal_False', 'ThemaFinancieelFiscaal_True', 'ThemaInnovatie_False', 'ThemaInnovatie_True', 'ThemaInternat

In [12]:
# dataframes van items en users, id's met features

items = df[['CampagneId'] + item_cols]
users = df[['PersoonId'] + user_cols]

In [13]:
# toch groeperen op aantal sessies per persoon per campagne, geen scheiding tussen sessiethema
df = df.groupby(['PersoonId', 'CampagneId'])['aantal_sessies'].sum().reset_index()

In [14]:
# zie LightFM documentatie voor uitleg over Dataset

dataset = Dataset()

dataset.fit(
    users=df['PersoonId'],
    items=df['CampagneId'],
    user_features=all_user_features,
    item_features=all_item_features
)

num_users, num_items = dataset.interactions_shape()
print(f'Num users: {num_users}, num_items: {num_items}')

Num users: 16688, num_items: 1979


In [15]:
# interactie matrix tussen users en items, met aantal sessies als rating
(interactions, weights) = dataset.build_interactions(zip(df['PersoonId'], df['CampagneId'], df['aantal_sessies']))

In [16]:
# interactie matrix splitsen in train- en testset

train_interactions, test_interactions = cross_validation.random_train_test_split(
    interactions, test_percentage=TEST_PERCENTAGE,
    random_state=np.random.RandomState(seed=SEED)
)

In [17]:
# testen of de splitsing gelijk verdeeld is

print(f"Shape of train interactions: {train_interactions.shape}")
print(f"Shape of test interactions: {test_interactions.shape}")

Shape of train interactions: (16688, 1979)
Shape of test interactions: (16688, 1979)


In [18]:
# tupels maken van item id's met bijhorende features

def item_feature_generator():
    for _, row in items.iterrows():
        features = row.values[1:]
        yield (row['CampagneId'], features)

def user_feature_generator():
    for _, row in users.iterrows():
        features = row.values[1:]
        yield (row['PersoonId'], features)

In [19]:
# test print om de item tupels te bekijken

item_id, item_features = next(item_feature_generator())
print(f"Item ID: {item_id}, Item Features: {item_features}")

Item ID: CA56CAFB-4E35-EB11-8116-001DD8B72B61, Item Features: ['Communicatie' 'Online' 'Opleiding']


In [20]:
# test print om de user tupels te bekijken

user_id, user_features = next(user_feature_generator())
print(f"User ID: {user_id}, User Features: {user_features}")

User ID: CA75D311-2F3A-EB11-8118-001DD8B72B62, User Features: ['ThemaDuurzaamheid_False' 'ThemaFinancieelFiscaal_False'
 'ThemaInnovatie_False' 'ThemaInternationaalOndernemen_False'
 'ThemaMobiliteit_False' 'ThemaOmgeving_False'
 'ThemaSalesMarketingCommunicatie_False'
 'ThemaStrategieEnAlgemeenManagement_False' 'ThemaTalent_False'
 'ThemaWelzijn_False']


In [21]:
# feature matrices maken om later te gebruiken in het model

item_features = dataset.build_item_features((item_id, item_feature) for item_id, item_feature in item_feature_generator())
user_features = dataset.build_user_features((user_id, user_feature) for user_id, user_feature in user_feature_generator())

## 5. Model trainen

In [22]:
# meegegeven parameters voor LightFM model, zie variabelen bovenaan

model = LightFM(
    no_components=NO_COMPONENTS,
    learning_rate=LEARNING_RATE,
    random_state=np.random.RandomState(SEED),
    loss=LOSS,
    item_alpha=ITEM_ALPHA,
    user_alpha=USER_ALPHA
)

In [23]:
# model exporteren

def save_model(model):
    with open(f'{CHECKPOINT}.pickle', 'wb') as fle:
        pickle.dump(model, fle, protocol=pickle.HIGHEST_PROTOCOL)

In [24]:
# model trainen met epochs, de iteratie met de beste test AUC score wordt opgeslagen als model

train_auc_history = []
test_auc_history = []

best_score = 0

for epoch in tqdm(range(NO_EPOCHS)):
    model.fit_partial(
        interactions=train_interactions,
        user_features=user_features,
        item_features=item_features,
        epochs=NO_EPOCHS,
        num_threads=NO_THREADS
    )

    train_auc = auc_score(model, train_interactions, user_features=user_features, item_features=item_features).mean()
    test_auc = auc_score(model, test_interactions, train_interactions, user_features=user_features, item_features=item_features).mean()

    train_auc_history.append(train_auc)
    test_auc_history.append(test_auc)

    if test_auc > best_score:
        best_score = test_auc
        save_model(model)

    print(f'Epoch {epoch + 1}/{NO_EPOCHS}, Train AUC: {train_auc}, Test AUC: {test_auc}')

  7%|▋         | 1/15 [00:12<02:53, 12.43s/it]

Epoch 1/15, Train AUC: 0.619840681552887, Test AUC: 0.6328714489936829


 13%|█▎        | 2/15 [00:24<02:41, 12.45s/it]

Epoch 2/15, Train AUC: 0.621476411819458, Test AUC: 0.6345963478088379


 20%|██        | 3/15 [00:36<02:27, 12.28s/it]

Epoch 3/15, Train AUC: 0.6225326657295227, Test AUC: 0.6356953382492065


 27%|██▋       | 4/15 [00:49<02:17, 12.47s/it]

Epoch 4/15, Train AUC: 0.6233119368553162, Test AUC: 0.6364351511001587


 33%|███▎      | 5/15 [01:02<02:06, 12.60s/it]

Epoch 5/15, Train AUC: 0.6238833665847778, Test AUC: 0.6369789838790894


 40%|████      | 6/15 [01:14<01:52, 12.50s/it]

Epoch 6/15, Train AUC: 0.6242479681968689, Test AUC: 0.6373438239097595


 47%|████▋     | 7/15 [01:27<01:41, 12.68s/it]

Epoch 7/15, Train AUC: 0.6246291399002075, Test AUC: 0.6377169489860535


 53%|█████▎    | 8/15 [01:40<01:28, 12.61s/it]

Epoch 8/15, Train AUC: 0.6249518990516663, Test AUC: 0.6380428075790405


 60%|██████    | 9/15 [01:52<01:15, 12.56s/it]

Epoch 9/15, Train AUC: 0.6251904964447021, Test AUC: 0.6382948160171509


 67%|██████▋   | 10/15 [02:05<01:02, 12.47s/it]

Epoch 10/15, Train AUC: 0.6254053711891174, Test AUC: 0.6385499835014343


 73%|███████▎  | 11/15 [02:17<00:49, 12.49s/it]

Epoch 11/15, Train AUC: 0.6256507039070129, Test AUC: 0.6388140320777893


 80%|████████  | 12/15 [02:29<00:37, 12.45s/it]

Epoch 12/15, Train AUC: 0.6258955001831055, Test AUC: 0.639079213142395


 87%|████████▋ | 13/15 [02:42<00:24, 12.47s/it]

Epoch 13/15, Train AUC: 0.6261444091796875, Test AUC: 0.6392804384231567


 93%|█████████▎| 14/15 [02:54<00:12, 12.45s/it]

Epoch 14/15, Train AUC: 0.6263439655303955, Test AUC: 0.6394516825675964


100%|██████████| 15/15 [03:07<00:00, 12.51s/it]

Epoch 15/15, Train AUC: 0.6265164017677307, Test AUC: 0.6396564245223999





## 6. Model evalueren

[Interpretatie metrics](https://stackoverflow.com/questions/45451161/evaluating-the-lightfm-recommendation-model/45466481#45466481)

In [25]:
# metrics berekenen, vooral de AUC score is belangrijk hier voor de recommender system

def calculate_metrics(model, test, train, item_features, user_features, k):
    precision = precision_at_k(model=model, test_interactions=test, train_interactions=train, item_features=item_features, user_features=user_features, k=k).mean().round(5)
    recall = recall_at_k(model=model, test_interactions=test, train_interactions=train, item_features=item_features, user_features=user_features, k=k).mean().round(5)
    auc = auc_score(model=model, test_interactions=test, train_interactions=train, item_features=item_features, user_features=user_features).mean().round(5)
    return print('Precision: ', precision, '\nRecall: ', recall, '\nAUC: ', auc)

In [26]:
calculate_metrics(model, test_interactions, train_interactions, item_features, user_features, 1)

Precision:  0.02914 
Recall:  0.02006 
AUC:  0.63966


## 7. Model gebruiken

In [27]:

def get_top_items_for_user(user_id):
    """
    Retrieves the top 5 recommended items for a given user.

    Parameters:
    user_id (str): The ID of the user.

    Returns:
    str: A string indicating the top 5 recommended items for the user.
    """
    
    user_id_internal = dataset.mapping()[0][user_id]

    item_ids_internal = np.array(list(dataset.mapping()[2].values()))

    scores = model.predict(user_id_internal, item_ids_internal)

    top_items_indices = np.argsort(-scores)[:5]

    top_items_ids = [list(dataset.mapping()[2].keys())[i] for i in top_items_indices]

    return f'Top 5 recommended items for user: {top_items_ids}'

print(get_top_items_for_user('6E42A199-9F70-E911-80FE-001DD8B72B62'))

Top 5 recommended items for user: ['73134661-6938-EB11-8116-001DD8B72B61', '37245A1A-8082-EB11-811D-001DD8B72B62', '0B8EF091-0C47-EC11-8C62-6045BD8D273E', '334DEF40-3B80-ED11-81AD-6045BD895B5A', 'C7D7D4CB-6106-EC11-8123-001DD8B72B61']


Onderstaande functie is het antwoord op epic 5

In [28]:
def get_top_users_for_item(item_id):
    """
    Returns the top 20 recommended users for a given item.

    Parameters:
    item_id (str): The ID of the item.

    Returns:
    str: A string containing the top 20 recommended users for the item.
    """
    item_id_internal = dataset.mapping()[2][item_id]
    
    user_ids_internal = np.array(list(dataset.mapping()[0].values()))

    scores = model.predict(user_ids_internal, np.repeat(item_id_internal, len(user_ids_internal)))

    top_users_indices = np.argsort(-scores)[:20]

    top_users_ids = [list(dataset.mapping()[0].keys())[i] for i in top_users_indices]

    return f'Top 20 recommended users for item: {top_users_ids}'

print(get_top_users_for_item('8FCA1D31-1EB7-E811-80F4-001DD8B72B62'))

Top 20 recommended users for item: ['F49C652D-F83D-EB11-8116-001DD8B72B61', 'ACD7EB0E-7DCC-EC11-A7B5-000D3ABB7D90', '9243B91F-13D1-E811-80F6-001DD8B72B62', '1B9EC4B9-0568-E111-A00F-00505680000A', '93BFA2D9-1368-E111-A00F-00505680000A', 'D4EDF4D6-5169-E111-B43A-00505680000A', '6D728DFF-0792-E711-80EB-001DD8B72B62', '79068084-DE8E-EA11-810F-001DD8B72B61', '32AF1A80-5269-E111-B43A-00505680000A', '8AAE9250-FC4D-ED11-BBA3-6045BD895D85', 'B69A7355-F467-E111-A00F-00505680000A', '29D9D2DD-5BBA-E311-9A5C-005056B06EB4', '414DF096-2268-E111-A00F-00505680000A', '8E16B4A4-0CD6-EA11-8114-001DD8B72B62', '4E30ADDE-23FA-ED11-8849-000D3A4AB78E', '27510C40-3A8F-ED11-AAD1-6045BD895BFB', 'F2721807-ED09-EA11-8107-001DD8B72B62', '4AF8600F-B83A-EB11-8116-001DD8B72B61', '8AF61AD2-7A6F-EA11-8110-001DD8B72B62', 'BF389053-3491-EB11-811E-001DD8B72B62']
