In this notebook I try a different approach for the recommendation, based on performing random walks on the graph that links users to the group in which they have been, with the number of messages sent as weight of the connection.

In [1]:
import pandas as pd
import networkx as nx
import csrgraph as cg
import numpy as np
from tqdm.notebook import tqdm
from joblib import Parallel, delayed
import pickle

In [2]:
group_membership = pd.read_pickle("df_membership.pkl")
users = pd.read_pickle("users.pkl")
groups = pd.read_pickle("groups.pickle")
group_membership = group_membership.loc[:,["user_id", "group_id","messages_count"]]
group_membership[group_membership.user_id==350313104].head()

Unnamed: 0,user_id,group_id,messages_count
190,350313104,-1001568744489,3
1802,350313104,-1001461601993,2
2018,350313104,-1001405226631,3
2203,350313104,-1001342690802,7
3196,350313104,-1001357462160,9


In [3]:
G = nx.Graph()
for _,row in group_membership.iterrows():
    G.add_edge(row.user_id, row.group_id, weight=row.messages_count)

For each node (representing a user or a group) I perform 50 random walks of length 6.

In [4]:
G = cg.csrgraph(G)
node_names = G.names
random_walks = G.random_walks(walklen=6, epochs=50, return_weight=1, neighbor_weight=1)
labeled_walks = np.vectorize(lambda x : node_names[x])(random_walks)
labeled_walks[:2]

array([[    1292286374, -1001568744489,      903461122, -1001568744489,
             806726832, -1001568744489],
       [-1001568744489,       66579321, -1001489401579,      187927651,
        -1001254729041,      439203500]])

Then I save the start and end of each walk with the number of times a walk from the node ended there.

In [5]:
start_end_points = list()
for walk in labeled_walks:
    start_end_points.append((walk[0], walk[len(walk)-1]))

walk_results = pd.DataFrame(start_end_points)\
            .rename({0:"start", 1:"ends"}, axis=1)\
            .reset_index()\
            .groupby(["start","ends"])\
            .count()\
            .reset_index()\
            .rename({"index":"visits"}, axis=1)
walk_results.head()

Unnamed: 0,start,ends,visits
0,-1001787166958,64701764,1
1,-1001787166958,113374506,1
2,-1001787166958,136953086,1
3,-1001787166958,144897765,1
4,-1001787166958,278035059,2


In [6]:
merged_walks = walk_results[walk_results.start > 0]\
                        .rename({"start":"user_id", "ends":"group_id", "index":"visits"}, axis=1)\
                        .merge(users)\
                        .merge(groups)

merged_walks[merged_walks.username=="acetimarco"].sort_values(by="visits", ascending=False)[:8]

Unnamed: 0,user_id,group_id,visits,first_name,last_name,username,title
12116,26170256,-1001456212600,4,Marco,Aceti,acetimarco,Chat | Studenti UniMi
19169,26170256,-1001396181733,4,Marco,Aceti,acetimarco,Algoritmi e strutture dati - Informatica
49409,26170256,-1001388941161,3,Marco,Aceti,acetimarco,Matematica del continuo - Informatica
23258,26170256,-1001379726039,3,Marco,Aceti,acetimarco,Ingegneria del software [PSS + PMD] (Club del ...
161472,26170256,-1001569149223,2,Marco,Aceti,acetimarco,Lingue e letterature straniere | StudentiUniMi
157771,26170256,-1001699979466,2,Marco,Aceti,acetimarco,International students | StudentiUniMi
68973,26170256,-1001157628331,2,Marco,Aceti,acetimarco,"Programmazione 1 - Informatica, Informatica Mu..."
38242,26170256,-1001167949464,2,Marco,Aceti,acetimarco,Alloggi - StudentiUniMi


In order to recommend a group to a user, I take the most visited groups (selecting between the ones in which he isn't already).

In [7]:
def get_not_in_groups(user_id):
    in_groups = group_membership[group_membership.user_id == user_id].group_id
    not_in_groups = groups[~(
        groups.group_id.isin(in_groups))].group_id
    return not_in_groups

def get_top_k(user, k):
    return merged_walks[(merged_walks.user_id==user) & (merged_walks.group_id.isin(get_not_in_groups(user)))]\
        .sort_values(by="visits", ascending=False)[:k]

get_top_k(26170256, 10)

Unnamed: 0,user_id,group_id,visits,first_name,last_name,username,title
23258,26170256,-1001379726039,3,Marco,Aceti,acetimarco,Ingegneria del software [PSS + PMD] (Club del ...
38242,26170256,-1001167949464,2,Marco,Aceti,acetimarco,Alloggi - StudentiUniMi
4820,26170256,-1001530261207,1,Marco,Aceti,acetimarco,Natural interaction - Informatica magistrale
106280,26170256,-1001300632521,1,Marco,Aceti,acetimarco,Crittografia 1 - Informatica | Informatica mus...
146924,26170256,-1001534900689,1,Marco,Aceti,acetimarco,Biotecnologie mediche | StudentiUniMi
136532,26170256,-1001215728849,1,Marco,Aceti,acetimarco,Fondamenti di social media digitali (Informati...
124971,26170256,-1001552980805,1,Marco,Aceti,acetimarco,Programmazione avanzata - Informatica magistrale
115443,26170256,-1001226553523,1,Marco,Aceti,acetimarco,Principi e modelli della percezione - Informat...
111649,26170256,-1001524770629,1,Marco,Aceti,acetimarco,Metodi probabilistici per l'informatica - Info...
109232,26170256,-1001212847374,1,Marco,Aceti,acetimarco,Matematica del Continuo - Informatica per la c...


To restrict the field, instead of working with the whole graph, I create a subgraph for each user (based on the groups degree id and the supposed degree id of the user) and perform the random walks from it. The subgraph contains the groups of the degree and the extra groups.

In [8]:
group_description = pd.read_pickle("group_description.pkl")
group_description.head()

-1001563734995    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
-1001557200491    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
-1001774201871    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6, ...
-1001724030284    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
-1001531478970    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
dtype: object

In [9]:
group_degrees = pd.read_pickle("groups_degrees.pkl")
trainset = pd.read_pickle("trainset.pkl")

#returns -1 if no probable degree is detected and 158 for FOR24 groups
def get_probable_degree_id(user_id):
    in_groups = trainset[trainset.user_id == user_id].group_id
    features_sum = sum(group_description[in_groups])[:158]
    if sum(features_sum) == 0:
        return -1
    return np.argmax(features_sum)+1

degrees = pd.read_pickle("degrees.pkl")
degrees[degrees.degree_id==get_probable_degree_id(26170256)]

Unnamed: 0,degree_id,degree_name,degree_type,degree_group_id
66,1,Informatica,B,-1001437343087


In [10]:
def get_user_subgraph(user_id):
    degree_id = get_probable_degree_id(user_id)
    
    if degree_id == -1:
        subgroups = group_degrees[(group_degrees.degree_id.isna())].group_id.values
    else: 
        subgroups = group_degrees[(group_degrees.degree_id.isna()) | (
            group_degrees.degree_id == degree_id)].group_id.values

    sub_group_membership = trainset[trainset.group_id.isin(
        subgroups)]
        
    assert len(sub_group_membership[sub_group_membership.user_id == user_id]) != 0
    
    G = nx.Graph()
    for _,row in sub_group_membership.iterrows():
        G.add_edge(row.user_id, row.group_id, weight=row.messages_count)
    return G


In [11]:
def get_graph_predictions(user_id, walklen=6, epochs=50, return_weight=1, neighbor_weight=1):
    
    user_subgraph = get_user_subgraph(user_id)

    G = cg.csrgraph(user_subgraph)
    node_names = G.names
    random_walks = G.random_walks(walklen=walklen, epochs=epochs, \
        return_weight=return_weight, neighbor_weight=neighbor_weight)

    labeled_walks = np.vectorize(lambda x : node_names[x])(random_walks)

    start_end_points = list()
    for walk in labeled_walks:
        start_end_points.append((walk[0], walk[len(walk)-1]))

    walk_results = pd.DataFrame(start_end_points)\
                .rename({0:"start", 1:"ends"}, axis=1)\
                .reset_index()\
                .groupby(["start","ends"])\
                .count()\
                .reset_index()

    merged_walks = walk_results[walk_results.start > 0]\
                            .rename({"start":"user_id", "ends":"group_id", "index":"prediction"}, axis=1)\
                            .merge(users)\
                            .merge(groups)
    return merged_walks


In [12]:
def get_not_in_groups(user_id):
    in_groups = trainset[trainset.user_id == user_id].group_id
    not_in_groups = groups[~(
        groups.group_id.isin(in_groups))].group_id
    return not_in_groups

def get_top_k_graph(user, k):
    merged_walks = get_graph_predictions(user)
    return merged_walks[(merged_walks.user_id==user) & (merged_walks.group_id.isin(get_not_in_groups(user)))]\
        .sort_values(by="prediction", ascending=False)[:k]

user_id = 350313104
graph_predictions = get_top_k_graph(user_id, 8).reset_index().drop(columns=["index"])
graph_predictions

Unnamed: 0,user_id,group_id,prediction,first_name,last_name,username,title
0,350313104,-1001350552358,2,Alessia,,aleceres,"Intelligent Systems for Industry, Supply Chain..."
1,350313104,-1001181636991,2,Alessia,,aleceres,Test di Inglese - StudentiUniMi
2,350313104,-1001348261542,2,Alessia,,aleceres,Modellazione e analisi di sistemi - Informatic...
3,350313104,-1001260565113,1,Alessia,,aleceres,Compravendita libri e appunti - StudentiUniMi
4,350313104,-1001167949464,1,Alessia,,aleceres,Alloggi - StudentiUniMi
5,350313104,-1001220140042,1,Alessia,,aleceres,Advent Of Code | StudentiUniMi
6,350313104,-1001288175726,1,Alessia,,aleceres,Visione artificiale - Informatica magistrale
7,350313104,-1001452073794,1,Alessia,,aleceres,Statistical methods for machine learning - Inf...


In [13]:
# we parallelize in order to take less time in computing
k = 8
graphs = Parallel(n_jobs=-1, prefer="processes")(
    delayed(get_top_k_graph)(user, k) for user in tqdm(trainset.user_id.unique(), total=len(trainset.user_id.unique()))
)


  0%|          | 0/4183 [00:00<?, ?it/s]

In [14]:
graph_predictions = pd.DataFrame()

for prediction in graphs:
    graph_predictions = pd.concat([graph_predictions, prediction ])

graph_predictions

with open("predictions_graphs.pkl", "wb+") as f:
    pickle.dump(graph_predictions, f)


In [15]:
graph_predictions[graph_predictions.user_id==26170256]

Unnamed: 0,user_id,group_id,prediction,first_name,last_name,username,title
16983,26170256,-1001379726039,5,Marco,Aceti,acetimarco,Ingegneria del software [PSS + PMD] (Club del ...
10588,26170256,-1001440238500,3,Marco,Aceti,acetimarco,Reti di calcolatori - Informatica
30304,26170256,-1001181636991,3,Marco,Aceti,acetimarco,Test di Inglese - StudentiUniMi
7792,26170256,-1001456212600,2,Marco,Aceti,acetimarco,Chat | Studenti UniMi
52563,26170256,-1001213858785,2,Marco,Aceti,acetimarco,Ripetizioni - Studenti UniMi
2156,26170256,-1001469541498,1,Marco,Aceti,acetimarco,Sicurezza e Privatezza - Informatica | Informa...
3679,26170256,-1001466214340,1,Marco,Aceti,acetimarco,Elaborazione dei segnali - Informatica Musical...
35648,26170256,-1001362716461,1,Marco,Aceti,acetimarco,Intelligenza Artificiale 1 - Informatica


I compute precision and recall metrics on this implementation.

In [16]:
testset = pd.read_pickle("testset.pkl")

def is_relevant(user, item):
    return len(
            testset[
                (testset.user_id == user) & (testset.group_id == item)
            ]
            ) != 0


def HR(user, k):
    recommended_items = graph_predictions[graph_predictions.user_id==user][:k]
    return sum(is_relevant(user, item) for item in recommended_items.group_id)


def average_precision(user, k):
    recommended_items = graph_predictions[graph_predictions.user_id==user][:k]
    return sum(
        is_relevant(user, row[1].group_id) * (1 / (rank + 1) * 1)
        for rank, row in enumerate(recommended_items.iterrows())
    )

def RR(user,k):
    recommended_items = graph_predictions[graph_predictions.user_id==user][:k]
    return (sum(is_relevant(user,row[1].group_id)*(1/(rank+1)) for rank,row in enumerate(recommended_items.iterrows())))

def precision(user,k):
    recommended_items = graph_predictions[graph_predictions.user_id==user][:k]
    return sum(is_relevant(user,item) for item in recommended_items.group_id)*1/k

In [17]:
k=8
print(f"Average P@{k}: {np.mean([precision(user,k) for user in testset.user_id])}")
print(f"HR@{k}: {np.mean([HR(user,k) for user in testset.user_id])}")
print(f"MAP@{k}: {np.mean([average_precision(user,k) for user in testset.user_id])}")
print(f"MRR@{k}: {np.mean([RR(user,k) for user in testset.user_id])}")

Average P@8: 0.059637278263171827
HR@8: 0.4770982261053746
MAP@8: 0.25094840954650327
MRR@8: 0.25094840954650327


Then I try to combine the outputs of the content based recommender system (implemented in contentbased.ipynb) with the graph one (implemented here). To do it, I use the reciprocal rank method: based on the position n in the final ranking, each group has a rank of 1/n. For each group in the first and second recommendation, I sum the two ranks and take the ones with the highest value.

In [18]:
content_based_predictions = pd.read_pickle("all_merged_predictions.pkl")
graph_predictions_rank = graph_predictions.reset_index()
graph_predictions_rank.drop(columns=["index"], inplace=True)

In [19]:
def get_reciprocal_rank(user, group, predictions):
    return 1/len(predictions) - \
                 predictions[(predictions.user_id == user) & \
                                   (predictions.group_id == group)]\
                 .index[0] + 1 

user = content_based_predictions.user_id[0]
reciprocal_rank_predictions = list()

In [20]:
user = 26170256
k=8

def get_rr_recommendation(user, k):
    top_k_graph = graph_predictions_rank[graph_predictions_rank.user_id==user]
    top_k_cb = content_based_predictions[content_based_predictions.user_id==user]\
        .sort_values(by= ["prediction"], ascending=False)[:k]

    reciprocal_rank_predictions = list()
    for group in set(list(top_k_graph.group_id.values) + list(top_k_cb.group_id.values)):
        rank = 0 
        if group in top_k_graph.group_id.values:
            rank+= get_reciprocal_rank(user, group, graph_predictions_rank)
        if group in top_k_cb.group_id.values:
            rank+= get_reciprocal_rank(user, group, content_based_predictions)
        reciprocal_rank_predictions.append((user, group, rank))

    return pd.DataFrame(reciprocal_rank_predictions)\
            .rename({0:"user_id", 1:"group_id", 2:"pred"}, axis=1)\
            .sort_values(by= ["pred"], ascending=False)\
            .merge(groups)[:k]

get_rr_recommendation(user, k)



Unnamed: 0,user_id,group_id,pred,title
0,26170256,-1001379726039,-652.99997,Ingegneria del software [PSS + PMD] (Club del ...
1,26170256,-1001440238500,-653.99997,Reti di calcolatori - Informatica
2,26170256,-1001181636991,-654.99997,Test di Inglese - StudentiUniMi
3,26170256,-1001456212600,-655.99997,Chat | Studenti UniMi
4,26170256,-1001213858785,-656.99997,Ripetizioni - Studenti UniMi
5,26170256,-1001362716461,-659.99997,Intelligenza Artificiale 1 - Informatica
6,26170256,-1001189502801,-619243.0,Linguaggi e traduttori - Informatica | Informa...
7,26170256,-1001466214340,-908957.999969,Elaborazione dei segnali - Informatica Musical...


In [21]:
testset = pd.read_pickle("testset.pkl")

def is_relevant(user, item):
    return len(
            testset[
                (testset.user_id == user) & (testset.group_id == item)
            ]
            ) != 0


def HR(user, k):
    recommended_items = get_rr_recommendation(user, k)
    return sum(is_relevant(user, item) for item in recommended_items.group_id)


def average_precision(user, k):
    recommended_items = get_rr_recommendation(user, k)
    return sum(
        is_relevant(user, row[1].group_id) * (1 / (rank + 1) * 1)
        for rank, row in enumerate(recommended_items.iterrows())
    )

def RR(user,k):
    recommended_items = get_rr_recommendation(user, k)
    return (sum(is_relevant(user,row[1].group_id)*(1/(rank+1)) for rank,row in enumerate(recommended_items.iterrows())))

def precision(user,k):
    recommended_items = get_rr_recommendation(user, k)
    return sum(is_relevant(user,item) for item in recommended_items.group_id)*1/k

In [22]:
k=8
print(f"Average P@{k}: {np.mean([precision(user,k) for user in testset.user_id])}")
print(f"HR@{k}: {np.mean([HR(user,k) for user in testset.user_id])}")

Average P@8: 0.059604183214191156
HR@8: 0.47683346571352925


In [23]:
print(f"MAP@{k}: {np.mean([average_precision(user,k) for user in testset.user_id])}")
print(f"MRR@{k}: {np.mean([RR(user,k) for user in testset.user_id])}")

MAP@8: 0.3268797987821022
MRR@8: 0.3268797987821022
