This notebook accompanies the blog post https://engineering.taboola.com/think-your-data-different.

In [27]:
import pandas as pd
import numpy as np
import itertools
from sklearn.cluster import KMeans
import pprint

## 1. Prepare input for node2vec
We'll use a CSV file where each row represents a single recommendable item: it contains a comma separated list of the named entities that appear in the item's title.

In [2]:
named_entities_df = pd.read_csv('named_entities.csv')
named_entities_df.head()

Unnamed: 0,named_entities
0,"basketball,Kobe Bryant"
1,"basketball,Lebron James"


First, we'll have to tokenize the named entities, since `node2vec` expects integers.

In [3]:
tokenizer = dict()
named_entities_df['named_entities'] = named_entities_df['named_entities'].apply(
    lambda named_entities: [tokenizer.setdefault(named_entitie, len(tokenizer))
                            for named_entitie in named_entities.split(',')])
named_entities_df.head()

Unnamed: 0,named_entities
0,"[0, 1]"
1,"[0, 2]"


In [18]:
pprint.pprint(list(tokenizer.items())[0:5])

[('basketball', 0), ('Kobe Bryant', 1), ('Lebron James', 2)]


In order to construct the graph on which we'll run node2vec, we first need to understand which named entities appear together.

In [20]:
pairs_df = named_entities_df['named_entities'].apply(lambda named_entities: list(itertools.combinations(named_entities, 2)))
pairs_df = pairs_df[pairs_df.apply(len) > 0]
pairs_df = pd.DataFrame(np.concatenate(pairs_df.values), columns=['named_entity_1', 'named_entity_2'])
pairs_df.head()

Unnamed: 0,named_entity_1,named_entity_2
0,0,1
1,0,2


Now we can construct the graph. The weight of an edge connecting two named entities will be the number of times these named entities appear together in our dataset.

In [31]:
pairs_df.groupby(['named_entity_1', 'named_entity_2']).size().reset_index(name='weight')

Unnamed: 0,named_entity_1,named_entity_2,weight
0,0,1,1
1,0,2,1


In [33]:
NAMED_ENTITIES_CO_OCCURENCE_THRESHOLD = 0
# By default, 25

edges_df = pairs_df.groupby(['named_entity_1', 'named_entity_2']).size().reset_index(name='weight')
edges_df = edges_df[edges_df['weight'] > NAMED_ENTITIES_CO_OCCURENCE_THRESHOLD]
edges_df[['named_entity_1', 'named_entity_2', 'weight']].to_csv('edges.csv', header=False, index=False, sep=' ')
# 为了作为文本输入，这里需要按照`' '`进行切分
# https://github.com/aditya-grover/node2vec/issues/42
edges_df.head()

Unnamed: 0,named_entity_1,named_entity_2,weight
0,0,1,1
1,0,2,1


Next, we'll run `node2vec`, which will output the result embeddings in a file called `emb`.  
We'll use the open source implementation developed by [Stanford](https://github.com/snap-stanford/snap/tree/master/examples/node2vec).

In [36]:
!python node2vec/src/main.py --input edges.csv --output emb --weighted

Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10


## 2. Read embedding and run KMeans clusterring:

In [37]:
emb_df = pd.read_csv('emb', sep=' ', skiprows=[0], header=None)
emb_df.set_index(0, inplace=True)
emb_df.index.name = 'named_entity'
emb_df.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,119,120,121,122,123,124,125,126,127,128
named_entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-0.017839,-0.01554,0.014009,0.011204,0.001812,0.016809,-0.029363,0.019553,0.017015,-0.042528,...,0.00544,0.007404,0.008619,-0.002957,0.007757,-0.027168,0.001521,0.009814,0.003208,-0.026657
1,-0.015903,-0.012227,0.009864,0.006678,0.006132,0.015084,-0.021753,0.01121,0.015354,-0.031373,...,0.009581,0.000854,0.00906,-0.001659,0.005635,-0.015787,-0.001362,0.005597,0.005464,-0.018249
2,-0.014181,-0.006827,0.011194,0.00144,0.001613,0.013619,-0.019055,0.011773,0.012155,-0.028162,...,0.006589,0.00317,0.002821,-0.004832,0.00182,-0.018488,0.004074,0.000793,0.003839,-0.0173


In [38]:
emb_df.shape

(3, 128)

> Each column is a dimension in the embedding space. Each row contains the dimensions of the embedding of one named entity.  

每一列是一个 embedding 的维度。

> We'll now cluster the embeddings using a simple clustering algorithm such as k-means.

下面利用 embedding 进行聚类。

In [40]:
NUM_CLUSTERS = 2
# By default 10

kmeans = KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(emb_df)
labels = kmeans.predict(emb_df)
emb_df['cluster'] = labels
clusters_df = emb_df.reset_index()[['named_entity','cluster']]
clusters_df.head()

Unnamed: 0,named_entity,cluster
0,0,1
1,1,0
2,2,0


## 3. Prepare input for Gephi:

[Gephi](https://gephi.org) (Java 1.8 or higher) is a nice visualization tool for graphical data.  
We'll output our data into a format recognizable by Gephi.

In [41]:
id_to_named_entity = {named_entity_id: named_entity
                      for named_entity, named_entity_id in tokenizer.items()}

with open('clusters.gdf', 'w') as f:
    f.write('nodedef>name VARCHAR,cluster_id VARCHAR,label VARCHAR\n')
    for index, row in clusters_df.iterrows():
        f.write('{},{},{}\n'.format(row['named_entity'], row['cluster'], id_to_named_entity[row['named_entity']]))
    f.write('edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE\n')
    for index, row in edges_df.iterrows(): 
        f.write('{},{},{}\n'.format(row['named_entity_1'], row['named_entity_2'], row['weight']))

Finally, we can open `clusters.gdf` using Gephi in order to inspect the clusters.