This notebook accompanies the blog post https://engineering.taboola.com/think-your-data-different.

In [1]:
import pandas as pd
import numpy as np
import itertools
from sklearn.cluster import KMeans
import pprint

## 1. Prepare input for node2vec
We'll use a CSV file where each row represents a single recommendable item: it contains a comma separated list of the named entities that appear in the item's title.

In [2]:
named_entities_df = pd.read_csv('named_entities.csv')
named_entities_df.head()

Unnamed: 0,named_entities
0,"CONCEPT-certification mark,CONCEPT-i swear,CON..."
1,"CONCEPT-middle school,CONCEPT-gun,CONCEPT-scho..."
2,"Facility-rush university medical center,CONCEP..."
3,CONCEPT-web browser
4,"CONCEPT-types of companies,Person-saquon barkl..."


First, we'll have to tokenize the named entities, since `node2vec` expects integers.

In [3]:
tokenizer = dict()
named_entities_df['named_entities'] = named_entities_df['named_entities'].apply(
    lambda named_entities: [tokenizer.setdefault(named_entitie, len(tokenizer))
                            for named_entitie in named_entities.split(',')])
named_entities_df.head()

Unnamed: 0,named_entities
0,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]"
1,"[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2..."
2,"[28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 3..."
3,[41]
4,"[42, 43, 44, 45, 46, 9]"


In [4]:
pprint.pprint(dict(tokenizer.items()[:5]))

{'CONCEPT-gal gadot': 20918,
 'CONCEPT-irish singles chart number one singles': 59693,
 'CONCEPT-tarantula': 83904,
 'Organization-ohio republican party': 93001,
 'Person-billy donovan': 32857}


In order to construct the graph on which we'll run node2vec, we first need to understand which named entities appear together.

In [5]:
pairs_df = named_entities_df['named_entities'].apply(lambda named_entities: list(itertools.combinations(named_entities, 2)))
pairs_df = pairs_df[pairs_df.apply(len) > 0]
pairs_df = pd.DataFrame(np.concatenate(pairs_df.values), columns=['named_entity_1', 'named_entity_2'])
pairs_df.head()

Unnamed: 0,named_entity_1,named_entity_2
0,0,1
1,0,2
2,0,3
3,0,4
4,0,5


Now we can construct the graph. The weight of an edge connecting two named entities will be the number of times these named entities appear together in our dataset.

In [6]:
NAMED_ENTITIES_CO_OCCURENCE_THRESHOLD = 25

edges_df = pairs_df.groupby(['named_entity_1', 'named_entity_2']).size().reset_index(name='weight')
edges_df = edges_df[edges_df['weight'] > NAMED_ENTITIES_CO_OCCURENCE_THRESHOLD]
edges_df[['named_entity_1', 'named_entity_2', 'weight']].to_csv('edges.csv', header=False, index=False, sep=' ')
edges_df.head()

Unnamed: 0,named_entity_1,named_entity_2,weight
49,3,9,34
988,9,41,1142
1275,11,127,31
1281,11,134,35
1290,11,149,61


Next, we'll run `node2vec`, which will output the result embeddings in a file called `emb`.  
We'll use the open source implementation developed by [Stanford](https://github.com/snap-stanford/snap/tree/master/examples/node2vec).

In [7]:
!python node2vec/src/main.py --input edges.csv --output emb --weighted

Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10


## 2. Read embedding and run KMeans clusterring:

In [8]:
emb_df = pd.read_csv('emb', sep=' ', skiprows=[0], header=None)
emb_df.set_index(0, inplace=True)
emb_df.index.name = 'named_entity'
emb_df.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,119,120,121,122,123,124,125,126,127,128
named_entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
45,0.193684,0.199515,-0.55807,0.193501,-0.151151,-0.108368,-0.080395,0.483877,-0.216687,-0.027689,...,-0.020264,-0.21916,-0.006211,-0.11605,-0.208311,-0.238917,0.416022,-0.069208,0.382213,-0.198407
41,0.116208,-0.013772,0.270675,0.22748,-0.123978,-0.076915,-0.080015,0.338822,0.007791,-0.028516,...,-0.250689,-0.219996,-0.346024,0.006914,-0.185476,0.09912,0.231357,0.326392,0.197053,-0.103405
478,0.326508,-0.080868,-0.534134,0.137786,-0.262377,-0.071972,-0.187409,0.533022,-0.314909,-0.019874,...,-0.160482,-0.192272,-0.132486,-0.058005,-0.182971,-0.2016,0.317926,0.059988,0.380023,-0.127033
88,-0.053936,-0.098514,-0.116975,0.194783,-0.127855,0.310879,-0.050054,-0.002542,0.094705,-0.104536,...,0.025011,-0.357876,-0.238409,0.247654,0.082463,-0.147044,0.15385,-0.535327,-0.435655,0.259705
83,0.013028,-0.122749,-0.029661,0.059336,-0.258743,0.397353,-0.082249,0.078653,0.102366,0.091354,...,0.141847,-0.456273,-0.119102,0.301741,0.072765,-0.035528,0.042997,-0.511059,-0.263644,0.366281


Each column is a dimension in the embedding space. Each row contains the dimensions of the embedding of one named entity.  
We'll now cluster the embeddings using a simple clustering algorithm such as k-means.

In [9]:
NUM_CLUSTERS = 10

kmeans = KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(emb_df)
labels = kmeans.predict(emb_df)
emb_df['cluster'] = labels
clusters_df = emb_df.reset_index()[['named_entity','cluster']]
clusters_df.head()

Unnamed: 0,named_entity,cluster
0,45,2
1,41,3
2,478,2
3,88,1
4,83,1


## 3. Prepare input for Gephi:

[Gephi](https://gephi.org) is a nice visualization tool for graphical data.  
We'll output our data into a format recognizable by Gephi.

In [10]:
id_to_named_entity = {named_entity_id: named_entity
                      for named_entity, named_entity_id in tokenizer.items()}

with open('clusters.gdf', 'w') as f:
    f.write('nodedef>name VARCHAR,cluster_id VARCHAR,label VARCHAR\n')
    for index, row in clusters_df.iterrows():
        f.write('{},{},{}\n'.format(row['named_entity'], row['cluster'], id_to_named_entity[row['named_entity']]))
    f.write('edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE\n')
    for index, row in edges_df.iterrows(): 
        f.write('{},{},{}\n'.format(row['named_entity_1'], row['named_entity_2'], row['weight']))

Finally, we can open `clusters.gdf` using Gephi in order to inspect the clusters.