# User Clustering

This notebook is used to cluster users based on some vector representation. Additionaly can can calculate clusters based on temporal splits.

To run the calcutation cell the notebook requires a pandas dataframe of the following structure:

| user_id | fn_spreader | pol_bias | vector
|-|-|-|-|
USER_ID1| 0 | 1.21| [0.2,...,-0.1]
USER_ID2| 1 | -0.34| [0.7,...,0.5]
... | ... | ...

Where fn_spreader is a binary variable, pol_bias is some float between [-3, 3] and the vector is a normalized vector of some dimension. The datafram without vectors can be found %LINK%. There are some cells below showing how you can add your own vectors.

The clustering will append an additional column to the dataframe containing the cluster label. The function score_clustering(dataframe) will we return a score based on inter cluster similarity. The whole clustering gets a score based on the weighted mean of its clusters.

In [None]:
import pandas as pd
import pickle
from sklearn.cluster import k_means

In [None]:
df = pd.read_csv('../data/blank_user_frame.csv')
df

## Adding named entity vectors

In [None]:
with open('../data/named_entities.pickel', 'rb') as f:
    user_entities = pickle.load(f)

named_entities = ['ORG', 'PERSON', 'DATE', 'GPE','CARDINAL', 'NORP',
                  'PERCENT', 'MONEY', 'ORDINAL', 'WORK_OF_ART', 'LOC',
                 'TIME', 'LAW', 'PRODUCT', 'FAC', 'EVENT', 'QUANTITY',
                 'LANGUAGE']

vectors = []
for user in df['user_id']:
    
    ents = user_entities[user]['all']
    
    # normalization
    N = sum(ents.values())
    
    # preparing vector; 0 as default value
    vector = [0]*len(named_entities)
    
    for ind, entity in enumerate(named_entities):
        if entity in ents:
            vector[ind] = ents[entity]/N
    vectors.append(vector)

df['vector'] = vectors
df