In [1]:
%load_ext autoreload
%autoreload 2

# Clustering

## 1. Data loading

Load version from settings.

In [2]:
from settings import DATASETS_CONFIG

config = DATASETS_CONFIG["tiny"]
config

tiny [Docs: 10000; Terms: 8000; Approximation error: 0.35]

In [3]:
from model.dataset import RCV1Loader

mat = RCV1Loader(config=config).load(sort_docs=True, sort_terms=True).embed()

INFO: Loading matrix. 
I/O: Loading /home/sebaq/Documents/GitHub/IR_project/dataset/data.npz. 
INFO: Removing non informative terms. 
INFO: Sorting documents by terms count. 
INFO: Sorting terms by their frequency. 


Load the data.

## 3. Clustering

We perform clustering in the embedded space. We look for a number of cluster which is the root of the number of items.

In [4]:
from model.clustering import KMeansClustering
from math import sqrt

n_clusters = int(sqrt(config.docs))

n_clusters

100

In [5]:
kmeans = KMeansClustering(
    mat=mat,
    name=config.name,
    k=n_clusters
)

kmeans

KMeansClustering[Items: 10000; k: 100;  Labeling-available: True]

Fit clustering model.

In [6]:
%%time

kmeans.labels;

INFO: Labels already computed. Loading from disk. 
I/O: Loading /home/sebaq/Documents/GitHub/IR_project/dataset/tiny/labeling.json 
CPU times: user 3.95 ms, sys: 0 ns, total: 3.95 ms
Wall time: 3.68 ms


Let's see output labeling.

In [7]:
kmeans.labels

array([80, 80, 99, ...,  4,  4,  4])

Save labeling.

In [8]:
kmeans.save_labels()

I/O: Overwriting /home/sebaq/Documents/GitHub/IR_project/dataset/tiny/labeling.json 


## 4. Clustering split

Split dataset with resulting clusters.

In [9]:
clusters = kmeans.clusters
clusters

ClusterDataSplit [Data: 10000, Clusters: 100, Mean-per-Cluster: 100.000]

We compute the medoid for each cluster.

In [10]:
clusters.medoids;

INFO: Loading medoids from disk
I/O: Loading /home/sebaq/Documents/GitHub/IR_project/dataset/tiny/medoids.npy. 


In [11]:
clusters.save_medoids()

INFO: Saving medoids 
I/O: Directory /home/sebaq/Documents/GitHub/IR_project/dataset/tiny already exists 
I/O: Overwriting /home/sebaq/Documents/GitHub/IR_project/dataset/tiny/medoids.npy. 


In [12]:
clusters.medoids

array([[ 0.        ,  0.01463922,  0.        , ..., -0.01463922,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ..., -0.01957361,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.02364533]])