In [1]:
%load_ext autoreload
%autoreload 2

# Clustering

## 1. Data loading

We load data as in previous analysis: docs and terms are sorted by length and frequencies.

In [2]:
from model.dataset import RCV1Loader

loader = RCV1Loader()
loader

RCV1Loader [File: /home/sebaq/Documents/GitHub/IR_project/dataset/data.npz]

In [3]:
from settings import DATASETS

dataset = DATASETS["tiny"]

data = loader.load(
    docs=dataset.docs, terms=dataset.terms,
    sort_docs=True, sort_terms=True
)

INFO: Loading matrix. 
I/O: Loading /home/sebaq/Documents/GitHub/IR_project/dataset/data.npz. 
INFO: Removing non informative terms. 
INFO: Sorting documents by terms count. 
INFO: Sorting terms by their frequency. 


In [4]:
data

DocumentsCollection[Docs: 10000; Terms: 5170; Nonzero: 110075]

## 2. Dimensionality reduction

We apply the [Johnson-Lindenstrauss lemma](https://scikit-learn.org/stable/modules/random_projection.html) to perform dimensionality reduction an dramatically reduce the vector space where to apply clustering.

In [5]:
embedding = data.embed(eps=dataset.eps)

Target dimension with the approximation error:

In [6]:
embedding.shape

(10000, 784)

In [7]:
embedding

array([[-0.03492827,  0.01056052,  0.00830388, ...,  0.00584377,
         0.02920113,  0.        ],
       [-0.03445972,  0.00461305,  0.01182234, ...,  0.01039861,
         0.00465934,  0.        ],
       [-0.01390731,  0.00449085, -0.0287742 , ...,  0.01459819,
         0.02523996,  0.01071666],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

## 3. Clustering

We perform clustering in the embedded space. We look for a number of cluster which is the root of the number of items.

In [8]:
from model.clustering import KMeansClustering
from math import sqrt

n_clusters = int(sqrt(len(embedding)))

kmeans = KMeansClustering(
    mat=embedding,
    k=n_clusters
)

kmeans

KMeansClustering[Items: 10000; k: 100;  Fitted: False]

Fit clustering model.

In [9]:
%%time

kmeans.fit()

INFO: Fitting K-Means model. 
CPU times: user 11 s, sys: 1.25 s, total: 12.3 s
Wall time: 3.56 s


Save labeling.

In [10]:
kmeans.save(name=dataset.name)

I/O: Saving /home/sebaq/Documents/GitHub/IR_project/dataset/tiny/labeling.json 


Split dataset with resulting clusters.

In [11]:
clusters = kmeans.clusters
clusters

ClusterDataSplit [Data: 10000, Clusters: 100, Mean-per-Cluster: 100.000]

Take the medoid for each cluster.

In [12]:
clusters.medoids

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        , -0.01944182, ...,  0.0533081 ,
         0.        , -0.02283529],
       [ 0.        ,  0.03383169,  0.        , ..., -0.03825251,
         0.        ,  0.        ],
       [ 0.        ,  0.012582  ,  0.        , ...,  0.        ,
        -0.01616018,  0.        ]])