# Clustering

The task of finding underlying coherent groups in the data.

* hard clustering - assign each item to one cluster
* soft clustering - degree of belonging to a cluster
* hierarchical clustering - hierarchy of clusters, usually constructed bottom-to-top

A nice comparison of the clustering algorithms and their results is [here](http://scikit-learn.org/stable/modules/clustering.html#clustering)

Depending on the clustering algorithm, we might need:
* a similarity metric on the items being clustered (function or matrix)
* a way to establish cluster centres
* a way to measure between cluster distance
* a way to measure cluster coherence

A notable difference between the algorithms is whether they can work with a matrix of pairwise similarities between arbitrary objects, or whether they need to be given a vector representation of the data items.

## K-means

The go-to clustering algorithm especially if your data is not badly violating its assumptions, and you have some idea about the number of clusters you'd like to obtain.

Material: http://scikit-learn.org/stable/modules/clustering.html#k-means

* Builds a single set of clusters
* Explained in good detail on the lecture

## Hierarchical clustering

Material: 
* http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering
* https://en.wikipedia.org/wiki/Hierarchical_clustering

* Builds nested clusters - cluster tree
* Top: one cluster containing everything, bottom: one cluster for each document
* Horizontal cut: one possible clustering of the data
* Explained in good detail on the lecture

# Clustering and Solr

* Carrot2 document clustering engine 
* Builds on top of Solr - clusters search results
* http://search.carrot2.org
* [LINGO algorithm](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.5370&rep=rep1&type=pdf)
  * A bit too involved for the lecture, but in a nutshell:
  * Choose common terms and phrases (filtering out stop-words)
  * Find the most informative ones
  * Match documents to these phrases
  * Relies on a technique called SVD (Singular Value Decomposition)
  * Note: reverses the typical approach of first clustering documents, then assigning labels to the clusters
  * Here: labels exist first, then the clusters around them

## Test Carrot2

* Download Carrot2 Workbench
* Point it to a running Solr
    * Solr as source
    * Correct URL
    * Fill in field names
    * Test with a query

## Clustering inside Solr

* Carrot2 no part of Solr, easy to get clustered results out-of-the-box
* You need to configure it though, in the core's own config
    solr-6.4.1/server/solr/YOURCORENAME/conf/solrconfig.xml
* In the place which mentions clustering config, stick this:

```
<searchComponent name="clustering" class="solr.clustering.ClusteringComponent">
    <!-- Lingo clustering algorithm -->
    <lst name="engine">
      <str name="name">lingo</str>
      <str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
    </lst>
  </searchComponent>

  <requestHandler name="/clustering"
                  class="solr.SearchHandler">
    <lst name="defaults">
      <bool name="clustering">true</bool>
      <bool name="clustering.results">true</bool>

      <!-- Logical field to physical field mapping. -->
      <str name="carrot.url">id</str>
      <str name="carrot.title">stext</str>
      <str name="carrot.snippet">stext</str>

      <!-- Configure any other request handler parameters. We will cluster the
           top 100 search results so bump up the 'rows' parameter. -->
      <str name="rows">100</str>
      <str name="fl">*,score</str>
    </lst>

    <!-- Append clustering at the end of the list of search components. -->
    <arr name="last-components">
      <str>clustering</str>
    </arr>
  </requestHandler>
```

...but you need to modify the logical-to-physical field mapping to make it work for your particular core and its fields. Then reload the core and try querying it, changing `browse` to `clustering`.




In [13]:
import requests
import json
# a simple call using requests
response=requests.get("http://localhost:8983/solr/ENCOW/clustering",params={"q":"stext:finland","wt":"json","fl":"id,stext"})
resp=json.loads(response.text) # decode the json response
print(resp.keys()) #...it now also has "clusters"

dict_keys(['responseHeader', 'response', 'clusters'])


In [15]:
print(resp["response"]["docs"][:3]) #First three documents

[{'id': 284325932, 'stext': 'finland , finland , finland , finland has it all .'}, {'id': 191290311, 'stext': 'finland forests finland forests .'}, {'id': 104129958, 'stext': "finland , finland finland , the country where i want to be , you 're so sadly neglected , and often ignored , finland , finland , finland , finland has it all ."}]


In [18]:
print(resp["clusters"][:3]) #First three clusters

[{'docs': [253339729, 8622119, 193331210, 247346889, 310673615, 269366772, 180257582], 'labels': ['Helsinki'], 'score': 11.008298691572627}, {'docs': [7763468, 3751396, 8663180, 194199679, 307908027], 'labels': ['Greetings from Finland'], 'score': 10.677767616504626}, {'docs': [104129958, 238872649, 317407179, 194845455], 'labels': ['Country'], 'score': 8.298028871876642}]


`resp["response"]["docs"]` is a list of documents you got, as before. `resp["clusters"]` is a new part of the results, which contains the cluster structure. You just need to piece them together.