# Clustering



In [7]:
from google.colab import drive
drive.mount('/content/drive/')
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
img = mpimg.imread('/content/drive/My Drive/digits_cluster.png')
plt.figure(figsize = (10,20))
plt.imshow(img)

ModuleNotFoundError: No module named 'google.colab'

### 1. Metrics in multidimensional space. In-cluster distance

##### 1. Nearest distance
- distance between the two closest objects (nearest neighbors) in different clusters;

##### 2. Full distance
- the greatest distance between any two objects in different clusters (i.e. the most distant neighbors);

##### 3. Weighted/Non-weighted paired average
- average distance between all pairs of objects in clusters;

##### 4. Weighted/Non-weighted centroid median
- average distance between all pairs of objects in clusters with weights depending on the size of the cluster;

https://habr.com/ru/post/101338/


### 2. Intercluster Distances:


The distance from $x = (x_{1}, \dots, x_{n})$ to $x' = (x'_{1}, \dots, x'_{n})$ $\in$ $\mathbb{R}^{n}$ is

$$
D(x,x^{'}) = \left( \sum_{i=1}^n \left| x_{i} - x'_{i} \right|^{q} \right)^{p},
$$

For p = 1, q=1 we get the taxicab norm, for p = 1/2, q=2 we get the Euclidean norm

The higher the power p the more metric penalizes the outliers.

### Classification of algorithms:


##### A. Hierarchical and flat.
Hierarchical algorithms (or taxonomy algorithms) build a system of nested partitions. So at the output we get a cluster tree, the _root_ of the tree is the entire sample, the _leaves_ are the smallest clusters.
Flat algorithms build one partition of objects into clusters.

##### B. Clear and shuffled.
Clear (or non-shuffled) algorithms assign a cluster number for each sample object, i.e. each object belongs to only one cluster. Shuffle algorithms associate a set of real values for each object, indicating the probability of the object's relation to the clusters. Each object belongs to each cluster with a certain probability.

In [None]:
img = mpimg.imread('/content/drive/My Drive/Clusters.png')
plt.figure(figsize = (20,20))
plt.imshow(img)

In [None]:
img = mpimg.imread('/content/drive/My Drive/Table.png')
plt.figure(figsize = (15,15))
plt.imshow(img)

https://scikit-learn.org/stable/modules/clustering.html

## Data

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

In [8]:
# texts and messages from different topics

from sklearn.datasets import fetch_20newsgroups

train_all = fetch_20newsgroups(subset='train')
# print topic names
print (train_all.target_names)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [9]:
# let's have 3 VERY different topics
new_dataset = fetch_20newsgroups(
    subset='train', 
    categories=['comp.sys.mac.hardware', 'soc.religion.christian', 'rec.sport.hockey'])

In [10]:
print (new_dataset.data[100])

From: generous@nova.sti.nasa.gov (Curtis Generous)
Subject: Apple Tape backup 40SC under System 7.x
Keywords: backup, tape,
Organization: NASA STI
Lines: 12


I need to get an Apple 40SC tape backup unit working under
Sys 7.0.x, but do not have any drivers/software to access
the device.  Does anyone know where I can fidn the tools
to access this device?

Appreciate any info/comments.

--curtis
-- 
Curtis C. Generous	generous@sti.nasa.gov		(703) 685-1140
NASA STI, Code JTT, Washington, DC 20546



In [11]:
# labels
import numpy as np
print(new_dataset.target)
print(np.unique(new_dataset.target,return_counts=True))

[0 0 1 ... 0 1 2]
(array([0, 1, 2], dtype=int64), array([578, 600, 599], dtype=int64))


In [12]:
print (new_dataset.data[-2])

From: scialdone@nssdca.gsfc.nasa.gov (John Scialdone)
Subject: CUT Vukota and Pilon!!!
News-Software: VAX/VMS VNEWS 1.41    
Organization: NASA - Goddard Space Flight Center
Lines: 32

I have been to all 3 Isles/Caps tilts at the Crap Centre this year, all Isles
wins and there is no justification for Vukota and Pilon to play for the Isles.
Vukota is absolutely the worst puck handler in the world!! He couldn't hit a
bull in the ass with a banjo!! Al must remember a few years back when Mick 
scored 3 goals in one period against the Caps in a 5-3 Isles win. I was there
and was astonished as was the rest of the crowd. Wake-up Al!!! Years later he's
gotten worse. He's a cheap shot artist and always ends up getting
stupid/senseless penalties. I think he would make a good police officier!!!

As for Pilon, he can't carry the puck out to center ice by himself. He either
makes a bad pass resulting in a turnover, or he attempts to bring the puck 
towards the neutral zone and skates right into an 

In [13]:
# number of entries
print (len(new_dataset.data))

1777


## Features

In [14]:
# let's create word-vectorizing matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# create vector embeddings for words with frequencies (> 10) and (< 500) in all dataset
# 10 and 500 - may try different frequencies
#vectorizer = CountVectorizer(max_df=500, min_df=10)
vectorizer = TfidfVectorizer(max_df=500, min_df=10)  # weighted word matrix 

matrix = vectorizer.fit_transform(new_dataset.data)

In [15]:
print(matrix)

  (0, 877)	0.111806579628099
  (0, 880)	0.11851815596020333
  (0, 2243)	0.0925317447411647
  (0, 1136)	0.05005067187805151
  (0, 2358)	0.08407975300726757
  (0, 2820)	0.10804290934262964
  (0, 2108)	0.12357635170024023
  (0, 3035)	0.10154477899829953
  (0, 1229)	0.07360450638969387
  (0, 3262)	0.13021297950442332
  (0, 1366)	0.0836339497399446
  (0, 3305)	0.06783179408623191
  (0, 551)	0.09781400308186793
  (0, 1961)	0.22885999498507148
  (0, 619)	0.1704733433340938
  (0, 294)	0.051771062722228284
  (0, 3322)	0.09538902048354801
  (0, 2325)	0.05419206470797136
  (0, 620)	0.111806579628099
  (0, 1831)	0.1250526461821779
  (0, 3285)	0.11264581593118288
  (0, 2422)	0.09427236106959616
  (0, 301)	0.05563256696046784
  (0, 855)	0.20162554088271675
  (0, 2868)	0.08596208970791507
  :	:
  (1776, 2582)	0.05609109765839554
  (1776, 3688)	0.050520649858871766
  (1776, 1392)	0.05585170073485205
  (1776, 2517)	0.13350627179343605
  (1776, 331)	0.055151655839178956
  (1776, 3383)	0.0471671358117620

## KMeans

In [16]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3, random_state=1)
preds = model.fit_predict(matrix.toarray())
print (list(preds))

[2, 2, 1, 0, 1, 2, 2, 2, 0, 2, 1, 2, 1, 2, 1, 2, 2, 0, 2, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0, 2, 1, 0, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 0, 1, 1, 1, 1, 0, 1, 2, 2, 0, 1, 1, 0, 1, 0, 0, 2, 0, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 0, 2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 0, 1, 1, 2, 2, 2, 2, 0, 0, 1, 0, 1, 2, 2, 0, 0, 0, 0, 2, 1, 2, 2, 2, 1, 0, 1, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 1, 1, 2, 0, 2, 0, 1, 0, 1, 0, 0, 2, 0, 0, 1, 2, 2, 1, 0, 0, 2, 2, 2, 2, 2, 2, 0, 0, 0, 1, 0, 0, 0, 1, 2, 1, 1, 2, 2, 1, 1, 0, 0, 0, 1, 2, 1, 2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 2, 1, 0, 1, 1, 0, 2, 2, 1, 0, 1, 2, 0, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 2, 1, 1, 0, 1, 2, 2, 2, 1, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 1, 1, 1, 2, 2, 0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 2, 0, 0, 2, 1, 2, 1, 0, 1, 0, 1, 0, 2, 2, 1, 2, 1, 1, 0, 2, 1, 0, 2, 1, 2, 2, 2, 0, 2, 2, 2, 0, 2, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 2, 1, 1, 2, 1, 0, 2, 1, 0, 2, 1, 0, 2, 1, 0, 0, 0, 2, 0, 0, 1, 2, 2, 0, 2, 1, 1, 1, 1, 1, 0, 0, 1, 2, 2, 0, 2, 0, 0, 0, 0, 2, 1, 1, 0, 0, 

In [17]:
print (list(new_dataset.target))

[0, 0, 1, 2, 1, 0, 0, 0, 2, 0, 1, 0, 1, 0, 1, 0, 0, 2, 0, 1, 1, 1, 0, 0, 0, 2, 2, 2, 2, 0, 1, 2, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 2, 1, 1, 1, 1, 2, 1, 0, 0, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 2, 0, 0, 2, 2, 0, 1, 2, 0, 0, 0, 2, 1, 1, 0, 0, 0, 0, 2, 2, 1, 2, 1, 0, 0, 2, 2, 2, 2, 0, 1, 0, 1, 0, 1, 2, 1, 1, 0, 0, 2, 1, 0, 2, 2, 0, 2, 1, 1, 0, 2, 0, 2, 1, 2, 1, 2, 2, 0, 2, 2, 1, 0, 0, 1, 2, 2, 0, 0, 0, 2, 0, 2, 2, 2, 2, 1, 2, 2, 2, 1, 0, 1, 1, 0, 0, 1, 1, 2, 2, 2, 1, 0, 1, 0, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 0, 1, 2, 1, 1, 2, 0, 0, 1, 2, 1, 0, 2, 0, 2, 2, 0, 1, 1, 0, 0, 0, 0, 0, 2, 2, 2, 2, 1, 0, 1, 1, 0, 1, 2, 0, 0, 1, 2, 1, 1, 2, 2, 1, 1, 0, 0, 2, 1, 1, 1, 0, 0, 2, 0, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 0, 2, 0, 1, 2, 1, 2, 1, 2, 0, 2, 1, 0, 1, 1, 2, 0, 1, 2, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 0, 1, 1, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 0, 0, 2, 0, 2, 2, 2, 2, 0, 1, 1, 2, 2, 

In [18]:
# We have mixed up "0" and "2" clusters
# Let's change it!

mapping = {2 : 0, 1: 1, 0: 2}
mapped_preds = [mapping[pred] for pred in preds]
print (float(sum(mapped_preds != new_dataset.target)) / len(new_dataset.target))

0.05289814293753517


In [19]:
# have right answers in ~95 % examples!
# check the result with 'professional' classifier:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
clf = LogisticRegression()
print (cross_val_score(clf, matrix, new_dataset.target).mean())


0.9859313182465581


## Let's have 3 closer topics

In [20]:
dataset = fetch_20newsgroups(
    subset='train', 
    categories=['comp.sys.mac.hardware', 'comp.os.ms-windows.misc', 'comp.graphics'])

In [21]:
matrix = vectorizer.fit_transform(dataset.data)
model = KMeans(n_clusters=3, random_state=42)
preds = model.fit_predict(matrix.toarray())
print (preds)
print (dataset.target)

[2 1 0 ... 2 0 2]
[2 1 1 ... 2 0 2]


In [22]:
#mapping = {2 : 0, 1: 1, 0: 2}
#mapped_preds = [mapping[pred] for pred in preds]
print (float(sum(preds != dataset.target)) / len(dataset.target))

0.22304620650313747


In [23]:
# performance came down because the topics are very close to each other

clf = LogisticRegression()
print (cross_val_score(clf, matrix, dataset.target).mean())

0.9264143264143264
