# Clustering

There are multiple different algorithms for doing clustering. Clustering, however, is a algorithm that creates clusters from the data based on what fits together and is an unsupervised algorithm as it doesn't need labels to find patterns.

## K Means Clustering

K means clustering is an algorithm that clusters training data and can use the centers of the data to predict what cluster other, testing data would belong to.

1. Clustering starts of with a random assortment of **centroids**. The number of these centroids are given by the user and denoted as **K**.
2. The distance from each centroid for each data point is calculated and the centroid is assigned to the centroid that is closest to it. The distance can be calculated using multiple algorithms including **euclidian distance** and **manhatten distance**.
3. After all the data points are assigned to a centroid, the centroid's location changes to the center of the data points that are assigned to it. *The change in location of the centroid also changes the distance each data point is from the different centroids.* **The cycle is therefore repeated as a data point may now be closer to another centroid.**

The data will converge at one point when the data points associated to the centroid no longer change.

## Hidden Markov Models

Looks at a probability distribution. Predicts future events or states given the probability distributions of them occuring.

Made up of a few key things: states, observations, transitions

* **States:** states are the things that define what is happening. The state could be a hot day
* **Observations:** observations are things that happen and have probabilities associated with them. The observations are associated with a state as the observations.
* **Transitions:** transitions are the probabilities that the state transitions from one to another. A hot day has a probability that the next day will be a cold day

In [1]:
import tensorflow as tf
import tensorflow_probability as tfp

In [3]:
tfd = tfp.distributions
initial_distribution = tfd.Categorical(probs=[0.8, 0.2])
transition_distribution = tfd.Categorical(probs=[[0.7, 0.3],
[0.2, 0.8]])
observation_distribution = tfd.Normal(loc=[0., 15.], scale=[5., 10.])

In [4]:
model = tfd.HiddenMarkovModel(
    initial_distribution=initial_distribution,
    transition_distribution=transition_distribution,
    observation_distribution=observation_distribution,
    num_steps=7
)

In [5]:
mean = model.mean()

with tf.compat.v1.Session() as sess:
    print(mean.numpy())

[2.9999998 5.9999995 7.4999995 8.25      8.625001  8.812501  8.90625  ]
