# CLUSTERING

Now that we've covered regression and classification it's time to talk about clustering data!

Clustering is a Machine Learning technique that involves the grouping of data points. In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. (https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68)

Unfortunalty there are issues with the current version of TensorFlow and the implementation for KMeans. This means we cannot use KMeans without writing the algorithm from scratch. We aren't quite at that level yet, so we'll just explain the basics of clustering for now.
Basic Algorithm for K-Means.

    Step 1: Randomly pick K points to place K centroids
    Step 2: Assign all the data points to the centroids by distance. The closest centroid to a point is the one it is assigned to.
    Step 3: Average all the points belonging to each centroid to find the middle of those clusters (center of mass). Place the corresponding centroids into that position.
    Step 4: Reassign every point once again to the closest centroid.
    Step 5: Repeat steps 3-4 until no point changes which centroid it belongs to.

Please refer to the video for an explanation of KMeans clustering.

# HIDDEN MARKOV MODEL

"The Hidden Markov Model is a finite set of states, each of which is associated with a (generally multidimensional) probability distribution []. Transitions among the states are governed by a set of probabilities called transition probabilities." (http://jedlik.phy.bme.hu/~gerjanos/HMM/node4.html)

A hidden markov model works with probabilities to predict future events or states based on past events. In this section we will learn how to create a hidden markov model that can predict the weather.

This section is based on the following TensorFlow tutorial. https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/HiddenMarkovModel

In [1]:
import tensorflow_probability as tfp  # We are using a different module from tensorflow this time
import tensorflow as tf

### Weather Model

Taken direclty from the TensorFlow documentation (https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/HiddenMarkovModel).

We will model a simple weather system and try to predict the temperature on each day given the following information.

    1) Cold days are encoded by a 0 and hot days are encoded by a 1.
    2) The first day in our sequence has an 80% chance of being cold.
    3) A cold day has a 30% chance of being followed by a hot day.
    4) A hot day has a 20% chance of being followed by a cold day.
    5) On each day the temperature is normally distributed with mean and standard deviation 0 and 5 on a cold day and mean and standard deviation 15 and 10 on a hot day.

In this example, on a hot day the average temperature is 15 and ranges from 5 to 25

TensorFlow Markov Method:
<pre>
tfp.distributions.HiddenMarkovModel(  
    initial_distribution,  
    transition_distribution,  
    observation_distribution,  
    num_steps,  
    validate_args=False,  
    allow_nan_stats=True,  
    time_varying_transition_distribution=False,  
    time_varying_observation_distribution=False,  
    mask=None,  
    name='HiddenMarkovModel'  
)
</pre>

In [2]:
tfd = tfp.distributions

# Since we have 2 states (cold and hot) we have 2 probabilities
initial_distribution = tfd.Categorical(probs=[0.8, 0.2]) # Point 2 above
transition_distribution = tfd.Categorical(probs=[[0.7, 0.3],
                                                 [0.2, 0.8]]) # Refers to points 3 & 4 above
# loc = mean & scale = std.dev
observation_distribution = tfd.Normal(loc=[0., 15.], scale=[5., 10.]) # Refers to point 5 above

In [3]:
# steps = number of days we want to predict over
model = tfd.HiddenMarkovModel(
    initial_distribution=initial_distribution,
    transition_distribution=transition_distribution,
    observation_distribution=observation_distribution,
    num_steps=7)

In [5]:
mean = model.mean()

# due to the way TensorFlow works on a lower level we need to evaluate part
# part the graph from within a session to see the value of this tensor

# in the new version of tensorflow we need to use tf.compat.v1.Session() rather than just tf.Session()
with tf.compat.v1.Session() as sess:
    print(mean.numpy())

[3.        5.9999995 7.4999995 8.25      8.625001  8.812501  8.90625  ]


In [6]:
model.log_prob(tf.zeros(shape=[7]))

<tf.Tensor: shape=(), dtype=float32, numpy=-19.855635>

In [None]:
tfd = tfp.distributions

# A simple weather model.

# Represent a cold day with 0 and a hot day with 1.
# Suppose the first day of a sequence has a 0.8 chance of being cold.
# We can model this using the categorical distribution:

initial_distribution = tfd.Categorical(probs=[0.8, 0.2])

# Suppose a cold day has a 30% chance of being followed by a hot day
# and a hot day has a 20% chance of being followed by a cold day.
# We can model this as:

transition_distribution = tfd.Categorical(probs=[[0.7, 0.3],
                                                 [0.2, 0.8]])

# Suppose additionally that on each day the temperature is
# normally distributed with mean and standard deviation 0 and 5 on
# a cold day and mean and standard deviation 15 and 10 on a hot day.
# We can model this with:

observation_distribution = tfd.Normal(loc=[0., 15.], scale=[5., 10.])

# We can combine these distributions into a single week long
# hidden Markov model with:

model = tfd.HiddenMarkovModel(
    initial_distribution=initial_distribution,
    transition_distribution=transition_distribution,
    observation_distribution=observation_distribution,
    num_steps=7)

# The expected temperatures for each day are given by:

model.mean()  # shape [7], elements approach 9.0

# The log pdf of a week of temperature 0 is:

model.log_prob(tf.zeros(shape=[7]))
