# Day 13: Unsupervised Learning


How do we find patterns in data if we know minimal amounts about the data? We are looking to group data - but without foreknowledge of what groups they belong to.

Examples of Unsupervised Learning:

*   Dividing customers into preselected groups to determine marketing plans.  
*   Identifying similar songs in a spotify database (without genre labels)
*   Trying to find members of a secret organization using nothing but their emails to and from one another.


In [None]:
#our second Pip, a very big install.
!pip install pyclustertend

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyclustertend
  Downloading pyclustertend-1.6.2-py3-none-any.whl (7.1 kB)
Collecting scikit-learn<0.25.0,>=0.24.0 (from pyclustertend)
  Downloading scikit-learn-0.24.2.tar.gz (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import numpy as np
import pandas as pd
import plotly.figure_factory as ff  #Simpler than scipy's clusering approach, also has great other features!
from sklearn import preprocessing #Gotta normalize several variables, better than doing it manually each time.
from sklearn.preprocessing import LabelEncoder #Label encoding, also a pain to do manually.
from sklearn.cluster import KMeans
# from pyclustertend.hopkins import hopkins #Hopkins Clustering

from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt

## Hierarchical Clustering

This proceedure is deceptively simple.

What makes a good group? Well, members of a group should be similar to one another.  More similar, smaller "distance" between numbers. Distance between members of a group should be small.

So, let's take the two closest data points, and gob them together. Track which groups are glued together and how far apart they were (in a dendrogram). Continue "gobbing" points together until we have 1 group. This way we can see which clusters of points are most naturally associated with one another and make a determination about appropriate groupings.

In [None]:
dfTitanic = pd.read_csv('titanic3.csv')
dfTitanic.head() #Time to look at data and select what seems like it might be good for grouping passengers together, can use several numerical values

FileNotFoundError: ignored

In [None]:
#Everything has to be numeric, no NA's
labelInstance = LabelEncoder
minMaxInstance = preprocessing.MinMaxScaler()
dfTitanic['encSex'] = labelInstance().fit_transform(dfTitanic['sex'])
dfTitanic['encBoat'] = labelInstance().fit_transform(dfTitanic[]) #Note that this encoding assumes a relationship between the boats in some quasilinear fashion!
dfTitanic[['stdFare','stdAge']] = minMaxInstance.fit_transform(dfTitanic[['fare','age']])   #It is vital that you standardize/normalize everything you can, otherwise the distance metrics are dominated by things like fare price and not by other elements.  Your choice here matters.
dfClean = dfTitanic[['pclass','survived','encSex','encBoat','stdFare','stdAge']].dropna(how='any')

In [None]:
fig = ff.create_dendrogram(dfClean[1:50])# only draw a few, we want to see the tree with some
fig.update_layout(width=800, height=500)
fig.show()

We can see from the figure above that there are multiple group types, the most distinct being passengers 0 and further right on the dengram (red) and all other passengers being another group, somewhere around distance 25 away from each other.

We can take horizontal cross sections (say at 20 distance) and it will allow us to identify how many groups and what the contents of those groups are. Some canidates are at distance 25 - 2 groups, 12 -3 groups, 7 - 4 groups, etc.

There are different methods of determining distance of groups, the most obvious perhaps being 'average linkage' of all points in cluster between one another, 'single linkage' (distance between two closest points), and so on. Check your textbook for variations, but these generally cover the ideas of note.

Critically, because this actually measures distance, one **must** use normalized/standardized data. **Results will vary by normalization/standardization choices.**

## K-Means Clustering

Rather than start with the data and find the number of groups (at a given distance), one could designate a number of groups that we believe to be present in the data, then search for an appropriate match.

Matches are determined by randomly assigning points to a cluster.
We then find a centroid for all points in that cluster.
Then we reassign observations to the closest centroid, and repeat until stablized.  Note that because this starts with a random point, it is **nondeterministic** in outcome.

Note that an labeled group is not guranteed to be found in this data. This is looking for unlabeled groups. For example, if I feed it every factor except for survival, it will not almost certainly not identify the group survive/not survive.  

In [None]:
kmeans = KMeans(n_clusters=3, n_init=1234) #selecting arbitrary random seed
kmeans.fit(dfClean)

ax = plt.axes(projection='3d')
ax.scatter3D(dfClean['encBoat'], dfClean['stdFare'], dfClean['stdAge'], c=kmeans.labels_);

ax.set_xlabel('Boat')
ax.set_ylabel('Fare')
ax.set_zlabel('Age')
ax.set_box_aspect(aspect=None, zoom=0.75)
plt.show()

Boats were a major component of the grouping, boats are seen as contiguous, but are likely stored on different floors, etc. - this is worth a long discussion and several sketches.

## Association Rules

We have a series of items (usually events occuring in time), and we are looking to see how often the antecedent (event before) is followed by a consequent (event after).  We will call the antecdent A and the consequent B.

$P(B|A) = Confidence = \frac{P(A and B)}{P(A)}$

However, this does not mean A *causes* B, those things are distinct. For example, if B simply happens all the time, then this confidence will be very high, because $P(B) \approx 1$.   One way to correct for this specific problem is to use the lift ratio (but doesn't solve the problem entirely)

$Lift\ Ratio = \frac{Confidence}{P(B)}$

If this lift ratio exceeds 1 it suggests a rule might be useful, at least, moreso than not using A.  However, one should be cautious about how frequently both A and B occur in the dataset, one would not want to perform this evaluation when either A or B were partiuclarly sparse.



## Measuring Clustering

Broadly, we would like to know if a given dataset has suitability for unsupervised clustering. We can do this with *a priori* knowledge, or by visual inspection, and then complement or approach with an autonomous measure (like a Hopikins Statistic).

We also would like to know if a clustering is well-done, given that some clusters are good and others are clearly... not.  One autonomous measure is the *Silhouette coefficient*.

### Hopkins Statistic


In [None]:
#Generally returns 0.5 if uniformly distributed, closer to 0 means more clusters.
#Note in large data sets natural lumps will trigger this to shrink because real data deviates from theoretical norms, larger data sets are more likely to have such lumps.
#One might argue such lumps are indeed clusters of note, but the roughness of the surface limits the ability to find whatever true groupings we believe to be present.
hopkins(dfClean, 200)

## Silhoette Coefficient

$s(i) = \frac{b(i)-a(i)}{max(a(i),b(i)}$

Where *a(i)* is the mean distance between point *i* and members of cluster A, and *b(i)* is the measure of mean distance to members of cluster B.

Ranges from -1 to 1, 1 meaning point *i* similar to cluster A, and -1 means *i* is more similar to members of cluster B.

# Exercizes

1) Discuss and explain why the siloette coefficient is vunerable to choices in scaling. Try to sketch out an example verbally.

2) Load a fresh dataset and make a call about the number of clusters using a hirerarical clustering method.

3) Load a fresh dataset and make a call about the number of clusters using a K-means cluster.

4) Construct an elbow plot for the K-means plot and discuss.