# Context Matters - Clustering

Lets begin by asserting that we have collected GPS location data fo various fossil samples. Many times a dinosaur bone fragment may be similiar in color, texture, and details to other fossils - So how can we gain confience that what we found was bone?

### Petrified Wood

![assets/PetrifiedWoodcropped1.PNG](assets/PetrifiedWoodcropped1.PNG)

![assets/PetrifiedWoodcropped2.PNG](assets/PetrifiedWoodcropped2.PNG)

### Shark Teeth

![assets/SharkTeeth20220520_113552.JPG](assets/SharkTeeth20220520_113552.JPG)

### Crocodilian Skutes

![assets/CrocodilianSkutes20220520_112944.JPG](assets/CrocodilianSkutes20220520_112944.JPG)

### Bones

![assets/DinosaurBonesLookLike.png](assets/DinosaurBonesLookLike.png)


Lets put all of our GPS data into a spreadsheet and perform Unsupervised learning to begin to identify context

https://towardsdatascience.com/simple-example-of-2d-density-plots-in-python-83b83b934f67


-------------
| 1 |    2       |    3          |      4        |
|---|------------|---------------|---------------|   
|  2   |  4 | 6 |  8  |


# Context matters!

![assets/NGMDB_DNM.PNG](assets/NGMDB_DNM.PNG)

Consult a geologic map to determine if the fossils you found are within a specified geological context - this will give you an idication of whether you found shark teeth in a marine environment or bones in a terrestrial environment. Or, it will indicate that the specimen you have is from the Triassic, Jurassic, or Cretaceous periods - these are the dinosaur periods. If geologists have already identified that your search area is a tertiary or quaternary time then what you found are NOT dinosaurs fossils.

Once you find a cluster of fossils in clsoe prximity, it is more likle that the other fossils nearby are from the same or simialr specimens or ate least from the same time period

In some cases, we find thousands of samples of petrified wood and no bone.

Other time we find hundreds of bones fragments, but no petrified wood.

Same for shark teeth. Shark teeth areas TEND to exclude bone fragments from being dinosaur bones (but this is not a hard and fast rule)

In some cases we find petrified wood and crocdilian skutes mingled together and then a few hundred yards away we find dinosaur bones grouped to gether.

BUT - MANY, MANY times, finding a few defintitive bone samples help us conclude the very nearby samples are likely also bone same for shark teeth - find a few definitive ones and the partial fossils are most likley also shark teeth and so on.

# Goal: Identify clusters

What we would like is to plot our GPS coordinates and look for patterns suggesting that fossils which are clustering together.

In some cases the clusters will overlap, but in many other cases, the clusters tend to reveal that fossil of simialr nature are found together.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
from sklearn.datasets import make_blobs
n_components = 4

labels = ['bone', 'shark teeth', 'crocodilian', 'petrified wood']

X, truth = make_blobs(n_samples=3000, centers=n_components, 
                      cluster_std = [2, 1.5, 1, 1], 
                      random_state=42)
plt.scatter(X[:, 0], X[:, 1], s=50, c = truth, cmap='Set1')
plt.title(f"Map of a mixture of 300 fossils of {n_components} fossils types ")
plt.xlabel("longitude")
plt.ylabel("latitude")
plt.grid()

# In the real world, 

We typically cannot identify each fossil specifically
- the fossil is too small (shape does not help)
- the fossil is too similar to other types of fossils
- too many species were buried in the same area

## The point is: Our map typically looks like this 

Because we dont even know what kinds of fossils we actually have, our first step is to identify any naturally occuring clusters. Since we dont yet have any other information about the fossils - they are all colored the same - gray.

In [None]:
plt.scatter(X[:, 0], X[:, 1], s=50, c = 'lightgray')
plt.title(f"Map of a mixture of 3000 fossils of {n_components} fossils types ")
plt.xlabel("longitude")
plt.ylabel("latitude")
plt.grid()

# Clustering with KMeans 

For simple distributions like the one above, we might try to cluster via Kmeans.

FYI: Applying the patch_sklearn() from the Intel Extensions for Scikit-learn* yirelds a speedup of 4 to 5X during the fitting of the model.

## Exercise: Experiment with patching and unpatching 

You can patch or unpatch subsequent imports from sklearn. Just be sure to patch before you import from sklearn



In [None]:
from sklearnex import patch_sklearn, unpatch_sklearn
patch_sklearn()
from sklearn.cluster import KMeans
import time

start = time.time()
kmeans = KMeans(n_clusters=4, random_state=0).fit(X)
print(f"elapsed: {time.time() - start} seconds")
kmeans.labels_

fig, ax = plt.subplots()


scatter = plt.scatter(X[:, 0], X[:, 1], s=50, c = kmeans.labels_)
plt.title(f"Map of a mixture of 300 fossils of {n_components} fossils types ")
plt.xlabel("longitude")
plt.ylabel("latitude")
plt.grid()

legend1 = ax.legend(*scatter.legend_elements(),
                    loc="lower right", title="Classes")
ax.add_artist(legend1)

plt.show()
i =0

labels = kmeans.labels_

index = {}
# create a dictionary to track indexes of each laebels class
# identified by clustering
for i in range(4):
    index[i] = np.where(labels == i)
    
# there are 40 to 60 points in each cluster



# Context! 

Now identification has gotten easier

We can take photographs, video, gps coordinates, measurements and head back tofor an interenet search to try to identify a even a small number of fossils from each cluster. We can choose the best specimens, or the most representative, or the ones with outer curved surfaces that may help identify the specimen

Context can help identify the other unkown fossils once we get a postive identification fo a few specimens

The problem with kmeans is that every time we run it - it may assign a different cluster ID to each point so sometime the blob at the lower left will be identified as cluster 2 and other runs it will be identified as cluster 3 for example

So it has given us great insight and gotten us to a point where we might classify these points in a more consistnent way.

### Classifier Will Color Sites Consistently

#### Exercise: Experiment with patching and unpatching SVC

In [None]:
# lets use 8 points from each class to create a classifier
# the classifier uses 8 values of X and 8 values of y for each class 
# to fit a classifier model to the data
# then we can use the classifier to predict 
# a bone type based on locality of the find

X_train = X[index[0][0][:80]]
X_train =  np.append(X_train, X[index[1][0][:8]], axis = 0)
X_train =  np.append(X_train, X[index[2][0][:8]], axis = 0)
X_train =  np.append(X_train, X[index[3][0][:8]], axis = 0)

y_train = labels[index[0][0][:80]]
y_train =  np.append(y_train, labels[index[1][0][:8]], axis = 0)
y_train =  np.append(y_train, labels[index[2][0][:8]], axis = 0)
y_train =  np.append(y_train, labels[index[3][0][:8]], axis = 0)

patch_sklearn("SVC") # patch SVC algo 
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC  # accelerated with Intel Extensions for Scikit-learn
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
start = time.time()
clf.fit(X_train,  y_train.ravel())
y_pred = clf.predict(X)
print(f"{time.time() - start} seconds")
plt.scatter(X[:,0], X[:,1], c = y_pred)
plt.xlabel("longitude")
plt.ylabel("latitude")
plt.grid()

# Clustering with DBSCAN for more complicated data distributions

We will use DBSCAN from scikit-learn to attemp to identify our clustering.

BUT - DBSCAN requires a parameter called EPS - which is the minimum distance bewteen points to be considered part of the same cluster

To find a good value for EPS - we will use sscikit-learn's nearest neighbor to plot the all the paris fo distances to help identify a good starting value to use for DBSCAN

We plot a sorted distance curve and look for the distance that best represents a knee or sometimes the large values from the curve

In [None]:
from sklearn.neighbors import NearestNeighbors

patch_sklearn() # patch all remaining sklearn imports

nn = NearestNeighbors(n_neighbors=2).fit(X)
distances, indices = nn.kneighbors(X)
plt.plot(sorted(distances[:,1]))
plt.ylabel('Distance')
plt.xlabel('Number of finds')
plt.grid()


In [None]:
knee = .4

In [None]:
from sklearn.cluster import DBSCAN
import numpy as np

fig, ax = plt.subplots()

clustering = DBSCAN(eps = 1.4, min_samples = 5 ).fit(X)
color = clustering.labels_

labels = [str(i) for i in np.unique(clustering.labels_).tolist()]
plt.legend(labels)
scatter = plt.scatter(X[:,0], X[:,1], c = color)
plt.xlabel("longitude")
plt.ylabel("latitude")
legend1 = ax.legend(*scatter.legend_elements(),
                    loc="lower right", title="Classes")
ax.add_artist(legend1)

plt.grid()

**Note: DBSCAN Identifies Outliers as class -1**

# Kernel density Estimation

single long vector

have to reshape 

say that often when you get home you drop your keys and relax for dinner. Later you decide to drive to a grocery store for snacks. You've lost your keys!

But you have lost your keys many times before and have kep a reocrd of exactly where you found them over the course of time

It seems reasonable that the likely that finding your keys near to locations you found them previously is high. Finding them in your dining room, living room, bathroom, bedroom each have historical probabilites.

On the other hand, seeing as you've never been to Nome Alaska, your likelyhood of finding your keys there is very nearly zero.

Even if you have never found your keys in the bathroom before, it makes sense that the probability of finding keys in the bathroom is higher than finding your keys in Nome.  There is a sense in which the density of finds and nearby locations have higher probablities than locations very, very far away.

This is similar to finding dinosaur bones, petrified wood, prehistoric shark teeth, prehistoric crocodilian remains.  The probability of finding a dinosaur bone is higher when other dinosaur bones have been found nearby

This is where clustering and kernel density estimation can help to determine what a given sample of likely to be and how likely it is to find a bone in a particular location 






In [None]:
# Extract x and y
x = X[:, 0]  # n_smaples is number of bone locations
y = X[:, 1]
# Define the borders
deltaX = (max(x) - min(x))/10
deltaY = (max(y) - min(y))/10
xmin = min(x) - deltaX
xmax = max(x) + deltaX
ymin = min(y) - deltaY
ymax = max(y) + deltaY
print(xmin, xmax, ymin, ymax)
# Create meshgrid
xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]

positions = np.vstack([xx.ravel(), yy.ravel()]) # stack into 1D for KDE calc
values = np.vstack([x, y])
kernel = st.gaussian_kde(values)
f = np.reshape(kernel(positions).T, xx.shape) # reshape to 2D

In [None]:
fig = plt.figure(figsize=(8,8))
ax = fig.gca()
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
cfset = ax.contourf(xx, yy, f, cmap='coolwarm')
ax.imshow(np.rot90(f), cmap='coolwarm', extent=[xmin, xmax, ymin, ymax])
cset = ax.contour(xx, yy, f, colors='k')
ax.clabel(cset, inline=1, fontsize=10)
ax.set_xlabel('X')
ax.set_ylabel('Y')
plt.title('2D Gaussian Kernel density estimation')

In [None]:
# idx = f > .006
# fig = plt.figure(figsize=(8,8))
# ax = fig.gca()
# ax.set_xlim(xmin, xmax)
# ax.set_ylim(ymin, ymax)
# ax.scatter(xx[idx], yy[idx])
# plt.grid()

In [None]:
# from scipy.integrate import simps
# import numpy as np

# # x = np.linspace(0, 1, 20)                # x.shape  (20,)
# # y = np.linspace(0, 1, 30)                # y.shape  (30,)
# # z = np.cos(x[:,None])**4 + np.sin(y)**2  #z.shape (20, 30)
# # simps(simps(f, yy[0,:]), xx[:,0])

# # -11.097585529030251 9.858180523990058 -11.545696279950441 16.81828998331647
# simps(simps(f, yy[0,:]), xx[:,0])


# Normalize the highest peak

Probability of 1 at the max location - like 100% chance Mammoth Mesa

In [None]:
fNorm = f/f.max()
fig = plt.figure(figsize=(8,8))
ax = fig.gca()
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
cfset = ax.contourf(xx, yy, fNorm, cmap='coolwarm')
ax.imshow(np.rot90(f), cmap='coolwarm', extent=[xmin, xmax, ymin, ymax])
cset = ax.contour(xx, yy, fNorm, colors='k')
ax.clabel(cset, inline=1, fontsize=10)
ax.set_xlabel('X')
ax.set_ylabel('Y')
plt.title('2D Gaussian Kernel density estimation')