We use [RAPIDS](https://rapids.ai/) for clustering faces in the train dataset. Rapids is a package developed and maintained by NVidia and uses the GPU for fast calculations

The faces were cropped using the facenet's pytorch version. They are 160x160 in dimension. Sample 2000(aprox) images can be found in the following [dataset](https://www.kaggle.com/skylord/sample-face-crop) 

Inspiration is taken :) from the following awesome notebooks: 

- @Bojan's MNIST 2-D t-sne with rapids: [Link](https://www.kaggle.com/tunguz/mnist-2d-t-sne-with-rapids)
- @Henrique's Proper clustering with facenet embeddings + PCA: [Link](https://www.kaggle.com/hmendonca/proper-clustering-with-facenet-embeddings-eda/)



So who is the fastest! 
![FastestSuperHero](https://www.kaggle.com/skylord/sample-face-crop#best-flash-super-hero-dc-laser-time.jpg)


In [None]:
%%time
# We add the rapids kaggle dataset [Link](https://www.kaggle.com/cdeotte/rapids)
# This installs the package offline. Installation takes place under a minute! 
import sys
!cp ../input/rapids/rapids.0.11.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz
sys.path = ["/opt/conda/envs/rapids/lib"] + ["/opt/conda/envs/rapids/lib/python3.6"] + ["/opt/conda/envs/rapids/lib/python3.6/site-packages"] + sys.path
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
import cudf,cuml
import pandas as pd
import numpy as np
from cuml.manifold import TSNE
from cuml import PCA  
#from cuml.decomposition import PCA << this is also supported
from cuml.cluster import DBSCAN
#from cuml import DBSCAN << this is also supported
import matplotlib.pyplot as plt
%matplotlib inline

The following scatter function is defined below, but not used in this notebook. You could call it your self for some interesting visualizations

In [None]:
def scatter_thumbnails(data, images, zoom=0.12, colors=None):
    assert len(data) == len(images)

    # reduce embedding dimentions to 2
    x = PCA(n_components=2).fit_transform(data) if len(data[0]) > 2 else data

    # create a scatter plot.
    f = plt.figure(figsize=(22, 15))
    ax = plt.subplot(aspect='equal')
    sc = ax.scatter(x[:,0], x[:,1], s=4)
    _ = ax.axis('off')
    _ = ax.axis('tight')

    # add thumbnails :) Displaying thumbnails is something I have commented out. 
#     from matplotlib.offsetbox import OffsetImage, AnnotationBbox
#     for i in range(len(images)):
#         image = plt.imread(images[i])
#         im = OffsetImage(image, zoom=zoom)
#         bboxprops = dict(edgecolor=colors[i]) if colors is not None else None
#         ab = AnnotationBbox(im, x[i], xycoords='data',
#                             frameon=(bboxprops is not None),
#                             pad=0.02,
#                             bboxprops=bboxprops)
#         ax.add_artist(ab)
    return ax


- Read pre-encoded embeddings. Created using the [original notebook](https://www.kaggle.com/skylord/face-clustering)> 
- This encodes the first-frame face crops, using the following codeblock

```
from torchvision.transforms import ToTensor

tf_img = lambda i: ToTensor()(i).unsqueeze(0)
embeddings = lambda input: resnet(input)

list_embs = []
with torch.no_grad():
    for face in tqdm(face_files):
        t = tf_img(Image.open(face)).to(device)
        e = embeddings(t).squeeze().cpu().tolist()
        list_embs.append(e)
```


In [None]:
%%time
import pickle

embeddings = pd.read_pickle('/kaggle/input/sample-face-crop/embeddings_face_clusters.pkl')
print(embeddings.shape)
embeddings.head()

In [None]:
# Convert the embeddings to columns
colnames = list()

for idx in range(512):
    colnames.append('colname_'+str(idx))
    
colnames;
embeddings[colnames] = pd.DataFrame(embeddings['embedding'].values.tolist(), index = embeddings.index)

In [None]:
#Convert to numpy array
embed_numpy = embeddings[colnames].to_numpy()


In [None]:
%%time
# PCA first to speed it up
x = PCA(n_components=50).fit_transform(embed_numpy)


Default dimensions for t-sne is n_components=2. This uses the fast Barnes-Hut clustering technique. 
With greater dimensions the exact method for calculating tsne is used

In [None]:
%%time
tsne = TSNE(random_state = 99) # 
x = tsne.fit_transform(embed_numpy)

Total time to fit the transform was ~ 7.35 secs !!! 

This can be compared to the 3-5+ hours if you used sklearn's t-sne 

In [None]:
%%time
tsne50 = TSNE(random_state=99, n_components=50)
x50= tsne50.fit_transform(embed_numpy)

DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find non-linearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisions in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.


In [None]:
%%time 
dbscan = DBSCAN(eps=1.5, verbose=True ) #min_samples (default is 5)
clusters =  dbscan.fit_predict(x)
embeddings['RapidDBSCAN'] = clusters

In [None]:
embeddings.to_pickle('/kaggle/working/embeddings.pkl')
