# Point Cloud Analysis
In this notebook, we attempt to classify symmetry type using the point cloud of atoms for each protein. This is done by equipping the space of point clouds with the Hausdorff metric, and running a KNN algorithm. While more geometrically motivated, this model does not perform as well as with our other approaches.

In [1]:
import os
import sys
import glob

import pandas as pd
import numpy as np
import pickle
from random import sample
import math

from sklearn.neighbors import KNeighborsClassifier

sys.path.append('../src/')
from ProteinPointCloud import *

Now let's load in the appropriate data. First is a list of proteins which have C2 symmetry. Our classifier will predict whether or not a protein lies within this list.

In [2]:
with open("../symmetry-lists/C2_list.pkl", "rb") as file:
    entry_list = pickle.load(file)

We must first collect the proteins into a training set and a testing set.

In [3]:
location = os.path.join('../proteins/large-batch-1', '*.cif')
files = glob.glob(location)
testing_files = sample(files, len(files)//5)
training_files = [f for f in files if f not in testing_files]

Let's see how many of these files have a C2 symmetry. Note that since we are using KNN, we are looking for closer to a 50/50 split between those with and without a C2 symmetry.

In [6]:
symmetries = 0
for f in training_files:
    entry = f[-8:-4].upper()
    if entry+'-1' in entry_list:
        symmetries = symmetries + 1
print('C2 symmetries in training data:  ',symmetries,'/',len(training_files))

C2 symmetries in training data:   126 / 200


In [7]:
symmetries = 0
for f in testing_files:
    entry = f[-8:-4].upper()
    if entry+'-1' in entry_list:
        symmetries = symmetries + 1
print('C2 symmetries in testing data:  ',symmetries,'/',len(testing_files))

C2 symmetries in testing data:   33 / 50


For each protein file in the training/testing set, we extract the point cloud given by positions of atoms in the protein. This is stored as a numpy matrix whose rows are 3-vectors corresponding to the points in space. 

There are three formatting steps employed to standardize the collection of point clouds:

 - Translate each cloud so that its center of mass lies at the origin.
 - Scale each cloud such that it lies within a ball of radius 1
 - Rotate each cloud so that the point farthest away from the center is at (1,0,0).

In [8]:
cloud_list = []
for f in training_files:
    pcd = atom_cloud(f)
    cloud_list.append(regularize_cloud(pcd))
for f in testing_files:
    pcd = atom_cloud(f)
    cloud_list.append(regularize_cloud(pcd))

We can now build a KNN classifier on the list of these point clouds. Note that we address these clouds by their index in the list rather than by their coordinates. Building and fitting the model takes a substantial amount of time because the metric we have equipped is computationally expensive and immune to the standard techniques which make KNN run faster.

In [11]:
#build a KNN classifier with the protein_distance metric
neigh = KNeighborsClassifier(n_neighbors=14, metric=lambda i, j : hausdorff_distance(cloud_list[int(i[0])], cloud_list[int(j[0])]))

In [12]:
X = [[i] for i in range(len(training_files))]
y = [training_files[i][-8:-4].upper()+'-1' in entry_list for i in range(len(training_files))]
neigh.fit(X,y)

Now let's see how well this model does at predicting if a protein has C2 symmetry when we move to the testing data set.

In [14]:
predictions = {}
actuals = {}

for i in range(len(testing_files)):
    txt = (("\rPredicting symmetry on protein {current} of {total}.   {percent}% complete.")
                   .format(current = i+1, total = len(testing_files), 
                           percent = math.floor(i/len(testing_files)*100)))
    sys.stdout.write(txt)
    sys.stdout.flush()
    
    predicted = neigh.predict([[i+len(training_files)]])[0]
    actual = testing_files[i][-8:-4].upper()+'-1' in entry_list
    predictions.update({testing_files[i][-8:-4].upper()+'-1' : predicted})
    actuals.update({testing_files[i][-8:-4].upper()+'-1' : actual})

Predicting symmetry on protein 50 of 50.   98% complete.

In [16]:
accuracy = len([key for key in predictions.keys() if predictions[key] == actuals[key]])/len(predictions.keys())
print('Accuracy of model:',accuracy)
print('Accuracy using mode:',max(symmetries/len(testing_files), 1-symmetries/len(testing_files)))

Accuracy of model: 0.58
Accuracy using mode: 0.66


The accuracy found is not impressive. With the inherent computational complexity of the metric, it's hard to say if adding more data to the model would substantially improve its performance.

In [17]:
precision = (len([key for key in predictions.keys() if predictions[key] == True and actuals[key] == True])
             /len([key for key in predictions.keys() if predictions[key] == True]))
print('Precision:',precision)

Precision: 0.6578947368421053


In [18]:
recall = (len([key for key in predictions.keys() if predictions[key] == True and actuals[key] == True])
             /len([key for key in predictions.keys() if actuals[key] == True]))
print('Recall:',recall)

Recall: 0.7575757575757576
