<a href="https://colab.research.google.com/github/ChirudeepG/assignment5/blob/main/Assignment5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Using Various Nearest Neighbour Alogorithms**

**About the Dataset**

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.


In [39]:
import pandas as pds

In [40]:
# reading the CSV file
file = ('/content/diabetes 2.csv')
dataset = pds.read_csv(file)
  
# displaying the contents of the CSV file
dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


**Importing faiss and pickle**

In [3]:
!pip install faiss

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss
  Downloading faiss-1.5.3-cp37-cp37m-manylinux1_x86_64.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 6.9 MB/s 
Installing collected packages: faiss
Successfully installed faiss-1.5.3


In [4]:
!sudo apt-get install libomp-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  libomp5
Suggested packages:
  libomp-doc
The following NEW packages will be installed:
  libomp-dev libomp5
0 upgraded, 2 newly installed, 0 to remove and 5 not upgraded.
Need to get 239 kB of archives.
After this operation, 804 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp5 amd64 5.0.1-1 [234 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp-dev amd64 5.0.1-1 [5,088 B]
Fetched 239 kB in 1s (238 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 2.)
debc

In [5]:
import pickle
import faiss
import numpy as np

**Splitting dataset into Vectors and Time**

In [6]:
data_vectors = dataset.drop(['Age'], axis = 1)
data_vectors

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Outcome
0,6,148,72,35,0,33.6,0.627,1
1,1,85,66,29,0,26.6,0.351,0
2,8,183,64,0,0,23.3,0.672,1
3,1,89,66,23,94,28.1,0.167,0
4,0,137,40,35,168,43.1,2.288,1
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,0
764,2,122,70,27,0,36.8,0.340,0
765,5,121,72,23,112,26.2,0.245,0
766,1,126,60,0,0,30.1,0.349,1


In [7]:
data_vectors = data_vectors.values
data_vectors = np.ascontiguousarray(data_vectors, dtype=np.float32)

In [8]:
data_labels = dataset['Age']
data_labels.head()

0    50
1    31
2    32
3    21
4    33
Name: Age, dtype: int64

In [9]:
data_labels = data_labels.values

**Locality Sensitive Hashing**

In [10]:
class LSHIndex():
    def __init__(self, vectors, labels):
        self.dimension = vectors.shape[1]
        self.vectors = vectors
        self.labels = labels    
   
    def build(self, num_bits=10):
        self.index = faiss.IndexLSH(self.dimension, num_bits)
        self.index.add(self.vectors)
        
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.labels[i] for i in indices[0]]

In [11]:
lsh_index = LSHIndex(data_vectors, data_labels)
lsh_index.build()

In [12]:
lsh_index.query(np.array([data_vectors[282]]))

[21, 30, 31, 27, 51, 40, 25, 23, 22, 22]

In [13]:
lsh_index.query(np.array([data_vectors[7]]))

[24, 21, 25, 29, 26, 22, 37, 40, 32, 32]

**Exhaustive Search**

In [14]:
class ExhaustiveIndex():
    def __init__(self, vectors, time):
        self.dimension = vectors.shape[1]
        self.vectors = vectors
        self.time = time   
   
    def build(self):
        self.index = faiss.IndexFlatL2(self.dimension)
        self.index.add(self.vectors)
        
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.time[i] for i in indices[0]]

In [15]:
exact_index = ExhaustiveIndex(data_vectors, data_labels)
exact_index.build()

In [16]:
exact_index.query(
  np.array([data_vectors[260]])
)

[34, 31, 55, 24, 33, 36, 47, 52, 28, 24]

**Product Quantization**

In [37]:
class ProductQuantizationIndex():
    def __init__(self, vectors, Age):
        self.dimension = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.Age = Age
    
    def build(self, number_of_partition=1, search_in_x_partitions=1, subvector_size=2):
        quantizer = faiss.IndexFlatL2(self.dimension)
        self.index = faiss.IndexIVFPQ(quantizer, 
                                      self.dimension, 
                                      number_of_partition, 
                                      search_in_x_partitions, 
                                      subvector_size)
        self.index.train(self.vectors)
        self.index.add(self.vectors)
        
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.Age[i] for i in indices[0]]

In [38]:
product_quantization_index = ProductQuantizationIndex(data_vectors, data_labels)
product_quantization_index.build()


In [41]:
product_quantization_index.query(np.array([data_vectors[117]]))

[29, 54, 34, 31, 30, 30, 32, 57, 32, 50]

**Trees and Graphs**

In [24]:
!pip install annoy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting annoy
  Downloading annoy-1.17.1.tar.gz (647 kB)
[K     |████████████████████████████████| 647 kB 6.5 MB/s 
[?25hBuilding wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.1-cp37-cp37m-linux_x86_64.whl size=395180 sha256=f8b08a617a20650465f57b9be538077e968b61ea21630e53d9101dfc2ada2944
  Stored in directory: /root/.cache/pip/wheels/81/94/bf/92cb0e4fef8770fe9c6df0ba588fca30ab7c306b6048ae8a54
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.1


In [25]:
import annoy

In [26]:
class AnnoyIndex():
    def __init__(self, vectors, Age):
        self.dimension = vectors.shape[1]
        self.vectors = vectors
        self.Age = Age    
   
    def build(self, number_of_trees=5):
        self.index = annoy.AnnoyIndex(self.dimension)
        for i, vec in enumerate(self.vectors):
            self.index.add_item(i, vec.tolist())
        self.index.build(number_of_trees)
        
    def query(self, vector, k=10):
        indices = self.index.get_nns_by_vector(vector.tolist(), k, search_k=7)                                           
        return [self.labels[i] for i in indices]

In [27]:
annoy_index = AnnoyIndex(data_vectors, data_labels)
annoy_index.build()

  


**Hierarchical Navigable Small World**

In [28]:
!pip install nmslib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nmslib
  Downloading nmslib-2.1.1-cp37-cp37m-manylinux2010_x86_64.whl (13.5 MB)
[K     |████████████████████████████████| 13.5 MB 7.3 MB/s 
[?25hCollecting pybind11<2.6.2
  Downloading pybind11-2.6.1-py2.py3-none-any.whl (188 kB)
[K     |████████████████████████████████| 188 kB 63.0 MB/s 
Installing collected packages: pybind11, nmslib
Successfully installed nmslib-2.1.1 pybind11-2.6.1


In [29]:
import nmslib

In [34]:
class NMSLIBIndex():
    def __init__(self, vectors, Age):
        self.dimention = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.Age =  Age
    def build(self):
        self.index = nmslib.init(method='hnsw', space='cosinesimil')
        self.index.addDataPointBatch(self.vectors)
        self.index.createIndex({'post': 2})
        
    def query(self, vector, k=10):
        indices = self.index.knnQuery(vector, k=k)
        return [self.Age[i] for i in indices[0]]

In [35]:
hnsw_index = NMSLIBIndex(data_vectors, data_labels)
hnsw_index.build()

In [36]:
hnsw_index.query(data_vectors[1])

[31, 23, 24, 29, 41, 38, 32, 25, 43, 26]