<a href="https://colab.research.google.com/github/Sravani-05/Assignment5/blob/main/for_nearest_neighbor_algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Using Various Nearest Neighbor Algorithms** 

**About this dataset**

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

In [51]:
import pandas as pds

In [3]:
# reading the CSV file
file = ('/content/heart_failure_clinical_records_dataset.csv')
dataset = pds.read_csv(file)
  
# displaying the contents of the CSV file
dataset.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


**Importing faiss and pickle**

In [4]:
!pip install faiss


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss
  Downloading faiss-1.5.3-cp37-cp37m-manylinux1_x86_64.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 5.0 MB/s 
Installing collected packages: faiss
Successfully installed faiss-1.5.3


In [5]:
!sudo apt-get install libomp-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  libomp5
Suggested packages:
  libomp-doc
The following NEW packages will be installed:
  libomp-dev libomp5
0 upgraded, 2 newly installed, 0 to remove and 5 not upgraded.
Need to get 239 kB of archives.
After this operation, 804 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp5 amd64 5.0.1-1 [234 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp-dev amd64 5.0.1-1 [5,088 B]
Fetched 239 kB in 1s (281 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 2.)
debc

In [6]:
import pickle
import faiss
import numpy as np

**Splitting dataset into Vectors and Time**

In [8]:
data_vectors = dataset.drop(['time'], axis = 1)
data_vectors

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,0


In [9]:
data_vectors = data_vectors.values
data_vectors = np.ascontiguousarray(data_vectors, dtype=np.float32)

In [10]:
data_labels = dataset['time']
data_labels.head()

0    4
1    6
2    7
3    7
4    8
Name: time, dtype: int64

In [11]:
data_labels = data_labels.values

**Locality Sensitive Hashing**

In [12]:
class LSHIndex():
    def __init__(self, vectors, labels):
        self.dimension = vectors.shape[1]
        self.vectors = vectors
        self.labels = labels    
   
    def build(self, num_bits=10):
        self.index = faiss.IndexLSH(self.dimension, num_bits)
        self.index.add(self.vectors)
        
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.labels[i] for i in indices[0]]

In [13]:
lsh_index = LSHIndex(data_vectors, data_labels)
lsh_index.build()

In [15]:
lsh_index.query(np.array([data_vectors[282]]))

[7, 8, 10, 6, 8, 10, 10, 10, 7, 4]

In [17]:
lsh_index.query(np.array([data_vectors[7]]))

[7, 8, 10, 6, 8, 10, 10, 10, 7, 4]

**Exhaustive Search**

In [23]:
class ExhaustiveIndex():
    def __init__(self, vectors, time):
        self.dimension = vectors.shape[1]
        self.vectors = vectors
        self.time = time   
   
    def build(self):
        self.index = faiss.IndexFlatL2(self.dimension)
        self.index.add(self.vectors)
        
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.time[i] for i in indices[0]]

In [25]:
exact_index = ExhaustiveIndex(data_vectors, data_labels)
exact_index.build()

In [26]:
exact_index.query(
  np.array([data_vectors[260]])
)

[233, 94, 195, 73, 8, 172, 26, 237, 235, 77]

**Product Quantization**

In [27]:
class ProductQuantizationIndex():
    def __init__(self, vectors, time):
        self.dimension = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.time = time
    
    def build(self, number_of_partition=1, search_in_x_partitions=1, subvector_size=2):
        quantizer = faiss.IndexFlatL2(self.dimension)
        self.index = faiss.IndexIVFPQ(quantizer, 
                                      self.dimension, 
                                      number_of_partition, 
                                      search_in_x_partitions, 
                                      subvector_size)
        self.index.train(self.vectors)
        self.index.add(self.vectors)
        
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.time[i] for i in indices[0]]

In [29]:
product_quantization_index = ProductQuantizationIndex(data_vectors, data_labels)
product_quantization_index.build()


In [30]:
product_quantization_index.query(np.array([data_vectors[117]]))

[94, 192, 250, 88, 205, 212, 278, 88, 67, 10]

**Trees and Graphs**

In [31]:
!pip install annoy


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting annoy
  Downloading annoy-1.17.1.tar.gz (647 kB)
[K     |████████████████████████████████| 647 kB 5.2 MB/s 
[?25hBuilding wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.1-cp37-cp37m-linux_x86_64.whl size=395183 sha256=852745c10e3c76f0d3a698fc5fe82219a6fdb808f379b41abded32a77ff68446
  Stored in directory: /root/.cache/pip/wheels/81/94/bf/92cb0e4fef8770fe9c6df0ba588fca30ab7c306b6048ae8a54
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.1


In [32]:
import annoy

In [38]:
class AnnoyIndex():
    def __init__(self, vectors, time):
        self.dimension = vectors.shape[1]
        self.vectors = vectors
        self.time = time    
   
    def build(self, number_of_trees=5):
        self.index = annoy.AnnoyIndex(self.dimension)
        for i, vec in enumerate(self.vectors):
            self.index.add_item(i, vec.tolist())
        self.index.build(number_of_trees)
        
    def query(self, vector, k=10):
        indices = self.index.get_nns_by_vector(vector.tolist(), k, search_k=7)                                           
        return [self.labels[i] for i in indices]

In [37]:
annoy_index = AnnoyIndex(data_vectors, data_labels)
annoy_index.build()

  


**Hierarchical Navigable Small World**

In [43]:
!pip install nmslib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nmslib
  Downloading nmslib-2.1.1-cp37-cp37m-manylinux2010_x86_64.whl (13.5 MB)
[K     |████████████████████████████████| 13.5 MB 5.3 MB/s 
Collecting pybind11<2.6.2
  Downloading pybind11-2.6.1-py2.py3-none-any.whl (188 kB)
[K     |████████████████████████████████| 188 kB 52.8 MB/s 
Installing collected packages: pybind11, nmslib
Successfully installed nmslib-2.1.1 pybind11-2.6.1


In [44]:
import nmslib

In [45]:
class NMSLIBIndex():
    def __init__(self, vectors, time):
        self.dimention = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.time =  time
    def build(self):
        self.index = nmslib.init(method='hnsw', space='cosinesimil')
        self.index.addDataPointBatch(self.vectors)
        self.index.createIndex({'post': 2})
        
    def query(self, vector, k=10):
        indices = self.index.knnQuery(vector, k=k)
        return [self.time[i] for i in indices[0]]

In [48]:
hnsw_index = NMSLIBIndex(data_vectors, data_labels)
hnsw_index.build()

In [49]:
hnsw_index.query(data_vectors[1])

[6, 126, 72, 186, 87, 246, 206, 60, 107, 280]