# Question Answering using spaCy and Nmslib

Introducing simplest question answering example which uses text embeddings and vector similarity search based method for searching similar questions to user input and then get answer for that question. 

- spaCy is natural language processing library which is quite popular, easy to use and understand. It provides text embeddings (averages word embeddings like word2vec and GloVe to get text embedding). see more [here](https://spacy.io/) and [here](https://spacy.io/usage/spacy-101) and also in their [course](https://course.spacy.io/en/)

- nmslib is fast vector search engine like faiss which uses approximate nearest neighbour search algorithm which is very fast (search in millions of records in one second or less). see more [here](https://github.com/nmslib/nmslib)   

# Imports

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [2]:
!pip install spacy
!pip install nmslib
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.3.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.0/en_core_web_lg-2.3.0.tar.gz (782.7MB)
[K     |████████████████████████████████| 782.7MB 4.8MB/s eta 0:00:011
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.3.0-cp37-none-any.whl size=782931392 sha256=418c695d360df504742be85ba517bb715e7ee8e9909e69399c1c9b7b92b7df50
  Stored in directory: /private/var/folders/t2/72xb54sx38d48l3rgd_9qy4m0000gn/T/pip-ephem-wheel-cache-27u9pu5r/wheels/75/27/ff/04e56916ef31a537960f4f4f857b224edd4cd997227c27686c


Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [3]:
import pandas as pd
import os
import json
import re
import spacy
import nmslib

# Read table csv file

In [5]:
!wget https://raw.githubusercontent.com/AnzorGozalishvili/QA_using_spacy_and_nmslib/master/data/sample_qa_data.csv

--2020-06-26 12:57:09--  https://raw.githubusercontent.com/AnzorGozalishvili/QA_using_spacy_and_nmslib/master/data/sample_qa_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7198 (7.0K) [text/plain]
Saving to: ‘sample_qa_data.csv’


2020-06-26 12:57:09 (23.5 MB/s) - ‘sample_qa_data.csv’ saved [7198/7198]



In [6]:
data = pd.read_csv('sample_qa_data.csv', index_col=0)

In [7]:
!rm -rf sample_qa_data.csv

In [8]:
data.head(2)

Unnamed: 0,question,answer
0,Carl and the Passions changed band name to what,Beach Boys
1,How many rings on the Olympic flag,Five


# Create QA model class

In [9]:
class QA(object):
    def __init__(self, data):
        self.nlp = spacy.load('en_core_web_lg')
        self.questions = data.question.tolist()
        self.answers = data.answer.tolist()
    
    def to_vectors(self, texts):
        """Convert texts into their vectors"""
        result = []
        for item in texts:
            result.append(self.nlp(item).vector)
        
        return result
            
    def build_nmslib_index(self):
        """build nmslib index with vectors of question texts"""
        self.index = {}
        self.index = nmslib.init(method='hnsw', space='cosinesimil')
        self.index.addDataPointBatch(self.to_vectors(self.questions))
        self.index.createIndex({'post': 2}, print_progress=True)
        
    def search(self, text, max_distance=0.2):
        """
        K-Nearest-Neighbour search over indexed taxonomy data and distance threshold parameter 
        to get most similar one. 
        Args:
            text: (str) sample question text
            max_distance: (float) maximum allowed distance for neighbours

        Returns:
            result: (tuple) index and distance for found item

        """
        result = {}
        vector = self.nlp(text).vector
        
        if vector is not None:
            ids, distances = self.index.knnQuery(vector)
            
            if ids is not None and distances is not None:
                best_indices_mask = (distances == distances.min()) & (distances < max_distance)
                if best_indices_mask.sum() != 0:
                    result = {'index': ids[best_indices_mask][0], 'distance': distances[best_indices_mask][0]}

        return result
    
    def query(self, question, max_distance=0.2):
        search_result = self.search(question, max_distance)
        index, distance = search_result.get('index', -1), search_result.get('distance', -1)
        result = "N/A"
        if index != -1:
            result = self.answers[index]
        
        return result

In [10]:
qa = QA(data)
qa.build_nmslib_index()

In [11]:
qa.query('Carl and the Passions day changed band name to what', max_distance=0.05)

'Beach Boys'

In [12]:
data.head(2)

Unnamed: 0,question,answer
0,Carl and the Passions changed band name to what,Beach Boys
1,How many rings on the Olympic flag,Five


In [13]:
preds = data.question.apply(lambda x: qa.query(x))