[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/question-answering/question-answering.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/question-answering/question-answering.ipynb)

# Question Answering with Similarity Search

This notebook demonstrates how Pinecone's similarity search as a service helps you build a question answering application. We will index a set of questions and retrieve the most similar stored questions for a new (unseen) question. That way, we can link a new question to answers we might already have.

You can build a questions answering application with Pinecone in three steps:
- Represent questions as vector embeddings so that semantically similar questions are in close proximity within the same vector space. 
- Index vectors using Pinecone.
- Given a new question, query the index to fetch similar questions. This can allow us to store answers associated with these questions 

In this notebook we will be dealing with indexing a set of quetions and retrieving similar questions for a new and unseen question.



## Dependencies

# My environment

```bash
conda create -n Airify python=3.9
conda activate Airify
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install pinecone-client sentence-transformers matplotlib ipywidgets
```

In [2]:
import pandas as pd
import numpy as np

%matplotlib inline

# PART 1: Indexing Questions to Pinecone

We will start by indexing a set of questions to Pinecone. We will use the `sentence-transformers` library to convert the questions into vector embeddings. We will then index these embeddings using Pinecone.

## Pinecone Installation and Setup

In [3]:
from pinecone import Pinecone, ServerlessSpec
import os

# load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "d11bd0fa-9723-487b-ae13-7f51b5b22772"
print(api_key)

pinecone = Pinecone(api_key=api_key)
spec=ServerlessSpec(
        cloud="aws",
        region="us-west-2"
    ) 

d11bd0fa-9723-487b-ae13-7f51b5b22772


## Create a New Pinecone Index

In [4]:
# pick a name for the new index
index_name = "question-answering"

**Create index**

In [5]:
# check whether an index with the same name already exists

if index_name in pinecone.list_indexes().names():
    pinecone.delete_index(index_name)
pinecone.create_index(name=index_name, dimension=300, spec=spec)

**Connect to the index**

The index object, a class instance of pinecone.Index , will be reused for optimal performance.

In [6]:
host = pinecone.describe_index(name=index_name).host
index = pinecone.Index(host=host)

## Uploading Questions

The dataset used in this notebook is the [Quora Question Pairs Dataset](https://www.kaggle.com/c/quora-question-pairs).

Let's download the dataset and load the data.

In [55]:
# download dataset from the url
import requests

DATA_DIR = "tmp"
DATA_FILE = f"{DATA_DIR}/quora_duplicate_questions.tsv"
DATA_URL = "https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"


def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(DATA_FILE):
        r = requests.get(DATA_URL)  # create HTTP response object
        with open(DATA_FILE, "wb") as f:
            f.write(r.content)


download_data()

In [56]:
pd.set_option("display.max_colwidth", 500)

df = pd.read_csv(
    f"{DATA_FILE}", sep="\t", usecols=["qid1", "question1"], index_col=False
)
df = df.sample(frac=1).reset_index(drop=True)
df.drop_duplicates(inplace=True)
df['qid1'] = df['qid1'].apply(str)
df['question1'] = df['question1'].apply(str)
print(df.head())

     qid1  \
0  401595   
1   13728   
2   67930   
3  223923   
4  316481   

                                                                  question1  
0  Which day of the week has the hardest crossword puzzle in the newspaper?  
1                                                         Who is Redd Foxx?  
2   What is the difference between a turbo charger and a turbo intercooler?  
3                            How do I get my contacts back after rebooting?  
4                              What are some cultural faux pas at LinkedIn?  


### Define the model: Generate embeddings for questions

We will use the [Averarage Word Embeddings Model](https://nlp.stanford.edu/projects/glove/) for this example. This model has a high computation speed but relatively low quality of embeddings. You can look into other sentence embeddings models such as the [Sentence Embeddings Models trained on Paraphrases](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1) for improving quality of embeddings. 

In [8]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the model from huggingface model hub
model = SentenceTransformer("average_word_embeddings_glove.6B.300d", device=device)

### Creating Vector Embeddings

In [59]:
# create embedding for each question
question_vectors = model.encode(list(df.question1), show_progress_bar=True).tolist()

Batches:   0%|          | 0/9083 [00:00<?, ?it/s]

In [60]:
# add question embeddings to dataframe
df["question_vector"] = question_vectors

In [61]:
#find row indices with all zeros in df.question_vector
zero_rows = df[df.question_vector.apply(lambda x: np.sum(x)==0.0)].index
df_zero= df.loc[zero_rows].question_vector
df_non_zero=df.drop(zero_rows)

### Index the Vectors

In [62]:
import itertools

def chunks(iterable, batch_size=1000):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

In [64]:
batch_size = 1000
total_size = len(df_non_zero)
print("Total size:", total_size)
print("Num of batches:", total_size/batch_size)

done = 0
for i, batch in enumerate(chunks(zip(df_non_zero.qid1, df_non_zero.question_vector), batch_size=batch_size)):
    done += len(batch)
    print("Batch: {}, Done: {}".format(i, done))
    index.upsert(vectors=batch)

Total size: 289952
Num of batches: 289.952
Batch: 0, Done: 1000
Batch: 1, Done: 2000
Batch: 2, Done: 3000
Batch: 3, Done: 4000
Batch: 4, Done: 5000
Batch: 5, Done: 6000
Batch: 6, Done: 7000
Batch: 7, Done: 8000
Batch: 8, Done: 9000
Batch: 9, Done: 10000
Batch: 10, Done: 11000
Batch: 11, Done: 12000
Batch: 12, Done: 13000
Batch: 13, Done: 14000
Batch: 14, Done: 15000
Batch: 15, Done: 16000
Batch: 16, Done: 17000
Batch: 17, Done: 18000
Batch: 18, Done: 19000
Batch: 19, Done: 20000
Batch: 20, Done: 21000
Batch: 21, Done: 22000
Batch: 22, Done: 23000
Batch: 23, Done: 24000
Batch: 24, Done: 25000
Batch: 25, Done: 26000
Batch: 26, Done: 27000
Batch: 27, Done: 28000
Batch: 28, Done: 29000
Batch: 29, Done: 30000
Batch: 30, Done: 31000
Batch: 31, Done: 32000
Batch: 32, Done: 33000
Batch: 33, Done: 34000
Batch: 34, Done: 35000
Batch: 35, Done: 36000
Batch: 36, Done: 37000
Batch: 37, Done: 38000
Batch: 38, Done: 39000
Batch: 39, Done: 40000
Batch: 40, Done: 41000
Batch: 41, Done: 42000
Batch: 42,

# PART 2: Retrival

Once indexing is done, we can retrieve similar questions for a new question.

In [1]:
import pandas as pd
import numpy as np

%matplotlib inline

In [2]:
from pinecone import Pinecone, ServerlessSpec
import os

# load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "d11bd0fa-9723-487b-ae13-7f51b5b22772"
print(api_key)

pinecone = Pinecone(api_key=api_key)
spec=ServerlessSpec(
        cloud="aws",
        region="us-west-2"
    ) 

d11bd0fa-9723-487b-ae13-7f51b5b22772


**Connect to the index**

The index object, a class instance of pinecone.Index , will be reused for optimal performance.

In [3]:
# pick the name for index
index_name = "question-answering"

host = pinecone.describe_index(name=index_name).host
index = pinecone.Index(host=host)

### Define the model: Generate embeddings for questions

We will use the [Averarage Word Embeddings Model](https://nlp.stanford.edu/projects/glove/) for this example. This model has a high computation speed but relatively low quality of embeddings. You can look into other sentence embeddings models such as the [Sentence Embeddings Models trained on Paraphrases](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1) for improving quality of embeddings. 

In [4]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the model from huggingface model hub
model = SentenceTransformer("average_word_embeddings_glove.6B.300d", device=device)

## Load Dataset

In [5]:
# download dataset from the url
import requests

DATA_DIR = "tmp"
DATA_FILE = f"{DATA_DIR}/quora_duplicate_questions.tsv"
DATA_URL = "https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"


def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(DATA_FILE):
        r = requests.get(DATA_URL)  # create HTTP response object
        with open(DATA_FILE, "wb") as f:
            f.write(r.content)


download_data()

In [6]:
pd.set_option("display.max_colwidth", 500)

df = pd.read_csv(
    f"{DATA_FILE}", sep="\t", usecols=["qid1", "question1"], index_col=False
)
df = df.sample(frac=1).reset_index(drop=True)
df.drop_duplicates(inplace=True)
df['qid1'] = df['qid1'].apply(str)
df['question1'] = df['question1'].apply(str)
print(df.head())

     qid1  \
0   94563   
1  119479   
2  479223   
3  286210   
4  408408   

                                                                                           question1  
0  Did you enjoy your school life? Did you ever wonder you would have been better off home schooled?  
1                      What is the best thing to do to start being involved in open source projects?  
2                                 Which is more important for overall well-being, sleep or exercise?  
3                     What are some examples of romantic Korean dramas that are popular in the West?  
4                                                                        Did Quora get hacked today?  


## Search

Once you have indexed the vectors it is very straightforward to query the index. These are the steps you need to follow:
- Select a set of questions you want to query with
- Use the Average Embedding Model to transform questions into embeddings.
- Send each question vector to the Pinecone index and retrieve most similar indexed questions

In [10]:
# define questions to query the vector index
query_questions = [
    "What is best way to make money online?",
    "How can i build an e-commerce website?"
]

# extract embeddings for the questions
query_vectors = model.encode(query_questions).tolist()

# query pinecone
query_results = [index.query(vector=xq, top_k=5) for xq in query_vectors]

# show the results
for question, res in zip(query_questions, query_results):
    print("\n\n\n Original question : " + str(question))
    print("\n Most similar questions based on pinecone vector search: \n")

    ids = [match.id for match in res.matches]
    scores = [match.score for match in res.matches]
    df_result = pd.DataFrame(
        {
            "id": ids,
            "question": [
                df[df.qid1 == _id].question1.values[0] for _id in ids
            ],
            "score": scores,
        }
    )
    print(df_result)




 Original question : What is best way to make money online?

 Most similar questions based on pinecone vector search: 

       id                                             question     score
0      57               What is best way to make money online?  0.999613
1  297469           What is the best way to make money online?  0.999613
2   55585        What is the best way for making money online?  0.989048
3  157045  What is the best way to make money on the internet?  0.980163
4   28280         What are the best ways to make money online?  0.979453



 Original question : How can i build an e-commerce website?

 Most similar questions based on pinecone vector search: 

       id                                                   question     score
0  119383                   How can I develop an e-commerce website?  0.924485
1    1713                 How would I develop an e-commerce website?  0.924485
2    1714                     How do I create an e-commerce website?  0.919350


: 

## Delete the Index

Delete the index once you are sure that you do not want to use it anymore. Once it is deleted, you cannot reuse it.


In [None]:
# pinecone.delete_index(index_name)