# Text Similarity 

> In this notebook I am discussing about how we made embedding from a Transformer model,upserting embedding in Pinecone system and Query most similar question from the database.

### 1. Import Library

In [44]:
import pinecone
import pandas as pd
import random
import itertools

### 2. Connect to Pinecone

In [3]:
def initialize_pinecone():
    """
    To use Pinecone, one must have an API key.
    By default, ENVIRONMENT is us-west1-gcp.
    :return:None
    """
    PINECONE_API_KEY = "0b0f6a93-7723-49bf-a006-aec1b05c81d2"
    pinecone.init(api_key=PINECONE_API_KEY, environment="us-west1-gcp-free")


def connect_pinecone_index(pinecone_index_name):
    """
    Connection to the pinecone index
    :param pinecone_index_name
    :return: pinecone index
    """
    initialize_pinecone()
    pinecone_index = pinecone.Index(index_name=pinecone_index_name)
    return pinecone_index

### 3. Create Index

In [40]:
### Name should be in small case letter
### Dimension can be changed according the embedding size
### Metric can be change according to the use case
initialize_pinecone()
pinecone_index_name = "pinecone-examples"
pinecone.create_index(pinecone_index_name, index_type="approximated", dimension=768, metric="cosine")

### 4. Connect to DataBase

In [41]:
index = connect_pinecone_index(pinecone_index_name)

### 5. Prepare Dataset

In [5]:
from datasets import load_dataset

dataset = load_dataset('quora', split='train')
dataset

Dataset(features: {'questions': Sequence(feature={'id': Value(dtype='int32', id=None), 'text': Value(dtype='string', id=None)}, length=-1, id=None), 'is_duplicate': Value(dtype='bool', id=None)}, num_rows: 404290)

In [6]:
questions = []

for record in dataset['questions']:
    questions.extend(record['text'])
  
# remove duplicates
questions = list(set(questions))
print('\n'.join(questions[:5]))
print(len(questions))


Can we drop some devices playing videos in North Sentinel and try to teach the people living there?
What are the reviews of the OnePlus 2?
What are some tips on making it through the job interview process at National Interstate?
Is it predicted by the Qur'an that Hillary Clinton will become the President of the USA? They say that the Qur'an has the answer to everything.
537362


In [16]:
ids=[]
for record in dataset['questions']:
    ids.extend(record['id'])

In [17]:
ids[0:5]

[1, 2, 3, 4, 5]

### 6. Generate Embeddings

In [9]:
from sentence_transformers import SentenceTransformer
import tensorflow.compat.v1 as tf
from torch.quantization import quantize_dynamic
from torch.nn import Embedding, Linear
import pandas as pd
import numpy as np

In [20]:
## Load Transformer Model
model=SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')

In [None]:
## Generate Embedding on Sample of 100 Data points

In [97]:
def generate_embeddings(ids,text,count):   
    data_id =[str(i) for i in ids[1:count+1]]
    data=text[1:count+1]
    metadata = [{'text': text} for text in data]
    embedding = model.encode(data,show_progress_bar=True,
                                convert_to_tensor=False,
                                convert_to_numpy=True,
                                batch_size=32,
                          )
    
    vectors = list(zip(data_id, embedding.tolist(),metadata))
    return vectors

In [98]:
embedded_vectors=generate_embeddings(ids,questions,1000)

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

### 6. Insert the data

### 7. Query Test Data

In [104]:
query = "What are reviews of Oneplus phone?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(xq, top_k=1, include_metadata=True,namespace="text-search")
xc

{'matches': [{'id': '3',
              'metadata': {'text': 'What are the reviews of the OnePlus 2?'},
              'score': 0.756250858,
              'values': []}],
 'namespace': 'text-search'}

> Returned Text share the exact same meaning as our question, or are related.

### 8. Delete Index

In [None]:
pinecone.delete_index(pinecone_index_name)