# Semantic Search

In this notebook we will use Pinecone for semantic search. To begin we must install the required prerequisite libraries:

----

pip install "pinecone-client[grpc]==2.2.1" datasets==2.12.0 sentence-transformers==2.2.2

----

In [13]:
import pandas as pd
import numpy as np
import pinecone

## Data Preprocessing

The dataset preparation process requires a few steps:

1. Load dataset from our directory.

2. The text content of the dataset is embedded into vectors.

3. We reformat into a `(id, vector, metadata)` structure to be added to Pinecone.

We will see how steps `1`, `2`, and `3` are done in this section, but we won't implement `2` and `3` across the whole dataset until we reach the *upsert loop* as we will iteratively perform these two steps.

In either case, this can take some time. If you'd rather skip the data preparation step and get straight to upserts and testing the semantic search functionality, you should 
refer to the [**fast notebook**](https://github.com/pinecone-io/examples/blob/master/docs/semantic-search.ipynb).

In [43]:
# load course dataset
dataset_df = pd.read_csv('data/all_minus_med.csv')
print('Dataframe Shape:', dataset_df.shape)
dataset_df.head()

Dataframe Shape: (43482, 21)


Unnamed: 0,Course ID,Class Nbr,Subject,Catalog,Descr,PI Name,Course Long Descr,Section,Component,Mode,...,Mtg Start,Mtg End,Pat,Start Date,End Date,Descr 1,Attribute Formal Desc,Term Descr,Career,Location
0,19,8275,AMES,165S,THE WORLD OF JAPANESE POP CULT,"Maude,Daryl J",An examination of modern Japanese culture thro...,1,SEM,In Person,...,11:45:00.000000AM,1:00:00.000000PM,TTH,8/28/23,12/8/23,Languages 207,"(ALP) Arts, Literature & Performance",2023 Fall Term,UGRD,DURHAM
1,19,8275,AMES,165S,THE WORLD OF JAPANESE POP CULT,"Maude,Daryl J",An examination of modern Japanese culture thro...,1,SEM,In Person,...,11:45:00.000000AM,1:00:00.000000PM,TTH,8/28/23,12/8/23,Languages 207,(CCI) Cross Cultural Inquiry,2023 Fall Term,UGRD,DURHAM
2,19,8275,AMES,165S,THE WORLD OF JAPANESE POP CULT,"Maude,Daryl J",An examination of modern Japanese culture thro...,1,SEM,In Person,...,11:45:00.000000AM,1:00:00.000000PM,TTH,8/28/23,12/8/23,Languages 207,(CZ) Civilizations,2023 Fall Term,UGRD,DURHAM
3,19,8275,AMES,165S,THE WORLD OF JAPANESE POP CULT,"Maude,Daryl J",An examination of modern Japanese culture thro...,1,SEM,In Person,...,11:45:00.000000AM,1:00:00.000000PM,TTH,8/28/23,12/8/23,Languages 207,Crosslisted in another department,2023 Fall Term,UGRD,DURHAM
4,19,8275,AMES,165S,THE WORLD OF JAPANESE POP CULT,"Maude,Daryl J",An examination of modern Japanese culture thro...,1,SEM,In Person,...,11:45:00.000000AM,1:00:00.000000PM,TTH,8/28/23,12/8/23,Languages 207,Seminar,2023 Fall Term,UGRD,DURHAM


The dataset contains 43482 courses and information about their description, subject, number, times offered, etc. at Duke University.

The dataset contains ~400K pairs of natural language questions from Quora.

We can extract all course descriptions into a single `descriptions` list.

In [44]:
# convert course descriptions to list
descriptions = dataset_df['Course Long Descr'].tolist()
# remove duplicates
descriptions = list(set(descriptions))

# print first 5 descriptions
for i in range(5):
    print(i+1, ')', descriptions[i])
    print()

# convert to csv and store in data directory
descriptions_df = pd.DataFrame(descriptions, columns=['Course Long Descr'])
descriptions_df.to_csv('data/descriptions.csv', index=False)

1 ) See African & African American Studies 391. Consent of both instructor and director of undergraduate studies required.

2 ) Topics vary.  May be repeated for credit

3 ) This course focuses on creating a solid foundation for nursing care of individuals across the lifespan. Students use clinical reasoning, therapeutic communication, and the nursing process to provide competent, evidence-based, safe and holistic care. Emphasis is placed on health assessment and the introduction of skills necessary to maintain wellness and promote the health of diverse populations in all stages of life. Corequisite: Nursing 393

4 ) Probability models, random variables with discrete and continuous distributions. Independence, joint distributions, conditional distributions. Expectations, functions of random variables, central limit theorem. An assignment will ask the student to relate this course to their research.

5 ) Tutorial course for Bass Connections yearlong project team. Topics vary by semester

We find the mean, median, minimum, and maximumm number of words in descriptions.

In [16]:
# Calculate the number of words for each description
num_words = [len(description.split()) for description in descriptions if isinstance(description, str)]

# Calculate the minimum, maximum, median, and mean number of words
minimum = min(num_words)
maximum = max(num_words)
median = sorted(num_words)[len(num_words) // 2]
mean = sum(num_words) / len(num_words)

# Print the results
print("Minimum number of words:", minimum)
print("Maximum number of words:", maximum)
print("Median number of words:", median)
print("Mean number of words:", mean)

Minimum number of words: 2
Maximum number of words: 262
Median number of words: 61
Mean number of words: 60.75444635685599


With our descriptions ready to go we can move on to demoing steps **2** and **3** above.

### Building Embeddings and Upsert Format

To create our embeddings we will us the `MiniLM-L6` sentence transformer model. This is a very efficient semantic similarity embedding model from the `sentence-transformers` library. We initialize it like so:

In [21]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print(f"You are using {device}. This is much slower than using "
          "a CUDA-enabled GPU. If on Colab you can change this by "
          "clicking Runtime > Change runtime type > GPU.")

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

You are using cpu. This is much slower than using a CUDA-enabled GPU. If on Colab you can change this by clicking Runtime > Change runtime type > GPU.


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

There are *three* interesting bits of information in the above model printout. Those are:

* `max_seq_length` is `256`. That means that the maximum number of tokens (like words) that can be encoded into a single vector embedding is `256`. Anything beyond this *must* be truncated.

* `word_embedding_dimension` is `384`. This number is the dimensionality of vectors output by this model. It is important that we know this number later when initializing our Pinecone vector index.

* `Normalize()`. This final normalization step indicates that all vectors produced by the model are normalized. That means that models that we would typical measure similarity for using *cosine similarity* can also make use of the *dotproduct* similarity metric. In fact, with normalized vectors *cosine* and *dotproduct* are equivalent.

Moving on, we can create a sentence embedding using this model like so:

In [22]:
query = 'classes that teach about machine learning'

xq = model.encode(query)
xq.shape

(384,)

Encoding this single sentence leaves us with a `384` dimensional sentence embedding (aligned to the `word_embedding_dimension` above).

To prepare this for `upsert` to Pinecone, all we do is this:

In [23]:
_id = '0'
metadata = {'text': query}

vectors = [(_id, xq, metadata)]

Later when we do upsert our data to Pinecone, we will be doing so in batches. Meaning `vectors` will be a list of `(id, embedding, metadata)` tuples.

## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

In [24]:
import os
import pinecone

from dotenv import load_dotenv
load_dotenv()
# get api key from app.pinecone.io
api_key = os.environ.get('PINECONE_API_KEY')
# find your environment next to the api key in pinecone console
env = os.environ.get('PINECONE_ENVIRONMENT', 'env')

pinecone.init(
    api_key=api_key,
    environment='gcp-starter'
)

Now we create a new index called `duke-course-desc`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [25]:
index_name = 'duke-course-desc'
hostname='https://duke-course-desc-ea5f15c.svc.gcp-starter.pinecone.io'

In [26]:
# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=model.get_sentence_embedding_dimension(),
        metric='cosine',
        hostname=hostname
    )
# now connect to the index
index = pinecone.Index(index_name)

Now we upsert the data, we will do this in batches of `128`.

_**Note:** On Google Colab with GPU expected runtime is ~7 minutes. If using CPU this will be significantly longer. If you'd like to get this running faster refer to the [fast notebook](https://github.com/pinecone-io/examples/blob/master/search/semantic-search/semantic-search-fast.ipynb)._

In [30]:
from tqdm.auto import tqdm

batch_size = 128
vector_limit = 100000

descriptions = descriptions[:vector_limit]

for i in tqdm(range(0, len(descriptions), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(descriptions))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in descriptions[i:i_end]]
    # create embeddings
    xc = model.encode(descriptions[i:i_end])
    print(metadatas)
    print(xc)
    # create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

# check number of records in the index
index.describe_index_stats()

  0%|          | 0/28 [00:00<?, ?it/s]

[{'text': 'See African & African American Studies 391. Consent of both instructor and director of undergraduate studies required.'}, {'text': 'Topics vary.  May be repeated for credit'}, {'text': 'This course focuses on creating a solid foundation for nursing care of individuals across the lifespan. Students use clinical reasoning, therapeutic communication, and the nursing process to provide competent, evidence-based, safe and holistic care. Emphasis is placed on health assessment and the introduction of skills necessary to maintain wellness and promote the health of diverse populations in all stages of life. Corequisite: Nursing 393'}, {'text': 'Probability models, random variables with discrete and continuous distributions. Independence, joint distributions, conditional distributions. Expectations, functions of random variables, central limit theorem. An assignment will ask the student to relate this course to their research.'}, {'text': 'Tutorial course for Bass Connections yearlon




ApiValueError: Unable to prepare type ndarray for serialization

## Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question. Let's begin.

In [20]:
query = "which city has the highest population in the world?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '31072',
              'metadata': {'text': 'What country has the biggest population?'},
              'score': 0.7655585,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '23769',
              'metadata': {'text': 'What is the biggest city?'},
              'score': 0.7271395,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '65783',
              'metadata': {'text': 'What is the most isolated city in the '
                                   'world, with over a million metro area '
                                   'inhabitants?'},
              'score': 0.7020447,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '104484',
              'metadata': {'text': 'Which is the most beautiful city in '
                                   'world?'},
              'score': 0.69991666,
    

In the returned response `xc` we can see the most relevant questions to our particular query. We can reformat this response to be a little easier to read:

In [21]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.77: What country has the biggest population?
0.73: What is the biggest city?
0.7: What is the most isolated city in the world, with over a million metro area inhabitants?
0.7: Which is the most beautiful city in world?
0.7: Where is the most beautiful city in the world?


These are good results, let's try and modify the words being used to see if we still surface similar results.

In [23]:
query = "which metropolis has the highest number of people?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(xq, top_k=5, include_metadata=True)
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.67: What is the most isolated city in the world, with over a million metro area inhabitants?
0.64: What is the biggest city?
0.61: Which place has the highest Asian Indian population in the USA?
0.6: What is the most dangerous city in USA?
0.59: What country has the biggest population?


Here we used different terms in our query than that of the returned documents. We substituted **"city"** for **"metropolis"** and **"populated"** for **"number of people"**.

Despite these very different terms and *lack* of term overlap between query and returned documents — we get highly relevant results — this is the power of *semantic search*.

You can go ahead and ask more questions above. When you're done, delete the index to save resources:

In [24]:
pinecone.delete_index(index_name)

---