This notebook investigates the behavior of OpenAI text embeddings. It attempts to explore some of the strengths and weaknesses of embeddings, and answer questions like "What makes a good embedding query for information retrieval?".

In [1]:
from doc_qa.embed import generate_pdf_embeddings

In [2]:
PDF_FILE = "test_docs/2304.02643.pdf" # The Segment Anything Model (SAM) paper.

In [5]:
index = generate_pdf_embeddings(PDF_FILE)

Found existing document embeddings at 'output/embeddings/2304.02643'. Not re-generating.


Using embedded DuckDB with persistence: data will be stored in: output/embeddings/2304.02643


# False Positive Example (Authors)

In this example, attempting to query for the authors of a paper returns authors of the referenced papers instead.

In [7]:
index.vectorstore.similarity_search("authors")

[Document(page_content='Robert Kuo for help with data annotation platform. We\nthank Allen Goodman and Bram Wasti for help in optimiz-\ning web-version of our model. Finally, we thank Morteza\nBehrooz, Ashley Gabriel, Ahuva Goldstand, Sumanth Gur-\nram, Somya Jain, Devansh Kukreja, Joshua Lane, Lilian\nLuong, Mallika Malhotra, William Ngan, Omkar Parkhi,\nNikhil Raina, Dirk Rowe, Neil Sejoor, Vanessa Stark, Bala\nVaradarajan, and Zachary Winstrom for their help in mak-\ning the demo, dataset viewer, and other assets and tooling.\n12', metadata={'source': 'test_docs/2304.02643.pdf', 'page': 11}),
 Document(page_content='garajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona\nRyan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhong-\ncong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car-\ntillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli,\nChristoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Chris-\ntian Fuegen, Abrham Gebreselasie, Cristina Gonzalez, Jame

Posing the query as a question did not help:

In [9]:
index.vectorstore.similarity_search("Who are the authors of this paper?")

[Document(page_content='garajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona\nRyan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhong-\ncong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car-\ntillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli,\nChristoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Chris-\ntian Fuegen, Abrham Gebreselasie, Cristina Gonzalez, James Hillis,\nXuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Ko-\nlar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li,\nYanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Mod-\nhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will\nPrice, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran\nSomasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao,\nMinh V o, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu,\nPablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria\nFarinella, Bernard Ghanem, Vamsi Krishna Ithapu, C. V . Jawahar,\nHanbyul Joo, Kris Kitani, Haizhou Li

# Comparing Query Styles

In this section, we test different query styles for information retrieval via embedding search.

The queries are designed to retrieve the size of the SA-1B dataset. (The correct answer is 11M images.)

All of the query styles succesfully retrieved documents that contained the correct answer. A much more thorough analysis would be required to draw any meaningful comparisons.

## Search Term

Use a short search term to find related text.

In [18]:
index.vectorstore.similarity_search("Dataset size")

[Document(page_content='in our dataset, producing a total of 1.1B high-quality masks.\nWe describe and analyze the resulting dataset, SA-1B, next.\nFigure 5: Image-size normalized mask center distributions.\n5. Segment Anything Dataset\nOur dataset, SA-1B, consists of 11M diverse, high-\nresolution, licensed, and privacy protecting images and\n1.1B high-quality segmentation masks collected with our\ndata engine. We compare SA-1B with existing datasets\nand analyze mask quality and properties. We are releasing\nSA-1B to aid future development of foundation models for\ncomputer vision. We note that SA-1B will be released un-\nder a favorable license agreement for certain research uses\nand with protections for researchers.\nImages . We licensed a new set of 11M images from a\nprovider that works directly with photographers. These im-\nages are high resolution (3300 \x024950 pixels on average),\nand the resulting data size can present accessibility and stor-\nage challenges. Therefore, we

# Full Question

Use the question that we are trying to answer as the vector embedding query.

In [19]:
index.vectorstore.similarity_search("What is the dataset size?")

[Document(page_content='of the 7 selected datasets.\n23', metadata={'source': 'test_docs/2304.02643.pdf', 'page': 22}),
 Document(page_content='and edges)? Please provide a description. All of the instances in the dataset\nare photos. The photos vary in subject matter; common themes of the photo\ninclude: locations, objects, scenes. All of the photos are distinct, however\nthere are some sets of photos that were taken of the same subject matter.\n2.How many instances are there in total (of each type, if appropriate)? There\nare 11 million images.\n3.Does the dataset contain all possible instances or is it a sample (not nec-\nessarily random) of instances from a larger set? If the dataset is a sample,\nthen what is the larger set? Is the sample representative of the larger set\n(e.g., geographic coverage)? If so, please describe how this representa-\ntiveness was validated/veriﬁed. If it is not representative of the larger set,\nplease describe why not ( e.g., to cover a more diverse ra

## Imaginary Statement

Write a statement that resembles the statement that you are *hoping* to find in the doc store. Use the imaginary statement for the embedding search. 

In [20]:
index.vectorstore.similarity_search("The size of the dataset that we introduced in this paper is __ images.")

[Document(page_content='images with their shortest side set to 1500 pixels. Even af-\nter downsampling, our images are signiﬁcantly higher reso-\nlution than many existing vision datasets ( e.g., COCO [66]\nimages are \x18480\x02640 pixels). Note that most models today\noperate on much lower resolution inputs. Faces and vehicle\nlicense plates have been blurred in the released images.\nMasks . Our data engine produced 1.1B masks, 99.1% of\nwhich were generated fully automatically. Therefore, the\nquality of the automatic masks is centrally important. We\ncompare them directly to professional annotations and look\nat how various mask properties compare to prominent seg-\nmentation datasets. Our main conclusion, as borne out in\nthe analysis below and the experiments in §7, is that our\nautomatic masks are high quality and effective for training\nmodels. Motivated by these ﬁndings, SA-1B only includes\nautomatically generated masks.\nMask quality. To estimate mask quality, we randomly sa