# Semantic Search with LangChain and Azure OpenAI

This notebook demonstrates how to build a semantic search system using vector embeddings. Unlike traditional keyword-based search, semantic search understands the **meaning** of queries and documents, enabling more intelligent and context-aware information retrieval.

## What is Semantic Search?

Semantic search uses machine learning models to convert text into numerical vectors (embeddings) that capture semantic meaning. Documents with similar meanings will have similar vector representations, allowing us to find relevant information based on conceptual similarity rather than just keyword matching.

## Workflow Overview

1. **Load Documents**: Import PDF documents into processable format
2. **Split Text**: Divide large documents into smaller, focused chunks
3. **Generate Embeddings**: Convert text chunks into vector representations
4. **Store Vectors**: Index embeddings in a vector database
5. **Search**: Query the system to find semantically similar content

Let's dive into each step!

## Step 1: Loading Documents

**Document loading** is the foundation of any information retrieval system. We'll load a PDF file containing Nike's 10-K financial report from 2023.

### Why PDFs?
- Common format for reports, papers, and documentation
- Contains structured and unstructured data
- Requires specialized parsing to extract text accurately

### Import PDF Loader and Load Document

**PyPDFLoader** is a specialized document loader for PDF files in LangChain.

### How it works:
- **Reads PDF**: Parses the binary PDF format
- **Extracts Text**: Pulls text content from each page
- **Creates Documents**: Converts each page into a LangChain `Document` object with:
  - `page_content`: The actual text from the page
  - `metadata`: Information like file path and page number

### The Process:
1. Specify the file path to the PDF
2. Create a loader instance
3. Call `load()` to extract all pages

**Result**: A list of Document objects, one per PDF page.

In [2]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../../data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

107


### Understanding Document Structure

`PyPDFLoader` creates one `Document` object per PDF page, making it easy to work with the content programmatically.

### Each Document contains:
- **`page_content`**: The string content of the page (all text extracted)
- **`metadata`**: A dictionary with:
  - `source`: File path to the original PDF
  - `page`: Page number (0-indexed)

This structure allows us to:
- Track where information came from
- Reference specific pages when displaying results
- Filter or process specific sections of documents

### Inspect Document Content and Metadata

Let's examine what was extracted from the first page to verify the loading process worked correctly.

**This preview shows:**
- First 200 characters of the page content
- Metadata dictionary with source file and page number

This verification step helps ensure:
- Text extraction is working properly
- Content quality is sufficient for semantic search
- Metadata is correctly populated

In [3]:
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F

{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': '../../data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


## Step 2: Document Splitting Strategy

### Why Split Documents?

**Problem**: A full PDF page is often too large and contains multiple topics.
- Mixing different concepts in one chunk dilutes the semantic meaning
- Large chunks make it harder to pinpoint specific information
- Retrieval accuracy suffers when relevant details are buried in irrelevant context

**Solution**: Split documents into smaller, focused chunks.

### Text Splitter Configuration

We use `RecursiveCharacterTextSplitter` with these parameters:
- **`chunk_size=1000`**: Target 1000 characters per chunk (approximately 150-200 words)
- **`chunk_overlap=200`**: Overlap consecutive chunks by 200 characters
  - Prevents splitting related sentences across chunks
  - Maintains context continuity at chunk boundaries
- **`add_start_index=True`**: Tracks original position in source document

### Why "Recursive"?

The splitter tries to split on natural boundaries in this order:
1. Paragraphs (double newlines)
2. Sentences (single newlines)
3. Words (spaces)
4. Characters (as last resort)

This preserves semantic coherence better than arbitrary character splitting.

**Learn more**: [LangChain PDF Guide](/docs/how_to/document_loader_pdf/)

### Split Documents into Manageable Chunks

**Execute the splitting** to transform the page-level documents into smaller, focused segments.

### What happens:
1. Each page document is processed by the text splitter
2. Pages are divided into ~1000 character chunks
3. Chunks maintain 200 character overlap with neighbors
4. Original document metadata is preserved in each chunk
5. Start index is added to track position in original document

### Expected Result:
The number of chunks will be significantly larger than the number of pages, as each page typically produces multiple chunks.

**Benefits for Search:**
- More precise retrieval (find specific paragraphs, not whole pages)
- Better embedding quality (focused semantic meaning)
- Improved answer accuracy in downstream tasks

In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

Document(metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': '../../data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1', 'start_index': 0}, page_content="Table of Contents\nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\nFORM 10-K\n(Mark One)\n☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\nFOR THE FISCAL YEAR ENDED MAY 31, 2023\nOR\n☐  TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\nFOR THE TRANSITION PERIOD FROM                         TO                         .\nCommission File No. 1-1063

## Step 3: Configure Azure OpenAI Embeddings

**Embeddings** are the heart of semantic search - they convert text into numerical vectors that capture meaning.

### What are Embeddings?
- **Input**: Text string (query or document)
- **Output**: Vector of numbers (e.g., 1536 dimensions for text-embedding-ada-002)
- **Property**: Similar meanings → Similar vectors

### Model: text-embedding-ada-002
- **Provider**: Azure OpenAI
- **Dimensions**: 1536-dimensional vectors
- **Use Cases**: Semantic search, clustering, similarity comparison
- **Quality**: Captures nuanced semantic relationships

### Configuration:
- **`azure_endpoint`**: Your Azure OpenAI resource URL
- **`api_key`**: Authentication credential (loaded securely)
- **`model`**: The deployment name of the embedding model
- **`api_version`**: Azure API version for compatibility

### Security Note:
The code checks for an existing environment variable before prompting, supporting both:
- Local development (with manual input)
- Production environments (with pre-configured env vars)

In [21]:
import getpass
import os

if not os.environ.get("AZURE_OPENAI_API_KEY"):
    os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass("Enter API key for Azure: ")

from langchain_openai import AzureOpenAIEmbeddings

embeddings = AzureOpenAIEmbeddings(
    azure_endpoint="https://aoi-ext-eus-aiml-profx-01.openai.azure.com/",
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    model="text-embedding-ada-002",
    api_version="2024-12-01-preview"
)

### Generate and Inspect Sample Embeddings

**Test the embeddings model** by converting two document chunks into vectors and examining the results.

### What this demonstrates:
1. **`embed_query()`**: Converts text to a vector embedding
2. **Consistency**: Both embeddings have the same dimensionality
3. **Vector Format**: Shows the actual numerical values

### Understanding the Output:
- **Vector Length**: Typically 1536 for text-embedding-ada-002
- **Vector Values**: Floating-point numbers (usually between -1 and 1)
- **First 10 Values**: Preview of the embedding (each dimension captures different semantic features)

### Key Insight:
Even though the text content is different, the embeddings are comparable (same length, same format). We can measure similarity between these vectors using mathematical operations like cosine similarity or Euclidean distance.

In [22]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 1536

[-0.00860656425356865, -0.03344116732478142, -0.009941618889570236, -0.0050745029002428055, 0.009079665876924992, 0.009442593902349472, -0.028230568394064903, -0.01646135002374649, 0.002953645773231983, -0.012832076288759708]


## Step 4: Initialize the Vector Store

**Vector Stores** (or Vector Databases) are specialized databases optimized for storing and searching high-dimensional vectors.

### What is InMemoryVectorStore?
- **Type**: In-memory storage (not persisted to disk)
- **Speed**: Fast for development and prototyping
- **Scope**: Data exists only during the session
- **Best For**: Testing, small datasets, demonstrations

### How it Works:
- Stores document text alongside their embedding vectors
- Enables fast similarity searches using vector math
- Automatically uses the embeddings model we configured

### Production Alternatives:
For production use with large datasets or persistence requirements, consider:
- **Chroma**: Local, persistent vector database
- **Pinecone**: Managed cloud vector database
- **Azure AI Search**: Azure's vector search service
- **Weaviate**: Open-source vector database

In [23]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

## Step 5: Index Documents in the Vector Store

**Indexing** is the process of storing documents with their embeddings for later retrieval.

### What happens in `add_documents()`:
1. **For each document chunk**:
   - Send the text to the embeddings model
   - Receive back a 1536-dimensional vector
   - Store both the text and vector in the database
   - Generate a unique ID for tracking

2. **Build the index**:
   - Organize vectors for efficient similarity search
   - Create data structures for fast nearest-neighbor lookup

### Performance Note:
This operation makes API calls to Azure OpenAI for each chunk, so it may take some time depending on:
- Number of chunks
- API rate limits
- Network latency

### Output:
Returns a list of unique IDs for each stored document, confirming successful indexing.

**Your knowledge base is now ready for semantic search!**

In [24]:
ids = vector_store.add_documents(documents=all_splits)

## Step 6: Perform Your First Similarity Search

**Semantic search in action!** Let's query the system to find information about Nike's distribution centers.

### How Similarity Search Works:
1. **Convert Query**: The question is converted to an embedding vector
2. **Compare Vectors**: The system compares the query vector to all stored document vectors
3. **Calculate Similarity**: Uses cosine similarity or distance metrics
4. **Rank Results**: Returns the most similar documents
5. **Default**: Returns top 4 most relevant chunks

### Query: "How many distribution centers does Nike have in the US?"

**Why this is powerful:**
- No exact keyword matching required
- Understands "distribution centers" relates to "facilities," "warehouses," etc.
- Captures semantic intent of the question
- Finds relevant content even with different wording

### Expected Output:
The most relevant document chunk containing information about Nike's distribution infrastructure.

In [25]:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])

page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213 
NIKE Brand in-line stores (including employee-only stores) 74 
Converse stores (including factory stores) 82 
TOTAL 369 
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': '../../data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 4, 'page_label': '5', 'start_index': 3125}


## Step 7: Async Similarity Search

**Asynchronous search** enables non-blocking operations, crucial for production applications.

### What is Async?
- **Traditional (Sync)**: Code waits for search to complete before continuing
- **Async**: Search runs in background, allowing other operations simultaneously
- **Use Cases**: Web applications, API endpoints, concurrent queries

### Benefits of Async Search:
- **Responsiveness**: UI remains interactive during search
- **Scalability**: Handle multiple search requests concurrently
- **Performance**: Better resource utilization in I/O-bound operations

### Query: "When was Nike incorporated?"

**Note**: In Jupyter notebooks, async functions are called with `await` and work seamlessly. In regular Python scripts, you'd need to use `asyncio.run()` or an event loop.

**Expected Output:**
Document chunk containing information about Nike's founding/incorporation date.

In [11]:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'cr

## Step 8: Similarity Search with Confidence Scores

**Understanding search quality** by examining similarity scores alongside results.

### What are Similarity Scores?
- **Purpose**: Quantify how well each result matches the query
- **Range**: Depends on the distance metric used
- **Interpretation**: Lower scores = higher similarity (for distance metrics)

### Distance Metrics:
Different vector stores use different scoring methods:
- **Cosine Distance**: Measures angle between vectors (0 = identical, 2 = opposite)
- **Euclidean Distance**: Straight-line distance in vector space
- **Dot Product**: Inner product of vectors

### Why Scores Matter:
- **Filtering**: Set thresholds to exclude low-quality matches
- **Ranking**: Sort results by relevance
- **Confidence**: Determine if results are trustworthy
- **Debugging**: Identify when queries aren't matching well

### Query: "What was Nike's revenue in 2023?"

**Expected Output:**
- The most relevant document chunk
- A numerical score indicating match quality

In [12]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.8807336730352323

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTSThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
• NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
• NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metad

## Step 9: Search with Pre-computed Embeddings

**Advanced technique**: Search using vectors directly instead of text queries.

### Why Use Pre-computed Embeddings?

**Scenario 1 - Performance Optimization:**
- Generate query embedding once
- Reuse it for multiple searches
- Reduces API calls and latency

**Scenario 2 - Advanced Workflows:**
- Search with modified/combined embeddings
- Implement custom similarity logic
- Build hybrid search systems

**Scenario 3 - Cross-modal Search:**
- Search documents using image embeddings
- Find similar concepts across different data types

### The Two-Step Process:
1. **`embed_query()`**: Convert text to embedding vector
2. **`similarity_search_by_vector()`**: Search using the vector directly

### Query: "How were Nike's margins impacted in 2023?"

**Use Case:**
This pattern is especially useful when:
- Building search APIs (cache embeddings)
- Implementing recommendation systems
- Creating multi-step search pipelines

**Expected Output:**
Same quality results as text-based search, but with more control over the embedding process.

In [15]:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='Table of Contents
GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to
43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
• Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as
product mix;
• Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in
the prior period resulting from lower available inventory supply;
• Unfavorable changes in net foreign currency exchange rates, including hedges; and
• Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:'