# Semantic Search with LangChain and Azure OpenAI

This notebook demonstrates how to build a semantic search system using vector embeddings. Unlike traditional keyword-based search, semantic search understands the **meaning** of queries and documents, enabling more intelligent and context-aware information retrieval.

## What is Semantic Search?

Semantic search uses machine learning models to convert text into numerical vectors (embeddings) that capture semantic meaning. Documents with similar meanings will have similar vector representations, allowing us to find relevant information based on conceptual similarity rather than just keyword matching.

## Workflow Overview

1. **Load Documents**: Import PDF documents into processable format
2. **Split Text**: Divide large documents into smaller, focused chunks
3. **Generate Embeddings**: Convert text chunks into vector representations
4. **Store Vectors**: Index embeddings in a vector database
5. **Search**: Query the system to find semantically similar content

Let's dive into each step!

## Step 1: Loading Documents

**Document loading** is the foundation of any information retrieval system. We'll load a PDF file containing Nike's 10-K financial report from 2023.

### Why PDFs?
- Common format for reports, papers, and documentation
- Contains structured and unstructured data
- Requires specialized parsing to extract text accurately

### Your Task:
Import the `PyPDFLoader` class from LangChain and load the PDF document located at `../../data/nke-10k-2023.pdf`

**Steps:**
1. Import `PyPDFLoader` from `langchain_community.document_loaders`
2. Create a variable `file_path` with the value `"../../data/nke-10k-2023.pdf"`
3. Create a loader instance: `loader = PyPDFLoader(file_path)`
4. Call `load()` method to extract all pages: `docs = loader.load()`
5. Print the number of documents loaded: `print(len(docs))`

**Expected Output:** A number representing the total pages in the PDF (should be around 110-120 pages)

In [None]:
# TODO: Import PyPDFLoader from langchain_community.document_loaders


# TODO: Set file_path variable to "../../data/nke-10k-2023.pdf"


# TODO: Create a PyPDFLoader instance


# TODO: Load the documents using the load() method


# TODO: Print the number of documents loaded


### Understanding Document Structure

Each `Document` object created by `PyPDFLoader` contains:
- **`page_content`**: The string content of the page (all text extracted)
- **`metadata`**: A dictionary with:
  - `source`: File path to the original PDF
  - `page`: Page number (0-indexed)

### Your Task:
Inspect the first document to understand its structure.

**Steps:**
1. Print the first 200 characters of the first document's content: `print(f"{docs[0].page_content[:200]}\\n")`
2. Print the metadata of the first document: `print(docs[0].metadata)`

**Expected Output:** 
- A preview of text from the first page
- Metadata showing the source file path and page number (0)

In [None]:
# TODO: Print the first 200 characters of the first document's page_content


# TODO: Print the metadata of the first document


## Step 2: Document Splitting Strategy

### Why Split Documents?

**Problem**: A full PDF page is often too large and contains multiple topics.
- Mixing different concepts in one chunk dilutes the semantic meaning
- Large chunks make it harder to pinpoint specific information
- Retrieval accuracy suffers when relevant details are buried in irrelevant context

**Solution**: Split documents into smaller, focused chunks.

### Your Task:
Use `RecursiveCharacterTextSplitter` to split documents into manageable chunks.

**Steps:**
1. Import `RecursiveCharacterTextSplitter` from `langchain_text_splitters`
2. Create a text splitter with these parameters:
   - `chunk_size=1000` (target 1000 characters per chunk)
   - `chunk_overlap=200` (overlap consecutive chunks by 200 characters)
   - `add_start_index=True` (track original position)
3. Split the documents: `all_splits = text_splitter.split_documents(docs)`
4. Print the number of chunks created: `len(all_splits)`

**Why "Recursive"?** The splitter tries to split on natural boundaries in this order:
1. Paragraphs (double newlines)
2. Sentences (single newlines)
3. Words (spaces)
4. Characters (as last resort)

**Expected Output:** A number much larger than the original page count (typically 500-800 chunks)

In [None]:
# TODO: Import RecursiveCharacterTextSplitter from langchain_text_splitters


# TODO: Create a RecursiveCharacterTextSplitter with chunk_size=1000, chunk_overlap=200, add_start_index=True


# TODO: Split the documents using split_documents() method


# TODO: Print the number of chunks created


## Step 3: Configure Azure OpenAI Embeddings

**Embeddings** are the heart of semantic search - they convert text into numerical vectors that capture meaning.

### What are Embeddings?
- **Input**: Text string (query or document)
- **Output**: Vector of numbers (e.g., 1536 dimensions for text-embedding-ada-002)
- **Property**: Similar meanings → Similar vectors

### Your Task:
Set up Azure OpenAI embeddings to convert text into vectors.

**Steps:**
1. Import necessary modules:
   - `getpass` and `os` for secure API key handling
   - `AzureOpenAIEmbeddings` from `langchain_openai`
2. Check if `AZURE_OPENAI_API_KEY` exists in environment variables
3. If not, prompt the user to enter it securely using `getpass.getpass()`
4. Create an `AzureOpenAIEmbeddings` instance with:
   - `azure_endpoint="https://aoi-ext-eus-aiml-profx-01.openai.azure.com/"`
   - `api_key=os.environ["AZURE_OPENAI_API_KEY"]`
   - `model="text-embedding-ada-002"`
   - `api_version="2024-12-01-preview"`

**Security Note:** Never hardcode API keys! Always use environment variables or secure input methods.

**Expected Output:** No output, but the `embeddings` object will be ready to use

In [None]:
# TODO: Import getpass and os modules


# TODO: Check if AZURE_OPENAI_API_KEY exists in environment, if not, prompt for it


# TODO: Import AzureOpenAIEmbeddings from langchain_openai


# TODO: Create an AzureOpenAIEmbeddings instance with the required parameters


### Test the Embeddings Model

Let's verify the embeddings are working by converting two document chunks into vectors.

### Your Task:
Generate embeddings for the first two document chunks and verify they work correctly.

**Steps:**
1. Use `embeddings.embed_query()` to convert the first chunk's content to a vector: `vector_1 = embeddings.embed_query(all_splits[0].page_content)`
2. Do the same for the second chunk: `vector_2 = embeddings.embed_query(all_splits[1].page_content)`
3. Verify both vectors have the same length using an assertion: `assert len(vector_1) == len(vector_2)`
4. Print the vector length: `print(f"Generated vectors of length {len(vector_1)}\\n")`
5. Print the first 10 values of vector_1: `print(vector_1[:10])`

**Expected Output:** 
- Message showing vectors of length 1536
- A list of 10 floating-point numbers representing the first dimensions of the embedding

In [None]:
# TODO: Generate embedding vector for the first chunk


# TODO: Generate embedding vector for the second chunk


# TODO: Assert that both vectors have the same length


# TODO: Print the vector length


# TODO: Print the first 10 values of vector_1


## Step 4: Initialize the Vector Store

**Vector Stores** are specialized databases optimized for storing and searching high-dimensional vectors.

### What is InMemoryVectorStore?
- **Type**: In-memory storage (not persisted to disk)
- **Speed**: Fast for development and prototyping
- **Scope**: Data exists only during the session
- **Best For**: Testing, small datasets, demonstrations

### Your Task:
Create an in-memory vector store to hold our document embeddings.

**Steps:**
1. Import `InMemoryVectorStore` from `langchain_core.vectorstores`
2. Create a vector store instance: `vector_store = InMemoryVectorStore(embeddings)`

**Production Alternatives:**
For production use with large datasets or persistence requirements, consider:
- **Chroma**: Local, persistent vector database
- **Pinecone**: Managed cloud vector database
- **Azure AI Search**: Azure's vector search service
- **Weaviate**: Open-source vector database

**Expected Output:** No output, but the vector store is now ready to store embeddings

In [None]:
# TODO: Import InMemoryVectorStore from langchain_core.vectorstores


# TODO: Create an InMemoryVectorStore instance with the embeddings object


## Step 5: Index Documents in the Vector Store

**Indexing** is the process of storing documents with their embeddings for later retrieval.

### What happens during indexing:
1. **For each document chunk**:
   - Send the text to the embeddings model
   - Receive back a 1536-dimensional vector
   - Store both the text and vector in the database
   - Generate a unique ID for tracking

2. **Build the index**:
   - Organize vectors for efficient similarity search
   - Create data structures for fast nearest-neighbor lookup

### Your Task:
Add all document chunks to the vector store.

**Steps:**
1. Call `vector_store.add_documents()` with the `all_splits` list: `ids = vector_store.add_documents(documents=all_splits)`
2. The method returns a list of unique IDs for each stored document

**Performance Note:**
This operation makes API calls to Azure OpenAI for each chunk, so it may take some time depending on:
- Number of chunks
- API rate limits
- Network latency

**Expected Output:** The indexing will complete (may take 1-2 minutes), and you'll have a searchable knowledge base!

**⚠️ Note:** This step may take a while to complete. Be patient!

In [None]:
# TODO: Add all document chunks to the vector store using add_documents()
# This will take some time as it needs to generate embeddings for all chunks


## Step 6: Perform Your First Similarity Search

**Semantic search in action!** Let's query the system to find information about Nike's distribution centers.

### How Similarity Search Works:
1. **Convert Query**: The question is converted to an embedding vector
2. **Compare Vectors**: The system compares the query vector to all stored document vectors
3. **Calculate Similarity**: Uses cosine similarity or distance metrics
4. **Rank Results**: Returns the most similar documents
5. **Default**: Returns top 4 most relevant chunks

### Your Task:
Search for information about Nike's distribution centers.

**Steps:**
1. Call `vector_store.similarity_search()` with the query: `"How many distribution centers does Nike have in the US?"`
2. Store the results: `results = vector_store.similarity_search("How many distribution centers does Nike have in the US?")`
3. Print the first result: `print(results[0])`

**Why this is powerful:**
- No exact keyword matching required
- Understands "distribution centers" relates to "facilities," "warehouses," etc.
- Captures semantic intent of the question
- Finds relevant content even with different wording

**Expected Output:** The most relevant document chunk containing information about Nike's distribution infrastructure

In [None]:
# TODO: Perform a similarity search with the query about distribution centers


# TODO: Print the first result


## Step 7: Async Similarity Search

**Asynchronous search** enables non-blocking operations, crucial for production applications.

### What is Async?
- **Traditional (Sync)**: Code waits for search to complete before continuing
- **Async**: Search runs in background, allowing other operations simultaneously
- **Use Cases**: Web applications, API endpoints, concurrent queries

### Your Task:
Perform an asynchronous similarity search.

**Steps:**
1. Use `await` with `vector_store.asimilarity_search()` to search for: `"When was Nike incorporated?"`
2. Store the results: `results = await vector_store.asimilarity_search("When was Nike incorporated?")`
3. Print the first result: `print(results[0])`

**Benefits of Async Search:**
- **Responsiveness**: UI remains interactive during search
- **Scalability**: Handle multiple search requests concurrently
- **Performance**: Better resource utilization in I/O-bound operations

**Note**: In Jupyter notebooks, async functions work seamlessly with `await`. In regular Python scripts, you'd need to use `asyncio.run()` or an event loop.

**Expected Output:** Document chunk containing information about Nike's founding/incorporation date

In [None]:
# TODO: Perform an async similarity search about Nike's incorporation


# TODO: Print the first result


## Step 8: Similarity Search with Confidence Scores

**Understanding search quality** by examining similarity scores alongside results.

### What are Similarity Scores?
- **Purpose**: Quantify how well each result matches the query
- **Range**: Depends on the distance metric used
- **Interpretation**: Lower scores = higher similarity (for distance metrics)

### Your Task:
Search with scores to understand result quality.

**Steps:**
1. Use `vector_store.similarity_search_with_score()` to search for: `"What was Nike's revenue in 2023?"`
2. Store the results: `results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")`
3. Extract the first document and score: `doc, score = results[0]`
4. Print the score: `print(f"Score: {score}\\n")`
5. Print the document: `print(doc)`

### Why Scores Matter:
- **Filtering**: Set thresholds to exclude low-quality matches
- **Ranking**: Sort results by relevance
- **Confidence**: Determine if results are trustworthy
- **Debugging**: Identify when queries aren't matching well

**Note:** Providers implement different scores. The score here is a distance metric that varies inversely with similarity (lower = more similar).

**Expected Output:** 
- A numerical score indicating match quality
- The most relevant document chunk about Nike's revenue

In [None]:
# TODO: Perform a similarity search with scores for Nike's revenue query


# TODO: Extract the first document and its score from the results


# TODO: Print the score


# TODO: Print the document


## Step 9: Search with Pre-computed Embeddings

**Advanced technique**: Search using vectors directly instead of text queries.

### Why Use Pre-computed Embeddings?

**Scenario 1 - Performance Optimization:**
- Generate query embedding once
- Reuse it for multiple searches
- Reduces API calls and latency

**Scenario 2 - Advanced Workflows:**
- Search with modified/combined embeddings
- Implement custom similarity logic
- Build hybrid search systems

**Scenario 3 - Cross-modal Search:**
- Search documents using image embeddings
- Find similar concepts across different data types

### Your Task:
Perform a two-step search using pre-computed embeddings.

**Steps:**
1. Generate an embedding for the query: `embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")`
2. Search using the vector: `results = vector_store.similarity_search_by_vector(embedding)`
3. Print the first result: `print(results[0])`

**Use Case:**
This pattern is especially useful when:
- Building search APIs (cache embeddings)
- Implementing recommendation systems
- Creating multi-step search pipelines

**Expected Output:** Same quality results as text-based search, but with more control over the embedding process

In [None]:
# TODO: Generate an embedding for the query about Nike's margins


# TODO: Perform a similarity search using the pre-computed embedding vector


# TODO: Print the first result


## Congratulations! 🎉

You've successfully built a semantic search system using LangChain and Azure OpenAI! 

### What You've Learned:
- ✅ Load and parse PDF documents
- ✅ Split large documents into meaningful chunks
- ✅ Generate embeddings using Azure OpenAI
- ✅ Store embeddings in a vector database
- ✅ Perform semantic similarity searches
- ✅ Use both synchronous and asynchronous search methods
- ✅ Understand similarity scores and their importance
- ✅ Work with pre-computed embeddings for advanced use cases

### Next Steps:
- Experiment with different chunk sizes and overlap values
- Try different queries to test the semantic understanding
- Explore other vector stores like Chroma or Pinecone
- Move on to the RAG notebook to build a complete question-answering system!

### Challenge Exercises:
1. Try searching for "Nike's sustainability efforts" and see what you find
2. Experiment with chunk_size values (500, 1500, 2000) and compare results
3. Search for the same query using all three methods: `similarity_search()`, `asimilarity_search()`, and `similarity_search_by_vector()`