## **Section 1: Import Required Libraries and Modules**

In this section, we set up the essential libraries needed for data storage, retrieval, and processing in our NLP pipeline. Each library and module plays a specific role:

- **Milvus**: Manages vector-based storage and retrieval.
  - `connections`, `utility`, `Collection`, etc., support Milvus operations like connecting, creating schemas, and managing collections.
  
- **LangChain Community Embeddings**: 
  - `HuggingFaceEmbeddings`: Converts text data into vector embeddings for efficient processing and retrieval.
  
- **LangChain Core Utilities**:
  - `StrOutputParser`: Formats the output of language models as strings.
  - `PromptTemplate`: Structures prompts for consistent and accurate model responses.
  - `RunnablePassthrough`: Passes data through various steps without modifications, supporting flexible workflows.

- **Milvus Hybrid Search Retriever**:
  - `MilvusCollectionHybridSearchRetriever` and `BM25SparseEmbedding`: Combines dense and sparse search capabilities for optimized retrieval.

- **ChatMistralAI**:
  - Provides a conversational interface with Mistral’s language model for interactive querying.

- **Web Scraping Tools**:
  - `requests` and `BeautifulSoup`: Make HTTP requests and parse HTML content for extracting data from web pages.

- **Environment and Utility Modules**:
  - `nltk`: Tokenizes text, preparing it for NLP processing.
  - `os` and `dotenv`: Manage environment variables for secure configurations.
  - `urljoin`: Joins URLs for dynamic web scraping.

This import setup establishes a comprehensive environment for NLP tasks, data management, and seamless integration with external sources.

In [None]:
# Import Milvus utilities for connections, schema creation, and managing collections.
from pymilvus import connections, utility, Collection, CollectionSchema, FieldSchema, DataType

# Import HuggingFace embeddings to convert text into vector representations.
from langchain_community.embeddings import HuggingFaceEmbeddings  # Adjust if necessary.

# Import parser to ensure the model outputs are formatted as strings.
from langchain_core.output_parsers import StrOutputParser

# Import PromptTemplate to structure and format prompts for the language model.
from langchain_core.prompts import PromptTemplate

# Import RunnablePassthrough to allow passing data unchanged through multiple steps.
from langchain_core.runnables import RunnablePassthrough

# Import Milvus to interact with the vector database for storage and retrieval.
from langchain_milvus import Milvus

# Import hybrid search retriever from Milvus, combining dense and sparse search techniques.
from langchain_milvus.retrievers import MilvusCollectionHybridSearchRetriever

# Import BM25 for sparse embedding, typically used for keyword-based search.
from langchain_milvus.utils.sparse import BM25SparseEmbedding

# Import ChatMistralAI to enable conversational interaction with the Mistral AI model.
from langchain_mistralai.chat_models import ChatMistralAI

# Import BeautifulSoup to parse and extract text content from HTML web pages.
from bs4 import BeautifulSoup

# Import requests to make HTTP requests for web scraping or API calls.
import requests

# Import nltk (Natural Language Toolkit) for tokenizing text into sentences.
import nltk

# Import os for interacting with the operating system, such as handling environment variables.
import os

# Import urljoin to join relative and absolute URLs for web scraping.
from urllib.parse import urljoin

# Import load_dotenv to load environment variables from a .env file into the program.
from dotenv import load_dotenv

## **Section 2: Load Environment Variables and Setup NLTK**

This section sets up essential configurations for secure access and text processing:

- **Environment Variables**: Loads configuration settings, like API keys, from a `.env` file using `load_dotenv()`.

- **NLTK Tokenizer**: Downloads the `punkt` tokenizer to enable sentence-level tokenization, which is useful for processing text data in NLP tasks.

In [None]:
# Load environment variables from a .env file to access configuration settings, such as API keys.
load_dotenv()

# Download the 'punkt' tokenizer, which allows nltk to split text into sentences.
nltk.download('punkt')

## **Section 3: Configure Mistral API Key**

This section focuses on configuring the Mistral API key to ensure secure access to the Mistral services. Proper management of API keys is crucial in any application to prevent unauthorized access and to maintain security.

- **Environment Variables**: 
  - The Mistral API key is retrieved from the environment variables. This is accomplished using the `os.getenv()` method, which fetches the value associated with the `MISTRAL_API_KEY` environment variable. Storing sensitive information like API keys in environment variables helps to keep them secure and out of the source code.
  
- **Setting the API Key**: 
  - Once retrieved, the API key is set as an environment variable using `os.environ`. This makes the API key accessible throughout the code, allowing seamless interaction with Mistral services without hardcoding the key in the codebase.


In [None]:
# Retrieve the Mistral API key from the environment variables using os.getenv().
mistral_api_key = os.getenv("MISTRAL_API_KEY")

# Set the Mistral API key as an environment variable so it can be accessed throughout the code.
os.environ["MISTRAL_API_KEY"] = mistral_api_key

## **Section 4: Connect to Milvus and Setup Collection Schema**

In this section, we will establish a connection to the Milvus database and define the schema for a collection that will store embeddings. Milvus is a vector database designed for efficient similarity search and analytics of unstructured data.

- **Establish Connection**: 
  - We will connect to the Milvus server running on `localhost` at the default port `19530`. This allows our application to communicate with the Milvus instance for data operations.

- **Check for Existing Collection**: 
  - Before creating a new collection, we check if a collection named `my_collection` already exists. If it does, we drop the existing collection to avoid duplicate entries and ensure a clean slate for data insertion.

- **Define Collection Schema**: 
  - The schema for the collection is defined using `FieldSchema` objects. The schema includes:
    - An `id` field of type `INT64`, which serves as the primary key for the collection.
    - An `embedding` field of type `FLOAT_VECTOR`, with a dimensionality of `768`, which is typically used for storing embeddings from models like BERT.

- **Create the Collection**: 
  - Finally, we create the collection in Milvus with the defined schema, allowing us to store and retrieve vector embeddings efficiently.

In [None]:
# Establish a connection to Milvus, using localhost as the host and default port 19530.
connections.connect(alias="default", host="localhost", port="19530")

# Check if the collection "my_collection" already exists in Milvus.
if utility.has_collection("my_collection"):
    # If the collection exists, drop it to avoid duplicate entries.
    utility.drop_collection("my_collection")

# Define fields for the Milvus collection schema. The ID field acts as the primary key.
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),  # Integer primary key field.
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)  # 768-dimensional float vector field.
]

# Create a collection schema using the defined fields.
schema = CollectionSchema(fields, description="My Collection for Embeddings")

# Initialize the collection with the name "my_collection" and the created schema.
collection = Collection(name="my_collection", schema=schema)

## **Section 5: Generate Embeddings and Insert Data into Milvus**

In this section, we will generate vector embeddings for textual data and insert these embeddings into the Milvus collection we created in the previous section. This allows for efficient storage and retrieval of high-dimensional data.

- **Initialize the Embeddings Model**: 
  - We will use the Hugging Face embeddings model `sentence-transformers/all-MiniLM-L6-v2`, which is optimized for generating vector representations from text. This model converts sentences into dense vector embeddings that can be easily compared and analyzed.

- **Define Sample Data**: 
  - For demonstration purposes, we define a list of sample IDs corresponding to the text data we want to embed. In this example, we have three IDs: `[1, 2, 3]`.

- **Generate Embeddings**: 
  - We generate embeddings for three sample sentences: "Hello world", "How are you?", and "Goodbye". The `embed_documents` method of the embeddings model is used to produce the corresponding vector representations.

- **Insert Data into Milvus**: 
  - Finally, we insert the generated embeddings along with their corresponding IDs into the Milvus collection. This step ensures that the embeddings are stored and can be accessed for future retrieval and analysis.

In [None]:
# Initialize the HuggingFace embeddings model to generate vector representations from text.
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Define some sample data to insert into the Milvus collection.
ids = [1, 2, 3]  # List of IDs for the data points.

# Generate embeddings for three sample sentences.
vectors = embeddings.embed_documents(["Hello world", "How are you?", "Goodbye"])

# Insert the IDs and corresponding vector embeddings into the Milvus collection.
collection.insert([ids, vectors])

## **Section 6: Load the Mistral Chat Model**

In this section, we will load the Mistral Chat model, which allows us to utilize its capabilities for generating responses in a conversational format. This model can be integrated into applications that require natural language understanding and response generation.

- **Load the Chat Model**: 
  - We will instantiate the `ChatMistralAI` model, passing the Mistral API key for authorization. This key is essential for authenticating requests made to the Mistral service, ensuring that only authorized applications can access the model's features.


In [None]:
# Load the ChatMistralAI model using the Mistral API key for authorization.
chat_model = ChatMistralAI(api_key=mistral_api_key)

## **Section 7: Configure the Milvus Hybrid Search Retriever**

In this section, we will set up a hybrid search retriever that combines the strengths of both dense and sparse retrieval methods to enhance search capabilities in the Milvus collection. This hybrid approach allows for more comprehensive and accurate search results.

- **Hybrid Search Retriever**: 
  - The hybrid search retriever integrates different retrieval techniques, enabling efficient searching across various types of embeddings. In this case, we will utilize both dense embeddings generated from the Hugging Face model and sparse embeddings based on the BM25 algorithm, which is effective for keyword-based searches.

- **Milvus Collection**: 
  - We will specify the Milvus collection that we previously created. This collection will serve as the source of the data being searched.

- **Embeddings**: 
  - We will use the Hugging Face embeddings for semantic search, allowing us to retrieve documents based on the meaning and context of the queries.

- **BM25 Sparse Embedding**: 
  - For keyword-based search, we will implement the BM25 algorithm, which ranks documents based on term frequency and document frequency. This enhances the ability to retrieve relevant documents based on exact keyword matches.

- **Top K Results**: 
  - We will configure the retriever to return the top 2 most relevant documents based on the hybrid search criteria.


In [None]:
# Set up a hybrid search retriever using the Milvus collection and various embeddings.
retriever = MilvusCollectionHybridSearchRetriever(
    collection=collection,  # Specify the Milvus collection to search.
    dense_embedding=embeddings,  # Use HuggingFace dense embeddings for semantic search.
    sparse_embedding=BM25SparseEmbedding(),  # Use BM25 for sparse, keyword-based search.
    top_k=2  # Return the top 2 most relevant documents.
)

## **Section 8: Create a Prompt Template and Output Parser**

In this section, we will create a prompt template that structures user queries for the language model and establish an output parser to format the model's responses. This ensures that the input to the model is well-defined and that the output is in the desired format.

- **Prompt Template**: 
  - We will define a `PromptTemplate` to structure user queries. This template specifies the expected input variables and the overall format of the prompt. By organizing the input in a consistent manner, we enhance the model's understanding and response accuracy.

- **Input Variables**: 
  - The template will include an input variable called `question`, which represents the user’s query that we want the model to answer. 

- **Template Structure**: 
  - The prompt will be formatted as: "Answer the following question: {question}". This clear and straightforward structure helps guide the model in generating relevant responses.

- **Output Parser**: 
  - An output parser is created to ensure that the model’s responses are appropriately converted to strings. This is crucial for maintaining consistency and ease of use in further processing or display.


In [None]:
# Define a prompt template to structure user queries for the language model.
prompt = PromptTemplate(
    input_variables=["question"],  # Specify the expected input variable.
    template="Answer the following question: {question}"  # Format of the prompt.
)

# Create an output parser to ensure the model response is converted to a string.
output_parser = StrOutputParser()

## **Section 9: Set Up a RunnablePassthrough**

In this section, we will set up a `RunnablePassthrough`, which is a utility that allows data to be passed unchanged through various steps of a workflow. This is useful for scenarios where we need to maintain the integrity of the data while still allowing for the flexibility of the processing pipeline.

- **RunnablePassthrough**: 
  - The `RunnablePassthrough` acts as a simple pass-through mechanism in the data processing workflow. It allows the data to flow through different components of the pipeline without any alterations.

- **Use Cases**: 
  - This can be particularly beneficial when you want to integrate different processing steps without modifying the input data. For instance, it can serve as a placeholder in a workflow where certain transformations are conditionally applied.


In [None]:
# Create a passthrough to pass data unchanged through different steps of the workflow.
passthrough = RunnablePassthrough()

## **Section 10: Define a Web Scraping Function**

In this section, we will define a function that scrapes text content from a specified website URL. This function will be useful for gathering textual data from the web for further processing or analysis.

- **Web Scraping Function**: 
  - The function `scrape_website(url)` is designed to retrieve and extract text from a given webpage. It utilizes libraries like `requests` for making HTTP requests and `BeautifulSoup` for parsing HTML content.

- **HTTP GET Request**: 
  - The function begins by sending an HTTP GET request to the specified URL using `requests.get(url)`. This retrieves the content of the webpage.

- **HTML Parsing**: 
  - The HTML content is then parsed with BeautifulSoup. This library simplifies the process of navigating and manipulating HTML documents, making it easy to extract relevant information.

- **Text Extraction**: 
  - We extract and join all visible text from the HTML using the `soup.stripped_strings` generator, which collects all text elements while stripping unnecessary whitespace.

- **Tokenization**: 
  - The extracted text is further processed by tokenizing it into individual sentences using NLTK's `sent_tokenize` function. This step is essential for breaking down the text into manageable pieces for analysis.

- **Return Value**: 
  - Finally, the function returns a list of sentences extracted from the webpage, which can be used for various applications such as natural language processing (NLP) or text analysis.


In [None]:
# Define a function to scrape text from a given website URL.
def scrape_website(url):
    # Send an HTTP GET request to the specified URL.
    response = requests.get(url)

    # Parse the HTML content using BeautifulSoup.
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract and join all visible text from the HTML.
    text = ' '.join(soup.stripped_strings)

    # Tokenize the extracted text into individual sentences using nltk.
    sentences = nltk.sent_tokenize(text)

    # Return the list of sentences extracted from the webpage.
    return sentences

## **Section 11: Scrape a Website and Insert Data into Milvus**

In this section, we will scrape text data from a specified website, generate embeddings for the extracted sentences, and insert these embeddings into the Milvus collection. This process allows us to enrich the Milvus database with real-time data collected from the web.

- **Website URL for Scraping**: 
  - We start by defining the URL of the website we want to scrape. For this example, we use a placeholder URL (`"https://example.com"`), which you can replace with any valid website.

- **Scraping Sentences**: 
  - We will use the previously defined `scrape_website(url)` function to extract sentences from the specified webpage. This function retrieves the content, parses the HTML, and tokenizes the text into individual sentences.

- **Generating Embeddings**: 
  - Once we have the list of sentences, we generate embeddings for these sentences using the `embed_documents` method from the Hugging Face embeddings model. This step transforms the sentences into numerical vector representations suitable for storage and retrieval.

- **Inserting Data into Milvus**: 
  - Finally, we insert the sentences along with their embeddings into the Milvus collection. We generate unique IDs for each sentence using `list(range(len(sentences)))`, ensuring each entry is identifiable. The `insert` method of the collection is then used to store the sentences and their corresponding embeddings.



In [None]:
# Define a website URL for scraping (example URL).
website_url = "https://example.com"

# Scrape the website to get a list of sentences.
sentences = scrape_website(website_url)

# Generate embeddings for the scraped sentences.
sentence_embeddings = embeddings.embed_documents(sentences)

# Insert the sentences and their embeddings into the Milvus collection with autogenerated IDs.
collection.insert([list(range(len(sentences))), sentence_embeddings])

## **Section 12: Perform a Hybrid Search and Display Results**

In this section, we will perform a hybrid search on the Milvus collection using a user-defined query. We will retrieve the most relevant documents based on this query and display the results. This process demonstrates the practical application of our previous configurations and data insertions.

- **User Query Definition**: 
  - We start by defining a user query that we want to search for in the Milvus collection. In this example, the query is `"What is this website about?"`, which reflects the kind of information we wish to retrieve.

- **Retrieving Relevant Documents**: 
  - Using the previously configured hybrid search retriever, we will obtain the most relevant documents that match the user query. The `get_relevant_documents(query)` method will leverage both dense and sparse embeddings to return results that are semantically and contextually relevant.

- **Displaying Results**: 
  - We then iterate through the retrieved results and print the content of each document. This step allows us to view the actual sentences or text snippets associated with the query, providing insights into the information available in the Milvus collection.


In [None]:
# Define a user query to search the collection.
query = "What is this website about?"

# Use the retriever to get the most relevant documents based on the query.
results = retriever.get_relevant_documents(query)

# Iterate through the retrieved results and print their content.
for result in results:
    print(result.page_content)