## **Section 1: Import Required Libraries and Modules**

In this section, we set up the essential libraries needed for data storage, retrieval, and processing in our NLP pipeline. Each library and module plays a specific role:



This import setup establishes a comprehensive environment for NLP tasks, data management, and seamless integration with external sources.


  Milvus manages vector-based storage and retrieval, supported by `connections`, `utility`, and `Collection` for operations like connecting, schema creation, and collection management.

In [None]:
# Import Milvus utilities for connections, schema creation, and managing collections.
from pymilvus import connections, utility, Collection, CollectionSchema, FieldSchema, DataType

  LangChain Community Embeddings: Utilizes `HuggingFaceEmbeddings` to convert text data into vector embeddings for efficient processing and retrieval.

In [None]:
# Import HuggingFace embeddings to convert text into vector representations.
from langchain_community.embeddings import HuggingFaceEmbeddings  # Adjust if necessary.


LangChain Core Utilities: Includes `StrOutputParser` for formatting model outputs as strings, `PromptTemplate` for structuring prompts, and `RunnablePassthrough` for passing data through steps without modifications.



In [None]:
# Import parser to ensure the model outputs are formatted as strings.
from langchain_core.output_parsers import StrOutputParser

# Import PromptTemplate to structure and format prompts for the language model.
from langchain_core.prompts import PromptTemplate

# Import RunnablePassthrough to allow passing data unchanged through multiple steps.
from langchain_core.runnables import RunnablePassthrough

# Import Milvus to interact with the vector database for storage and retrieval.
from langchain_milvus import Milvus

  Milvus Hybrid Search Retriever: Utilizes `MilvusCollectionHybridSearchRetriever` and `BM25SparseEmbedding` to combine dense and sparse search capabilities for optimized retrieval.



In [None]:
# Import hybrid search retriever from Milvus, combining dense and sparse search techniques.
from langchain_milvus.retrievers import MilvusCollectionHybridSearchRetriever

# Import BM25 for sparse embedding, typically used for keyword-based search.
from langchain_milvus.utils.sparse import BM25SparseEmbedding

ChatMistralAI offers a conversational interface with Mistral’s language model for interactive querying.



In [None]:
# Import ChatMistralAI to enable conversational interaction with the Mistral AI model.
from langchain_mistralai.chat_models import ChatMistralAI

Web Scraping Tools: Use `requests` for HTTP requests and `BeautifulSoup` to parse HTML content for data extraction.



In [None]:
# Import BeautifulSoup to parse and extract text content from HTML web pages.
from bs4 import BeautifulSoup

# Import requests to make HTTP requests for web scraping or API calls.
import requests

  Environment and Utility Modules: `nltk` tokenizes text for NLP, `os` and `dotenv` manage environment variables securely, and `urljoin` joins URLs for web scraping.



In [None]:
# Import nltk (Natural Language Toolkit) for tokenizing text into sentences.
import nltk

# Import os for interacting with the operating system, such as handling environment variables.
import os

# Import urljoin to join relative and absolute URLs for web scraping.
from urllib.parse import urljoin

# Import load_dotenv to load environment variables from a .env file into the program.
from dotenv import load_dotenv

## **Section 2: Load Environment Variables and Setup NLTK**

This section configures secure access and text processing. Environment variables are loaded from a `.env` file using `load_dotenv()`, and the `punkt` tokenizer from NLTK is downloaded for sentence-level tokenization in NLP tasks.



In [None]:
# Load environment variables from a .env file to access configuration settings, such as API keys.
load_dotenv()

# Download the 'punkt' tokenizer, which allows nltk to split text into sentences.
nltk.download('punkt')

## **Section 3: Configure Mistral API Key**

This section highlights the importance of securely managing the `MISTRAL_API_KEY` by using environment variables. The API key is retrieved using `os.getenv()` and set as an environment variable with `os.environ`, ensuring secure and seamless access to Mistral services without hardcoding sensitive information in the code.



In [None]:
# Retrieve the Mistral API key from the environment variables using os.getenv().
mistral_api_key = os.getenv("MISTRAL_API_KEY")

# Set the Mistral API key as an environment variable so it can be accessed throughout the code.
os.environ["MISTRAL_API_KEY"] = mistral_api_key

## **Section 4: Connect to Milvus and Setup Collection Schema**

In this section, we connect to the Milvus database on localhost at port 19530, ensuring our application can communicate with Milvus for data operations. We then check for an existing collection named `my_collection`, dropping it if necessary to avoid duplicates. Next, we define the collection schema using `FieldSchema` objects, including an `id` field of type `INT64` as the primary key and an `embedding` field of type `FLOAT_VECTOR` with a dimensionality of `768`. Finally, we create the collection in Milvus to efficiently store and retrieve vector embeddings.

In [None]:
# Establish a connection to Milvus, using localhost as the host and default port 19530.
connections.connect(alias="default", host="localhost", port="19530")

# Check if the collection "my_collection" already exists in Milvus.
if utility.has_collection("my_collection"):
    # If the collection exists, drop it to avoid duplicate entries.
    utility.drop_collection("my_collection")

# Define fields for the Milvus collection schema. The ID field acts as the primary key.
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),  # Integer primary key field.
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)  # 768-dimensional float vector field.
]

# Create a collection schema using the defined fields.
schema = CollectionSchema(fields, description="My Collection for Embeddings")

# Initialize the collection with the name "my_collection" and the created schema.
collection = Collection(name="my_collection", schema=schema)

## **Section 5: Generate Embeddings and Insert Data into Milvus**

 This section outlines generating vector embeddings for textual data and inserting them into the Milvus collection. We initialize the Hugging Face embeddings model `sentence-transformers/all-MiniLM-L6-v2` to convert text into dense vector embeddings. Sample data with IDs `[1, 2, 3]` are defined, and embeddings are generated for the sentences "Hello world," "How are you?," and "Goodbye" using the `embed_documents` method. Finally, the embeddings and their IDs are inserted into the Milvus collection for efficient storage and retrieval.

In [None]:
# Initialize the HuggingFace embeddings model to generate vector representations from text.
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Define some sample data to insert into the Milvus collection.
ids = [1, 2, 3]  # List of IDs for the data points.

# Generate embeddings for three sample sentences.
vectors = embeddings.embed_documents(["Hello world", "How are you?", "Goodbye"])

# Insert the IDs and corresponding vector embeddings into the Milvus collection.
collection.insert([ids, vectors])

## **Section 6: Load the Mistral Chat Model**

In this section, we load the `ChatMistralAI` model to enable conversational response generation. By passing the Mistral API key for authorization, we ensure secure access to the model’s features for natural language understanding and response generation in applications.


In [None]:
# Load the ChatMistralAI model using the Mistral API key for authorization.
chat_model = ChatMistralAI(api_key=mistral_api_key)

## **Section 7: Configure the Milvus Hybrid Search Retriever**

In this section, we establish a hybrid search retriever to combine dense and sparse retrieval methods for better search results in the Milvus collection. Utilizing dense embeddings from the Hugging Face model and sparse embeddings with the BM25 algorithm, we enable efficient searches by specifying the previously created Milvus collection. This setup enhances semantic and keyword-based searches and configures the retriever to return the top 2 most relevant documents.

In [None]:
# Set up a hybrid search retriever using the Milvus collection and various embeddings.
retriever = MilvusCollectionHybridSearchRetriever(
    collection=collection,  # Specify the Milvus collection to search.
    dense_embedding=embeddings,  # Use HuggingFace dense embeddings for semantic search.
    sparse_embedding=BM25SparseEmbedding(),  # Use BM25 for sparse, keyword-based search.
    top_k=2  # Return the top 2 most relevant documents.
)

## **Section 8: Create a Prompt Template and Output Parser**

This section involves creating a `PromptTemplate` to structure user queries and an output parser to format the model's responses. The template will use the `question` input variable, formatted as "Answer the following question: {question}", ensuring clarity and accuracy in model responses. An output parser will convert responses to strings for consistent and easy use.



In [None]:
# Define a prompt template to structure user queries for the language model.
prompt = PromptTemplate(
    input_variables=["question"],  # Specify the expected input variable.
    template="Answer the following question: {question}"  # Format of the prompt.
)

# Create an output parser to ensure the model response is converted to a string.
output_parser = StrOutputParser()

## **Section 9: Set Up a RunnablePassthrough**

This section sets up a `RunnablePassthrough`, a utility that passes data unchanged through workflow steps, maintaining data integrity while enabling flexible processing. It acts as a pass-through mechanism, allowing data to flow through different pipeline components without alterations. This is beneficial for integrating various processing steps without modifying input data, serving as a placeholder where transformations are conditionally applied.



In [None]:
# Create a passthrough to pass data unchanged through different steps of the workflow.
passthrough = RunnablePassthrough()

## **Section 10: Define a Web Scraping Function**

  This section defines a function, `scrape_website(url)`, to extract text content from a website URL for further analysis. Using `requests` for HTTP GET requests and `BeautifulSoup` for HTML parsing, the function retrieves webpage content. It then extracts and joins visible text using `soup.stripped_strings`, and tokenizes the text into sentences with NLTK's `sent_tokenize`. The function returns a list of sentences, useful for NLP or text analysis.




In [None]:
# Define a function to scrape text from a given website URL.
def scrape_website(url):
    # Send an HTTP GET request to the specified URL.
    response = requests.get(url)

    # Parse the HTML content using BeautifulSoup.
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract and join all visible text from the HTML.
    text = ' '.join(soup.stripped_strings)

    # Tokenize the extracted text into individual sentences using nltk.
    sentences = nltk.sent_tokenize(text)

    # Return the list of sentences extracted from the webpage.
    return sentences

## **Section 11: Scrape a Website and Insert Data into Milvus**

In this section, we scrape text data from a specified website, generate embeddings for the extracted sentences, and insert these embeddings into the Milvus collection. Using the placeholder URL `"https://example.com"`, we extract sentences with the `scrape_website(url)` function, generating embeddings via the Hugging Face `embed_documents` method. Unique IDs for each sentence are created with `list(range(len(sentences)))` and the `insert` method of Milvus stores the sentences and embeddings.



In [None]:
# Define a website URL for scraping (example URL).
website_url = "https://example.com"

# Scrape the website to get a list of sentences.
sentences = scrape_website(website_url)

# Generate embeddings for the scraped sentences.
sentence_embeddings = embeddings.embed_documents(sentences)

# Insert the sentences and their embeddings into the Milvus collection with autogenerated IDs.
collection.insert([list(range(len(sentences))), sentence_embeddings])

## **Section 12: Perform a Hybrid Search and Display Results**

This section focuses on performing a hybrid search on the Milvus collection using a user-defined query to retrieve and display the most relevant documents. We begin by defining a user query, such as `"What is this website about?"`. Using the hybrid search retriever and the `get_relevant_documents(query)` method, we leverage dense and sparse embeddings to find semantically and contextually relevant documents. Finally, we display the content of these retrieved documents, providing insights based on the information available in the Milvus collection.



In [None]:
# Define a user query to search the collection.
query = "What is this website about?"

# Use the retriever to get the most relevant documents based on the query.
results = retriever.get_relevant_documents(query)

# Iterate through the retrieved results and print their content.
for result in results:
    print(result.page_content)