#### Instructions for Setting Up the Environment:
1. If this is a new instance:
   - Create a new virtual Python environment to ensure a clean, isolated environment for your project dependencies. You can create a new virtual environment using  your IDE or directly from your terminal.
   - Ensure that Jupyter Notebook is installed in that environment, and then launch it.
   - Confirm that the correct environment is selected as the kernel.
2. If an existing environment is available:
   - Activate the existing virtual environment where you want to run this notebook.
   - Ensure that Jupyter Notebook is installed in that environment, and then launch it.
   - Confirm that the correct environment is selected as the kernel.
3. Install Required Dependencies:
   - Once you have the correct environment set as the kernel, use the `%pip` magic command in a new notebook cell to install the required dependencies.
   - This command will ensure all the necessary libraries are installed in the environment associated with the notebook kernel.

In [1]:
%pip install -q -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


#### Standard Library Imports
Built-in libraries in Python that provide functionalities such as file handling, working with paths, and generating unique identifiers.

#### Third-Party Imports
External libraries installed previously (requirements.txt) for specialised use in this project.

- pymupdf4llm: A library for processing PDF files.
- SentenceTransformer: A framework for sentence embedding using state-of-the-art models.
- QdrantClient: A client library for interacting with Qdrant, a vector search engine.
- Flashrank: A library for ranking search results
- ChatGroq: A library for integrating Groq-based models with LangChain.
- PromptTemplate: A utility from LangChain for creating customisable prompts.
- ConversationBufferMemory: A memory buffer module from LangChain for handling conversations.
- MarkdownTextSplitter: A tool from LangChain for splitting text into Markdown format.

In [None]:
# Standard library imports
import os
import pathlib
import uuid

# Third-party imports
import pymupdf4llm
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from flashrank import Ranker, RerankRequest
from langchain_groq import ChatGroq
from langchain_core.prompts import PromptTemplate
from langchain.memory.buffer import ConversationBufferMemory
from langchain.text_splitter import MarkdownTextSplitter

#### Converting PDF Files to Markdown

The function is used to convert all PDF files located in the `docs` folder into Markdown format. The converted files are then saved in the `parsed_docs` folder.

By iterating over all the files in the `docs` folder, the function checks for files with a `.pdf` extension, converts them to Markdown, and saves them with an uppercase file name in the specified output directory.

In [3]:
def convert_pdf_to_text(pdf_path, output_folder):
    """
    Converts a PDF file to Markdown text and saves it to a specified
    output folder with the file name fully capitalised.

    Args:
        pdf_path (str): The path to the input PDF file.
        output_folder (str): The folder where the converted Markdown
                             file will be saved.

    Returns:
        None: This function does not return any value. It saves the
              converted text as a Markdown (.md) file in the specified
              output folder.
    """

    # Convert the PDF to Markdown
    import_doc = pymupdf4llm.to_markdown(pdf_path)

    # Get the base name of the file, capitalise it, and replace extension
    base_name = os.path.basename(pdf_path).replace('.pdf', '').upper()
    output_file = os.path.join(output_folder, f"{base_name}.md")

    # Save the converted document to the specified output file
    pathlib.Path(output_file).write_bytes(import_doc.encode())

# Convert all PDFs in the /documents folder
for pdf_file in os.listdir('docs'):
    if pdf_file.endswith('.pdf'):
        convert_pdf_to_text(os.path.join('docs', pdf_file), 'parsed_docs')


Processing docs\fed_gov_guide.pdf...


#### Setting Up Path and Listing Markdown Files

This section sets the path to the directory containing parsed Markdown files and lists all Markdown files within it.

- `notebook_path`: Gets the current directory.
- `docs_path`: Path to `parsed_docs` where Markdown files are stored.
- `documents`: List of full paths to all `.md` files in `parsed_docs`.

In [4]:
# Path to the directory containing parsed Markdown files
notebook_path = os.getcwd()
docs_path = os.path.join(notebook_path, 'parsed_docs')

# List all Markdown files in the directory
documents = [
    os.path.join(docs_path, file)
    for file in os.listdir(docs_path)
    if file.endswith('.md')
]

#### Splitting Markdown Files into Chunks

This section initialises a text splitter to divide the Markdown files into smaller chunks for easier processing.

- `text_splitter`: Initialises a splitter to divide text into chunks of 500 characters with a 50-character overlap.
- `chunks`: A list containing all chunks extracted from the Markdown files.
- Output Check: Prints the total number of chunks and displays the first chunk to verify the result.

In [5]:
# Initialise the text splitter
text_splitter = MarkdownTextSplitter(chunk_size=500, chunk_overlap=50)

# Split all documents into chunks
chunks = []
for doc in documents:
    with open(doc, 'r', encoding='utf-8') as file:
        text = file.read()
        chunk = text_splitter.split_text(text)
        chunks.extend(chunk)

# Check: Print the number of chunks and the first chunk
print(len(chunks))
print(chunks[:1])

93
['# Choose Health:\n Be Active\n\n\n##### A physical activity guide for older Australians\n\n###### An initiative of the Australian Government in\n\n association with Sports Medicine Australia\n\n\n-----\n\n**Choose Health: Be Active**\n\nFirst printed April 2005\nRevised and reprinted April 2008\nRevised and reprinted June 2008\nISBN 978-1-920720-2856']


#### Initialising the Sentence Encoder

We initialise a sentence encoder model to generate embeddings for the text chunks.

- `encoder`: A `SentenceTransformer` model (`all-MiniLM-L6-v2`) is used for generating text embeddings to represent the meaning of sentences or chunks.

`all-MiniLM-L6-v2` model is an lightweight and efficient open-source transformer model fine-tuned for various sentence embedding tasks. It provides a good balance between speed and accuracy, making it suitable for applications like semantic search, clustering, and text classification. Despite its smaller size, it performs well across multiple languages and contexts.

In [None]:
encoder = SentenceTransformer("all-MiniLM-L6-v2")


#### Generating Embeddings for Text Chunks

In this step, we use the initialised sentence encoder to generate embeddings for the text chunks. These embeddings are numerical representations of the text, capturing their semantic meaning.

- `embeddings`: A list of vectors generated by the `encoder` for each chunk of text. These embeddings can be used for various downstream tasks, such as similarity search, clustering, or classification, where the semantic content of the text is crucial.

In [7]:
embeddings = encoder.encode(chunks)

#### Preparing Data for Upload with Unique Identifiers

In this section, we generate unique IDs for each text chunk and prepare the data for uploading to a vector database.

- `ids`: A list of unique identifiers generated for each text chunk using UUIDs.
- `points`: A list of `PointStruct` objects, each containing:
  - `id`: A unique identifier for the chunk.
  - `vector`: The embedding representing the chunk's semantic content.
  - `payload`: An optional dictionary with the original text, useful for retrieval tasks or additional metadata.

In [8]:
# Generate unique IDs
ids = [str(uuid.uuid4()) for _ in range(len(chunks))]

# Prepare data for upload
points = [
    PointStruct(
        id=id,
        vector=embedding,
        payload={"text": chunk}  # Optional payload with original text
    )
    for id, embedding, chunk in zip(ids, embeddings, chunks)
]

#### Verifying Prepared Data Points

This step checks the structure and content of the prepared data points by printing the first entry.

In [9]:
print(points[:1])

[PointStruct(id='e83b6727-c040-47d3-b165-a47db5f356bb', vector=[0.09132414311170578, 0.003608893370255828, 0.00802591722458601, 0.09255637228488922, -0.0016212842892855406, 0.10537403076887131, 0.07642673701047897, -0.00259967939928174, -0.06672266870737076, 0.12426268309354782, 0.020014192909002304, 1.565547972859349e-05, -0.041562583297491074, 0.03374786674976349, 0.12447652220726013, 0.04611389711499214, -0.012749030254781246, -0.04728269949555397, -0.0045511312782764435, 0.04754595831036568, -0.046240612864494324, 0.11829520016908646, 0.048751913011074066, 0.04791689291596413, -0.034289389848709106, 0.01574784144759178, -0.012838504277169704, 0.005514912307262421, -0.04133659601211548, 0.028025276958942413, 0.005489352159202099, 0.04444870352745056, 0.06721746921539307, -0.006648452952504158, 0.0006244006799533963, -0.052473317831754684, -0.02547653391957283, -0.044095054268836975, -0.12334436923265457, -0.011725603602826595, 0.0466545931994915, -0.08336395770311356, 0.009355216287

#### Initialising the Qdrant Client and Creating a Collection

This section initialises a Qdrant client and creates a collection to store the text chunk embeddings with a specific configuration.

- `client`: Initialises a Qdrant client using an in-memory database (`:memory:`), suitable for testing or temporary storage.
- `COLLECTION_NAME`: Defines the name of the collection where the text embeddings will be stored.
- `create_collection`: Creates a new collection in Qdrant with:
  - `size=384`: The dimensionality of the vectors (matching the output size of the `all-MiniLM-L6-v2` model).
  - `distance=Distance.COSINE`: Specifies the use of cosine similarity as the distance metric for comparing vectors.
```

In [10]:
client = QdrantClient(":memory:")

COLLECTION_NAME = 'my_text_chunks'

# Create a collection with specific configuration
client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

True

#### Uploading Data Points to Qdrant Collection

This step uploads the prepared data points to the specified Qdrant collection.

- `upsert`: Adds or updates the data points in the specified Qdrant collection.
  - `collection_name`: The name of the collection where data points are stored (`my_text_chunks`).
  - `wait=True`: Ensures the operation is completed before proceeding, providing immediate feedback on the upload status.
  - `points`: The list of `PointStruct` objects containing the unique IDs, embeddings, and optional payloads to be stored.

In [11]:
client.upsert(
    collection_name=COLLECTION_NAME,
    wait=True,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

#### Querying Qdrant for Closest Matches

This function queries the Qdrant vector database to find the top-k closest matches to a given query embedding.

- Search Qdrant - Retrieves the top-k matches based on the query embedding.
- Format Results - Returns a list of dictionaries containing the match ID, text, and similarity score.

In [33]:
def query_qdrant(query_embedding, collection_name, top_k=3):
    """
    Queries the Qdrant vector database to retrieve the top-k closest matches
    to a given embedding.

    Args:
        query_embedding (list): The vector embedding representing the query.
        collection_name (str): The name of the Qdrant collection to search.
        top_k (int, optional): The number of closest matches to retrieve.

    Returns:
        list: A list of payloads from the top-k closest matches found.
    """

    search_result = client.search(
        collection_name=collection_name,
        query_vector=query_embedding,
        limit=top_k,
        with_payload=True
    )

    # Format the results
    formatted_results = [
        {
            "id": hit.id,
            "text": hit.payload.get("text", ""),
            "score": hit.score
        }
        for hit in search_result
    ]

    return formatted_results


#### Searching and Reranking Vector Matches

This example shows how to perform a search using Qdrant, display the initial results, and then rerank these results using a ranking model.

1. Search for Initial Results: Encode the query and search Qdrant to find the closest matches.
2. Display Initial Results: Print the top matches retrieved from Qdrant.
3. Rerank Results: Use the `Ranker` model to reorder the initial results based on relevance.
4. Display Reranked Results: Print the matches after reranking for comparison.

In [52]:
# Search for a vector before re-ranking
QUERY_TEXT = "Sport and Recreation Tasmania"
query_vector = encoder.encode([QUERY_TEXT])[0]  # Encode the search query

# Initial search using Qdrant
initial_results_test = query_qdrant(query_vector, COLLECTION_NAME, top_k=4)
print("Initial Results Before Re-ranking:\n")
for result in initial_results_test:
    print(f"ID: {result['id']}\n"
          f"Text: {result['text'][:30]}...\n"
          f"Old Score: {result.get('score', 'N/A')}\n")

# Re-rank the initial results
ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")
rerankrequest_test = RerankRequest(QUERY_TEXT, initial_results_test)
results = ranker.rerank(rerankrequest_test)
print("\nResults After Re-ranking:\n")
for result in results:
    print(f"ID: {result['id']}\n"
          f"Text: {result['text'][:30]}...\n"
          f"New Score: {result.get('score', 'N/A')}\n")

Batches: 100%|██████████| 1/1 [00:00<00:00, 41.67it/s]


Initial Results Before Re-ranking:

ID: adb6738f-ae06-4870-91c0-3c530fc5b30f
Text: ###### TAS Sport and Recreatio...
Old Score: 0.668080747127533

ID: a32c097c-2f48-4dd3-b30b-67086af3574a
Text: -----

### Useful Contacts

##...
Old Score: 0.48196303844451904

ID: da2b37bc-f8f0-4ee2-9ecd-0ed4cace9bfe
Text: This booklet was produced by t...
Old Score: 0.47543221712112427

ID: e83b6727-c040-47d3-b165-a47db5f356bb
Text: # Choose Health:
 Be Active


...
Old Score: 0.4655294418334961


Results After Re-ranking:

ID: adb6738f-ae06-4870-91c0-3c530fc5b30f
Text: ###### TAS Sport and Recreatio...
New Score: 0.9982047080993652

ID: a32c097c-2f48-4dd3-b30b-67086af3574a
Text: -----

### Useful Contacts

##...
New Score: 0.06663087755441666

ID: da2b37bc-f8f0-4ee2-9ecd-0ed4cace9bfe
Text: This booklet was produced by t...
New Score: 0.0004697415861301124

ID: e83b6727-c040-47d3-b165-a47db5f356bb
Text: # Choose Health:
 Be Active


...
New Score: 0.0001939923531608656



#### Setting Environment Variables for API Access
This step sets the environment variable needed to authenticate and access the Groq API.

- `%env`: A Jupyter magic command used to set environment variables within the notebook.
- `GROQ_API_KEY`: The environment variable that stores the API key required for authenticating requests to the Groq API.

**Note:** Be cautious when sharing API keys to avoid unauthorised access!

In [None]:
%env GROQ_API_KEY=

#### Initialising the ChatGroq Model

This section initialises the `ChatGroq` model with streaming enabled, allowing real-time interaction with the model.

- `chat_model`: An instance of the `ChatGroq` model initialised with:
  - `model_name`: Specifies the model version to use (`llama-3.1-70b-versatile`).
  - `api_key`: The API key retrieved from the environment variable (`GROQ_API_KEY`) for authentication.
  - `streaming=True`: Enables streaming, allowing for real-time output as the model processes input, which is useful for interactive sessions.

In [43]:
# Initialise the ChatGroq model with streaming enabled
chat_model = ChatGroq(
    model_name='llama-3.1-8b-instant',
    api_key=os.getenv("GROQ_API_KEY"), streaming=True
)

#### Generating Responses Using Context and Re-ranking

This section demonstrates how to generate a response using the retrieved context and re-ranked search results.

**Set-up:**

1. Initialise Conversation Memory: Keeps track of the conversation history.
2. Define the Chat Prompt Template: Formats the input for the chat model, combining context and the user's question

The `generate_response` function handles user input, performs a search, reranks the results, and generates a response using the ChatGroq model.

- Embedding Query: Converts user input into a query embedding.
- Initial Search and Reranking: Retrieves and reranks search results to provide relevant context.
- Context Extraction: Combines text from the top reranked results.
- Response Generation: Uses the `ChatGroq` model to generate a response based on the provided context.

In [44]:
# Initialise ConversationBufferMemory
memory = ConversationBufferMemory(return_messages=True)

# Define the ChatPromptTemplate for user interaction
TEMPLATE = """Answer the following question from the context

context = {context}

question = {question}
"""
prompt_template = PromptTemplate(
    input_variables=["context", "question"], template=TEMPLATE
)

def generate_response(user_input: str) -> str:
    """
    Generates a response to a given user input by encoding the input,
    performing a search, reranking results, and using the ChatGroq model
    to generate a coherent response.

    Args:
        user_input (str): The input provided by the user.
        description (str): An optional description for the context or query.

    Returns:
        str: The generated response or an error message if an exception occurs.
    """
    try:
        query_embedding = encoder.encode([user_input])[0]
        # Get initial search results
        initial_results = query_qdrant(query_embedding, COLLECTION_NAME)
        # Re-rank the results
        rerankrequest = RerankRequest(user_input, initial_results)
        reranked_results = ranker.rerank(rerankrequest)

        # Extract the context from the reranked results
        context = "\n".join([result['text'] for result in reranked_results])

        # Generate a response using ChatGroq model
        full_response = chat_model.predict(
            prompt_template.format(question=user_input, context=context)
        )
        return full_response.strip()
    except (TypeError, KeyError) as te:
        print(f"TypeError or KeyError in generate_response: {str(te)}")
        return f"Error: {str(te)}"

#### Example Usage

This example demonstrates how to use the `generate_response` function to generate an answer based on a user's query.

- `generate_response`: The function is called with the `USER_QUERY` to generate a response based on the available context and data.
- `print(response)`: Outputs the generated response to the console for the user to see.

This example shows how to use the function to interact with the chat model, retrieve relevant information, and produce an answer.

In [53]:
# Example usage
USER_QUERY = "What is the phone number of Sport and Recreation Tasmania?"
response = generate_response(USER_QUERY)
print(response)

Batches: 100%|██████████| 1/1 [00:00<00:00, 68.97it/s]
INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


The phone number of Sport and Recreation Tasmania is 1800 252 476.


#### Testing the Question Without Context (For Comparison)

This test directly queries the `ChatGroq` model with the user question without providing any additional contex as a point of comparison to our implementation of the vector database

In [54]:
# Function to generate response without context
def generate_response_without_context(user_input: str) -> str:
    """
    Generates a response from the chat model without using any context.

    Args:
        user_input (str): The input string from the user for which a response
                          is to be generated.

    Returns:
        str: The generated response from the chat model.
    """
    try:
        # Directly use the chat model without any context
        response_nc = chat_model.predict(
            prompt_template.format(question=user_input, context="")
        )
        return response_nc.strip()
    except (TypeError, KeyError) as te:
        print(f"TypeError or KeyError in generate_response: {str(te)}")
        return f"Error: {str(te)}"

# Example usage: Testing the question without any context
USER_QUERY = "What is the phone number of Sport and Recreation Tasmania?"
response_without_context = generate_response_without_context(USER_QUERY)
print(response_without_context)

INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


However, I don't see the phone number of Sport and Recreation Tasmania in the provided context. I can suggest a possible solution for you. 

You can try contacting the Sport and Recreation Tasmania directly. You can look up their official website or social media channels to find their contact information, including their phone number.

Alternatively, you can try searching online for the phone number of Sport and Recreation Tasmania. You can also try contacting the Tasmanian government's customer service number, which is 1300 65 64 63 (from within Tasmania), for assistance.
