# General Purpose Document Parsing and Index Creation Notebook for Generative AI PoC Projects
Notebook and associated modules for parsing source documents and index creation Generative AI projects. It is recommended to use this notebook to evaluate the parsing output on a set of sample documents prior to attemping to parse the full document library. Often the parsing functions will need to modified directly to ensure the best possible parsing results.

### ToDo
* Add functionality for other cloud providers
* Add additional index types
* Add csv/excel parsing
* update to allow AML pipelines for document processing (for large numbers of files)
* create Bicep or TF files for resource deployment

## Environment setup
It is recommended to set up a python virtual environment in which to install the needed packages, and work from there. Most of this notebook assumes a Linux OS, but most items will work from Windows OS also. The exceptions are converting Word documents into PDFs, which must be done in Windows (see example Powershell script), or if extracting text from Word documents directly, the utilities assume Bash is the standard OS commandline interface.

To create a new virtual environment, run one of the following commands from the command line (bash or cmd)  
Create a virtual environment - Linux:  
`$ python -m venv /path/to/new/virtual/environment`  

Create a virtual environment - Windows:  
if PATH and PATHEXT are configured:  
`c:\>python -m venv c:\path\to\new\virtual\environment`  
or if not:  
`c:\>replace with python path\python -m venv c:\path\to\new\virtual\environment`

A new directory will be created, if it does not exist.

Activate the virtual environment (where env is repaced with the path to your virtual environment) - Linux:  
`$ source env/bin/activate`  
Activate the virtual environment (where env is repaced with the path to your virtual environment) - Windows:  
`c:\> .\env\Scripts\activate`

Once you are in your virtual environment, install the necessary packages, shown in Pipfile for example:  
`python -m pipenv install`  
the exact command depends on your system and preferred python package manager

To leave the virtual environment, just run the command `deactivate`


Additional references:  
https://docs.python.org/3/library/venv.html#venv-def  
https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/#creating-a-virtual-environmentv  
https://detox.sourceforge.net/

In [1]:
# required imports
import os
import re
import json
from data_utils import chunk_directory, FILE_FORMAT_DICT

# optional imports depending on index method
import weaviate



## Dowload files from an Azure Blobstore
* To explicitly download the docuuments from Azure blob store, you can use the cell below by:
* Add the key and container names - this will download all blobs from the containers.
* If you want to limit the files to specific extensions, update the logic in the function below or run in a loop for each file extension


In [None]:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

dowload_blobs = False
blob_connection_string = '<replace with blob connection string>'
blob_containers = ['<replace with blob container names>']

def get_blobs(blob_service_client = None, blob_containers = None, file_extensions = None):
    for container in blob_containers:
        container_client = blob_service_client.get_container_client(container)
        blob_list = container_client.list_blobs()

        if not os.path.exists('data/'):
            os.makedirs('data/')
        for blob in blob_list:
            with open('data/' + re.split('/', blob.name[-1]) as f:
                f.write(container_client.download_blob(blob.name).readall())

if download_blobs:
    # setup blob connection
    blob_service_client = BlobServiceClient.from_connection_string(blob_connection_string)
    get_blobs(blob_service_client=blob_service_client, blob_containers=blob_containers, file_extensions=file_extensions)

## Parse and Examine Source Documents (locally)
Parse a set of files locally. This will not create a vector index or do any embedding. It is recommended to use this method on at least a few test files of each type that you want to parse, and examine the output. In many, if not most, cases the parsing will need to have some tuning to parse well and cleanly.
* Set and/or uncomment the appropiate variables below
* If you want to use Azure Form Recognizer to crack the PDFs (if any) then supply the appropriate credentials and information.
* * Export the form recognizer endpoint and key to your environment
* * If the source documents (pdf) contain a lot of formatted data, e.g., lots of tables, then it is recommended to create a specific layout in the form recognizer to use
 
### Notes:
* Start with approximately a 10% token overlap (e.g., if the num_tokens = 1024, the token_overlap should be set to about 128 (as token counts are normally in multiples of 16)
* increasing njobs may help speed things up if there are many files to process. Max is 32
* Currently, adding vectors to Azure Cognitive Seach index during parsing does NOT work well on the Azure end. It is NOT recommended attempt to embed vectors if using this service. Investigations are underway to support this is a more reliable fashion
* If Azure Document Intelligence is to be used for parsing, it does NOT support Word (.doc, .docx) documents. They must be converted to PDF. See the Powershell script below to do this
* If using the shell script to parse Word documents in a Linux OS, the the file paths needd to be cleaned up (e.g., no spaces or diacritical marks) the detox utility is recommended. See references above.

In [2]:
# set and uncomment these variables if you want to use Azure Form Recognizer to crack pdfs, 
# otherwise the script will use pypdf to crack pdf files
# os.environ["FORM_RECOGNIZER_ENDPOINT"] = None
# os.environ["FORM_RECOGNIZER_KEY"] = None

# you must set the directory path. The other variables are optional,
# and are listed with their default values, the extensions list includes txt, html, pdf, py, md files
# Word files are supported, but some prior preparation is needed
directory_path = '<top_level_directory_containing_documents>'
ignore_errors = True
num_tokens = 1024
min_chunk_size = 10
url_prefix = None
token_overlap = 128
extensions_to_process = list(FILE_FORMAT_DICT.keys())
form_recognizer_client = None
use_layout = False
njobs=4
add_embeddings = False
azure_credential = None
embedding_endpoint = None


If you want to used Azure Document Intelligence to process Word files (.doc, .docx) you will need to convert them to pdf files first. An example powershell script is shown below.
```
Folders = Get-ChildItem <directory-name-here> -Directory -Recurse

 ForEach ($Folder in $Folders)
 {
     $wdFormatPDF = 17
     $word = New-Object -ComObject word.application
     $word.visible = $false
     $folderpath = "$($Folder.FullName)\*"
    $fileTypes = "*.docx","*doc"
    Get-ChildItem -path $folderpath -include $fileTypes |
    foreach-object `
     {
    $path =  ($_.fullname).substring(0,($_.FullName).lastindexOf("."))
    "Converting $path to pdf ..."
     $doc = $word.documents.open($_.fullname)
     $doc.saveas([ref] $path, [ref]$wdFormatPDF)
     $doc.close()
    }
    $word.Quit()
 }
```

### Parse and Chunk the Documents
* Running this function will recursively parse and chunk all the files in the supplied directory_path and it's subdirectories
* A chunking result will be returned - which consists of an object with the folling attributes:
* * chunks (List[Document]): List of chunks
* * total_files (int): Total number of files.
* * num_unsupported_format_files (int): Number of files with unsupported format.
* * num_files_with_errors (int): Number of files with errors.
* * skipped_chunks (int): Number of chunks skipped.


* The chunks will be a list of objects with type Document, with the following attributes:
* * content (str): The content of the document.
* * id (Optional[str]): The id of the document.
* * title (Optional[str]): The title of the document.
* * filepath (Optional[str]): The filepath of the document.
* * url (Optional[str]): The url of the document.
* * metadata (Optional[Dict]): The metadata of the document.  

In [5]:
chunks = chunk_directory(
    directory_path=directory_path,
    ignore_errors=ignore_errors,
    num_tokens=num_tokens,
    min_chunk_size=min_chunk_size,
    url_prefix=url_prefix,
    token_overlap=token_overlap,
    extensions_to_process=extensions_to_process,
    form_recognizer_client=form_recognizer_client,
    use_layout=use_layout,
    njobs=njobs,
    add_embeddings=add_embeddings,
    azure_credential=azure_credential,
    embedding_endpoint=embedding_endpoint
    )

## Create Index and/or Vector Store
In order for the information contained in the source documents to be accessible and relevant to Generative AI, then these parsed out chunks of text must be used to create an index of some sort. Examples below illustrate using Azure Cognitive Search and Weaviate as the searchable index.

If possible, a separate index should be created for each language represented in the source documents.

Important: ensure that there are no duplicate documents before index creation. This leads to poor search results.

### Choosing between Azure Cognitive Search and Weaviate
Azure Cognitive Search integrates better with other Azure services for some use cases. In particular, if certain fields need to be restricted based on Azure authentication, or if an ETL pipline feeding off an Azure storage method are anticipated, then this may be the better choice. It is also slower and more expensive, and if a very large number of documents are to be inserted into the index, then the pricing tier needs to be upgraded. Weaviate is open source and very fast, and allows easier customization on how the data stored and retrieved. It is also deployed on Kubernetes, making it cloud/on-prem agnostic. It is limited to BM25 keyword and vector searching, although it does allow filtering on metatdata and fields.

Both allow extensive customization of the index fields, including searchability, formatting, types, restrictions, filtering, and data returned.

### Azure Cognitive Search
Azure Cognitive Search can be used as the backing index for Generative AI. Azure Cognitive Search supports serval types of searches, depending somewhat on how the index is built. It supports full text search - using Lucerne; vector search - if embedding vectors are created during index creation; hybrid search - a combination of full text and vector search; and other - not normally used directly by Generative AI applications, but useful for things like geospatial search, or complex fielded searching. See https://learn.microsoft.com/en-us/azure/search/ for additional information on how Azure Cognitive Search works and the various options allowed.

Once the parsing and chunking operations are tuned, bulk document parsing and index creation can be completed:
* Create a config file like `config.json`. You can create multiple indices for various configurations and data sets concurrently. The format of `config.json` should be an array of JSON objects, with each object specifying a configuration of local (or Azure blob store) data path and target search service and index. If the search service or index does not exist, it will be created. You can insert additional items into the index at any point after the initial creation. A sample `config.json` file is provided, with a single object for creation of one index and from one data path (can contain subdirectories). Your `config.json` file should look similiar to the example below:
```
[
    {
        "data_path": "<local path or blob URL>",
        "location": "<azure region, e.g. 'westus2'>", 
        "subscription_id": "<subscription id>",
        "resource_group": "<resource group name>",
        "search_service_name": "<search service name to use or create>",
        "index_name": "<index name to use or create>",
        "chunk_size": 1024, // set to null to disable chunking before ingestion
        "token_overlap": 128 // number of tokens to overlap between chunks
        "semantic_config_name": "default",
        "language": "en" // setting to set language of your documents. Change if your documents are not in English. Look in data_preparation_acs.py for SUPPORTED_LANGUAGE_CODES,
        "vector_config_name": "default" // used if adding vectors to index - NOT recommended at this time.
    }
]
```

The data_path can be a local directory, or an Azure Blob URL (like `https://<storage account name>.blob.core.windows.net/<container name>/<path>/`) If using a blob store, the files will be download to a temporary directory on your local machine.

Run one of the following commands in your python virtual environment:  
`python data_preparation_acs.py --config config.json --njobs=4` (or however many concurrent jobs you would like, max=32)  
Or if using Azure Document Intelligence (Form Recognizer) for cracking PDF files:  
`python data_preparation_acs.py --config.json --njobs=4 --form-rec-resource <form-rec-resource-name> --form-rec-key <form-rec-key>`  
which will use the Form Recognizer Read model by default. If you have a layout, then you can pass in the argument `--form-rec-use-layout`

Notes:
* See the list in data_preparation_acs.py for supported language codes
* It is NOT recommeded to use this process for adding vectors to an ACS index at this time. This service is in preview, and does not work well.
* You may have to turn on sematic search (if using) through the Azure Portal. The REST interrface for ACS management does not always activate semantic search, however it is generally recommended to use semantic search for most GenAI applications for better search results.