# **Enforcing Access Permissions with Document Metadata and Constrained Search**
Large organizations have many documents and members. Not every member can access all documents. For example, they may only have access to:
- documents within their department
- documents with matching authorization levels tied to their position in the organization
- a fine-grained set of documents

This notebook demonstrates how you can use NeuralDB's document metadata and constrained search to enforce these restrictions. At a high level, the process is as follows:
1. Index each document with metadata containing its ID, department, and authorization level.
2. When a user fires a query, get the user's department, authorization level, and set of allowed files.
3. Constrain the search results to only include documents that match the user's department, authorization level, or allowed files set.

We will use the CUAD V1 dataset, a collection of 510 contracts spanning three parts and 17 contract types, which we will use as proxies for "authorization level" and "department" respectively.

### Downloading CUAD V1
Note: This dataset may take around an hour to download and unzip.

In [1]:
import os

cuad_path = "CUAD_v1"

if not os.path.exists(cuad_path):
    os.system(f"wget -O {cuad_path}.zip 'https://zenodo.org/record/4595826/files/CUAD_v1.zip?download=1'")
    os.system(f"unzip {cuad_path}.zip -d .")

### Document Metadata
Let's curate IDs, permission levels, and departments for each document. The following command collects the paths to all contract files.

In [2]:
import glob

contract_paths = glob.glob(cuad_path + "/full_contract_pdf/*/*/*")

print(contract_paths[0])

CUAD_v1/full_contract_pdf/Part_II/Promotion/MIDDLEBROOKPHARMACEUTICALS,INC_03_18_2010-EX-10.1-PROMOTION AGREEMENT.PDF


When we print the paths, we can see that the last three segments are "Part", contract type, and contract file name. We will treat the "Part" as the permission level and treat the contract type as the department. 

In [None]:
auth_levels = {
    "Part_I": 0,
    "Part_II": 1,
    "Part_III": 2,
}
path_segments = [path.split('/') for path in contract_paths]
contract_parts = [segments[-3] for segments in path_segments]
contract_auths = [auth_levels[part] for part in contract_parts]

contract_depts = [segments[-2] for segments in path_segments]

contract_ids = list(range(len(contract_paths)))

### Inserting to NeuralDB
Insert the documents into NeuralDB with metadata containing the document's ID, department, and authorization level.

In [None]:
from thirdai import neural_db as ndb

db = ndb.NeuralDB()

documents = [
    ndb.PDF(path, metadata={"id": iden, "dept": dept, "auth": auth}) 
    for path, iden, dept, auth in zip(contract_paths, contract_ids, contract_depts, contract_auths)
]

db.insert(documents, train=False);

### Constrained Search
To apply constraints to your search results, pass a `constraints` kwarg to the `search()`` method like this:

In [None]:
db.search(
    "hello world", 
    constraints={
        "auth": 1,
        "dept": "Affiliate_Agreements",
    },
)

You can also use range constraints

In [None]:
db.search(
    "hello world",
    constraints={
        "auth": ndb.LessThan(1, include_equal=True),
        "dept": ndb.AnyOf(["Affiliate_Agreements", "Co_Branding", "Maintenance"])
    },
)

## **Permission-Constrained Search Service**
Imagine you need to build a search system for your organization. Each user should only get search results from documents that they have permission to access. To build this system, we will assume that there are services we can use to get a user's permissions.

The following block of code simulates a service for getting a user's permissions.

In [None]:
user_registry = {
    "user_with_auth_2_in_select_departments": {
        "auth": 2,
        "dept": ["Affiliate_Agreements", "Co_Branding", "Maintenance"],
    },
    "user_with_auth_2_in_all_departments": {
        "auth": 2,
    },
    "user_with_auth_1_in_select_departments": {
        "auth": 1,
        "dept": ["Affiliate_Agreements", "Co_Branding", "Maintenance"],
    },
    "user_with_specific_document_permissions": {
        "id": [5, 302, 602],
    },
}

def user_auth_level(user_id: str):
    if user_id in user_registry:
        if "auth" in user_registry[user_id]:
            return user_registry[user_id]["auth"]
        # This user does not have auth level constraints
        return None
    raise ValueError(f"User {user_id} cannot be found in the registry.")

def user_departments(user_id: str):
    if user_id in user_registry:
        if "dept" in user_registry[user_id]:
            return user_registry[user_id]["dept"]
        # This user does not have departmental constraints
        return None
    raise ValueError(f"User {user_id} cannot be found in the registry.")

def user_allowed_documents(user_id: str):
    if user_id in user_registry:
        if "id" in user_registry[user_id]:
            return user_registry[user_id]["id"]
        # This user does not have document-specific permissions
        return None
    raise ValueError(f"User {user_id} cannot be found in the registry.")

Finally, we can use these services and NeuralDB's constrained search to define a secure search endpoint. We assume that this code runs on a remote server, the user cannot access the NeuralDB instance except through this endpoint, and the user ID is provided by an authentication layer.

In [None]:
def search_endpoint(query: str, user_id: str, top_k: int):
    auth_level = user_auth_level(user_id)
    departments = user_departments(user_id)
    allowed_documents = user_allowed_documents(user_id)

    # The keys of the constraint dictionary matches the keys of the document 
    # metadata.
    constraints = {}
    if auth_level is not None:
        # Match any document with auth <= user's auth level.
        constraints["auth"] = ndb.LessThan(auth_level, include_equal=True)
    if departments is not None:
        constraints["dept"] = ndb.AnyOf(departments)
    if allowed_documents is not None:
        constraints["id"] = ndb.AnyOf(allowed_documents)
    
    results = db.search(query, top_k=top_k, constraints=constraints)

    print("Auth level constraints", auth_level, "or below" if auth_level else "")
    print("Department constraints", departments)
    print("Document-level constraints:", allowed_documents)
    print("")
    print("Results:")
    print("====================================================")
    for r in results:
        print("Content:", r.text)
        print("Auth level:", r.metadata["auth"])
        print("Department:", r.metadata["dept"])
        print("Document ID:", r.metadata["id"])
        print("====================================================")
    return 

Let's see it in action. In the first example, the user has the highest auth level (level 2) but is restricted to a few departments. Notice that our search results span multiple permission levels but only contain results from "Affiliate_Agreements", "Co_Branding", and "Maintenance".

In [None]:
search_endpoint("Term of agreement", user_id="user_with_auth_2_in_select_departments", top_k=10)

Without departmental restrictions, we now get results from other departments as well.

In [None]:
search_endpoint("Term of agreement", user_id="user_with_auth_2_in_all_departments", top_k=10)

On the contrary, if the user has auth level 1, then we no longer get results from documents with auth level 2.

In [None]:
search_endpoint("Term of agreement", user_id="user_with_auth_1_in_select_departments", top_k=10)

Finally, if the user does not have auth restrictions or departmental restrictions and instead has document-specific permissions, we will only get results from those documents.

In [None]:
search_endpoint("Terms of the agreement", user_id="user_with_specific_document_permissions", top_k=10)

## Row-level Constraints for CSVs
When inserting a database in the form of a CSV file, constraints aren't limited to the document level. To demonstrate this, we will insert a CSV file that contains population data from different countries.

In [1]:
from thirdai import neural_db as ndb

db = ndb.NeuralDB()


In [2]:
populations = ndb.CSV(
    path="country_populations.csv",
    id_column="id",
    strong_columns=["Region/Country/Area"],
    weak_columns=["Dummy Description"],
    reference_columns=["Region/Country/Area"],
    # IMPORTANT: you have to pass the columns that will be used for 
    # constrained search to the index_columns argument.
    index_columns=[
        "Region/Country/Area",
        "Population aged 0 to 14 years old (percentage)",
        "Population aged 60+ years old (percentage)",
        "Population density",
        "Population mid-year estimates (millions)",
        "Population mid-year estimates for females (millions)",
        "Population mid-year estimates for males (millions)",
        "Sex ratio (males per 100 females)"
    ]
)

db.insert([populations], train=True)

loading data | source 'Documents:
country_populations.csv'
loaded data | source 'Documents:
country_populations.csv' | vectors 468 | batches 1 | time 0.002s | complete

train | epoch 0 | train_steps 1 | train_hash_precision@5=0.017094  | train_batches 1 | time 1.135s

train | epoch 1 | train_steps 2 | train_hash_precision@5=0.278205  | train_batches 1 | time 0.228s

train | epoch 2 | train_steps 3 | train_hash_precision@5=0.160684  | train_batches 1 | time 0.204s

train | epoch 3 | train_steps 4 | train_hash_precision@5=0.125641  | train_batches 1 | time 0.194s

train | epoch 4 | train_steps 5 | train_hash_precision@5=0.125214  | train_batches 1 | time 0.192s

train | epoch 5 | train_steps 6 | train_hash_precision@5=0.144444  | train_batches 1 | time 0.197s

train | epoch 6 | train_steps 7 | train_hash_precision@5=0.181624  | train_batches 1 | time 0.192s

train | epoch 7 | train_steps 8 | train_hash_precision@5=0.212821  | train_batches 1 | time 0.195s

train | epoch 8 | train_steps 9

['60e929dc9d59d2f07b3d6971222ab8b1b0aa5615']

In [6]:
results = db.search(
    query="A country", 
    top_k=10,
    constraints={
        "Region/Country/Area": ndb.InRange("A", "F"),
        "Population density": ndb.GreaterThan(100),
        "Population aged 0 to 14 years old (percentage)": ndb.LessThan(50),
        "Population mid-year estimates (millions)": ndb.InRange(10, 20),
    },
)

import json
for result in results:
    print(result.text)
    print(result.metadata)
    print("")

Region/Country/Area: Benin
{'id': 21, 'Region/Country/Area': 'Benin', 'Population aged 0 to 14 years old (percentage)': 42.4, 'Population aged 60+ years old (percentage)': 4.9, 'Population density': 118.4, 'Population mid-year estimates (millions)': 13.35, 'Population mid-year estimates for females (millions)': 6.66, 'Population mid-year estimates for males (millions)': 6.69, 'Sex ratio (males per 100 females)': 100.4, 'Dummy Description': 'Benin is a country in the UN populations dataset.'}

Region/Country/Area: Czechia
{'id': 54, 'Region/Country/Area': 'Czechia', 'Population aged 0 to 14 years old (percentage)': 16.0, 'Population aged 60+ years old (percentage)': 26.3, 'Population density': 135.9, 'Population mid-year estimates (millions)': 10.49, 'Population mid-year estimates for females (millions)': 5.32, 'Population mid-year estimates for males (millions)': 5.17, 'Sex ratio (males per 100 females)': 97.1, 'Dummy Description': 'Czechia is a country in the UN populations dataset.'}