# Use `Uniflow` to Extract PDF and Ingest into OpenSearch (Resources Set Up)

### Before running the code

You will need to create a `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [AWS CLI profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) to run the code. You can set up the profile by running `aws configure --profile <profile_name>` in your terminal. You will need to provide your AWS Access Key ID and AWS Secret Access Key. You can find your AWS Access Key ID and AWS Secret Access Key in the [Security Credentials](https://console.aws.amazon.com/iam/home?region=us-east-1#/security_credentials) section of the AWS console.

```bash
$ aws configure --profile <profile_name>
$ AWS Access Key ID [None]: <your_access_key_id>
$ AWS Secret Access Key [None]: <your_secret_access_key>
$ Default region name [None]: us-west-2
$ Default output format [None]: .json
```

Make sure to set `Default output format` to `.json`.

> Note: If you don't have AWS CLI installed, you will get a `command not found: aws` error. You can follow the instructions [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

### Install Extra Libraries

In [1]:
import sys

!{sys.executable} -m pip install -q boto3

### Import Libraries

In [2]:
import boto3
from datetime import datetime
import time
import json

In [3]:
session = boto3.Session(profile_name='default')
account_id = session.client('sts').get_caller_identity().get('Account')
region = session.region_name

current_time = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

### Create an S3 bucket
In this section, we will create an S3 bucket to store the PDF files. You can create an S3 bucket by running the following command in your terminal:
    
```bash
$ aws s3api create-bucket --bucket your-bucket-here --region us-west-2
```

Or you can use the boto3 code below to create an S3 bucket.


In [None]:
s3_client = session.client('s3')
s3_bucket_name = f"uniflow-es-sample-bucket-{account_id}-{region}"

def create_bucket(bucket_name, region="us-west-2"):
    try:
        location = {'LocationConstraint': region}
        s3_client.create_bucket(Bucket=bucket_name,
                                CreateBucketConfiguration=location)
    except Exception as e:
        print("Error in creating bucket: ", e)

create_bucket(s3_bucket_name, 'us-west-2')

### Create an OpenSearch index
We highly recommend you to create an OpenSearch index using AWS console. You can follow the instructions [here](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html#createupdatedomains-console) to create an OpenSearch index.

Specify master user credentials for your OpenSearch domain.

The master user password must contain at least one uppercase letter, one lowercase letter, one number, and one special character.

In [5]:
master_username = 'your_master_username'
master_password = 'your_master_password'

In [None]:
es_client = session.client("opensearch")
domain_name = f"uniflow-es-sample-domain"

access_policies = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"AWS": "*"},
            "Action": "es:*",
            "Resource": f"arn:aws:es:us-west-2:{account_id}:domain/{domain_name}/*",
        }
    ],
}


def create_opensearch_domain(domain_name):
    try:
        response = es_client.create_domain(
            DomainName=domain_name,
            EngineVersion="OpenSearch_2.7",
            ClusterConfig={
                "InstanceType": "r6g.large.search",
                "InstanceCount": 3,
                "DedicatedMasterEnabled": True,
                "ZoneAwarenessEnabled": True,
                "DedicatedMasterType": "m6g.large.search",
                "DedicatedMasterCount": 3,
                "MultiAZWithStandbyEnabled": True,
            },
            EBSOptions={
                "EBSEnabled": True,
                "VolumeType": "gp3",
                "VolumeSize": 100,
            },
            AccessPolicies=json.dumps(access_policies),
            EncryptionAtRestOptions={"Enabled": True | False},
            NodeToNodeEncryptionOptions={"Enabled": True},
            DomainEndpointOptions={"EnforceHTTPS": True},
            AdvancedSecurityOptions={
                "Enabled": True,
                "InternalUserDatabaseEnabled": True,
                "MasterUserOptions": {
                    "MasterUserName": master_username,
                    "MasterUserPassword": master_password,
                },
            },
        )
        print("Domain created:", response)
    except Exception as e:
        print("Error in creating domain: ", e)


create_opensearch_domain(domain_name)

### Write our configuration in .env
After creating the S3 bucket and OpenSearch index, we will write our configuration in `.env` file. You can find the `.env` file in the same directory as this notebook. You can open the `.env` file using a text editor and fill in the following information:

In [7]:
s3_sample_prefix = "uniflow-es-sample/pdf/nike-paper.pdf"
s3_client.upload_file('es_sample_files/pdf/nike-paper.pdf', s3_bucket_name, s3_sample_prefix)

In [8]:
# Get the status of new created domain
def get_opensearch_domain_status(domain_name):
    try:
        response = es_client.describe_domain(DomainName=domain_name)
        print("Domain status:", response['DomainStatus']['Processing'])
    except Exception as e:
        print("Error in getting domain status: ", e)

while True:
    response = es_client.describe_domain(DomainName=domain_name)
    if response['DomainStatus']['Processing'] == False:
        print("Domain Processing status:", response['DomainStatus']['Processing'])
        break
    else:
        print("Domain Processing status:", response['DomainStatus']['Processing'])
        time.sleep(60)


Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: True
Domain Processing status: False


In [12]:
describe_domain_response = es_client.describe_domain(DomainName=domain_name)
opensearch_url = describe_domain_response["DomainStatus"]["Endpoint"]

with open('.env', 'w') as f:
    f.write(f"OPENSEARCH_URL={opensearch_url}\n")
    f.write(f"ES_USERNAME={master_username}\n")
    f.write(f"ES_PASSWORD={master_password}\n")
    f.write(f"S3_BUCKET={s3_bucket_name}\n")
    f.write(f"S3_PREFIX={s3_sample_prefix}\n")

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>
