# Available Function 
This document summarizes the available functions present in the SDK

---
## 1. Create UDP client (Unstructured Data Processor)


The `udp` Python SDK is designed to streamline interactions with the Unstructured Data Processing (UDP) service on Cloud Pak for Data (CP4D). It provides a high-level interface to effortlessly create, manage, and execute data flows. At the heart of the SDK is the `UDPClient` class, which enables developers to define and run UDP flows with minimal code, making it easier to integrate unstructured data processing capabilities into Python-based workflows.

### Configuration for UDPClient
To initialize the UDPClient, you need to provide a configuration dictionary with the following keys:
* ### Example
    ```python
                {
                'base_url': "<URL of the UDP service>",  # e.g., "https://dummy.com/"
                'token': None,                           # Optional: Use if you have a bearer token
                'project_id': "420f8ed1589d48c",         # Required: Your project identifier
                'user_name': "",                         # Optional: Used if token is not provided
                'password': "",                          # Optional: Used with user_name
                'api_key': "",                           # Optional: Alternative to token or username/password
                'env': 'cpd'                             # Required: Environment type
            }
For CPD environments, the base URL should be in the format:
https://cpd-wkc.apps.udptest7.cp.fyre.ibm.comm

For Cloud environments, the base URL should be in the format:
https://api.dai.dev.cloud.ibm.com/           

### Authentication Options
You must provide one of the following authentication methods:

* token
* user_name and password
* api_key
‚ö†Ô∏è If more than one method is provided, the client will prioritize them in the order: token ‚Üí api_key ‚Üí user_name/password.

### Environment (env)
The env key specifies the environment in which the client will operate. Valid values include:

* "cpd" ‚Äì Cloud Pak for Data
* "cloud-dev" ‚Äì Development environment
* "cloud-test" ‚Äì Testing environment
* "cloud-prod" ‚Äì Production environment

## 2.Get metadata
From this all operator names and their attributes, features and required values can be retrieved.
* Example
    `metadata = uc.get_metadata()`

              

# 3.Available Operators 
This summarizes the available operators grouped by their functional categories.

---
Each operator is defined by both an attribute schema and a feature schema, which together describe how the operator functions and how its data can be processed and filtered.

üß© Attributes: <br>
Each Operator has its own specific attribute and parameters.<br>
Each attribute usually includes:

        name: This is the display name shown in user interfaces or documentation.
        description: This explains the purpose of the attribute.
        Default: This is the default value used if the user does not provide one.
        Required: This indicates whether the attribute is mandatory.
        valid_values: This indicates allowed options for a particular attribute.

üß© Features:<br>
Each Operator has it own features that it adds to the table. These features can be used with sql_filter operator to run Queries. Retrieve the list of features added by each operator.<br>
Each Features usually includes:

        name: This is the display name of the feature, shown in interfaces or documentation.
        description: Explains that this field contains a unique identifier for each document.
        available_for_filter: Indicates that this feature can be used to filter documents (e.g., search by ID).
        available_for_vector_db: This feature is required when storing the document in a vector database.
        type: The data type of the feature<br>
        
<div style="
    border-left: 6px solid #fbc02d;
    background-color: #fff8e1;
    padding: 12px 16px;
    border-radius: 6px;
    font-family: sans-serif;
">

üí° <b style="color:#f57f17;">Tip:</b>  
You can use the <code>get_attributes()</code> and <code>get_features()</code> functions to inspect what settings (attributes) and data properties (features) are available for any operator.

<div style="
    background: #fff3cd;
    padding: 10px;
    border-radius: 4px;
    font-family: monospace;
    font-size: 90%;
    line-height: 1.4;
    margin-top: 10px;
    white-space: pre;
">
operator_attributes = get_attributes(metadata, "extract_cpd")
operator_features   = get_features(metadata, "extract_cpd")
</div>

</div>

## Ingest Operator
This operator is used to bring data into the system from various sources.
* ### `ingest_cpd_assets`
    The ingest_cpd_assets operator is used to ingest assets of type data_asset in the project. It allows users to specify which assets to ingest, apply filters based on file type, and skip files that exceed a certain size.
    *  ####  Example for get_attribute
        `operator_attributes = get_attributes(metadata, "ingest_cpd_assets") `<br>
       ` print(json.dumps(operator_attributes, indent=2))`
        ```json
        {
            "max_file_size": {
                "name": "Max File Size",
                "description": "If the document is larger than the given max file size, then it will be skipped",
                "default": 100,
                "required": true
            },
            "cp4d_asset_ids": {
                "name": "Asset ID's",
                "description": "IDs of the assets to be ingested",
                "default": null,
                "required": true
            },
            "include_filter": {
                "name": "Include File Type",
                "description": "File types to be included",
                "default": "pdf,txt,md",
                "required": false
            }
        }
    *  ####  Example for get_feature

       ` operator_feature = get_features(metadata, "ingest_cpd_assets")`<br>
       ` print(json.dumps(operator_attributes, indent=2))`
        
        ```json 
        {
            "id": {
                "name": "ID",
                "description": "The ID of the document",
                "available_for_filter": true,
                "available_for_vector_db": true,
                "mandatory_for_vector_db": true,
                "type": "string"
            },
            "name": {
                "name": "Document Name",
                "description": "The name of the document",
                "available_for_vector_db": true,
                "type": "string"
            },
            "size": {
                "name": "Size",
                "description": "The size of the document",
                "available_for_filter": true,
                "available_for_vector_db": true,
                "type": "int64"
            },
            "created_time": {
                "name": "Created Time",
                "description": "When the document was created in the project",
                "available_for_filter": true,
                "available_for_vector_db": true,
                "type": "int64"
            },
            "modified_time": {
                "name": "Modified Time",
                "description": "When the document was last modified",
                "available_for_filter": true,
                "available_for_vector_db": true,
                "type": "int64"
            }
        }




* ### `ingest_cpd_connections:` <small>Select the documents or folders you wish to ingest from Connections like "Amazon S3", "Box"</small>
            

* ### `ingest_document_set:` <small>Select the documents set or the base document set to ingest</small>
   

## Extract Operator
This operator is designed to extract useful information from ingested data.
* ### `extract_cpd:` <small>Extracts metadata or content from CPD assets.</small>

## Quality Operator
This operator help assess and improve the quality of data and documents.

* ###  `lang_detect:`<small>Detects the language of a document.</small>

* ### `doc_quality:` <small>Evaluates the quality of a document.</small>
     
* ###  `sql_filter:` <small>Applies SQL-based filtering to datasets.</small>

* ###  `data_class_assignment:` <small>Assigns data to predefined classes.</small>
   
* ###  `term_assignment_operator:` <small>Assigns terms or tags to content.</small>
     
* ###  `pii_and_hap_extract_redact:` <small>Identifies and redacts sensitive information (PII/HAP).</small>
         
* ###  `redaction:` <small>Removes or masks sensitive content.</small>


## Functional Operator
These operators provide core processing capabilities.

* ### `chunker:` <small>Breaks documents into smaller, manageable chunks.</small>

* ### `embeddings:` <small>Converts text into vector representations for machine learning.</small>


## VectorDB Operator
These operators interact with vector databases for advanced search and retrieval.

* ### `milvusdb_cp4d:` <small>Connects to Milvus vector database in CPD.</small>
     
* ### `document_set:` <small>Manages sets of documents in vector format</small>
     



## 4. Create Pipeline for execution
To create a pipeline using UDPClient, you must define a list of operators in the order they should be executed. Each operator is a dictionary containing:

* "type": The type of operation to perform.
* "parameters" (optional): A dictionary of parameters required by that operator.You can get this parameters for the CAMS api for the particular asset. If you don't provide this if will just effect the UI view

* ### Example
```python
        operators = [{"type": "ingest_document_set",
                        "parameters": {
                        "document_set_id":"1b6d97c5-4585-4789-b075-873a1536d2e1",
                        "input_assets": {
                            "document_set_name": "prince_16june",
                            "created_on": "2025-05-16T13:33:25Z",
                            "document_set_id":"1b6d97c5-4585-4789-b075-873a1536d2e1"
                    }
                        }},
                    {"type": "extract_cpd"},
                    {"type": "chunker"},
                    {"type": "embeddings"},
                    {"type": "data_class_assignment"}
                    {"type": "milvusdb_cp4d"}]
                    
        #Get globalconfig
        global_config= {
            "global_config": {
                "data_local_config": {
                    "output_folder": "./test/flows/output"
                },
                "data_storage_type": "local"
            }
        }

        flow_name = "xyz" # Name of your flow name
        pipeline = {
            "flow_name": flow_name,
            "project_id": config.get('project_id'),
            "orchestrator": "python",
            "flow": operators,
            "global_config": global_config
        }
```

## 5. Get the Milvus Operator Feature Mapping Related Metadata

#### Initial `milvusdb_cp4d` Type Operator  
  ``` python
      {
        "type": "milvusdb_cp4d",
        "parameters": {
          "connection_id": "<cp4d-connection-id>",
          "collection_name":"<collection_name>"
        }
      }
  ```
  if the `collection_name` not provided system will take `datasift` as default.

By using  `milvus_feature_mapping_metadata = uc.get_milvus_feature_mapping_metadata(pipeline_details=pipeline)` you will get the milvus_feature_mapping metadata
- `available_collections` ‚Äì Lists all collections available in the Milvus connection.
- `collection_columns` ‚Äì Lists columns present in the specified collection.
- `available_features` - Available features in the given pipeline that can be mapped to collection columns..
- `milvus_feature_mappings` - Default feature mapping for the specified collection.
* #### Example
`available_collections = milvus_feature_mapping_metadata.get("available_collections")`

By using this you can update the milvus_embedding

By using this you will be able to create and run the flow

```python
          from udp.flows import Flow
          flow = Flow(uc)

          try:
            flow.create(flow_name, pipeline)
            flow.run()
            print("Status:", flow.status())
          
          except Exception as e:
              print("Error:", e)