#  Ingest metadata of basic entities into open metadata

In this tutorial, we will show how to ingest metadata into open metadata.

There are many ways to ingest metadat into openmetadata, such as:
- connectors
- rest API
- python SDK

In this tutorial, we only how you how to use `python SDK` to ingest metadata.

## 1. Set up the python virtual environment

Open a conda shell of **python 11** in `Bureau`->`Raccourci`->`Python`. Then enter the below command

```shell
# 1. Check if conda exists in the current shell
conda --version

# 2. create a virtual environment
conda create --name om-ingestion python --offline
# view existing virtual environment list
conda env list
# check status of a virtual environment
conda info --envs

# 3. activate a virtual environment
conda activate om-ingestion

# 4. install packages
# check installed package list
pip list

# install package via requirements.txt
pip install -r requirements.txt

# 5. verify that you have the required packages
pip show pandas
pip show openmetadata-ingestion
```

## 2. Ingest metadata of basic entities

The most basic entities in open metadata is the descriptive metadata of data assets. For example
- Databases
- Tables
- Columns
- Filesystem
- Folder
- Files
- Etc.

In the below example, we will insert the descriptive metadata of Database, Schema, tables, and columns.

In [5]:
import pandas as pd
from metadata.ingestion.ometa.ometa_api import OpenMetadata
from metadata.generated.schema.entity.services.connections.metadata.openMetadataConnection import (
    OpenMetadataConnection, AuthProvider)
from metadata.generated.schema.security.client.openMetadataJWTClientConfig import OpenMetadataJWTClientConfig
from metadata.generated.schema.api.services.createStorageService import CreateStorageServiceRequest
from metadata.generated.schema.entity.services.storageService import StorageServiceType, StorageConnection
from metadata.generated.schema.entity.services.connections.storage.customStorageConnection import \
    CustomStorageConnection, CustomStorageType

from metadata.generated.schema.entity.services.storageService import StorageService

### 2.1 Check open metadata api server connectivity

The python-SDK which we use to ingest metadata is an `OM client`, it needs to connect to an `OM server` to ingest metadata.
Let's check the connectivity of the server via client.

In [2]:
# you need to modify this value to match your target open metadata server url
target_om_server = "http://om-dev.casd.local/api"

In [3]:

from conf.creds import om_oidc_token

server_config = OpenMetadataConnection(
    hostPort=target_om_server,
    authProvider=AuthProvider.openmetadata,
    securityConfig=OpenMetadataJWTClientConfig(
        jwtToken=om_oidc_token,
    ),
)
om_con = OpenMetadata(server_config)

In [4]:
# if it returns true, it means the connection is success 
om_con.health_check()

True

### 2.2 Ingest metadata of a file system

In this section, we will learn how to ingest metadata of a file system into Open metadata server. Suppose we have a file system (i.e. datalake) with the below architecture

```text
|Constances
|   |- geospatial
|   |   |- vector
|   |   |    |- file1.wkb
|   |   |    |- file2.geojson
|   |   |- raster
|   |   |    |- file3.tif
|   |   |    |- file4.nc
|   |- clinical
|   |   |- file5.parquet
|   |   |- file6.csv
```
To illustrate a file system, open metadata server provides a concept called `StorageService`. A `StorageServer` may contain one or more `Containers`(directories or files).

#### 2.2.1 Create a storage service

You can consider the storage service as the abstraction of a file system which allows you to store data.

In [6]:
# --------------- CONFIGURATION ---------------

STORAGE_SERVICE_NAME = "Constances-Datalake"
STORAGE_SERVICE_DESC = "Main constances datalake which host all constances related data"

# Step 1: Create the CustomStorageConnection
cs_conn = CustomStorageConnection(
    type=CustomStorageType.CustomStorage,
    connectionOptions={
    }
)

# Step 2: Wrap it inside StorageConnection
storage_conn = StorageConnection(config=cs_conn)

# Step 3: Create StorageServiceRequest
storage_service_request = CreateStorageServiceRequest(
    name=STORAGE_SERVICE_NAME,
    serviceType=StorageServiceType.CustomStorage,
    displayName=STORAGE_SERVICE_NAME,
    description=STORAGE_SERVICE_DESC,
    connection=storage_conn  # <-- must be StorageConnection, not CustomStorageConnection directly
)

# Step 4: Create or update in OpenMetadata
storage_service_entity = om_con.create_or_update(data=storage_service_request)
print(f"StorageService created: {storage_service_entity}")

StorageService created: id=Uuid(root=UUID('cc3361c7-85db-4dbd-9ba0-27ce4d91e230')) name=EntityName(root='Constances-Datalake') fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Datalake') displayName='Constances-Datalake' serviceType=<StorageServiceType.CustomStorage: 'CustomStorage'> description=Markdown(root='Main constances datalake which host all constances related data') connection=StorageConnection(config=CustomStorageConnection(type=<CustomStorageType.CustomStorage: 'CustomStorage'>, sourcePythonClass=None, connectionOptions=ConnectionOptions(root={}), containerFilterPattern=None, supportsMetadataExtraction=SupportsMetadataExtraction(root=True))) pipelines=None testConnectionResult=None tags=[] version=EntityVersion(root=0.1) updatedAt=Timestamp(root=1757407607305) updatedBy='ingestion-bot' href=Href(root=AnyUrl('http://localhost:8585/v1/services/storageServices/cc3361c7-85db-4dbd-9ba0-27ce4d91e230')) owners=EntityReferenceList(root=[]) changeDescription=None incremen

#### 2.2.2 Create containers

Open metadata provides a concept called `Container` to represent directories and files. In the below sections we will create
containers to represent directory, then we will create containers to represent files

We will create the below directories:
- geospatial
- geospatial/vector
- geospatial/raster
- clinical

In [7]:
# config containers parameters
GEO_PATH = "/geospatial"
GEO_CONTAINER_NAME = "geospatial"
GEO_CONTAINER_DESC = "This folder contains all constances related geospatial data"

VECTOR_PATH = "/geospatial/vector"
VECTOR_CONTAINER_NAME = "vector"
VECTOR_CONTAINER_DESC = "This folder contains all constances related geospatial data in vector format"

RASTER_PATH = "/geospatial/raster"
RASTER_CONTAINER_NAME = "raster"
RASTER_CONTAINER_DESC = "This folder contains all constances related geospatial data in raster format"

CLINICAL_PATH = "/clinical"
CLINICAL_CONTAINER_NAME = "clinical"
CLINICAL_CONTAINER_DESC = "This folder contains all constances related clinical data"

In [8]:
from metadata.generated.schema.api.data.createContainer import CreateContainerRequest
from metadata.generated.schema.entity.data.container import ContainerDataModel
from metadata.generated.schema.entity.data.table import Column, DataType
from metadata.generated.schema.type.entityReference import EntityReference

# -------------------------
#  Create geospatial container under storage service
# -------------------------
geo_dir_request = CreateContainerRequest(
    name=GEO_CONTAINER_NAME,
    displayName=GEO_CONTAINER_NAME,
    description=GEO_CONTAINER_DESC,
    service=storage_service_entity.fullyQualifiedName,  # StorageService FQN
    fullPath=GEO_PATH,
)

# Register with OpenMetadata
geo_dir_entity = om_con.create_or_update(data=geo_dir_request)
print(f"✅ Container created: {geo_dir_entity.fullyQualifiedName}")

# -------------------------
#  Create vector container under geospatial container
# -------------------------
vector_dir_request = CreateContainerRequest(
    name=VECTOR_CONTAINER_NAME,
    displayName=VECTOR_CONTAINER_NAME,
    description=VECTOR_CONTAINER_DESC,
    service=storage_service_entity.fullyQualifiedName,
    parent=EntityReference(id=geo_dir_entity.id, type="container"),
    fullPath=VECTOR_PATH,
)
vector_dir_entity = om_con.create_or_update(data=vector_dir_request)
print(f"✅ Container created: {vector_dir_entity.fullyQualifiedName}")
# -------------------------
#  Create raster container under geospatial container
# -------------------------
raster_dir_request = CreateContainerRequest(
    name=RASTER_CONTAINER_NAME,
    displayName=RASTER_CONTAINER_NAME,
    description=RASTER_CONTAINER_DESC,
    service=storage_service_entity.fullyQualifiedName,
    parent=EntityReference(id=geo_dir_entity.id, type="container"),
    fullPath=RASTER_PATH
)
raster_dir_entity = om_con.create_or_update(data=raster_dir_request)
print(f"✅ Container created: {raster_dir_entity.fullyQualifiedName}")
# -------------------------
#  Create clinical container under storage service
# -------------------------
clinical_dir_request = CreateContainerRequest(
    name=CLINICAL_CONTAINER_NAME,
    displayName=CLINICAL_CONTAINER_NAME,
    description=CLINICAL_CONTAINER_DESC,
    service=storage_service_entity.fullyQualifiedName,  # StorageService FQN
    fullPath=CLINICAL_PATH,
)

# Register with OpenMetadata
clinical_dir_entity = om_con.create_or_update(data=clinical_dir_request)
print(f"✅ Container created: {clinical_dir_entity.fullyQualifiedName}")

✅ Container created: root='Constances-Datalake.geospatial'
✅ Container created: root='Constances-Datalake.geospatial.vector'
✅ Container created: root='Constances-Datalake.geospatial.raster'
✅ Container created: root='Constances-Datalake.clinical'


#### Create containers to represent files

We have created all required directories, now let's create files. Files can also contain schema in open metadata. In the below example, we first create schema(i.e. columns), then we associate these columns to a container(i.e.file)


In [9]:
# config file1
file1_name = "file1.wkb"
file1_desc = "All hospital data in France"
file1_path = f"{VECTOR_PATH}/{file1_name}"

# define columns of file1
file1_columns = [
    Column(
        name="hospital_id",
        displayName="Hospital ID",
        dataType=DataType.INT,
        description="Unique identifier for the hospital"
    ),
    Column(
        name="hospital_name",
        displayName="Hospital Name",
        dataType=DataType.STRING,
        description="Name of the hospital"
    ),
    Column(
        name="location",
        displayName="location",
        dataType=DataType.STRING,
        description="gps coordinates where the hospital is located"
    ),
    Column(
        name="capacity",
        displayName="Capacity",
        dataType=DataType.INT,
        description="Number of beds available"
    ),
]

# Build the data model for the container
file1_data_model = ContainerDataModel(
    isPartitioned=False,
    columns=file1_columns
)

# Create the container request
file1_request = CreateContainerRequest(
    name=file1_name,
    displayName=file1_name,
    description=file1_desc,
    service=storage_service_entity.fullyQualifiedName,  # must be the FQN of your StorageService
    parent=EntityReference(id=vector_dir_entity.id, type="container"),  # must be the parent container FQN
    dataModel=file1_data_model,
    fullPath=file1_path,
    numberOfObjects=1,
    size=123.4,
    fileFormats=["csv"],
)

# Register with OpenMetadata
file1_entity = om_con.create_or_update(data=file1_request)
print(f"✅ Container created: {file1_entity.fullyQualifiedName}")

✅ Container created: root='Constances-Datalake.geospatial.vector."file1.wkb"'


In [10]:
# config file2
file2_name = "file2.geojson"
file2_desc = "All patients in constances cohort"
file2_path = f"{VECTOR_PATH}/{file2_name}"

# define columns of file1
file2_columns = [
    Column(
        name="patient_id",
        displayName="Patient ID",
        dataType=DataType.INT,
        description="Unique identifier of patient"
    ),
    Column(
        name="patient_name",
        displayName="Patient Name",
        dataType=DataType.STRING,
        description="Name of the patient"
    ),
    Column(
        name="location",
        displayName="location",
        dataType=DataType.STRING,
        description="gps coordinates where the patient live"
    )
]

# Build the data model for the container
file2_data_model = ContainerDataModel(
    isPartitioned=False,
    columns=file2_columns
)

# Create the container request
file2_request = CreateContainerRequest(
    name=file2_name,
    displayName=file2_name,
    description=file2_desc,
    service=storage_service_entity.fullyQualifiedName,  # must be the FQN of your StorageService
    parent=EntityReference(id=vector_dir_entity.id, type="container"),  # must be the parent container FQN
    dataModel=file2_data_model,
    fullPath=file2_path,
    numberOfObjects=1,
    size=456.7,
    fileFormats=["json"],
)

# Register with OpenMetadata
file2_entity = om_con.create_or_update(data=file2_request)
print(f"✅ Container created: {file2_entity.fullyQualifiedName}")

✅ Container created: root='Constances-Datalake.geospatial.vector."file2.geojson"'


In [11]:
# config file3
file3_name = "file3.tif"
file3_desc = "AVG Temperature of Frace city by month in geotiff format"
file3_path = f"{RASTER_PATH}/{file3_name}"

# define columns of file1
file3_columns = [
    Column(
        name="avg_temperature",
        displayName="Average temperature",
        dataType=DataType.INT,
        description="Average temperature of a pixel in geotiff"
    ),
    Column(
        name="pixel",
        displayName="pixel",
        dataType=DataType.STRING,
        description="Pixel of french city in geotiff"
    ),

]

# Build the data model for the container
file3_data_model = ContainerDataModel(
    isPartitioned=False,
    columns=file3_columns
)

# Create the container request
file3_request = CreateContainerRequest(
    name=file3_name,
    displayName=file3_name,
    description=file3_desc,
    service=storage_service_entity.fullyQualifiedName,  # must be the FQN of your StorageService
    parent=EntityReference(id=raster_dir_entity.id, type="container"),  # must be the parent container FQN
    dataModel=file3_data_model,
    fullPath=file3_path,
    numberOfObjects=1,
    size=789.7,
    fileFormats=["csv"],
)

# Register with OpenMetadata
file3_entity = om_con.create_or_update(data=file3_request)
print(f"✅ Container created: {file3_entity.fullyQualifiedName}")

✅ Container created: root='Constances-Datalake.geospatial.raster."file3.tif"'


In [12]:
# config file4
file4_name = "file4.nc"
file4_desc = "AVG air pollution of Frace city by month in netcdf format"
file4_path = f"{RASTER_PATH}/{file4_name}"

# define columns of file4
file4_columns = [
    Column(
        name="avg_air_pollution",
        displayName="Average air pollution",
        dataType=DataType.INT,
        description="Average air pollution of a pixel in netcdf"
    ),
    Column(
        name="pixel",
        displayName="pixel",
        dataType=DataType.STRING,
        description="Pixel of french city in netcdf"
    ),

]

# Build the data model for the container
file4_data_model = ContainerDataModel(
    isPartitioned=False,
    columns=file4_columns
)

# Create the container request
file4_request = CreateContainerRequest(
    name=file4_name,
    displayName=file4_name,
    description=file4_desc,
    service=storage_service_entity.fullyQualifiedName,  # must be the FQN of your StorageService
    parent=EntityReference(id=raster_dir_entity.id, type="container"),  # must be the parent container FQN
    dataModel=file4_data_model,
    fullPath=file4_path,
    numberOfObjects=1,
    size=666.7,
    fileFormats=["csv"],
)

# Register with OpenMetadata
file4_entity = om_con.create_or_update(data=file4_request)
print(f"✅ Container created: {file4_entity.fullyQualifiedName}")

✅ Container created: root='Constances-Datalake.geospatial.raster."file4.nc"'


In [13]:
def create_file_entity(om_server_con, file_name: str, file_desc: str, file_path: str, file_columns, storage_service,
                       parent_dir, file_size: float):
    """
    This function will create a file entity(container) under a directory entity(container) inside a storage service.
    :param om_server_con: open metadata server connection
    :param file_name:
    :param file_desc:
    :param file_path:
    :param file_columns: A list of columns
    :param storage_service:
    :param parent_dir:
    :param file_size:
    :return:
    """
    # Build the data model for the container
    file_data_model = ContainerDataModel(
        isPartitioned=False,
        columns=file_columns
    )

    # Create the container request
    file_request = CreateContainerRequest(
        name=file_name,
        displayName=file_name,
        description=file_desc,
        service=storage_service.fullyQualifiedName,  # must be the FQN of your StorageService
        parent=EntityReference(id=parent_dir.id, type="container"),  # must be the parent container FQN
        dataModel=file_data_model,
        fullPath=file_path,
        numberOfObjects=1,
        size=file_size,
        fileFormats=["csv"],
    )

    # Register with OpenMetadata
    file_entity = om_server_con.create_or_update(data=file_request)
    print(f"✅ Container created: {file_entity.fullyQualifiedName}")

In [14]:
# config file5
file5_name = "file5.parquet"
file5_desc = "blood test of a patient"
file5_path = f"{RASTER_PATH}/{file5_name}"
file5_size = 345.6
# define columns of file5
file5_columns = [
    Column(
        name="blood_test_id",
        displayName="blood test id",
        dataType=DataType.INT,
        description="Unique identifier of the blood test"
    ),
    Column(
        name="patient_name",
        displayName="patient Name",
        dataType=DataType.STRING,
        description="Name of the patient"
    ),
    Column(
        name="red_cell_count",
        displayName="red cell number count",
        dataType=DataType.INT,
        description="Number of red cells"
    ),

]

create_file_entity(om_con, file5_name, file5_desc, file5_path, file5_columns, storage_service_entity,
                   clinical_dir_entity, file5_size)

✅ Container created: root='Constances-Datalake.clinical."file5.parquet"'


In [15]:
# config file6
file6_name = "file6.csv"
file6_desc = "general test of a patient"
file6_path = f"{RASTER_PATH}/{file6_name}"
file6_size = 8888.8
# define columns of file6
file6_columns = [
    Column(
        name="general_test_id",
        displayName="general test id",
        dataType=DataType.INT,
        description="Unique identifier of the general test"
    ),
    Column(
        name="patient_name",
        displayName="patient Name",
        dataType=DataType.STRING,
        description="Name of the patient"
    ),
    Column(
        name="patient_weight",
        displayName="patient weight",
        dataType=DataType.INT,
        description="weight of a patient"
    ),

    Column(
        name="patient_height",
        displayName="patient height",
        dataType=DataType.INT,
        description="height of a patient"
    ),

]

create_file_entity(om_con, file6_name, file6_desc, file6_path, file6_columns, storage_service_entity,
                   clinical_dir_entity, file6_size)

✅ Container created: root='Constances-Datalake.clinical."file6.csv"'


### 2.3 Ingest the metadata of a database

We have seen how to ingest metadata of a file system, now lets see how to ingest metadata of a database. Suppose we have a mysql database called `hospitals_in_france`. We want to ingest metadata of this database into OM. So other users can use this database.

To ingest metdata of a database, the architecture of the database must be respected as below:
`DatabaseService`-> `Database`->`Schema`->

In [26]:
from metadata.generated.schema.api.services.createDatabaseService import CreateDatabaseServiceRequest
from metadata.generated.schema.entity.services.connections.database.common.basicAuth import BasicAuth
from metadata.generated.schema.entity.services.connections.database.mysqlConnection import MysqlConnection
from metadata.generated.schema.entity.services.databaseService import (DatabaseConnection, DatabaseService,
                                                                       DatabaseServiceType, )

# name of the db service
DB_SERVICE_NAME = "Constances-Geography"
# description of the service
DB_SERVICE_DESC = "This database service stores all geography databases of INSERM"

DB_AUTH_LOGIN = "db_login"
DB_AUTH_PWD = "db_pwd"
DB_URL = "http://db_url:1234"

db_service = CreateDatabaseServiceRequest(
    name=DB_SERVICE_NAME,
    serviceType=DatabaseServiceType.Mysql,
    connection=DatabaseConnection(
        config=MysqlConnection(
            username=DB_AUTH_LOGIN,
            authType=BasicAuth(password=DB_AUTH_PWD),
            hostPort=DB_URL,
        )
    ),
    description=DB_SERVICE_DESC,
)

# when we create an entity by using function `create_or_update`, it returns the created instance of the query
db_service_entity = om_con.create_or_update(data=db_service)

In [None]:
# you can view the content of the returned object to check if your request is executed correctly.
print(db_service_entity)

In [None]:
from metadata.generated.schema.api.data.createDatabase import CreateDatabaseRequest

DB_NAME = "hospitals_in_france"

db_entity_req = CreateDatabaseRequest(
    name=DB_NAME,
    service=db_service_entity.fullyQualifiedName,
    description="In this database, we store all tables which contain geographical information in Constances",
)

db_entity = om_con.create_or_update(data=db_entity_req)

In [None]:
from metadata.generated.schema.api.data.createDatabaseSchema import CreateDatabaseSchemaRequest

SCHEMA_NAME = "Geography"
create_schema_req = CreateDatabaseSchemaRequest(
    name=SCHEMA_NAME,
    database=db_entity.fullyQualifiedName,
    description="In this schema, we group all tables which contain geographical information of hospitals in France", )

# the create request will return the fqn(fully qualified name) of the created schema
schema_entity = om_con.create_or_update(data=create_schema_req)

## Step2: Get metadata from source files

Here we use two files to describe metadata:
- <project_name>_tables: describes the metadata of tables in this project
- <project_name_vars>: describes the metadata of the columns in this project 

In [None]:
import pathlib

project_root = pathlib.Path.cwd().parent
metadata_path = project_root / "data"

print(metadata_path)

In [None]:
table_spec_path = f"{metadata_path}/constances_tables.csv"
col_spec_path = f"{metadata_path}/constances_vars.csv"



In [None]:
table_df = pd.read_csv(table_spec_path, header=0)
print(table_df.head(5))

In [None]:
col_df = pd.read_csv(col_spec_path, header=0)
print(col_df.head(5))

In [None]:
from metadata.generated.schema.api.data.createTable import CreateTableRequest
from metadata.generated.schema.entity.data.table import Column, DataType


def getColDetailsByTabName(table_name: str, col_df):
    # filter the rows that belongs to the given table name
    table_col_list = col_df[col_df["table"] == table_name].to_dict(orient="records")
    return table_col_list


target_tab_name = "fr_communes_raw"
tab_col_list = getColDetailsByTabName(target_tab_name, col_df)

for item in tab_col_list:
    print(f"table name: {item['table']}")
    print(f"column name: {item['var']}")
    print(f"column type: {item['var_type']}")
    print(f"column size: {item['var_size']}")
    print(f"column description: {item['description']}")

## Step 3. clean the metadata before ingestion  

We need to clean the raw metadata before ingestion, because the value may not be compatible with `Open metadata`.
For example, the column types in `Open metadata` are pre-defined. Only the valid value can be inserted into the `Open metadata` server. 

In [None]:
from metadata.generated.schema.entity.data.table import Column, DataType
from typing import Dict, List, Optional

# util func
authorized_str_type = ["string", "str", ]
authorized_int_type = ["int", "integer"]
authorized_long_type = ["bigint", "long"]


def get_om_dtype(in_type: str) -> DataType:
    # test input type is not null and is string
    if in_type and isinstance(in_type, str):
        # cast it to lower case to ignor case
        in_type_val = in_type.lower()
        # we create a mapping case for all sql types
        if in_type_val == "tinyint":
            return DataType.TINYINT
        elif in_type_val == "byte":
            return DataType.BYTEINT
        elif in_type_val == "smallint":
            return DataType.SMALLINT
        elif in_type_val in authorized_int_type:
            return DataType.INT
        elif in_type_val in authorized_long_type:
            return DataType.BIGINT
        elif in_type_val == 'numeric':
            return DataType.NUMERIC
        elif in_type_val == 'number':
            return DataType.NUMBER
        elif in_type_val == 'float':
            return DataType.FLOAT
        elif in_type_val == 'double':
            return DataType.DOUBLE
        elif in_type_val == 'date':
            return DataType.DATE
        elif in_type_val == 'time':
            return DataType.TIME
        elif in_type_val == "char":
            return DataType.CHAR
        elif in_type_val == "varchar":
            return DataType.VARCHAR
        elif in_type_val == "text":
            return DataType.TEXT
        elif in_type_val == "ntext":
            return DataType.NTEXT
        elif in_type_val == "binary":
            return DataType.BINARY
        elif in_type_val == "varbinary":
            return DataType.VARBINARY
        # other types
        elif in_type_val in authorized_str_type:
            return DataType.STRING
        # for complex map such as array<int>, map<int,string>
        # we must use dataTypeDisplay to show the details. In dataType, we can only put array, map
        elif in_type_val == "array":
            return DataType.ARRAY
        elif in_type_val == "map":
            return DataType.MAP
        elif in_type_val == "struct":
            return DataType.STRUCT
        # for geometry type
        elif in_type_val == "geometry":
            return DataType.GEOMETRY
        # for empty string, we use string as default value
        elif in_type_val == "":
            return DataType.STRING

        else:
            return DataType.UNKNOWN
    else:
        print(f"The input value {in_type} is not a valid string type")
        raise ValueError


def build_type_display_name(type_val: str, length: Optional[int], precision: Optional[int]) -> str:
    """
    This function build a data type display value, it only considers three case, because the result return by 
    split_length_precision only has three possible case
    :param type_val: data type value (e.g. string, int, etc.) 
    :type type_val: str
    :param length: full length of the type 
    :type length: Optional[int]
    :param precision: precision of the type 
    :type precision: Optional[int]
    :return: data type display value
    :rtype: str
    """
    if length and precision:
        return f"{type_val}({length},{precision})"
    elif length and not precision:
        return f"{type_val}({length})"
    else:
        return type_val


def split_length_precision(raw_type_size: str) -> (int, int):
    """
    This function parse the raw type size (e.g. 3 or 5,3) into a tuple of (length, precision).
    Some example
     - 3 to (3,None)
     - 5,3 to (5,3).
     - None or not string to (None,None)
     - "" to (None,None)
     - ,3 to (None,None) because it does not make sense if only return precision
    :param raw_type_size:
    :type raw_type_size:
    :return:
    :rtype:
    """
    length = None
    precision = None
    # if it's null or not string, return none,none
    if raw_type_size and isinstance(raw_type_size, str):
        # if the size is not empty string, do split
        if len(raw_type_size) > 0:
            split_res = raw_type_size.split(",", 1)
            # if it has two items after split, it has length and precision
            try:
                if len(split_res) == 2:
                    length = int(split_res[0])
                    precision = int(split_res[1])
                else:
                    length = int(split_res[0])
            except ValueError as e:
                print(f"The length:{split_res[0]} or precision{split_res[1]} can't be cast to int.")

    return length, precision


def generate_om_column_entity(col_details: List[Dict]) -> List[Column]:
    """
    This functions takes the column details of a tables, it generates a list of openmetadata column entity
    :param col_details: 
    :type col_details: 
    :return: 
    :rtype: 
    """
    columns: List[Column] = []
    for col_detail in col_details:
        col_name = col_detail['var']
        type_val = col_detail['var_type'].lower()
        type_size = col_detail['var_size']
        length, precision = split_length_precision(type_size)
        data_type = get_om_dtype(type_val)
        type_display_val = build_type_display_name(type_val, length, precision)
        col_desc = col_detail['description']
        # for array data type, we must also provide the datatype inside the array, here we set string for simplicity
        if data_type == DataType.ARRAY:
            array_data_type = DataType.STRING
        else:
            array_data_type = None
        # for struct data type,
        if data_type == DataType.STRUCT:
            children = [{"version": DataType.INT}, {"timestamp": DataType.TIME}]
        else:
            children = None
        col_entity = Column(name=col_name, dataType=data_type, arrayDataType=array_data_type, children=children,
                            dataTypeDisplay=type_display_val, dataLength=length, precision=precision,
                            description=col_desc)
        columns.append(col_entity)
    return columns

In [None]:
## Load metadata of all tables
from metadata.generated.schema.api.data.createTable import CreateTableRequest

# step1: loop the table list to get table name and description
table_list = table_df[['table', 'description']].to_dict(orient="records")

for tab in table_list:
    tab_name = tab['table']
    tab_desc = tab['description']
    print(f"tab_name:{tab_name}, tab_desc:{tab_desc}")
    # step2: get tab col list
    tab_col_list = getColDetailsByTabName(tab_name, col_df)
    # step3: loop through the col list and build the OM colum list
    columns = generate_om_column_entity(tab_col_list)
    # step4: create table
    table_create = CreateTableRequest(
        name=tab_name,
        description=tab_desc,
        databaseSchema=schema_entity.fullyQualifiedName,
        columns=columns)
    table_entity = om_con.create_or_update(data=table_create)

### 2.4 Ingest metadata of a external storage server

We have seen how to ingest metadata of file system and database. If the storage server is hosted at public cloud such as Amazon and GCP. It's also possible to ingest the metadata into the open metadata server. In the below section, we will ingest the metadata of a S3 storage server in AWS.

In [19]:
from metadata.generated.schema.security.credentials.awsCredentials import AWSCredentials
from metadata.generated.schema.entity.services.connections.storage.s3Connection import S3Connection

# 1. Create S3 Storage Service
s3_conn = S3Connection(
    awsConfig=AWSCredentials(
        awsAccessKeyId="YOUR_ACCESS_KEY",
        awsSecretAccessKey="YOUR_SECRET_KEY",
        awsRegion="us-east-1",
        assumeRoleArn=None
    ),
    bucketNames=["Constance"]  # must be a list
)

s3_service_req = CreateStorageServiceRequest(
    name="Constance-AWS",
    serviceType=StorageServiceType.S3,
    connection=StorageConnection(config=s3_conn),
    description="Constances AWS S3 data lake"
)
s3_service_entity = om_con.create_or_update(data=s3_service_req)

# 2. Create a container for development
dev_dir_name = "development"
dev_dir_req = CreateContainerRequest(
    name=dev_dir_name,
    displayName=dev_dir_name,
    service=s3_service_entity.fullyQualifiedName,
)

dev_dir_entity = om_con.create_or_update(data=dev_dir_req)
print(dev_dir_entity)

id=Uuid(root=UUID('c108cb6f-8123-4fea-8c58-b52ec2bd8c67')) name=EntityName(root='development') fullyQualifiedName=FullyQualifiedEntityName(root='Constance-AWS.development') displayName='development' description=None version=EntityVersion(root=0.1) updatedAt=Timestamp(root=1757423980780) updatedBy='ingestion-bot' href=Href(root=AnyUrl('http://localhost:8585/v1/containers/c108cb6f-8123-4fea-8c58-b52ec2bd8c67')) owners=EntityReferenceList(root=[]) service=EntityReference(id=Uuid(root=UUID('241cb5ff-4706-4766-810a-b00725c9e08e')), type='storageService', name='Constance-AWS', fullyQualifiedName='Constance-AWS', description=Markdown(root='Constances AWS S3 data lake'), displayName='Constance-AWS', deleted=False, inherited=None, href=Href(root=AnyUrl('http://localhost:8585/v1/services/storageServices/241cb5ff-4706-4766-810a-b00725c9e08e'))) parent=None children=None dataModel=None prefix=None numberOfObjects=None size=None fileFormats=None serviceType=<StorageServiceType.S3: 'S3'> followers=N

In [20]:
# create a data model
prs_columns = [
    Column(
        name="patient_id",
        displayName="Patient ID",
        dataType=DataType.INT,
        description="Unique identifier of patient"
    ),
    Column(
        name="patient_name",
        displayName="Patient Name",
        dataType=DataType.STRING,
        description="Name of the patient"
    ),
    Column(
        name="location",
        displayName="location",
        dataType=DataType.STRING,
        description="gps coordinates where the patient live"
    ),
    Column(
        name="total_sum",
        displayName="total_sum",
        dataType=DataType.FLOAT,
        description="Total payment of the social security"
    ),

]

# Build the data model for the container
prs_data_model = ContainerDataModel(
    isPartitioned=False,
    columns=prs_columns
)

# create a container, it must belong to a service. Here we use a storage service
container_req = CreateContainerRequest(name='prs_2015_03_08',
                                       displayName='prs_2015_03_08',
                                       description='this parquet dataset contains the prs data',
                                       parent=EntityReference(id=dev_dir_entity.id, type="container"),
                                       service=s3_service_entity.fullyQualifiedName,
                                       dataModel=prs_data_model,
                                       numberOfObjects=3,
                                       size=123456.75,
                                       fileFormats=['parquet', ]
                                       , )

container_entity = om_con.create_or_update(data=container_req)


## Clean up

We have created many metadata entities, if we want to clean them, we can call the below function.


In [30]:
def delete_storage_service(storage_service_name:str):
    """
    This function takes a database service name, if existed, it will remove the database service and all
    metadata entities under the database services. If not, a warning message will be shown.
    :param storage_service_name:
    :return:
    """
    # get database service id
    try:
        # try to get the db service
        service_id = om_con.get_by_name(
            entity=StorageService, fqn=storage_service_name
        ).id
        print(f"Find the storage service with id: {service_id}")
        print(f"Start the delete process")
        # delete the service by using id
        om_con.delete(
            entity=StorageService,
            entity_id=service_id,
            recursive=True,
            hard_delete=True,
        )
    except Exception as e:
        print(f"Cant find a storage service with the given name {storage_service_name}: {e}")
        return

In [32]:
# delete the custom storage service

STORE_SER_NAME = "Constances-Datalake"
delete_storage_service(STORE_SER_NAME)

Find the storage service with id: root=UUID('cc3361c7-85db-4dbd-9ba0-27ce4d91e230')
Start the delete process


In [None]:
def delete_db_service(db_service_name:str):
    """
    This function takes a database service name, if existed, it will remove the database service and all
    metadata entities under the database services. If not a warning message will be shown.
    :param db_service_name:
    :return:
    """
    # get database service id
    try:
        # try to get the db service
        service_id = om_con.get_by_name(
            entity=DatabaseService, fqn=db_service_name
        ).id
        print(f"Find the database service with id: {service_id}")
        print(f"Start the delete process")
        # delete the service by using id
        om_con.delete(
            entity=DatabaseService,
            entity_id=service_id,
            recursive=True,
            hard_delete=True,
        )
    except Exception as e:
        print(f"Cant find a database service with the given name {db_service_name}: {e}")
        return


In [None]:
DB_SERVICE_NAME = ""
delete_db_service(DB_SERVICE_NAME)

In [36]:
# delete the S3 storage
s3_storage_name = "Constance_datalake"
delete_storage_service(s3_storage_name)

Find the storage service with id: root=UUID('754f4ae3-e3b8-4d10-94c4-99b1c57c6461')
Start the delete process
