#  Ingest metadata of basic entities into open metadata

In this tutorial, we will show how to ingest metadata into open metadata.

There are many ways to ingest metadat into openmetadata, such as:
- connectors
- rest API
- python SDK

In this tutorial, we only how you how to use `python SDK` to ingest metadata.

## 1. Set up the python virtual environment

Open a conda shell of **python 11** in `Bureau`->`Raccourci`->`Python`. Then enter the below command

```shell
# 1. Check if conda exists in the current shell
conda --version

# 2. create a virtual environment
conda create --name om-ingestion python --offline
# view existing virtual environment list
conda env list
# check status of a virtual environment
conda info --envs

# 3. activate a virtual environment
conda activate om-ingestion

# 4. install packages
# check installed package list
pip list

# install package via requirements.txt
pip install -r requirements.txt

# 5. verify that you have the required packages
pip show pandas
pip show openmetadata-ingestion
```

## 2. Ingest metadata of basic entities

The most basic entities in open metadata is the descriptive metadata of data assets. For example
- Databases
- Tables
- Columns
- Filesystem
- Folder
- Files
- Etc.

In the below example, we will insert the descriptive metadata of Database, Schema, tables, and columns.

In [1]:
import pandas as pd
from metadata.ingestion.ometa.ometa_api import OpenMetadata
from metadata.generated.schema.entity.services.connections.metadata.openMetadataConnection import (OpenMetadataConnection, AuthProvider)
from metadata.generated.schema.security.client.openMetadataJWTClientConfig import OpenMetadataJWTClientConfig

### 2.1 Check open metadata api server connectivity

The python-SDK which we use to ingest metadata is an `OM client`, it needs to connect to an `OM server` to ingest metadata.
Let's check the connectivity of the server via client.

In [2]:
# you need to modify this value to match your target open metadata server url
target_om_server="http://om-dev.casd.local/api"

In [3]:

from conf.creds import om_oidc_token
server_config = OpenMetadataConnection(
    hostPort=target_om_server,
    authProvider=AuthProvider.openmetadata,
    securityConfig=OpenMetadataJWTClientConfig(
        jwtToken=om_oidc_token,
    ),
)
om_con = OpenMetadata(server_config)

In [4]:
# if it returns true, it means the connection is success 
om_con.health_check()

True

### 2.2 Ingest the metadata of a database

Suppose we have a mysql database called `hospitals_in_france`. We want to ingest metadata of this database into OM. So other users can use this database.

In [5]:
from metadata.generated.schema.api.services.createDatabaseService import CreateDatabaseServiceRequest
from metadata.generated.schema.entity.services.connections.database.common.basicAuth import BasicAuth
from metadata.generated.schema.entity.services.connections.database.mysqlConnection import MysqlConnection
from metadata.generated.schema.entity.services.databaseService import (DatabaseConnection, DatabaseService, DatabaseServiceType,)

# name of the db service
DB_SERVICE_NAME = "Constances-Geography"
# description of the service
DB_SERVICE_DESC = "This database service stores all geography databases of INSERM"

DB_AUTH_LOGIN = "db_login"
DB_AUTH_PWD = "db_pwd"
DB_URL = "http://db_url:1234"

db_service = CreateDatabaseServiceRequest(
    name=DB_SERVICE_NAME,
    serviceType=DatabaseServiceType.Mysql,
    connection=DatabaseConnection(
        config=MysqlConnection(
            username=DB_AUTH_LOGIN,
            authType=BasicAuth(password=DB_AUTH_PWD),
            hostPort=DB_URL,
        )
    ),
    description=DB_SERVICE_DESC,
)

# when we create an entity by using function `create_or_update`, it returns the created instance of the query
db_service_entity = om_con.create_or_update(data=db_service)

In [6]:
# you can view the content of the returned object to check if your request is executed correctly.
print(db_service_entity)

id=Uuid(root=UUID('c726e413-a5de-41d0-a9b7-25fb11197d6b')) name=EntityName(root='Constances-Geography') fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography') displayName=None serviceType=<DatabaseServiceType.Mysql: 'Mysql'> description=Markdown(root='This database service stores all geography databases of INSERM') connection=DatabaseConnection(config=MysqlConnection(type=<MySQLType.Mysql: 'Mysql'>, scheme=<MySQLScheme.mysql_pymysql: 'mysql+pymysql'>, username='db_login', authType=BasicAuth(password=SecretStr('**********')), hostPort='http://db_url:1234', databaseName=None, databaseSchema=None, sslConfig=None, connectionOptions=None, connectionArguments=None, schemaFilterPattern=FilterPattern(includes=[], excludes=['^information_schema$', '^performance_schema$']), tableFilterPattern=None, databaseFilterPattern=None, supportsMetadataExtraction=SupportsMetadataExtraction(root=True), supportsDBTExtraction=SupportsDBTExtraction(root=True), supportsProfiler=SupportsProfile

In [7]:
from metadata.generated.schema.api.data.createDatabase import CreateDatabaseRequest
DB_NAME = "hospitals_in_france"

db_entity_req = CreateDatabaseRequest(
    name=DB_NAME,
    service=db_service_entity.fullyQualifiedName,
    description="In this database, we store all tables which contain geographical information in Constances",
)

db_entity = om_con.create_or_update(data=db_entity_req)

In [8]:
from metadata.generated.schema.api.data.createDatabaseSchema import CreateDatabaseSchemaRequest
SCHEMA_NAME = "Geography"
create_schema_req = CreateDatabaseSchemaRequest(
    name=SCHEMA_NAME, 
    database=db_entity.fullyQualifiedName,
    description="In this schema, we group all tables which contain geographical information of hospitals in France",)

# the create request will return the fqn(fully qualified name) of the created schema
schema_entity = om_con.create_or_update(data=create_schema_req)

## Step2: Get metadata from source files

Here we use two files to describe metadata:
- <project_name>_tables: describes the metadata of tables in this project
- <project_name_vars>: describes the metadata of the columns in this project 

In [9]:
import pathlib
project_root = pathlib.Path.cwd().parent
metadata_path = project_root / "data"

print(metadata_path)

C:\Users\PLIU\Documents\git\Seminare_data_catalog\data


In [10]:
table_spec_path = f"{metadata_path}/constances_tables.csv"
col_spec_path = f"{metadata_path}/constances_vars.csv"



In [11]:
table_df = pd.read_csv(table_spec_path,header=0)
print(table_df.head(5))

       domain                  table  \
0       INSEE        fr_communes_raw   
1  Constances      fr_communes_clean   
2         OSM         osm_france_raw   
3  Constances    osm_hospitals_clean   
4  Constances  hospitals_in_communes   

                                         description  creation  suppression  
0  This table contains all geographical informati...      2022          NaN  
1  This table is built based on fr_communes_raw w...      2024          NaN  
2  This table is the open street map of france. I...      2020          NaN  
3  This table is build based on osm_france_raw. I...      2024          NaN  
4  This table contains the number of hospitals in...      2024          NaN  


In [12]:
col_df = pd.read_csv(col_spec_path,header=0)
print(col_df.head(5))

             table        var                                    description  \
0  fr_communes_raw   geometry  geo location of the commune in a polygon form   
1  fr_communes_raw  wikipedia            url of the wiki page of the commune   
2  fr_communes_raw    surf_ha          number of habitats inside the commune   
3  fr_communes_raw        nom                            name of the commune   
4  fr_communes_raw      insee                      code insee of the commune   

   var_type var_size    nomencalure  creation  suppression  
0  geometry       18  geometry_type      2024          NaN  
1    string       28            NaN      2024          NaN  
2    number        8            NaN      2024          NaN  
3    string       26            NaN      2024          NaN  
4    string        5     code_insee      2024          NaN  


In [13]:
from metadata.generated.schema.api.data.createTable import CreateTableRequest
from metadata.generated.schema.entity.data.table import Column, DataType

def getColDetailsByTabName(table_name:str, col_df):
    # filter the rows that belongs to the given table name
    table_col_list=col_df[col_df["table"]==table_name].to_dict(orient="records")
    return table_col_list
    
target_tab_name = "fr_communes_raw"
tab_col_list=getColDetailsByTabName(target_tab_name, col_df)

for item in tab_col_list:
    print(f"table name: {item['table']}")
    print(f"column name: {item['var']}")
    print(f"column type: {item['var_type']}")
    print(f"column size: {item['var_size']}")
    print(f"column description: {item['description']}")

table name: fr_communes_raw
column name: geometry
column type: geometry
column size: 18
column description: geo location of the commune in a polygon form
table name: fr_communes_raw
column name: wikipedia
column type: string
column size: 28
column description: url of the wiki page of the commune
table name: fr_communes_raw
column name: surf_ha
column type: number
column size: 8
column description: number of habitats inside the commune
table name: fr_communes_raw
column name: nom
column type: string
column size: 26
column description: name of the commune
table name: fr_communes_raw
column name: insee
column type: string
column size: 5
column description: code insee of the commune


## Step 3. clean the metadata before ingestion  

We need to clean the raw metadata before ingestion, because the value may not be compatible with `Open metadata`.
For example, the column types in `Open metadata` are pre-defined. Only the valid value can be inserted into the `Open metadata` server. 

In [14]:
from metadata.generated.schema.entity.data.table import Column, DataType
from typing import Dict, List, Optional

# util func
authorized_str_type=["string","str",]
authorized_int_type=["int","integer"]
authorized_long_type=["bigint","long"]

def get_om_dtype(in_type:str)->DataType:
    # test input type is not null and is string
    if in_type and isinstance(in_type, str):
        # cast it to lower case to ignor case
        in_type_val=in_type.lower()
        # we create a mapping case for all sql types
        if in_type_val == "tinyint":
            return DataType.TINYINT
        elif in_type_val == "byte":
            return DataType.BYTEINT
        elif in_type_val == "smallint":
            return DataType.SMALLINT
        elif in_type_val in authorized_int_type:
            return DataType.INT
        elif in_type_val in authorized_long_type:
            return DataType.BIGINT
        elif in_type_val=='numeric':
            return DataType.NUMERIC
        elif in_type_val=='number':
            return DataType.NUMBER
        elif in_type_val=='float':
            return DataType.FLOAT
        elif in_type_val=='double':
            return DataType.DOUBLE
        elif in_type_val=='date':
            return DataType.DATE
        elif in_type_val=='time':
            return DataType.TIME
        elif in_type_val=="char":
            return DataType.CHAR
        elif in_type_val=="varchar":
            return DataType.VARCHAR
        elif in_type_val=="text":
            return DataType.TEXT
        elif in_type_val=="ntext":
            return DataType.NTEXT
        elif in_type_val=="binary":
            return DataType.BINARY
        elif in_type_val=="varbinary":
            return DataType.VARBINARY
        # other types
        elif in_type_val in authorized_str_type:
            return DataType.STRING
        # for complex map such as array<int>, map<int,string>
        # we must use dataTypeDisplay to show the details. In dataType, we can only put array, map
        elif in_type_val=="array":
            return DataType.ARRAY
        elif in_type_val=="map":
            return DataType.MAP
        elif in_type_val=="struct":
            return DataType.STRUCT
        # for geometry type
        elif in_type_val=="geometry":
            return DataType.GEOMETRY
        # for empty string, we use string as default value
        elif in_type_val=="":
            return DataType.STRING
        
        else:
            return DataType.UNKNOWN
    else:
        print(f"The input value {in_type} is not a valid string type")
        raise ValueError
    

def build_type_display_name(type_val: str, length: Optional[int], precision: Optional[int]) -> str:
    """
    This function build a data type display value, it only considers three case, because the result return by 
    split_length_precision only has three possible case
    :param type_val: data type value (e.g. string, int, etc.) 
    :type type_val: str
    :param length: full length of the type 
    :type length: Optional[int]
    :param precision: precision of the type 
    :type precision: Optional[int]
    :return: data type display value
    :rtype: str
    """
    if length and precision:
        return f"{type_val}({length},{precision})"
    elif length and not precision:
        return f"{type_val}({length})"
    else:
        return type_val

def split_length_precision(raw_type_size: str) -> (int, int):
    """
    This function parse the raw type size (e.g. 3 or 5,3) into a tuple of (length, precision).
    Some example
     - 3 to (3,None)
     - 5,3 to (5,3).
     - None or not string to (None,None)
     - "" to (None,None)
     - ,3 to (None,None) because it does not make sense if only return precision
    :param raw_type_size:
    :type raw_type_size:
    :return:
    :rtype:
    """
    length = None
    precision = None
    # if it's null or not string, return none,none
    if raw_type_size and isinstance(raw_type_size, str):
        # if the size is not empty string, do split
        if len(raw_type_size) > 0:
            split_res = raw_type_size.split(",", 1)
            # if it has two items after split, it has length and precision
            try:
                if len(split_res) == 2:
                    length = int(split_res[0])
                    precision = int(split_res[1])
                else:
                    length = int(split_res[0])
            except ValueError as e:
                print(f"The length:{split_res[0]} or precision{split_res[1]} can't be cast to int.")

    return length, precision
    
def generate_om_column_entity(col_details:List[Dict])->List[Column]:
    """
    This functions takes the column details of a tables, it generates a list of openmetadata column entity
    :param col_details: 
    :type col_details: 
    :return: 
    :rtype: 
    """
    columns:List[Column]=[]
    for col_detail in col_details:
        col_name=col_detail['var']
        type_val=col_detail['var_type'].lower()
        type_size=col_detail['var_size']
        length, precision=split_length_precision(type_size)
        data_type=get_om_dtype(type_val)
        type_display_val=build_type_display_name(type_val,length,precision)
        col_desc=col_detail['description']
        # for array data type, we must also provide the datatype inside the array, here we set string for simplicity
        if data_type==DataType.ARRAY:
            array_data_type=DataType.STRING
        else:
            array_data_type=None
        # for struct data type,
        if data_type==DataType.STRUCT:
            children=[{"version":DataType.INT},{"timestamp":DataType.TIME}]
        else:
            children=None
        col_entity=Column(name=col_name, dataType=data_type, arrayDataType=array_data_type, children=children, dataTypeDisplay=type_display_val, dataLength=length,precision=precision,description=col_desc)
        columns.append(col_entity)
    return columns

In [15]:
## Load metadata of all tables
from metadata.generated.schema.api.data.createTable import CreateTableRequest
# step1: loop the table list to get table name and description
table_list=table_df[['table','description']].to_dict(orient="records")

for tab in table_list:
    tab_name=tab['table']
    tab_desc=tab['description']
    print(f"tab_name:{tab_name}, tab_desc:{tab_desc}")
    # step2: get tab col list
    tab_col_list=getColDetailsByTabName(tab_name, col_df)
    # step3: loop through the col list and build the OM colum list
    columns = generate_om_column_entity(tab_col_list)
    # step4: create table
    table_create=CreateTableRequest(
    name=tab_name,
    description=tab_desc,
    databaseSchema=schema_entity.fullyQualifiedName,
    columns=columns)
    table_entity=om_con.create_or_update(data=table_create)

tab_name:fr_communes_raw, tab_desc:This table contains all geographical information of french communes
tab_name:fr_communes_clean, tab_desc:This table is built based on fr_communes_raw which is suitable for Contances related analysis
tab_name:osm_france_raw, tab_desc:This table is the open street map of france. It contains all geographical information such as roads hospitals in france
tab_name:osm_hospitals_clean, tab_desc:This table is build based on osm_france_raw. It only contains geographical information of hospitals in france
tab_name:hospitals_in_communes, tab_desc:This table contains the number of hospitals in each communes


### Ingest metadata of a file system

In [None]:
# --------------- CONFIGURATION ---------------
LOCAL_FOLDER = "/constances"
STORAGE_SERVICE_NAME = "Constances-Datalake"
CONTAINER_NAME = "root"

# ---------------- CREATE STORAGE SERVICE ----------------
def create_or_get_storage_service(om_server_con):
    # Check if the wanted storage service already exists, return the service entity
    service = om_server_con.get_by_name(entity="storageService", fqn=STORAGE_SERVICE_NAME)
    if service:
        return service

    custom_storage_conn = CustomStorageConnection(
        type="CustomStorage",
        sourcePythonClass="metadata.ingestion.source.storage.local.LocalStorageSource",
        connectionOptions={"configSource": LOCAL_FOLDER, "bucketName": CONTAINER_NAME}
    )
    storage_conn_entity = StorageConnection(config=custom_storage_conn)

    service_request = CreateStorageServiceRequest(
        name=STORAGE_SERVICE_NAME,
        serviceType=StorageServiceType.CustomStorage,
        description="Local filesystem storage service",
        connection=storage_conn_entity
    )

    return om_server_con.create_or_update(data=service_request)

# ---------------- CREATE ROOT CONTAINER ----------------
def create_root_container(om_server_con, storage_service_entity):
    container = om_server_con.get_by_name(entity="container", fqn=f"{STORAGE_SERVICE_NAME}.{CONTAINER_NAME}")
    if container:
        return container

    container_request = CreateContainerRequest(
        name=CONTAINER_NAME,
        displayName="Root Container",
        service=storage_service_entity.fullyQualifiedName,
        dataModel=ContainerDataModel(isPartitioned=False)
    )
    return om_server_con.create_or_update(data=container_request)

# ---------------- CREATE FILE ENTITIES ----------------
# def register_files(om_server_con, storage_service_entity, container_entity, folder_path):
#     for root, dirs, files in os.walk(folder_path):
#         for f in files:
#             file_path = os.path.join(root, f)
#             size = os.path.getsize(file_path)
#             ext = os.path.splitext(f)[1]
#             mime = "text/csv" if ext == ".csv" else None  # simple heuristic
#             file_request = CreateFileRequest(
#                 name=f,
#                 displayName=f,
#                 service=storage_service_entity.fullyQualifiedName,
#                 directory=container_entity.fullyQualifiedName,
#                 fileType=FileType.CSV if ext==".csv" else FileType.Other,
#                 mimeType=mime,
#                 fileExtension=ext,
#                 path=file_path,
#                 size=size
#             )
#             om_server_con.create_or_update(data=file_request)
#             print(f"Registered file: {file_path}")

In [5]:
from metadata.generated.schema.api.services.createStorageService import CreateStorageServiceRequest
from metadata.generated.schema.entity.services.storageService import StorageServiceType, StorageConnection
from metadata.generated.schema.entity.services.connections.storage.customStorageConnection import \
    CustomStorageConnection, CustomStorageType

from metadata.generated.schema.entity.services.storageService import StorageService

# Step 1: Create the CustomStorageConnection
custom_conn = CustomStorageConnection(
    type="CustomStorage",
    sourcePythonClass="metadata.ingestion.source.storage.local.LocalStorageSource",
    connectionOptions={

    }
)
#        "configSource": "/data",
#         "bucketName": "local-bucket"

# Step 2: Wrap it inside StorageConnection
storage_conn = StorageConnection(config=custom_conn)

# Step 3: Create StorageServiceRequest
storage_service_request = CreateStorageServiceRequest(
    name="Local-Filesystem",
    serviceType=StorageServiceType.CustomStorage,
    displayName="Local Filesystem",
    description="Custom storage service for local files",
    connection=storage_conn  # <-- must be StorageConnection, not CustomStorageConnection directly
)

# Step 4: Create or update in OpenMetadata
storage_service_entity = om_con.create_or_update(data=storage_service_request)
print(f"StorageService created: {storage_service_entity}")

StorageService created: id=Uuid(root=UUID('36592ca7-8dda-4fbe-ab45-d31c592976a3')) name=EntityName(root='Local-Filesystem') fullyQualifiedName=FullyQualifiedEntityName(root='Local-Filesystem') displayName='Local Filesystem' serviceType=<StorageServiceType.CustomStorage: 'CustomStorage'> description=Markdown(root='Custom storage service for local files') connection=StorageConnection(config=CustomStorageConnection(type=<CustomStorageType.CustomStorage: 'CustomStorage'>, sourcePythonClass='metadata.ingestion.source.storage.local.LocalStorageSource', connectionOptions=ConnectionOptions(root={}), containerFilterPattern=None, supportsMetadataExtraction=SupportsMetadataExtraction(root=True))) pipelines=None testConnectionResult=None tags=[] version=EntityVersion(root=0.1) updatedAt=Timestamp(root=1757347640097) updatedBy='ingestion-bot' href=Href(root=AnyUrl('http://localhost:8585/v1/services/storageServices/36592ca7-8dda-4fbe-ab45-d31c592976a3')) owners=None changeDescription=None incrementa

In [6]:
from metadata.generated.schema.api.data.createContainer import CreateContainerRequest
from metadata.generated.schema.entity.data.container import ContainerDataModel

# 1. Create the root container (Drive)
container_request = CreateContainerRequest(
    name="local-bucket",   # logical bucket name (must match your connectionOptions.bucketName if used)
    displayName="Local Bucket",
    service=storage_service_entity.fullyQualifiedName,
    dataModel=ContainerDataModel(
        isPartitioned=False,
        columns=[]
    )
)
container_entity = om_con.create_or_update(data=container_request)
print(f"Container created: {container_entity.fullyQualifiedName}")


Container created: root='Local-Filesystem.local-bucket'


In [12]:
from metadata.generated.schema.api.data.createFile import CreateFileRequest
from metadata.generated.schema.entity.data.file import FileType


file_request = CreateFileRequest(
    name="hospitals.csv",
    displayName="Hospitals in France",
    description="Dataset of French hospitals with location information",
    service=container_entity.fullyQualifiedName,      # points to the Container
    fileType=FileType.CSV,
    mimeType="text/csv",
    fileExtension=".csv",
    path="/INSERM/geo/datasets/raw/hospitals.csv",
    size=123456,  # in bytes
    checksum="d41d8cd98f00b204e9800998ecf8427e",  # optional md5/sha1/sha256 hash
    isShared=False,
    fileVersion="v1.0",
)

file_entity = om_con.create_or_update(data=file_request)
print(f"File created: {file_entity.fullyQualifiedName}")


APIError: driveService instance for "Local-Filesystem.local-bucket" not found

## Clean up

In [10]:

# get database service id
service_id = om_con.get_by_name(
    entity=DatabaseService, fqn=DB_SERVICE_NAME
).id

# delete service by using id
om_con.delete(
    entity=DatabaseService,
    entity_id=service_id,
    recursive=True,
    hard_delete=True,
)

In [13]:
# get file system service id


STORE_SER_NAME = "Local-Filesystem"
storage_service_entity = om_con.get_by_name(
    entity=StorageService, fqn=STORE_SER_NAME
)
print(storage_service_entity)

id=Uuid(root=UUID('b9e113ba-2ea4-4d75-9713-38d90ce7bf98')) name=EntityName(root='Local-Filesystem') fullyQualifiedName=FullyQualifiedEntityName(root='Local-Filesystem') displayName='Local Filesystem' serviceType=<StorageServiceType.CustomStorage: 'CustomStorage'> description=Markdown(root='Local filesystem containing raw data files.') connection=StorageConnection(config=CustomStorageConnection(type=<CustomStorageType.CustomStorage: 'CustomStorage'>, sourcePythonClass='metadata.ingestion.source.storage.local.LocalStorageSource', connectionOptions=ConnectionOptions(root={'bucketName': 'local-bucket', 'configSource': '/data'}), containerFilterPattern=None, supportsMetadataExtraction=SupportsMetadataExtraction(root=True))) pipelines=None testConnectionResult=None tags=[] version=EntityVersion(root=0.2) updatedAt=Timestamp(root=1757340315204) updatedBy='ingestion-bot' href=Href(root=AnyUrl('http://localhost:8585/v1/services/storageServices/b9e113ba-2ea4-4d75-9713-38d90ce7bf98')) owners=None

In [14]:
# delete service by using id
om_con.delete(
    entity=StorageService,
    entity_id=storage_service_entity.id,
    recursive=True,
    hard_delete=True,
)