# Tags, Glossary and terms

In metadata management, we often add two more extra categories of metadata:
- Classification/Tags(Labels): Provide flexible classification and annotation of data assets.
- Glossary/Terms: Provide a unified vocabulary across the organization.

In [None]:
from typing import Dict

from metadata.ingestion.ometa.ometa_api import OpenMetadata
from metadata.generated.schema.entity.services.connections.metadata.openMetadataConnection import (
    OpenMetadataConnection, AuthProvider)
from metadata.generated.schema.security.client.openMetadataJWTClientConfig import OpenMetadataJWTClientConfig

import chardet
import pandas as pd

from metadata.generated.schema.entity.data.table import Table
from metadata.ingestion.ometa.mixins.patch_mixin_utils import PatchOperation
from metadata.ingestion.models.table_metadata import ColumnTag
from metadata.generated.schema.type.tagLabel import TagLabel, TagSource, State, LabelType

In [4]:
# you need to modify this value to match your target open metadata server url
target_om_server = "http://om-dev.casd.local/api"

In [5]:
from conf.creds import om_oidc_token

server_config = OpenMetadataConnection(
    hostPort=target_om_server,
    authProvider=AuthProvider.openmetadata,
    securityConfig=OpenMetadataJWTClientConfig(
        jwtToken=om_oidc_token,
    ),
)
om_conn = OpenMetadata(server_config)

In [6]:
# if it returns true, it means the connection is success 
om_conn.health_check()

True

In [7]:
import pathlib

project_root = pathlib.Path.cwd().parent
metadata_path = project_root / "data"

print(metadata_path)

C:\Users\PLIU\Documents\git\Seminare_data_catalog\data


## 1. Classification and tags

In Open Metadata, a `Classification entity` contains `hierarchical terms` called **tags** used for categorizing and classifying data assets and other entities.

For example, in CASD, we value data security, so want to classify data with the below security level:
- **TopSecret**: Such material would cause "exceptionally grave damage" to national security if made publicly available
- **Secret**: Secret material would cause "serious damage" to national security if it were publicly available
- **Confidential**: Confidential material would cause "damage" or be prejudicial to national security if publicly available
- **Restricted**: Restricted material would cause "undesirable effects" if publicly available.
- **Official**: Official material forms the generality of government business, public service delivery and commercial activity. This includes a diverse range of information, of varying sensitivities, and with differing consequences resulting from compromise or loss.
- **Unclassified**: Unclassified information is low-impact, and therefore does not require any special protection, such as vetting of personnel.

To be able to classify data with the above tags, we need to the below steps:
1. Create a classification called **SecurityLevel**
2. Create tags in the classification **SecurityLevel**

### 1.1 Create classification entity

To create a new classification entity, we must call the `CreateClassificationRequest`.

In [None]:
from metadata.generated.schema.api.classification.createClassification import CreateClassificationRequest
from metadata.generated.schema.api.classification.createTag import CreateTagRequest

#### Config #######
classification_name = "SecurityLevel"
classification_desc = "CASD data confidentiality classification"

# build the creation request
classification_request=CreateClassificationRequest(
    name=classification_name,
    description=classification_desc,
)

# submit the request
classification_entity=om_conn.create_or_update(classification_request)

### 1.2 Create tags in the classification entity


In [None]:
ts_tag_request=CreateTagRequest(
    classification=classification_request.name,
    name="TopSecret",
    displayName="TopSecret",
    description="Such material would cause `exceptionally grave damage` to national security if made publicly available",
)

s_tag_request=CreateTagRequest(
    classification=classification_request.name,
    name="Secret",
    displayName="Secret",
    description="Secret material would cause `serious damage` to national security if it were publicly available",
)


conf_tag_request=CreateTagRequest(
    classification=classification_request.name,
    name="Confidential",
    displayName="Confidential",
    description="Confidential material would cause `damage` or be prejudicial to national security if publicly available",
)

res_tag_request=CreateTagRequest(
    classification=classification_request.name,
    name="Restricted",
    displayName="Restricted",
    description="Restricted material would cause `undesirable effects` if publicly available",
)

off_tag_request=CreateTagRequest(
    classification=classification_request.name,
    name="Official",
    displayName="Official",
    description="Official material forms the generality of government business, public service delivery and commercial activity. This includes a diverse range of information, of varying sensitivities, and with differing consequences resulting from compromise or loss",
)


un_tag_request=CreateTagRequest(
    classification=classification_request.name,
    name="Unclassified",
    displayName="Unclassified",
    description="Unclassified information is low-impact, and therefore does not require any special protection, such as vetting of personnel.",
)



ts_tag_entity=om_conn.create_or_update(ts_tag_request)
s_tag_entity=om_conn.create_or_update(s_tag_request)
conf_tag_entity=om_conn.create_or_update(conf_tag_request)
res_tag_entity=om_conn.create_or_update(res_tag_request)
off_tag_entity=om_conn.create_or_update(off_tag_request)
un_tag_entity=om_conn.create_or_update(un_tag_request)



## Tag/Classify data

We have created the tags, now we need associate data with dags. We call it dagging.



In [1]:
def get_table_entity_by_name(om_con, table_fqn:str):
    """
    This function returns a table entity with the given name, if not exist, return None
    :return:
    """
    om_con.get_by_name(
            entity=Table, fqn=table_fqn
        )

In [None]:
from metadata.generated.schema.entity.data.table import Table
from metadata.ingestion.ometa.mixins.patch_mixin_utils import PatchOperation
from metadata.generated.schema.type.tagLabel import TagLabel, TagSource, State, LabelType

# we need to create tag label first to be able to add tag to other entities
# We only need tag_fqn to build the tag label of type str.
ts_tag_label=TagLabel(tagFQN=s_tag_entity.fullyQualifiedName,source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)
s_tag_label=TagLabel(tagFQN=s_tag_entity.fullyQualifiedName,source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)
conf_tag_label=TagLabel(tagFQN=conf_tag_entity.fullyQualifiedName,source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)

# if we know the value of the fqn, we can create the tag label from string value directly
res_tag_label=TagLabel(tagFQN="SecurityLevel.Restricted",source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)
off_tag_label=TagLabel(tagFQN="SecurityLevel.Official",source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)
un_tag_label=TagLabel(tagFQN="SecurityLevel.Unclassified",source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)

metadata.patch_tags(
    entity=Table,
    source=table_a_entity,
    tag_labels=[s_tag_label],
    operation=PatchOperation.ADD
)

In [8]:
nomen_spec_path = f"{metadata_path}/constances_nomenclatures.csv"
column_spec_path = f"{metadata_path}/constances_vars.csv"
desc_dir_path = f"{metadata_path}/nomenclature_values"
nomen_om_term_out_path = f"{metadata_path}/om_nomenclatures.csv"

In [9]:
def create_term_row(name: str, display_name: str, description: str, parent: str = "", synonyms: str = None,
                    related_terms: str = "", references: str = "", tags: str = "", reviewers: str = "",
                    owner: str = "user;admin", status: str = "Approved") -> Dict[str, str]:
    """
    This function build a row of the glossary term dataframe which can be imported into the om server via web ui
    :param name: glossary term name
    :type name: str
    :param display_name: The display name of the glossary term
    :type display_name: str
    :param description: The description of the glossary term, it can take mark down format
    :type description: str
    :param parent: specify the parent term of the current term, it takes the fqn of the term
    :type parent: str
    :param synonyms: specify a list of synonyms of the current term, use ; as separator.
    :type synonyms: str
    :param related_terms: specify a list of related terms of the current term, use ; as separator. It takes the fqn of the term
    :type related_terms: str
    :param references: Add links from the internet from where you inherited the term. The references must be in the format (name;url;name;url)
    :type references: str
    :param tags: Add the tags which are already existing in OpenMetadata. The tags must be in the format (PII.Sensitive;PersonalData.Personal)
    :type tags: str
    :param reviewers: Add an existing user to review the term. It must be in format (user;uid)
    :type reviewers: str
    :param owner: Add an existing user as the owner of the term. It must be in format (user;uid)
    :type owner: str
    :param status: The status of the term, it's a enum type
    :type status: str
    :return:
    :rtype:
    """
    # don't know why, the synonyms can't be empty. otherwise the term is not valid
    if synonyms is None:
        synonyms = display_name.lower()
    return {
        "parent": parent,
        "name*": name.strip(),
        "displayName": display_name.strip(),
        "description": description,
        "synonyms": synonyms,
        "relatedTerms": related_terms,
        "references": references,
        "tags": tags,
        "reviewers": reviewers,
        "owner": owner,
        "status": status}

In [10]:
def detect_file_encoding(file_path: str) -> str:
    # Detect the encoding of the CSV file
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
        return result['encoding']


def parse_linked_col_str(cols_str: str):
    """
    This function parse the linked col raw string value.
    :param cols_str:
    :type cols_str:
    :return:
    :rtype:
    """
    if cols_str.strip() == "Aucune variable liée":
        linked_cols = []
    else:
        # if the value contains ", remove it
        clean_str = cols_str.replace("\"", "")
        linked_cols = [item.strip() for item in clean_str.split(",")]
    return linked_cols


def parse_nomenclature_row(row: dict, desc_dir_path: str):
    """
    This function takes one row of the nomenclature data frame, parse it and return
    the name of the nomenclature, description with value table. and linked table name list
    :param desc_dir_path: The root dir which contains the description file of each nomenclature
    :type desc_dir_path: str
    :param row: A dict which represent a row of the snds_nomenclature table
    :type row: Dict
    :return:
    :rtype:
    """
    # 1.get the name of the term
    term_name = row["nomenclature"]

    # 2. get the description of the term
    desc = row["titre"]
    # get the value table path,

    desc_file_path = f"{desc_dir_path}/{term_name}.csv"
    file_encoding = detect_file_encoding(desc_file_path)
    # get the value table content
    desc_detail = pd.read_csv(desc_file_path, sep=";", encoding=file_encoding).to_markdown(index=False)
    # build the description in Markdown format
    full_desc = f"{desc} \n {desc_detail}"

    # 3. get the linked columns
    linked_cols_str = row["variables_liees"]
    linked_cols = parse_linked_col_str(linked_cols_str)

    return term_name, full_desc, linked_cols

In [11]:
# 2. Read the nomenclature file
nom_df = pd.read_csv(nomen_spec_path).drop_duplicates(subset=["nomenclature"])

print(nom_df.head(5))


    nomenclature                                              titre  \
0  geometry_type  The description of possible type values in a g...   
1     code_insee  The administrative code of a french commune is...   

  variables_liees  nombre_lignes  
0        geometry            NaN  
1           insee            NaN  


In [12]:
generated_term_rows = []
# 3. For each row in the nomenclature file, generate a new row for OM term
for index, row in nom_df.iterrows():
    row_dict = row.to_dict()
    print(row_dict)
    term_name, full_desc, linked_cols = parse_nomenclature_row(row_dict, desc_dir_path)
    print(f"treating term: {term_name}")
    generated_term_rows.append(create_term_row(term_name, term_name, full_desc))

print(generated_term_rows)

{'nomenclature': 'geometry_type', 'titre': 'The description of possible type values in a geometry column', 'variables_liees': 'geometry', 'nombre_lignes': nan}
treating term: geometry_type
{'nomenclature': 'code_insee', 'titre': 'The administrative code of a french commune issued by INSEE ', 'variables_liees': 'insee', 'nombre_lignes': nan}
treating term: code_insee
[{'parent': '', 'name*': 'geometry_type', 'displayName': 'geometry_type', 'description': 'The description of possible type values in a geometry column \n | geo_type   | description                                                     |\n|:-----------|:----------------------------------------------------------------|\n| point      | A pair of (latitude, longitude) which represent the geolocation |\n| line       | A list of points which represent a line                         |\n| polygon    | A list of points which represent a polygone                     |', 'synonyms': 'geometry_type', 'relatedTerms': '', 'references': '',

In [13]:
# 4. build a dataframe and export to csv
nomen_om_term_df = pd.DataFrame(generated_term_rows)
nomen_om_term_df.to_csv(nomen_om_term_out_path, index=False, sep=",", encoding="utf-8")

## 2. Ingest the nomenclature


For now(09-2025), **the python-sdk does not support nomenclature ingestion**. We have not found api too. The official doc for ingesting glossary and terms is [here](https://docs.open-metadata.org/latest/how-to-guides/data-governance/glossary/import). So we will `load the glossary term via OM GUI`. The goal is:

1. Create a glossary called "constances_geo_terms"
2. Insert the generated om_nomenclatures.csv 



## 3. Link terms with target columns

The terms in glossary has been created, now we need to associate the terms with matched columns.

> if you already move the database to the domain constances, you also need to move the glossary to the domain constances. Othwise, you can not associate the term with the target columns

In [14]:
DB_SERVICE_NAME = "Constances-Geography"
DB_NAME = "hospitals_in_france"
SCHEMA_NAME = "Geography"
glossary_name = "constances_geo_terms"

In [25]:
from metadata.generated.schema.entity.data.glossaryTerm import GlossaryTerm
def get_glossary_term_by_fqn(om_conn, term_fqn: str) -> GlossaryTerm:
    """
    This function takes a term fully qualified name and returns a corresponding term entity. If the target term does
    not exist or not valid, return None.
    :param om_conn:
    :type om_conn:
    :param term_fqn:
    :type term_fqn:
    :return:
    :rtype:
    """
    try:
        term_entity = om_conn.get_by_name(entity=GlossaryTerm, fqn=term_fqn)
    except Exception as e:
        print(f"Can not find the target term, or the term is not valid: {e}")
        raise
    return term_entity

In [26]:
term_name = "code_insee"
t1_entity= get_glossary_term_by_fqn(om_conn,f"{glossary_name}.{term_name}")
print(t1_entity)

id=Uuid(root=UUID('67da81ea-7942-4610-a895-d21ea679f04d')) name=EntityName(root='code_insee') displayName='code_insee' description=Markdown(root='<p>The administrative code of a French commune issued by INSEE</p><p>| insee_code | commune_name |</p><p>|-------------:|:---------------|</p><p>| 1 | toto |</p><p>| 2 | titi |</p><p>| 3 | tata |</p><p></p><p></p>') style=Style(color=None, iconURL=None) fullyQualifiedName=FullyQualifiedEntityName(root='constances_geo_terms.code_insee') synonyms=[] glossary=EntityReference(id=Uuid(root=UUID('354c661a-bf2d-4f19-9fe6-2436f62738ee')), type='glossary', name='constances_geo_terms', fullyQualifiedName='constances_geo_terms', description=Markdown(root='<p>This glossary is for constances geospatial dataset</p>'), displayName='constances_geo_terms', deleted=False, inherited=None, href=Href(root=AnyUrl('http://localhost:8585/v1/glossaries/354c661a-bf2d-4f19-9fe6-2436f62738ee'))) parent=None children=None relatedTerms=None references=[] version=EntityVer

In [27]:
table_fqn = f"Constances-Geography.hospitals_in_france.Geography"
table_entity = om_conn.get_by_name(entity=Table, fqn=table_fqn)
print(table_entity)

None


In [20]:



def search_constance_column_name(tab_col_path: str, target_col_name: str) -> list:
    """
    This function read the constances_vars, find all rows that contains the
    target column name. return a list of all match table column pair
    :param tab_col_path:
    :type tab_col_path:
    :param target_col_name:
    :type target_col_name:
    :return:
    :rtype:
    """
    res_tab_col = []
    col_df = pd.read_csv(tab_col_path, header=0)
    target_tab_col = col_df[col_df["var"] == target_col_name]
    if len(target_tab_col) > 0:
        for index, row in target_tab_col.iterrows():
            row_dict = row.to_dict()
            res_tab_col.append({
                "table": row_dict["table"],
                "column": row_dict["var"],
            })

    return res_tab_col





def patch_term_to_column(om_conn, term_fqn: str, db_fqn: str, tab_name: str, column_name: str):
    # get the target table entity which contains the target column
    table_fqn = f"{db_fqn}.{tab_name}"
    table_entity = om_conn.get_by_name(entity=Table, fqn=table_fqn)

    # get the term entity
    term_entity = get_glossary_term_by_fqn(om_conn, term_fqn)
    if not table_entity:
        print(f"Can't find the table entity: {table_fqn} ")
        return
    if not term_entity:
        print(f"Can't find the term entity: {term_fqn} ")
        return
        # build a tag label with the given term entity
    tag_label = TagLabel(tagFQN=str(term_entity.fullyQualifiedName), source=TagSource.Glossary, state=State.Suggested,
                         labelType=LabelType.Automated)
    col_tag = ColumnTag(column_fqn=f"{db_fqn}.{tab_name}.{column_name}", tag_label=tag_label)

    if col_tag:
        om_conn.patch_column_tags(table=table_entity,
                                  column_tags=[col_tag],
                                  operation=PatchOperation.ADD, )

In [21]:
resu_list = search_constance_column_name(column_spec_path, "geometry")
print(resu_list)

[{'table': 'fr_communes_raw', 'column': 'geometry'}, {'table': 'fr_communes_clean', 'column': 'geometry'}, {'table': 'hospitals_in_communes', 'column': 'geometry'}]


In [22]:
# define the target database fqn
db_fqn = f"{DB_SERVICE_NAME}.{DB_NAME}.{SCHEMA_NAME}"

# get nomenclature df
nom_df = pd.read_csv(nomen_spec_path, header=0).drop_duplicates(subset=["nomenclature"])
print(nom_df.head(5))

    nomenclature                                              titre  \
0  geometry_type  The description of possible type values in a g...   
1     code_insee  The administrative code of a french commune is...   

  variables_liees  nombre_lignes  
0        geometry            NaN  
1           insee            NaN  


In [23]:
for index, row in nom_df.iterrows():
    row_dict = row.to_dict()
    print(row_dict)

    term_name, _, linked_cols = parse_nomenclature_row(row_dict, desc_dir_path)
    print(f"treating term: {term_name}")

    # --- via term name ---
    term_linked_tab_cols = search_constance_column_name(column_spec_path, term_name) or []
    for term_linked_tab_col in term_linked_tab_cols:
        print(f"target column: {term_linked_tab_col}")
        patch_term_to_column(
            om_conn,
            f"{glossary_name}.{term_name}",
            db_fqn,
            term_linked_tab_col["table"],
            term_linked_tab_col["column"],
        )

    # --- via linked columns ---
    if linked_cols:
        for linked_col in linked_cols:
            print(f"linked col: {linked_col}")
            col_linked_tab_cols = search_constance_column_name(column_spec_path, linked_col) or []
            for col_linked_tab_col in col_linked_tab_cols:
                patch_term_to_column(
                    om_conn,
                    f"{glossary_name}.{term_name}",
                    db_fqn,
                    col_linked_tab_col["table"],
                    col_linked_tab_col["column"],
                )



{'nomenclature': 'geometry_type', 'titre': 'The description of possible type values in a geometry column', 'variables_liees': 'geometry', 'nombre_lignes': nan}
treating term: geometry_type
linked col: geometry




{'nomenclature': 'code_insee', 'titre': 'The administrative code of a french commune issued by INSEE ', 'variables_liees': 'insee', 'nombre_lignes': nan}
treating term: code_insee
linked col: insee




In [24]:
for index, row in nom_df.iterrows():
    row_dict = row.to_dict()
    print(row_dict)
    term_name, _, linked_cols = parse_nomenclature_row(row_dict, desc_dir_path)
    print(f"treating term: {term_name}")
    # find linked table column via term name
    term_linked_tab_cols = search_constance_column_name(column_spec_path, term_name)
    print(term_linked_tab_cols)
    # for each linked col, add term to the column
    for term_linked_tab_col in term_linked_tab_cols:
        print(f"target column: {term_linked_tab_col}")
        patch_term_to_column(om_conn, f"{glossary_name}.{term_name}", db_fqn, term_linked_tab_col["table"],
                             term_linked_tab_col["column"])
    # find linked table column via linked_cols
    print(linked_cols)
    if len(linked_cols) > 0:
        for linked_col in linked_cols:
            print(linked_col)
            col_linked_tab_cols = search_constance_column_name(column_spec_path, linked_col)
            # for each find table column, add tag
            for col_linked_tab_col in col_linked_tab_cols:
                patch_term_to_column(om_conn, f"{glossary_name}.{term_name}", db_fqn, col_linked_tab_col["table"],
                                     col_linked_tab_col["column"])



{'nomenclature': 'geometry_type', 'titre': 'The description of possible type values in a geometry column', 'variables_liees': 'geometry', 'nombre_lignes': nan}
treating term: geometry_type
[]
['geometry']
geometry




{'nomenclature': 'code_insee', 'titre': 'The administrative code of a french commune issued by INSEE ', 'variables_liees': 'insee', 'nombre_lignes': nan}
treating term: code_insee
[]
['insee']
insee


