# Tags, Glossary and terms

In metadata management, we often add two more extra categories of metadata:
- Classification/Tags(Labels): Provide flexible classification and annotation of data assets.
- Glossary/Terms: Provide a unified vocabulary across the organization.

In [2]:
from typing import Dict

from metadata.ingestion.ometa.ometa_api import OpenMetadata
from metadata.generated.schema.entity.services.connections.metadata.openMetadataConnection import (
    OpenMetadataConnection, AuthProvider)
from metadata.generated.schema.security.client.openMetadataJWTClientConfig import OpenMetadataJWTClientConfig

import chardet
import pandas as pd

from metadata.generated.schema.entity.data.table import Table
from metadata.ingestion.ometa.mixins.patch_mixin_utils import PatchOperation
from metadata.ingestion.models.table_metadata import ColumnTag
from metadata.generated.schema.type.tagLabel import TagLabel, TagSource, State, LabelType

In [3]:
# you need to modify this value to match your target open metadata server url
target_om_server = "http://om-dev.casd.local/api"

In [4]:
from conf.creds import om_oidc_token

server_config = OpenMetadataConnection(
    hostPort=target_om_server,
    authProvider=AuthProvider.openmetadata,
    securityConfig=OpenMetadataJWTClientConfig(
        jwtToken=om_oidc_token,
    ),
)
om_conn = OpenMetadata(server_config)

In [5]:
# if it returns true, it means the connection is success 
om_conn.health_check()

True

In [6]:
import pathlib

project_root = pathlib.Path.cwd().parent
metadata_path = project_root / "data"

print(metadata_path)

C:\Users\PLIU\Documents\git\Seminare_data_catalog\data


## 1. Classification and tags

In Open Metadata, a `Classification entity` contains `hierarchical terms` called **tags** used for categorizing and classifying data assets and other entities.

For example, in CASD, we value data security, so want to classify data with the below security level:
- **TopSecret**: Such material would cause "exceptionally grave damage" to national security if made publicly available
- **Secret**: Secret material would cause "serious damage" to national security if it were publicly available
- **Confidential**: Confidential material would cause "damage" or be prejudicial to national security if publicly available
- **Restricted**: Restricted material would cause "undesirable effects" if publicly available.
- **Official**: Official material forms the generality of government business, public service delivery and commercial activity. This includes a diverse range of information, of varying sensitivities, and with differing consequences resulting from compromise or loss.
- **Unclassified**: Unclassified information is low-impact, and therefore does not require any special protection, such as vetting of personnel.

To be able to classify data with the above tags, we need to the below steps:
1. Create a classification called **SecurityLevel**
2. Create tags in the classification **SecurityLevel**

### 1.1 Create classification entity

To create a new classification entity, we must call the `CreateClassificationRequest`.

In [7]:
from metadata.generated.schema.api.classification.createClassification import CreateClassificationRequest
from metadata.generated.schema.api.classification.createTag import CreateTagRequest

#### Config #######
classification_name = "SecurityLevel"
classification_desc = "CASD data confidentiality classification"

# build the creation request
classification_request=CreateClassificationRequest(
    name=classification_name,
    description=classification_desc,
)

# submit the request
classification_entity=om_conn.create_or_update(classification_request)

### 1.2 Create tags in the classification entity


In [9]:
ts_tag_request=CreateTagRequest(
    classification=classification_name,
    name="TopSecret",
    displayName="TopSecret",
    description="Such material would cause `exceptionally grave damage` to national security if made publicly available",
)

s_tag_request=CreateTagRequest(
    classification=classification_name,
    name="Secret",
    displayName="Secret",
    description="Secret material would cause `serious damage` to national security if it were publicly available",
)


conf_tag_request=CreateTagRequest(
    classification=classification_name,
    name="Confidential",
    displayName="Confidential",
    description="Confidential material would cause `damage` or be prejudicial to national security if publicly available",
)

res_tag_request=CreateTagRequest(
    classification=classification_name,
    name="Restricted",
    displayName="Restricted",
    description="Restricted material would cause `undesirable effects` if publicly available",
)

off_tag_request=CreateTagRequest(
    classification=classification_name,
    name="Official",
    displayName="Official",
    description="Official material forms the generality of government business, public service delivery and commercial activity. This includes a diverse range of information, of varying sensitivities, and with differing consequences resulting from compromise or loss",
)


un_tag_request=CreateTagRequest(
    classification=classification_name,
    name="Unclassified",
    displayName="Unclassified",
    description="Unclassified information is low-impact, and therefore does not require any special protection, such as vetting of personnel.",
)



ts_tag_entity=om_conn.create_or_update(ts_tag_request)
s_tag_entity=om_conn.create_or_update(s_tag_request)
conf_tag_entity=om_conn.create_or_update(conf_tag_request)
res_tag_entity=om_conn.create_or_update(res_tag_request)
off_tag_entity=om_conn.create_or_update(off_tag_request)
un_tag_entity=om_conn.create_or_update(un_tag_request)



### 1.3 Tag/Classify data

We have created the tags, now we need associate data with dags. We call it dagging.

> If you have deleted the metadata of the basic entities of TP2, you need to regenerate them. So we can tag them with the created tags


In [18]:
# config of the target database which we want to tag
DB_SERVICE_NAME = "Constances-Geography"
DB_NAME = "hospitals_in_france"
SCHEMA_NAME = "Geography"

In [31]:
from metadata.generated.schema.entity.data.table import Table, Column

def get_table_entity_by_name(om_con, table_fqn:str):
    """
    This function returns a table entity with the given table fqn, if not exist, return None
    :return:
    """
    return om_con.get_by_name(entity=Table, fqn=table_fqn)


In [32]:
# get the table entity to tag them after
hic_table_name = "hospitals_in_communes"
hospitals_in_communes_fqn = f"{DB_SERVICE_NAME}.{DB_NAME}.{SCHEMA_NAME}.{hic_table_name}"
hospitals_in_communes_table_entity = get_table_entity_by_name(om_conn, hospitals_in_communes_fqn)
print(hospitals_in_communes_table_entity)

id=Uuid(root=UUID('855c4ae4-20e8-4ce6-96f9-94a36f03e2cf')) name=EntityName(root='hospitals_in_communes') displayName=None fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography.hospitals_in_france.Geography.hospitals_in_communes') description=Markdown(root='This table contains the number of hospitals in each communes') version=EntityVersion(root=0.3) updatedAt=Timestamp(root=1757592307286) updatedBy='ingestion-bot' href=Href(root=AnyUrl('http://localhost:8585/v1/tables/855c4ae4-20e8-4ce6-96f9-94a36f03e2cf')) tableType=None columns=[Column(name=ColumnName(root='name'), displayName=None, dataType=<DataType.STRING: 'STRING'>, arrayDataType=None, dataLength=26, precision=None, scale=None, dataTypeDisplay='string(26)', description=Markdown(root='name of the commune and all letters are in lower case'), fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography.hospitals_in_france.Geography.hospitals_in_communes.name'), tags=[], constraint=None, ordinalPosition=Non

In [17]:
from metadata.ingestion.ometa.mixins.patch_mixin_utils import PatchOperation
from metadata.generated.schema.type.tagLabel import TagLabel, TagSource, State, LabelType

# we need to create tag label first to be able to add tag to other entities
# We only need tag_fqn to build the tag label of type str.
ts_tag_label=TagLabel(tagFQN=s_tag_entity.fullyQualifiedName,source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)
s_tag_label=TagLabel(tagFQN=s_tag_entity.fullyQualifiedName,source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)
conf_tag_label=TagLabel(tagFQN=conf_tag_entity.fullyQualifiedName,source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)

# if we know the value of the fqn, we can create the tag label from string value directly
res_tag_label=TagLabel(tagFQN="SecurityLevel.Restricted",source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)
off_tag_label=TagLabel(tagFQN="SecurityLevel.Official",source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)
un_tag_label=TagLabel(tagFQN="SecurityLevel.Unclassified",source=TagSource.Classification,  state=State.Suggested, labelType=LabelType.Automated,)

om_conn.patch_tags(
    entity=Table,
    source=hospitals_in_communes_table_entity,
    tag_labels=[s_tag_label],
    operation=PatchOperation.ADD
)

Table(id=Uuid(root=UUID('855c4ae4-20e8-4ce6-96f9-94a36f03e2cf')), name=EntityName(root='hospitals_in_communes'), displayName=None, fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography.hospitals_in_france.Geography.hospitals_in_communes'), description=Markdown(root='This table contains the number of hospitals in each communes'), version=EntityVersion(root=0.2), updatedAt=Timestamp(root=1757590379233), updatedBy='ingestion-bot', href=Href(root=AnyUrl('http://localhost:8585/v1/tables/855c4ae4-20e8-4ce6-96f9-94a36f03e2cf')), tableType=None, columns=[Column(name=ColumnName(root='name'), displayName=None, dataType=<DataType.STRING: 'STRING'>, arrayDataType=None, dataLength=26, precision=None, scale=None, dataTypeDisplay='string(26)', description=Markdown(root='name of the commune and all letters are in lower case'), fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography.hospitals_in_france.Geography.hospitals_in_communes.name'), tags=[], constraint=None, ord

## 2. Glossary and terms

Glossary(nomenclature)/term helps us to unifier definition of vocabulary across the organization, and make sure `everyone agrees on what a concept means`.

Like the tags, we need to two steps:
1. Define and insert glossary/term into OM
2. Associate term to an entity(e.g. Table, column)


### 2.1 Define and insert the glossary

Suppose we have two terms
- geometry_type: The description of possible type values in a geometry column
- code_insee: The administrative code of a French commune issued by INSEE

We also want to give all the possible values, such as:

| geo_type   | description                                                     |
|:-----------|:----------------------------------------------------------------|
| point      | A pair of (latitude, longitude) which represent the geolocation |
| line       | A list of points which represent a line                         |
| polygon    | A list of points which represent a polygone                     |

|   insee_code | commune_name   |
|-------------:|:---------------|
|            1 | toto           |
|            2 | titi           |
|            3 | tata           |"



For now(09-2025), **the python-sdk does not support nomenclature ingestion**. We have not found api too. The official doc for ingesting glossary and terms is [here](https://docs.open-metadata.org/latest/how-to-guides/data-governance/glossary/import). So we will `load the glossary term via OM GUI`. The goal is:

1. Create a glossary called `constances_geo_terms`
2. Create the two terms `geometry_type` and `code_insee` in glossary `constances_geo_terms`

> As the python sdk does not support this functionality. We need to use the web GUI.

## 3. Link terms to target entity

The terms in the glossary have been created, now we need to associate the terms with matched columns.

> if you have already deleted the database to the domain constances, you also need to move the glossary to the domain constances. Otherwise, you can not associate the term with the target columns.

### 3.1 Link terms to a table

In the below example, we will Link the term `geometry_type` to the table `hospitals_in_communes`

In [22]:
glossary_name = "constances_geo_terms"
geo_term_name = "geometry_type"
insee_term_name = "code_insee"

In [23]:
from metadata.generated.schema.entity.data.glossaryTerm import GlossaryTerm
def get_glossary_term_by_fqn(om_conn, term_fqn: str) -> GlossaryTerm:
    """
    This function takes a term fully qualified name and returns a corresponding term entity. If the target term does
    not exist or not valid, return None.
    :param om_conn:
    :type om_conn:
    :param term_fqn:
    :type term_fqn:
    :return:
    :rtype:
    """
    try:
        term_entity = om_conn.get_by_name(entity=GlossaryTerm, fqn=term_fqn)
    except Exception as e:
        print(f"Can not find the target term, or the term is not valid: {e}")
        raise
    return term_entity

In [24]:

insee_term_entity= get_glossary_term_by_fqn(om_conn,f"{glossary_name}.{insee_term_name}")
print(insee_term_entity)

id=Uuid(root=UUID('67da81ea-7942-4610-a895-d21ea679f04d')) name=EntityName(root='code_insee') displayName='code_insee' description=Markdown(root='<p>The administrative code of a French commune issued by INSEE</p><p>| insee_code | commune_name |</p><p>|-------------:|:---------------|</p><p>| 1 | toto |</p><p>| 2 | titi |</p><p>| 3 | tata |</p><p></p><p></p>') style=Style(color=None, iconURL=None) fullyQualifiedName=FullyQualifiedEntityName(root='constances_geo_terms.code_insee') synonyms=[] glossary=EntityReference(id=Uuid(root=UUID('354c661a-bf2d-4f19-9fe6-2436f62738ee')), type='glossary', name='constances_geo_terms', fullyQualifiedName='constances_geo_terms', description=Markdown(root='<p>This glossary is for constances geospatial dataset</p>'), displayName='constances_geo_terms', deleted=False, inherited=None, href=Href(root=AnyUrl('http://localhost:8585/v1/glossaries/354c661a-bf2d-4f19-9fe6-2436f62738ee'))) parent=None children=None relatedTerms=None references=[] version=EntityVer

In [25]:
# get the table entity to tag them after
hic_table_name = "hospitals_in_communes"
hospitals_in_communes_fqn = f"{DB_SERVICE_NAME}.{DB_NAME}.{SCHEMA_NAME}.{hic_table_name}"
hospitals_in_communes_table_entity = get_table_entity_by_name(om_conn, hospitals_in_communes_fqn)
print(hospitals_in_communes_table_entity)

id=Uuid(root=UUID('855c4ae4-20e8-4ce6-96f9-94a36f03e2cf')) name=EntityName(root='hospitals_in_communes') displayName=None fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography.hospitals_in_france.Geography.hospitals_in_communes') description=Markdown(root='This table contains the number of hospitals in each communes') version=EntityVersion(root=0.2) updatedAt=Timestamp(root=1757590379233) updatedBy='ingestion-bot' href=Href(root=AnyUrl('http://localhost:8585/v1/tables/855c4ae4-20e8-4ce6-96f9-94a36f03e2cf')) tableType=None columns=[Column(name=ColumnName(root='name'), displayName=None, dataType=<DataType.STRING: 'STRING'>, arrayDataType=None, dataLength=26, precision=None, scale=None, dataTypeDisplay='string(26)', description=Markdown(root='name of the commune and all letters are in lower case'), fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography.hospitals_in_france.Geography.hospitals_in_communes.name'), tags=[], constraint=None, ordinalPosition=Non

In [27]:
geo_term_label=TagLabel(tagFQN=f"{glossary_name}.{geo_term_name}",source=TagSource.Glossary,  state=State.Confirmed, labelType=LabelType.Manual,)
om_conn.patch_tags(
    entity=Table,
    source=hospitals_in_communes_table_entity,
    tag_labels=[geo_term_label],
    operation=PatchOperation.ADD
)

Table(id=Uuid(root=UUID('855c4ae4-20e8-4ce6-96f9-94a36f03e2cf')), name=EntityName(root='hospitals_in_communes'), displayName=None, fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography.hospitals_in_france.Geography.hospitals_in_communes'), description=Markdown(root='This table contains the number of hospitals in each communes'), version=EntityVersion(root=0.3), updatedAt=Timestamp(root=1757592307286), updatedBy='ingestion-bot', href=Href(root=AnyUrl('http://localhost:8585/v1/tables/855c4ae4-20e8-4ce6-96f9-94a36f03e2cf')), tableType=None, columns=[Column(name=ColumnName(root='name'), displayName=None, dataType=<DataType.STRING: 'STRING'>, arrayDataType=None, dataLength=26, precision=None, scale=None, dataTypeDisplay='string(26)', description=Markdown(root='name of the commune and all letters are in lower case'), fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography.hospitals_in_france.Geography.hospitals_in_communes.name'), tags=[], constraint=None, ord

### 3.2 Link tags and terms to a column

Now we want to link:
- term `geometry_type`
- tag `TopSecret`
to the column `geometry` of table `hospitals_in_communes`

In [33]:
geo_column_name = "geometry"
geo_column_fqn = f"{hospitals_in_communes_fqn}.{geo_column_name}"
# create a column tag for column geometry and term geometry_type
geo_col_geo_term=ColumnTag(column_fqn=geo_column_fqn,tag_label=geo_term_label)
# create a column tag for column geometry and tag TopSecret
geo_col_ts_tag=ColumnTag(column_fqn=geo_column_fqn,tag_label=ts_tag_label)

om_conn.patch_column_tags(table=hospitals_in_communes_table_entity,
                          column_tags=[geo_col_geo_term,geo_col_ts_tag],
                          operation=PatchOperation.ADD,)

Table(id=Uuid(root=UUID('855c4ae4-20e8-4ce6-96f9-94a36f03e2cf')), name=EntityName(root='hospitals_in_communes'), displayName=None, fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography.hospitals_in_france.Geography.hospitals_in_communes'), description=Markdown(root='This table contains the number of hospitals in each communes'), version=EntityVersion(root=0.4), updatedAt=Timestamp(root=1757596151985), updatedBy='ingestion-bot', href=Href(root=AnyUrl('http://localhost:8585/v1/tables/855c4ae4-20e8-4ce6-96f9-94a36f03e2cf')), tableType=None, columns=[Column(name=ColumnName(root='name'), displayName=None, dataType=<DataType.STRING: 'STRING'>, arrayDataType=None, dataLength=26, precision=None, scale=None, dataTypeDisplay='string(26)', description=Markdown(root='name of the commune and all letters are in lower case'), fullyQualifiedName=FullyQualifiedEntityName(root='Constances-Geography.hospitals_in_france.Geography.hospitals_in_communes.name'), tags=[], constraint=None, ord