#  Standardising heterogeneous tags

by ***Dumindu Madithiyagasthenna***



The algorithm below is instance of programming by example that uses syntactic clustering to infer and annotate sensor tags from BMS (Building Management Systems) semi-automatically. 
The aim is to standardise the heterogeneous tags coming from different BMSs.


To get started make sure you have the files from the repository in the same folder as the jupyter notebook


In [5]:
!pip install -q distance
import csv
import itertools
import json
import re
from pprint import pprint
import os
import distance
import numpy as np
from scipy.cluster.hierarchy import fclusterdata

These are the functions to read the files.

**To see the full files please click on the *files* tab on the left**

The ``categorised_tags`` is the file that stores the standard/convention of tags with their categories. (This is a standard defined by Mirek and is a small subset based off of *Project Haystack*, for our purpose we can use the vanilla Haystack convention)


![sorted_sm.json snippet ](https://i.imgur.com/oPKVusJ.png)


The other file that is being read stores the list of the tags that needs to be tagged.
Below is a snippet from that file
![chw.json](https://imgur.com/bLyXJdd.png)

## Functions

### Feature extraction function
Several steps are taken to convert the sensor names into feature vectors (this is a version adapted from the previously stated paper)
1.	Prefixes common to the whole set is stripped away; this allows for easily distinguishing between sensors.
2.	Replace alphabetic characters with 1; numeric with 2; spaces with 3; and all other characters with 4.
3.	Replace combinations of 21 or 12 (ie. Adjacent alphabetic and numeric chacraters) with 2. Numbering (reference) of locations/equipments usually use a combination of alpha/numeric characters (eg: 2A, 3C, etc.); this gives the same weight as a normal number.
4.	Merge the repeating features.
5.	Pad the set with 0s (from right) to normalise dimensions.

In [6]:
# transforms the raw sensor into feature vectors
# gets rid of the common prefix (if any) among the whole dataset
# Replace characters with a single number:
# Alphabets with 1
# Digits with 2
# Spaces with 3
# Non-alphanumerics with 4
# Merge repeating numbers
# Merge adjacent alphabet and numeric characters into 1 feature
# Pad these transformed vectors with 0 to normalise the dimensions
def transform_tags_to_features():
    transformed_list = []  # holds all the tags in its tranformed state

    #substring/prefix that is common across all the sensor names
    #getting rid of this will lead to better clustering
    #(long common prefixes will lead to larger smilarities, which leads to everything being clustered together)
    common_substring = os.path.commonprefix(raw_sensor_list)

    for tag in raw_sensor_list:
        tag_list = []  # holds the values of one tag

        #removing the common string
        vector = tag.replace(common_substring,"")
        
        for character in vector:
            if character.isalpha():
                tag_list.append('1')
            elif character.isdigit():
                tag_list.append('2')
            elif character.isspace():
                tag_list.append('3')
            else:
                tag_list.append('4')
        transformed_list.append(tag_list)  # adds it to the main list

    feature_vector_list = []
    for transformedTag in transformed_list:
        # joins the list values into single string
        joined_tag = ''.join(transformedTag)

        # combines combintations of adjacent aphlabetic and numeric characters into 1 feature
        joined_tag = joined_tag.replace("21", "2")
        joined_tag = joined_tag.replace("12", "2")

        # merges the repeating characters of the string
        feature_vector = ''.join(ch for ch, _ in itertools.groupby(joined_tag))

        feature_vector_list.append(feature_vector)

    # pads the vectors with 0
    # to make them all have the same dimensions
    norm_vector_list = []
    max_len = np.max([len(a) for a in feature_vector_list])
    for fv in feature_vector_list:
        # norm_vector_list.append(fv.rjust(max_len, '0')) #left pad
        norm_vector_list.append(fv.ljust(max_len, '0'))  # right pad
        # norm_vector_list.append(fv) #no padding
        
    # convert to numpy array
    return np.asarray(norm_vector_list)

print("Done")

Done


### Clustring function
The feature vectors are then clustered using the Jaccard distance metric to calculate distances between feature vectors. The agglomerative (bottom-up) hierarchical clustering is performed using the UPGMA algorithm, and forms flat clusters using 0 as the threshold.

In [7]:
# clusters the feature vectors
# using the given params and
# hierachical clustering
def cluster(type, threshold, c, d, m):
    fclust = fclusterdata(numpy_feature_vectors.reshape(-1, 1), threshold, criterion=c, metric=type, depth=d, method=m)

    cluster = [[] for j in range(max(fclust))]

    for index, f in enumerate(fclust):
        cluster[f - 1].append(index)
    return cluster


# returns the clusters with the original tag name in json format
def json_cluster_w_labels(cluster_list):
    json_dict = {}  # json object which holds all info
    for idx, clus in enumerate(cluster_list):
        cluster_dict = {"example": raw_sensor_list[clus[0]]}  # object which holds all cluster specific info

        # example presented to the user
        # the first tag from the list

        cluster_tag_list = []  # holds the list of tags in the current cluster

        for idx2, itm in enumerate(clus):
            cluster_tag_list.append(raw_sensor_list[itm])

        cluster_dict["cluster_list"] = cluster_tag_list
        json_dict[idx] = cluster_dict
    return json_dict


print("Done")

Done


### Get Answers to the examples

At this point the users will provide answers to the examples shown above (one example from each cluster)
The categories and standard shown above are used by the expert to provide the answers.

A form for entering the answer for one example is shown below (from the UI designed for Mirek)

![UI for entering answers](https://imgur.com/cN1iozK.png)

In [8]:

variable_name = ""

# gets the answer values as JSON
# TODO: get these directly from the site
def get_answers():
    with open('answers.json') as f:
        dict = json.load(f)
    return dict

#get a tempoarary dicationary from the already provided answers to the examples
def get_dict_from_answers():
    temp_dict = {}
    for i1 in range(len(answers_dict)):
        for i2 in range(len(answers_dict[str(i1)]["answer"]["fields"])):
            if (answers_dict[str(i1)]["answer"]["types"][i2]!="ref"):
                temp_dict[answers_dict[str(i1)]["answer"]["fields"][i2]]=answers_dict[str(i1)]["answer"]["values"][i2]
    return temp_dict
  
print("Done")

Done


### String Splitting Function

Based on how the user broke up the example provided, the rest of the tags in the cluster will be split similarly and be assigned the same types.
I have used a simple function of using regex and delimters to split them, but a much better approach is by Sumit Gulawani from Microsoft (it ships with Excel and is also published):
- https://www.microsoft.com/en-us/research/wp-content/uploads/2016/12/pbe16.pdf
- https://www.microsoft.com/en-us/research/wp-content/uploads/2016/12/popl11-synthesis.pdf
- https://microsoft.github.io/prose/
- A demo can be viewed [here](https://prose-splittext.azurewebsites.net/) and [here](https://www.microsoft.com/en-us/research/publication/automating-string-processing-spreadsheets-using-input-output-examples/?from=https%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fpeople%2Fsumitg%2Fflashfill.html)

The splitting will be much more efficient if this method was used.

In [9]:
def string_spliting():
    split_sub_tags_array = {}
    for i1 in range(len(answers_dict)):
        delim_array = []  # holds the delimters
        # for 0 to the amount of delimters
        # delimters = (no. of sub tags) -1
        for i2 in range(len(answers_dict[str(i1)]["answer"]["fields"]) - 1):
            # position of the left sub tag
            pos1 = answers_dict[str(i1)]["example"].find(answers_dict[str(i1)]["answer"]["fields"][i2])
            # position of the right sub-tag
            pos2 = answers_dict[str(i1)]["example"].find(answers_dict[str(i1)]["answer"]["fields"][i2 + 1])
            # the string in between the left and the right sub tag
            # ie. the delimter
            temp_delim = (answers_dict[str(i1)]["example"][(pos1 + len(answers_dict[str(i1)]["answer"]["fields"][i2])):pos2])
            delim_array.append(temp_delim)

        striped_array = [] #holds the delimiters striped of alpha-numeric characters
        for d in delim_array:
            d = ''.join([i for i in d if not i.isdigit()]) #numeric
            d = ''.join([i for i in d if not i.isalpha()]) #alpha
            striped_array.append(d)

        #turn this array into 1 string
        delims_string = (''.join([str(x) for x in striped_array]))
        #remove duplicated delimiter characters
        unique = set()
        [ unique.add(c) for c in delims_string ]

        #holds the final delimeter regex
        delim_regex = '|'.join([str(x) for x in unique]) #joins using "or" ,ie. "|"        
        delim_regex = re.escape(delim_regex) #escapes special regex characters (.,a,/,etc.)
        delim_regex = ("["+delim_regex+"]") #converts to a proper regex string

        #split the example using the regex
        split_example = re.split(delim_regex, str(answers_dict[str(i1)]["example"]))

        #holds the positions of all the indexes
        position_list = []

        #get indexes of the sub-tags needed
        #these indexes now can be applied to obtain every sub-tag in each tag in the whole cluster
        for a in answers_dict[str(i1)]["answer"]["fields"]:
            #all occurances of the tag
            indices = [i for i, x in enumerate(split_example) if x == a]
            for idx in indices:
                #gets the actual position
                #Example:
                #for FC_1_CHW_1, if the user wants the second "1"
                #this makes sure that the index obtained is 3 and not 1
                # **** VITAL THAT THE USER ENTERS THE SUB-TAGS IN ORDER *******
                if len(position_list) == 0 or position_list[-1] < idx:
                    position_list.append(idx)
                    break

        # splits the sub_tags using the delim regex
        for t in clustered_tag_labels[i1]["cluster_list"]:
            split_sub_tags = re.split(delim_regex, str(t))
            
            # holds 2D array of all information of sub tags
            sub_tag_info_array = []
            #for each index
            for p_idx, p in enumerate(position_list):
                temp_array = [split_sub_tags[p],
                #TODO
                str(answers_dict[str(i1)]["answer"]["values"][p_idx]), 
                str(answers_dict[str(i1)]["answer"]["types"][p_idx])]  # holds the field, value, type of each sub-tag
                sub_tag_info_array.append(temp_array)
            split_sub_tags_array[t] = sub_tag_info_array
    return split_sub_tags_array

print("Done")

Done


### String Matching Function
String matching only needs to be done with the sub-tags that haven't been tagged with the type “ref”.

All these unique sub-tags are run through a standard Levenshtein string distance comparison algorithm and are matched from a comprehensive list of standardised tags (eg: haystack). 

These unique sub-tags with their matches are sent to the user for verification. Since most of the sensor names in any data set is usually a combination of the same set of words, this verification is the only verification step we need.

In [10]:
# compares distance using different methods
# returns closets match
def distance_compare(sub_tag, tag_list, method):
    # no shortest distance found, yet
    global closest
    shortest = -1

    # loop through words to find the closest
    categorised = ["equip","comp","attr","point"]
    for cat in categorised:
        for t in tag_list[cat]:
            if method == "levenshtein":
                dist = distance.levenshtein(sub_tag, t)
            if method == "jaccard":
                dist = distance.jaccard(sub_tag, t)
            if method == "nlevenshtein":
                dist = distance.nlevenshtein(sub_tag, t)
            if method == "sorensen":
                dist = distance.sorensen(sub_tag, t)
            if method == "damerau_levenshtein":
                dist = damerau_levenshtein_distance(sub_tag, t)

            if dist == 0:
                closest = t
                shortest = 0
                break

            # if this distance is less than the next found shortest
            # distance, OR if a next shortest word has not yet been found
            if dist <= shortest or shortest < 0:
                # set the closest match, and shortest distance
                closest = t
                shortest = dist
    return closest
print("Done")

Done


### Other helper functions

In [11]:
def get_categorised_tags():
    with open('categorised_standard.json') as f:
        dict = json.load(f)
    return dict

# get the sensor data and puts it into a list
def get_sensor_data():
    with open('sensor_list.json') as f:
        dict = json.load(f)
    return dict["sensor_list"]
  
def get_fields_to_match(split_tags_info_dict):
    # for each cluster
    fields = []
    for key_tag in split_tags_info_dict:
        for sub_tag in split_tags_info_dict[key_tag]:
            if (sub_tag[2] != "ref") and (sub_tag[0] not in fields):
                fields.append(sub_tag[0])
    return fields

# returns the tags that are not in the temp_dict
# they are need to be string matched
def match_against_temp_dict(temp_dict_from_answers, all_to_be_matched):
    not_in_temp_array = []
    for st in all_to_be_matched:
        if st not in temp_dict_from_answers:
            not_in_temp_array.append(st)
    return not_in_temp_array
  
# get matches
# and returns a dictionary (JSON)
def get_matches(categorised_tags, fields_array):
    dict = {}
    for sub_tag in fields_array:
        #TODO instead of haystack list use the specific type of list
        # check which type it is and call the specific type

        dict[sub_tag] = distance_compare(sub_tag, categorised_tags, "levenshtein")
    return dict


# gets the verfied sub-tag answers from the user
# reads in as array
# TODO: connect to site
def get_string_match_answers(filename):
    with open(filename + '.json') as f:
        dict = json.load(f)
    return dict
  
#does the final tagging/annotation of the tags
#and returns a python dict/JSON object
def tagging(split_tags_info_dict, complete_matched_dict):
    tagged_output = {} #holds all the meta_data
    for key_tag in split_tags_info_dict:
        #holds the infomation of each tag
        sub_tag_dict = {}
        for sub_tag_info_array in split_tags_info_dict[key_tag]:
            #if the type is "ref" use the exact tag given by the user for that position
            #if multiple ref with the same values, put them in an array
            if sub_tag_info_array[2] == "ref":
                if sub_tag_info_array[0] not in sub_tag_dict:
                    sub_tag_dict[sub_tag_info_array[0]] = []                    
                sub_tag_dict[sub_tag_info_array[0]].append(sub_tag_info_array[1]) 
            else:                
                # sub_tag_dict[sub_tag_info_array[0]] = "error"
                sub_tag_dict[sub_tag_info_array[0]] = complete_matched_dict[sub_tag_info_array[0]]
        tagged_output[key_tag] = sub_tag_dict
    return tagged_output
  
print("Done")

Done




## Running the Algorithm

### Run Feature extraction

In [12]:
# dictionary of categorised tags
categorised_tags = get_categorised_tags()

# contains all the raw sensor tags from the BMS
raw_sensor_list = get_sensor_data()

# corresponding feature vector list
numpy_feature_vectors = transform_tags_to_features()

print("Feature vector created")

Feature vector created


### Run Clustering

In [13]:
# params: type, threshold, criterion, depth, method
# type: "braycurtis", "canberra", "chebyshev", "cityblock", "cosine", "dice", "euclidean", "hamming"
#       "jaccard", "kulsinski", "mahalanobis", "matching", "minkowski", "rogerstanimoto", "russellrao"
#       "seuclidean", "sokalmichener", "sokalsneath", "sqeuclidean"
# criterion: "inconsistent", "distance"
# depth: only needed for inconsustent criterion (1-3)
# method: "single", "complete", "average", "weighted", "ward"
cluster_list = cluster('jaccard', 0, 'distance', 3, 'average')

# holds the json object with all the cluster data
# {
#     cluster number : {
#         example : ""
#         cluster list : []
#     },
#     {},..
# }
clustered_tag_labels = json_cluster_w_labels(cluster_list)
print(json.dumps(clustered_tag_labels, indent=4, separators=(", ", ": ")))

{
    "0": {
        "example": "FC 1_Exhaust Air Temp", 
        "cluster_list": [
            "FC 1_Exhaust Air Temp", 
            "FC 1_Exhaust Air Volume", 
            "FC 1_Outside Air Temp", 
            "FC 1_Return Air Temp", 
            "FC 1_Supply Air Temp", 
            "FC 1_Supply Air Volume", 
            "FC 2_Exhaust Air Temp", 
            "FC 2_Exhaust Air Volume", 
            "FC 2_HHW Leaving Temp", 
            "FC 2_Return Air Temp", 
            "FC 2_Supply Air Temp", 
            "FC 2_Supply Air Volume", 
            "FC 4_CHW Leaving Temp", 
            "FC 4_HHW Leaving Temp", 
            "FC 4_Mixed Air Temp", 
            "FC 4_Supply Air Temp", 
            "FC 5A_CHW Leaving Temp", 
            "FC 5A_HHW Leaving Temp", 
            "FC 5A_Return Air Damper", 
            "FC 5A_Return Air Volume", 
            "FC 5A_Supply Air Temp", 
            "FC 5B_Outside Air Temp", 
            "FC 5B_Return Air Temp", 
            "FC 5B_Supply Air Temp",

### Get answers from expert
*(see above for how the user can enter the answers using the categories standard)*

In [14]:
# get answers as JSON object
# {
#     cluster number : {
#         example : ""
#         answer : {
#             fields : ["f1","f2",..],
#             values : ["v1","v2",..],
#             types : ["t1","t2",..]
#         }
#     },
#     {},..
# }
answers_dict = get_answers()
print(json.dumps(answers_dict, indent=4, separators=(", ", ": ")))

{
    "0": {
        "example": "FC 1_Exhaust Air Temp", 
        "answer": {
            "fields": [
                "FC", 
                "1", 
                "Exhaust", 
                "Air", 
                "Temp"
            ], 
            "values": [
                "fan cooler", 
                "equipRef", 
                "exhaust", 
                "air", 
                "temp"
            ], 
            "types": [
                "equip", 
                "ref", 
                "comp", 
                "attr", 
                "point"
            ]
        }
    }, 
    "1": {
        "example": "FC 4_Supply Air Fan Volume", 
        "answer": {
            "fields": [
                "FC", 
                "4", 
                "Supply", 
                "Air", 
                "Volume"
            ], 
            "values": [
                "fan cooler", 
                "equipRef", 
                "supply", 
                "air", 
                "volume"
      

### String Match and verify it with the expert

In [15]:
# holds all information about tags and their splits in a dictionary
# {
#     full tag name : [
#         [sub-tag-1-field, sub-tag-1-value, sub-tag-1-type],
#         [sub-tag-2-field, sub-tag-2-value, sub-tag-2-type],
#         [],...
#     ],
#     [],...
# }
split_tags_info_dict = string_spliting()
# pprint(split_tags_info_dict)

#get temporary dictionary from the answers
temp_dict_from_answers = get_dict_from_answers()
# pprint(temp_dict_from_answers)

#match the fields with the temp dictonary and find perfect matches
#rest of the fields with no perfect matches go through the normal get_matches function
#   with them being matched agasint the tags from their cateorgy (type)
#only send these string matched ones for verification


# all the fields to be matched
var_fields = get_fields_to_match(split_tags_info_dict)
#pprint(var_fields)


var_fields_not_matched_w_temp  = match_against_temp_dict(temp_dict_from_answers, var_fields)
# pprint(var_fields_not_matched_w_temp)


fields_with_match_dict = get_matches(categorised_tags, var_fields_not_matched_w_temp)

# print(json.dumps(fields_with_match_dict, indent=4, separators=(", ", ": ")))  # send to screen for verification

# TODO: get directly from user
# TODO only for fields_with_match_dict, the ones that were matched against cateorgised tags
# not the temp dict
verified_matched_dict = get_string_match_answers("string_match_answers")
print("Verified dictionary of string matched sub-tags")
pprint(verified_matched_dict)

#combine the tags matched from temp and haystack categories

complete_matched_dict = {**verified_matched_dict, **temp_dict_from_answers}
tagged_dict = tagging(split_tags_info_dict, complete_matched_dict)
#json output for printing to screen or wriring to db
# print(json.dumps(tagged_dict, indent=4, separators=(", ", ": ")))



Verified dictionary of string matched sub-tags
{'Damper': 'damper',
 'Flow': 'flow',
 'HHW': 'heating hot water',
 'Leaving': 'leaving',
 'Mixed': 'air',
 'Outside': 'outside'}


### Tag the rest of tags in the clutser

Tag the rest of the tags using the answers provided and the verification of the string matched sub-tags and output desired result

In [16]:
# gets categories for the tagged tags
def get_category(t, categorised_tags):
    for k,v in categorised_tags.items():
        if t in v:
            return k
def print_tagged(tagged_dict):   
    for key_tag, value_meta_tags in tagged_dict.items():
        ### PLEASE SEE: see schema for intended output
        # mongo document object
        mongo_doc={}
        mongo_doc["data_point"] = key_tag
        
        # for each sub-tag   
        for key_sub_tag, value_sub_tag_meta in value_meta_tags.items():
            #is a ref
            if isinstance(value_sub_tag_meta, list):
                category = "ref"
            else:
                category = get_category(value_sub_tag_meta, categorised_tags)
            
            # location tags            
            if category == "loc":
                mongo_doc["loc"][value_sub_tag_meta] = key_sub_tag
                continue         
            
            object_key = "type" # default key
            
            if category == "ref": # if it is a reference
                for ref_tag in value_sub_tag_meta:                    
                    # lets the category equal it's parent
                    # eg: if 'equipRef' becomes 'equip'
                    category_ref = ref_tag.replace('Ref', '')
                    object_key = "ref" # reference key
                    if category_ref not in mongo_doc:
                        mongo_doc[category_ref] = {}
                    mongo_doc[category_ref][object_key] = key_sub_tag
                continue
            
            if category not in mongo_doc:
                    mongo_doc[category] = {}
                    
            mongo_doc[category][object_key] = value_sub_tag_meta
            
        valuestr = json.dumps(mongo_doc, separators=(',', ':'), sort_keys=True, indent=4)
        value = json.loads(valuestr)
        # result=db_conn.mapped.insert(value)
        print(valuestr)
        
print_tagged(tagged_dict)

{
    "attr":{
        "type":"air"
    },
    "comp":{
        "type":"exhaust"
    },
    "data_point":"FC 1_Exhaust Air Temp",
    "equip":{
        "ref":"1",
        "type":"fan cooler"
    },
    "point":{
        "type":"temp"
    }
}
{
    "attr":{
        "type":"air"
    },
    "comp":{
        "type":"exhaust"
    },
    "data_point":"FC 1_Exhaust Air Volume",
    "equip":{
        "ref":"1",
        "type":"fan cooler"
    },
    "point":{
        "type":"volume"
    }
}
{
    "attr":{
        "type":"air"
    },
    "comp":{
        "type":"outside"
    },
    "data_point":"FC 1_Outside Air Temp",
    "equip":{
        "ref":"1",
        "type":"fan cooler"
    },
    "point":{
        "type":"temp"
    }
}
{
    "attr":{
        "type":"air"
    },
    "comp":{
        "type":"return"
    },
    "data_point":"FC 1_Return Air Temp",
    "equip":{
        "ref":"1",
        "type":"fan cooler"
    },
    "point":{
        "type":"temp"
    }
}
{
    "attr":{
        "type":