# Automating Parser Generation and Easier Serialization Efforts
In this demo we cluster and parse a list of building point label data from a csv file, and package the results into a class for easier and more efficient serialization.


In [1]:
from buildingmotif.label_parsing.SerializedParserMetrics import SerializedParserMetrics

## Overview
1. **Setup SerializedParserMetrics Class**
    1. Import BuildingMOTIF and associated packages
    1. Load BuildingMOTIF Libraries
    1. Initialize SerializedParserMetrics class with required arguements of ```csv filename``` and ```csv column_name``` where data is located. Optionally, modify
       llm_tries (llm is prompted to predict relevant classes for sections of the point label llm_tries times) and list_of_dicts
1. **View associated data of Parser Generation Process and Results**
    1. View generated parsers and their clusters
    1. View distance metrics and clustering calculations applied to each point label
    1. View packaged, serialized format of parsers, clusters, and data

In [2]:
# instantiate SerializedParserMetrics class, args are csv_filename and csv_column_name
serializedParsers = SerializedParserMetrics("examples/basic_len102.csv", "BuildingNames")

#can also accept additional args, llm_tries (defaults to 3) and list_of_dicts (list of abbreviations matched to brick classes in dictionary form, 
#defaults to provided COMMON_EQUIP_ABBREVIATIONS and COMMON_POINT_ABBREVIATIONS)

In [3]:
#print default dicts used

for abbrev_dict in serializedParsers.list_of_dicts:
    print(abbrev_dict)

{'AHU': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Air_Handling_Unit'), 'FCU': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Fan_Coil_Unit'), 'VAV': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Variable_Air_Volume_Box'), 'CRAC': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Computer_Room_Air_Conditioner'), 'HX': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Heat_Exchanger'), 'PMP': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Pump'), 'RVAV': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Variable_Air_Volume_Box_With_Reheat'), 'HP': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Heat_Pump'), 'RTU': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Rooftop_Unit'), 'DMP': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Damper'), 'STS': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Status'), 'VLV': rdflib.term.URIRef('https://brickschema.org/schema/Brick#Valve'), 'CHVLV': r

In [5]:
print("all generated parsers: ")
for parser in serializedParsers.parsers:
    print(parser)
    print("\n")

print("all generated clusters: ")
for cluster in serializedParsers.clusters:
    print(cluster)
    print("\n")

all generated parsers: 
parser_lencluster_7_935 = sequence(until(delimiters, Constant), delimiters, until(COMBINED_ABBREVIATIONS, Identifier), COMBINED_ABBREVIATIONS, until(delimiters, Constant), delimiters, regex(r"[a-zA-Z]+", Constant))


parser_lencluster_10_396 = sequence(until(delimiters, Constant), delimiters, until(COMBINED_ABBREVIATIONS, Identifier), COMBINED_ABBREVIATIONS, until(delimiters, Identifier), delimiters, regex(r"[a-zA-Z0-9]{1,4}", Identifier), delimiters, COMBINED_ABBREVIATIONS, regex(r"[a-zA-Z]+", Constant))


parser_lencluster_12_509 = sequence(until(delimiters, Constant), delimiters, until(COMBINED_ABBREVIATIONS, Identifier), COMBINED_ABBREVIATIONS, until(delimiters, Identifier), delimiters, regex(r"[a-zA-Z]{1,2}", Constant), until(delimiters, Identifier), delimiters, until(delimiters, Constant), delimiters, regex(r"[a-zA-Z]+", Constant))


parser_noise_8_502 = sequence(until(delimiters, Constant), delimiters, until(COMBINED_ABBREVIATIONS, Identifier), COMBINED_A

In [15]:
# print each serialized parser
for serializedParser in serializedParsers.serializers_list:
    print(serializedParser)

{'parser': 'sequence', 'args': {'parsers': [{'parser': 'until', 'args': {'parser': {'parser': 'regex', 'args': {'r': '[._&:/\\- ]', 'type_name': {'token': 'Delimiter'}}}, 'type_name': {'token': 'Constant'}}}, {'parser': 'regex', 'args': {'r': '[._&:/\\- ]', 'type_name': {'token': 'Delimiter'}}}, {'parser': 'until', 'args': {'parser': {'parser': 'abbreviations', 'args': {'patterns': {'CHVLV': 'https://brickschema.org/schema/Brick#Chilled_Water_Valve', 'HWVLV': 'https://brickschema.org/schema/Brick#Hot_Water_Valve', 'CHWST': 'https://brickschema.org/schema/Brick#Leaving_Chilled_Water_Temperature_Sensor', 'CHWRT': 'https://brickschema.org/schema/Brick#Entering_Chilled_Water_Temperature_Sensor', 'CRAC': 'https://brickschema.org/schema/Brick#Computer_Room_Air_Conditioner', 'RVAV': 'https://brickschema.org/schema/Brick#Variable_Air_Volume_Box_With_Reheat', 'HWST': 'https://brickschema.org/schema/Brick#Leaving_Hot_Water_Temperature_Sensor', 'HWRT': 'https://brickschema.org/schema/Brick#Enteri

### Potential Abbreviations
The LLM (Ollama3) is asked to classify tokens from each point label as either: constants, identifiers, abbreviations, or delimiters. If the LLM **predicts** a token to be an abbreviation and it is **not** found in the provided list_of_dicts, then it will be added to this list.

In [6]:
print("potential abbreviations: ", serializedParsers.flagged_abbreviations)

potential abbreviations:  []


### Distance Metric
Tokens were classified into **3 types** based on their composition: characters, numbers, and special characters. The developed metric compares the tokens from a pair of point labels, computing a ratio for the amount of **identical classified tokens over the amount of tokens in the longer point label**. The range is from **0 (no similarity) to 1 (identical similarity).** This approach is helpful for evaluating parser performance because tokens are parsed in sequence, so the order of tokens is important.

In [8]:
print("distance matrix statistics: ")
for k, v in serializedParsers.distance_metrics.items():
    print(k, v)

distance matrix statistics: 
mean 0.8501318937160921
median 1.0
std 0.16728128336649273
min 0.4166666666666667
max 1.0
range 0.5833333333333333


### Clustering
Density-based spatial clustering of applications with noise (DBSCAN) was used for clustering, grouping the most **similar** point labels together to optimize parser generation and the quality of the emitted tokens from applying the parser.

In [18]:
print("clustering info statistics: ")
for k, v in serializedParsers.clustering_metrics.items():
    print(k, v)

clustering info statistics: 
clusters 3
noise points 2
clustering_score 0.9803685207905768


In [11]:
print("total parsed points from all clusters: ", serializedParsers.parsed_count)
print("total unparsed from all clusters: ", serializedParsers.unparsed_count)
print("total number of points from all clusters: ", serializedParsers.total_count)

total parsed points from all clusters:  85
total unparsed from all clusters:  17
total number of points from all clusters:  102


## All Together
Now that we have seen the results of the parser generation/clustering process and associated metrics, we neatly package all the the data into a class attribute (a list of dictionaries that contains information about each cluster) for easier access

In [12]:
for cluster_dict in serializedParsers.combined_clusters:
    for k, v in cluster_dict.items():
        print(k)
        print(v)
        print("\n")

parser
{'parser': 'sequence', 'args': {'parsers': [{'parser': 'until', 'args': {'parser': {'parser': 'regex', 'args': {'r': '[._&:/\\- ]', 'type_name': {'token': 'Delimiter'}}}, 'type_name': {'token': 'Constant'}}}, {'parser': 'regex', 'args': {'r': '[._&:/\\- ]', 'type_name': {'token': 'Delimiter'}}}, {'parser': 'until', 'args': {'parser': {'parser': 'abbreviations', 'args': {'patterns': {'CHVLV': 'https://brickschema.org/schema/Brick#Chilled_Water_Valve', 'HWVLV': 'https://brickschema.org/schema/Brick#Hot_Water_Valve', 'CHWST': 'https://brickschema.org/schema/Brick#Leaving_Chilled_Water_Temperature_Sensor', 'CHWRT': 'https://brickschema.org/schema/Brick#Entering_Chilled_Water_Temperature_Sensor', 'CRAC': 'https://brickschema.org/schema/Brick#Computer_Room_Air_Conditioner', 'RVAV': 'https://brickschema.org/schema/Brick#Variable_Air_Volume_Box_With_Reheat', 'HWST': 'https://brickschema.org/schema/Brick#Leaving_Hot_Water_Temperature_Sensor', 'HWRT': 'https://brickschema.org/schema/Brick

**Each `cluster_dict` contains:**

1. **Serialized** parser
2. **Source code** for parser
3. **Parsed** point labels list
4. **Emitted tokens** from running parser on its cluster
5. **Unparsed** point labels list
6. **Parser Metrics** (how many parsed/unparsed/total in that cluster)
