# Avnatis Mapping Instructions

The following describes the process for mapping the current Avantis Classes to the new Classification system. 

## Load SPARQL data

The first step is to run SPARQL queires in protege to get the revelent metadata from the new classes. The SPARQL queires should look for classes with the specific annotation propertys:
1. is_equivalent_to_Avantis_class
2. is_equivalent_to_Avantis_category
3. is_equivalent_to_tag_code
4. is_superclass_of_Avantis_class
5. is_superclass_of_Avantis_category
6. is_superclass_of_tag_code
The following code can be modified to acomplish this task. 

```sparql
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX tw: <http://www.toronto.ca/TWONTO#>

SELECT (STR(?label) as ?OWL) (STR(?object) as ?Avantis)
WHERE { 
    ?entityIRI tw:is_equivalent_to_Avantis_class ?object ;
              rdfs:label ?label .
}
```
Once the SPARQL queires are saved, they are open in python pandas dataframes. This is done with the following code: 

In [119]:
import pandas as pd

df = pd.read_excel("SPARQL_class.xlsx", header= 1)
Class = dict(zip(df['Avantis_Class'],df['TWONTO']))
df

Unnamed: 0,TWONTO,Avantis_Class
0,outfall or discharge point,Outfall
1,control loop,Control Loop
2,UPS,Uninterruptible Power Supply
3,pipe segment,Piping
4,instrument transmitter,Transmitter
...,...,...
117,tool,Tool
118,air damper,Damper/Louver
119,weight scale,Weigh Scale
120,operator interface terminal,SCADA computer terminal


There is an additonal step were the two column are saved as a dictionary. A python dictionary is a simple key value pair. 

In [120]:
df = pd.read_excel("SPARQL_superclass.xlsx", header= 1)
Super_Class = dict(zip(df['Super_Class'],df['TWONTO']))

df = pd.read_excel("SPARQL_tag.xlsx", header= 1)
Tag = dict(zip(df['Tag'],df['TWONTO']))

df = pd.read_excel("SPARQL_supertag.xlsx", header= 1)
Super_Tag = dict(zip(df['Super_Tag'],df['TWONTO']))

df = pd.read_excel("SPARQL_category.xlsx", header= 1)
Category = dict(zip(df['Category'],df['TWONTO']))

df = pd.read_excel("SPARQL_supercategory.xlsx", header= 1)
Super_Category = dict(zip(df['Super_Category'],df['TWONTO']))

df = pd.read_excel("manualMatch.xlsx", sheet_name = 'LLM capability Test Dataset')
Manual_Match = dict(zip(df['Entity_number'],df['Valid_Class']))

Tag

{'MX': 'mixer or agitator',
 'CBL': 'cable segment',
 'UV': 'UV disinfection assembly',
 'HU': 'humidifier',
 'DP': 'material distribution panel',
 'SC': 'screen',
 'BU': 'burner component of asset',
 'DD': 'display panel',
 'SM': 'large stationary tool',
 'TE': 'temperature sensor element',
 'D': 'dehumidifier',
 'FS': 'flow switch',
 'PIP': 'pipe segment',
 'SPL': 'spill kit',
 'LDR': 'ladder',
 'DR': 'roll up door',
 'HF': 'power harmonic filter',
 'CMP': 'compactor',
 'DS': 'disconnect switch',
 'FA': 'annunciator panel',
 'FM': 'pressurized sewer segment',
 'CAP': 'capacitor',
 'DU': 'air duct segment',
 'HYD': 'hydrant',
 'VSS': 'surge suppressor',
 'TR': 'transformer',
 'GEN': 'generator-set',
 'ELV': 'elevator',
 'PCV': 'valve',
 'C': 'compressor',
 'BFP': 'backflow preventer',
 'DIT': 'slude density meter',
 'LP': 'LV electrical panel',
 'LS': 'level switch',
 'SCBA': 'self-contained breathing apparatus',
 'STR': 'strainer',
 'CU': 'AC condenser unit',
 'PT': 'pressure transmi

## SQL connection
The next step connects to the Avantis SQL server to get the list of entities

In [121]:
import pyodbc
import os
from sqlalchemy.engine import URL
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set_theme()

connect = 'DSN=Avantis6-P;UID='+ os.environ['Avantis_User'] + ';PWD=' + os.environ['Avantis_Pass']
connection_url = URL.create("mssql+pyodbc", query={"odbc_connect": connect})

engine = create_engine(connection_url)

SQL1 = """SELECT Distinct MAINTENT.id as [Entity_number],
[MAINTENT].[aenm] as [Description],
MAINTENT2.id as [Parent],
MAINTENT2.aenm as [Parent_Description],
[contname] as [Category],
[entclsid] as [Class],
SUSPEND.suspoi as [Suspended]

FROM  [AvantisP].[mc].[MAINTENT]
	  Left Join [AvantisP].[mc].[ENTCLASS] on MAINTENT.entclsref_oi = ENTCLASS.entcloi
	  Left Join [AvantisP].[mc].CATVAL ON MAINTENT.cat1_oi = CATVAL.cvoi
	  Left Join MC.SUSPEND ON MAINTENT.susp_oi = SUSPEND.suspoi
	  Left Join MC.MELINK ON MAINTENT.mtnoi = MELINK.mtnchild_oi
	  Left Join MC.MAINTENT MAINTENT2 ON MELINK.mtnparn_oi = MAINTENT2.mtnoi"""


df1 = pd.read_sql(SQL1,engine)
df1 = df1[df1['Entity_number'].values != None]
df1['Tag'] = df1['Entity_number'].str.extract(r"-([a-zA-Z]+)-\d+")
df1

Unnamed: 0,Entity_number,Description,Parent,Parent_Description,Category,Class,Suspended,Tag
0,\tFCL_ELS_CBL_001L,"Electrical Power Line,4.16KV,From BUS-00B1-A t...",FCL_ELS_4.16KV_LINES,"Electrical Power Line,4.16KV",Air Handling Unit,Electrical Power Line,,
1,\tFCL_ELS_CBL_002D,"Electrical Power Line,4.16KV,From BUS-00B2-A t...",FCL_ELS_4.16KV_LINES,"Electrical Power Line,4.16KV",Air Handling Unit,Electrical Power Line,,
2,\tFCL_ELS_CBL_002F,"Electrical Power Line,4.16KV,From BUS-00B2 to ...",FCL_ELS_4.16KV_LINES,"Electrical Power Line,4.16KV",Air Handling Unit,Electrical Power Line,,
3,\tFCL_ELS_CBL_002H,"Electrical Power Line,4.16KV,From BUS-00B1 to ...",FCL_ELS_4.16KV_LINES,"Electrical Power Line,4.16KV",Air Handling Unit,Electrical Power Line,,
4,\tFCL_ELS_CBL_002L,"Electrical Power Line,4.16KV,From BUS-00B2-A t...",FCL_ELS_4.16KV_LINES,"Electrical Power Line,4.16KV",Air Handling Unit,Electrical Power Line,,
...,...,...,...,...,...,...,...,...
148118,YX2413A,"Pump, Dewatering, Old PS House #3",TAB-PRM-P-SUMP,"P Bldg & Old PS Buildings, Primary Treatment S...","Pump,Non Positive Displacement",Pump,,
148119,ZDATA PILOT,ZCity of Toronto,,,,,39026.0,
148120,ZXDP ENTITY COL,DO NOT USE - City of Toronto,,,,,,
148121,ZXDP OTHER COL,DO NOT USE - DP Other Entities,,,,,,


In [122]:
#df1 = df1[df1['Suspended'].isna()]

In [123]:
entityMatch = pd.DataFrame()
entityMatch['Entity_number'] = df1['Entity_number'] 
entityMatch['Class_Match'] = [Class.get(x,"") for x in df1['Class']]
entityMatch['Super_Class_Match'] = [Super_Class.get(x,"") for x in df1['Class']]
entityMatch['Tag_Match'] = [Tag.get(x,"") for x in df1['Tag']]
entityMatch['Super_Tag_Match'] = [Super_Tag.get(x,"") for x in df1['Tag']]
entityMatch['Category_Match'] = [Category.get(x,"") for x in df1['Category']]
entityMatch['Super_Category_Match'] = [Super_Category.get(x,"") for x in df1['Category']]
entityMatch

Unnamed: 0,Entity_number,Class_Match,Super_Class_Match,Tag_Match,Super_Tag_Match,Category_Match,Super_Category_Match
0,\tFCL_ELS_CBL_001L,,cable segment,,,,air handler unit
1,\tFCL_ELS_CBL_002D,,cable segment,,,,air handler unit
2,\tFCL_ELS_CBL_002F,,cable segment,,,,air handler unit
3,\tFCL_ELS_CBL_002H,,cable segment,,,,air handler unit
4,\tFCL_ELS_CBL_002L,,cable segment,,,,air handler unit
...,...,...,...,...,...,...,...
148118,YX2413A,,,,,,
148119,ZDATA PILOT,,,,,,
148120,ZXDP ENTITY COL,,,,,,
148121,ZXDP OTHER COL,,,,,,


In [126]:
entityMatch = pd.DataFrame()
entityMatch['Entity_number'] = df1['Entity_number']

# Step 1: Class Match
entityMatch['First_Match'] = df1['Class'].map(Class).fillna("")
class_match_count = entityMatch['First_Match'].str.len().gt(0).sum()
print(f"Class Match: {class_match_count}/{len(df1)} ({class_match_count/len(df1)*100:.2f}%)")

# Step 2: Super Class Match (if no Class Match)
entityMatch['First_Match'] = entityMatch['First_Match'].mask(entityMatch['First_Match'] == "", df1['Class'].map(Super_Class).fillna(""))
super_class_match_count = entityMatch['First_Match'].str.len().gt(0).sum() - class_match_count
print(f"Super Class Match: {super_class_match_count}/{len(df1)} ({super_class_match_count/len(df1)*100:.2f}%)")

# Step 3: Tag Match (if no Class or Super Class Match)
entityMatch['First_Match'] = entityMatch['First_Match'].mask(entityMatch['First_Match'] == "", df1['Tag'].map(Tag).fillna(""))
tag_match_count = entityMatch['First_Match'].str.len().gt(0).sum() - (class_match_count + super_class_match_count)
print(f"Tag Match: {tag_match_count}/{len(df1)} ({tag_match_count/len(df1)*100:.2f}%)")

# Step 4: Super Tag Match (if no Class, Super Class, or Tag Match)
entityMatch['First_Match'] = entityMatch['First_Match'].mask(entityMatch['First_Match'] == "", df1['Tag'].map(Super_Tag).fillna(""))
super_tag_match_count = entityMatch['First_Match'].str.len().gt(0).sum() - (class_match_count + super_class_match_count + tag_match_count)
print(f"Super Tag Match: {super_tag_match_count}/{len(df1)} ({super_tag_match_count/len(df1)*100:.2f}%)")

# Step 5: Category Match (if no match yet)
entityMatch['First_Match'] = entityMatch['First_Match'].mask(entityMatch['First_Match'] == "", df1['Category'].map(Category).fillna(""))
category_match_count = entityMatch['First_Match'].str.len().gt(0).sum() - (class_match_count + super_class_match_count + tag_match_count + super_tag_match_count)
print(f"Category Match: {category_match_count}/{len(df1)} ({category_match_count/len(df1)*100:.2f}%)")

# Step 6: Super Category Match (final fallback)
entityMatch['First_Match'] = entityMatch['First_Match'].mask(entityMatch['First_Match'] == "", df1['Category'].map(Super_Category).fillna(""))
super_category_match_count = entityMatch['First_Match'].str.len().gt(0).sum() - (class_match_count + super_class_match_count + tag_match_count + super_tag_match_count + category_match_count)
print(f"Super Category Match: {super_category_match_count}/{len(df1)} ({super_category_match_count/len(df1)*100:.2f}%)")

# Total matches
total_match_count = entityMatch['First_Match'].str.len().gt(0).sum()
print(f"Auto Matches: {total_match_count}/{len(df1)} ({total_match_count/len(df1)*100:.2f}%)")

# Step 7: Manual Match
entityMatch['First_Match'] = df1['Entity_number'].map(Manual_Match).fillna(entityMatch['First_Match'])
manual_match_count = df1['Entity_number'].map(Manual_Match).str.len().gt(0).sum()
print(f"\nManual Match: {manual_match_count}/{len(df1)} ({manual_match_count/len(df1)*100:.2f}%)")
total_match_count = entityMatch['First_Match'].str.len().gt(0).sum()
print(f"\nTotal Matches: {total_match_count}/{len(df1)} ({total_match_count/len(df1)*100:.2f}%)")


Class Match: 124895/148123 (84.32%)
Super Class Match: 479/148123 (0.32%)
Tag Match: 4071/148123 (2.75%)
Super Tag Match: 7219/148123 (4.87%)
Category Match: 785/148123 (0.53%)
Super Category Match: 191/148123 (0.13%)
Auto Matches: 137640/148123 (92.92%)

Manual Match: 24664/148123 (16.65%)

Total Matches: 139110/148123 (93.92%)


In [127]:
#excel_class = ["Control Panel,MCC"]

#with pd.ExcelWriter('missing.xlsx', engine='openpyxl', mode='w') as writer:
#   for Value in excel_class:  
#      df1[df1['Class'] == Value].to_excel(writer, sheet_name = Value.replace(" ", "_"))
#   
#   df1[df1['Class'].isna()].to_excel(writer, sheet_name = "None")