In [98]:
import pandas as pd
from pyzotero import zotero 
from IPython.display import display
from sklearn.cluster import AffinityPropagation
import distance
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Zotero add-on

We can extract all tags in a library using the add-on **zotero-tag** by right-clicking on the collection and exporting all tags as csv. It comes out with 4 columns - tag, count, items and ids.
The items and ids columns contain lists separated by a comma, which bugs the python parser.  
I opened it with LibreOffoce Calc and save it with semi-columns as a separtor.


In [99]:
df = pd.read_csv('tags_fixed.csv',sep=';')
df.head()

Unnamed: 0,TAG,COUNT,ITEMS,IDS
0,/unread,551,A BIM-integrated Fuzzy Multi-criteria Decision...,"474,306,163,54,461,580,103,395,493,165,525,127..."
1,Building Information Modeling (BIM),72,Semantic information alignment of BIMs to comp...,"306,54,493,128,518,472,15,45,458,482,440,155,5..."
2,Automated Compliance Checking (ACC),30,Semantic information alignment of BIMs to comp...,"306,54,15,458,590,448,102,447,58,577,425,600,3..."
3,ontology,42,Semantic information alignment of BIMs to comp...,"306,493,165,792,128,472,45,482,9,140,288,46,29..."
4,Information extraction,2,Semantic information alignment of BIMs to comp...,306159


In [100]:
tags_addon = df['TAG']
print(tags_addon.size)

866


# Zotero API

[Here](https://www.zotero.org/support/dev/web_api/v3/basics) is an API which gives more detailed information on the group/personal libraries in JSON format.
The library pyzotero helps in the usage. Let's compare the results.  

The returned list contains duplicates. Strangely two we get two tags not found from the addon(mby set transforamtion)

In [101]:
# API key is personal, The id is of the grou[]
zot = zotero.Zotero(3007408, 'group', 'Yv2xY0CH9vjjf40sQiVcCQt3')
tags_api = pd.Series(zot.everything(zot.tags()))

In [102]:
len(tags_api)

921

In [103]:
tags_api.duplicated().sum()

55

In [104]:
tags_addon.duplicated().sum()

0

In [105]:
dif = set(tags_api.tolist()) - set(tags_addon.tolist())
print(dif)

{'Object-Property, Method, Relation', 'Rule language, engine and checking'}


Here is an example of a single item from the library and it's metadata

In [106]:
items = zot.items()
print(items[0])

{'key': 'IHC2TIKN', 'version': 1987, 'library': {'type': 'group', 'id': 3007408, 'name': 'Semantic BIM', 'links': {'alternate': {'href': 'https://www.zotero.org/groups/semantic_bim', 'type': 'text/html'}}}, 'links': {'self': {'href': 'https://api.zotero.org/groups/3007408/items/IHC2TIKN', 'type': 'application/json'}, 'alternate': {'href': 'https://www.zotero.org/groups/semantic_bim/items/IHC2TIKN', 'type': 'text/html'}}, 'meta': {'createdByUser': {'id': 10307964, 'username': 'aleksandrositsyn', 'name': 'Aleksandr Ositsyn', 'links': {'alternate': {'href': 'https://www.zotero.org/aleksandrositsyn', 'type': 'text/html'}}}, 'creatorSummary': 'Corry et al.', 'parsedDate': '2014-10', 'numChildren': 0}, 'data': {'key': 'IHC2TIKN', 'version': 1987, 'itemType': 'journalArticle', 'title': 'Using semantic web technologies to access soft AEC data', 'creators': [{'creatorType': 'author', 'firstName': 'Edward', 'lastName': 'Corry'}, {'creatorType': 'author', 'firstName': 'James', 'lastName': 'O’Donn

# Normalization

Even though Vlado suggested to use the algo's present in openrefine, a found a string clustering algo in sklearn library based on levenstein distance.
The algorith gives a warning of failed convergence thus results may not be optimal.

In [107]:
# Using the tags from the addon plugin
words = np.asarray(tags_addon) 
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))
    

 - *ontology:* Airport ontology, B-Prolog, Basic Formal Ontology, Bozen-Bolzano, CQIEOntology, CSCOntology, EUnet4DBP, Energy, Geology, IndoorGML, Meteorology, Prolog, Topology, Vocabulary, energy, ontology, ontology mapping, point cloud, pythonOCC, review methodology, taxonomy, terminology
 - *Visual Compliance Checking Language:* Visual Code Checking Language, Visual Compliance Checking Language
 - *spatial reasoning:* Declarative Reasoning, as-designed/as-built, defeasible reasoning, logic-based reasoning, spatial computation, spatial indexing, spatial operators, spatial reasoning, topological reasoning
 - *International Foundation Classes (IFC):* International Foundation Classes (IFC)
 - *Model view definition (MVD):* Model View Definition (MVD), Model view definition (MVD), Model view definitions (MVD)
 - *Hazard recognition and communication:* Hazard recognition and communication
 - *Occupational construction safety and health:* Occupational construction safety and health
 - *Sch



In [108]:
clusters = [words[center] for center in affprop.cluster_centers_indices_]
clusters

['ontology',
 'Visual Compliance Checking Language',
 'spatial reasoning',
 'International Foundation Classes (IFC)',
 'Model view definition (MVD)',
 'Hazard recognition and communication',
 'Occupational construction safety and health',
 'Schema',
 'Smart Buildings',
 'Common Data Environenment (CDE)',
 'Sustainable Development Goals (SDG)',
 'DAta Linked Through Occurrences Network',
 'Formalisation of conformance requirements',
 'Ontological approach for conformance checking',
 'Semantic annotation and organisation of building codes',
 'information extraction',
 'Compliance Audit Procedures',
 'Legal Knowledge Model',
 'SPARQL',
 'Information Container for Document Delivery (ICDD)',
 'ISO 21597',
 'geometry',
 'Level of Detail (LOD)',
 'conflict-driven constraint learning',
 'easoner',
 'Performance-based Building Code',
 'Built Environment',
 'Concepts',
 'Technologies',
 'building administration permission service',
 'energy performance certificates',
 'Building management system

In [95]:
len(clusters)

138