# Fix automatic identifiers from VocBench

First, we can get new concepts with auto-generated IDs through SPARQL:

```sparql
SELECT ?concept ?it_str ?en_str WHERE {
    ?concept a skos:Concept .
    ?concept skosxl:prefLabel ?label_it .
    ?label_it skosxl:literalForm ?it .
    ?concept skosxl:prefLabel ?label_en .
    ?label_en skosxl:literalForm ?en .
    BIND ( STR(?it) as ?it_str)
    BIND ( STR(?en) as ?en_str)
    FILTER (
        langMatches(lang(?it), "it")
        && langMatches(lang(?en), "en")
        && NOT EXISTS {
        	?concept dc:identifier ?id .
        }
    )
}
```

We save that as a CSV and load it.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('no_identifier.csv')
data.head()

Unnamed: 0,concept,it_str,en_str
0,https://w3id.org/diga/terms/c_ab40f3d2,connessure labiali,lip dimples
1,https://w3id.org/diga/terms/c_8a2a0463,vulva,vulva
2,https://w3id.org/diga/terms/c_53d7423a,tridente,trident
3,https://w3id.org/diga/terms/c_cff8466e,"colonna e pilastro, semicolonna e parasta","column and pillar, semi-column and pilaster"
4,https://w3id.org/diga/terms/c_4a243ddb,ūṣṇīṣa separata,separated ūṣṇīṣa


In [3]:
import hashlib

def generate_id(label):
    h = hashlib.sha1(label.encode('utf-8')).hexdigest()
    h_short = h[0:8]
    id_int = int(h_short, 16)
    return str(id_int)

generate_id('DiGA')

'2443743546'

In [4]:
data['id'] = data.apply(lambda row: generate_id(f'{row["it_str"]} | {row["en_str"]}'), axis=1)
data.head()

Unnamed: 0,concept,it_str,en_str,id
0,https://w3id.org/diga/terms/c_ab40f3d2,connessure labiali,lip dimples,648615685
1,https://w3id.org/diga/terms/c_8a2a0463,vulva,vulva,524880040
2,https://w3id.org/diga/terms/c_53d7423a,tridente,trident,1775892328
3,https://w3id.org/diga/terms/c_cff8466e,"colonna e pilastro, semicolonna e parasta","column and pillar, semi-column and pilaster",292026662
4,https://w3id.org/diga/terms/c_4a243ddb,ūṣṇīṣa separata,separated ūṣṇīṣa,2529761007


Check we don’t have clashes with existing identifiers.

In [5]:
import rdflib

g = rdflib.Graph()
g.parse('diga_terms_vocbench.ttl')

diga_terms = rdflib.Namespace('https://w3id.org/diga/terms/')
g.bind('diga_terms', diga_terms)

Now we can check that a given identifier is not yet used. The example is know to exist, just as a check:

In [6]:
(diga_terms['2838159259'], None, None) in g

True

In [7]:
for id_ in data['id']:
    if (diga_terms[id_], None, None) in g:
        print(f'ID {id_} already in use!')

Okay, we’re save, so save the file.

In [8]:
data.to_csv('new_identifiers.csv')