# Fix automatic identifiers from VocBench

First, we can get new concepts with auto-generated IDs through SPARQL:

```sparql
SELECT ?concept ?it_str ?en_str WHERE {
    ?concept a skos:Concept .
    ?concept skosxl:prefLabel ?label_it .
    ?label_it skosxl:literalForm ?it .
    ?concept skosxl:prefLabel ?label_en .
    ?label_en skosxl:literalForm ?en .
    BIND ( STR(?it) as ?it_str)
    BIND ( STR(?en) as ?en_str)
    FILTER (
        langMatches(lang(?it), "it")
        && langMatches(lang(?en), "en")
        && NOT EXISTS {
        	?concept dc:identifier ?id .
        }
    )
}
```

We save that as a CSV and load it.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('no_identifier.csv')
data

Unnamed: 0,concept,it_str,en_str
0,https://w3id.org/diga/terms/c_c81bb8c2,il lavoro dello scultore,the sculptor's work
1,https://w3id.org/diga/terms/c_f55b180f,architettura,architecture
2,https://w3id.org/diga/terms/c_0d640c10,motivi decorativi,decorative motifs
3,https://w3id.org/diga/terms/c_ef524e8b,persone,people
4,https://w3id.org/diga/terms/c_90b07019,fauna,fauna
5,https://w3id.org/diga/terms/c_41a4be3f,flora,flora
6,https://w3id.org/diga/terms/c_4ba24f3c,strumenti musicali,musical instruments
7,https://w3id.org/diga/terms/c_6c3d7bae,oggetti da cerimonia,ceremonial objects
8,https://w3id.org/diga/terms/c_7c1d4a8d,oggetti di vita quotidiana,everyday objects
9,https://w3id.org/diga/terms/c_81f28c3c,mobilio,furniture


In [3]:
import hashlib

def generate_id(label):
    h = hashlib.sha1(label.encode('utf-8')).hexdigest()
    h_short = h[0:8]
    id_int = int(h_short, 16)
    return str(id_int)

generate_id('DiGA')

'2443743546'

In [4]:
data['id'] = data.apply(lambda row: generate_id(f'{row["it_str"]} | {row["en_str"]}'), axis=1)
data

Unnamed: 0,concept,it_str,en_str,id
0,https://w3id.org/diga/terms/c_c81bb8c2,il lavoro dello scultore,the sculptor's work,2642759048
1,https://w3id.org/diga/terms/c_f55b180f,architettura,architecture,1629844188
2,https://w3id.org/diga/terms/c_0d640c10,motivi decorativi,decorative motifs,2896914945
3,https://w3id.org/diga/terms/c_ef524e8b,persone,people,234251970
4,https://w3id.org/diga/terms/c_90b07019,fauna,fauna,3945627877
5,https://w3id.org/diga/terms/c_41a4be3f,flora,flora,2298336107
6,https://w3id.org/diga/terms/c_4ba24f3c,strumenti musicali,musical instruments,2054426851
7,https://w3id.org/diga/terms/c_6c3d7bae,oggetti da cerimonia,ceremonial objects,3066298879
8,https://w3id.org/diga/terms/c_7c1d4a8d,oggetti di vita quotidiana,everyday objects,1022807358
9,https://w3id.org/diga/terms/c_81f28c3c,mobilio,furniture,573338218


Check we don’t have clashes with existing identifiers.

In [5]:
import rdflib

g = rdflib.Graph()
g.parse('repertorio.ttl')

diga_terms = rdflib.Namespace('https://w3id.org/diga/terms/')
g.bind('diga_terms', diga_terms)

Now we can check that a given identifier is not yet used. The example is know to exist, just as a check:

In [6]:
(diga_terms['2838159259'], None, None) in g

True

In [7]:
for id_ in data['id']:
    if (diga_terms[id_], None, None) in g:
        print(f'ID {id_} already in use!')

Okay, we’re save, so save the file.

In [8]:
data.to_csv('new_identifiers.csv')