## Classification
- **Culturally Exclusive**: 'caponata' -> Italy
- **Cultural Representative**: 'pizza' -> Italy
- **Cultural Agnostic**: 'bread'

#### Exploiting the graph knowledge-based wikidata structure
- **Entity**: the item (e.g., "pizza")
- **P495**: country of origin
- **P2596**: culture
- **P172**: ethnic group
- **P37**: official language
- **P407**: langauge of work or name
- **P135**: Movement(Art, Literature, philosophy)
- **P136**: Genre
- **P921**: Main subject
- **P547**: Memorialized by
- **P784**: Significant event
- **P840**: Narrative location
- **P17**: Country
- **P1843**: Taxon common name
- **P1001**: Applies to jurisdiction
- **P144**: Based on
- **P361**: Part of
- **P1705**: Native label
- **P2012**: cuisine
- **P2541**: Operating area
- **P1535**: Used by
- **P366**: Use
- **P1142**: Political ideology
- **P140**: Religion
- **P102**: Member of political party
- **P1344**: Participant in
- **P183**: Endemic to
- **P2341**: Indigenous to
- **P1532**: Country for sport
- **P279**: Subclass of


In [1]:
import requests
import json
from wikidata.client import Client
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd


In [24]:
properties = ['originLabel', 'cultureLabel', 'ethnic_groupLabel', 'off_languageLabel',
                'nameLabel', 'movementLabel', 'genreLabel', 'main_subjectLabel',
                'memorializedLabel', 'sign_eventLabel', 'narrative_locLabel', 'countryLabel',
                'taxonLabel', 'jurisdictionLabel', 'based_onLabel', 'part_ofLabel',
                'native_labelLabel', 'cuisineLabel', 'areaLabel', 'used_byLabel',
                'useLabel', 'political_ideoLabel', 'religionLabel', 'political_partyLabel', 
                'participant_inLabel', 'endemicLabel', 'indigenousLabel', 
                'country_sportLabel', 'subclass_ofLabel'
              ]

In [25]:
def get_query_result(data, properties):
    query_result = {}
    keys = data['results']['bindings'][0].keys()
    for prop in properties:
        if prop in keys:
            query_result[prop] = data['results']['bindings'][0][prop]['value']
        else:
            query_result[prop] = None

    return query_result

In [26]:
sparql = SPARQLWrapper('https://query.wikidata.org/sparql')
sparql.setReturnFormat(JSON)

To deal with missing values we must add the `OPTIONAL` keyword

In [24]:
sparql.setQuery(
'''
SELECT ?originLabel ?culture ?ethnic_group 
?off_language ?related_trad ?name ?country_use ?cuisineLabel
WHERE {
    OPTIONAL { wd:Q177 wdt:P495 ?origin. }
    OPTIONAL { wd:Q177 wdt:P2596 ?culture. }
    OPTIONAL { wd:Q177 wdt:172 ?ethnic_group. }
    OPTIONAL { wd:Q177 wdt:37 ?off_language. }
    OPTIONAL { wd:Q177 wdt:4950 ?related_trad. }
    OPTIONAL { wd:Q177 wdt:407 ?name. }
    OPTIONAL { wd:Q177 wdt:17 ?country_use. }
    OPTIONAL { wd:Q177 wdt:P2012 ?cuisine. }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}
'''
)

In [25]:
data = ''
try:
    data = sparql.queryAndConvert()
    
except Exception as e:
    print(e)

In [28]:
query_result = get_query_result(data, properties)
table = table = pd.DataFrame.from_dict(query_result, orient='index')
print(table)

                            0
originLabel             Italy
culture                  None
ethnic_group             None
off_language             None
related_trad             None
name                     None
country_use              None
cuisineLabel  Italian cuisine


In [8]:
from datasets import load_dataset, load_from_disk

In [None]:
# dataset = load_dataset('sapienzanlp/nlp2025_hw1_cultural_dataset')
# dataset.save_to_disk('./training_dataset')

In [9]:
dataset = load_from_disk('./datasets')

In [11]:
print(len(dataset['train']))

6251


In [10]:
print(dataset['train'][5])

{'item': 'http://www.wikidata.org/entity/Q104414508', 'name': '100 percent corner', 'description': 'term for city center', 'type': 'concept', 'category': 'geography', 'subcategory': 'city', 'label': 'cultural agnostic'}


In what follows, we are going to extract new information from wikidata. We start from the items in the training dataset and we write a SPARQL query for extracting some useful information regarding them.

In [13]:
dataset['train'][0]['item'].strip().split('/')[-1]

'Q32786'

In [15]:
# First we extract the items' IDs
rows = len(dataset['train'])
IDs = []
for index in range(rows):
    item_id = dataset['train'][index]['item'].strip().split('/')[-1]
    IDs.append(item_id)

In [17]:
print(IDs[:6])

['Q32786', 'Q371', 'Q3729947', 'Q158611', 'Q280375', 'Q104414508']


In [None]:
sparql.setQuery(
    '''
        SELECT ?originLabel ?cultureLabel ?ethnic_groupLabel ?off_languageLabel
                ?nameLabel ?movementLabel ?genreLabel ?main_subjectLabel
                ?memorializedLabel ?sign_eventLabel ?narrative_locLabel ?countryLabel
                ?taxonLabel ?jurisdictionLabel ?based_onLabel ?part_ofLabel
                ?native_labelLabel ?cuisineLabel ?areaLabel ?used_byLabel
                ?useLabel ?political_ideoLabel ?religionLabel ?political_partyLabel 
                ?participant_inLabel ?endemicLabel ?indigenousLabel 
                ?country_sportLabel ?subclass_ofLabel
        WHERE {
            OPTIONAL { wd:Q177 wdt:P495 ?origin .}
            OPTIONAL { wd:Q177 wdt:P2596 ?culture .}
            OPTIONAL { wd:Q177 wdt:P172 ?ethnic_group .}
            OPTIONAL { wd:Q177 wdt:P37 ?off_language .}
            OPTIONAL { wd:Q177 wdt:P407 ?name .}
            OPTIONAL { wd:Q177 wdt:P135 ?movement .}
            OPTIONAL { wd:Q177 wdt:P136 ?genre .}
            OPTIONAL { wd:Q177 wdt:P921 ?main_subject .}
            OPTIONAL { wd:Q177 wdt:P547 ?memorialized .}
            OPTIONAL { wd:Q177 wdt:P793 ?sign_event .}
            OPTIONAL { wd:Q177 wdt:P840 ?narrative_loc .}
            OPTIONAL { wd:Q177 wdt:P17 ?country .}
            OPTIONAL { wd:Q177 wdt:P1843 ?taxon .}
            OPTIONAL { wd:Q177 wdt:P1001 ?jurisdiction .}
            OPTIONAL { wd:Q177 wdt:P144 ?based_on .}
            OPTIONAL { wd:Q177 wdt:P361 ?part_of .}
            OPTIONAL { wd:Q177 wdt:P1705 ?native_label .}
            OPTIONAL { wd:Q177 wdt:P2012 ?cuisine .}
            OPTIONAL { wd:Q177 wdt:P2541 ?area .}
            OPTIONAL { wd:Q177 wdt:P1535 ?used_by .}
            OPTIONAL { wd:Q177 wdt:P366 ?use .}
            OPTIONAL { wd:Q177 wdt:P1142 ?political_ideo .}
            OPTIONAL { wd:Q177 wdt:P140 ?religion .}
            OPTIONAL { wd:Q177 wdt:P102 ?political_party .}
            OPTIONAL { wd:Q177 wdt:P1344 ?participant_in .}
            OPTIONAL { wd:Q177 wdt:P183 ?endemic .}
            OPTIONAL { wd:Q177 wdt:P2341 ?indigenous .}
            OPTIONAL { wd:Q177 wdt:P1532 ?country_sport .}
            OPTIONAL { wd:Q177 wdt:P279 ?subclass_of .}      

            SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }

        }

        LIMIT 1
    '''
)

In [31]:
def retrieve_info(items):
    i = 0
    for item in items:
        sparql.setQuery(
            f'''
            SELECT ?originLabel ?cultureLabel ?ethnic_groupLabel ?off_languageLabel
                    ?nameLabel ?movementLabel ?genreLabel ?main_subjectLabel
                    ?memorializedLabel ?sign_eventLabel ?narrative_locLabel ?countryLabel
                    ?taxonLabel ?jurisdictionLabel ?based_onLabel ?part_ofLabel
                    ?native_labelLabel ?cuisineLabel ?areaLabel ?used_byLabel
                    ?useLabel ?political_ideoLabel ?religionLabel ?political_partyLabel 
                    ?participant_inLabel ?endemicLabel ?indigenousLabel 
                    ?country_sportLabel ?subclass_ofLabel
            WHERE {{
                OPTIONAL {{ wd:{item} wdt:P495 ?origin .}}  
                OPTIONAL {{ wd:{item} wdt:P2596 ?culture .}}
                OPTIONAL {{ wd:{item} wdt:P172 ?ethnic_group .}}
                OPTIONAL {{ wd:{item} wdt:P37 ?off_language .}}
                OPTIONAL {{ wd:{item} wdt:P407 ?name .}}
                OPTIONAL {{ wd:{item} wdt:P135 ?movement .}}
                OPTIONAL {{ wd:{item} wdt:P136 ?genre .}}
                OPTIONAL {{ wd:{item} wdt:P921 ?main_subject .}}
                OPTIONAL {{ wd:{item} wdt:P547 ?memorialized .}}
                OPTIONAL {{ wd:{item} wdt:P793 ?sign_event .}}
                OPTIONAL {{ wd:{item} wdt:P840 ?narrative_loc .}}
                OPTIONAL {{ wd:{item} wdt:P17 ?country .}}
                OPTIONAL {{ wd:{item} wdt:P1843 ?taxon .}}
                OPTIONAL {{ wd:{item} wdt:P1001 ?jurisdiction .}}
                OPTIONAL {{ wd:{item} wdt:P144 ?based_on .}}
                OPTIONAL {{ wd:{item} wdt:P361 ?part_of .}}
                OPTIONAL {{ wd:{item} wdt:P1705 ?native_label .}}
                OPTIONAL {{ wd:{item} wdt:P2012 ?cuisine .}}
                OPTIONAL {{ wd:{item} wdt:P2541 ?area .}}
                OPTIONAL {{ wd:{item} wdt:P1535 ?used_by .}}
                OPTIONAL {{ wd:{item} wdt:P366 ?use .}}
                OPTIONAL {{ wd:{item} wdt:P1142 ?political_ideo .}}
                OPTIONAL {{ wd:{item} wdt:P140 ?religion .}}
                OPTIONAL {{ wd:{item} wdt:P102 ?political_party .}}
                OPTIONAL {{ wd:{item} wdt:P1344 ?participant_in .}}
                OPTIONAL {{ wd:{item} wdt:P183 ?endemic .}}
                OPTIONAL {{ wd:{item} wdt:P2341 ?indigenous .}}
                OPTIONAL {{ wd:{item} wdt:P1532 ?country_sport .}}
                OPTIONAL {{ wd:{item} wdt:P279 ?subclass_of .}}        

                SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }}

            }}

            LIMIT 1
        '''
        )

        data = ''
        try:
            data = sparql.queryAndConvert()
            
        except Exception as e:
            print(e)

        query_result = get_query_result(data, properties)
        table = table = pd.DataFrame.from_dict(query_result, orient='index')
        print(table)
        i += 1
        if i == 5:
            break
        
    

In [32]:
retrieve_info(IDs)

                               0
originLabel                India
cultureLabel                None
ethnic_groupLabel           None
off_languageLabel           None
nameLabel                   None
movementLabel               None
genreLabel            drama film
main_subjectLabel           None
memorializedLabel           None
sign_eventLabel             None
narrative_locLabel          None
countryLabel                None
taxonLabel                  None
jurisdictionLabel           None
based_onLabel               None
part_ofLabel                None
native_labelLabel           None
cuisineLabel                None
areaLabel                   None
used_byLabel                None
useLabel                    None
political_ideoLabel         None
religionLabel               None
political_partyLabel        None
participant_inLabel         None
endemicLabel                None
indigenousLabel             None
country_sportLabel          None
subclass_ofLabel            None
          