# Wikidata - Constraint Violations

Return the Wikidata items with mandatory constraint violations - as scraped from the web page, https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/Mandatory_constraints/Violations. 

The items are returned using a Turtle and RDF-Star encoding that can be loaded to a graph db and used to understand the "noisiness" of the data, to selectively delete triples from the RDF Dump (such as properties with erroneous values), or to insert triples (for example, adding inverse relationships where they are missing). At a minimum, the existence of the statement would indicate reduced confidence in the information conveyed by the specified properties for the indicated item.

Results are stored in the file, __constraint_violations.ttl__.

Specifically, each violation is output as a triple of the form:
```
<itemIRI> wikidata-owl:violatedProperty { 
                  optional_other_property_or_value
                  wikidata-owl:violationType <violationTypeIRI> ; 
                  wikidata-owl:constraintText “stringTextFromViolationsReport”                               
} <propertyIRIThatIsViolated> .
```

For example, a violation of the P21 constraint regarding sex or gender for Q921090 (a grove of quaking aspen in Utah) would be:
```
wd:Q921090 wikidata-owl:violatedProperty {
                   wikidata-owl:conflictingProperty wd:P131 ;
                   wikidata-owl:violationType wd:Q21502838 ; 
                   wikidata-owl:constraintText “sex or gender (P21): Conflicts with located in the 
                                               administrative territorial entity (P131)”
} wd:P21
```

## Supported Violation Types

At this time, violation triples are generated for the following constraint types: 
* Allowed-entity-type constraint (Q52004125); Violations indicate that the property should not be used with/is invalid for the entity
  * There is no optional_other_property_or_value
* Conflicts-with constraint (Q21502838); Violations indicate that the semantics of two different properties are not logical/not consistent for the item, and therefore the properties should not be used together
  * The “conflicting” property is referenced by the predicate, wikidata-owl:conflictingProperty
* Contemporary constraint (Q25796498); Violations indicate that the subject and object of the property triple should coexist at some point in time but do not
  * The other entity which is not contemporary is referenced by the predicate, wikidata-owl:nonContemporaryWith 
* Format constraint (Q21502404); Violations indicate that there is a formatting error in the property value
  * There is no optional_other_property_or_value
  * Further work is needed to extract the regex expression defining the format from the property's constraint definition (format identified by the predicate, pq:P1793)
* Integer constraint (Q52848401); Violations indicate that the property value should be an integer but is not
  * There is no optional_other_property_or_value
* Inverse constraint (Q21510855); Violations indicate that there is only a triple defined relating the subject to the object, but a corresponding triple in the reverse direction should exist
  * The other entity which should have an inverse relationship is referenced by the predicate, wikidata-owl:missedInverse
  * Futher work would be needed to obtain the property defined as the “inverse”, which is referenced by the predicate, pq:P2306, in the original property's constraint definition
* None-of constraint (Q52558054); Violations indicate that the property value is erroneous, a better alternative exists or should not be used for other reasons
  * The invalid property value is referenced by the predicate, wikidata-owl:invalidItem
* One-of constraint (Q21510859); Violations indicate that the property value (the object of the property triple) is not defined as an item from a predefined set, and therefore is erroneous
  * The entity that is referenced by the property but has an invalid value is specified using the predicate, wikidata-owl:invalidItem (which is not declared if the referenced entity is not an instance or subclass of some item)
  * Further work is needed to obtain the allowed instance of/subclass of types, which are specified using the predicate, pq:P2305, in the original property’s constraint definition
* Item-requires-statement constraint (Q21503247); Violations indicate that the item using the property does not itself declare a triple with another, specific predicate, or that the triple is declared but its property value is not one of a predefined set (as an example of the latter, if an item’s Google Knowledge Graph ID (P2671) value is a string and not a graph reference, then the item would carry this violation) 
  * The missing property declaration or declaration with an invalid item value is specified using the predicate, wikidata-owl:missingOrInvalidProperty
  * The allowed item value(s) for the wikidata-owl:missingOrInvalidProperty are provided using the predicate, wikidata-owl:allowedValue
* Single-value constraint (Q19474404); Violations indicate that more than one value is defined for a property that should be single-valued
  * There is no optional_other_property_or_value
* Subject-type constraint (Q21503250); Violations indicate that the referencing entity (the subject of the property triple) is not a subclass or instance of one of the required types
  * The allowed item value(s) are referenced by the predicate, wikidata-owl:allowedValue
  * The current, invalid type(s) for the item are referenced by the predicate, wikidata-owl:invalidItem (which is not declared if the referenced entity is not an instance or subclass of some item)
* Symmetric constraint (Q2510862); Violations indicate that the subject/object of the property triple should also be defined reversing the order (e.g., object – property – subject), but is not
  * The other entity which is missing a triple relating it to the original subject is referenced by the predicate, wikidata-owl:missedSymmetric
* Unique-value constraint (Q21502410); Violations indicate that the combination of the property and value should be relevant for only one item, but is used with multiples
  * The duplicated value is given by the predicate, wikidata-owl:duplicatedValue, or if the value is a string, by the predicated, wikidata-owl:duplicatedStringValue 
  * Note that a violation triple is defined for _each item_ that uses the property-value pair, which enables querying for all entities with that value 
* Value-type constraint (Q21510865); Violations indicate that the referenced entity (the object of the property triple) is not a subclass or instance of one of the required types
  * The allowed item value(s) are referenced by the predicate, wikidata-owl:allowedValue
  * The invalid referenced object is identified using the predicate, wikidata-owl:invalidItem


In [1]:
from bs4 import BeautifulSoup
from rdflib import Literal
from rfc3987 import parse
import requests
import urllib.parse

violations_url = \
    'https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/'\
    'Mandatory_constraints/Violations'

prefixes = '@prefix wd: <http://www.wikidata.org/entity/> .\n'\
           '@prefix wikidata-owl: <urn:wikidata-owl#> .\n'

rdf_star_details = '{ \n'\
                   'optional_details'\
                   '    wikidata-owl:violationType wd:QViolationType ;\n'\
                   '    wikidata-owl:constraintText violation_text \n}'

def find_entities(input_string) -> list:
    entities = []
    for entity_string in input_string.split('('):   
        if ')' not in entity_string or \
            entity_string[0] not in ('Q', 'P', 'L') or not entity_string[1].isdigit():
            # Account for parentheses in the text that does not refer to items or properties
            continue
        entities.append(entity_string.split(')')[0])
    return entities

def process_anchor(anchor_item) -> str:
    anchor_item = anchor_item.replace('Lexeme:', '').replace('Property:', '')
    if anchor_item.startswith('/wiki/'):
        return f'wd:{anchor_item.replace("/wiki/", "")}'
    elif anchor_item.startswith('http'):    
        if '\\' in anchor_item:    # Invalid backslash in anchor 
            anchor_item = anchor_item.replace('\\', '')
        try:
            url_anchor_item = urllib.parse.quote(anchor_item)   # Unencoded spaces, etc. in anchors
            parse(url_anchor_item, rule='IRI')
        except:
            return f'"{anchor_item}"'    # Instead of trying to fix the URL, just encode it as a string
        return f'<{url_anchor_item}>'
    else:
        return f'"{anchor_item}"'
    
def get_violations():
    resp = requests.get(violations_url)
    if resp.status_code == 200:    
        html = resp.text
        soup = BeautifulSoup(html, "html.parser")
        return soup.body
    else:
        print(f'** Failed to retrieve constraint violations, response code {resp.status_code}')
        print(resp)

def get_violation_data(violation) -> list:
    next_element = violation.next_element
    if next_element.name == 'ul':
        violation_data = []
        list_items = next_element.find_all('li')
        for list_item in list_items:
            all_anchors = list_item.find_all('a')
            anchors = []
            for anchor in all_anchors:
                anchors.append(process_anchor(anchor['href']))
            violation_data.append('->'.join(anchors))
        return violation_data
    return get_violation_data(next_element)

In [2]:
violations_body = get_violations()

In [3]:
with open('constraint_violations.ttl', 'w') as turtle:
    turtle.write(prefixes)
    for violation in violations_body.find_all('h3'):
        violation_header = violation.find('span', class_="mw-headline")
        if violation_header is None:
            continue  
        # Address invalid characters (such as double quotes) in the violation string
        violation_text = Literal(violation_header['id'].replace('_', ' ')).n3()
        if not (': Conflicts with' in violation_text or ': Contemporary' in violation_text or 
                ': Entity types' in violation_text or ': Format' in violation_text or 
                ': Integer' in violation_text or ': Inverse' in violation_text or 
                ': None of' in violation_text or ': One of' in violation_text or 
                ') one of' in violation_text or ': Single value' in violation_text or 
                ': Symmetric' in violation_text or ': Type' in violation_text or 
                ': Unique value' in violation_text or ': Value type' in violation_text or 
                ' = ' in violation_text):
            continue
        # Property that is violated identified as the first referenced entity in the violation text, '(Pxxx):'
        referenced_entities = find_entities(violation_text) 
        rdf_star_ttl = rdf_star_details.replace('violation_text', violation_text)
        for anchor_details in get_violation_data(violation):
            anchors = anchor_details.split('->')  # Violating item is always identified by the first anchor
            if ': Entity types' in violation_text:
                # 'Allowed-entity-type' constraint, wd:Q52004125
                # Property that is violated (that is not valid for use) identified as referenced_entities[0]
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q52004125')\
                                       .replace('optional_details', '')
            elif ': Conflicts with' in violation_text:
                # 'Conflicts-with' constraint, wd:Q21502838
                # Property that is violated (that one of the conflicting) identified as referenced_entities[0]
                # Other conflicting property identified as referenced_entities[1]
                optional_details = f'    wikidata-owl:conflictingProperty wd:{referenced_entities[1]} ;\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q21502838')\
                                       .replace('optional_details', optional_details)
            elif ': Contemporary' in violation_text:
                # 'Contemporary' constraint, wd:Q25796498
                # Property that is violated (whose value must be contemporary) identified as referenced_entities[0]
                # Non-contemporary item is identified as anchors[1]
                optional_details = f'    wikidata-owl:nonContemporaryWith {anchors[1]} ;\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q25796498')\
                                       .replace('optional_details', optional_details)
            elif ': Format' in violation_text:
                # 'Format' constraint, wd:Q21502404 
                # Property that is violated (whose value has an invalid format) identified as referenced_entities[0]
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q21502404')\
                                       .replace('optional_details', '')
            elif ': Integer' in violation_text:
                # 'Integer' constraint, wd:Q52848401
                # Property that is violated (whose value is an invalid integer) identified as referenced_entities[0]
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q52848401')\
                                       .replace('optional_details', '')
            elif ': Inverse' in violation_text:
                # 'Inverse' constraint, wd:Q21510855
                # Property that is violated (where the object reference does not have an inverse relationship triple) 
                #   identified as referenced_entities[0]
                # Item missing the inverse is identified as anchors[1]
                optional_details = f'    wikidata-owl:missedInverse {anchors[1]} ;\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q21510855')\
                                       .replace('optional_details', optional_details)
            elif ': None of' in violation_text:
                # 'None-of' constraint, wd:Q52558054
                # Property that is violated (that has an invalid value) identified as referenced_entities[0]
                # Invalid value identified as anchor[1]
                optional_details = f'    wikidata-owl:invalidItem {anchors[1]} ;\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q52558054')\
                                       .replace('optional_details', optional_details)
            elif ': One of' in violation_text:
                # 'One-of' constraint, wd:Q21510859
                # Property that is violated (that references an entity of an invalid type) 
                #   identified as referenced_entities[0]
                # Property value(s) that are invalid identified as anchors[1], [2], ...
                invalid_list = []
                for anchor in anchors[1:]:
                    invalid_list.append(f'    wikidata-owl:invalidItem {anchor} ;')
                if not invalid_list:
                    optional_details = ''
                else:
                    optional_details = '\n'.join(invalid_list) + '\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q21510859')\
                                       .replace('optional_details', optional_details)
            elif ') one of' in violation_text:
                # 'Item-requires-statement' constraint, wd:Q21503247
                # Property that is violated (that is present but which is missing a corresponding property 
                #   or the corresponding property's referenced entity has an invalid type) identified 
                #   as referenced_entities[0]
                # Missing, corresponding property or property with an invalid value type identified 
                #   as referenced_entities[1]
                # Allowed types for the value of the corresponding property identified as 
                #   referenced_entities[2], [3], ...
                allowed_list = []
                for referenced_ent in referenced_entities[2:]:
                    allowed_list.append(f'    wikidata-owl:allowedValue wd:{referenced_ent} ;')
                optional_details = '\n'.join(allowed_list) + '\n' + \
                                   f'    wikidata-owl:missingOrInvalidProperty wd:{referenced_entities[1]} ;\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q21503247')\
                                       .replace('optional_details', optional_details)
            elif ' = ' in violation_text:
                # 'Item-requires-statement' constraint, wd:Q21503247
                # Property that is violated (that is present but which is missing a corresponding property or the
                #   corresponding property is present but its referenced entity has an invalid type) 
                #   identified as referenced_entities[0]
                # Missing, corresponding property or property with an invalid value type identified 
                #   as referenced_entities[1]
                # Expected value for the corresponding property identified as referenced_entities[2]
                # Current/invalid value(s) for the corresponding property identified as 
                #   anchors[1], [2], ...
                invalid_list = []
                for anchor in anchors[1:]:
                    invalid_list.append(f'    wikidata-owl:invalidItem {anchor} ;')
                if not invalid_list:
                    optional_details = f'    wikidata-owl:missingOrInvalidProperty wd:{referenced_entities[1]} ;\n'\
                                       f'    wikidata-owl:allowedValue wd:{referenced_entities[2]} ;\n'
                else:
                    optional_details = '\n'.join(invalid_list) + '\n' + \
                                       f'    wikidata-owl:missingOrInvalidProperty wd:{referenced_entities[1]} ;\n'\
                                       f'    wikidata-owl:allowedValue wd:{referenced_entities[2]} ;\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q21503247')\
                                       .replace('optional_details', optional_details)
            elif ': Single value' in violation_text:
                # 'Single value' constraint, wd:Q19474404
                # Property that is violated (that has multiple values) identified as referenced_entities[0]
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q19474404')\
                                       .replace('optional_details', '')
            elif ': Type' in violation_text:
                # 'Subject-type' constraint, wd:Q21503250
                # Property that is violated (that is invalid because the violating item is not of a specific 
                #   set of types) identified as referenced_entities[0]
                # Allowed types for the item identified as referenced_entities[1], [2], ...
                # The current type(s) of the item identified in anchors[1], [2], ...
                allowed_list = []
                for referenced_ent in referenced_entities[1:]:
                    allowed_list.append(f'    wikidata-owl:allowedValue wd:{referenced_ent} ;')
                invalid_list = []
                for anchor in anchors[1:]:
                    invalid_list.append(f'    wikidata-owl:invalidItem {anchor} ;')
                if not invalid_list:
                    optional_details = '\n'.join(allowed_list) + '\n'
                else:
                    optional_details = '\n'.join(allowed_list) + '\n' + '\n'.join(invalid_list) + '\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q21503250')\
                                       .replace('optional_details', optional_details)
            elif ': Symmetric' in violation_text:
                # 'Symmetric' constraint, wd:Q2510862
                # Property that is violated (for which a symmetric relationship should exist) 
                #   identified as referenced_entities[0]
                # Item that is missing the symmetric relationship is identified as anchors[1]
                optional_details = f'    wikidata-owl:missedSymmetric {anchors[1]} ;\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q2510862')\
                                       .replace('optional_details', optional_details)
            elif ': Unique value' in violation_text:
                # 'Unique-value' constraint, wd:Q21502410
                # Violation for the item(s) identified as anchors[1], [2], ...
                # Property that is violated (for which multiple values are defined) 
                #   identified as referenced_entities[0]
                # The value that should be unique, but is not, is identified as anchors[0] (and may be a string)
                if anchors[0].startswith('"'):
                    optional_details = f'    wikidata-owl:duplicatedStringValue {anchors[0]} ;\n'
                else:
                    optional_details = f'    wikidata-owl:duplicatedValue {anchors[0]} ;\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q21502410')\
                                       .replace('optional_details', optional_details)
                for anchor in anchors[1:]:
                    turtle.write(
                        f'{anchor} wikidata-owl:violatedProperty {rdf_star} wd:{referenced_entities[0]} .\n')
                rdf_star = ''     # Turtle handled above
            elif ': Value type' in violation_text:
                # 'Value-type' constraint, wd:Q21510865
                # Property that is violated (where its value is not one of a set of predefined types) 
                #   identified as referenced_entities[0]
                # Referenced item which is not of the correct type identified as anchors[1]
                # The allowed types for the referenced item identified as referenced_entities[1], [2], ...
                allowed_list = []
                for referenced_ent in referenced_entities[1:]:
                    allowed_list.append(f'    wikidata-owl:allowedValue wd:{referenced_ent} ;')
                if len(anchors) > 1:
                    optional_details = '\n'.join(allowed_list) + '\n' + f'    wikidata-owl:invalidItem {anchors[1]} ;\n'
                else:
                    optional_details = '\n'.join(allowed_list) + '\n'
                rdf_star = rdf_star_ttl.replace('QViolationType', 'Q21510865')\
                                       .replace('optional_details', optional_details)
            else:
                rdf_star = ''
            if rdf_star:
                turtle.write(
                    f'{anchors[0]} wikidata-owl:violatedProperty {rdf_star} '\
                    f'wd:{referenced_entities[0]} .\n')