# Wikidata PropertyFinder

In [1]:
from PropertyFinder2 import PropertyFinder
f = PropertyFinder()

## Goal Description

Combined with T2WML, this project (`PropertyFinder`) tries to find linkages between dataset columns and wikidata properties. 

Ideally, `PropertyFinder` will automatically suggest a property based on `header name` and annotation fields (e.g. `role` & `type`). If ambiguities remain, users should be able to refine their query string/type/scope so that the desire property will appear in top 5 of the ranked list returned.




## Summary of results

Using a combination of string similarity and counting, and the `world-modelers` dataset as a development set, the current `PropertyFinder` is able to locate `62%` of properties as its top choice of the ranked list, and `66%` as its top 5 choices. If users are allowed to refine their query string to reduce ambiguity, then `86%` of desired properties will appear as top choice, and `94%` will appear as top five choices.

## Inputs

Currently, `PropertyFinder` accepts the following parameters:
1. label (aka query string): column name, or user-defined input. 
   - Note: Due to some restrictions to the KGTK-search API, only the first 10 characters will be accepted by the index. Some complexities are addressed in the following.

2. type_: wikidata type of the target property, for example quantity, item, time, etc.

3. scope: qualifier/main value
4. constraint: If input is a qualifier, the main property it modifies

There are some other inputs, which are not used in this report

## Finding Relevant Properties

1. Send a query to `kgtk/search` and obtain a list of properties, filter by type
2. Generate relevant properties of the properties in the list, using P1696, P1647, P6609, and P1659 (see also), filter by type. Three layer design:
    - Layer 1: Properties returned by the index (highest priority)
    - Layer 2: P1696, P1647, P6609 relevant properties
    - Layer 3: P1659 (see also)
3. Properties that break one of the constraints (e.g. scope) put to Layer 4 (Lowest priority)

## Ranking Properties
After PropertyFinder fetches and compiles the list of relevant properties, it will rank each property by two metrics:
1. Similarity between the **names** (`label` & `aliases`) of the property and the **query string**
2. Current property usages in wikidata

The ranking here is very obvious:
1. If the property label/alias is very similar to the query string, the property needs to be assigned a higher rank.
2. If the property is used more in wikidata, it should be ranked higher. However, we should be cautious with the large number of usages may dwarf similarity measure, so we take the log of counts here.

Let $p$ a property from wikidata, and $s$ be the query string, and $scope$ be the required scope (qualifier/main value) of the property. The score of $p$ can be expressed as:

$$\text{score}(p,s,scope) = \text{reduce}(sim(p_{names}, s)) \times ln (\text{count}(p|scope)) $$

`reduce` is simply a function to summarize the list of similarities. It can be mean or max, or some other weighting functions. Max is used in this project.

`sim` is calculated by `difflib.SequenceMatcher`. Other similarity measures can also be applied.

The ranking formula currently in this project is
$$\text{score}(p,s,scope) = \text{max}(sim(p_{names}, s)) \times ln (\text{count}(p|scope)) $$

Finally, within each layer, the properties are sorted by `score` in descending order.

### Test Data

In order to examine the performance of PropertyFinder, I look up for the `world modeler` datasets from December, got their column header names, roles, types, and inferred groundtruths by referencing to `wikidata.org`.

Note that a dataset may consider multiple spreadsheets, and the spreadsheets may be very similar. Columns with same header name are counted only `once`. However, `admin1` in different datasets may still be counted multiple times.

In [2]:
import pandas as pd
test_data = pd.read_csv('world_modelers.csv')
test_data = test_data.where(test_data.notnull(), None)
print('Number of columns fetched:', len(test_data))
print('Number of columns with inferred properties:', test_data['expected property'].apply(lambda x: not x is None).sum())

Number of columns fetched: 101
Number of columns with inferred properties: 51


About half of the columns are assigned a wikidata property, which is good.

In [3]:
test_data_wlabel = test_data[test_data['expected property'].apply(lambda x: not x is None)]
test_data_wlabel.head(9)

Unnamed: 0,dataset_id,cname,role,type,expected property,label*,fixed name
1,ACLED,event_date_formatted,qualifier,time,P585,,event_date_formatted
2,ACLED,year,qualifier,time,P585,,year
13,ACLED,country,main value,country,P17,,country
14,ACLED,admin1,qualifier,location,P131,,admin1
15,ACLED,admin2,qualifier,location,P131,,admin2
16,ACLED,admin3,qualifier,location,P131,,admin3
18,ACLED,latitude,qualifier,globe-coordinate,P625,,latitude
19,ACLED,longitude,qualifier,globe-coordinate,P625,,longitude
23,ACLED,fatalities,main value,quantity,P1120,number of deaths,fatalities


Most of the columns are `P585`, `P17`, or `P131`, but we do have some variable columns mapped to wikidata properties.

---
### Metrics

Let's design a few metrics for our task. Since users will be able to manipulate the query string, and they can choose whether to apply the returned property to the column, recall is considered more important than precision here. For each column from the `world modeler` dataset, we consider whether groudtruth appears as one of the top ranked properties returned by our property finder. Here are some metrics to consider:
- `find@1`: Whether groundtruth appears as the top ranked property returned by the finder. This metric measures the performance when property finder automatically assigns property types without user input.
- `find@5`: Whether groundtruth appears as the top five ranked properties returned by the finder. This metric measures whether the finder can provide satisfactory search result for the user.
- `no find`: Whether groundtruth does not appear in the ranked properties returned by the finder. This metric measures whether the finder can correctly identifies potential properties.

The result is as follows:

In [4]:
def find_index(lst, node):
    if node in lst:
        return lst.index(node)
    return -1

In [5]:
test_data_wlabel['candidates'] = test_data_wlabel.apply(lambda r: [x[0] for x in f.generate_top_candidates(r['cname'], type_=r['type'], scope=r['role'], size=100)], axis=1)
print('find@1:', test_data_wlabel.apply(lambda r: r['expected property'] in r['candidates'][:1], axis=1).sum() / len(test_data_wlabel))
print('find@5:', test_data_wlabel.apply(lambda r: r['expected property'] in r['candidates'][:5], axis=1).sum() / len(test_data_wlabel))
print('no find', test_data_wlabel.apply(lambda r: not r['expected property'] in r['candidates'], axis=1).sum() / len(test_data_wlabel))

find@1: 0.6274509803921569
find@5: 0.6666666666666666
no find 0.29411764705882354


Here we can generate a list of properties where we cannot find relevant wikidata properties according to the property finder.

In [6]:
test_data_wlabel[test_data_wlabel.apply(lambda r: not r['expected property'] in r['candidates'], axis=1)]

Unnamed: 0,dataset_id,cname,role,type,expected property,label*,fixed name,candidates
25,CHIRPS,DateTime,qualifier,time,P585,,Date Time,[]
28,ERA5,DateTime,qualifier,time,P585,,Date Time,[]
31,ERA5,Mean Temperature,main value,quantity,P2076,temperature,Temperature,[]
32,MERRA2,DateTime,qualifier,time,P585,,Date Time,[]
37,TerraClimate,DateTime,qualifier,time,P585,,Date Time,[]
47,Kimetrica_Ethiopia_Crop_Production,Zone,qualifier,location,P131,,admin,"[P421, P3610, P8194, P8193]"
51,Kimetrica_Ethiopia_Crop_Production,Area in hectare,main value,quantity,P2046,area,area,[]
53,Kimetrica_Ethiopia_Crop_Production,Yield,main value,quantity,P2197,production rate,production,"[P2145, P5529, P5677]"
63,FAO_Locust,STARTDATE,qualifier,time,P580,,START DATE,[]
64,FAO_Locust,FINISHDATE,qualifier,time,P582,,END DATE,[]


### Failure Analysis: We can divide the failure cases into the following two categories

#### 1. Some column names do not give information about the column
For example, indices `81`, `83`, `84` are all about locations (admin), and the `KGTK-SEARCH` API cannot return relevant properties because the input is too vague. This can be easily fixed by the users since users have the domain knowledge about the dataset, and can therefore input query string that are more informative. For example, we can change `ADM1_NAME` at index `84` so that it would return relevant properties.

Indices `25`, `28`, `32`, `37`, `63`, `64`, `51`, `67`, `73` also fall into this category.
Indices `47` and `53` have used some aliases for the property, which is not present in wikidata.

In [7]:
# This one returns nothing
f.generate_top_candidates('ADM1_NAME', scope='qualifier', type_='location')

[]

In [8]:
# This one returns the correct properties
f.generate_top_candidates('admin1', scope='qualifier', type_='location', size=3)

[('P131',
  'located in the administrative territorial entity',
  6.62258060978968),
 ('P137', 'operator', 5.074493790515764),
 ('P159', 'headquarters location', 3.419636959130033)]

#### 2. Truncation issues with the `KGTK-SEARCH` API
For index `31`, currently the `KGTK-SEARCH` API only supports a search string of maximum 10 characters. String with characters longer than 10 will be truncated to 10 characters in order to compliant with this requirement. Hence, the query string `Mean Temperature` will be converted to `Mean Tempe`, which will not make much sense for the finder to find relevant properties.

Hence, users should always make sure the input is less or equal to 10 characters.

In [9]:
# Returns nothing
f.generate_top_candidates('Mean Temperature', scope='qualifier', type_='location', size=3)

[]

##### KGTK-SEARCH DO NOT Accpet query string greater than 10 characters

In [10]:
# Use the API directly, returns nothing because truncation
f._query('Mean Temperature')

[]

---

### Improvement
After users fixed their query names, we can run property finder again to see how the search results have improved.

In [11]:
test_data_wlabel['candidates2'] = test_data_wlabel.apply(lambda r: [x[0] for x in f.generate_top_candidates(r['fixed name'], type_=r['type'], scope=r['role'], size=100)], axis=1)
print('find@1:', test_data_wlabel.apply(lambda r: r['expected property'] in r['candidates2'][:1], axis=1).sum() / len(test_data_wlabel))
print('find@5:', test_data_wlabel.apply(lambda r: r['expected property'] in r['candidates2'][:5], axis=1).sum() / len(test_data_wlabel))
print('no find', test_data_wlabel.apply(lambda r: not r['expected property'] in r['candidates2'], axis=1).sum() / len(test_data_wlabel))

find@1: 0.8627450980392157
find@5: 0.9411764705882353
no find 0.0


The results improve greatly, now we have over 90% of cases where the real property falls within the top 5 categories. How did the remaining 10% cases go wrong?

In [12]:
misses = test_data_wlabel[test_data_wlabel.apply(lambda r: not r['expected property'] in r['candidates2'][:5], axis=1)]
misses

Unnamed: 0,dataset_id,cname,role,type,expected property,label*,fixed name,candidates,candidates2
31,ERA5,Mean Temperature,main value,quantity,P2076,temperature,Temperature,[],"[P6879, P6591, P7422, P5066, P2199, P5067, P21..."
39,TerraClimate,Max Temperature,main value,quantity,P2076,temperature,Max Temperature,"[P6591, P3252, P7422, P5066, P3253, P3251, P2076]","[P6591, P3252, P7422, P5066, P3253, P3251, P2076]"
40,TerraClimate,Min Temperature,main value,quantity,P2076,temperature,Min Temperature,"[P7422, P3251, P6591, P5067, P3253, P3252, P2076]","[P7422, P3251, P6591, P5067, P3253, P3252, P2076]"


### Temperature
The case of temperature is actually an interesting one.

In [13]:
# Without scope constraint, P2076 would rank first
f.find_property('Temperature', 'quantity')

{1: [('P6879', 'effective temperature', 8.35644007575381),
  ('P2076', 'temperature', 7.283954217570722),
  ('P6591', 'maximum temperature record', 3.662040962227032),
  ('P7422', 'minimum temperature record', 3.3243321142103706),
  ('P5066', 'operating temperature', 2.618534213766516),
  ('P2199', 'autoignition temperature', 2.4111472601006323),
  ('P5067', 'non-operating temperature', 2.200084343834319),
  ('P2113', 'sublimation temperature', 2.0386681781174865),
  ('P3253', 'optimum viable temperature', 1.187688960722281),
  ('P3251', 'minimum viable temperature', 0.5938444803611405),
  ('P3252', 'maximum viable temperature', 0.5938444803611405),
  ('P5682', 'heat deflection temperature', 0.3648143055578659),
  ('P5670', 'glass transition temperature', 0.0)]}

In [14]:
# With scope == main value, P2076 moves to bottom tier for breaking the scope constraint
f.find_property('Temperature', 'quantity', 'main value')

{1: [('P6879', 'effective temperature', 8.3564361737107),
  ('P6591', 'maximum temperature record', 3.662040962227032),
  ('P7422', 'minimum temperature record', 3.3243321142103706),
  ('P5066', 'operating temperature', 2.618534213766516),
  ('P2199', 'autoignition temperature', 2.4111472601006323),
  ('P5067', 'non-operating temperature', 2.200084343834319),
  ('P2113', 'sublimation temperature', 2.019992473226557),
  ('P3253', 'optimum viable temperature', 1.1240224549620734),
  ('P3251', 'minimum viable temperature', 0.3746741516540245),
  ('P3252', 'maximum viable temperature', 0.3746741516540245),
  ('P5682', 'heat deflection temperature', 0.3648143055578659),
  ('P5670', 'glass transition temperature', 0.0)],
 4: [('P2076', 'temperature', 5.7379407355935585)]}

The reason behind this is that, according to wikidata:

> https://www.wikidata.org/wiki/Property:P2076

`P2076` has a scope constraint as a `qualifier`. However, observing the weight generated by the property finder, it is clear some people use `P2076` as a `main value` (at least $e^{5.73} \approx 308$ usages), and it looks like we also use it as a main value in the dataset as well.

There may be a discussion whether it is ideal to implement a rule-based system based on constraints, since some of wikidata constraints are not even in the loosest term enforced.