# Import properties from Wikidata into the MaRDI Portal

Used [Protege online](https://webprotege.stanford.edu/) to model the required classes and properties.

The resulting model was saved in OWL/XMLformat (`data/knowledge-graph.owl`).

Used [WebVOWL](http://vowl.visualdataweb.org/webvowl.html) to visualize the model.

![title](data/knowledge-graph.owl.svg)

1. Start a local wikibase, e.g. using [MaRDI4NFDI/portal-compose](https://github.com/MaRDI4NFDI/portal-compose)
2. Increase memory limit by setting `ini_set( 'memory_limit', '1024M' );` in LocalSettings.d/LocalSettings.override.php
3. ~Enable federated properties from Wikidata by setting `$wgWBRepoSettings['federatedPropertiesEnabled'] = true;` in Localsettings.d/LocalSettings.override.php~
4. Load the required pages fom Wikidata by importing the file `data/Wikidata_export.xml` into the local wikibase. This file contains item and property pages that have been exported from Wikidata.

### The pages that need to be imported:
Import `data/Wikidata-properties.xml` to namespace "Main" with interwiki prefix "en"
* instance of (Property:P31)
* zbMATH author ID (Property:P1556)
* zbMATH work ID (Property:P894) 
* swMATH work ID (Property:P6830) 
* formatter URL (Property:P1630)
* published in (Property:P1433)
* DOI (Property:P356)
* describes a project that uses (Property:P4510)

Import `data/Wikidata-items.xml` to namespace **"Item"** with interwiki prefix "en"
* human (Q5)
* software (Q7397)
* scholarly article (Q13442814) 
* academic journal (Q737498)

In [1]:
# load the list of authors
import pandas as pd

# load the list of zbMath authors
authors_df = pd.read_csv('data/all_authors.csv.zip')
#authors_df.head()

Map the columns to MaRDI-Portal properties, reformat the data according to [the CSV file syntax expected by Quickstatements](https://www.wikidata.org/wiki/Help:QuickStatements#CSV_file_syntax). 

In [2]:
from datetime import datetime

import_authors_df = pd.DataFrame()
import_authors_df['qid'] = len(authors_df) * [''] # leave empty to create new item
import_authors_df['Len'] = authors_df['author_name']
import_authors_df['P31'] = len(authors_df) * ['Q5'] # instance of 'human'
import_authors_df['P1556'] = authors_df['author_id'] # zbMath author id
import_authors_df['#'] = len(authors_df) * ['{}: imported from zbMath Open API'.format(datetime.now())]

In [3]:
import_authors_df.head()

Unnamed: 0,qid,Len,P31,P1556,#
0,,"Aardal, Karen",Q5,aardal.karen-i,2022-01-28 15:46:15.264376: imported from zbMa...
1,,"Aarts, Gert",Q5,aarts.gert,2022-01-28 15:46:15.264376: imported from zbMa...
2,,"Abad, Alberto",Q5,abad.alberto-j,2022-01-28 15:46:15.264376: imported from zbMa...
3,,"Abada, Asmaa",Q5,abada.asmaa,2022-01-28 15:46:15.264376: imported from zbMa...
4,,"Abánades, Miguel A.",Q5,abanades.miguel-angel,2022-01-28 15:46:15.264376: imported from zbMa...


In [4]:
# save as csv
import_authors_df.to_csv('data/qs_import_authors.csv', index=None) # suppress index to make valid CSV for import 

~Copy and paste the data in the csv into Quickstatements.~ 

Send each table row separately to [Quickstatements](https://www.wikidata.org/wiki/Help:QuickStatements) (to avoid NaN values, see below).

If you started the local wikibase using [MaRDI4NFDI/portal-compose](https://github.com/MaRDI4NFDI/portal-compose), then Quickstatements can be found at http://localhost:8840.

In [26]:
import urllib.parse
QS_URL = 'http://localhost:8840/#/v2' # the Quickstatements instance to be used

def send_Quickstatements(row):
    query = ''
    for key, val in row.items():
        query += key + '|' + val + '|'
    # URL encode
    query = urllib.parse.quote_plus(query)
    request_url = "{}={}".format(QS_URL, query)
    print(request_url)
    
"""
Array(
    [action] => run_single_command
    [command] => {
        "action":"create",
        "type":"item",
        "data":{
            "labels":{
                "en":{
                    "language":"en",
                    "value":"Aardal, Karen"
                }
            },
            "claims":[
                {"mainsnak":{
                    "snaktype":"value",
                    "property":"P31",
                    "datavalue":{
                        "type":"wikibase-entityid",
                        "value":{
                            "entity-type":"item",
                            "id":"Q5"
                        }
                    }
                },
                 "type":"statement",
                 "rank":"normal"
                },
                {
                    "mainsnak":{
                        "snaktype":"value",
                        "property":"P1556",
                        "datavalue":{
                            "type":"string",
                            "value":"aardal.karen-i"
                        }
                    },
                    "type":"statement",
                    "rank":"normal"
                }
            ]
        },
        "meta":{
            "message":"",
            "status":"RUN",
            "id":0
        },
        "summary":"#temporary_batch_1643387603932"
    }
    [site] => wikibase-docker
    [last_item] => 
    [tokenKey] => 021eeae4c03ea855d32dfc68f00497c9
    [tokenSecret] => 80246b8460e353f36036f5c237fa01a26c5a9cfe
    [qs_test_wikiUserName] => Admin
    [my_wikiUserName] => Admin
    [qs_test_wikiUserID] => 1
    [VEE] => visualeditor
    [privacy_policy] => "user_accepted_policy"
    [grafana_session] => 83e6cbb09f8cba88f224023b9ba1b5cd
    [_xsrf] => 2|c95c67f9|06640ef7db94e7f8f135a3d2b35c5003|1642498859
    [my_wiki_session] => rou6803452ah756m83p5ok3mgdaqber8
    [my_wikiUserID] => 1
    [quickstatements] => 7a4e35fa3ec7f5cc71d2cd2599ac42a1
)
"""

In [27]:
for _,row in import_authors_df[:10].iterrows():
    send_Quickstatements(row)

http://localhost:8840/#/v2=qid%7C%7CLen%7CAardal%2C+Karen%7CP31%7CQ5%7CP1556%7Caardal.karen-i%7C%23%7C2022-01-28+15%3A46%3A15.264376%3A+imported+from+zbMath+Open+API%7C
http://localhost:8840/#/v2=qid%7C%7CLen%7CAarts%2C+Gert%7CP31%7CQ5%7CP1556%7Caarts.gert%7C%23%7C2022-01-28+15%3A46%3A15.264376%3A+imported+from+zbMath+Open+API%7C
http://localhost:8840/#/v2=qid%7C%7CLen%7C+Abad%2C+Alberto%7CP31%7CQ5%7CP1556%7Cabad.alberto-j%7C%23%7C2022-01-28+15%3A46%3A15.264376%3A+imported+from+zbMath+Open+API%7C
http://localhost:8840/#/v2=qid%7C%7CLen%7CAbada%2C+Asmaa%7CP31%7CQ5%7CP1556%7Cabada.asmaa%7C%23%7C2022-01-28+15%3A46%3A15.264376%3A+imported+from+zbMath+Open+API%7C
http://localhost:8840/#/v2=qid%7C%7CLen%7C+Ab%C3%A1nades%2C+Miguel+A.%7CP31%7CQ5%7CP1556%7Cabanades.miguel-angel%7C%23%7C2022-01-28+15%3A46%3A15.264376%3A+imported+from+zbMath+Open+API%7C
http://localhost:8840/#/v2=qid%7C%7CLen%7C+Abate%2C+J.%7CP31%7CQ5%7CP1556%7Cabate.joseph%7C%23%7C2022-01-28+15%3A46%3A15.264376%3A+imported+from+

## Import the software list
All software entries have already been imported into the MaRDI portal.
Here I will import the first 1000 (out of 40000) software entries into the local wiki for testing.

In [5]:
# load the list of swMath software
software_df = pd.read_csv('data/swMATH-software-list.csv')
software_df = software_df[:1000]
software_df.head()

Unnamed: 0,qid,P13,Len,#
0,,'0',swMATH,initial csv import 2021-12-17
1,,'1',FORTRAN,initial csv import 2021-12-17
2,,'2',SuperLU-DIST,initial csv import 2021-12-17
3,,'3',WHISPAR,initial csv import 2021-12-17
4,,'4',MULTI2D,initial csv import 2021-12-17


Map the columns to MaRDI-Portal properties, reformat the data according to [the CSV file syntax expected by Quickstatements](https://www.wikidata.org/wiki/Help:QuickStatements#CSV_file_syntax). The zbMath software id 

In [6]:
import_software_df = pd.DataFrame()
import_software_df['qid'] = len(software_df) * [''] # leave empty to create new item
import_software_df['Len'] = software_df['Len']
import_software_df['P31'] = len(software_df) * ['Q7397'] # instance of 'software'
import_software_df['P6830'] = software_df['P13'] # zbMath work id
import_software_df['#'] = len(software_df) * ['{}: imported from zbMath Open API'.format(datetime.now())]
import_software_df.head()

Unnamed: 0,qid,Len,P31,P6830,#
0,,swMATH,Q7397,'0',2022-01-28 15:46:37.523766: imported from zbMa...
1,,FORTRAN,Q7397,'1',2022-01-28 15:46:37.523766: imported from zbMa...
2,,SuperLU-DIST,Q7397,'2',2022-01-28 15:46:37.523766: imported from zbMa...
3,,WHISPAR,Q7397,'3',2022-01-28 15:46:37.523766: imported from zbMa...
4,,MULTI2D,Q7397,'4',2022-01-28 15:46:37.523766: imported from zbMa...


In [7]:
# save as csv
import_software_df.to_csv('data/qs_import_software.csv', index=None) # suppress index to make valid CSV for import 

Copy and paste the data in the csv into Quickstatements. If you started the local wikibase using [MaRDI4NFDI/portal-compose](https://github.com/MaRDI4NFDI/portal-compose), then Quickstatements can be found at http://localhost:8840.

## Import the papers list
A subsample of the papers list was created in notebook `filter_papers_by_software.ipyb`. This list contains the papers related to the first 1000 software entries in the list of softwares (`data/swMath-software-list.csv`). The list of papers is in file `data/all_papers.csv.zip`. 

In [8]:
# load the list of zbMath papers
papers_df = pd.read_csv('data/all_papers.csv.zip')
#papers_df.head(3)

In [9]:
import_papers_df = pd.DataFrame()
import_papers_df['qid'] = len(papers_df) * [''] # leave empty to create new item
import_papers_df['Len'] = papers_df['document_title']
import_papers_df['P31'] = len(papers_df) * ['Q13442814'] # instance of 'scholarly article'
import_papers_df['P894'] = papers_df['id'] # zbMath work id
import_papers_df['#'] = len(papers_df) * ['{}: imported from zbMath Open API'.format(datetime.now())]
import_papers_df.head()

Unnamed: 0,qid,Len,P31,P894,#
0,,A hybrid approach to solve the high-frequency ...,Q13442814,oai:zbmath.org:7073637,2022-01-28 15:47:04.488126: imported from zbMa...
1,,Computational and numerical challenges in envi...,Q13442814,oai:zbmath.org:5181224,2022-01-28 15:47:04.488126: imported from zbMa...
2,,A domain-decomposing parallel sparse linear sy...,Q13442814,oai:zbmath.org:5969837,2022-01-28 15:47:04.488126: imported from zbMa...
3,,Sparse direct factorizations through unassembl...,Q13442814,oai:zbmath.org:5982908,2022-01-28 15:47:04.488126: imported from zbMa...
4,,2LEV-D2P4: a package of high-performance preco...,Q13442814,oai:zbmath.org:5187737,2022-01-28 15:47:04.488126: imported from zbMa...


### Append the paper-to-software relations
**Papers may use multiple softwares**. The relation between papers and software is in an additional file `data/all_papers_software.csv.zip`.

In [10]:
# load the list of zbMath papers
papers_software_df = pd.read_csv('data/all_papers_software.csv.zip')
papers_software_df.head(3)

Unnamed: 0.1,Unnamed: 0,id,software
0,0,oai:zbmath.org:7073637,SuperLU-DIST
1,1,oai:zbmath.org:5181224,SuperLU-DIST
2,2,oai:zbmath.org:5969837,SuperLU-DIST


In [11]:
for i, paper in import_papers_df.iterrows():
    # find all software related to this paper by looking at the zbMath id of the paper
    idx = papers_software_df['id'] == paper['P894'] # zbMATH work ID of the paper
    # append a column for each combination of paper and software
    counter = 0
    for _,paper_software in papers_software_df[idx].iterrows():
        column_name = 'P4510.{}'.format(counter) # some papers are related to multile software. add a column P4510.*
        import_papers_df.loc[i, column_name] = paper_software['software']
        counter += 1

This results in a table that has 1 column for each software. While some papers may have software values in multiple column, most don't.

**Currently, quickstatements can't handle a csv file that has empty values (NaN) in some rows.**

Possible solutions:
* Patch Quickstatements so that it skips NaN
* Send each line separately to Quickstatements

In [12]:
import_papers_df.head(20)

Unnamed: 0,qid,Len,P31,P894,#,P4510.0,P4510.1,P4510.2,P4510.3,P4510.4,P4510.5
0,,A hybrid approach to solve the high-frequency ...,Q13442814,oai:zbmath.org:7073637,2022-01-28 15:47:04.488126: imported from zbMa...,SuperLU-DIST,,,,,
1,,Computational and numerical challenges in envi...,Q13442814,oai:zbmath.org:5181224,2022-01-28 15:47:04.488126: imported from zbMa...,SuperLU-DIST,,,,,
2,,A domain-decomposing parallel sparse linear sy...,Q13442814,oai:zbmath.org:5969837,2022-01-28 15:47:04.488126: imported from zbMa...,SuperLU-DIST,,,,,
3,,Sparse direct factorizations through unassembl...,Q13442814,oai:zbmath.org:5982908,2022-01-28 15:47:04.488126: imported from zbMa...,SuperLU-DIST,,,,,
4,,2LEV-D2P4: a package of high-performance preco...,Q13442814,oai:zbmath.org:5187737,2022-01-28 15:47:04.488126: imported from zbMa...,SuperLU-DIST,2LEV-D2P4,,,,
5,,A shared- and distributed-memory parallel gene...,Q13442814,oai:zbmath.org:5187739,2022-01-28 15:47:04.488126: imported from zbMa...,SuperLU-DIST,,,,,
6,,SPIKE: A parallel environment for solving band...,Q13442814,oai:zbmath.org:5128211,2022-01-28 15:47:04.488126: imported from zbMa...,SuperLU-DIST,,,,,
7,,On fast factorization pivoting methods for spa...,Q13442814,oai:zbmath.org:5134450,2022-01-28 15:47:04.488126: imported from zbMa...,SuperLU-DIST,,,,,
8,,Newton-Krylov-BDDC solvers for nonlinear cardi...,Q13442814,oai:zbmath.org:7055073,2022-01-28 15:47:04.488126: imported from zbMa...,SuperLU-DIST,,,,,
9,,Adjoint parameter sensitivity analysis for the...,Q13442814,oai:zbmath.org:6054722,2022-01-28 15:47:04.488126: imported from zbMa...,SuperLU-DIST,,,,,


# Example queries
Some example SPARQL queries that *should* be possible with the model and can be used to test it:
* List all papers that use a certain software
* List all papers by one author, sort by date
* List all papers published by a certain journal, sort by author
