# Populate the portal using the Wikibase API
Goal of this notebook: Import the data prepared in `filter_papers_by_software.ipynb` into the data structure setup in `import_wikidata_properties.ipynb`.

The API documentation is [here](https://www.wikidata.org/w/api.php?action=help&modules=wbeditentity)

In [1]:
# import common definitions and functions
%run WB_common.ipynb

## Login to the Wikibase

### Network settings
* Make sure the wikibase is running, e.g. using [MaRDI4NFDI/portal-compose](https://github.com/MaRDI4NFDI/portal-compose)
* Make sure this jupyter notebook is in the same network as the wiki. This is done in docker-compose

```
networks:
  default:
    external: true
    name: portal-compose_default
```

Networks can be listed using `docker network ls`. Here, "portal-compose_default" is the name of the network started by portal-compose.
* Verify that this notebook is in the correct network `docker network inspect portal-compose_default`

The wiki is then accessible from the notebook container at `http://mardi-wikibase`.

In [2]:
import requests
import json 
import configparser

# url of the API endpoint
WIKIBASE_API = 'http://mardi-wikibase/w/api.php?format=json'

def login(username, botpwd):
    """
    Starts a new session and logins using a bot account.
    @username, @botpwd string: credentials of an existing bot user
    @returns requests.sessions.Session object
    """
    # create a new session
    session = requests.Session()

    # get login token
    r1 = session.get(WIKIBASE_API, params={
        'format': 'json',
        'action': 'query',
        'meta': 'tokens',
        'type': 'login'
    })
    # login with bot account
    r2 = session.post(WIKIBASE_API, data={
        'format': 'json',
        'action': 'login',
        'lgname': username,
        'lgpassword': botpwd,
        'lgtoken': r1.json()['query']['tokens']['logintoken'],
    })
    # raise when login failed
    if r2.json()['login']['result'] != 'Success':
        raise WBAPIException(r2.json()['login'])
        
    return session

### Bot user
* Login to the wiki as admin
* Go to Special:BotPasswords, create a bot user, call it "import", grant it "High-volume editing", "Edit existing pages", "Create, edit, and move pages"
* Copy `data/credentials.tpl` to `data/credentials.ini`. Replace the username and password by those of the newly created bot user (make sure not to commit this file)

In [67]:
# read bot username and password from data/credentials.ini
config = configparser.ConfigParser()
config.sections()
config.read('data/credentials.ini')
username = config['default']['username']
botpwd = config['default']['password']

session = login(username, botpwd)

## Create a wikibase property
A function that creates a new wikidatabase property and returns the new id.

If the property label already exists in the wiki, will not overwrite it, but raise an error.

In [4]:
def get_csrf_token(session):
    """Gets a security (CSRF) token."""
    params1 = {
        "action": "query",
        "meta": "tokens",
        "type": "csrf"
    }
    r1 = session.get(WIKIBASE_API, params=params1)
    token = r1.json()['query']['tokens']['csrftoken']

    return token
    

def create_property(session, data):
    """
    Creates a wikibase property.
    @session requests.sessions.Session: session obtained from login 
    @data python dict: creation parameters of the property
    @returns string: id of the new property
    """
    token = get_csrf_token(session)
    
    params = {
        "action": "wbeditentity",
        "format": "json",
        'new': 'property',
        'data': json.dumps(data),
        'token': token
    }
    r1 = session.post(WIKIBASE_API, data=params)
    r1.json = r1.json()
    
    # raise when edit failed
    if 'error' in r1.json.keys():
        raise WBAPIException(r1.json['error'])

    return r1.json['entity']['id']

For example create a property with these parameters will return an id-string in the form 'Px' (where x is a number).

The property can be seen in the wiki under `http:localhost:8080/wiki/Property:Px`

## Create a wikibase entity
A function that creates a new wikidatabase entity (item) and returns the new id.

If the entity label already exists in the wiki, will not overwrite it, but create a new entity.

In [5]:
def create_entity(session, data):
    """
    Creates a wikibase entity.
    @session requests.sessions.Session: session obtained from login 
    @data python dict: creation parameters of the entity
    @returns string: id of the new entity
    """
    token = get_csrf_token(session)
    
    params = {
        'action': 'wbeditentity',
        'format': 'json',
        'new': 'item',
        'data': json.dumps(data),
        'token': token
    }
    r1 = session.post(WIKIBASE_API, data=params)
    r1.json = r1.json()
    
    # raise when edit failed
    if 'error' in r1.json.keys():
        raise WBAPIException(r1.json['error'])

    return r1.json['entity']['id']

For example create an item with these parameters will return an id-string in the form 'Qx' (where x is a number).

The item can be seen in the wiki under `http:localhost:8080/wiki/Item:Qx`

## Import the authors list
Before importing anything, make sure the corresponding items and properties have been imported from wikidata. See notebook `WB_wikidata_properties.ipynb`.

A subsample of the authors list was created in notebook `filter_papers_by_software.ipyb`. This list contains the authors of a papers related to the first 1000 software entries in the list of softwares (`data/swMath-software-list.csv`). The list of authors is in file `data/all_authors.csv.zip`. 

In [6]:
# load the list of authors
import pandas as pd

# load the list of zbMath authors
authors_df = pd.read_csv('data/all_authors.csv.zip') 
authors_df.head()

Unnamed: 0,author_id,author_name
0,aidun.cyrus-k,"Aidun, Cyrus K."
1,babuska.ivo-m,"Babuška, I."
2,banerjee.uday,"Banerjee, U."
3,bartels.alexander,"Bartels, Alexander"
4,basko.mikhail-m,"Basko, M. M."


Use the create_entity function to import the authors into the wiki.
The Q-id returned by the wikibase is appended to the pandas dataframe of authors.

In [7]:
for i,current in authors_df.iterrows():
    author_name = current['author_name'].strip()
    author_id = current['author_id'].strip()
    data = {
        'labels':{'en':{'language':'en','value':author_name}},
        'claims': [
            # instance of 'human'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q5'}}},
            'type': 'statement', 'rank': 'normal'},
            # zbMath author id
            {'mainsnak':{
                'snaktype':'value', 'property':'P1556', 'datavalue': {'type':'string', 'value': author_id}},
            'type': 'statement', 'rank': 'normal'}
            ]
    }
    # import into wikibase, save Qid
    authors_df.loc[i, 'qid'] = create_entity(session, data)

In [8]:
authors_df.head()

Unnamed: 0,author_id,author_name,qid
0,aidun.cyrus-k,"Aidun, Cyrus K.",Q1583
1,babuska.ivo-m,"Babuška, I.",Q1584
2,banerjee.uday,"Banerjee, U.",Q1585
3,bartels.alexander,"Bartels, Alexander",Q1586
4,basko.mikhail-m,"Basko, M. M.",Q1587


## Import the software list
All software entries have already been imported into the MaRDI portal.
Here I will import the first 10 (out of 40000) software entries into the local wiki for testing.

In [9]:
MAX_ENTRIES = 10 # number of software entries to process

# load the list of swMath software
software_df = pd.read_csv('data/swMATH-software-list.csv')
software_df = software_df[:MAX_ENTRIES]
software_df.head()

Unnamed: 0,qid,P13,Len,#
0,,'0',swMATH,initial csv import 2021-12-17
1,,'1',FORTRAN,initial csv import 2021-12-17
2,,'2',SuperLU-DIST,initial csv import 2021-12-17
3,,'3',WHISPAR,initial csv import 2021-12-17
4,,'4',MULTI2D,initial csv import 2021-12-17


Use the create_entity function to import the software into the wiki.
The Q-id returned by the wikibase is appended to the pandas dataframe of software.

In [10]:
for i,current in software_df[:MAX_ENTRIES].iterrows():
    software_name = current['Len'].strip()
    software_id = current['P13'].strip().replace("'", '')

    data = {
        'labels':{'en':{'language':'en','value':software_name}},
        'claims': [
            # instance of 'software'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q7397'}}},
            'type': 'statement', 'rank': 'normal'},
            # swMath work id
            {'mainsnak':{
                'snaktype':'value', 'property':'P6830', 'datavalue': {'type':'string', 'value': software_id}},
            'type': 'statement', 'rank': 'normal'}
            ]
    }
    # import into wikibase, save Qid
    software_df.loc[i, 'qid'] = create_entity(session, data)

In [11]:
software_df.head()

Unnamed: 0,qid,P13,Len,#
0,Q1751,'0',swMATH,initial csv import 2021-12-17
1,Q1752,'1',FORTRAN,initial csv import 2021-12-17
2,Q1753,'2',SuperLU-DIST,initial csv import 2021-12-17
3,Q1754,'3',WHISPAR,initial csv import 2021-12-17
4,Q1755,'4',MULTI2D,initial csv import 2021-12-17


## Import the papers list
A subsample of the papers list was created in notebook `filter_papers_by_software.ipyb`. This list contains the papers related to the first 10 software entries in the list of softwares (`data/swMath-software-list.csv`). The list of papers is in file `data/all_papers.csv.zip`. 

In [12]:
# load the list of zbMath papers
papers_df = pd.read_csv('data/all_papers.csv.zip', dtype={'doi': 'string'}) # force DOI to be a string
papers_df.head(3)

Unnamed: 0,id,author,author_ids,document_title,source,classifications,language,keywords,doi,publication_year,serial,links
0,oai:zbmath.org:7188252,"Fohrmeister, Volker; Bartels, Alexander; Mosle...",fohrmeister.volker;bartels.alexander;mosler.jorn,Variational updates for thermomechanically cou...,"Comput. Methods Appl. Mech. Eng. 339, 239-261 ...",74C05;74S05;65M60;65Y05,English,hyper-dual numbers;numerical differentiation;v...,10.1016/j.cma.2018.04.047,2018,Computer Methods in Applied Mechanics and Engi...,
1,oai:zbmath.org:7206170,"Berardocco, Luca; Kronbichler, Martin; Graveme...",berardocco.luca;kronbichler.martin;gravemeier....,A hybridizable discontinuous Galerkin method f...,"Comput. Methods Appl. Mech. Eng. 366, Article ...",78A25;65M60,English,Maxwell's equations;electromagnetic diffusion ...,10.1016/j.cma.2020.113071,2020,Computer Methods in Applied Mechanics and Engi...,https://arxiv.org/abs/1904.10257
2,oai:zbmath.org:7138575,"Hu, Xiukun; Douglas, Craig C.",hu.xiukun;douglas.craig-c,Performance and scalability analysis of a coup...,"Japan J. Ind. Appl. Math. 36, No. 3, 1039-1054...",76S05;76M10;65M60;65M12,English,domain decomposition;finite element method;par...,10.1007/s13160-019-00381-3,2019,Japan Journal of Industrial and Applied Mathem...,


In [15]:
for i,current in papers_df.iterrows():
    document_title = current['document_title'].strip()
    document_id = current['id'].strip()
    publication_date = "+{}-01-01T00:00:00Z".format(current['publication_year'])
    doi = current['doi']
    
    data = {
        'labels':{'en':{'language':'en','value':document_title}},
        'claims': [
            # instance of 'scholarly article'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q13442814'}}},
            'type': 'statement', 'rank': 'normal'},
            # title
            {'mainsnak':{
                'snaktype':'value', 'property':'P1476', 'datavalue': {'type':'monolingualtext', 'value': {'text': document_title, 'language': 'en'}}},
            'type': 'statement', 'rank': 'normal'},
            # zbMath work id
            {'mainsnak':{
                'snaktype':'value', 'property':'P894', 'datavalue': {'type':'string', 'value': document_id}},
            'type': 'statement', 'rank': 'normal'},
            # publication date
            {'mainsnak':{
                'snaktype':'value', 'property':'P577', 'datavalue': {'type':'time', 'value': { 
                    "time": publication_date, 'precision': 9, 'timezone':0, 'before':0, 'after':0, 'calendarmodel':'http://www.wikidata.org/entity/Q1985727'
                }}},
            'type': 'statement', 'rank': 'normal'}
            ]
    }
    if not pd.isna(doi):
        data['claims'].append(
            # DOI
            {'mainsnak':{
                'snaktype':'value', 'property':'P356', 'datavalue': {'type':'string', 'value': doi}},
            'type': 'statement', 'rank': 'normal'}
        )

    # import into wikibase, save qid
    papers_df.loc[i, 'qid'] = create_entity(session, data)

### Append the paper-to-software relations
**Papers may use multiple softwares**. The relation between papers and software is in an additional file `data/all_papers_software.csv.zip`.

In [16]:
# load the list of papers-to-software relations
papers_software_df = pd.read_csv('data/all_papers_software.csv.zip')
papers_software_df.head(3)

Unnamed: 0,id,software
0,oai:zbmath.org:7188252,SuperLU-DIST
1,oai:zbmath.org:7206170,SuperLU-DIST
2,oai:zbmath.org:7138575,SuperLU-DIST


### Edit a wikibase entity

In [17]:
def edit_entity(session, qid, data):
    token = get_csrf_token(session)
    
    params = {
        'id': qid,
        'action': 'wbeditentity',
        'format': 'json',
        'data': json.dumps(data),
        'token': token
    }
    r1 = session.post(WIKIBASE_API, data=params)
    r1.json = r1.json()
    
    # raise when edit failed
    if 'error' in r1.json.keys():
        raise WBAPIException(r1.json['error'])

    return r1.json['entity']['id']


In [18]:
groups = papers_software_df.groupby('id')
for paper_ref,group in groups:
    paper_id = papers_df[papers_df['id'] == paper_ref]['qid'].values[0]
    for software_name in group['software']:
        software_id = software_df[software_df['Len'] == software_name]['qid']
        software_id = software_id.values[0]
        data = {
            'claims': [
                {
                    # describe project that uses
                    'mainsnak':{
                        'snaktype':'value', 'property':'P4510', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':software_id}}},
                        'type': 'statement', 'rank': 'normal'}]
        }
        edit_entity(session, paper_id, data)

### Append paper to author relations

In [19]:
# load the list of papers to authors relations
papers_authors_df = pd.read_csv('data/all_papers_authors.csv.zip')
papers_authors_df.head(3)

Unnamed: 0,paper_id,author_id
0,oai:zbmath.org:7188252,fohrmeister.volker
1,oai:zbmath.org:7188252,bartels.alexander
2,oai:zbmath.org:7188252,mosler.jorn


In [20]:
groups = papers_authors_df.groupby('paper_id')
for paper_ref,group in groups:
    paper_id = papers_df[papers_df['id'] == paper_ref]['qid'].values[0]
    for author_ref in group['author_id']:
        author_qid = authors_df[authors_df['author_id'] == author_ref]['qid']
        author_qid = author_qid.values[0]
        data = {
            'claims': [
                {
                    # paper has author
                    'mainsnak':{
                        'snaktype':'value', 'property':'P50', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':author_qid}}},
                        'type': 'statement', 'rank': 'normal'}]
        }
        edit_entity(session, paper_id, data)

In [21]:
authors_df.head()

Unnamed: 0,author_id,author_name,qid
0,aidun.cyrus-k,"Aidun, Cyrus K.",Q1583
1,babuska.ivo-m,"Babuška, I.",Q1584
2,banerjee.uday,"Banerjee, U.",Q1585
3,bartels.alexander,"Bartels, Alexander",Q1586
4,basko.mikhail-m,"Basko, M. M.",Q1587


### Import journals and publishers

In [22]:
# load the list of papers to authors relations
all_journals_df = pd.read_csv('data/all_journals.csv.zip')
all_journals_df.head(3)

Unnamed: 0,paper_id,journal,publisher
0,oai:zbmath.org:7188252,Computer Methods in Applied Mechanics and Engi...,"Elsevier (North-Holland), Amsterdam"
1,oai:zbmath.org:7206170,Computer Methods in Applied Mechanics and Engi...,"Elsevier (North-Holland), Amsterdam"
2,oai:zbmath.org:7138575,Japan Journal of Industrial and Applied Mathem...,"Springer Japan, Tokyo"


In [41]:
# remove duplicated journal/publisher combinations
unique_journals = all_journals_df.drop(columns=['paper_id'])
idx = unique_journals.duplicated()
unique_journals = unique_journals[~idx]

# add publishers
for i,row in unique_journals.iterrows():
    publisher = row['publisher'].strip()
    data = {
        'labels':{'en':{'language':'en','value':publisher}},
        'claims': [
            # instance of 'publisher'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q2085381'}}},
            'type': 'statement', 'rank': 'normal'}
        ]}
    publisher_qid = create_entity(session, data)
    unique_journals.loc[i, 'publisher_qid'] = publisher_qid
    
# add journals
for i,row in unique_journals.iterrows():
    journal_title = row['journal'].strip()
    publisher_qid = row['publisher_qid']
    data = {
        'labels':{'en':{'language':'en','value':journal_title}},
        'claims': [
            # instance of 'scientific publication'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q591041'}}},
            'type': 'statement', 'rank': 'normal'},
            # title
            {'mainsnak':{
                'snaktype':'value', 'property':'P1476', 'datavalue': {'type':'monolingualtext', 'value': {'text': journal_title, 'language': 'en'}}},
            'type': 'statement', 'rank': 'normal'},
            # publisher
            {'mainsnak':{
                'snaktype':'value', 'property':'P123', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':publisher_qid}}},
            'type': 'statement', 'rank': 'normal'}
        ]}
    journal_qid = create_entity(session, data)
    unique_journals.loc[i, 'journal_qid'] = journal_qid
    

In [42]:
unique_journals.head()

Unnamed: 0,journal,publisher,publisher_qid,journal_qid
0,Computer Methods in Applied Mechanics and Engi...,"Elsevier (North-Holland), Amsterdam",Q1976,Q2003
2,Japan Journal of Industrial and Applied Mathem...,"Springer Japan, Tokyo",Q1977,Q2004
3,Computational Geosciences,"Springer International Publishing, Cham",Q1978,Q2005
4,Journal of Computational and Applied Mathematics,"Elsevier (North-Holland), Amsterdam",Q1979,Q2006
5,Engineering Analysis with Boundary Elements,"Elsevier, Oxford; Association with the Interna...",Q1980,Q2007


### Append journal reference to papers

In [68]:
for paper_ref,paper in papers_df.iterrows():
    paper_qid = paper['qid']
    # Journals and publishers are stored in the 'serial' column, ';'-separated, e.g. journal;publisher
    journal_name = paper['serial'].split(';', 1)[0] # source is split in 2 (maxsplit=1)
    journal_qid = unique_journals[unique_journals['journal'] == journal_name]['journal_qid'].values[0]
    data = {
        'claims': [
            {
                # paper published in journal
                'mainsnak':{
                    'snaktype':'value', 'property':'P1433', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':journal_qid}}},
                    'type': 'statement', 'rank': 'normal'}]
    }
    edit_entity(session, paper_qid, data)

### Create an entity for each keyword

In [104]:
import matplotlib

# a flat list of unique keywords
keywords = []
keywords.append( [ keyword.split(';') for keyword in papers_df['keywords'] if not pd.isna(keyword)] )
keywords = list(set(matplotlib.cbook.flatten(keywords)))
keywords = pd.DataFrame(keywords, columns=['keyword'])

# add keywords to wikibase
for i,row in keywords.iterrows():
    keyword = row['keyword']
    data = {
        'labels':{'en':{'language':'en','value':keyword}},
        'claims': [
            # instance of 'index term'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q1128340'}}},
            'type': 'statement', 'rank': 'normal'}
        ]}
    keyword_qid = create_entity(session, data)
    keywords.loc[i, 'keyword_qid'] = keyword_qid

In [105]:
keywords.head()

Unnamed: 0,keyword,keyword_qid
0,velocity post-processing,Q2030
1,moment-based acceleration,Q2031
2,sparse skew symmetric matrices,Q2032
3,geostatistics,Q2033
4,jet,Q2034


### Append keyword references to papers

In [None]:
for _,row in keywords.iterrows():
    keyword = row['keyword']
    keyword_qid = row['keyword_qid']
    for _,paper in papers_df.iterrows():
        paper_qid = paper['qid']
        if (not pd.isna(paper['keywords'])) and (keyword in paper['keywords']):
            data = {
                'claims': [
                    {
                        # main subject
                        'mainsnak':{
                            'snaktype':'value', 'property':'P921', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':keyword_qid}}},
                            'type': 'statement', 'rank': 'normal'}]
            }
            edit_entity(session, paper_qid, data)            
