# Populate the portal using the Wikibase API
Goal of this notebook: Import the data prepared in `filter_papers_by_software.ipynb` into the data structure setup in `import_wikidata_properties.ipynb`.

The API documentation is [here](https://www.wikidata.org/w/api.php?action=help&modules=wbeditentity)

In [5]:
# import common definitions and functions
%run WB_common.ipynb

## Login to the Wikibase

### Network settings
* Make sure the wikibase is running, e.g. using [MaRDI4NFDI/portal-compose](https://github.com/MaRDI4NFDI/portal-compose)
* Make sure this jupyter notebook is in the same network as the wiki. This is done in docker-compose

```
networks:
  default:
    external: true
    name: portal-compose_default
```

Networks can be listed using `docker network ls`. Here, "portal-compose_default" is the name of the network started by portal-compose.
* Verify that this notebook is in the correct network `docker network inspect portal-compose_default`

The wiki is then accessible in the browser on http://localhost:8080, **from the notebook container at `http://mardi-wikibase`.**

### Import the properties and items from Wikidata
see the notebook `WB_wikidata_properties.ipynb`

### Bot user
* Login to the wiki as admin
* Go to Special:BotPasswords, create a bot user, call it "import", grant it "High-volume editing", "Edit existing pages", "Create, edit, and move pages"
* Copy `data/credentials.tpl` to `data/credentials.ini`. Replace the username and password by those of the newly created bot user (make sure not to commit this file)

In [6]:
# read bot username and password from data/credentials.ini
config = configparser.ConfigParser()
config.sections()
config.read('data/credentials.ini')
username = config['default']['username']
botpwd = config['default']['password']
WIKIBASE_API = config['default']['WIKIBASE_API']

session = login(username, botpwd)

## Import the authors list
Before importing anything, make sure the corresponding items and properties have been imported from wikidata. See notebook `WB_wikidata_properties.ipynb`.

A subsample of the authors list was created in notebook `filter_papers_by_software.ipyb`. This list contains the authors of a papers related to the first 1000 software entries in the list of softwares (`data/swMath-software-list.csv`). The list of authors is in file `data/all_authors.csv.zip`. 

In [26]:
# load the list of zbMath authors
authors_df = pd.read_csv('data/all_authors.csv.zip') 
authors_df.head()

Unnamed: 0,author_id,author_name
0,aidun.cyrus-k,"Aidun, Cyrus K."
1,babuska.ivo-m,"Babuška, I."
2,banerjee.uday,"Banerjee, U."
3,bartels.alexander,"Bartels, Alexander"
4,basko.mikhail-m,"Basko, M. M."


Use the create_entity function to import the authors into the wiki.
The Q-id returned by the wikibase is appended to the pandas dataframe of authors.

In [29]:
for i,current in authors_df.iterrows():
    author_name = current['author_name'].strip()
    author_id = current['author_id'].strip()
    data = {
        'labels':{'en':{'language':'en','value':author_name}},
        'claims': [
            # instance of 'human'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q5'}}},
            'type': 'statement', 'rank': 'normal'},
            # zbMath author id
            {'mainsnak':{
                'snaktype':'value', 'property':'P1556', 'datavalue': {'type':'string', 'value': author_id}},
            'type': 'statement', 'rank': 'normal'}
            ]
    }
    # import into wikibase, save Qid
    authors_df.loc[i, 'qid'] = create_entity(session, data)

In [30]:
authors_df.head()

Unnamed: 0,author_id,author_name,qid
0,aidun.cyrus-k,"Aidun, Cyrus K.",Q6
1,babuska.ivo-m,"Babuška, I.",Q7
2,banerjee.uday,"Banerjee, U.",Q8
3,bartels.alexander,"Bartels, Alexander",Q9
4,basko.mikhail-m,"Basko, M. M.",Q10


## Import the software list
All software entries have already been imported into the MaRDI portal.
Here I will import the first 10 (out of 40000) software entries into the local wiki for testing.

In [30]:
MAX_ENTRIES = 10 # number of software entries to process

# load the list of swMath software
software_df = pd.read_csv('data/swMATH-software-list.csv')
software_df = software_df[:MAX_ENTRIES]
software_df.head()

Unnamed: 0,qid,P13,Len,#
0,,'0',swMATH,initial csv import 2021-12-17
1,,'1',FORTRAN,initial csv import 2021-12-17
2,,'2',SuperLU-DIST,initial csv import 2021-12-17
3,,'3',WHISPAR,initial csv import 2021-12-17
4,,'4',MULTI2D,initial csv import 2021-12-17


Use the create_entity function to import the software into the wiki.
* If the software is already in the wiki, then the existing Qid is appended to the pandas dataframe of software.
* If the software is not already in the wiki, then the software is added and the new Qid returned by the wikibase is appended to the pandas dataframe of software.

In [28]:
def read_entity_by_title(session, title):
    """Reads the Qid of an entity."""
    params = {
        'action': 'wbsearchentities',
        'format': 'json',
        'search': title,
        'language': 'en',
        'type': 'item',
        'limit': 1
    }
    r1 = session.post(WIKIBASE_API, data=params)
    r1.json = r1.json()
    qid = None
    if 'search' in r1.json.keys():
        if len(r1.json['search']) > 0:
            qid = r1.json['search'][0]['id']
    return qid

read_entity_by_title(session, 'FORTRAN')

'Q175'

In [33]:
for i,current in software_df[:MAX_ENTRIES].iterrows():
    software_name = current['Len'].strip()
    qid = read_entity_by_title(session, software_name)
    if qid is not None:
        # software already in the wiki
        software_df.loc[i, 'qid'] = qid
    else:
        # software is not already in the wiki
        software_swmath_id = current['P13'].strip().replace("'", '')

        data = {
            'labels':{'en':{'language':'en','value':software_name}},
            'claims': [
                # instance of 'software'
                {'mainsnak':{
                    'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q7397'}}},
                'type': 'statement', 'rank': 'normal'},
                # swMath work id
                {'mainsnak':{
                    'snaktype':'value', 'property':'P6830', 'datavalue': {'type':'string', 'value': software_swmath_id}},
                'type': 'statement', 'rank': 'normal'}
                ]
        }
        # import into wikibase, save Qid
        software_df.loc[i, 'qid'] = create_entity(session, data)

In [34]:
software_df.head()

Unnamed: 0,qid,P13,Len,#
0,Q174,'0',swMATH,initial csv import 2021-12-17
1,Q175,'1',FORTRAN,initial csv import 2021-12-17
2,Q176,'2',SuperLU-DIST,initial csv import 2021-12-17
3,Q177,'3',WHISPAR,initial csv import 2021-12-17
4,Q178,'4',MULTI2D,initial csv import 2021-12-17


## Import the papers list
A subsample of the papers list was created in notebook `filter_papers_by_software.ipyb`. This list contains the papers related to the first 10 software entries in the list of softwares (`data/swMath-software-list.csv`). The list of papers is in file `data/all_papers.csv.zip`. 

In [34]:
# load the list of zbMath papers
papers_df = pd.read_csv('data/all_papers.csv.zip', dtype={'doi': 'string'}) # force DOI to be a string
papers_df.head(3)

Unnamed: 0,id,author,author_ids,document_title,source,classifications,language,links,keywords,doi,publication_year,serial
0,oai:zbmath.org:6648376,"Casoni, E.; Jérusalem, A.; Samaniego, C.; Eguz...",casoni.e;jerusalem.antoine;samaniego.cristobal...,Alya: computational solid mechanics for superc...,"Arch. Comput. Methods Eng. 22, No. 4, 557-576 ...",74-04;74Sxx;65N22;65Y05;65Y15,English,http://hdl.handle.net/2117/84764,computational mechanics;finite element method;...,10.1007/s11831-014-9126-8,2015,Archives of Computational Methods in Engineeri...
1,oai:zbmath.org:6911772,"Povich, T. J.; Dawson, C. N.; Farthing, M. W.;...",povich.t-j;dawson.clint-n;farthing.matthew-w;k...,Finite element methods for variable density fl...,"Comput. Geosci. 17, No. 3, 529-549 (2013).",76M10;76S05;65M60,English,,saltwater intrusion;variable density;stabilize...,10.1007/s10596-012-9330-2,2013,Computational Geosciences;Springer Internation...
2,oai:zbmath.org:6910452,"Chávez, Gustavo; Turkiyyah, George; Zampini, S...",chavez.gustavo;turkiyyah.george-m;zampini.stef...,Parallel accelerated cyclic reduction precondi...,"J. Comput. Appl. Math. 344, 760-781 (2018).",65F08;65N22;65Y05,English,,preconditioning;cyclic reduction;hierarchical ...,10.1016/j.cam.2017.11.035,2018,Journal of Computational and Applied Mathemati...


### Create languages items

In [35]:
languages_df = pd.DataFrame()
for language in papers_df['language'].unique():
    data = {
        'labels':{'en':{'language':'en','value':language}},
        'claims': [
            # instance of 'language'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q34770'}}},
            'type': 'statement', 'rank': 'normal'}
        ]}
    qid = create_entity(session, data)
    languages_df = languages_df.append({'language': language, 'qid': qid}, ignore_index=True)
languages_df.set_index('language', inplace=True)
languages_df.head()

Unnamed: 0_level_0,qid
language,Unnamed: 1_level_1
English,Q184


In [36]:
for i,current in papers_df.iterrows():
    document_title = current['document_title'].strip()
    document_id = current['id'].strip()
    publication_date = "+{}-01-01T00:00:00Z".format(current['publication_year'])
    doi = current['doi']
    language = languages_df.loc[current['language'], 'qid']
    link = current['links']
    
    data = {
        'labels':{'en':{'language':'en','value':document_title}},
        'claims': [
            # instance of 'scholarly article'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q13442814'}}},
            'type': 'statement', 'rank': 'normal'},
            # title
            {'mainsnak':{
                'snaktype':'value', 'property':'P1476', 'datavalue': {'type':'monolingualtext', 'value': {'text': document_title, 'language': 'en'}}},
            'type': 'statement', 'rank': 'normal'},
            # zbMath work id
            {'mainsnak':{
                'snaktype':'value', 'property':'P894', 'datavalue': {'type':'string', 'value': document_id}},
            'type': 'statement', 'rank': 'normal'},
            # publication date
            {'mainsnak':{
                'snaktype':'value', 'property':'P577', 'datavalue': {'type':'time', 'value': { 
                    "time": publication_date, 'precision': 9, 'timezone':0, 'before':0, 'after':0, 'calendarmodel':'http://www.wikidata.org/entity/Q1985727'
                }}},
            'type': 'statement', 'rank': 'normal'},
            # language of work or name
            {'mainsnak':{
                'snaktype':'value', 'property':'P407', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':language}}},
            'type': 'statement', 'rank': 'normal'},
        ]
    }
    if not pd.isna(doi):
        data['claims'].append(
            # DOI
            {'mainsnak':{
                'snaktype':'value', 'property':'P356', 'datavalue': {'type':'string', 'value': doi}},
            'type': 'statement', 'rank': 'normal'}
        )

    if not pd.isna(link):
        data['claims'].append(
            # exact match (link)
            {'mainsnak':{
                'snaktype':'value', 'property':'P2888', 'datavalue': {'type':'string', 'value': link}},
            'type': 'statement', 'rank': 'normal'},
        )
    # import into wikibase, save qid
    papers_df.loc[i, 'qid'] = create_entity(session, data)

### Append the paper-to-software relations
**Papers may use multiple softwares**. The relation between papers and software is in an additional file `data/all_papers_software.csv.zip`.

In [37]:
# load the list of papers-to-software relations
papers_software_df = pd.read_csv('data/all_papers_software.csv.zip')
papers_software_df.head(3)

Unnamed: 0,id,software
0,oai:zbmath.org:6648376,SuperLU-DIST
1,oai:zbmath.org:6911772,SuperLU-DIST
2,oai:zbmath.org:6910452,SuperLU-DIST


In [38]:
groups = papers_software_df.groupby('id')
for paper_ref,group in groups:
    paper_id = papers_df[papers_df['id'] == paper_ref]['qid'].values[0]
    for software_name in group['software']:
        software_id = software_df[software_df['Len'] == software_name]['qid']
        software_id = software_id.values[0]
        data = {
            'claims': [
                {
                    # describe project that uses
                    'mainsnak':{
                        'snaktype':'value', 'property':'P4510', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':software_id}}},
                        'type': 'statement', 'rank': 'normal'}]
        }
        edit_entity(session, paper_id, data)

### Append paper to author relations

In [39]:
# load the list of papers to authors relations
papers_authors_df = pd.read_csv('data/all_papers_authors.csv.zip')
papers_authors_df.head(3)

Unnamed: 0,paper_id,author_id
0,oai:zbmath.org:6648376,casoni.e
1,oai:zbmath.org:6648376,jerusalem.antoine
2,oai:zbmath.org:6648376,samaniego.cristobal


In [40]:
groups = papers_authors_df.groupby('paper_id')
for paper_ref,group in groups:
    paper_id = papers_df[papers_df['id'] == paper_ref]['qid'].values[0]
    for author_ref in group['author_id']:
        author_qid = authors_df[authors_df['author_id'] == author_ref]['qid']
        author_qid = author_qid.values[0]
        data = {
            'claims': [
                {
                    # paper has author
                    'mainsnak':{
                        'snaktype':'value', 'property':'P50', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':author_qid}}},
                        'type': 'statement', 'rank': 'normal'}]
        }
        edit_entity(session, paper_id, data)

In [41]:
authors_df.head()

Unnamed: 0,author_id,author_name,qid
0,aidun.cyrus-k,"Aidun, Cyrus K.",Q6
1,babuska.ivo-m,"Babuška, I.",Q7
2,banerjee.uday,"Banerjee, U.",Q8
3,bartels.alexander,"Bartels, Alexander",Q9
4,basko.mikhail-m,"Basko, M. M.",Q10


### Import journals and publishers

In [42]:
# load the list of papers to authors relations
all_journals_df = pd.read_csv('data/all_journals.csv.zip')
all_journals_df.head(3)

Unnamed: 0,paper_id,journal,publisher
0,oai:zbmath.org:6648376,Archives of Computational Methods in Engineering,"Springer Netherlands, Dordrecht; International..."
1,oai:zbmath.org:6911772,Computational Geosciences,"Springer International Publishing, Cham"
2,oai:zbmath.org:6910452,Journal of Computational and Applied Mathematics,"Elsevier (North-Holland), Amsterdam"


In [43]:
# remove duplicated journal/publisher combinations
unique_journals = all_journals_df.drop(columns=['paper_id'])
idx = unique_journals.duplicated()
unique_journals = unique_journals[~idx]

# add publishers
for i,row in unique_journals.iterrows():
    publisher = row['publisher'].strip()
    data = {
        'labels':{'en':{'language':'en','value':publisher}},
        'claims': [
            # instance of 'publisher'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q2085381'}}},
            'type': 'statement', 'rank': 'normal'}
        ]}
    publisher_qid = create_entity(session, data)
    unique_journals.loc[i, 'publisher_qid'] = publisher_qid
    
# add journals
for i,row in unique_journals.iterrows():
    journal_title = row['journal'].strip()
    publisher_qid = row['publisher_qid']
    data = {
        'labels':{'en':{'language':'en','value':journal_title}},
        'claims': [
            # instance of 'scientific publication'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q591041'}}},
            'type': 'statement', 'rank': 'normal'},
            # title
            {'mainsnak':{
                'snaktype':'value', 'property':'P1476', 'datavalue': {'type':'monolingualtext', 'value': {'text': journal_title, 'language': 'en'}}},
            'type': 'statement', 'rank': 'normal'},
            # publisher
            {'mainsnak':{
                'snaktype':'value', 'property':'P123', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':publisher_qid}}},
            'type': 'statement', 'rank': 'normal'}
        ]}
    journal_qid = create_entity(session, data)
    unique_journals.loc[i, 'journal_qid'] = journal_qid
    

In [44]:
unique_journals.head()

Unnamed: 0,journal,publisher,publisher_qid,journal_qid
0,Archives of Computational Methods in Engineering,"Springer Netherlands, Dordrecht; International...",Q245,Q272
1,Computational Geosciences,"Springer International Publishing, Cham",Q246,Q273
2,Journal of Computational and Applied Mathematics,"Elsevier (North-Holland), Amsterdam",Q247,Q274
3,Engineering Analysis with Boundary Elements,"Elsevier, Oxford; Association with the Interna...",Q248,Q275
4,Computers and Fluids,"Elsevier (Pergamon), Oxford",Q249,Q276


### Append journal reference to papers

In [45]:
for paper_ref,paper in papers_df.iterrows():
    paper_qid = paper['qid']
    # Journals and publishers are stored in the 'serial' column, ';'-separated, e.g. journal;publisher
    journal_name = paper['serial'].split(';', 1)[0] # source is split in 2 (maxsplit=1)
    journal_qid = unique_journals[unique_journals['journal'] == journal_name]['journal_qid'].values[0]
    data = {
        'claims': [
            {
                # paper published in journal
                'mainsnak':{
                    'snaktype':'value', 'property':'P1433', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':journal_qid}}},
                    'type': 'statement', 'rank': 'normal'}]
    }
    edit_entity(session, paper_qid, data)

### Create an entity for each keyword

In [46]:
import matplotlib

# a flat list of unique keywords
keywords = []
keywords.append( [ keyword.split(';') for keyword in papers_df['keywords'] if not pd.isna(keyword)] )
keywords = list(set(matplotlib.cbook.flatten(keywords)))
keywords = pd.DataFrame(keywords, columns=['keyword'])

# add keywords to wikibase
for i,row in keywords.iterrows():
    keyword = row['keyword']
    data = {
        'labels':{'en':{'language':'en','value':keyword}},
        'claims': [
            # instance of 'index term'
            {'mainsnak':{
                'snaktype':'value', 'property':'P31', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':'Q1128340'}}},
            'type': 'statement', 'rank': 'normal'}
        ]}
    keyword_qid = create_entity(session, data)
    keywords.loc[i, 'keyword_qid'] = keyword_qid

In [47]:
keywords.head()

Unnamed: 0,keyword,keyword_qid
0,ICF simulation,Q299
1,parameterization,Q300
2,Navier-Stokes and energy equations,Q301
3,variable density,Q302
4,interior point optimization,Q303


### Append keyword references to papers

In [48]:
for _,row in keywords.iterrows():
    keyword = row['keyword']
    keyword_qid = row['keyword_qid']
    for _,paper in papers_df.iterrows():
        paper_qid = paper['qid']
        if (not pd.isna(paper['keywords'])) and (keyword in paper['keywords']):
            data = {
                'claims': [
                    {
                        # main subject
                        'mainsnak':{
                            'snaktype':'value', 'property':'P921', 'datavalue':{'type':'wikibase-entityid', 'value': {'entity-type':'item','id':keyword_qid}}},
                            'type': 'statement', 'rank': 'normal'}]
            }
            edit_entity(session, paper_qid, data)            
