An example pipeline for buiding a Scopus co-authorship network
---------------

### Step 1a: Access Scopus Collection and download the Scopus paper metadata you are interested. 

To reach the Scopus document search module, you should use academic IPs. If your institute has been listed in the Scopus database, you have permission to search documents in Scopus. It is not free of charge, and your university should pay its share to Scopus to provide this service for its academic researchers.

If you do not have a academic IP, please skip to [Step 1b](#step-1b) to download the exsiting csv files.

**Why Scopus?**	

Scopus has very comprehensive paper data, especially its metadata contains details of authors' affiliations, countries and paper keywords (which are not available on other paper search websites)

**How?**		

1. As the number of papers involving Dutch researchers in just one year is 50,000+, the Scopus API does not offer to handle such a  large amount of data. Therefore, I use the [Scopus Document Search website](https://www.scopus.com/search/form.uri?display=basic#basic) (which requires academic IPs, such as the UvA VPN). The [Advanced Document Search](https://www.scopus.com/search/form.uri?display=advanced) query string is as follows: 

    `PUBYEAR  >  2012  AND  PUBYEAR  <  2024  AND  (  LIMIT-TO ( OA ,  "all" ) )  AND  ( LIMIT-TO ( AFFILCOUNTRY ,  "Netherlands" ) )  AND  ( LIMIT-TO ( PUBSTAGE ,  "final" ) )  AND  ( LIMIT-TO ( PUBYEAR ,  2022 ) )  AND  ( LIMIT-TO ( LANGUAGE ,  "English" ) )`
    
    Using this statement we can get: papers (in 2022) with researchers working in Dutch institutions among the authors, so the authors in the data we obtain are most Dutch researchers, and researchers from other countries who have collaborated with them.

2. Limit the data scape by choosing one particular **Subject area** in the webpage, and click the *CSV export* button to select information that you want to export. In this project, the following information will be used:

    <img src=../images/scopus_export_setting.png width=50% />
    
    (*Export restrictions*: If the number of selected papers is greater than 2000, the Affiliations and Author Keywords parameters are not available, so please split the data into csv files containing less than 2000 papers each.)


### Step 1b (Optional): Get the downloaded csv files that contains the last ten years papers with Dutch researchers.

You can get the metadata for papers in scopus 2022 with Dutch researchers [here](https://nlesc-my.sharepoint.com/:f:/g/personal/z_bai_esciencecenter_nl/Eig3gDDIRvRAgz9LzP7br1kBVa9e8vMQu6s6y9GDBmDsOQ?e=9zdHF2)

### Step 2: Import CSV files to Neo4j Database

1. Neo4j provides a [fully-managed cloud service](https://neo4j.com/cloud/platform/aura-graph-database/?ref=nav-get-started-cta) (One free AuraDB instance per user, with a limit: 20,000 nodes and 40,000 relationships max)

2. You can also download [Neo4j Desktop](https://neo4j.com/download/), there is no limit to the sizes (recommanded), and then add *Project* and *Start* it.


In [1]:
import os
import sys
sys.path.append("..")
from rcn_py import neo4j_scopus
from rcn_py import neo4j_rsd
from neo4j import GraphDatabase
import pandas as pd

Connect to your neo4j DB server, obtain your own **uri, username** and **password**

In [2]:
# local AuraDB example
uri = "bolt://localhost:7687"
user = "neo4j"
password = "zhiningbai"
# uri = "your URI"
# user = "your username"
# password = "your password"

Check for connection

In [3]:
check_verify =  GraphDatabase.driver(uri, auth=(user, password))
check_verify.verify_connectivity()

#### 2.1 Scopus data storage

Change the following to your csv file path.

If you download the csv file from scopus by filtering *Subject area*, please input the following *Subject*.

In [3]:
filepath = "/Users/jennifer/scopus_data/year2022/Medicine1.csv"
subject = "Medicine"

Create Constraints

In [5]:
driver = GraphDatabase.driver(uri, auth=(user, password))
# session = driver.session(database="neo4j")
# session.execute_write(neo4j_scopus.add_constraint) 

# with GraphDatabase.driver(uri, auth=(user, password)) as driver:
with driver.session(database="neo4j") as session:
    session.execute_write(neo4j_scopus.add_constraint) 

Insert people nodes and publication nodes, and authorship edges to Neo4j DB

In [4]:
with GraphDatabase.driver(uri, auth=(user, password)) as driver:
    driver.verify_connectivity()
    with driver.session(database="neo4j") as session:
        # Create nodes & edges
        if os.path.exists(filepath):
        # Skipping bad lines (very rare occurrence): 
        # Replace the following line: df = pd.read_csv(path, on_bad_lines = 'skip')
            df = pd.read_csv(filepath)
                    
            # Create "Person" nodes (scopus_id, name, affiliation, country, keywords, year, subject)
            session.execute_write(neo4j_scopus.neo4j_create_people, df, subject) 
            # Create "Publication" nodes (doi, title, year, cited, keywords, subject)
            session.execute_write(neo4j_scopus.neo4j_create_publication, df, subject)
            # Create Relationship "IS_AUTHOR_OF" (scopus_id, doi, author_name, title, year)
            session.execute_write(neo4j_scopus.neo4j_create_author_pub_edge, df)
            print ("Successfully insert " + subject + " csv file.")  
        else:
            print("The file path does not exist!") 

Successfully insert Medicine csv file.


##### Now you can find data in your Neo4j DB.

Close DB connection if necessary

In [None]:
session.close()
driver.close()

#### 2.2 [Research Software Directory (RSD)](https://research-software-directory.org/) data storage

In [6]:
projects, authors_proj, software, contributor_soft = neo4j_rsd.request_rsd_data()

#### Due to the strict Scopus limit (not very friendly), I saved the "get_scopus_info_from_orcid" results to a local JSON file

So first let's get the response data (which are saved as dictionaries) from the files

In [9]:
# load json module
import json


In [10]:
with open('../json/rsd_scopus_id_dict.json') as scopus_id_file:
  id_file_contents = scopus_id_file.read()
with open('../json/rsd_preferred_name_dict.json') as preferred_name_file:
  name_file_contents = preferred_name_file.read()
with open('../json/rsd_profilelink_dict.json') as profile_link_file:
  link_file_contents = profile_link_file.read()

scopus_id_dict = json.loads(id_file_contents)
preferred_name_dict = json.loads(name_file_contents)
author_link_dict = json.loads(link_file_contents)

OK, now we get the scopus-related info by orcids, let's save the data to our Neo4j DB.

### Run only once:

In [11]:
with GraphDatabase.driver(uri, auth=(user, password)) as driver:
    driver.verify_connectivity()
    with driver.session(database="neo4j") as session:
        # Start creating nodes & edges
        
        # Create "Person" nodes (scopus_id, orcid, name, affiliation)
        session.execute_write(neo4j_rsd.create_person_nodes, authors_proj, scopus_id_dict, preferred_name_dict, author_link_dict) 
        session.execute_write(neo4j_rsd.create_person_nodes, contributor_soft, scopus_id_dict, preferred_name_dict, author_link_dict) 

        # Create "Project" nodes (project_id, title, year, description)
        session.execute_write(neo4j_rsd.create_project_nodes, projects)
        # Create "Software" nodes (software_id, doi, brand_name, year, description)
        session.execute_write(neo4j_rsd.create_software_nodes, software)

        # Create Relationship "IS_AUTHOR_OF" 
        # (scopus_id, project_id/software_id, author_name, title, year)
        session.execute_write(neo4j_rsd.create_author_project_edge, authors_proj, scopus_id_dict, preferred_name_dict, author_link_dict)
        session.execute_write(neo4j_rsd.create_author_software_edge, contributor_soft, scopus_id_dict, preferred_name_dict, author_link_dict)
        

Software nodes added
Author-Project relationship added
Contributor-Software relationship added


Close the drive if it is no longer in use.

In [None]:
session.close()
driver.close()

### Step 3: Read database and map the network

These are the components of our Web Application:

|  |  |
| --- | --- |
| Application Type | Python-Web Application |
| Web framework | Flask (Micro-Webframework)|
| Neo4j Database Connector | Neo4j Python Driver for Cypher Docs |
| Database | Neo4j-Server |
| Frontend | jquery, bootstrap, d3.js |


#### Currently accomplishes the following main functions for visualization:
1. A default display (keywork: "Deep learning", year: 2022)
2. Simple topic search, and simple author search (the two searches are completely separate)
3. information tables and tooltips when clicking on a node

    The tooltip is now unlabelled: 
        
    a. The top left is a "lock" that can lock the node's position', the node will not move even after the tooltip is removed, click the 'lock' button again to unlock it
    b. The top right is a "remove" button, which, when clicked, removes the node and the nodes only associated with that selected node
    c. The below is a "expand" key, click on it to get all other relations for that node
        
4. drag and zoom

In [21]:
# uri = "bolt://localhost:7687"
# username = "neo4j"
# password = "zhiningbai"

In [21]:
%run ../rcn_d3.py "bolt://localhost:7687" "neo4j" "zhiningbai"

INFO:root:Starting on port 8081, database is at bolt://localhost:7687


 * Serving Flask app 'rcn_d3'
 * Debug mode: off


 * Running on http://127.0.0.1:8081
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [21/Apr/2023 14:53:06] "[36mGET / HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [21/Apr/2023 14:53:07] "GET /graph HTTP/1.1" 200 -


##### Or you can run:

In [28]:
!python ../rcn_d3.py "bolt://localhost:7687" "neo4j" "zhiningbai"

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/jennifer/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jennifer/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jennifer/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
INFO:root:Starting on port 8081, database is at bolt://localhost:7687
 * Serving Flask app 'rcn_d3'
 * Debug mode: off
 * Running on http://127.0.0.1:8081
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [10/Apr/2023 03:20:45] "[36mGET / HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [10/Apr/2023 03:20:46] "GET /graph HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [10/Apr/2023 03:20:50] "GET /search?keyword=Deep%20learning&year=2014 HTTP/1.1" 200 -
^C
Exception ignored in: <function Driver.__del__ at 0x7f8292337f70>
Traceback (most recent call l

In [17]:


import requests

MYAPIKEY="3d120b6ddb7d069272dfc2bc68af4028"
conf = "egu"
url = "http://api.elsevier.com/content/search/scopus?query=CONFNAME%28egu%29"

header = {'Accept' : 'application/json', 
                'X-ELS-APIKey' : MYAPIKEY}
resp = requests.get(url, headers=header)
results = resp.json()

In [24]:
results['search-results']

{'opensearch:totalResults': '401',
 'opensearch:startIndex': '0',
 'opensearch:itemsPerPage': '25',
 'opensearch:Query': {'@role': 'request',
  '@searchTerms': 'CONFNAME(egu)',
  '@startPage': '0'},
 'link': [{'@_fa': 'true',
   '@ref': 'self',
   '@href': 'https://api.elsevier.com/content/search/scopus?start=0&count=25&query=CONFNAME%28egu%29',
   '@type': 'application/json'},
  {'@_fa': 'true',
   '@ref': 'first',
   '@href': 'https://api.elsevier.com/content/search/scopus?start=0&count=25&query=CONFNAME%28egu%29',
   '@type': 'application/json'},
  {'@_fa': 'true',
   '@ref': 'next',
   '@href': 'https://api.elsevier.com/content/search/scopus?start=25&count=25&query=CONFNAME%28egu%29',
   '@type': 'application/json'},
  {'@_fa': 'true',
   '@ref': 'last',
   '@href': 'https://api.elsevier.com/content/search/scopus?start=376&count=25&query=CONFNAME%28egu%29',
   '@type': 'application/json'}],
 'entry': [{'@_fa': 'true',
   'link': [{'@_fa': 'true',
     '@ref': 'self',
     '@href': 