An example pipeline for buiding a Scopus co-authorship network
---------------

### Step 1a: Access Scopus Collection and download the Scopus paper metadata you are interested. 

To reach the Scopus document search module, you should use academic IPs. If your institute has been listed in the Scopus database, you have permission to search documents in Scopus. It is not free of charge, and your university should pay its share to Scopus to provide this service for its academic researchers.

If you do not have a academic IP, please skip to [Step 1b](#step-1b) to download the exsiting csv files.

**Why Scopus?**	

Scopus has very comprehensive paper data, especially its metadata contains details of authors' affiliations, countries and paper keywords (which are not available on other paper search websites)

**How?**		

1. As the number of papers involving Dutch researchers in just one year is 50,000+, the Scopus API does not offer to handle such a  large amount of data. Therefore, I use the [Scopus Document Search website](https://www.scopus.com/search/form.uri?display=basic#basic) (which requires academic IPs, such as the UvA VPN). The [Advanced Document Search](https://www.scopus.com/search/form.uri?display=advanced) query string is as follows: 

    `PUBYEAR  >  2012  AND  PUBYEAR  <  2024  AND  (  LIMIT-TO ( OA ,  "all" ) )  AND  ( LIMIT-TO ( AFFILCOUNTRY ,  "Netherlands" ) )  AND  ( LIMIT-TO ( PUBSTAGE ,  "final" ) )  AND  ( LIMIT-TO ( PUBYEAR ,  2022 ) )  AND  ( LIMIT-TO ( LANGUAGE ,  "English" ) )`
    
    Using this statement we can get: papers (in 2022) with researchers working in Dutch institutions among the authors, so the authors in the data we obtain are most Dutch researchers, and researchers from other countries who have collaborated with them.

2. Limit the data scape by choosing one particular **Subject area** in the webpage, and click the *CSV export* button to select information that you want to export. In this project, the following information will be used:

    <img src=../images/scopus_export_setting.png width=50% />
    
    (*Export restrictions*: If the number of selected papers is greater than 2000, the Affiliations and Author Keywords parameters are not available, so please split the data into csv files containing less than 2000 papers each.)


### Step 1b (Optional): Get the downloaded csv files that contains the last ten years papers with Dutch researchers.

You can get the metadata for papers in scopus 2022 with Dutch researchers [here](https://nlesc-my.sharepoint.com/:f:/g/personal/z_bai_esciencecenter_nl/Eig3gDDIRvRAgz9LzP7br1kBVa9e8vMQu6s6y9GDBmDsOQ?e=9zdHF2)

### Step 2: Import CSV files to Neo4j Database

1. Neo4j provides a [fully-managed cloud service](https://neo4j.com/cloud/platform/aura-graph-database/?ref=nav-get-started-cta) (One free AuraDB instance per user, with a limit: 20,000 nodes and 40,000 relationships max)

2. You can also download [Neo4j Desktop](https://neo4j.com/download/), there is no limit to the sizes (recommanded), and then add *Project* and *Start* it.


In [1]:
import os
import sys
sys.path.append("..")
from rcn_py import neo4j_scopus
from rcn_py import neo4j_rsd
from neo4j import GraphDatabase
import pandas as pd

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/jennifer/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jennifer/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jennifer/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Connect to your neo4j DB server, obtain your own **uri, username** and **password**

In [None]:
# local AuraDB example
# uri = "bolt://localhost:7687"
# user = "neo4j"
# password = "zhiningbai"
uri = "your URI"
user = "your username"
password = "your password"

Check for connection

In [None]:
check_verify =  GraphDatabase.driver(uri, auth=(user, password))
check_verify.verify_connectivity()

#### 2.1 Scopus data storage

Change the following to your csv file path.

If you download the csv file from scopus by filtering *Subject area*, please input the following *Subject*.

In [None]:
filepath = "The csv file path that you want to insert into the database"
subject = "The official scopus subject classification if available"

Create Constraints

In [None]:
driver = GraphDatabase.driver(uri, auth=(user, password))
session = driver.session(database="neo4j")
session.write_transaction(neo4j_scopus.add_constraint) 

Insert people nodes and publication nodes, and authorship edges to Neo4j DB

In [None]:
with GraphDatabase.driver(uri, auth=(user, password)) as driver:
    driver.verify_connectivity()
    with driver.session(database="neo4j") as session:
        # Create nodes & edges
        if os.path.exists(filepath):
        # Skipping bad lines (very rare occurrence): 
        # Replace the following line: df = pd.read_csv(path, on_bad_lines = 'skip')
            df = pd.read_csv(filepath)
                    
            # Create "Person" nodes (scopus_id, name, affiliation, country, keywords, year, subject)
            session.write_transaction(neo4j_scopus.neo4j_create_people, df, subject) 
            # Create "Publication" nodes (doi, title, year, cited, keywords, subject)
            session.write_transaction(neo4j_scopus.neo4j_create_publication, df, subject)
            # Create Relationship "IS_AUTHOR_OF" (scopus_id, doi, author_name, title, year)
            session.write_transaction(neo4j_scopus.neo4j_create_author_pub_edge, df)
            print ("Successfully insert " + subject + " csv file.")  
        else:
            print("The file path does not exist!") 

##### Now you can find data in your Neo4j DB.

Close DB connection if necessary

In [None]:
session.close()
driver.close()

#### 2.2 [Research Software Directory (RSD)](https://research-software-directory.org/) data storage

In [None]:
projects, authors_proj, software, contributor_soft = neo4j_rsd.request_rsd_data()

Run only once.

In [None]:
with GraphDatabase.driver(uri, auth=(user, password)) as driver:
    driver.verify_connectivity()
    with driver.session(database="neo4j") as session:
        # Start creating nodes & edges
        
        # Create "Person" nodes (scopus_id, orcid, name, affiliation)
        session.write_transaction(neo4j_rsd.create_person_nodes, authors_proj) 
        session.write_transaction(neo4j_rsd.create_person_nodes, contributor_soft) 

        # Create "Project" nodes (project_id, title, year, description)
        session.write_transaction(neo4j_rsd.create_project_nodes, projects)
        # Create "Software" nodes (software_id, doi, brand_name, year, description)
        session.write_transaction(neo4j_rsd.create_software_nodes, software)

        # Create Relationship "IS_AUTHOR_OF" 
        # (scopus_id, project_id/software_id, author_name, title, year)
        session.write_transaction(neo4j_rsd.create_author_project_edge, authors_proj)
        session.write_transaction(neo4j_rsd.create_author_software_edge, contributor_soft)
        

Close the drive if it is no longer in use.

In [None]:
session.close()
driver.close()

### Step 3: Read database and map the network

These are the components of our Web Application:

|  |  |
| --- | --- |
| Application Type | Python-Web Application |
| Web framework | Flask (Micro-Webframework)|
| Neo4j Database Connector | Neo4j Python Driver for Cypher Docs |
| Database | Neo4j-Server |
| Frontend | jquery, bootstrap, d3.js |

In [21]:
# uri = "bolt://localhost:7687"
# username = "neo4j"
# password = "zhiningbai"

In [27]:
%run ../rcn_d3.py "bolt://localhost:7687" "neo4j" "zhiningbai"

INFO:root:Starting on port 8081, database is at bolt://localhost:7687


 * Serving Flask app 'rcn_d3'
 * Debug mode: off


 * Running on http://127.0.0.1:8081
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [10/Apr/2023 03:19:51] "[36mGET / HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [10/Apr/2023 03:19:51] "GET /graph HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [10/Apr/2023 03:19:56] "GET /search?keyword=Deep%20learning&year=2014 HTTP/1.1" 200 -


Or:

In [28]:
!python ../rcn_d3.py "bolt://localhost:7687" "neo4j" "zhiningbai"

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/jennifer/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jennifer/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jennifer/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
INFO:root:Starting on port 8081, database is at bolt://localhost:7687
 * Serving Flask app 'rcn_d3'
 * Debug mode: off
 * Running on http://127.0.0.1:8081
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [10/Apr/2023 03:20:45] "[36mGET / HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [10/Apr/2023 03:20:46] "GET /graph HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [10/Apr/2023 03:20:50] "GET /search?keyword=Deep%20learning&year=2014 HTTP/1.1" 200 -
^C
Exception ignored in: <function Driver.__del__ at 0x7f8292337f70>
Traceback (most recent call l