An example pipeline for buiding a Scopus co-authorship network
---------------

### Step 1a: Access Scopus Collection and download the Scopus paper metadata you are interested. 

To reach the Scopus document search module, you should use academic IPs. If your institute has been listed in the Scopus database, you have permission to search documents in Scopus. It is not free of charge, and your university should pay its share to Scopus to provide this service for its academic researchers.

If you do not have a academic IP, please skip to [Step 1b](#step-1b) to download the exsiting csv files.

**Why Scopus?**	

Scopus has very comprehensive paper data, especially its metadata contains details of authors' affiliations, countries and paper keywords (which are not available on other paper search websites)

**How?**		

1. As the number of papers involving Dutch researchers in just one year is 50,000+, the Scopus API does not offer to handle such a  large amount of data. Therefore, I use the [Scopus Document Search website](https://www.scopus.com/search/form.uri?display=basic#basic) (which requires academic IPs, such as the UvA VPN). The [Advanced Document Search](https://www.scopus.com/search/form.uri?display=advanced) query string is as follows: 

    `PUBYEAR  >  2012  AND  PUBYEAR  <  2024  AND  (  LIMIT-TO ( OA ,  "all" ) )  AND  ( LIMIT-TO ( AFFILCOUNTRY ,  "Netherlands" ) )  AND  ( LIMIT-TO ( PUBSTAGE ,  "final" ) )  AND  ( LIMIT-TO ( PUBYEAR ,  2022 ) )  AND  ( LIMIT-TO ( LANGUAGE ,  "English" ) )`
    
    Using this statement we can get: papers (in 2022) with researchers working in Dutch institutions among the authors, so the authors in the data we obtain are most Dutch researchers, and researchers from other countries who have collaborated with them.

2. Limit the data scape by choosing one particular **Subject area** in the webpage, and click the *CSV export* button to select information that you want to export. In this project, the following information will be used:

    <img src=../images/scopus_export_setting.png width=50% />
    
    (*Export restrictions*: If the number of selected papers is greater than 2000, the Affiliations and Author Keywords parameters are not available, so please split the data into csv files containing less than 2000 papers each.)


### Step 1b (Optional): Get the downloaded csv files that contains the last ten years papers with Dutch researchers.

You can get the metadata for papers in scopus 2022 with Dutch researchers [here](https://nlesc-my.sharepoint.com/:f:/g/personal/z_bai_esciencecenter_nl/Eig3gDDIRvRAgz9LzP7br1kBVa9e8vMQu6s6y9GDBmDsOQ?e=9zdHF2)

### Step 2: Import CSV files to Neo4j Database

1. Neo4j provides a [fully-managed cloud service](https://neo4j.com/cloud/platform/aura-graph-database/?ref=nav-get-started-cta) (One free AuraDB instance per user, with a limit: 20,000 nodes and 40,000 relationships max)

2. You can also download [Neo4j Desktop](https://neo4j.com/download/), there is no limit to the sizes (recommanded), and then add *Project* and *Start* it.


In [1]:
import os
import sys
sys.path.append("..")
from rcn_py import neo4j_scopus
from rcn_py import neo4j_rsd
from rcn_py import openalex
from rcn_py import topic_modeling
from neo4j import GraphDatabase
import pandas as pd

Connect to your neo4j DB server, obtain your own **uri, username** and **password**

In [None]:
# local AuraDB example
uri = "bolt://localhost:7687"
user = "neo4j"
password = "zhiningbai"
# uri = "your URI"
# user = "your username"
# password = "your password"

Check for connection

In [None]:
check_verify =  GraphDatabase.driver(uri, auth=(user, password))
check_verify.verify_connectivity()

#### 2.1 Scopus data storage

Change the following to your csv file path.

If you download the csv file from scopus by filtering *Subject area*, please input the following *Subject*.

In [None]:
filepath = "/Users/jennifer/scopus_data/year2022/Medicine1.csv"
subject = "Medicine"

Create Constraints

In [None]:
driver = GraphDatabase.driver(uri, auth=(user, password))
# session = driver.session(database="neo4j")
# session.execute_write(neo4j_scopus.add_constraint) 

# with GraphDatabase.driver(uri, auth=(user, password)) as driver:
with driver.session(database="neo4j") as session:
    session.execute_write(neo4j_scopus.add_constraint) 
# have to do the same thing for rsd!
# CREATE CONSTRAINT orcid IF NOT EXISTS FOR (p:Person) REQUIRE p.orcid IS UNIQUE

Insert people nodes and publication nodes, and authorship edges to Neo4j DB

In [None]:
with GraphDatabase.driver(uri, auth=(user, password)) as driver:
    driver.verify_connectivity()
    with driver.session(database="neo4j") as session:
        # Create nodes & edges
        if os.path.exists(filepath):
        # Skipping bad lines (very rare occurrence): 
        # Replace the following line: df = pd.read_csv(path, on_bad_lines = 'skip')
            df = pd.read_csv(filepath)
                    
            # Create "Person" nodes (scopus_id, name, affiliation, country, keywords, year, subject)
            session.execute_write(neo4j_scopus.neo4j_create_people, df, subject) 
            # Create "Publication" nodes (doi, title, year, cited, keywords, subject)
            session.execute_write(neo4j_scopus.neo4j_create_publication, df, subject)
            # Create Relationship "IS_AUTHOR_OF" (scopus_id, doi, author_name, title, year)
            session.execute_write(neo4j_scopus.neo4j_create_author_pub_edge, df)
            print ("Successfully insert " + subject + " csv file.")  
        else:
            print("The file path does not exist!") 

##### Now you can find data in your Neo4j DB.

Close DB connection if necessary

In [None]:
session.close()
driver.close()

#### 2.2 [Research Software Directory (RSD)](https://research-software-directory.org/) data storage

In [None]:
projects, authors_proj, software, contributor_soft = neo4j_rsd.request_rsd_data()

#### Due to the strict Scopus limit (not very friendly), I saved the "get_scopus_info_from_orcid" results to a local JSON file

So first let's get the response data (which are saved as dictionaries) from the files

In [None]:
# load json module
import json


In [None]:
with open('../json/rsd_scopus_id_dict.json') as scopus_id_file:
  id_file_contents = scopus_id_file.read()
with open('../json/rsd_preferred_name_dict.json') as preferred_name_file:
  name_file_contents = preferred_name_file.read()
with open('../json/rsd_profilelink_dict.json') as profile_link_file:
  link_file_contents = profile_link_file.read()

scopus_id_dict = json.loads(id_file_contents)
preferred_name_dict = json.loads(name_file_contents)
author_link_dict = json.loads(link_file_contents)

Add constraints

In [None]:
driver = GraphDatabase.driver(uri, auth=(user, password))

with driver.session(database="rsd") as session:
    session.execute_write(neo4j_rsd.rsd_add_constraint) 


OK, now we get the scopus-related info by orcids, let's save the data to our Neo4j DB.

### Run only once:

In [None]:
with GraphDatabase.driver(uri, auth=(user, password)) as driver:
    driver.verify_connectivity()
    with driver.session(database="rsd") as session:
        # Start creating nodes & edges
        
        # Create "Person" nodes (scopus_id, orcid, name, affiliation)
        session.execute_write(neo4j_rsd.create_person_nodes, authors_proj, scopus_id_dict, preferred_name_dict, author_link_dict) 
        session.execute_write(neo4j_rsd.create_person_nodes, contributor_soft, scopus_id_dict, preferred_name_dict, author_link_dict) 

        # Create "Project" nodes (project_id, title, year, description)
        session.execute_write(neo4j_rsd.create_project_nodes, projects)
        # Create "Software" nodes (software_id, doi, brand_name, year, description)
        session.execute_write(neo4j_rsd.create_software_nodes, software)

        # Create Relationship "IS_AUTHOR_OF" 
        # (scopus_id, project_id/software_id, author_name, title, year)
        session.execute_write(neo4j_rsd.create_author_project_edge, authors_proj, scopus_id_dict, preferred_name_dict, author_link_dict)
        session.execute_write(neo4j_rsd.create_author_software_edge, contributor_soft, scopus_id_dict, preferred_name_dict, author_link_dict)
        

Close the drive if it is no longer in use.

In [None]:
session.close()
driver.close()

### Step 3: Read database and map the network

These are the components of our Web Application:

|  |  |
| --- | --- |
| Application Type | Python-Web Application |
| Web framework | Flask (Micro-Webframework)|
| Neo4j Database Connector | Neo4j Python Driver for Cypher Docs |
| Database | Neo4j-Server |
| Frontend | jquery, bootstrap, d3.js |


#### Currently accomplishes the following main functions for visualization:
1. A default display (keywork: "Deep learning", year: 2022)
2. Simple topic search, and simple author search (the two searches are completely separate)
3. information tables and tooltips when clicking on a node

    The tooltip is now unlabelled: 
        
    a. The top left is a "lock" that can lock the node's position', the node will not move even after the tooltip is removed, click the 'lock' button again to unlock it
    b. The top right is a "remove" button, which, when clicked, removes the node and the nodes only associated with that selected node
    c. The below is a "expand" key, click on it to get all other relations for that node
        
4. drag and zoom

In [None]:
# uri = "bolt://localhost:7687"
# username = "neo4j"
# password = "zhiningbai"

In [None]:
%run ../rcn_d3.py "bolt://localhost:7687" "neo4j" "zhiningbai"

##### Or you can run:

In [None]:
!python ../rcn_d3.py "bolt://localhost:7687" "neo4j" "zhiningbai"

### Let's try OpenAlex

Pre-load some aff-related works

In [3]:
institution_name = "Netherlands eScience Center"
all_works_of_one_institution = openalex.get_all_works_of_one_institution(institution_name)

In [None]:
all_works_of_one_institution

In [4]:
first_work = all_works_of_one_institution[0]

In [6]:
first_work['concepts']

[{'id': 'https://openalex.org/C2777950569',
  'wikidata': 'https://www.wikidata.org/wiki/Q17021836',
  'display_name': 'Stewardship (theology)',
  'level': 3,
  'score': 0.7497333},
 {'id': 'https://openalex.org/C206588197',
  'wikidata': 'https://www.wikidata.org/wiki/Q846574',
  'display_name': 'Reuse',
  'level': 2,
  'score': 0.69452894},
 {'id': 'https://openalex.org/C137981799',
  'wikidata': 'https://www.wikidata.org/wiki/Q1369184',
  'display_name': 'Reusability',
  'level': 3,
  'score': 0.6708406},
 {'id': 'https://openalex.org/C26713055',
  'wikidata': 'https://www.wikidata.org/wiki/Q245962',
  'display_name': 'Implementation',
  'level': 2,
  'score': 0.64504087},
 {'id': 'https://openalex.org/C41008148',
  'wikidata': 'https://www.wikidata.org/wiki/Q21198',
  'display_name': 'Computer science',
  'level': 0,
  'score': 0.5673512},
 {'id': 'https://openalex.org/C177264268',
  'wikidata': 'https://www.wikidata.org/wiki/Q1514741',
  'display_name': 'Set (abstract data type)',

In [None]:
institution_info_list = []
id = 0
for work in all_works_of_one_institution:
    for au in work['authorships']:
        for inst in au['institutions']:
            if inst['id'] and inst not in institution_info_list:
                inst['node_id'] = id
                institution_info_list.append(inst)
                id +=1
institution_info_list

In [None]:
institution_info_list

In [None]:

    
all_institution_info_list = []
for work in all_works_of_one_institution:
        current_work_inst_list = []
        for au in work['authorships']:
            for inst in au['institutions']:
                if inst['id'] and  inst not in current_work_inst_list:
                    current_work_inst_list.append(inst)
        all_institution_info_list.append(current_work_inst_list)
        
all_institution_info_list

In [None]:
node_id = 0 # node_id is from 0
all_institution_info_list = [] # Node info (no duplication)
all_inst_id_list = [] # Used to check whether the institution has been saved
for work in all_works_of_one_institution:
        current_work_inst_id_list = []
        for au in work['authorships']:
            for inst in au['institutions']:
                if inst['id']:
                    if inst['id'] in all_inst_id_list:
                        existing_index = all_inst_id_list.index(inst['id'])
                        # Get the exsiting node id 
                        if existing_index not in current_work_inst_id_list:
                            current_work_inst_id_list.append(existing_index)
                    else:
                        inst['node_id'] = node_id
                        all_institution_info_list.append(inst)
                        all_inst_id_list.append(inst['id'])
                        current_work_inst_id_list.append(node_id)
                        
                        node_id += 1
        print(current_work_inst_id_list)

In [3]:
import gensim
from gensim import corpora

In [17]:
keywords_list = []
works = openalex.get_works_of_one_institution_by_year("Netherlands eScience Center", '2023','2023')
for w in works:
    keywords = openalex.publication_keywords(w['doi'])
    keywords_list.append(keywords)
keywords_list

[['Sustainability',
  'Software',
  'Computer science',
  'Business',
  'Software engineering',
  'Environmental planning',
  'Environmental resource management',
  'Environmental science',
  'Operating system',
  'Biology'],
 ['Ecosystem',
  'Environmental science',
  'Environmental resource management',
  'Psychological resilience',
  'Climate change',
  'Ecosystem services',
  'Resilience (materials science)',
  'Ecology',
  'Biology',
  'Physics'],
 ['Mesoscale meteorology',
  'Cloud computing',
  'Meteorology',
  'Convection',
  'Large eddy simulation',
  'Scale (ratio)',
  'Geology',
  'Atmospheric sciences',
  'Computer science',
  'Geography'],
 ['Transformer',
  'Computer science',
  'Architecture',
  'Deep learning',
  'Artificial intelligence',
  'Proteomics',
  'Mass spectrometry',
  'Machine learning',
  'Engineering',
  'Chemistry'],
 ['Python (programming language)',
  'Scripting language',
  'Upload',
  'Computer science',
  'Software',
  'Download',
  'Programming lang

In [18]:
dictionary = corpora.Dictionary(keywords_list)
corpus = [dictionary.doc2bow(keywords) for keywords in keywords_list]

# Train LDA Model
# num_topics = 10  # Specify the number of topics
lda_model = gensim.models.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=10)

# Get Topics and Assign Topic Names
topics = lda_model.print_topics(num_words=5)  # Specify the number of words per topic

topic_lists = []
for t in topics:
    top_keywords = [word.split("*")[1].strip().replace('"', '') for word in t[1].split("+")]
    topic_lists.append(top_keywords)

# Count the frequency of each word
word_counts = {}
for topic in topic_lists:
    for word in topic:
        word_counts[word] = word_counts.get(word, 0) + 1

# Remove words that appear in more than half of the lists
filtered_topics = []
threshold = len(topic_lists) / 2
for topic in topic_lists:
    filtered_topic = [word for word in topic if word_counts[word] <= threshold]
    filtered_topics.append(filtered_topic)

# Print and Save Topic Names
topic_names = {}
for i, publication in enumerate(works):
        publication_topics = lda_model.get_document_topics(corpus[i])
        top_topic = max(publication_topics, key=lambda x: x[1])[0]  # Get the topic with the highest weight for the publication
        top_keywords = filtered_topics[top_topic]
        topic_names[top_topic] = ", ".join(top_keywords)
        # topic_names[i] = {"topic_name": ", ".join(top_keywords), "group_id": top_topic}
        
        # publication["topic_name"] = top_keywords
        # publication["group_id"] = top_topic
print(topic_names)

{3: 'Biology, Climate change, Business', 0: 'Workflow, Geology, Python (programming language), Climate change', 4: 'Programming language, Software, Operating system, Python (programming language)', 2: 'Ecosystem, Biology, Geography', 1: 'Ecology, Workflow, Hydrology (agriculture)'}


In [None]:

openalex.get_works_of_one_institution("Netherlands eScience Center", 50)

In [None]:
works_url = openalex.find_institution("Netherlands eScience Center")['results'][0]['works_api_url']

In [None]:
import requests

start_year = '2022'
end_year = '2023'
api_url = works_url+',publication_year:'+start_year+'-'+ end_year + '&sort=publication_date:desc'
header = {'Accept' : 'application/json'}

params = {
            'per_page': 100,  # Number of results per page
            'page': 1       # Initial page number
        }
openalex_aff_search = []

while True:
            resp = requests.get(api_url, headers=header, params=params)
            work_search_per_page = resp.json()
            openalex_aff_search.extend(work_search_per_page['results'])
            total_results = work_search_per_page['meta']['count']
            current_page = work_search_per_page['meta']['page']
            results_per_page = work_search_per_page['meta']['per_page']

            total_pages = (total_results + results_per_page - 1) // results_per_page
            if current_page == total_pages:
                break  # Break the loop when all pages have been retrieved

            params['page'] += 1 


In [None]:
openalex_aff_search


In [2]:
works = openalex.get_works_of_one_institution_by_year("Netherlands eScience Center", '2023','2023')
topic_modeling.openalex_build_corpus(works,5)

({4: 'Workflow, Environmental science, Software engineering, Software',
  0: 'Python (programming language), Biology, Artificial intelligence, Geology',
  3: 'Atmospheric sciences, Environmental science, Ecology, Ecosystem, Chemistry',
  1: 'Climate model, Climate change, Grid, Earth system science',
  2: 'Chemistry, Gammarus pulex, Pulex, Amphipoda'},
 [{'id': 'https://openalex.org/W4378737443',
   'doi': 'https://doi.org/10.5281/zenodo.7951155',
   'title': 'Report on the Workshop on Sustainable Software Sustainability 2021 (WoSSS21)',
   'display_name': 'Report on the Workshop on Sustainable Software Sustainability 2021 (WoSSS21)',
   'publication_year': 2023,
   'publication_date': '2023-05-30',
   'ids': {'openalex': 'https://openalex.org/W4378737443',
    'doi': 'https://doi.org/10.5281/zenodo.7951155'},
   'language': 'en',
   'primary_location': {'is_oa': True,
    'landing_page_url': 'https://zenodo.org/record/7951155',
    'pdf_url': 'https://zenodo.org/record/7951155/files/R