#

# Article Graph Example

This notebook contains a quick overview of the `article_graph` module together with the `topic_modeling`, `similarity` and `ner` modules.

## Processing Papers with Grobid

In [1]:
# We make available the packages inside all the modules

import sys
import os
sys.path.append(os.path.dirname(os.getcwd()))
os.path.dirname(os.getcwd())

from pdf_analyzer.config_load import load_config
from pdf_analyzer.api import API
from omegaconf import OmegaConf
from pdf_analyzer.config_load import load_config

In [2]:
server_config = load_config("config/api/grobid-server-config.yaml")
extract_config = load_config("config/api/api-base-config.yaml")
print("SERVER_CONFIG\n"+OmegaConf.to_yaml(server_config))
print("CLOUD_CONFIG\n"+OmegaConf.to_yaml(extract_config))

base_api = API.BaseAPI(extract_config,server_config)

files = base_api.proccesed_files

SERVER_CONFIG
url:
  protocol: http
  api_domain: yordi111nas.synology.me
  port: 8070

CLOUD_CONFIG
data:
  data_dir: data/PDFs
  format: .pdf
  recursive: true
grobid:
  cache: true
  cache_dir: data/xmls
  operation_key: processFulltextDocument
  format: .grobid.tei.xml
  recursive: true

GROBID server is up and running
data/xmls/word2vec.grobid.tei.xml already exist, skipping... (use --force to reprocess pdf input files)
data/xmls/Dont_stop_pretraining.grobid.tei.xml already exist, skipping... (use --force to reprocess pdf input files)
data/xmls/LIME.grobid.tei.xml already exist, skipping... (use --force to reprocess pdf input files)
data/xmls/LoRA.grobid.tei.xml already exist, skipping... (use --force to reprocess pdf input files)
data/xmls/Bert.grobid.tei.xml already exist, skipping... (use --force to reprocess pdf input files)
data/xmls/SORA.grobid.tei.xml already exist, skipping... (use --force to reprocess pdf input files)
data/xmls/Transformers.grobid.tei.xml already exist, s

## Adding the Papers to the Graph

In this section, we will be adding all the papers to the graph!

In [3]:
from article_graph.article_graph import ArticleGraph
from get_paper_metadata import get_paper_metadata

# We create the graph
g = ArticleGraph()

# We add the documents to the graph
for paper_id, file in enumerate(files):
    paper_info = get_paper_metadata(file)
    g.add_paper(paper_id=paper_id,
                title=paper_info['title'],
                abstract=paper_info['abstract'],
                release_date=paper_info['release_date'])
    
# Explore the graph by printing the titles of the papers
for s, p, o in g.graph.triples((None, g.ns.title, None)):
    print(s, p, o)

http://open_science.com/paper#0 http://open_science.com/title A Robustly Optimized BERT Pre-training Approach with Post-training
http://open_science.com/paper#1 http://open_science.com/title Pronunciation and good language learners
http://open_science.com/paper#2 http://open_science.com/title LORA: LOW-RANK ADAPTATION OF LARGE LAN-GUAGE MODELS
http://open_science.com/paper#3 http://open_science.com/title "Why Should I Trust You?"
http://open_science.com/paper#4 http://open_science.com/title Preface to the book Fast Processes in Large Scale Atmospheric Models: Progress, Challenges and Opportunities
http://open_science.com/paper#5 http://open_science.com/title Attention Is All You Need
http://open_science.com/paper#6 http://open_science.com/title DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
http://open_science.com/paper#7 http://open_science.com/title Exploring teachers’ confidence in addressing mental health issues in learners with Profound and Multiple 

## Topic Modeling

In this section, we will be exploring the use of topic modeling inside the **Article Graph** !

### Generating the Topics

In this subsection, we will be extracting topics from the papers' abstracts using the `topic_modeling` module!

In [4]:
from topic_modeling.lda import LDA
from article_graph._utils import get_abstract

# Create the LDA model for Topic Modeling with the optimal number of topics
# The optimal number of topics was calculated in `examples/topic_modeling.ipynb`
lda_model = LDA(corpus=[get_abstract(file) for file in files],
                num_topics=6,
                num_words=7)
lda_model.fit()

# Display the generated topics
for i, topic in enumerate(lda_model.topics):
    print(f'Topic {i}: {topic}')

Topic 0: ['high', 'based', 'observe', 'similarities', 'measured', 'vector', 'very']
Topic 1: ['prediction', 'an', 'trust', 'predictions', 'classifier', 'one', 'explanations']
Topic 2: ['and', 'the', 'to', 'of', 'in', 'video', 'sora']
Topic 3: ['pretraining', 'of', 'to', 'task', 'adaptive', 'we', 'domain']
Topic 4: ['of', 'and', 'the', 'on', 'we', 'to', 'model']
Topic 5: ['and', 'our', 'the', 'work', 'in', 'with', 'codebase']


### Adding Topics to Graph

In this subsection, we will be adding the generated topics to the graph!

In [5]:
# We add the topics to the graph
for topic_id, keywords in enumerate(lda_model.topics):
    g.add_topic(topic_id, keywords)

# We explore the graph by printing the keywords of each topic
for s, p, o in g.graph.triples((None, g.ns.keyword, None)):
    print(s, p, o)

http://open_science.com/topic#0 http://open_science.com/keyword high
http://open_science.com/topic#0 http://open_science.com/keyword based
http://open_science.com/topic#0 http://open_science.com/keyword observe
http://open_science.com/topic#0 http://open_science.com/keyword similarities
http://open_science.com/topic#0 http://open_science.com/keyword measured
http://open_science.com/topic#0 http://open_science.com/keyword vector
http://open_science.com/topic#0 http://open_science.com/keyword very
http://open_science.com/topic#1 http://open_science.com/keyword prediction
http://open_science.com/topic#1 http://open_science.com/keyword an
http://open_science.com/topic#1 http://open_science.com/keyword trust
http://open_science.com/topic#1 http://open_science.com/keyword predictions
http://open_science.com/topic#1 http://open_science.com/keyword classifier
http://open_science.com/topic#1 http://open_science.com/keyword one
http://open_science.com/topic#1 http://open_science.com/keyword expl

### Adding TopicBelongings to Graph

In this subsection, we will be adding the topic belonging relationships to the graph! These relationships represent the topic dostributions of each paper to every topic in the graph.

In [6]:
# We predict the topic distributions for each paper to all the topics
lda_model.predict_all()

# We add the topic belonging for each topic and paper storing the degree of belonging
for paper_id, paper_info in enumerate(lda_model.topic_distributions):
    for topic_id, topic_dist in paper_info.items():
        g.add_topic_belonging(paper_id, topic_id, topic_dist)

# We explore the graph by printing the topic belonging for each paper to all the topics
for s, p, o in g.graph.triples((None, g.ns.belongs_to_topic, None)):
    for _, p1, o1 in g.graph.triples((o, g.ns.degree, None)):
        print(s, p, o, p1, o1)

http://open_science.com/paper#0 http://open_science.com/belongs_to_topic http://open_science.com/topic_belonging#0-0 http://open_science.com/degree 0.0013022417207319654
http://open_science.com/paper#0 http://open_science.com/belongs_to_topic http://open_science.com/topic_belonging#0-1 http://open_science.com/degree 0.0013021680174009603
http://open_science.com/paper#0 http://open_science.com/belongs_to_topic http://open_science.com/topic_belonging#0-2 http://open_science.com/degree 0.001305713362396965
http://open_science.com/paper#0 http://open_science.com/belongs_to_topic http://open_science.com/topic_belonging#0-3 http://open_science.com/degree 0.0013107402399422056
http://open_science.com/paper#0 http://open_science.com/belongs_to_topic http://open_science.com/topic_belonging#0-4 http://open_science.com/degree 0.9934719743350049
http://open_science.com/paper#0 http://open_science.com/belongs_to_topic http://open_science.com/topic_belonging#0-5 http://open_science.com/degree 0.0013

## Named Entity Recognition

In this section, we will be exploring the use of named entity recognition inside the **Article Graph** !

In [7]:
from transformers import pipeline

# Init the BERT model
pipe = pipeline("token-classification", model="dslim/bert-base-NER")

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
from ner.extract_ner import get_all_ners

# Obtain all the NERs
all_orgs_rel, all_orgs = get_all_ners(files, pipe)

In [9]:
from article_graph._utils import get_acknowledgements
from ner.extract_ner import extract_ners

# Obtain the NERs inside the Acknowledgements section
all_ners = None
for acno in [get_acknowledgements(file) for file in files]:
    all_ners = extract_ners(acno, pipe)

In [10]:
# All the Named entities
all_ners

[{'name': 'Office of Naval Research', 'score': 0.99672663, 'type': 'ORG'},
 {'name': 'Google', 'score': 0.99887425, 'type': 'ORG'}]

In [11]:
# All the Organizations
all_orgs

[{'name': 'OpenAI', 'type': 'ORG', 'org_id': 0},
 {'name': 'TerraSwarm', 'type': 'ORG', 'org_id': 1},
 {'name': 'STARnet', 'type': 'ORG', 'org_id': 2},
 {'name': 'Semiconductor Research Corporation', 'type': 'ORG', 'org_id': 3},
 {'name': 'MARCO', 'type': 'ORG', 'org_id': 4},
 {'name': 'DARPA', 'type': 'ORG', 'org_id': 5},
 {'name': 'Office of Naval Research', 'type': 'ORG', 'org_id': 6},
 {'name': 'Google', 'type': 'ORG', 'org_id': 7}]

## Projects Extraction

Regex can be custom if necessary, in case their are not setup default values will be used

In [12]:
regex_patterns = {
    "NIH": r'(?:#)?\b[1-9][A-Z\d]{3}[A-Z]{2}\d{6}(?:-[AS]?\d+)?\b',
    "DOD": r'(?:#)?\b[A-Z\d]{6}-\d{2}-[123]-\d{4}\b',
    "NASA": r'(?:#)?\b(?:80|NN)[A-Z]+\d{2}[A-Z\d]+\b',
    "Education": r'(?:#)?\b[A-Z]+\d+[A-Z]\d{2}[A-Z\d]+\b',
    "Universal":r'[A-Z]{3,}[0-9]+-[0-9]+'
}

In [13]:
from ner.extract_ner import get_all_projects,get_projects_names,extract_award_identifiers

In [14]:
# Obtain all the projects in the files and their relations
all_projects, all_projects_relation = get_all_projects(files)

In [15]:
# Obtain all the projects in the files and their relations with custom RegExps
all_projects,all_projects_relation=get_all_projects(files,regex_patterns)

In [16]:
all_projects

[{'project_name': 'ONR',
  'project_federal_id': '#W911NF-13-1-0246',
  'project_id': 0},
 {'project_name': 'MURI',
  'project_federal_id': 'N00014-18-1-2670',
  'project_id': 1}]

In [17]:
# Projects associated with organizations

from article_graph._utils import get_acknowledgements
[
    {
        "project_federal_ids": extract_award_identifiers(get_acknowledgements(file)),
        "project_names": get_projects_names(get_acknowledgements(file))
    }
    for file in files]

[{'project_federal_ids': [], 'project_names': []},
 {'project_federal_ids': [], 'project_names': []},
 {'project_federal_ids': [], 'project_names': []},
 {'project_federal_ids': ['#W911NF-13-1-0246', '#N00014-13-1-0023'],
  'project_names': ['ONR']},
 {'project_federal_ids': [], 'project_names': []},
 {'project_federal_ids': [], 'project_names': []},
 {'project_federal_ids': [], 'project_names': []},
 {'project_federal_ids': [], 'project_names': []},
 {'project_federal_ids': [], 'project_names': []},
 {'project_federal_ids': ['N00014-18-1-2670'], 'project_names': ['MURI']}]

## Author Extraction

In [18]:
from extract_authors import get_all_author_metadata

# Obtain all the data about the authors of the papers
authors_list, relation_author_paper, all_orgs, relation_author_org = get_all_author_metadata(files, all_orgs)

In [19]:
relation_author_org

[{'name': 'Johns Hopkins University', 'org_id': 8, 'author_id': 8},
 {'name': 'Microsoft Corporation', 'org_id': 9, 'author_id': 34},
 {'name': 'Microsoft Corporation', 'org_id': 9, 'author_id': 35},
 {'name': 'Microsoft Corporation', 'org_id': 9, 'author_id': 36},
 {'name': 'Microsoft Corporation', 'org_id': 9, 'author_id': 37},
 {'name': 'Microsoft Corporation', 'org_id': 9, 'author_id': 38},
 {'name': 'Microsoft Corporation', 'org_id': 9, 'author_id': 39},
 {'name': 'Microsoft Corporation', 'org_id': 9, 'author_id': 40},
 {'name': 'Microsoft Corporation', 'org_id': 9, 'author_id': 41},
 {'name': 'University of Washington Seattle', 'org_id': 10, 'author_id': 42},
 {'name': 'University of Washington Seattle', 'org_id': 10, 'author_id': 43},
 {'name': 'University of Washington Seattle', 'org_id': 10, 'author_id': 44},
 {'name': 'Lehigh University', 'org_id': 11, 'author_id': 45},
 {'name': 'Lehigh University', 'org_id': 11, 'author_id': 46},
 {'name': 'Lehigh University', 'org_id': 11,

In [20]:
authors_list

[{'name': 'Zhuang  ',
  'last_name': 'Liu ',
  'label': 'Zhuang   Liu ',
  'email': None,
  'author_id': 0},
 {'name': 'Wayne  ',
  'last_name': 'Lin ',
  'label': 'Wayne   Lin ',
  'email': None,
  'author_id': 1},
 {'name': 'Ya  ',
  'last_name': 'Shi ',
  'label': 'Ya   Shi ',
  'email': None,
  'author_id': 2},
 {'name': 'Jun  ',
  'last_name': 'Zhao ',
  'label': 'Jun   Zhao ',
  'email': None,
  'author_id': 3},
 {'name': 'Tom  B ',
  'last_name': 'Brown ',
  'label': 'Tom  B  Brown ',
  'email': None,
  'author_id': 4},
 {'name': 'Benjamin  ',
  'last_name': 'Mann ',
  'label': 'Benjamin   Mann ',
  'email': None,
  'author_id': 5},
 {'name': 'Nick  ',
  'last_name': 'Ryder ',
  'label': 'Nick   Ryder ',
  'email': None,
  'author_id': 6},
 {'name': 'Melanie  ',
  'last_name': 'Subbiah ',
  'label': 'Melanie   Subbiah ',
  'email': None,
  'author_id': 7},
 {'name': 'Jared  ',
  'last_name': 'Kaplan ',
  'label': 'Jared   Kaplan ',
  'email': None,
  'author_id': 8},
 {'name': '

## Adding Organizations to graph

In [21]:
from article_graph.recon import get_organizations_info

# Reconcile with wikidata
orgs_info = get_organizations_info(list(map(lambda o: o['name'], all_orgs)))

for org in all_orgs:
    org_name = org['name']
    if org_name not in orgs_info:
        continue
    org['wikidata_id'] = orgs_info[org_name]['wikidata_id'] if 'wikidata_id' in orgs_info[org_name] else None
    org['icon'] = orgs_info[org_name]['icon'] if 'icon' in orgs_info[org_name] else None
    org['coordinates'] = orgs_info[org_name]['coordinates'] if 'coordinates' in orgs_info[org_name] else None

100%|██████████| 3/3 [00:24<00:00,  8.06s/it]


In [22]:
all_orgs

[{'name': 'OpenAI',
  'type': 'ORG',
  'org_id': 0,
  'wikidata_id': 'Q21708200',
  'icon': None,
  'coordinates': {'lon': -122.416388888, 'lat': 37.7775}},
 {'name': 'TerraSwarm', 'type': 'ORG', 'org_id': 1},
 {'name': 'STARnet',
  'type': 'ORG',
  'org_id': 2,
  'wikidata_id': 'Q4050035',
  'icon': None,
  'coordinates': {'lon': 28.835277777, 'lat': 47.022777777}},
 {'name': 'Semiconductor Research Corporation',
  'type': 'ORG',
  'org_id': 3,
  'wikidata_id': 'Q7449388',
  'icon': None,
  'coordinates': None},
 {'name': 'MARCO',
  'type': 'ORG',
  'org_id': 4,
  'wikidata_id': 'Q3395689',
  'icon': None,
  'coordinates': {'lon': -8.721111111, 'lat': 42.235833333}},
 {'name': 'DARPA',
  'type': 'ORG',
  'org_id': 5,
  'wikidata_id': 'Q207361',
  'icon': None,
  'coordinates': {'lon': -77.108333333, 'lat': 38.880277777}},
 {'name': 'Office of Naval Research',
  'type': 'ORG',
  'org_id': 6,
  'wikidata_id': 'Q1063818',
  'icon': None,
  'coordinates': {'lon': -77.108333333, 'lat': 38.

In [23]:
from rdflib.namespace import OWL

# Add the extended organizations to the graph
for org in all_orgs:
    g.add_organization(org_id=org["org_id"],
                       org_name=org["name"],
                       icon=org['icon'] if 'icon' in org else None,
                       wikidata_id=org['wikidata_id'] if 'wikidata_id' in org else None,
                       coordinates=org['coordinates'] if 'coordinates' in org else None)

# We explore the graph by printing the organizations' wikidata_ids in the graph
for s, p, o in g.graph.triples((None, None, g.ns.Organization)):
    for _, _, wikidata_id in g.graph.triples((s, OWL.sameAs, None)):
        print(s, wikidata_id)

http://open_science.com/organization#0 https://www.wikidata.org/entity/Q21708200
http://open_science.com/organization#2 https://www.wikidata.org/entity/Q4050035
http://open_science.com/organization#3 https://www.wikidata.org/entity/Q7449388
http://open_science.com/organization#4 https://www.wikidata.org/entity/Q3395689
http://open_science.com/organization#5 https://www.wikidata.org/entity/Q207361
http://open_science.com/organization#6 https://www.wikidata.org/entity/Q1063818
http://open_science.com/organization#7 https://www.wikidata.org/entity/Q95
http://open_science.com/organization#8 https://www.wikidata.org/entity/Q193727
http://open_science.com/organization#9 https://www.wikidata.org/entity/Q2283
http://open_science.com/organization#10 https://www.wikidata.org/entity/Q219563
http://open_science.com/organization#11 https://www.wikidata.org/entity/Q622137
http://open_science.com/organization#12 https://www.wikidata.org/entity/Q1144725
http://open_science.com/organization#16 https://

In [24]:
# Add what organizations are acknowledged
for relation in all_orgs_rel:
    g.add_organization_paper_relation(relation["paper_id"], relation["org_id"])

# We explore the graph by printing the organizations that are acknowledged
for s, p, o in g.graph.triples((None, g.ns.acknowledges, None)):
    if str(o).startswith(str(g.ns.Organization)):
        print(s, p, o)

In [25]:
# Add the organization that every author is member of
for relation in relation_author_org:
    g.add_organization_author_relation(relation["author_id"],relation["org_id"])

# We explore the graph by printing the organization that every author is member of
for s, p, o in g.graph.triples((None, g.ns.member, None)):
    print(s,p,o)

http://open_science.com/person#8 http://open_science.com/member http://open_science.com/organization#8
http://open_science.com/person#34 http://open_science.com/member http://open_science.com/organization#9
http://open_science.com/person#35 http://open_science.com/member http://open_science.com/organization#9
http://open_science.com/person#36 http://open_science.com/member http://open_science.com/organization#9
http://open_science.com/person#37 http://open_science.com/member http://open_science.com/organization#9
http://open_science.com/person#38 http://open_science.com/member http://open_science.com/organization#9
http://open_science.com/person#39 http://open_science.com/member http://open_science.com/organization#9
http://open_science.com/person#40 http://open_science.com/member http://open_science.com/organization#9
http://open_science.com/person#41 http://open_science.com/member http://open_science.com/organization#9
http://open_science.com/person#42 http://open_science.com/member 

## Adding Projects to graph

In [26]:
# Add all the projects to the graph
for project in all_projects:
    g.add_project(project["project_id"], project["project_name"], project["project_federal_id"])

# We explore the graph by printing the projects' names
for s, p, o in g.graph.triples((None, None, g.ns.Project)):
    for _, _, org_name in g.graph.triples((s, g.ns.name, None)):
        print(s,org_name)

http://open_science.com/project#0 ONR
http://open_science.com/project#1 MURI


In [27]:
# Add what projects are acknowledged
for relation in all_projects_relation:
    g.add_project_relation(relation["paper_id"],relation["project_id"])

# We explore the graph by printing the topic belonging for each paper to all the topics
for s, p, o in g.graph.triples((None, g.ns.acknowledges, None)):
    if str(o).startswith(str(g.ns.Project)):
        print(s, p, o)

## Add Authors to Graph

In [28]:
authors_list

[{'name': 'Zhuang  ',
  'last_name': 'Liu ',
  'label': 'Zhuang   Liu ',
  'email': None,
  'author_id': 0},
 {'name': 'Wayne  ',
  'last_name': 'Lin ',
  'label': 'Wayne   Lin ',
  'email': None,
  'author_id': 1},
 {'name': 'Ya  ',
  'last_name': 'Shi ',
  'label': 'Ya   Shi ',
  'email': None,
  'author_id': 2},
 {'name': 'Jun  ',
  'last_name': 'Zhao ',
  'label': 'Jun   Zhao ',
  'email': None,
  'author_id': 3},
 {'name': 'Tom  B ',
  'last_name': 'Brown ',
  'label': 'Tom  B  Brown ',
  'email': None,
  'author_id': 4},
 {'name': 'Benjamin  ',
  'last_name': 'Mann ',
  'label': 'Benjamin   Mann ',
  'email': None,
  'author_id': 5},
 {'name': 'Nick  ',
  'last_name': 'Ryder ',
  'label': 'Nick   Ryder ',
  'email': None,
  'author_id': 6},
 {'name': 'Melanie  ',
  'last_name': 'Subbiah ',
  'label': 'Melanie   Subbiah ',
  'email': None,
  'author_id': 7},
 {'name': 'Jared  ',
  'last_name': 'Kaplan ',
  'label': 'Jared   Kaplan ',
  'email': None,
  'author_id': 8},
 {'name': '

In [29]:
from article_graph.recon import reconcile_persons

# Reconcile the authors with Wikidata
reconciled = reconcile_persons(list(map(lambda p: p['label'], authors_list)))

for author in authors_list:
    author_name = author['label']
    if author_name not in reconciled:
        continue
    author['wikidata_id'] = reconciled[author_name]['wikidata_id'] if 'wikidata_id' in reconciled[author_name] else None


100%|██████████| 9/9 [01:04<00:00,  7.21s/it]


In [30]:
authors_list

[{'name': 'Zhuang  ',
  'last_name': 'Liu ',
  'label': 'Zhuang   Liu ',
  'email': None,
  'author_id': 0,
  'wikidata_id': 'Q7271'},
 {'name': 'Wayne  ',
  'last_name': 'Lin ',
  'label': 'Wayne   Lin ',
  'email': None,
  'author_id': 1,
  'wikidata_id': 'Q716022'},
 {'name': 'Ya  ',
  'last_name': 'Shi ',
  'label': 'Ya   Shi ',
  'email': None,
  'author_id': 2,
  'wikidata_id': 'Q45420261'},
 {'name': 'Jun  ',
  'last_name': 'Zhao ',
  'label': 'Jun   Zhao ',
  'email': None,
  'author_id': 3,
  'wikidata_id': 'Q3357852'},
 {'name': 'Tom  B ',
  'last_name': 'Brown ',
  'label': 'Tom  B  Brown ',
  'email': None,
  'author_id': 4,
  'wikidata_id': 'Q115662131'},
 {'name': 'Benjamin  ',
  'last_name': 'Mann ',
  'label': 'Benjamin   Mann ',
  'email': None,
  'author_id': 5,
  'wikidata_id': 'Q91736166'},
 {'name': 'Nick  ',
  'last_name': 'Ryder ',
  'label': 'Nick   Ryder ',
  'email': None,
  'author_id': 6,
  'wikidata_id': 'Q96211491'},
 {'name': 'Melanie  ',
  'last_name': '

In [31]:
from rdflib.namespace import OWL

# Add persons to the graph
for author in authors_list:
    g.add_author(author_id=author["author_id"],
                 label=author['label'],
                 first_name=author['name'],
                 last_name=author["last_name"],
                 email=author["email"],
                 wikidata_id=author['wikidata_id'] if 'wikidata_id' in author else None)

# We explore the graph by printing the person and its full name
for s, p, o in g.graph.triples((None, None, g.ns.Person)):
    for _, _, wikidata_uri in g.graph.triples((s, OWL.sameAs, None)):
        print(s, wikidata_uri)

http://open_science.com/person#0 https://www.wikidata.org/entity/Q7271
http://open_science.com/person#1 https://www.wikidata.org/entity/Q716022
http://open_science.com/person#2 https://www.wikidata.org/entity/Q45420261
http://open_science.com/person#3 https://www.wikidata.org/entity/Q3357852
http://open_science.com/person#4 https://www.wikidata.org/entity/Q115662131
http://open_science.com/person#5 https://www.wikidata.org/entity/Q91736166
http://open_science.com/person#6 https://www.wikidata.org/entity/Q96211491
http://open_science.com/person#7 https://www.wikidata.org/entity/Q115664942
http://open_science.com/person#8 https://www.wikidata.org/entity/Q102649624
http://open_science.com/person#9 https://www.wikidata.org/entity/Q115662059
http://open_science.com/person#10 https://www.wikidata.org/entity/Q55444792
http://open_science.com/person#11 https://www.wikidata.org/entity/Q115662093
http://open_science.com/person#12 https://www.wikidata.org/entity/Q115661401
http://open_science.com

In [32]:
relation_author_paper

[{'author_id': 0, 'paper_id': 0},
 {'author_id': 1, 'paper_id': 0},
 {'author_id': 2, 'paper_id': 0},
 {'author_id': 3, 'paper_id': 0},
 {'author_id': 0, 'paper_id': 1},
 {'author_id': 1, 'paper_id': 1},
 {'author_id': 2, 'paper_id': 1},
 {'author_id': 3, 'paper_id': 1},
 {'author_id': 4, 'paper_id': 1},
 {'author_id': 5, 'paper_id': 1},
 {'author_id': 6, 'paper_id': 1},
 {'author_id': 7, 'paper_id': 1},
 {'author_id': 8, 'paper_id': 1},
 {'author_id': 9, 'paper_id': 1},
 {'author_id': 10, 'paper_id': 1},
 {'author_id': 11, 'paper_id': 1},
 {'author_id': 12, 'paper_id': 1},
 {'author_id': 13, 'paper_id': 1},
 {'author_id': 14, 'paper_id': 1},
 {'author_id': 15, 'paper_id': 1},
 {'author_id': 16, 'paper_id': 1},
 {'author_id': 17, 'paper_id': 1},
 {'author_id': 18, 'paper_id': 1},
 {'author_id': 19, 'paper_id': 1},
 {'author_id': 20, 'paper_id': 1},
 {'author_id': 21, 'paper_id': 1},
 {'author_id': 22, 'paper_id': 1},
 {'author_id': 23, 'paper_id': 1},
 {'author_id': 24, 'paper_id': 1},

In [33]:
for relation_author in relation_author_paper:
    g.add_author_paper_relation(relation_author["author_id"],relation_author["paper_id"])

for s, p, o in g.graph.triples((None, g.ns.author, None)):
    if str(o).startswith(str(g.ns.paper)):
        print(s, p, o)

http://open_science.com/person#0 http://open_science.com/author http://open_science.com/paper#0
http://open_science.com/person#1 http://open_science.com/author http://open_science.com/paper#0
http://open_science.com/person#2 http://open_science.com/author http://open_science.com/paper#0
http://open_science.com/person#3 http://open_science.com/author http://open_science.com/paper#0
http://open_science.com/person#0 http://open_science.com/author http://open_science.com/paper#1
http://open_science.com/person#1 http://open_science.com/author http://open_science.com/paper#1
http://open_science.com/person#2 http://open_science.com/author http://open_science.com/paper#1
http://open_science.com/person#3 http://open_science.com/author http://open_science.com/paper#1
http://open_science.com/person#4 http://open_science.com/author http://open_science.com/paper#1
http://open_science.com/person#5 http://open_science.com/author http://open_science.com/paper#1
http://open_science.com/person#6 http://

## Similarity

In this section, we will be exploring the use of similarity inside the **Article Graph** !

### Calculating similarity

In this subsection, we will be calculating the similarity between the papers' abstracts using the `similarity` module!

In [34]:
# pip install -U sentence-transformers
from similarity.Model import Model

# Name of the SentenceTransformer model to use
model_name = 'sentence-transformers/all-mpnet-base-v2'

# Create an instance of the class
Model_instance = Model([get_abstract(file) for file in files], model_name)

# Calculate similarity and retrieve the results
similarity_results = Model_instance.calculate_similarity()

# Print similarity results
print("Similarity results:")
for result in similarity_results:
    print(result)

Similarity results:
{'text_id1': 0, 'text_id2': 1, 'similarity': 0.68489057}
{'text_id1': 0, 'text_id2': 2, 'similarity': 0.6559324}
{'text_id1': 0, 'text_id2': 3, 'similarity': 0.39857095}
{'text_id1': 0, 'text_id2': 4, 'similarity': 0.36316535}
{'text_id1': 0, 'text_id2': 5, 'similarity': 0.44492385}
{'text_id1': 0, 'text_id2': 6, 'similarity': 0.698509}
{'text_id1': 0, 'text_id2': 7, 'similarity': 0.64963436}
{'text_id1': 0, 'text_id2': 8, 'similarity': 0.48471498}
{'text_id1': 0, 'text_id2': 9, 'similarity': 0.6291123}
{'text_id1': 1, 'text_id2': 2, 'similarity': 0.7720689}
{'text_id1': 1, 'text_id2': 3, 'similarity': 0.43565458}
{'text_id1': 1, 'text_id2': 4, 'similarity': 0.45362663}
{'text_id1': 1, 'text_id2': 5, 'similarity': 0.4939827}
{'text_id1': 1, 'text_id2': 6, 'similarity': 0.76379037}
{'text_id1': 1, 'text_id2': 7, 'similarity': 0.72956586}
{'text_id1': 1, 'text_id2': 8, 'similarity': 0.58569795}
{'text_id1': 1, 'text_id2': 9, 'similarity': 0.7682411}
{'text_id1': 2, 't

### Adding similarity to Graph

In this subsection, we will be adding the calculated similarity to the graph!

In [35]:
# Iterate over the similarity results and add them to the graph
for result in similarity_results:
    text_id1 = result['text_id1']
    text_id2 = result['text_id2']
    similarity_score = result['similarity']
    
    # Add the similarity to the graph using the add_similarity method
    g.add_similarity(text_id1, text_id2, similarity_score)

# Iterate over the graph to print the similarity between papers
for paper1, _, similarity in g.graph.triples((None, g.ns.similar_to, None)):
    for _, _, paper2 in g.graph.triples((similarity, g.ns.related_paper, None)):
        for _, _, degree in g.graph.triples((similarity, g.ns.degree, None)):
            print(f'Paper 1: {paper1}, Paper 2: {paper2}, Similarity Score: {degree}')

 

## Generating the Graph

In [36]:
# Print the graph in the notebook
print(g.graph.serialize(format='ttl'))

@prefix ns1: <http://open_science.com/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://open_science.com/person#0> a ns1:Person ;
    ns1:author <http://open_science.com/paper#0>,
        <http://open_science.com/paper#1>,
        <http://open_science.com/paper#2>,
        <http://open_science.com/paper#3>,
        <http://open_science.com/paper#4>,
        <http://open_science.com/paper#5>,
        <http://open_science.com/paper#6>,
        <http://open_science.com/paper#7>,
        <http://open_science.com/paper#8>,
        <http://open_science.com/paper#9> ;
    ns1:first_name "Zhuang  "^^xsd:string ;
    ns1:label "Zhuang   Liu "^^xsd:string ;
    ns1:last_name "Liu "^^xsd:string ;
    owl:sameAs <https://www.wikidata.org/entity/Q7271> .

<http://open_science.com/person#1> a ns1:Person ;
    ns1:author <http://open_science.com/paper#0>,
        <http://open_science.

In [37]:
# Generate it inside ../rdf/graph.ttl
g.graph.serialize(format='ttl', destination='../rdf/graph.ttl')

<Graph identifier=N4ab82aefb73f4f5abd75e0e6899376b2 (<class 'rdflib.graph.Graph'>)>