# BioRxivist

BioRxivist is a tool designed to help obtain full text articles for [BioRxiv](https://biorxiv.org) and facilitate the use of this text to integrate with LLM and build knowledge graph infrastructure using [Neo4J](https://neo4j.com/).


In [1]:
import os
import sys
sys.version

'3.10.13 (main, Nov  6 2023, 22:35:59) [GCC 9.4.0]'

In [2]:
from dotenv import find_dotenv, load_dotenv
from os import environ
import openai
# import classes from BioRxivist
from biorxivist.webtools import BioRxivPaper, BioRxivDriver, SearchResult
from biorxivist.vectorstore import Neo4JDatabase

In [3]:
# load_environment variables like our API key and Neo4J credentials
load_dotenv(find_dotenv())

True

# 1. Using BioRxivist to find and load text from papers

The first thing we'll want to do is find a paper of interest.  Well use the object from this package. `BioRxivDriver` will give us access to BioRxiv's search utility. `SearchResuls` will help us manage the results of our search. `BioRxivPaper` will manage how we access text from the papers 

In [4]:
driver = BioRxivDriver()

In [5]:
r = driver.search_biorxiv('TGF-Beta 1 Signaling in monocyte maturation')

In [6]:
r.response.request.url

'https://www.biorxiv.org/search/TGF-Beta+1+Signaling+in+monocyte+maturation%20numresults:75'

In [7]:
type(r)

biorxivist.webtools.SearchResult

In [8]:
len(r.results)

75

In [9]:
# load more results
r.more()

In [10]:
len(r.results)

150

In [11]:
# Results are a list of BioRxivPaper objects
r.results[0:5]

[[Cross-species analysis identifies conserved transcriptional mechanisms of neutrophil maturation](https://biorxiv.org/content/10.1101/2022.11.28.518146v1),
 [Melanogenic Activity Facilitates Dendritic Cell Maturation via FMOD](https://biorxiv.org/content/10.1101/2022.05.14.491976v2),
 [Microglia integration into human midbrain organoids leads to increased neuronal maturation and functionality](https://biorxiv.org/content/10.1101/2022.01.21.477192v1),
 [Long-term culture of fetal monocyte precursors in vitro allowing the generation of bona fide alveolar macrophages in vivo](https://biorxiv.org/content/10.1101/2021.06.04.447115v2),
 [Pathologic α-Synuclein Species Activate LRRK2 in Pro-Inflammatory Monocyte and Macrophage Responses](https://biorxiv.org/content/10.1101/2020.05.04.077065v1)]

# BioRxivPaper Objects:
BioRxiv Paper objects link us to BioRxiv resources related to individual papers. Once instantiated these papers load a minimal amount of information into memory the URI of the papers homepage and the title.  Other features like the paper's abstract and full text are lazy-loaded properties. They are only accessed once we call them for the first time. After that they are available from memory so we don't have to hit the URL another time.

They are also feed directly into our LangChain pipeline.  In the next section we will make sure of their BioRxivPaper.langchain_html_doc attribute.

# Interact with individual papers:
The results are a collection of BioRxivPaper objects.  We can interact with them by indexing into the list or we can pull them out and interact with them here:

In [12]:
paper3 = r.results[3]

In [13]:
paper3.title

'Long-term culture of fetal monocyte precursors in vitro allowing the generation of bona fide alveolar macrophages in vivo'

In [14]:
paper3.abstract

'Tissue-resident macrophage-based immune therapies have been proposed for various diseases. However, generation of sufficient numbers that possess tissue-specific functions remains a major handicap. Here, we show that fetal liver monocytes (FLiMo) cultured with GM-CSF (also known as CSF2) rapidly differentiate into a long-lived, homogeneous alveolar macrophage (AM)-like population in vitro. CSF2-cultured FLiMo remain the capacity to develop into bona fide AM upon transfer into Csf2ra-/- neonates and prevent development of alveolar proteinosis and efferocytosis of apoptotic cells for at least 1 year in vivo. Compared to transplantation of AM-like cells derived from bone marrow macrophages (BMM), CSF2-cFliMo more efficiently engraft empty AM niches in the lung and protect mice from respiratory viral infection. Harnessing the potential of this approach for gene therapy, we restored a disrupted Csf2ra gene in FLiMo and their capacity to develop into AM in vivo. Together, we provide a novel

# Accessing the paper's text

There are now a few ways to access the papers text:

In [15]:
# through the paper.text property:
# print(paper3.text)
# by accessing the BioRxivPaper.__str__ attribute:
print(f'...{paper3[392:1000]}...')

... CSF2-cultured FLiMo remain the capacity to develop into bona fide AM upon transfer into Csf2ra-/- neonates and prevent development of alveolar proteinosis and efferocytosis of apoptotic cells for at least 1 year in vivo. Compared to transplantation of AM-like cells derived from bone marrow macrophages (BMM), CSF2-cFliMo more efficiently engraft empty AM niches in the lung and protect mice from respiratory viral infection. Harnessing the potential of this approach for gene therapy, we restored a disrupted Csf2ra gene in FLiMo and their capacity to develop into AM in vivo. Together, we provide a novel ...


# Build vector Embeddings:

In [16]:
openai.api_key = environ['OPENAI_API_KEY']

In [17]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
# TODO: There is an HTML text splitter. maybe skip bs4 and cut to chase?
from langchain.vectorstores import Neo4jVector
from langchain.document_loaders import TextLoader
from langchain.docstore.document import Document

In [18]:
len(paper3.langchain_doc)

Created a chunk of size 1197, which is longer than the specified 1000
Created a chunk of size 1274, which is longer than the specified 1000
Created a chunk of size 1647, which is longer than the specified 1000
Created a chunk of size 1902, which is longer than the specified 1000
Created a chunk of size 1529, which is longer than the specified 1000
Created a chunk of size 1496, which is longer than the specified 1000
Created a chunk of size 1123, which is longer than the specified 1000
Created a chunk of size 1078, which is longer than the specified 1000
Created a chunk of size 1122, which is longer than the specified 1000
Created a chunk of size 1342, which is longer than the specified 1000
Created a chunk of size 1292, which is longer than the specified 1000
Created a chunk of size 1133, which is longer than the specified 1000
Created a chunk of size 1627, which is longer than the specified 1000
Created a chunk of size 1483, which is longer than the specified 1000
Created a chunk of s

43

In [19]:
embeddings = OpenAIEmbeddings()

In [20]:
# TODO lets make an object in BioRxivist that does this
vec = Neo4jVector.from_documents(
    paper3.langchain_doc, OpenAIEmbeddings(), 
    url=f'bolt://{environ["NEO4J_HOST"]}:{environ["NEO4J_BOLT_PORT"]}',
    username=environ['NEO4J_USERNAME'],
    password=''
)

In [21]:
type(vec)

langchain.vectorstores.neo4j_vector.Neo4jVector

In [23]:
docs_with_score = vec.similarity_search_with_score('What is the role of CSF2?', k=5)

In [24]:
docs_with_score[0]

(Document(page_content='In addition to the homeostatic function, AM play an essential role in protecting influenza virus-infected mice from morbidity by maintaining lung integrity through the removal of dead cells and excess surfactant (Schneider et al, 2014). To assess the functional capacity of CSF2-cFLiMo-derived AM during pulmonary virus infection, we reconstituted Csf2ra-/- neonates with CSF2-cFLiMo and infected adults 10 weeks later with influenza virus PR8 (Fig. 5A). Without transfer, Csf2ra-/- mice succumbed to infection due to lung failure (Fig. 5B-E), as reported previously (Schneider et al, 2017). Notably, the presence of CSF2-cFLiMo-derived-AM protected Csf2ra-/- mice from severe morbidity (Fig. 5B, C) and completely restored viability (Fig. 5D) and O2 saturation (Fig. 5E) compared to infected WT mice.', metadata={'title': 'Long-term culture of fetal monocyte precursors in vitro allowing the generation of bona fide alveolar macrophages in vivo', 'source': 'https://biorxiv.o

In [25]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.9213951230049133
In addition to the homeostatic function, AM play an essential role in protecting influenza virus-infected mice from morbidity by maintaining lung integrity through the removal of dead cells and excess surfactant (Schneider et al, 2014). To assess the functional capacity of CSF2-cFLiMo-derived AM during pulmonary virus infection, we reconstituted Csf2ra-/- neonates with CSF2-cFLiMo and infected adults 10 weeks later with influenza virus PR8 (Fig. 5A). Without transfer, Csf2ra-/- mice succumbed to infection due to lung failure (Fig. 5B-E), as reported previously (Schneider et al, 2017). Notably, the presence of CSF2-cFLiMo-derived-AM protected Csf2ra-/- mice from severe morbidity (Fig. 5B, C) and completely restored viability (Fig. 5D) and O2 saturation (Fig. 5E) compared to infected WT mice.
--------------------------------------------------------------------------------
---------

# Connect to an existing vector Store

In [26]:
db = Neo4JDatabase.from_environment()

In [27]:
db.url

'neo4j://localhost:7687'

In [28]:
labels = db.fetch_labels_and_properties()

In [29]:
labels

[{"labels": ["Chunk"], "propkeys": ["id", "text", "title", "source", "embedding"]}]

In [30]:
labels.data

[{'labels': ['Chunk'],
  'propkeys': ['id', 'text', 'title', 'source', 'embedding']}]

In [31]:
labels.metadata.query_type

'r'

In [32]:
print(labels)
# This tells us the label and property keys we want when we make our vector.

[{"labels": ["Chunk"], "propkeys": ["id", "text", "title", "source", "embedding"]}]


In [33]:
with db as session:
    r = session.execute_read(db.transaction, "SHOW CONSTRAINTS")
r
# this tells us that "id" is an index

[{"id": 5, "name": "constraint_1dc138a", "type": "UNIQUENESS", "entityType": "NODE", "labelsOrTypes": ["Chunk"], "properties": ["id"], "ownedIndex": "constraint_1dc138a", "propertyType": null}]

In [34]:
vec2 = db.make_vectorstore(
    embedding=OpenAIEmbeddings(),
    index_name="id",
    node_label="Chunk",
    text_node_properties=["id", "text", "title", "source", "embedding"],
    embedding_node_property="embedding"
)

In [35]:
type(vec2)

langchain.vectorstores.neo4j_vector.Neo4jVector