Exploring a Textual Corpus with ArchiTXT
========================================

This tutorial provides a **step-by-step guide** on how to use **ArchiTXT** to efficiently process and analyze textual corpora.

ArchiTXT allows loading a corpus as a set of syntax trees, where each tree is enriched by incorporating named entities.
These enriched trees form a **forest**, which can then be automatically structured into a valid **database instance** for further analysis.

By following this tutorial, you'll learn how to:
- Load a corpus
- Parse textual data with **Berkeley Neural Parser (Benepar)**
- Extract structured data using **ArchiTXT**

In [None]:
import itables

itables.init_notebook_mode(connected=True)

## Downloading the MACCROBAT Corpus
The **MACCROBAT** corpus is a collection of **200 annotated medical documents**, specifically **clinical case reports**, extracted from **PubMed Central**.
The annotations focus on key medical concepts such as **diseases, treatments, medications, and symptoms**, making it a valuable resource for biomedical text analysis.

The **MACCROBAT** corpus is available for download at [Figshare](https://figshare.com/articles/dataset/MACCROBAT2018/9764942) or on [kaggle](https://www.kaggle.com/datasets/okolojeremiah/maccrobat).

Let's download the corpora.

In [None]:
import urllib.request

urllib.request.urlretrieve(
    'https://www.kaggle.com/api/v1/datasets/download/okolojeremiah/maccrobat',
    filename='MACCROBAT.zip',
)

## Installing and Configuring NLP Models

ArchiTXT can parse the sentences using either **Benepar** with SpaCy or a **CoreNLP** server.
In this tutorial, we will use the **SpaCy parser** with the default model, but you can use any models like one from **SciSpaCy**, a collection of models designed for biomedical text processing by **AllenAI**.

To download the SciSpaCy model, do:

In [None]:
!spacy download en_core_web_sm

We also need to download the Benepar model for English

In [None]:
import benepar

benepar.download('benepar_en3')

## Parsing the Corpus with ArchiTXT

Before processing the corpus, we need to configure the **BeneparParser**, specifying which SpaCy model to use for each language.

In [None]:
import warnings

from architxt.nlp.parser.benepar import BeneparParser

# Initialize the parser
parser = BeneparParser(
    spacy_models={
        'English': 'en_core_web_sm',
    }
)

# Suppress warnings for unsupported annotations
warnings.filterwarnings("ignore")

Named Entity Resolution (NER) helps to standardize the named entities and to build a database instance.
To enable NER, we need to provide the knowledge base to use.
For this tutorial, we will use the **UMLS (Unified Medical Language System)** resolver.

In [None]:
from architxt.nlp.entity_extractor import FlairEntityExtractor
from architxt.nlp.entity_resolver import ScispacyResolver

resolver = ScispacyResolver(kb_name='umls')
extractor = FlairEntityExtractor()

Let's parse a sample of the corpus. To verify that everything is functioning as expected, we will inspect the largest enriched tree using the :py:meth:`~architxt.tree.Tree.pretty_print` method.

In [None]:
from architxt.nlp import raw_load_corpus

forest = [
    tree
    async for tree in raw_load_corpus(
        ['MACCROBAT.zip'],
        ['English'],
        cache=False,
        parser=parser,
        resolver=resolver,
        extractor=extractor,
        sample=600,
        entities_filter={
            'OTHER_ENTITY',
            'OTHER_EVENT',
            'COREFERENCE',
        },
        entities_mapping={
            'QUANTITATIVE_CONCEPT': 'VALUE',
            'QUALITATIVE_CONCEPT': 'VALUE',
            'LAB_VALUE': 'VALUE',
            'THERAPEUTIC_PROCEDURE': 'TREATMENT',
            'MEDICATION': 'TREATMENT',
            'OUTCOME': 'SIGN_SYMPTOM',
        },
    )
]

In [None]:
# Look at the highest tree
max(forest, key=lambda tree: tree.height).pretty_print()

Let's see the repartition of the entities inside this sample

In [None]:
from collections import Counter

import plotly.express as px

entity_count = Counter(entity.label.name for tree in forest for entity in tree.entities())

sorted_entities = sorted(entity_count.items(), key=lambda x: x[1], reverse=True)
entities = [label for label, count in sorted_entities]
counts = [count for label, count in sorted_entities]

fig = px.histogram(y=entities, x=counts, orientation='h')
fig.update_layout(xaxis_title='Count', yaxis_title='Entities')
fig.show()

 **ArchiTXT** can then automatically structure parsed text into a **database-friendly format**.
 Let's start with a simple rewrite!

In [None]:
from copy import deepcopy

from architxt.simplification.simple_rewrite import simple_rewrite

forest_copy = deepcopy(forest)
simple_rewrite(forest_copy)

In [None]:
# Look at the highest tree
max(forest_copy, key=lambda tree: tree.height).pretty_print()

Now that we have a structured instance, we can extract its schema.
The schema provides a **formal representation** of the extracted data.

In [None]:
from architxt.schema import Schema

schema = Schema.from_forest(forest_copy, keep_unlabelled=False)
print(schema.as_cfg())

We've successfully built a basic database schema from our corpus, but there's significant potential for improvement.
Let's explore how we can enhance it using the **ArchiTXT** simplification algorithm!

First, let's visualize the repartition of equivalent classes inside the forest.

In [None]:
from architxt.similarity import equiv_cluster

clusters = equiv_cluster(forest, tau=0.95)

Let's visualize the clustering result as a bar chart to better understand the distribution of groups across equivalent classes.


In [None]:
clusters_names = sorted(filter(lambda klass: len(clusters[klass]) >= 5, clusters.keys()))
fig = px.bar(y=clusters_names, x=[len(clusters[klass]) for klass in clusters_names], orientation='h')

fig.update_layout(xaxis_title='Count', yaxis_title='Equivalent Class')
fig.show()

It's now time to use **ArchiTXT** to automatically structure the data.

In [None]:
from architxt.simplification.tree_rewriting import rewrite

rewrite(forest, epoch=30, min_support=5, tau=0.95)

In [None]:
# Look at the highest tree
max(forest, key=lambda tree: tree.height).pretty_print()

We now have a more granular structure. Let's take a closer look at the schema.

In [None]:
schema = Schema.from_forest(forest, keep_unlabelled=False)
print(schema.as_cfg())

The schema is now much smaller, and the groups are more meaningful.

But not all extracted trees provide valuable insights, so we could filter the structured instance to keep only the **valid trees** using `schema.extract_valid_trees(new_forest)`.
Let's explore the different **semantic groups**.
Groups represent common patterns across the corpus.

In [None]:
all_datasets = schema.extract_datasets(forest)
group, dataset = max(all_datasets.items(), key=lambda x: len(x[1]))

print(f'Group: {group}')

dataset

## Export as a property graph

Now that we've integrated our two databases, we can export the result as a **property graph**.

We use `testcontainers` to spin up a **disposable Neo4j instance** for safe experimentation.

In [None]:
from testcontainers.neo4j import Neo4jContainer

neo4j = Neo4jContainer('neo4j:5')
neo4j.start()
uri = neo4j.get_connection_url()

ArchiTXT makes it easy to export structured data like a tree or forest directly into a property graph.

In [None]:
from architxt.database.export import export_cypher
from neo4j import GraphDatabase

driver = GraphDatabase.driver(uri, auth=('neo4j', 'password'))

with driver.session() as session:
    export_cypher(forest, session=session)

Let's explore the generated graph database.

In [None]:
from yfiles_jupyter_graphs_for_neo4j import Neo4jGraphWidget

g = Neo4jGraphWidget(driver)
g.show_cypher("""
MATCH (n)
OPTIONAL MATCH path = (n)-[*..4]-()
RETURN n, path
LIMIT 50
""")