### Embed PubMed journal articles into Weaviate

PubMed MultiLabel Text Classification Dataset MeSH: https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification

In [1]:
from weaviate.util import generate_uuid5
import weaviate
import json
import pandas as pd


In [27]:
#Read in the pubmed data
df = pd.read_csv("PubMed Multi Label Text Classification Dataset Processed.csv")

In [8]:
client = weaviate.Client(
    url = "XXX",  # Replace with your Weaviate endpoint
    auth_client_secret=weaviate.auth.AuthApiKey(api_key="XXX"),  # Replace with your Weaviate instance API key
    additional_headers = {
        "X-OpenAI-Api-Key": "XXX"  # Replace with your inference API key
    }
)

In [28]:
df = df[:10000]

In [29]:
class_obj = {
    # Class definition
    "class": "articles",

    # Property definitions
    "properties": [
        {
            "name": "title",
            "dataType": ["text"],
        },
        {
            "name": "abstractText",
            "dataType": ["text"],
        },
    ],

    # Specify a vectorizer
    "vectorizer": "text2vec-openai",

    # Module settings
    "moduleConfig": {
        "text2vec-openai": {
            "vectorizeClassName": True,
            "model": "ada",
            "modelVersion": "002",
            "type": "text"
        },
        "qna-openai": {
          "model": "gpt-3.5-turbo-instruct"
        },
        "generative-openai": {
          "model": "gpt-3.5-turbo"
        }
    },
}

In [30]:
client.schema.create_class(class_obj)


In [31]:
import logging
import numpy as np

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')

# Replace infinity values with NaN and then fill NaN values
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna('', inplace=True)

# Convert columns to string type
df['Title'] = df['Title'].astype(str)
df['abstractText'] = df['abstractText'].astype(str)

# Log the data types
logging.info(f"Title column type: {df['Title'].dtype}")
logging.info(f"abstractText column type: {df['abstractText'].dtype}")

with client.batch(
    batch_size=10,  # Specify batch size
    num_workers=2,   # Parallelize the process
) as batch:
    for index, row in df.iterrows():
        try:
            question_object = {
                "title": row.Title,
                "abstractText": row.abstractText,
            }
            batch.add_data_object(
                question_object,
                class_name="articles",
                uuid=generate_uuid5(question_object)
            )
        except Exception as e:
            logging.error(f"Error processing row {index}: {e}")


2024-08-23 13:30:23,330 INFO Title column type: object
2024-08-23 13:30:23,331 INFO abstractText column type: object
            Use the `client.batch.configure()` method to configure your batch process, and `client.batch` to enter the context manager.

            See https://weaviate.io/developers/weaviate/client-libraries/python for details.


In [32]:
client.query.aggregate("articles").with_meta_count().do()

{'data': {'Aggregate': {'Articles': [{'meta': {'count': 9997}}]}}}

In [33]:

res = (
    client.query
    .get("articles", ["title", "abstractText"])
    .with_additional(["id"])
    .with_limit(2)
    .do()
)

print(json.dumps(res, indent=4))


{
    "data": {
        "Get": {
            "Articles": [
                {
                    "_additional": {
                        "id": "00083969-facc-5ba7-a200-49a065aff97c"
                    },
                    "abstractText": "Streptomyces sp. GSL-6B was isolated from sediment collected from the Great Salt Lake and investigation of its organic extract led to the isolation of three new linear heptapeptides, bonnevillamides A (1), B (2), and C (3). The bonnevillamides represent a new class of linear peptides featuring unprecedented non-proteinogenic amino acids. All three peptides contain the newly characterized bonnevillic acid moiety (3-(3,5-dichloro-4-methoxyphenyl)-2-hydroxyacrylic acid), as well as a heavily modified proline residue. Moreover, in bonnevillamide A, the terminal proline residue found in bonnevillamides B and C is replaced with 4-methyl-azetidine-2-carboxylic acid methyl ester. The structures of the three heptapeptides were elucidated by NMR, high-resol

In [421]:
response = (
    client.query
    .get("Articles", ["title", "abstractText"])
    .with_additional(["id"])
    .with_near_text({"concepts": ["Mouth Neoplasms"]})  # Correct argument name
    .with_limit(10)
    .do()
)

print(json.dumps(response, indent=4))


{
    "data": {
        "Get": {
            "Articles": [
                {
                    "_additional": {
                        "id": "a7690f03-66b9-5d17-b765-8c6eb21f99c8"
                    },
                    "abstractText": "INTRODUCTION: Metastatic malignant mesothelioma to the oral cavity is extremely rare. They are more common in the jaw bones than the soft tissue. Occurrence of the malignant disease typically carries an average survival rate of 9-12 monthsMETHODS: : Thirteen patients underwent neoadjuvant chemotherapy and radical pleurectomy decortication, followed by radiotherapy from August 2012 to September 2013. Patients were followed up with computed tomography of the chest and the abdomen every 3 months. All patients were followed up until February 2014.RESULTS: In January 2014, 11 patients were still alive with a median survival of 11 months, eight patients developed a recurrence and two patients died at 8 and 9 months after surgery. After 1 year from macro

In [422]:
response = (
    client.query
    .get("articles", ["title", "abstractText"])
    .with_near_object({
        "id": "a7690f03-66b9-5d17-b765-8c6eb21f99c8"
    })
    .with_limit(10)
    .with_additional(["distance"])
    .do()
)

print(json.dumps(response, indent=2))

{
  "data": {
    "Get": {
      "Articles": [
        {
          "_additional": {
            "distance": -4.7683716e-07
          },
          "abstractText": "INTRODUCTION: Metastatic malignant mesothelioma to the oral cavity is extremely rare. They are more common in the jaw bones than the soft tissue. Occurrence of the malignant disease typically carries an average survival rate of 9-12 monthsMETHODS: : Thirteen patients underwent neoadjuvant chemotherapy and radical pleurectomy decortication, followed by radiotherapy from August 2012 to September 2013. Patients were followed up with computed tomography of the chest and the abdomen every 3 months. All patients were followed up until February 2014.RESULTS: In January 2014, 11 patients were still alive with a median survival of 11 months, eight patients developed a recurrence and two patients died at 8 and 9 months after surgery. After 1 year from macroscopic radical pleurectomy decortication, a 68-year-old man suffered from gingiv

In [429]:
response = (
    client.query
    .get("articles", ["title", "abstractText"])
    .with_near_text({"concepts": ["Gingival metastasis as first sign of multiorgan dissemination of epithelioid malignant mesothelioma"]})
    .with_generate(single_prompt="Please explain this article {title} like you would to someone without a medical degree.")
    .with_limit(1)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Articles": [
                {
                    "_additional": {
                        "generate": {
                            "error": null,
                            "singleResult": "Sure! This article is talking about a case where a person had a type of cancer called epithelioid malignant mesothelioma. This cancer usually starts in the lining of the lungs or abdomen. However, in this case, the first sign of the cancer spreading to other parts of the body was seen in the gums (gingiva). This is called gingival metastasis.\n\nMetastasis means that cancer cells have spread from the original tumor to other parts of the body. In this case, the cancer had spread to the gums before spreading to other organs. This is important because it shows that the cancer was already advanced and had spread to multiple organs before it was even detected.\n\nOverall, this article highlights the importance of early detection and monitoring of cancer, 

In [431]:
#Grouped RAG
response = (
    client.query
    .get("Articles", ["title", "abstractText"])
    .with_near_text({"concepts": ["Mouth Neoplasms"]})
    .with_limit(3)
    .with_generate(grouped_task="Summarize the key information here in bullet points. Make it understandable to someone without a medical degree.")
    .do()
)

print(response["data"]["Get"]["Articles"][0]["_additional"]["generate"]["groupedResult"])

- Metastatic malignant mesothelioma to the oral cavity is rare, with more cases in jaw bones than soft tissue
- Average survival rate for this type of cancer is 9-12 months
- Study of 13 patients who underwent neoadjuvant chemotherapy and surgery showed a median survival of 11 months
- One patient had a gingival mass as the first sign of multiorgan recurrence of mesothelioma
- Biopsy of new growing lesions, even in uncommon sites, is important for patients with a history of mesothelioma
- Myoepithelioma of minor salivary gland origin can show features indicative of malignant potential
- Metastatic neuroblastoma in the mandible is very rare and can present with osteolytic jaw defects and looseness of deciduous molars in children


### Turn metadata into a KG

In [45]:
from rdflib import Graph, RDF, RDFS, Namespace, URIRef, Literal
from rdflib.namespace import SKOS, XSD
import pandas as pd
import urllib.parse
import random
from datetime import datetime, timedelta

# Create a new RDF graph
g = Graph()

# Define namespaces
schema = Namespace('http://schema.org/')
ex = Namespace('http://example.org/')
prefixes = {
    'schema': schema,
    'ex': ex,
    'skos': SKOS,
    'xsd': XSD
}
for p, ns in prefixes.items():
    g.bind(p, ns)

# Define classes and properties
Article = URIRef(ex.Article)
MeSHTerm = URIRef(ex.MeSHTerm)
g.add((Article, RDF.type, RDFS.Class))
g.add((MeSHTerm, RDF.type, RDFS.Class))

title = URIRef(schema.name)
abstract = URIRef(schema.description)
date_published = URIRef(schema.datePublished)
access = URIRef(ex.access)

g.add((title, RDF.type, RDF.Property))
g.add((abstract, RDF.type, RDF.Property))
g.add((date_published, RDF.type, RDF.Property))
g.add((access, RDF.type, RDF.Property))

# Function to clean and parse MeSH terms
def parse_mesh_terms(mesh_list):
    if pd.isna(mesh_list):
        return []
    return [term.strip().replace(' ', '_') for term in mesh_list.strip("[]'").split(',')]

# Function to create a valid URI
def create_valid_uri(base_uri, text):
    if pd.isna(text):
        return None
    sanitized_text = urllib.parse.quote(text.strip().replace(' ', '_').replace('"', '').replace('<', '').replace('>', '').replace("'", "_"))
    return URIRef(f"{base_uri}/{sanitized_text}")

# Function to generate a random date within the last 5 years
def generate_random_date():
    start_date = datetime.now() - timedelta(days=5*365)
    random_days = random.randint(0, 5*365)
    return start_date + timedelta(days=random_days)

# Function to generate a random access value between 1 and 10
def generate_random_access():
    return random.randint(1, 10)

# Load your DataFrame here
# df = pd.read_csv('your_data.csv')

# Loop through each row in the DataFrame and create RDF triples
for index, row in df.iterrows():
    article_uri = create_valid_uri("http://example.org/article", row['Title'])
    if article_uri is None:
        continue
    
    # Add Article instance
    g.add((article_uri, RDF.type, Article))
    g.add((article_uri, title, Literal(row['Title'], datatype=XSD.string)))
    g.add((article_uri, abstract, Literal(row['abstractText'], datatype=XSD.string)))
    
    # Add random datePublished and access
    random_date = generate_random_date()
    random_access = generate_random_access()
    g.add((article_uri, date_published, Literal(random_date.date(), datatype=XSD.date)))
    g.add((article_uri, access, Literal(random_access, datatype=XSD.integer)))
    
    # Add MeSH Terms
    mesh_terms = parse_mesh_terms(row['meshMajor'])
    for term in mesh_terms:
        term_uri = create_valid_uri("http://example.org/mesh", term)
        if term_uri is None:
            continue
        
        # Add MeSH Term instance
        g.add((term_uri, RDF.type, MeSHTerm))
        g.add((term_uri, RDFS.label, Literal(term.replace('_', ' '), datatype=XSD.string)))
        
        # Link Article to MeSH Term
        g.add((article_uri, schema.about, term_uri))

# Serialize the graph to a file (optional)
g.serialize(destination='ontology.ttl', format='turtle')


<Graph identifier=N6eec4d8e56b4491598e953312a77df35 (<class 'rdflib.graph.Graph'>)>

### Semantic search using KG

In [376]:
from SPARQLWrapper import SPARQLWrapper, JSON

def get_concept_triples_for_term(term):
    sparql = SPARQLWrapper("https://id.nlm.nih.gov/mesh/sparql")
    query = f"""
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
    PREFIX mesh: <http://id.nlm.nih.gov/mesh/>

    SELECT ?subject ?p ?pLabel ?o ?oLabel
    FROM <http://id.nlm.nih.gov/mesh>
    WHERE {{
        ?subject rdfs:label "{term}"@en .
        ?subject ?p ?o .
        FILTER(CONTAINS(STR(?p), "concept"))
        OPTIONAL {{ ?p rdfs:label ?pLabel . }}
        OPTIONAL {{ ?o rdfs:label ?oLabel . }}
    }}
    """
    
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    
    triples = set()  # Using a set to avoid duplicate entries
    for result in results["results"]["bindings"]:
        obj_label = result.get("oLabel", {}).get("value", "No label")
        triples.add(obj_label)
    
    # Add the term itself to the list
    triples.add(term)
    
    return list(triples)  # Convert back to a list for easier handling

def get_narrower_concepts_for_term(term):
    sparql = SPARQLWrapper("https://id.nlm.nih.gov/mesh/sparql")
    query = f"""
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
    PREFIX mesh: <http://id.nlm.nih.gov/mesh/>

    SELECT ?narrowerConcept ?narrowerConceptLabel
    WHERE {{
        ?broaderConcept rdfs:label "{term}"@en .
        ?narrowerConcept meshv:broaderDescriptor ?broaderConcept .
        ?narrowerConcept rdfs:label ?narrowerConceptLabel .
    }}
    """
    
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    
    concepts = set()  # Using a set to avoid duplicate entries
    for result in results["results"]["bindings"]:
        subject_label = result.get("narrowerConceptLabel", {}).get("value", "No label")
        concepts.add(subject_label)
    
    return list(concepts)  # Convert back to a list for easier handling

def get_all_narrower_concepts(term, depth=2, current_depth=1):
    # Create a dictionary to store the terms and their narrower concepts
    all_concepts = {}

    # Initial fetch for the primary term
    narrower_concepts = get_narrower_concepts_for_term(term)
    all_concepts[term] = narrower_concepts
    
    # If the current depth is less than the desired depth, fetch narrower concepts recursively
    if current_depth < depth:
        for concept in narrower_concepts:
            # Recursive call to fetch narrower concepts for the current concept
            child_concepts = get_all_narrower_concepts(concept, depth, current_depth + 1)
            all_concepts.update(child_concepts)
    
    return all_concepts

# Fetch alternative names and narrower concepts
term = "Mouth Neoplasms"
alternative_names = get_concept_triples_for_term(term)
all_concepts = get_all_narrower_concepts(term, depth=4)  # Adjust depth as needed

# Output alternative names
print("Alternative names:", alternative_names)
print()

# Output narrower concepts
for broader, narrower in all_concepts.items():
    print(f"Broader concept: {broader}")
    print(f"Narrower concepts: {narrower}")
    print("---")


Alternative names: ['Mouth Neoplasms', 'Cancer of Mouth']

Broader concept: Mouth Neoplasms
Narrower concepts: ['Salivary Gland Neoplasms', 'Tongue Neoplasms', 'Lip Neoplasms', 'Palatal Neoplasms', 'Gingival Neoplasms', 'Leukoplakia, Oral']
---
Broader concept: Salivary Gland Neoplasms
Narrower concepts: ['Sublingual Gland Neoplasms', 'Submandibular Gland Neoplasms', 'Parotid Neoplasms']
---
Broader concept: Sublingual Gland Neoplasms
Narrower concepts: []
---
Broader concept: Submandibular Gland Neoplasms
Narrower concepts: []
---
Broader concept: Parotid Neoplasms
Narrower concepts: []
---
Broader concept: Tongue Neoplasms
Narrower concepts: []
---
Broader concept: Lip Neoplasms
Narrower concepts: []
---
Broader concept: Palatal Neoplasms
Narrower concepts: []
---
Broader concept: Gingival Neoplasms
Narrower concepts: []
---
Broader concept: Leukoplakia, Oral
Narrower concepts: ['Leukoplakia, Hairy']
---
Broader concept: Leukoplakia, Hairy
Narrower concepts: []
---


In [413]:
def flatten_concepts(concepts_dict):
    flat_list = []

    def recurse_terms(term_dict):
        for term, narrower_terms in term_dict.items():
            flat_list.append(term)
            if narrower_terms:
                recurse_terms(dict.fromkeys(narrower_terms, []))  # Use an empty dict to recurse
    
    recurse_terms(concepts_dict)
    return flat_list

# Flatten the concepts dictionary
flat_list = flatten_concepts(all_concepts)

In [411]:
#Convert the MeSH terms to URI
def convert_to_mesh_uri(term):
    formatted_term = term.replace(" ", "_").replace(",", "_").replace("-", "_")
    return URIRef(f"http://example.org/mesh/_{formatted_term}_")


# Convert terms to URIs
mesh_terms = [convert_to_mesh_uri(term) for term in flat_list]

In [435]:
from rdflib import URIRef

query = """
PREFIX schema: <http://schema.org/>
PREFIX ex: <http://example.org/>

SELECT ?article ?title ?abstract ?datePublished ?access ?meshTerm
WHERE {
  ?article a ex:Article ;
           schema:name ?title ;
           schema:description ?abstract ;
           schema:datePublished ?datePublished ;
           ex:access ?access ;
           schema:about ?meshTerm .

  ?meshTerm a ex:MeSHTerm .
}
"""

# Dictionary to store articles and their associated MeSH terms
article_data = {}

# Run the query for each MeSH term
for mesh_term in mesh_terms:
    results = g.query(query, initBindings={'meshTerm': mesh_term})

    # Process results
    for row in results:
        article_uri = row['article']

        if article_uri not in article_data:
            article_data[article_uri] = {
                'title': row['title'],
                'abstract': row['abstract'],
                'datePublished': row['datePublished'],
                'access': row['access'],
                'meshTerms': set()
            }

        # Add the MeSH term to the set for this article
        article_data[article_uri]['meshTerms'].add(str(row['meshTerm']))

# Rank articles by the number of matching MeSH terms
ranked_articles = sorted(
    article_data.items(),
    key=lambda item: len(item[1]['meshTerms']),
    reverse=True
)

# Get the top 3 articles
top_3_articles = ranked_articles[:3]

# Output results
for article_uri, data in top_3_articles:
    print(f"Title: {data['title']}")
    print(f"Abstract: {data['abstract']}")
    print("MeSH Terms:")
    for mesh_term in data['meshTerms']:
        print(f"  - {mesh_term}")
    print()


Title: Myoepithelioma of minor salivary gland origin. Light and electron microscopical study.
Abstract: A gingival tumor that invaded the anterior maxilla was removed from a 14-year-old boy and studied by light and electron microscopy. The tumor was composed exclusively of myoepithelial cells and appeared to be malignant. By light microscopy, the tumor appeared to be a poorly differentiated epithelial neoplasm of undetermined origin; however, electron microscopical examination showed myoepithelial differentiation, indicative of a salivary gland origin. To our knowledge, the present case represents the only confirmed myoepithelioma that shows features indicative of malignant potential. Myoepitheliomas may be related to mixed tumors of salivary glands.
MeSH Terms:
  - http://example.org/mesh/_Gingival_Neoplasms_
  - http://example.org/mesh/_Salivary_Gland_Neoplasms_

Title: [Feasability study of screening for malignant lesions in the oral cavity targeting tobacco users].
Abstract: INTROD

### Similarity search using a KG

In [416]:
from rdflib import Graph, URIRef
from rdflib.namespace import RDF, RDFS, Namespace, SKOS
import urllib.parse

# Define namespaces
schema = Namespace('http://schema.org/')
ex = Namespace('http://example.org/')
rdfs = Namespace('http://www.w3.org/2000/01/rdf-schema#')

# Function to calculate Jaccard similarity and return overlapping terms
def jaccard_similarity(set1, set2):
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    similarity = len(intersection) / len(union) if len(union) != 0 else 0
    return similarity, intersection

# Load the RDF graph
g = Graph()
g.parse('ontology.ttl', format='turtle')

def get_article_uri(title):
    # Convert the title to a URI-safe string
    safe_title = urllib.parse.quote(title.replace(" ", "_"))
    return URIRef(f"http://example.org/article/{safe_title}")

def get_mesh_terms(article_uri):
    query = """
    PREFIX schema: <http://schema.org/>
    PREFIX ex: <http://example.org/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    SELECT ?meshTerm
    WHERE {
      ?article schema:about ?meshTerm .
      ?meshTerm a ex:MeSHTerm .
      FILTER (?article = <""" + str(article_uri) + """>)
    }
    """
    results = g.query(query)
    mesh_terms = {str(row['meshTerm']) for row in results}
    return mesh_terms

def find_similar_articles(title):
    article_uri = get_article_uri(title)
    mesh_terms_given_article = get_mesh_terms(article_uri)

    # Query all articles and their MeSH terms
    query = """
    PREFIX schema: <http://schema.org/>
    PREFIX ex: <http://example.org/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    SELECT ?article ?meshTerm
    WHERE {
      ?article a ex:Article ;
               schema:about ?meshTerm .
      ?meshTerm a ex:MeSHTerm .
    }
    """
    results = g.query(query)

    mesh_terms_other_articles = {}
    for row in results:
        article = str(row['article'])
        mesh_term = str(row['meshTerm'])
        if article not in mesh_terms_other_articles:
            mesh_terms_other_articles[article] = set()
        mesh_terms_other_articles[article].add(mesh_term)

    # Calculate Jaccard similarity
    similarities = {}
    overlapping_terms = {}
    for article, mesh_terms in mesh_terms_other_articles.items():
        if article != str(article_uri):
            similarity, overlap = jaccard_similarity(mesh_terms_given_article, mesh_terms)
            similarities[article] = similarity
            overlapping_terms[article] = overlap

    # Sort by similarity and get top 5
    top_similar_articles = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:15]
    
    # Print results
    print(f"Top 15 articles similar to '{title}':")
    for article, similarity in top_similar_articles:
        print(f"Article URI: {article}")
        print(f"Jaccard Similarity: {similarity:.4f}")
        print(f"Overlapping MeSH Terms: {overlapping_terms[article]}")
        print()

# Example usage
article_title = "Gingival metastasis as first sign of multiorgan dissemination of epithelioid malignant mesothelioma."
find_similar_articles(article_title)


Top 15 articles similar to 'Gingival metastasis as first sign of multiorgan dissemination of epithelioid malignant mesothelioma.':
Article URI: http://example.org/article/Calcific_tendinitis_of_the_vastus_lateralis_muscle._A_report_of_three_cases.
Jaccard Similarity: 0.1923
Overlapping MeSH Terms: {'http://example.org/mesh/_Tomography', 'http://example.org/mesh/Aged_', 'http://example.org/mesh/_Male_', 'http://example.org/mesh/_Humans_', 'http://example.org/mesh/X-Ray_Computed'}

Article URI: http://example.org/article/What_is_the_optimal_duration_of_androgen_deprivation_therapy_in_prostate_cancer_patients_presenting_with_prostate-specific_antigen_levels__20_ng/ml%3F
Jaccard Similarity: 0.1852
Overlapping MeSH Terms: {'http://example.org/mesh/_Radiotherapy', 'http://example.org/mesh/Aged_', 'http://example.org/mesh/_Male_', 'http://example.org/mesh/_Humans_', 'http://example.org/mesh/Adjuvant_'}

Article URI: http://example.org/article/CT_scan_cerebral_hemispheric_asymmetries%3A_predic

In [508]:
from rdflib import URIRef

# Function to retrieve title and abstract for a given article URI
def get_article_details(article_uri):
    query = """
    PREFIX schema: <http://schema.org/>
    PREFIX ex: <http://example.org/>

    SELECT ?title ?abstract ?datePublished ?access
    WHERE {{
      <{}> a ex:Article ;
            schema:name ?title ;
            schema:description ?abstract ;
            schema:datePublished ?datePublished ;
            ex:access ?access .
    }}
    """.format(article_uri)

    # Execute the query
    results = g.query(query)

    # Process results
    for row in results:
        title = row['title']
        abstract = row['abstract']
        date_published = row['datePublished']
        access = row['access']
        return title, abstract, date_published, access

    # Return None if no results found
    return None, None, None, None

# Example usage
article_uri = "http://example.org/article/CT_scan_cerebral_hemispheric_asymmetries%3A_predictors_of_recovery_from_aphasia."  # Use the URI as a string
title, abstract, date_published, access = get_article_details(article_uri)

if title and abstract:
    print(f"Title: {title}")
    print(f"Abstract: {abstract}")
    print(f"Date Published: {date_published}")
    print(f"Access: {access}")
else:
    print("No article found with the given URI.")


Title: CT scan cerebral hemispheric asymmetries: predictors of recovery from aphasia.
Abstract: Individual variations in anatomic cerebral asymmetries have been linked with specific neurodevelopmental processes, with patterns of cognitive ability, and with recovery from focal brain damage. The present study investigated relationships between cerebral asymmetries and recovery from aphasia. Aphasic patients (N = 25) were assessed for language recovery for 1 year poststroke, and linear measurements of cerebral asymmetries were performed on CT scans. Increasing left occipital width asymmetry was associated with faster rate of language recovery and with higher final language scores during the first year poststroke. There was, moreover, a tendency for increasing left occipital width asymmetry to be associated with less initial impairment. It is hypothesized that those aspects of neural organization conferring better premorbid language skills are the same factors conferring greater recovery o

### RAG using a KG

In [443]:
# Function to combine titles and abstracts
def combine_abstracts(top_3_articles):
    combined_text = "".join(
        [f"Title: {data['title']} Abstract: {data['abstract']}" for article_uri, data in top_3_articles]
    )
    return combined_text

# Combine abstracts from the top 3 articles
combined_text = combine_abstracts(top_3_articles)
print(combined_text)


Title: Myoepithelioma of minor salivary gland origin. Light and electron microscopical study. Abstract: A gingival tumor that invaded the anterior maxilla was removed from a 14-year-old boy and studied by light and electron microscopy. The tumor was composed exclusively of myoepithelial cells and appeared to be malignant. By light microscopy, the tumor appeared to be a poorly differentiated epithelial neoplasm of undetermined origin; however, electron microscopical examination showed myoepithelial differentiation, indicative of a salivary gland origin. To our knowledge, the present case represents the only confirmed myoepithelioma that shows features indicative of malignant potential. Myoepitheliomas may be related to mixed tumors of salivary glands.Title: [Feasability study of screening for malignant lesions in the oral cavity targeting tobacco users]. Abstract: INTRODUCTION: Oral cavity cancer is frequent. Prognosis of this cancer is closely linked to the development. Although the or

In [60]:
import pandas as pd
import openai

# Set up your OpenAI API key
api_key = "XXX"
openai.api_key = api_key


In [444]:
def generate_summary(combined_text):
    response = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=f"Summarize the key information here in bullet points. Make it understandable to someone without a medical degree:\n\n{combined_text}",
        max_tokens=1000,
        temperature=0.3
    )
    
    # Get the raw text output
    raw_summary = response.choices[0].text.strip()
    
    # Split the text into lines and clean up whitespace
    lines = raw_summary.split('\n')
    lines = [line.strip() for line in lines if line.strip()]
    
    # Join the lines back together with actual line breaks
    formatted_summary = '\n'.join(lines)
    
    return formatted_summary

# Generate and print the summary
summary = generate_summary(combined_text)
print(summary)


- A 14-year-old boy had a gingival tumor in his anterior maxilla that was removed and studied by light and electron microscopy
- The tumor was made up of myoepithelial cells and appeared to be malignant
- Electron microscopy showed that the tumor originated from a salivary gland
- This is the only confirmed case of a myoepithelioma with features of malignancy
- A feasibility study was conducted to improve early detection of oral cancer and premalignant lesions in a high incidence region
- Tobacco vendors were involved in distributing flyers to invite smokers for free examinations by general practitioners
- 93 patients were included in the study and 27% were referred to a specialist
- 63.6% of those referred actually saw a specialist and 15.3% were confirmed to have a premalignant lesion
- A study found a correlation between increased expression of the protein HuR and the enzyme COX-2 in oral squamous cell carcinoma (OSCC)
- Cytoplasmic HuR expression was associated with COX-2 expressio

### Step 3: use a vector-powered knowledge graph to test data retrieval

Create a column to store URIs for each article

In [76]:
# Function to create a valid URI
def create_valid_uri(base_uri, text):
    if pd.isna(text):
        return None
    # Encode text to be used in URI
    sanitized_text = urllib.parse.quote(text.strip().replace(' ', '_').replace('"', '').replace('<', '').replace('>', '').replace("'", "_"))
    return URIRef(f"{base_uri}/{sanitized_text}")

# Add a new column to the DataFrame for the article URIs
df['Article_URI'] = df['Title'].apply(lambda title: create_valid_uri("http://example.org/article", title))


Test that the URI works - we are able to find all mesh terms associated with a given uri

In [77]:
from rdflib import Graph, Namespace, URIRef

# Assuming your RDF graph (g) is already loaded

# Define namespaces
schema = Namespace('http://schema.org/')
ex = Namespace('http://example.org/')
rdfs = Namespace('http://www.w3.org/2000/01/rdf-schema#')

def get_mesh_terms_for_article(graph, article_uri):
    # Define the SPARQL query using the article URI
    query = f"""
    PREFIX schema: <http://schema.org/>
    PREFIX ex: <http://example.org/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    SELECT ?meshTermLabel
    WHERE {{
      <{article_uri}> a ex:Article ;
               schema:about ?meshTerm .
      ?meshTerm rdfs:label ?meshTermLabel .
    }}
    """
    
    # Execute the query
    results = graph.query(query)
    
    # Extract the MeSH terms from the results
    mesh_terms = [str(row['meshTermLabel']) for row in results]
    
    return mesh_terms

# Example usage with the provided URI
article_uri = "http://example.org/article/Expression_of_p53_and_coexistence_of_HPV_in_premalignant_lesions_and_in_cervical_cancer."
mesh_terms = get_mesh_terms_for_article(g, article_uri)

# Output the results
print(f"MeSH terms associated with the article '{article_uri}':")
for term in mesh_terms:
    print(term)


MeSH terms associated with the article 'http://example.org/article/Expression_of_p53_and_coexistence_of_HPV_in_premalignant_lesions_and_in_cervical_cancer.':
DNA Probes
HPV'
'DNA
Viral'
'Female'
'Humans'
'Immunohistochemistry'
'Papillomaviridae'
'Tumor Suppressor Protein p53'
'Uterine Cervical Dysplasia'
'Uterine Cervical Neoplasms


Create the schema for new embedding which includes mesh tags and URI

In [181]:
class_obj = {
    # Class definition
    "class": "articles_with_abstracts_and_URIs",

    # Property definitions
    "properties": [
        {
            "name": "title",
            "dataType": ["text"],
        },
        {
            "name": "abstractText",
            "dataType": ["text"],
        },
        {
            "name": "meshMajor",
            "dataType": ["text"],
        },
        {
            "name": "Article_URI",
            "dataType": ["text"],
        },
    ],

    # Specify a vectorizer
    "vectorizer": "text2vec-openai",

    # Module settings
    "moduleConfig": {
        "text2vec-openai": {
            "vectorizeClassName": True,
            "model": "ada",
            "modelVersion": "002",
            "type": "text"
        },
        "qna-openai": {
          "model": "gpt-3.5-turbo-instruct"
        },
        "generative-openai": {
          "model": "gpt-3.5-turbo"
        }
    },
}

In [182]:
client.schema.create_class(class_obj)

In [183]:
import logging
import numpy as np

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')

# Replace infinity values with NaN and then fill NaN values
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna('', inplace=True)

# Convert columns to string type
df['Title'] = df['Title'].astype(str)
df['abstractText'] = df['abstractText'].astype(str)
df['meshMajor'] = df['meshMajor'].astype(str)
df['Article_URI'] = df['Article_URI'].astype(str)


# Log the data types
logging.info(f"Title column type: {df['Title'].dtype}")
logging.info(f"abstractText column type: {df['abstractText'].dtype}")
logging.info(f"meshMajor column type: {df['meshMajor'].dtype}")
logging.info(f"Article_URI column type: {df['Article_URI'].dtype}")


with client.batch(
    batch_size=10,  # Specify batch size
    num_workers=2,   # Parallelize the process
) as batch:
    for index, row in df.iterrows():
        try:
            question_object = {
                "title": row.Title,
                "abstractText": row.abstractText,
                "meshMajor": row.meshMajor,
                "article_URI": row.Article_URI,
            }
            batch.add_data_object(
                question_object,
                class_name="articles_with_abstracts_and_URIs",
                uuid=generate_uuid5(question_object)
            )
        except Exception as e:
            logging.error(f"Error processing row {index}: {e}")


2024-08-12 20:50:55,063 INFO Title column type: object
2024-08-12 20:50:55,064 INFO abstractText column type: object
2024-08-12 20:50:55,064 INFO meshMajor column type: object
2024-08-12 20:50:55,065 INFO Article_URI column type: object
            Use the `client.batch.configure()` method to configure your batch process, and `client.batch` to enter the context manager.

            See https://weaviate.io/developers/weaviate/client-libraries/python for details.
[ERROR] Batch ConnectionError Exception occurred! Retrying in 2s. [1/3]
[ERROR] Batch ConnectionError Exception occurred! Retrying in 2s. [1/3]
[ERROR] Batch SSLError Exception occurred! Retrying in 2s. [1/3]
[ERROR] Batch SSLError Exception occurred! Retrying in 2s. [1/3]
[ERROR] Batch SSLError Exception occurred! Retrying in 2s. [1/3]


### Semantic search with vectorized KG

In [446]:
response = (
    client.query
    .get("articles_with_abstracts_and_URIs", ["title","abstractText","meshMajor","article_URI"])
    .with_additional(["id"])
    .with_near_text({"concepts": ["mouth neoplasms"]})
    .with_limit(10)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Articles_with_abstracts_and_URIs": [
                {
                    "_additional": {
                        "id": "37b695c4-5b80-5f44-a710-e84abb46bc22"
                    },
                    "abstractText": "INTRODUCTION: Metastatic malignant mesothelioma to the oral cavity is extremely rare. They are more common in the jaw bones than the soft tissue. Occurrence of the malignant disease typically carries an average survival rate of 9-12 monthsMETHODS: : Thirteen patients underwent neoadjuvant chemotherapy and radical pleurectomy decortication, followed by radiotherapy from August 2012 to September 2013. Patients were followed up with computed tomography of the chest and the abdomen every 3 months. All patients were followed up until February 2014.RESULTS: In January 2014, 11 patients were still alive with a median survival of 11 months, eight patients developed a recurrence and two patients died at 8 and 9 months after surgery.

In [447]:
# Extract article URIs
article_uris = [article["article_URI"] for article in response["data"]["Get"]["Articles_with_abstracts_and_URIs"]]

In [448]:
article_uris

['http://example.org/article/Gingival_metastasis_as_first_sign_of_multiorgan_dissemination_of_epithelioid_malignant_mesothelioma.',
 'http://example.org/article/Angiocentric_Centrofacial_Lymphoma_as_a_Challenging_Diagnosis_in_an_Elderly_Man.',
 'http://example.org/article/Mandibular_pseudocarcinomatous_hyperplasia.',
 'http://example.org/article/Metastatic_neuroblastoma_in_the_mandible._Report_of_a_case.',
 'http://example.org/article/Malignant_fibrous_histiocytoma_of_the_pharynx.',
 'http://example.org/article/%5BChondroma_adjacent_to_Meckel_s_cave_mimicking_a_fifth_cranial_nerve_neurinoma._A_case_report%5D.',
 'http://example.org/article/Histogenesis_and_Morphology_of_periosteal_sarcomas_induced_by_FBJ_virus_in_NIH_Swiss_mice.',
 'http://example.org/article/Myoepithelioma_of_minor_salivary_gland_origin._Light_and_electron_microscopical_study.',
 'http://example.org/article/March_2000%3A_5_month_old_boy_with_occipital_bone_mass.',
 'http://example.org/article/Disease-on-a-chip%3A_mimi

In [450]:
from rdflib import Graph, Namespace, URIRef, Literal
from rdflib.namespace import RDF, RDFS, XSD

# Define namespaces
schema = Namespace('http://schema.org/')
ex = Namespace('http://example.org/')
rdfs = Namespace('http://www.w3.org/2000/01/rdf-schema#')
xsd = Namespace('http://www.w3.org/2001/XMLSchema#')

def get_articles_after_date(graph, article_uris, date_cutoff):
    # Create a dictionary to store results for each URI
    results_dict = {}

    # Define the SPARQL query using a list of article URIs and a date filter
    uris_str = " ".join(f"<{uri}>" for uri in article_uris)
    query = f"""
    PREFIX schema: <http://schema.org/>
    PREFIX ex: <http://example.org/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

    SELECT ?article ?title ?datePublished
    WHERE {{
      VALUES ?article {{ {uris_str} }}
      
      ?article a ex:Article ;
               schema:name ?title ;
               schema:datePublished ?datePublished .
      
      FILTER (?datePublished > "{date_cutoff}"^^xsd:date)
    }}
    """
    
    # Execute the query
    results = graph.query(query)
    
    # Extract the details for each article
    for row in results:
        article_uri = str(row['article'])
        results_dict[article_uri] = {
            'title': str(row['title']),
            'date_published': str(row['datePublished'])
        }
    
    return results_dict

date_cutoff = "2023-01-01"
articles_after_date = get_articles_after_date(g, article_uris, date_cutoff)

# Output the results
for uri, details in articles_after_date.items():
    print(f"Article URI: {uri}")
    print(f"Title: {details['title']}")
    print(f"Date Published: {details['date_published']}")
    print()


Article URI: http://example.org/article/Angiocentric_Centrofacial_Lymphoma_as_a_Challenging_Diagnosis_in_an_Elderly_Man.
Title: Angiocentric Centrofacial Lymphoma as a Challenging Diagnosis in an Elderly Man.
Date Published: 2024-07-28

Article URI: http://example.org/article/Mandibular_pseudocarcinomatous_hyperplasia.
Title: Mandibular pseudocarcinomatous hyperplasia.
Date Published: 2023-04-15

Article URI: http://example.org/article/%5BChondroma_adjacent_to_Meckel_s_cave_mimicking_a_fifth_cranial_nerve_neurinoma._A_case_report%5D.
Title: [Chondroma adjacent to Meckel's cave mimicking a fifth cranial nerve neurinoma. A case report].
Date Published: 2023-08-16

Article URI: http://example.org/article/Myoepithelioma_of_minor_salivary_gland_origin._Light_and_electron_microscopical_study.
Title: Myoepithelioma of minor salivary gland origin. Light and electron microscopical study.
Date Published: 2023-05-09

Article URI: http://example.org/article/March_2000%3A_5_month_old_boy_with_occip

### Similarity search with vectorized KG

In [475]:
response = (
    client.query
    .get("articles_with_abstracts_and_URIs", ["title","abstractText","meshMajor","article_URI"])
    .with_near_object({
        "id": "37b695c4-5b80-5f44-a710-e84abb46bc22"
    })
    .with_limit(50)
    .with_additional(["distance"])
    .do()
)

print(json.dumps(response, indent=2))

{
  "data": {
    "Get": {
      "Articles_with_abstracts_and_URIs": [
        {
          "_additional": {
            "distance": 1.1920929e-07
          },
          "abstractText": "INTRODUCTION: Metastatic malignant mesothelioma to the oral cavity is extremely rare. They are more common in the jaw bones than the soft tissue. Occurrence of the malignant disease typically carries an average survival rate of 9-12 monthsMETHODS: : Thirteen patients underwent neoadjuvant chemotherapy and radical pleurectomy decortication, followed by radiotherapy from August 2012 to September 2013. Patients were followed up with computed tomography of the chest and the abdomen every 3 months. All patients were followed up until February 2014.RESULTS: In January 2014, 11 patients were still alive with a median survival of 11 months, eight patients developed a recurrence and two patients died at 8 and 9 months after surgery. After 1 year from macroscopic radical pleurectomy decortication, a 68-year-old m

In [478]:
# Assuming response is the data structure with your articles
article_uris = [URIRef(article["article_URI"]) for article in response["data"]["Get"]["Articles_with_abstracts_and_URIs"]]

In [481]:
from rdflib import URIRef

# Constructing the SPARQL query with a FILTER for the article URIs
query = """
PREFIX schema: <http://schema.org/>
PREFIX ex: <http://example.org/>

SELECT ?article ?title ?abstract ?datePublished ?access ?meshTerm
WHERE {
  ?article a ex:Article ;
           schema:name ?title ;
           schema:description ?abstract ;
           schema:datePublished ?datePublished ;
           ex:access ?access ;
           schema:about ?meshTerm .

  ?meshTerm a ex:MeSHTerm .

  # Filter to include only articles from the list of URIs
  FILTER (?article IN (%s))
}
"""


# Convert the list of URIRefs into a string suitable for SPARQL
article_uris_string = ", ".join([f"<{str(uri)}>" for uri in article_uris])

# Insert the article URIs into the query
query = query % article_uris_string

# Dictionary to store articles and their associated MeSH terms
article_data = {}

# Run the query for each MeSH term
for mesh_term in mesh_terms:
    results = g.query(query, initBindings={'meshTerm': mesh_term})

    # Process results
    for row in results:
        article_uri = row['article']

        if article_uri not in article_data:
            article_data[article_uri] = {
                'title': row['title'],
                'abstract': row['abstract'],
                'datePublished': row['datePublished'],
                'access': row['access'],
                'meshTerms': set()
            }

        # Add the MeSH term to the set for this article
        article_data[article_uri]['meshTerms'].add(str(row['meshTerm']))

# Rank articles by the number of matching MeSH terms
ranked_articles = sorted(
    article_data.items(),
    key=lambda item: len(item[1]['meshTerms']),
    reverse=True
)


# Output results
for article_uri, data in ranked_articles:
    print(f"Title: {data['title']}")
    print(f"Abstract: {data['abstract']}")
    print("MeSH Terms:")
    for mesh_term in data['meshTerms']:
        print(f"  - {mesh_term}")
    print()


Title: Myoepithelioma of minor salivary gland origin. Light and electron microscopical study.
Abstract: A gingival tumor that invaded the anterior maxilla was removed from a 14-year-old boy and studied by light and electron microscopy. The tumor was composed exclusively of myoepithelial cells and appeared to be malignant. By light microscopy, the tumor appeared to be a poorly differentiated epithelial neoplasm of undetermined origin; however, electron microscopical examination showed myoepithelial differentiation, indicative of a salivary gland origin. To our knowledge, the present case represents the only confirmed myoepithelioma that shows features indicative of malignant potential. Myoepitheliomas may be related to mixed tumors of salivary glands.
MeSH Terms:
  - http://example.org/mesh/_Gingival_Neoplasms_
  - http://example.org/mesh/_Salivary_Gland_Neoplasms_

Title: [Feasability study of screening for malignant lesions in the oral cavity targeting tobacco users].
Abstract: INTROD

In [483]:
# List to store the URIs of the ranked articles
ranked_article_uris = [URIRef(article_uri) for article_uri, data in ranked_articles]

In [486]:
from rdflib import Graph, Namespace, URIRef, Literal
from rdflib.namespace import RDF, RDFS, XSD, SKOS

# Assuming your RDF graph (g) is already loaded

# Define namespaces
schema = Namespace('http://schema.org/')
ex = Namespace('http://example.org/')
rdfs = Namespace('http://www.w3.org/2000/01/rdf-schema#')

def filter_articles_by_access(graph, article_uris, access_values):
    # Construct the SPARQL query with a dynamic VALUES clause
    uris_str = " ".join(f"<{uri}>" for uri in article_uris)
    query = f"""
    PREFIX schema: <http://schema.org/>
    PREFIX ex: <http://example.org/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    SELECT ?article ?title ?abstract ?datePublished ?access ?meshTermLabel
    WHERE {{
      VALUES ?article {{ {uris_str} }}
      
      ?article a ex:Article ;
               schema:name ?title ;
               schema:description ?abstract ;
               schema:datePublished ?datePublished ;
               ex:access ?access ;
               schema:about ?meshTerm .
      ?meshTerm rdfs:label ?meshTermLabel .
      
      FILTER (?access IN ({", ".join(map(str, access_values))}))
    }}
    """
    
    # Execute the query
    results = graph.query(query)
    
    # Extract the details for each article
    results_dict = {}
    for row in results:
        article_uri = str(row['article'])
        if article_uri not in results_dict:
            results_dict[article_uri] = {
                'title': str(row['title']),
                'abstract': str(row['abstract']),
                'date_published': str(row['datePublished']),
                'access': str(row['access']),
                'mesh_terms': []
            }
        results_dict[article_uri]['mesh_terms'].append(str(row['meshTermLabel']))
    
    return results_dict

access_values = [3,5,7]
filtered_articles = filter_articles_by_access(g, ranked_article_uris, access_values)

# Output the results
for uri, details in filtered_articles.items():
    print(f"Article URI: {uri}")
    print(f"Title: {details['title']}")
    print(f"Abstract: {details['abstract']}")
    print(f"Date Published: {details['date_published']}")
    print(f"Access: {details['access']}")
    print()


Article URI: http://example.org/article/Myoepithelioma_of_minor_salivary_gland_origin._Light_and_electron_microscopical_study.
Title: Myoepithelioma of minor salivary gland origin. Light and electron microscopical study.
Abstract: A gingival tumor that invaded the anterior maxilla was removed from a 14-year-old boy and studied by light and electron microscopy. The tumor was composed exclusively of myoepithelial cells and appeared to be malignant. By light microscopy, the tumor appeared to be a poorly differentiated epithelial neoplasm of undetermined origin; however, electron microscopical examination showed myoepithelial differentiation, indicative of a salivary gland origin. To our knowledge, the present case represents the only confirmed myoepithelioma that shows features indicative of malignant potential. Myoepitheliomas may be related to mixed tumors of salivary glands.
Date Published: 2023-05-09
Access: 5

Article URI: http://example.org/article/%5BFeasability_study_of_screening_

### RAG with a vectorized KG

In [488]:
response = (
    client.query
    .get("Articles_with_abstracts_and_URIs", ["title", "abstractText",'article_URI','meshMajor'])
    .with_near_text({"concepts": ["therapies for mouth neoplasms"]})
    .with_limit(3)
    .with_generate(grouped_task="Summarize the key information here in bullet points. Make it understandable to someone without a medical degree.")
    .do()
)

print(response["data"]["Get"]["Articles_with_abstracts_and_URIs"][0]["_additional"]["generate"]["groupedResult"])

- Metastatic malignant mesothelioma to the oral cavity is rare, with an average survival rate of 9-12 months.
- Neoadjuvant chemotherapy and radical pleurectomy decortication followed by radiotherapy were used in 13 patients from August 2012 to September 2013.
- In January 2014, 11 patients were still alive with a median survival of 11 months, while 8 patients had a recurrence and 2 patients died at 8 and 9 months after surgery.
- A 68-year-old man had a gingival mass that turned out to be a metastatic deposit of malignant mesothelioma, leading to multiorgan recurrence.
- Biopsy is important for new growing lesions, even in uncommon sites, when there is a history of mesothelioma.

- Neoadjuvant radiochemotherapy for locally advanced rectal carcinoma can be effective, but some patients may not respond well.
- Genetic alterations may be associated with sensitivity or resistance to neoadjuvant therapy in rectal cancer.
- Losses of chromosomes 1p, 8p, 17p, and 18q, and gains of 1q and 13q 

In [489]:
# Extract article URIs
article_uris = [article["article_URI"] for article in response["data"]["Get"]["Articles_with_abstracts_and_URIs"]]

# Function to filter the response for only the given URIs
def filter_articles_by_uri(response, article_uris):
    filtered_articles = []
    
    articles = response['data']['Get']['Articles_with_abstracts_and_URIs']
    for article in articles:
        if article['article_URI'] in article_uris:
            filtered_articles.append(article)
    
    return filtered_articles

# Filter the response
filtered_articles = filter_articles_by_uri(response, article_uris)

# Output the filtered articles
print("Filtered articles:")
for article in filtered_articles:
    print(f"Title: {article['title']}")
    print(f"URI: {article['article_URI']}")
    print(f"Abstract: {article['abstractText']}")
    print(f"MeshMajor: {article['meshMajor']}")
    print("---")

Filtered articles:
Title: Gingival metastasis as first sign of multiorgan dissemination of epithelioid malignant mesothelioma.
URI: http://example.org/article/Gingival_metastasis_as_first_sign_of_multiorgan_dissemination_of_epithelioid_malignant_mesothelioma.
Abstract: INTRODUCTION: Metastatic malignant mesothelioma to the oral cavity is extremely rare. They are more common in the jaw bones than the soft tissue. Occurrence of the malignant disease typically carries an average survival rate of 9-12 monthsMETHODS: : Thirteen patients underwent neoadjuvant chemotherapy and radical pleurectomy decortication, followed by radiotherapy from August 2012 to September 2013. Patients were followed up with computed tomography of the chest and the abdomen every 3 months. All patients were followed up until February 2014.RESULTS: In January 2014, 11 patients were still alive with a median survival of 11 months, eight patients developed a recurrence and two patients died at 8 and 9 months after surge

## RAG on vectorized knowledge graph with filters

In [492]:
response = (
    client.query
    .get("articles_with_abstracts_and_URIs", ["title", "abstractText", "meshMajor", "article_URI"])
    .with_additional(["id"])
    .with_near_text({"concepts": ["therapies for mouth neoplasms"]})
    .with_limit(20)
    .do()
)

# Assuming response is the data structure with your articles
article_uris = [URIRef(article["article_URI"]) for article in response["data"]["Get"]["Articles_with_abstracts_and_URIs"]]

In [493]:
from rdflib import URIRef

# Constructing the SPARQL query with a FILTER for the article URIs
query = """
PREFIX schema: <http://schema.org/>
PREFIX ex: <http://example.org/>

SELECT ?article ?title ?abstract ?datePublished ?access ?meshTerm
WHERE {
  ?article a ex:Article ;
           schema:name ?title ;
           schema:description ?abstract ;
           schema:datePublished ?datePublished ;
           ex:access ?access ;
           schema:about ?meshTerm .

  ?meshTerm a ex:MeSHTerm .

  # Filter to include only articles from the list of URIs
  FILTER (?article IN (%s))
}
"""


# Convert the list of URIRefs into a string suitable for SPARQL
article_uris_string = ", ".join([f"<{str(uri)}>" for uri in article_uris])

# Insert the article URIs into the query
query = query % article_uris_string

# Dictionary to store articles and their associated MeSH terms
article_data = {}

# Run the query for each MeSH term
for mesh_term in mesh_terms:
    results = g.query(query, initBindings={'meshTerm': mesh_term})

    # Process results
    for row in results:
        article_uri = row['article']

        if article_uri not in article_data:
            article_data[article_uri] = {
                'title': row['title'],
                'abstract': row['abstract'],
                'datePublished': row['datePublished'],
                'access': row['access'],
                'meshTerms': set()
            }

        # Add the MeSH term to the set for this article
        article_data[article_uri]['meshTerms'].add(str(row['meshTerm']))

# Rank articles by the number of matching MeSH terms
ranked_articles = sorted(
    article_data.items(),
    key=lambda item: len(item[1]['meshTerms']),
    reverse=True
)


# Output results
for article_uri, data in ranked_articles:
    print(f"Title: {data['title']}")
    print(f"Abstract: {data['abstract']}")
    print("MeSH Terms:")
    for mesh_term in data['meshTerms']:
        print(f"  - {mesh_term}")
    print()


Title: [Feasability study of screening for malignant lesions in the oral cavity targeting tobacco users].
Abstract: INTRODUCTION: Oral cavity cancer is frequent. Prognosis of this cancer is closely linked to the development. Although the oral cavity is a potentially accessible site for examination, up to 50% of oral cancers are not detected until the disease is well advanced.PATIENTS AND METHOD: In a region where incidence rate is particularly high, local teams involved in screening, in epidemiological survey, in diagnosis and treatment of oral cancer performed a pilot feasibility study to improve strategy of early detection of oral cancer and premalignant lesion. Tobacco venders were solicited to distribute a flyer, which invite smokers to a free examination by general practitioner. General practitioners were invited to examine smokers, and to fill a predeterminate systematic oral cavity examination record during 3 months. They were asked to refer to a specialist if there was a potent

In [495]:
# Function to combine titles and abstracts
def combine_abstracts(ranked_articles):
    combined_text = "".join(
        [f"Title: {data['title']} Abstract: {data['abstract']}" for article_uri, data in ranked_articles]
    )
    return combined_text

# Combine abstracts from the top 3 articles
combined_text = combine_abstracts(ranked_articles)
print(combined_text)


Title: [Feasability study of screening for malignant lesions in the oral cavity targeting tobacco users]. Abstract: INTRODUCTION: Oral cavity cancer is frequent. Prognosis of this cancer is closely linked to the development. Although the oral cavity is a potentially accessible site for examination, up to 50% of oral cancers are not detected until the disease is well advanced.PATIENTS AND METHOD: In a region where incidence rate is particularly high, local teams involved in screening, in epidemiological survey, in diagnosis and treatment of oral cancer performed a pilot feasibility study to improve strategy of early detection of oral cancer and premalignant lesion. Tobacco venders were solicited to distribute a flyer, which invite smokers to a free examination by general practitioner. General practitioners were invited to examine smokers, and to fill a predeterminate systematic oral cavity examination record during 3 months. They were asked to refer to a specialist if there was a potent

In [497]:
# Function to combine titles and abstracts into one chunk of text
def combine_abstracts(ranked_articles):
    combined_text = " ".join(
        [f"Title: {data['title']} Abstract: {data['abstract']}" for _, data in ranked_articles]
    )
    return combined_text

# Combine abstracts from the filtered articles
combined_text = combine_abstracts(ranked_articles)
print(combined_text)


Title: [Feasability study of screening for malignant lesions in the oral cavity targeting tobacco users]. Abstract: INTRODUCTION: Oral cavity cancer is frequent. Prognosis of this cancer is closely linked to the development. Although the oral cavity is a potentially accessible site for examination, up to 50% of oral cancers are not detected until the disease is well advanced.PATIENTS AND METHOD: In a region where incidence rate is particularly high, local teams involved in screening, in epidemiological survey, in diagnosis and treatment of oral cancer performed a pilot feasibility study to improve strategy of early detection of oral cancer and premalignant lesion. Tobacco venders were solicited to distribute a flyer, which invite smokers to a free examination by general practitioner. General practitioners were invited to examine smokers, and to fill a predeterminate systematic oral cavity examination record during 3 months. They were asked to refer to a specialist if there was a potent

In [498]:
# Generate and print the summary
summary = generate_summary(combined_text)
print(summary)


- Oral cavity cancer is common and often not detected until it is advanced
- A feasibility study was conducted to improve early detection of oral cancer and premalignant lesions in a high-risk region
- Tobacco vendors were involved in distributing flyers to smokers for free examinations by general practitioners
- 93 patients were included in the study, with 27% being referred to a specialist
- 63.6% of referred patients actually saw a specialist, with 15.3% being diagnosed with a premalignant lesion
- Photodynamic therapy (PDT) was studied as an experimental cancer therapy in rats with chemically-induced premalignant lesions and squamous cell carcinoma of the palatal mucosa
- PDT was performed using Photofrin and two different activation wavelengths, with better results seen in the 514.5 nm group
- Gingival metastasis from malignant mesothelioma is extremely rare, with a low survival rate
- A case study showed a patient with a gingival mass as the first sign of multiorgan recurrence of