# Europe PMC Annotations API Tutorial

This notebook demonstrates how to use PyEuropePMC's **AnnotationsClient** to retrieve and parse text-mining annotations from scientific literature.

## What are Annotations?

The Europe PMC Annotations API provides access to:
- **Entity annotations**: Mentions of genes, diseases, chemicals, organisms, etc.
- **Sentence annotations**: Contextual text containing entity mentions
- **Relationship annotations**: Associations between entities (e.g., gene-disease relationships)

All annotations follow the [W3C Open Annotation Data Model](https://www.w3.org/TR/annotation-model/) and are provided in JSON-LD format.

## Setup and Installation

First, ensure PyEuropePMC is installed:

In [None]:
import pyeuropepmc
from pyeuropepmc.cache.cache import CacheConfig

print(f"PyEuropePMC version: {pyeuropepmc.__version__}")

## Initialize the AnnotationsClient

We'll create a client with caching enabled for better performance:

In [None]:
# Create client with caching
cache_config = CacheConfig(enabled=True, ttl=3600)  # Cache for 1 hour
client = pyeuropepmc.AnnotationsClient(cache_config=cache_config)

print("✓ AnnotationsClient initialized successfully")

## Example 1: Retrieve Annotations by Article IDs

Let's fetch annotations for specific PMC articles:

In [None]:
# Fetch annotations for sample articles
article_ids = ["PMC3359311", "PMC3359312"]

annotations = client.get_annotations_by_article_ids(
    article_ids=article_ids,
    section="abstract",  # Options: 'all', 'abstract', 'fulltext'
    format="JSON-LD"
)

print(f"Total annotations: {annotations.get('totalCount', 0)}")
print(f"Annotations in response: {len(annotations.get('annotations', []))}")

## Example 2: Parse Annotations

Use the `AnnotationParser` to extract structured data from the JSON-LD response:

In [None]:
from pyeuropepmc.processing.annotation_parser import parse_annotations

# Parse the annotations
parsed = parse_annotations(annotations)

print(f"Entities found: {len(parsed['entities'])}")
print(f"Sentences found: {len(parsed['sentences'])}")
print(f"Relationships found: {len(parsed['relationships'])}")

# Display metadata
print("\nMetadata:")
for key, value in parsed['metadata'].items():
    print(f"  {key}: {value}")

## Example 3: Explore Entity Annotations

Let's examine the entities found in the text:

In [None]:
import pandas as pd

# Convert entities to DataFrame for easier exploration
if parsed['entities']:
    entities_df = pd.DataFrame(parsed['entities'])
    
    print("First 10 entities:")
    display(entities_df[['name', 'type', 'exact', 'section']].head(10))
    
    # Count entities by type
    print("\nEntity type distribution:")
    entity_counts = entities_df['type'].value_counts()
    display(entity_counts)
else:
    print("No entities found in the annotations")

## Example 4: Retrieve Annotations for Specific Entities

Search for all articles mentioning a specific entity (e.g., a chemical compound):

In [None]:
# Search for ethanol (CHEBI:16236) mentions
entity_id = "CHEBI:16236"
entity_type = "CHEBI"

entity_annotations = client.get_annotations_by_entity(
    entity_id=entity_id,
    entity_type=entity_type,
    section="all",
    page=1,
    page_size=20
)

print(f"Total annotations for {entity_id}: {entity_annotations.get('totalCount', 0)}")

## Example 5: Filter by Annotation Provider

Get annotations from a specific provider:

In [None]:
# Fetch Europe PMC annotations
provider_annotations = client.get_annotations_by_provider(
    provider="Europe PMC",
    article_ids=["PMC3359311"],
    annotation_type="Disease"
)

print(f"Europe PMC disease annotations: {provider_annotations.get('totalCount', 0)}")

## Example 6: Extract Relationships

Find relationships between entities (e.g., gene-disease associations):

In [None]:
from pyeuropepmc.processing.annotation_parser import extract_relationships

# Extract relationships from annotations
if annotations.get('annotations'):
    relationships = extract_relationships(annotations['annotations'])
    
    print(f"Relationships found: {len(relationships)}")
    
    if relationships:
        # Display first few relationships
        print("\nFirst 5 relationships:")
        for i, rel in enumerate(relationships[:5], 1):
            subj = rel['subject']
            pred = rel['predicate']
            obj = rel['object']
            print(f"{i}. {subj['name']} ({subj['type']}) {pred} {obj['name']} ({obj['type']})")
else:
    print("No relationships found")

## Example 7: Visualize Entity Distribution

Create visualizations of entity types:

In [None]:
import matplotlib.pyplot as plt

if parsed['entities']:
    # Count entity types
    entity_types = {}
    for entity in parsed['entities']:
        etype = entity.get('type', 'Unknown')
        entity_types[etype] = entity_types.get(etype, 0) + 1
    
    # Create bar chart
    plt.figure(figsize=(10, 6))
    plt.bar(entity_types.keys(), entity_types.values())
    plt.xlabel('Entity Type')
    plt.ylabel('Count')
    plt.title('Distribution of Entity Types in Annotations')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

## Example 8: Cache Management

Monitor and manage the cache:

In [None]:
# Get cache statistics
stats = client.get_cache_stats()

print("Cache Statistics:")
print(f"  Hit rate: {stats.get('hit_rate', 0):.2%}")
print(f"  Total hits: {stats.get('hits', 0)}")
print(f"  Total misses: {stats.get('misses', 0)}")

# Get cache health
health = client.get_cache_health()
print(f"\nCache health: {health.get('status', 'unknown')}")

# Clear cache if needed
# client.clear_cache()

## Advanced Use Case: Gene-Disease Association Mining

Let's combine multiple features to mine gene-disease associations:

In [None]:
from pyeuropepmc.processing.annotation_parser import extract_entities

def find_gene_disease_cooccurrences(article_ids):
    """Find genes and diseases mentioned in the same articles."""
    
    # Fetch annotations
    annotations = client.get_annotations_by_article_ids(
        article_ids=article_ids,
        section="all"
    )
    
    if not annotations.get('annotations'):
        return []
    
    # Extract entities
    entities = extract_entities(annotations['annotations'])
    
    # Group by article and find co-occurrences
    genes = [e for e in entities if e['type'] in ['Gene', 'GENE']]
    diseases = [e for e in entities if e['type'] in ['Disease', 'DISEASE']]
    
    return {
        'genes': genes,
        'diseases': diseases,
        'gene_count': len(genes),
        'disease_count': len(diseases)
    }

# Example usage
results = find_gene_disease_cooccurrences(["PMC3359311"])
print(f"Found {results['gene_count']} genes and {results['disease_count']} diseases")

## Cleanup

Always close the client when done:

In [None]:
client.close()
print("✓ Client closed successfully")

## Summary

In this tutorial, we covered:

1. ✓ Initializing the AnnotationsClient
2. ✓ Retrieving annotations by article IDs
3. ✓ Parsing JSON-LD annotations
4. ✓ Extracting entities, sentences, and relationships
5. ✓ Filtering by entity types and providers
6. ✓ Cache management
7. ✓ Advanced use cases for text mining

## Additional Resources

- [Europe PMC Annotations API Documentation](https://europepmc.org/AnnotationsApi)
- [W3C Open Annotation Data Model](https://www.w3.org/TR/annotation-model/)
- [PyEuropePMC Documentation](https://github.com/JonasHeinickeBio/pyEuropePMC)