# Advanced Analytics and Filtering Demo

This notebook demonstrates the advanced analytics and filtering capabilities of pyEuropePMC, including:

1. Result analysis tools (publication year distribution, citation patterns)
2. Pandas DataFrame conversion
3. Duplicate detection and removal
4. Quality metrics
5. Visualization and trend analysis charts

In [None]:
# Import required modules
from pyeuropepmc import (
    SearchClient,
    to_dataframe,
    publication_year_distribution,
    citation_statistics,
    detect_duplicates,
    remove_duplicates,
    quality_metrics,
    publication_type_distribution,
    journal_distribution,
    plot_publication_years,
    plot_citation_distribution,
    plot_quality_metrics,
    plot_publication_types,
    plot_journals,
    plot_trend_analysis,
    create_summary_dashboard,
)

import matplotlib.pyplot as plt

# Set up matplotlib for inline display
%matplotlib inline

## 1. Search for Papers

Let's start by searching for papers on a specific topic.

In [None]:
# Initialize the search client
client = SearchClient()

# Search for papers on CRISPR gene editing
query = "CRISPR gene editing"
response = client.search(query, pageSize=100, resultType="core")

# Extract papers from response
papers = response.get("resultList", {}).get("result", [])
print(f"Found {len(papers)} papers on '{query}'")

## 2. Convert to Pandas DataFrame

Convert the search results to a pandas DataFrame for easier manipulation and analysis.

In [None]:
# Convert to DataFrame
df = to_dataframe(papers)

# Display basic information
print(f"DataFrame shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

## 3. Publication Year Distribution

Analyze how publications are distributed across years.

In [None]:
# Get year distribution
year_dist = publication_year_distribution(papers)
print("Publication Year Distribution:")
print(year_dist)

# Visualize
fig = plot_publication_years(papers, title=f"Publications on '{query}' by Year")
plt.show()

## 4. Citation Statistics

Analyze citation patterns in the search results.

In [None]:
# Get citation statistics
cite_stats = citation_statistics(papers)

print("Citation Statistics:")
print(f"  Total papers: {cite_stats['total_papers']}")
print(f"  Mean citations: {cite_stats['mean_citations']:.2f}")
print(f"  Median citations: {cite_stats['median_citations']:.0f}")
print(f"  Max citations: {cite_stats['max_citations']}")
print(f"  Papers with citations: {cite_stats['papers_with_citations']}")
print(f"  Papers without citations: {cite_stats['papers_without_citations']}")

print("\nCitation Distribution (Percentiles):")
for percentile, value in cite_stats['citation_distribution'].items():
    print(f"  {percentile}: {value:.1f}")

In [None]:
# Visualize citation distribution
fig = plot_citation_distribution(papers, title=f"Citation Distribution for '{query}'")
plt.show()

## 5. Quality Metrics

Assess the quality of papers based on various criteria.

In [None]:
# Get quality metrics
metrics = quality_metrics(papers)

print("Quality Metrics:")
print(f"  Total papers: {metrics['total_papers']}")
print(f"  Open access: {metrics['open_access_count']} ({metrics['open_access_percentage']:.1f}%)")
print(f"  With abstract: {metrics['with_abstract_count']} ({metrics['with_abstract_percentage']:.1f}%)")
print(f"  With DOI: {metrics['with_doi_count']} ({metrics['with_doi_percentage']:.1f}%)")
print(f"  In PMC: {metrics['in_pmc_count']} ({metrics['in_pmc_percentage']:.1f}%)")
print(f"  With PDF: {metrics['with_pdf_count']} ({metrics['with_pdf_percentage']:.1f}%)")
print(f"  Peer reviewed (estimated): {metrics['peer_reviewed_estimate']} ({metrics['peer_reviewed_percentage']:.1f}%)")

# Visualize quality metrics
fig = plot_quality_metrics(papers, title=f"Quality Metrics for '{query}'")
plt.show()

## 6. Duplicate Detection

Detect and remove duplicate papers based on various criteria.

In [None]:
# Detect duplicates by title
duplicates = detect_duplicates(papers, method="title")
print(f"Found {len(duplicates)} sets of duplicate papers (by title)")

if duplicates:
    print("\nDuplicate sets:")
    for i, dup_indices in enumerate(duplicates[:3], 1):  # Show first 3 sets
        print(f"\nSet {i}: {len(dup_indices)} papers")
        for idx in dup_indices:
            print(f"  - {df.iloc[idx]['title'][:80]}...")

In [None]:
# Remove duplicates, keeping the most cited version
unique_papers_df = remove_duplicates(papers, method="title", keep="most_cited")
print(f"Original papers: {len(papers)}")
print(f"After removing duplicates: {len(unique_papers_df)}")
print(f"Duplicates removed: {len(papers) - len(unique_papers_df)}")

## 7. Publication Type Analysis

Analyze the distribution of publication types.

In [None]:
# Get publication type distribution
pub_types = publication_type_distribution(papers)
print("Top 10 Publication Types:")
print(pub_types.head(10))

# Visualize
fig = plot_publication_types(papers, title=f"Publication Types for '{query}'", top_n=10)
plt.show()

## 8. Journal Analysis

Identify the top journals publishing on this topic.

In [None]:
# Get journal distribution
journals = journal_distribution(papers, top_n=15)
print("Top 15 Journals:")
print(journals)

# Visualize
fig = plot_journals(papers, title=f"Top Journals for '{query}'", top_n=15)
plt.show()

## 9. Trend Analysis

Analyze publication and citation trends over time.

In [None]:
# Create trend analysis plot
fig = plot_trend_analysis(papers, title=f"Trends for '{query}'")
plt.show()

## 10. Comprehensive Dashboard

Create a comprehensive dashboard with all key visualizations.

In [None]:
# Create comprehensive dashboard
fig = create_summary_dashboard(papers, title=f"Literature Analysis Dashboard: '{query}'")
plt.show()

## 11. Advanced DataFrame Operations

Use pandas to perform advanced filtering and analysis.

In [None]:
# Filter papers with high citations
highly_cited = df[df['citedByCount'] > df['citedByCount'].quantile(0.75)]
print(f"\nHighly cited papers (top 25%): {len(highly_cited)}")

# Filter open access papers from recent years
recent_oa = df[(df['pubYear'] >= '2020') & (df['isOpenAccess'] == 'Y')]
print(f"Recent open access papers (2020+): {len(recent_oa)}")

# Group by year and calculate average citations
yearly_citations = df.groupby('pubYear')['citedByCount'].mean().sort_index()
print("\nAverage citations by year:")
print(yearly_citations.tail(10))

## 12. Export Results

Export the DataFrame to various formats for further analysis.

In [None]:
# Export to CSV
# df.to_csv('search_results.csv', index=False)
# print("Results exported to search_results.csv")

# Export to Excel
# df.to_excel('search_results.xlsx', index=False)
# print("Results exported to search_results.xlsx")

print("Uncomment the above lines to export the results")

## Summary

This notebook demonstrated the advanced analytics and filtering capabilities of pyEuropePMC:

1. **Data Conversion**: Easy conversion of API results to pandas DataFrames
2. **Publication Analysis**: Year-wise distribution and trends
3. **Citation Analysis**: Comprehensive citation statistics and distributions
4. **Quality Assessment**: Multiple quality metrics including open access, abstracts, DOIs
5. **Duplicate Management**: Detection and removal of duplicate papers
6. **Publication Types**: Distribution and analysis of different publication types
7. **Journal Analysis**: Identification of top journals in the field
8. **Trend Analysis**: Time-based trends in publications and citations
9. **Visualizations**: Rich visualizations including dashboards
10. **DataFrame Operations**: Advanced filtering and grouping operations

These tools enable comprehensive analysis of scientific literature retrieved from Europe PMC.