# Interacting with the German Digital Library API

*Notebook based on https://github.com/Digital-History-Bielefeld/llm-supported-workflow-for-processing-faulty-ocr*

This notebook shows how to interact with the German Digital Library's Application Programming Interface (API). An API is like a bridge that allows different software programs to talk to each other. It sets the guidelines for how they can request and share information. Think of it as a menu in a restaurant: it tells you what you can order and how to ask for it, without needing to know how the kitchen prepares the food. APIs make it easier for different apps and services to work together smoothly. Consider this Jupyter Notebook (the environment where you are reading right now) as an app, that exchanges information using the German Digital Library API. The programming code is written in Python, a programming language that aims to be easily understandable, even for inexperienced programmers.

## German Historical Newspaper Portal

German historical newspapers from the German Digital Library can be accessed via the DDB-API. This API is open access and allows to query the Historical Newspapers available in the German Newspaper Portal ([Deutsches Zeitungsportal](https://www.deutsche-digitale-bibliothek.de/newspaper)). An instruction, provided by the German Newspaper Portal (from Karl Krägerlin), can be found [here](https://deepnote.com/app/karl-kragelin-b83c/Zeitungsportal-API-d9224dda-8e26-4b35-a6d7-40e9507b1151). Using the DDB API in a Python app (as we will do here) is easy if you use the `ddbapi` Python package. A Python package is like a collection of tools that you can use in your program. Imagine having a toolbox with different tools for various tasks, like a hammer, a screwdriver, and pliers. A Python package contains many useful functions and modules that help you perform specific tasks in your code more easily and quickly. Instead of writing everything from scratch, you can use a package that already has the functions you need. This makes programming more efficient and organized.

In [None]:
# Install the "Deutsche Digitale Bibliothek API" package (ddbapi: https://pypi.org/project/ddbapi/)
%pip install ddbapi

### Define the newspapers to be processed for case studies

Every newspaper accessible through the German Digital Library has an ID provided by the Zeitschriftendatenbank (ZDB). With these IDs, we can target specific newspapers. For case studies, a relevant selection of ZDB-IDs can be saved in lists. This allows case-study-specific selections to be easily utilized in subsequent steps of the workflow. Additionally, time periods relevant to the case studies can be defined and saved in variables, enabling further refinement of the scope in subsequent steps of the workflow.

In [None]:
zdb_ids = [
    # Vorwärts : Berliner Volksblatt ; das Abendblatt der Hauptstadt Deutschlands
    "2814128-3"
]

periods = [
    "[1921-08-10T00:00:01Z TO 1921-08-12T23:59:59Z]",
    "[1922-08-10T00:00:01Z TO 1922-08-12T23:59:59Z]",
    "[1923-08-10T00:00:01Z TO 1923-08-12T23:59:59Z]",
    "[1924-08-10T00:00:01Z TO 1924-08-12T23:59:59Z]",
    "[1925-08-10T00:00:01Z TO 1925-08-12T23:59:59Z]",
    "[1926-08-10T00:00:01Z TO 1926-08-12T23:59:59Z]",
    "[1927-08-10T00:00:01Z TO 1927-08-12T23:59:59Z]",
    "[1928-08-10T00:00:01Z TO 1928-08-12T23:59:59Z]",
    "[1929-08-10T00:00:01Z TO 1929-08-12T23:59:59Z]",
    "[1930-08-10T00:00:01Z TO 1930-08-12T23:59:59Z]",
    "[1931-08-10T00:00:01Z TO 1931-08-12T23:59:59Z]",
    "[1932-08-10T00:00:01Z TO 1932-08-12T23:59:59Z]"
]

### Define search terms to effectively query the German Historical Newspaper Portal

Searching for specific terms or phrases retrieves only newspaper pages that include those keywords or phrases. We can define them individually or create thematic lists to use in the queries. Below, you find a selection of basic keyword lists. They may be combined to create specified keyword lists for more targeted queries.

In [None]:
# A list of keywords centered on discussions about democracy in general
keywords_democracy = [
    "demokrat*",
    "volksherrschaft",
    "demokratisier*",
    "verfassung*"
]

# A list of keywords specified on social democratic topoi in discourses on democracy
keywords_social_democrats = [
    "sozialisier*",
    "sozialistisch~",
    "verfassung*"
]

### Helper functions for the DDB-API interaction

Down below we define helper functions for using the DDB API. In Python, a function is like a recipe that you can use to perform a specific task. Imagine you have a recipe for making a sandwich. Every time you want a sandwich, you follow the same steps in the recipe. Similarly, a function in Python is a set of instructions that you can use whenever you need to do a particular job. You give it a name, and whenever you call that name, it runs the instructions inside. This helps you avoid repeating the same code and makes your programs easier to manage and understand.

In [None]:
# First we need to import the packages that we want to use for our helper functions
import ddbapi
import pandas as pd
import requests

In [None]:
def get_pages(publication_dates: list[str], zdb_ids: list[str], keywords: list[str]) -> pd.DataFrame:
    """ Get data on Newspaper-Pages from the DDB API for multiple periods and keywords. """
    data_frames = []

    for publication_date in publication_dates:
        for zdb_id in zdb_ids:
            for keyword in keywords:
                try:
                    df = ddbapi.zp_pages(publication_date=publication_date, zdb_id=zdb_id, plainpagefulltext=keyword)
                    if isinstance(df, pd.DataFrame):
                        data_frames.append(df)
                    else:
                        print(f"Warning: The result for date {publication_date}, zdb_id {zdb_id}, and keyword {keyword} is not a DataFrame.")
                except requests.exceptions.HTTPError as e:
                    print(f"HTTPError for date {publication_date}, zdb_id {zdb_id}, and keyword {keyword}: {e}")
                except Exception as e:
                    print(f"An error occurred for date {publication_date}, zdb_id {zdb_id}, and keyword {keyword}: {e}")

    if len(data_frames) == 0:
        return pd.DataFrame()

    combined_df = pd.concat(data_frames, ignore_index=True)
    return combined_df.drop_duplicates(subset=['page_id'])

In [None]:
def combine_keyword_lists(*keyword_lists: list[str]) -> list[str]:
    """ Combines multiple lists of keywords into and removes duplicates. """
    # Use a set to combine all keywords and automatically remove duplicates
    combined_set = set()
    for keyword_list in keyword_lists:
        combined_set.update(keyword_list)
    # Convert the set back to a list
    keywords_combined = list(combined_set)
    return keywords_combined


### Helper functions in action

In the next step we will use our helper functions and receive the data in a pandas data frame. A pandas DataFrame is like a table or a spreadsheet that you can use in your programs. Imagine an Excel sheet where you have rows and columns to organize your data. Each column can have a different type of information, like names, dates, or numbers, and each row represents a different entry or record. A pandas DataFrame works the same way, allowing you to store, organize, and manipulate data easily. It's a powerful tool for handling data because you can quickly sort, filter, and analyze the information, just like you would in a spreadsheet.

In [None]:
# Create a specified list of keywords by combining basic keyword lists
keywords = combine_keyword_lists(
    keywords_democracy,
    keywords_social_democrats
)

# Get data on newspapers to query
pages = get_pages(
    periods,
    zdb_ids,
    keywords
)

In [None]:
# Display the number of entries and the first few rows of the DataFrame
print(f"Number of entries: {len(pages)}")
pages.head()

## Visualizing keyword findings through frequency charts

First, we extract information on the count of keyword search hits for each day from the above data frame. We store that information in a new pandas DataFrame to get a first impression.

In [None]:
# Convert the 'publication_date' column to datetime format (if not already)
pages['publication_date'] = pd.to_datetime(pages['publication_date'])

# Extract year, month, and day from the 'publication_date' column and create new columns for each
pages['year'] = pages['publication_date'].dt.year
pages['month'] = pages['publication_date'].dt.month
pages['day'] = pages['publication_date'].dt.day

# Create a new DataFrame with the counts of pages per day
daily_counts = pages.groupby(['year', 'month', 'day']).size().reset_index(name='count')

# Display the first few rows of the "daily_counts" DataFrame
daily_counts.head()

For more advanced frequency based visualizations we use more Python packages called `seaborn` and `matplotlib`. These need to be installed the same way that we already did with the `ddbapi` package.

In [None]:
# Install the `seaborn` and `matplotlib` packages for visualization
%pip install seaborn matplotlib

In [None]:
# First we need to import the new packages that we want to use for our helper functions
import seaborn as sns
from matplotlib import pyplot as plt
import pandas as pd

In [None]:
def linegraph_search_hits(df: pd.DataFrame, keywords: list[str], time_granularity: str):
    ''' Plots keyword search hits over time with specified granularity (pass 'day', 'month', or 'year'). '''

    xlabel = time_granularity.capitalize()
    ylabel = 'Count'

    sns.set_theme(style="darkgrid", font_scale=1.0)

    # Create a figure for plotting
    plt.figure(figsize=(18, 12))

    # Create a DataFrame to hold all the data for plotting
    plot_data = pd.DataFrame()

    for keyword in keywords:
        # Filter DataFrame for rows containing the keyword in the 'context' column
        keyword_df = df[df['plainpagefulltext'].str.contains(keyword, case=False, na=False)].copy()

        # Group by the specified time granularity
        if time_granularity == 'year':
            grouped_df = keyword_df.groupby('year').size().reset_index(name='count')
            grouped_df['time'] = pd.to_datetime(grouped_df['year'].astype(str))
        elif time_granularity == 'month':
            grouped_df = keyword_df.groupby(['year', 'month']).size().reset_index(name='count')
            grouped_df['time'] = pd.to_datetime(grouped_df['year'].astype(str) + '-' + grouped_df['month'].astype(str).str.zfill(2))
        elif time_granularity == 'day':
            grouped_df = keyword_df.groupby(['year', 'month', 'day']).size().reset_index(name='count')
            grouped_df['time'] = pd.to_datetime(grouped_df[['year', 'month', 'day']])
        else:
            raise ValueError("time_granularity must be 'year', 'month', or 'day'")

        # Add a column for the keyword to use in hue and style
        grouped_df['keyword'] = keyword

        # Append to the plot data
        plot_data = pd.concat([plot_data, grouped_df], ignore_index=True)

    # Check if plot_data is empty
    if plot_data.empty:
        print("No data available for the given keywords.")
        return

    # Sort the intermediary plot_data DataFrame by the 'keyword' first and then 'time' column
    plot_data = plot_data.sort_values(by=['keyword', 'time'])

    # Format 'time' as year-month for plotting when time_granularity is 'month'
    if time_granularity == 'year':
        plot_data['time'] = plot_data['time'].dt.strftime('%Y')
    elif time_granularity == 'month':
        plot_data['time'] = plot_data['time'].dt.strftime('%Y-%m')
    elif time_granularity == 'day':
        plot_data['time'] = plot_data['time'].dt.strftime('%Y-%m-%d')
    else:
        raise ValueError("Invalid time granularity specified. Choose from 'year', 'month', or 'day'.")

    # Plot using hue and style for different keywords
    sns.lineplot(data=plot_data, x='time', y='count', hue='keyword', style='keyword', markers=True, markersize=8, dashes=False)

    # Customize the plot's appearence
    plt.xlabel(xlabel, fontsize=13)
    plt.ylabel(ylabel, fontsize=13)
    plt.title('Search hits for keywords', fontsize=16)
    plt.xticks(rotation=65)
    plt.tick_params(axis='both', which='major', labelsize=11)
    plt.legend(loc='upper left', fontsize=12, title='Keywords', title_fontsize='14', framealpha=0.6)
    plt.tight_layout()
    plt.show()

In [None]:
# Call the function, using the same variable for both `time_granularity` and `xlabel`
linegraph_search_hits(
    pages, # pass the text data (from the 'context' column of a data frame)
    keywords, # Pass the keyword(s) you want to use
    time_granularity='month',  # Pass the time granularity
)