Deutsche Digitale Bibliothek REST API: https://labs.deutsche-digitale-bibliothek.de/app/ddbapi/#/search/getSolrSearch

Fragen & Antworten zum Deutschen Zeitungsportal: https://www.deutsche-digitale-bibliothek.de/content/newspaper/fragen-antworten

Wrapper for the DDB API: https://pypi.org/project/ddbapi/

# Functions for extracting and analyzing data from Das Deutsche Zeitungsportal: 

In [1]:
import pandas as pd
from ddbapi import zp_pages
import folium
from geopy import geocoders 

In [2]:
def article_extractor(search_dict):
    """
    This function extracts newspaper articles in a given language between two given dates. It takes a dictionary with 
    three keys as its only argument, like the following example:

    search_dict= {
        'language': 'ger',
        'date_begin': f'{year}-01-01',
        'date_end': f'{year}-12-31'
        }
    """
    
    df= zp_pages(language=search_dict['language'], 
                  publication_date= f"[{search_dict['date_begin']}T12:00:00Z TO {search_dict['date_end']}T12:00:00Z]")
    return df

Downloading newspaper data for each year in three chunks: 

In [3]:
year= 1925

In [4]:
# search_dict_ger= {
#     'language': 'ger',
#     'date_begin': f'{year}-01-01',
#     'date_end': f'{year}-04-30'
#     }
    
# df_challenge_ger= article_extractor(search_dict_ger)
# df_challenge_ger.to_pickle(f"./data_deutsches_zeitungsportal_1914_1945/newspapers_ger_{year}_part_1")

In [5]:
# search_dict_ger= {
#     'language': 'ger',
#     'date_begin': f'{year}-05-01',
#     'date_end': f'{year}-08-31'
#     }
    
# df_challenge_ger= article_extractor(search_dict_ger)
# df_challenge_ger.to_pickle(f"./data_deutsches_zeitungsportal_1914_1945/newspapers_ger_{year}_part_2")

In [None]:
search_dict_ger= {
    'language': 'ger',
    'date_begin': f'{year}-09-01',
    'date_end': f'{year}-12-31'
    }
    
df_challenge_ger= article_extractor(search_dict_ger)
df_challenge_ger.to_pickle(f"./data_deutsches_zeitungsportal_1914_1945/newspapers_ger_{year}_part_3")

https://api.deutsche-digitale-bibliothek.de/search/index/newspaper-issues/select?rows=1000&sort=id+ASC&q=type%3Apage+AND+language%3A%22ger%22+AND+publication_date%3A%22%5B1925-09-01T12%3A00%3A00Z%5C+TO%5C+1925-12-31T12%3A00%3A00Z%5D%22&cursorMark=%2A
Getting 1000 of 113678
Getting 2000 of 113678
Getting 3000 of 113678
Getting 4000 of 113678
Getting 5000 of 113678
Getting 6000 of 113678
Getting 7000 of 113678
Getting 8000 of 113678
Getting 9000 of 113678
Getting 10000 of 113678
Getting 11000 of 113678
Getting 12000 of 113678
Getting 13000 of 113678
Getting 14000 of 113678
Getting 15000 of 113678
Getting 16000 of 113678
Getting 17000 of 113678
Getting 18000 of 113678
Getting 19000 of 113678
Getting 20000 of 113678
Getting 21000 of 113678
Getting 22000 of 113678


In [None]:
# testing if the pickled dataframes are loadable: 

test_year= f"{year}_part_3"
columns= ['paper_title', 'publication_date', 'place_of_distribution']
try:
    print (len(pd.read_pickle(f"./data_deutsches_zeitungsportal_1914_1945/newspapers_ger_{test_year}")[columns]))
                  
except EOFError:
    print(f"Error: EOFError occurred while loading data for year {test_year}.")