<a href="https://colab.research.google.com/github/NearANDfar13/NLP_Winter_2024/blob/main/Final_Notebooks_Modules_1_to_5/Anderson_Module_5_LLMs_Article_Sep_OCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Large Language Models and Article Extraction


Created by Sarah Oberbichler [ORCID](https://orcid.org/0000-0002-1031-2759)

###Using LLMs via APIs

For this course, we utilize the NVIDIA API, which provides up to 4,000 free credits to access the open-source model llama-3.1-nemotron-70b-instruct via NVIDIA's GPU infrastructure. When using larger models outside of chatbot applications, they demand significant computational resources.
While APIs offer a solution for accessing models and GPU power through third parties where no local computer power is available, they typically:

*   Require payment beyond free trial credits
*   Should not be used with sensitive data
*   Should not be used with copyright restricted data


### Using LLMs via APIs for the Analysis of Historical Newspapers
Historical newspapers published before 1940 are generally free from copyright protection and, when accessed through public newspaper platforms, are not classified as sensitive data. However, important considerations include:

*   Library licensing agreements may restrict usage
*   Cultural heritage institutions might have specific terms of use
*   Access and processing policies may vary by institution

When using API's provided by third parties, make sure to check the licensing agreements of the data provider (e.g. library). For example, newspapers makred with **Public Domain Mark 1.0 Universell** don't have any restrictions.

In [1]:
!git clone https://github.com/ieg-dhr/NLP-Course4Humanities_2024.git

Cloning into 'NLP-Course4Humanities_2024'...
remote: Enumerating objects: 1501, done.[K
remote: Counting objects: 100% (237/237), done.[K
remote: Compressing objects: 100% (118/118), done.[K
remote: Total 1501 (delta 189), reused 119 (delta 119), pack-reused 1264 (from 2)[K
Receiving objects: 100% (1501/1501), 61.53 MiB | 21.47 MiB/s, done.
Resolving deltas: 100% (866/866), done.


#Setting up the Large Language Model

In order to use the large language model via API, you need to get an API key: https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct. Add your private key to you Colab Notebook under *Secrets* as NVIDIA_TOKEN. Run the next cell and see if everything worked as intended.

In [2]:
!pip uninstall -y httpx
!pip install httpx==0.27.2
from openai import OpenAI
from google.colab import userdata

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=userdata.get('NVIDIA_TOKEN'),
    # Remove any default timeout settings
    timeout=None
)


completion = client.chat.completions.create(
  model="nvidia/llama-3.1-nemotron-70b-instruct",
  messages=[{"role":"user","content":f"""Hello?"""
}],
  temperature=0.3,
  top_p=1,
  max_tokens=10024,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

Found existing installation: httpx 0.28.1
Uninstalling httpx-0.28.1:
  Successfully uninstalled httpx-0.28.1
Collecting httpx==0.27.2
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Downloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: httpx
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-genai 1.8.0 requires httpx<1.0.0,>=0.28.1, but you have httpx 0.27.2 which is incompatible.[0m[31m
[0mSuccessfully installed httpx-0.27.2
Hello!

It's nice to meet you. Is there something I can help you with or would you like to:

1. **Chat about a topic** (e.g., hobbies, movies, books, or news)?
2. **Ask a question** on a specific subject (e.g., history, science, technology, or more)?
3. **Play a ga

#Importing the Dataset

In [3]:
import pandas as pd

# Replace 'your_file.xlsx' with the actual path to your Excel file.
df = pd.read_excel('/content/NLP-Course4Humanities_2024/datasets/Süddeutsche_Zeitung_Messina.xlsx')

# Display the first few rows of the DataFrame to verify it's loaded correctly.
df=df[:4]
df.head()

Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext
0,22PDTLUY4AQ5TWJXNYQ52E2RIMQI2RNP-FILE_0014_DDB...,14,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1909-05-23 12:00:00,['Hamburg'],['ger'],d9443de6-7d7f-49ae-bc86-2c04b4b242bf,['/data/altos/22/PD/22PDTLUY4AQ5TWJXNYQ52E2RIM...,FILE_0014_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Seite 14 Nr. 119 yamvurger Fremdenblatt. Sonnt...
1,22PDTLUY4AQ5TWJXNYQ52E2RIMQI2RNP-FILE_0045_DDB...,45,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1909-05-23 12:00:00,['Hamburg'],['ger'],d9443de6-7d7f-49ae-bc86-2c04b4b242bf,['/data/altos/22/PD/22PDTLUY4AQ5TWJXNYQ52E2RIM...,FILE_0045_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Seite 45 ^Gruppen', sondern um .Bilder'. Das d..."
2,25DAA2WIW7V63U44WIHK2OTQUX55SE6M-FILE_0010_DDB...,10,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1911-01-14 12:00:00,['Hamburg'],['ger'],0a4b6b89-bd99-4dac-8e0c-f93890e41677,['/data/altos/25/DA/25DAA2WIW7V63U44WIHK2OTQUX...,FILE_0010_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"«eite 10 Hamburger Fremdenblatt. Sonnabend, 14..."
3,25JJY2NKEIQWL2FV3XBLV3HFXNRKENRV-FILE_0018_DDB...,18,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1909-02-26 12:00:00,['Hamburg'],['ger'],e98c28c0-1334-4f46-af04-6b011ac54ec2,['/data/altos/25/JJ/25JJY2NKEIQWL2FV3XBLV3HFXN...,FILE_0018_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,s eile 18. Haulburgcr ^reinvenblatt. Freitag. ...


#Importing a Text File Containing an Example of how to Structure the Output

In [4]:

with open('/content/NLP-Course4Humanities_2024/datasets/structure_example_AS.txt', 'r') as file:
    examples = file.read()
examples

'**Extracted Article 1:**\n\n* **Relevant Topics:** Erdbeben\n\n**Original Text (unchanged):**\n\n**Verification:**\n* **Coherent Unit:** Yes\n* **Topic Presence:** Yes (Erdbeben)\n* **Completeness:** Yes (short report, completely exracted)\n* **Human Control Needed:** Yes, the article is too long or I might have overseen relevant articles\n'

In [5]:
import pandas as pd
from typing import List, Dict
from openai import OpenAI

# Initialize OpenAI client with NVIDIA API settings
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key = userdata.get('NVIDIA_TOKEN')
)

def analyze_dataframe(df: pd.DataFrame, text_column: str) -> pd.DataFrame:
    def analyze_text(text: str) -> List[Dict[str, str]]:
        system_prompt = f"""
# System Instructions
You are an expert text analyst and information retrieval specialist and hate summarization as well as enumerations. Use {examples} for structuring your answer.
Your task is to carefully analyze given texts and extract complete articles that contain specific themes. You never change original texts.

Classify as relevant if the text contains:
- Primary earthquake terminology from the 19th and 20th century
- Official earthquake reports
- geology and seismology
- Impact descriptions
- Solution description
- Technical description
- Aid
- Honorations
- Political discussion and opinions on earthquake
- Stories from victims and refugees
- reportings on refugees and victims
- Live of victims
- historical references
- comparisons

Your output should consist of the extracted articles and the verification

Maintain a neutral, objective stance throughout the analysis. Focus on accuracy and completeness in your extractions
"""
        user_prompt = f"""
# Task Instructions
Bitte führe die folgenden Schritte aus:
1. Lese jeden Text aufmerksam durch. Behandle jeden Text als eigene Einheit, ohne auf andere Texte zu referieren
2. Identifiziere alle Artikel zum Thema Erdbeben und Erstoß
3. Für jedes Vorkommen des Themas:
   a. Bestimme den Anfang des Artikels, in dem das Thema vorkommen.
   b. Kontrolliere Satz für Satz, ob diese zusammengehören, Ende den Artikel, wenn die Sätze nicht mehr zusammengehören.
   c. Markiere den vollständigen Artikel von Anfang bis Ende.
   d. Wenn der Artikel zu lang für eine Antwort ist, antworte mit Ja auf "article too long, human addition needed":
   e. Berücksichtige auch sehr kurze und sehr lange Artikel
4. Überprüfe jeden markierten Artikel:
   a. Stelle sicher, dass er eine Einheit bildet, auch wenn es nicht mehr um Erdbeben geht.
   b. Vergewissere dich, dass er eines der genannten Themen enthält.
   c. Prüfe, ob der extrahierte Text tatsächlich im Dokument ist
5. Extrahiere jeden überprüften Artikel als Originaltext, der nichts als den originalen Text enthält
6. Korrigiere OCR-Fehler
7. Wenn keine Artikel gefunden wurden, gib "Keine Artikel mit dem angegebenen Thema gefunden." aus.

Führe nun diese Schritte für den folgenden Text aus:
{text}
"""
        try:
            messages = [
                {
                    'role': 'system',
                    'content': system_prompt
                },
                {
                    'role': 'user',
                    'content': user_prompt
                }
            ]

            completion = client.chat.completions.create(
                model="nvidia/llama-3.1-nemotron-70b-instruct",
                messages=messages,
                temperature=0.0,
                max_tokens=20000
            )

            content = completion.choices[0].message.content

            # Split the content into individual articles
            articles = []
            if "Keine Artikel mit dem angegebenen Thema gefunden." in content:
                return []

            # Split by "**END OF ARTICLE**" if present, otherwise treat as single article
            if "**END OF ARTICLE**" in content:
                parts = content.split("**END OF ARTICLE**")
                articles = [{"article": part.strip()} for part in parts if part.strip()]
            else:
                articles = [{"article": content.strip()}]

            return articles

        except Exception as e:
            print(f"Error in AI processing: {str(e)}")
            return []

    # Apply the analysis to each row in the DataFrame
    all_articles = []
    for index, row in df.iterrows():
        articles = analyze_text(row[text_column])
        for i, article in enumerate(articles, 1):
            new_row = row.to_dict()
            new_row['extracted_article'] = article['article']
            new_row['article_part'] = i
            new_row['total_parts'] = len(articles)
            all_articles.append(new_row)

    # Create a new DataFrame with individual rows for each article
    result_df = pd.DataFrame(all_articles)

    return result_df

# Usage example
text_column = 'plainpagefulltext'
result_df = analyze_dataframe(df, text_column)

# Save the results to an Excel file
result_df.to_excel('test_1.xlsx', index=False)

# Display the first few rows of the result
print(result_df.head())

                                             page_id  pagenumber  \
0  22PDTLUY4AQ5TWJXNYQ52E2RIMQI2RNP-FILE_0014_DDB...          14   
1  22PDTLUY4AQ5TWJXNYQ52E2RIMQI2RNP-FILE_0045_DDB...          45   
2  25DAA2WIW7V63U44WIHK2OTQUX55SE6M-FILE_0010_DDB...          10   
3  25JJY2NKEIQWL2FV3XBLV3HFXNRKENRV-FILE_0018_DDB...          18   

              paper_title                   provider_ddb_id  \
0  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
1  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
2  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
3  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   

                                            provider     zdb_id  \
0  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
1  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
2  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
3  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   

     pu

Write a prompt for OCR Post-Correction

In [6]:
#Research for OCR Post-Correction Prompting: https://aclanthology.org/2024.lt4hala-1.14.pdf
#https://essay.utwente.nl/102117/1/Veninga_MA_EEMCS.pdf

import pandas as pd
from openai import OpenAI

# Initialize OpenAI client with NVIDIA API settings
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key = userdata.get('NVIDIA_TOKEN')
)

# Process the DataFrame
all_articles = []
for index, row in result_df.iterrows():
    try:
        # Make API call
        completion = client.chat.completions.create(
            model="nvidia/llama-3.1-nemotron-70b-instruct",
            messages=[
                {
                    'role': 'system',
                    'content': """
# System Instructions: You are a language model. Your task is to fix the OCR errors in the provided German text. Please use your understanding of the German langage to correct wrongly spelled words, wrongly inserted spaces, and wrongly inserted punctuation marks. Please focus on correcting misread characters that look like numbers.

Please maintain the original meaning and structure of the provided text. Your output should consist of the corrected articles.
                    """
                },
                {
                    'role': 'user',
                    'content': f"""# Task Instructions:
Text to analyze:
{row['extracted_article']}"""
                }
            ],
            temperature=0.0,
            max_tokens=20000
        )

        content = completion.choices[0].message.content

        # Process articles
        if content and "Keine Artikel mit dem angegebenen Thema gefunden." not in content:
            new_row = row.to_dict()
            new_row['article_corrected'] = content.strip()
            all_articles.append(new_row)

    except Exception as e:
        print(f"Error processing row {index}: {str(e)}")
        continue

# Create final DataFrame
result_2_df = pd.DataFrame(all_articles)

# Save to Excel
result_2_df.to_excel('test_1b.xlsx', index=False)

# Display results
print(result_2_df.head())

                                             page_id  pagenumber  \
0  22PDTLUY4AQ5TWJXNYQ52E2RIMQI2RNP-FILE_0014_DDB...          14   
1  22PDTLUY4AQ5TWJXNYQ52E2RIMQI2RNP-FILE_0045_DDB...          45   
2  25DAA2WIW7V63U44WIHK2OTQUX55SE6M-FILE_0010_DDB...          10   
3  25JJY2NKEIQWL2FV3XBLV3HFXNRKENRV-FILE_0018_DDB...          18   

              paper_title                   provider_ddb_id  \
0  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
1  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
2  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
3  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   

                                            provider     zdb_id  \
0  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
1  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
2  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
3  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   

     pu

In [7]:
#Attempt with fire descriptions

# Initialize OpenAI client with NVIDIA API settings
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key = userdata.get('NVIDIA_TOKEN')
)

def analyze_dataframe(df: pd.DataFrame, text_column: str) -> pd.DataFrame:
    def analyze_text(text: str) -> List[Dict[str, str]]:
        system_prompt = f"""
# System Instructions
You are an expert text analyst and information retrieval specialist and hate summarization as well as enumerations. Use {examples} for structuring your answer.
Your task is to carefully analyze given texts and extract complete articles that contain specific themes. You never change original texts.

Classify as relevant if the text contains:
- Fire terminology from the 19th and 20th century
- Reports on fire, flames, and ashes
- Official fire reports
- City fires
- City areas, companies, buildings destroyed by fire
- City areas, companies, buildings damaged by fire
- Damage descriptions
- Causes of fires
- Reports on victims, deaths, injuries
- Loss of life, property, goods
- Fire-fighting attempts
- Financial aid
- Comparisons to other disasters

Your output should consist of the extracted articles and the verification

Maintain a neutral, objective stance throughout the analysis. Focus on accuracy and completeness in your extractions
"""
        user_prompt = f"""
# Task Instructions
Bitte führe die folgenden Schritte aus:
1. Lese jeden Text aufmerksam durch. Behandle jeden Text als eigene Einheit, ohne auf andere Texte zu referieren
2. Identifiziere alle Artikel zum Thema Feuer, Flammen und Brand
3. Für jedes Vorkommen des Themas:
   a. Bestimme den Anfang des Artikels, in dem das Thema vorkommen.
   b. Kontrolliere Satz für Satz, ob diese zusammengehören, Ende den Artikel, wenn die Sätze nicht mehr zusammengehören.
   c. Markiere den vollständigen Artikel von Anfang bis Ende.
   d. Wenn der Artikel zu lang für eine Antwort ist, antworte mit Ja auf "article too long, human addition needed":
   e. Berücksichtige auch sehr kurze und sehr lange Artikel
4. Überprüfe jeden markierten Artikel:
   a. Stelle sicher, dass er eine Einheit bildet, auch wenn es nicht mehr um Erdbeben geht.
   b. Vergewissere dich, dass er eines der genannten Themen enthält.
   c. Prüfe, ob der extrahierte Text tatsächlich im Dokument ist
5. Extrahiere jeden überprüften Artikel als Originaltext, der nichts als den originalen Text enthält
6. Korrigiere OCR-Fehler
7. Wenn keine Artikel gefunden wurden, gib "Keine Artikel mit dem angegebenen Thema gefunden." aus.

Führe nun diese Schritte für den folgenden Text aus:
{text}
"""
        try:
            messages = [
                {
                    'role': 'system',
                    'content': system_prompt
                },
                {
                    'role': 'user',
                    'content': user_prompt
                }
            ]

            completion = client.chat.completions.create(
                model="nvidia/llama-3.1-nemotron-70b-instruct",
                messages=messages,
                temperature=0.0,
                max_tokens=20000
            )

            content = completion.choices[0].message.content

            # Split the content into individual articles
            articles = []
            if "Keine Artikel mit dem angegebenen Thema gefunden." in content:
                return []

            # Split by "**END OF ARTICLE**" if present, otherwise treat as single article
            if "**END OF ARTICLE**" in content:
                parts = content.split("**END OF ARTICLE**")
                articles = [{"article": part.strip()} for part in parts if part.strip()]
            else:
                articles = [{"article": content.strip()}]

            return articles

        except Exception as e:
            print(f"Error in AI processing: {str(e)}")
            return []

    # Apply the analysis to each row in the DataFrame
    all_articles = []
    for index, row in df.iterrows():
        articles = analyze_text(row[text_column])
        for i, article in enumerate(articles, 1):
            new_row = row.to_dict()
            new_row['extracted_article'] = article['article']
            new_row['article_part'] = i
            new_row['total_parts'] = len(articles)
            all_articles.append(new_row)

    # Create a new DataFrame with individual rows for each article
    result_df = pd.DataFrame(all_articles)

    return result_df

# Usage example
text_column = 'plainpagefulltext'
result_df = analyze_dataframe(df, text_column)

# Save the results to an Excel file
result_df.to_excel('test_2.xlsx', index=False)

# Display the first few rows of the result
print(result_df.head())

                                             page_id  pagenumber  \
0  22PDTLUY4AQ5TWJXNYQ52E2RIMQI2RNP-FILE_0014_DDB...          14   
1  22PDTLUY4AQ5TWJXNYQ52E2RIMQI2RNP-FILE_0045_DDB...          45   
2  25DAA2WIW7V63U44WIHK2OTQUX55SE6M-FILE_0010_DDB...          10   
3  25JJY2NKEIQWL2FV3XBLV3HFXNRKENRV-FILE_0018_DDB...          18   

              paper_title                   provider_ddb_id  \
0  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
1  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
2  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
3  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   

                                            provider     zdb_id  \
0  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
1  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
2  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
3  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   

     pu

In [8]:
#Prompting OCR Post-Correction with Examples

# Initialize OpenAI client with NVIDIA API settings
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key = userdata.get('NVIDIA_TOKEN')
)

# Process the DataFrame
all_articles = []
for index, row in result_df.iterrows():
    try:
        # Make API call
        completion = client.chat.completions.create(
            model="nvidia/llama-3.1-nemotron-70b-instruct",
            messages=[
                {
                    'role': 'system',
                    'content': """
# System Instructions: You are a language model. Your task is to fix the OCR errors in the provided German text. Please use your understanding of the German langage to correct wrongly spelled words, wrongly inserted spaces, and wrongly inserted punctuation marks. Please focus on correcting misread characters that look like numbers.

Here are some examples of OCR corrections:
"OCR Text: 'in Braud.' Corrected text: 'in Brand.'
OCR Text: 'fin gen Feirer' Corrected text: 'fingen Feuer'
OCR Text: 'D r e i M ä d ch e n wurden ge tötet' Corrected text: 'Drei Mädchen wurden getötet'
OCR Text: 'e x p l o d i ei r t e ein Zünder' Corrected text: 'explodierte ein Zünder'

Please maintain the original meaning and structure of the provided text. Your output should consist of the corrected articles.
"""
                },
                {
                    'role': 'user',
                    'content': f"""# Task Instructions:
Text to analyze:
{row['extracted_article']}"""
                }
            ],
            temperature=0.0,
            max_tokens=20000
        )

        content = completion.choices[0].message.content

        # Process articles
        if content and "Keine Artikel mit dem angegebenen Thema gefunden." not in content:
            new_row = row.to_dict()
            new_row['article_corrected'] = content.strip()
            all_articles.append(new_row)

    except Exception as e:
        print(f"Error processing row {index}: {str(e)}")
        continue

# Create final DataFrame
result_2_df = pd.DataFrame(all_articles)

# Save to Excel
result_2_df.to_excel('test_2b.xlsx', index=False)

# Display results
print(result_2_df.head())

                                             page_id  pagenumber  \
0  22PDTLUY4AQ5TWJXNYQ52E2RIMQI2RNP-FILE_0014_DDB...          14   
1  22PDTLUY4AQ5TWJXNYQ52E2RIMQI2RNP-FILE_0045_DDB...          45   
2  25DAA2WIW7V63U44WIHK2OTQUX55SE6M-FILE_0010_DDB...          10   
3  25JJY2NKEIQWL2FV3XBLV3HFXNRKENRV-FILE_0018_DDB...          18   

              paper_title                   provider_ddb_id  \
0  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
1  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
2  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   
3  Hamburger Fremdenblatt  BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4   

                                            provider     zdb_id  \
0  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
1  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
2  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   
3  Staats- und Universitätsbibliothek Hamburg Car...  3024925-9   

     pu