# Notebook: Scraping of the N.S. Frauen-Warte and export to CSV

To scrape the *Frauen-Warte*, input the `year_number` in the format, as well as `start_page` and `end_page` as a number (without leading zeros).

For the available years and numbers of the respective page see here: https://digi.ub.uni-heidelberg.de/diglit/frauenwarte


In [None]:
#@title 0. Enter Parameters
year_number = 'frauenwarte1944' #@param {type:"string"}
start_page =  1 #@param {type:"integer"}
end_page = 44 #@param {type:"integer"}

## 1. Install and import required libraries

In [None]:
!pip install requests beautifulsoup4



In [None]:
import requests
from bs4 import BeautifulSoup

## 2. Function to fetch and parse webpage

In [None]:
def scrape_page(page_number):
    url = f"https://digi.ub.uni-heidelberg.de/diglit/{year_number}/{page_number}/image,info,text_ocr#col_text_ocr"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract page number text
        page_num_element = soup.find('li', class_='navbar-text ml-4')
        page_num_text = page_num_element.find('div', class_='badge badge-secondary').text if page_num_element else 'Page number not found'

        # Extract OCR text
        ocr_text_element = soup.find(id='column_text_ocr')
        ocr_texts = ocr_text_element.find_all(class_='text_part') if ocr_text_element else []
        ocr_text = ' '.join([text.get_text(strip=True) for text in ocr_texts])

        return page_num_text, ocr_text
    else:
        return 'Failed to fetch page', ''


## 3. Iterate through page numbers and collect data

In [None]:
page_numbers = range(start_page, end_page)

# dictionary to store the results, with page numbers as keys
results = {}

for number in page_numbers:
    formatted_number = f"{number:04d}"  # Format number to 4 digits with leading zeros
    page_text, ocr_text = scrape_page(formatted_number)
    results[formatted_number] = {'page_text': page_text, 'ocr_text': ocr_text}

    # print the progress
    print(f"Scraped page {formatted_number}")


Scraped page 0001
Scraped page 0002
Scraped page 0003
Scraped page 0004
Scraped page 0005
Scraped page 0006
Scraped page 0007
Scraped page 0008
Scraped page 0009
Scraped page 0010
Scraped page 0011
Scraped page 0012
Scraped page 0013
Scraped page 0014
Scraped page 0015
Scraped page 0016
Scraped page 0017
Scraped page 0018
Scraped page 0019
Scraped page 0020
Scraped page 0021
Scraped page 0022
Scraped page 0023
Scraped page 0024
Scraped page 0025
Scraped page 0026
Scraped page 0027
Scraped page 0028
Scraped page 0029
Scraped page 0030
Scraped page 0031
Scraped page 0032
Scraped page 0033
Scraped page 0034
Scraped page 0035
Scraped page 0036
Scraped page 0037
Scraped page 0038
Scraped page 0039
Scraped page 0040
Scraped page 0041
Scraped page 0042
Scraped page 0043


## 4. Save and download scraped data

In [None]:
import pandas as pd

df = pd.DataFrame.from_dict(results, orient='index')

In [None]:
df

Unnamed: 0,page_text,ocr_text
1,Seite: 1,srauen-Varktc! i s e i n 2: i g e p 3 ^ t 6 i ...
2,Seite: 2,"All«s Schwe« viklitrt «n Btixutung, wtnn Aeir ..."
3,Seite: 3,"Ois Ssrictitsr clsc fcsusn isigsc, I.isbs unci..."
4,Seite: 4,Am Tage ror der Begebenhei« waren wir auf unse...
5,Seite: 5,LIHL LL2lllI-v^6 LV8 VLLl X06L6LLILI VO^ U-ö. ...
6,Seite: 6,"kot ^nrpruct, ouf «ien kiausarbsilrlog?2)ie Fr..."
7,Seite: 7,pcreirlX vno^«xt«I!« : 5ur« 8«»kk»ff. ^s^ckt««...
8,Seite: 8,4-9157 V 4S119^tt 49149 49159 V 1451Vin durchg...
9,Seite: 9,Dar flvircknkmalr irt unr in «Ivr Irucr«n 2«il...
10,Seite: 10,2»ln Bau einrr provisorischen Feurrstelle werd...


In [None]:
from google.colab import files

df_name = f"{year_number}.csv"

df.to_csv(df_name)
files.download(df_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>