# Notebook introduction

In this notebook, travel-related texts are fetched from the API of https://www.biodiversitylibrary.org/.
A travel-related term is manually entered - and texts which contain that term as a title element are fetched.
The result is then stored as a dataframe, and subsequently as a .CSV-file.

These .CSV-files are further processed and metadated in other notebooks.

In [None]:
import requests
import pandas as pd

In [None]:
#user key for BHL API
key = "d8ffd12f-b06d-49a6-acaa-d830fcb30083"

# Fetch BHL data


In [None]:
#define user agent
ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"

In [None]:
headers = {"User-Agent": ua}

In [None]:
#enter travel-related search term
search = "reys"

In [None]:
#define request
title_IDS = requests.get(f"https://www.biodiversitylibrary.org/api2/httpquery.ashx?op=TitleSearchSimple&title={search}&apikey={key}&format=json", headers = headers)

In [None]:
#check number of titles fetched
x = title_IDS.json()
print("Number of title_IDS:")
len(x["Result"])

Number of title_IDS:


6

In [None]:
#if no title id available, don't fetch
title_IDS = [result["TitleID"] if result["TitleID"] is not None else "not available" for result in x["Result"]]

In [None]:
#Fetch books from Biodiversity Heritage Library based on presence of travel-related term in the title.
#Assumption: books with travel-related term in the title are travel-related.

item_ids = []
books = []
cnt = 0
for title_id in title_IDS:
  title_metadata = requests.get(f"https://www.biodiversitylibrary.org/api2/httpquery.ashx?op=GetTitleMetadata&titleid={title_id}&items=t&apikey={key}&format=json", headers = headers)
  title_metadata = title_metadata.json()

  item_id = [item["ItemID"] for item in title_metadata["Result"]["Items"]] #fetch item_id (identifier of the work)
  for item_i in item_id:
    try:
        title = title_metadata["Result"]["FullTitle"] #fetch title of the work
        author = title_metadata["Result"]["Authors"][0]["Name"] #fetch author name
        pub_year = title_metadata["Result"]["Items"][0]["Year"] #publication year
        language = title_metadata["Result"]["Items"][0]["Language"] #language
        cnt += 1
        print(cnt)
        print(f"Fetching: {title} \n By: {author} \n Written in: {pub_year} \n Language: {language}")
        print(item_i)

    except:
        continue
  #fetch OCR'ed text for every volume pertaining to every title_id
    book = {}
    #save failed item_ids to attempt again (never needed this)
    failed =[]
    text = requests.get(f"https://www.biodiversitylibrary.org/api2/httpquery.ashx?op=GetItemPages&itemid={item_i}&ocr=t&apikey={key}&format=json", headers = headers)
    text = text.json()
    if text["Status"] != "error":
      ocr = [item["OcrText"] for item in text["Result"]] #fetch OCR for 1 item_ID
    else:
      failed.append(item_i)
      continue

    if ocr:
      print("OCR available")
      #populate dictionary "book" with other metadata
    book["title"] = title
    book["author"] = author
    book["publication_year"] = pub_year
    book["language"] = language
    book["fullText"] = " ".join(ocr).replace("\n", "")
    book["item_id"] = item_i


    books.append(book)


1
Fetching: Fünfter Theil der Orientalischen Indien, eygentlicher Bericht vnd warhafftige Beschreibung der gantzen volkommenen Reyse oder Schiffart, so die Holländer mit acht Schiffen in die Orientalische Indien, sonderlich aber in die Javanische vnd Molukische Inseln, als Bantam, Banda, vnd Ternate, &c. gethan haben, welche von Amsterdam abgefahren im Jahr 1598. vnd zum Theil Anno 1599. zum Theil aber in Jüngst abgelauffenen 1600. Jahr, : mit grossen Reichthumb von Pfeffer Muscaten, Regelein, vnd anderer köstlichen Würtz, wider anheym gelanget, darinn fleissig beschrieben vnd angezeigt, was ihnen auff der gantzen Reyse Denckwürdiges begegnet vnd zuhanden gangen 
 By: Neck, Jacob Cornelisz. van, 
 Written in:  
 Language: German
171866
OCR available
2
Fetching: Fünfter Theil der Orientalischen Indien, eygentlicher Bericht vnd warhafftige Beschreibung der gantzen volkommenen Reyse oder Schiffahrt, so die Holländer mit 8. Schiffen in die Orientalische Indien, sond[er]lich aber in die Jav

In [None]:
len(books)

6

## To csv

In [None]:
df = pd.DataFrame(books) #change fetched book texts and metadata to dataframe object

In [None]:
df = df.reset_index(drop = True)

In [None]:
df.tail() #check

Unnamed: 0,Title,Author,Publication_year,Language,FullText,Item_id
20,Reize door de binnenlanden van Noord-Amerika,"Carver, Jonathan,",,Dutch,IMAGE EVALUATION TEST TARGET (MT-3) 1.0 l.l ■...,101068
21,Reize naar Arabie?? en andere omliggende landen,"Niebuhr, Carsten,",1776.0,Dutch,""" .. Q) CD sz o s 1q - * — naturalis nationaa...",211502
22,"Reize om de wereld gedaan in de jaren 1803, 18...","Kruzenshtern, Ivan Fedorovich,",,Dutch,t IMAGE EVALUATION TEST TARGET (MT-S) RHl! 1....,85276
23,"Reize van Aleppo naar Jeruzalem, op paasschen ...","Maundrell, Henry,",1705.0,Dutch,"£3 Z hljrpozoc < , • < - • ' ' ■ A BIBUCTHSC...",209966
24,Untersuchungen über Reizerscheinungen bei den ...,"Polowzow, Warwara,",,German,m Untersuchungen über Rcizcrscbeinungen ...,46882


In [None]:
df["Language"].value_counts() #check language distribution

German    3
Dutch     3
Name: Language, dtype: int64

In [None]:
df.to_csv("reys.csv") #save to .csv-file with name of travel-related term

In [None]:
ddf[df["FullText"] == ""] #check for entries where texts are not available

Unnamed: 0,Title,Author,Publication_year,Language,FullText,Item_id
4,Apparent triploidy in the unisexual brahminy b...,"Cole, Charles J.",,English,,171114
15,"Chromosome evolution in selected treefrogs, in...","Cole, Charles J.",,English,,168661
152,"On North American moths, with the description ...","Beutenmüller, William,",,English,,167304
