# Noscemus_ETF metadata extraction notebook
- Run every cell in notebook
- After that there should be a Pandas DataFrame "metadata_tabulka" containing metadata from all 994 works in the Noscemus corpus.
- The 1st row of the dataset is comprised of column headers. There is 26 columns, but only first 15 have headers. Data from 16th column onward are comprised from Noscemus internal notes.
- Last cell can also export .csv file which will contain all data. You will be prompted for desired location of that .csv file. Example "~/Documents/noscemus_metadata.csv"

In [61]:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [62]:
%%time
# Web test, extraction of clomun headers and dataframe extraction. (also checks time of execution. time for whole corpus ~="time"*994)
url = "https://wiki.uibk.ac.at/noscemus/A_Latin_Letter_containing_some_Animadversions_upon_Mr._Isaac_Newton,_his_Theory_of_Light"
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table", class_="wikitable")
    #data = [item.get_text(strip=True) for item in table.find_all("td")]
    indices = [item.get_text(strip=True) for item in table.find_all("th")]
    metadata_tabulka = pd.DataFrame([indices])
else:
    print("Request error, response code is:", response.status_code)

CPU times: user 30.3 ms, sys: 2.34 ms, total: 32.6 ms
Wall time: 301 ms


In [63]:
# creation of list of works of noscemus based on which i am iterating throught the corpus
url = ["https://wiki.uibk.ac.at/noscemus/_-_/index.php?title=Category:Works&pageuntil=De+curandis+vulneribus+sclopettorum#mw-pages", "https://wiki.uibk.ac.at/noscemus/_-_/index.php?title=Category:Works&pagefrom=De+curandis+vulneribus+sclopettorum#mw-pages", "https://wiki.uibk.ac.at/noscemus/_-_/index.php?title=Category:Works&pagefrom=Discursus+astronomicus+novissimus#mw-pages", "https://wiki.uibk.ac.at/noscemus/_-_/index.php?title=Category:Works&pagefrom=In+opus+revolutionum+Nicolai+Copernici+Torunnaei+dialogus#mw-pages", "https://wiki.uibk.ac.at/noscemus/_-_/index.php?title=Category:Works&pagefrom=Petri+Nonii+Salaciensis+opera#mw-pages",]
seznam =[]
for item in url:
    response = requests.get(item)
    soup = BeautifulSoup(response.content, "html.parser")
    tag = soup.find("div", class_="mw-category")
    seznam.extend(re.findall("(?<=href=\").*(?=\" title)", str(tag)))

CPU times: user 291 ms, sys: 9.07 ms, total: 300 ms
Wall time: 2.08 s


In [64]:
%%time
#writing metadata from every work in noscemus corpus into pd.dataframe "metadata_tabulka"
for page in seznam:
    url = "https://wiki.uibk.ac.at"+page
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table", class_="wikitable")
    data = [item.get_text(strip=True) for item in table.find_all("td")]
    added = pd.DataFrame([data])
    metadata_tabulka = pd.concat([metadata_tabulka, added], ignore_index=True)

CPU times: user 39.6 s, sys: 1.65 s, total: 41.3 s
Wall time: 6min 24s


In [67]:
metadata_tabulka

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,Author,Full title,In,Year,Place,Publisher/Printer,Era,Form/Genre,Discipline/Content,Digital copies,...,,,,,,,,,,
1,"Pardies, Ignace Gaston",A Latin Letter written to the Publisher April ...,Philosophical Transactions of the Royal Societ...,1672,London,Martyn,17th century,"Letter, Review",Physics,Original,...,"Pardies, Ignace Gaston:A Latin Letter containi...",Internal notes,RECENSIO,Of interest to,,Transkribus text available,Yes,Written by,IT,
2,"Scheuchzer, Johann Jakob","Acarnania sive Relatio eorum, quae hactenus el...","ΟΥΡΕΣΙΦΟΙΤΗΣ (Ouresiphoites) Helveticus, 609–35",1723,Leiden,"van der Aa, Pieter",18th century,"Biography, Bibliography","Mathematics, Physics, Geography/Cartography, M...",Original,...,Internal notes,,Of interest to,MK,Transkribus text available,Yes,Written by,MK,,
3,"Morabito, Giuseppe",Ad astronautas Americanos carmen Iosephi Morab...,Fons pacis. Nova aetas. Ad astronautas America...,1969,Amsterdam,Nord-Hollandsche Uitgevers Maatschapij,After 1800,Panegyric poem,Astronomy/Astrology/Cosmography,Original,...,"Morabito, Giuseppe:Ad astronautas Americanos, ...",Internal notes,"The Earthrise picture and ""Please be informed ...",Of interest to,IT,Transkribus text available,,Written by,IT,
4,"Addison, Joseph",Ad insignissimum virum dominum Thomam Burnettu...,"Examen poeticum duplex, sive, Musarum anglican...",1698,London,Richard Wellington I.,17th century,Panegyric poem,Meteorology/Earth sciences,Original,...,Internal notes,,Of interest to,"MK, IT",Transkribus text available,Yes,Written by,MK,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
990,"Scheuchzer, Johann Jakob",ΟΥΡΕΣΙΦΟΙΤΗΣ Helveticus sive itinera per Helve...,1723,Leiden,"van der Aa, Pieter",18th century,"Report, Bibliography","Geography/Cartography, Meteorology/Earth scien...",Original,ΟΥΡΕΣΙΦΟΙΤΗΣ (Ouresiphoites) Helveticus(e-rara...,...,Internal notes,"Tomus primus (= Itinera 1702, 1703, 1704)Praef...",Of interest to,MK,Transkribus text available,Yes,Written by,MK,,
991,"Bauhin, Caspar",ΠΙΝΑΞ (Pinax) theatri botanici Caspari Bauhini...,1623,Basel,König,17th century,"Dictionary/Lexicon, Historia, Encyclopedic work",Biology,Original,Pinax theatri botanici(e-rara.ch)Alternative l...,...,Internal notes,,Of interest to,"DB, MK",Transkribus text available,Yes,Written by,DB,,
992,"Colonna, Fabio",ΦΥΤΟΒΑΣΑΝΟΣ (Phytobasanos) sive plantarum aliq...,1592,Naples,"Salviani, Orazio",16th century,Historia,"Biology, Medicine",Original,Phytobasanos(Biodiversity Heritage Library),...,,Of interest to,DB,Transkribus text available,Yes,Written by,DB,,,
993,"Scultetus, Johannes","ΧΕΙΡΟΠΛΟΘΗΚΗ seu domini Ioannis Sculteti, phys...",1655,Ulm,Kühn,17th century,"Monograph, Report, Other (see description)",Medicine,Original,ΧΕΙΡΟΠΛΟΘΗΚΗ(Google Books)German translation (...,...,Indications regarding the size of the instrume...,Of interest to,MK,Transkribus text available,Yes,Written by,MK,,,


In [70]:
#export dataframe into .csv file. Please insert your desired file location. For example: "~/Documents/noscemus_metadata.csv"
metadata_tabulka.to_csv(path_or_buf=input, index=False)