# Noscemus_ETF metadata extraction notebook
- Run every cell in notebook
- After that there should be a Pandas DataFrame "metadata_tabulka" containing metadata from all 994 works in the Noscemus corpus.
- The 1st row of the dataset is comprised of column headers. There is 26 columns, but only first 15 have headers. Data from 16th column onward are comprised from Noscemus internal notes.
- Last cell can also export .csv file which will contain all data. You will be prompted for desired location of that .csv file. Example "~/Documents/noscemus_metadata.csv"

In [2]:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [7]:
%%time
# Web test, extraction of clomun headers and dataframe extraction. (also checks time of execution. time for whole corpus ~="time"*994)
url = "https://wiki.uibk.ac.at/noscemus/A_Latin_Letter_containing_some_Animadversions_upon_Mr._Isaac_Newton,_his_Theory_of_Light"
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table", class_="wikitable")
    #data = [item.get_text(strip=True) for item in table.find_all("td")]
    indices = [item.get_text(strip=True) for item in table.find_all("th")]
    metadata_table = pd.DataFrame([indices])
else:
    print("Request error, response code is:", response.status_code)

CPU times: user 27 ms, sys: 3.43 ms, total: 30.4 ms
Wall time: 545 ms


In [8]:
# creation of list of works of noscemus based on which i am iterating throught the corpus
url = ["https://wiki.uibk.ac.at/noscemus/_-_/index.php?title=Category:Works&pageuntil=De+curandis+vulneribus+sclopettorum#mw-pages", "https://wiki.uibk.ac.at/noscemus/_-_/index.php?title=Category:Works&pagefrom=De+curandis+vulneribus+sclopettorum#mw-pages", "https://wiki.uibk.ac.at/noscemus/_-_/index.php?title=Category:Works&pagefrom=Discursus+astronomicus+novissimus#mw-pages", "https://wiki.uibk.ac.at/noscemus/_-_/index.php?title=Category:Works&pagefrom=In+opus+revolutionum+Nicolai+Copernici+Torunnaei+dialogus#mw-pages", "https://wiki.uibk.ac.at/noscemus/_-_/index.php?title=Category:Works&pagefrom=Petri+Nonii+Salaciensis+opera#mw-pages",]
seznam =[]
for item in url:
    response = requests.get(item)
    soup = BeautifulSoup(response.content, "html.parser")
    tag = soup.find("div", class_="mw-category")
    seznam.extend(re.findall("(?<=href=\").*(?=\" title)", str(tag)))

In [9]:
%%time
#writing metadata from every work in noscemus corpus into pd.dataframe "metadata_table"
for page in seznam:
    url = "https://wiki.uibk.ac.at"+page
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table", class_="wikitable")
    data = [item.get_text(strip=True) for item in table.find_all("td")]
    added = pd.DataFrame([data])
    metadata_table = pd.concat([metadata_table, added], ignore_index=True)

CPU times: user 43.2 s, sys: 1.39 s, total: 44.6 s
Wall time: 9min 14s


In [18]:
metadata_table.T[0]

0                     Author
1                 Full title
2                         In
3                       Year
4                      Place
5          Publisher/Printer
6                        Era
7                 Form/Genre
8         Discipline/Content
9             Digital copies
10               Description
11                References
12                  Cited in
13    How to cite this entry
14            Internal notes
15                       NaN
16                       NaN
17                       NaN
18                       NaN
19                       NaN
20                       NaN
21                       NaN
22                       NaN
23                       NaN
24                       NaN
25                       NaN
Name: 0, dtype: object

In [21]:
metadata_table.columns = metadata_table.T[0]
metadata_table = metadata_table[1:]
metadata_table.head(10)

Unnamed: 0,Author,Full title,In,Year,Place,Publisher/Printer,Era,Form/Genre,Discipline/Content,Digital copies,...,NaN,NaN.1,NaN.2,NaN.3,NaN.4,NaN.5,NaN.6,NaN.7,NaN.8,NaN.9
1,"Pardies, Ignace Gaston",A Latin Letter written to the Publisher April ...,Philosophical Transactions of the Royal Societ...,1672,London,Martyn,17th century,"Letter, Review",Physics,Original,...,"Pardies, Ignace Gaston:A Latin Letter containi...",Internal notes,RECENSIO,Of interest to,,Transkribus text available,Yes,Written by,IT,
2,"Scheuchzer, Johann Jakob","Acarnania sive Relatio eorum, quae hactenus el...","ΟΥΡΕΣΙΦΟΙΤΗΣ (Ouresiphoites) Helveticus, 609–35",1723,Leiden,"van der Aa, Pieter",18th century,"Biography, Bibliography","Mathematics, Physics, Geography/Cartography, M...",Original,...,Internal notes,,Of interest to,MK,Transkribus text available,Yes,Written by,MK,,
3,"Morabito, Giuseppe",Ad astronautas Americanos carmen Iosephi Morab...,Fons pacis. Nova aetas. Ad astronautas America...,1969,Amsterdam,Nord-Hollandsche Uitgevers Maatschapij,After 1800,Panegyric poem,Astronomy/Astrology/Cosmography,Original,...,"Morabito, Giuseppe:Ad astronautas Americanos, ...",Internal notes,"The Earthrise picture and ""Please be informed ...",Of interest to,IT,Transkribus text available,,Written by,IT,
4,"Addison, Joseph",Ad insignissimum virum dominum Thomam Burnettu...,"Examen poeticum duplex, sive, Musarum anglican...",1698,London,Richard Wellington I.,17th century,Panegyric poem,Meteorology/Earth sciences,Original,...,Internal notes,,Of interest to,"MK, IT",Transkribus text available,Yes,Written by,MK,,
5,"Lipsius, Justus",Ad Clusii nomen lusus,"L'Ecluse, Charles de, Rariorum aliquot stirpiu...",1583,Antwerp,Plantin,16th century,Panegyric poem,"Biology, Medicine, Other (see description)",Original,...,"Lipsius, Justus:Ad Clusii nomen lusus, in: Nos...",Internal notes,"Possibly, this epigram could be found in Lipsi...",Of interest to,IT,Transkribus text available,Yes,Written by,IT,
6,"Owen, John",Ad Dominum Gilbertum,Epigrammatum libri tres. Auctore Ioanne Owen B...,1606,London,"Windet, John, Waterson, Simon",17th century,Other (see description),Astronomy/Astrology/Cosmography,Original,...,"Owen, John:Ad Gilbertum, in: Noscemus Wiki, UR...",Internal notes,First edition in sharefolder.The epigram was a...,Of interest to,"JL, IT",Transkribus text available,Yes,Written by,IT,
7,"Costus, Petrus",Petrus Costus ad Gulielmum Rondeletium medicum...,"Aquatilium historia, vol. 1, fol. α3r",1554,Lyon,Bonhomme,16th century,Panegyric poem,Biology,Original,...,Internal notes,,Of interest to,MK,Transkribus text available,Yes,Written by,MK,,
8,"Acidalius, Valens","Ad Iordanum Brunum Nolanum, Italum","Poematum Iani Lernutii, Iani Gulielmi, Valenti...",1603,"Liegnitz, Wrocław","Albert, David",17th century,Panegyric poem,Astronomy/Astrology/Cosmography,Original,...,"Acidalius, Valens:Ad Iordanum Brunum, in: Nosc...",Internal notes,Kühlmann must have overlooked the poem in the ...,Of interest to,"MK, IT",Transkribus text available,Yes,Written by,MK,
9,"Paulinus, Fabius","Ad clarissimum virum Laurentium Massam, sereni...","Avicennae, Arabum medicorum principis, ex Gera...",1595,Venice,I Giunti,16th century,Panegyric poem,Medicine,Original,...,"Paulinus, Fabius:Ad Laurentium Massam pro Avic...",Internal notes,"On the title page of the edition, the many acc...",Of interest to,MK,Transkribus text available,Yes,Written by,MK,
10,"Sands, Patrick",Ad lectorem trigonometriae studiosum,Mirifici logarithmorum canonis descriptioeiusq...,1614,Edinburgh,Andro Hart,17th century,Panegyric poem,Mathematics,Original,...,"Sands, Patrick:Ad lectorem trigonometriae stud...",Internal notes,Sands also wrote a poem for Napier'sRabdologia...,Of interest to,IT,Transkribus text available,Yes,Written by,IT,


In [22]:
#export dataframe into .csv file. Please insert your desired file location.
metadata_table.to_csv("../data/metadata_table.csv", index=False)