# Description
The textual data, as we obtained them from the developers of Noscemus, have the structure that each individual work is represented by a singular text file with the name of the author, name of the work, and place and year of publication (e.g. `Bacon,_Francis_-_Instauratio_magna__London_1620.pdf.txt`). Within the original folder structure, each of these works is contained in is in its own directory, named by its "Digital sourcebook" ID (e.g. `1031760`). This makes any iteration over the text files not that much straightforward. Therefore, in this notebook, we reorganize and rename the files in the way that we create one big directory, in which each work is represented by one text file named by its ID `1031760.txt`. These IDs we can later on easily map on the metadata. 

INPUT: "NOSCEMUS_FULL" subdirectory on sciencedata (raw data as we got them from noscemus)
OUTPUT: "noscemus_raw" subdirectory on sciencedata (reorganized raw data files from noscemus)
OUTPUT: "../data/ids_filenames_df.csv" table mapping on each othe the "Digital sourcebook" IDs (e.g. `1031760`) and original names of the raw text files (e.g. `Bacon,_Francis_-_Instauratio_magna__London_1620.pdf.txt`) - we later map it on other metadata


In [4]:
import pandas as pd
import sddk
import os

In [2]:
# we keep the data on a shared folder on sciencedata.dk
# to go through that step, you have to have a) a sciencedata.dk account and b) access to the TOME directory
# s = sddk.cloudSession(provider="sciencedata.dk", shared_folder_name="TOME/DATA/NOSCEMUS", owner="kase@zcu.cz")

In [8]:
# in the "NOSCEMUS_FULL" directory we got, each individual work is in its own directory (named by its ID), what makes the navigation not that much straightforward
# there
# dir_ids_list = s.list_directories("NOSCEMUS_FULL/")
# dir_ids_list[:10]

In [9]:
local_path = "/srv/data/tome/noscemus/NOSCEMUS FULL"
dir_ids_list = os.listdir(local_path)
dir_ids_list[:10]

['902259',
 '841474',
 '659199',
 '697193',
 '897258',
 '662869',
 '845319',
 '905698',
 '929376',
 '660341']

In [46]:
len(dir_ids_list)

1010

In [10]:
%%time
# map ids on filenames
ids_filenames = []
for id in dir_ids_list:
    id_filenames = []
    for filename in os.listdir(local_path + "/" + id):
        if ".txt" in filename:
            id_filenames.append(filename)
    ids_filenames.append((id, id_filenames))

CPU times: user 8.37 ms, sys: 4.25 ms, total: 12.6 ms
Wall time: 11.8 ms


In [8]:
%%time
# with sciencedata...
# map ids on filenames
#ids_filenames = []
#for id in dir_ids_list:
#    id_filenames = []
#    for filename in s.list_filenames("NOSCEMUS_FULL/" + id, "txt"):
#        id_filenames.append(filename)
#    ids_filenames.append((id, id_filenames))

CPU times: user 6.44 s, sys: 54.8 ms, total: 6.5 s
Wall time: 4min 21s


In [14]:
ids_filenames[:10]

[('902259',
  ['Metz,_Andreas_-_De_adaequata_exponentis_notione__Würzburg_1820_pdf.txt']),
 ('841474',
  ['Hippocrates_&_Galenus,_&_Cruser,_Hermann_-_Hippocratis_De_natura_humana_liber_cum_commentariis_Galeni__Paris_1534_pdf.txt']),
 ('659199',
  ['Radius,_Justus_-_Scriptores_ophthalmologici_minores__Vol__3__Leipzig_1830_pdf.txt']),
 ('697193',
  ['Jussieu,_Antoine_Laurent_de_-_Genera_plantarum_secundum_ordines_naturales_disposita__Paris_1789_pdf.txt']),
 ('897258', ['Fehr,_Johann_Michael_-_Hiera_picra__Leipzig_1668_pdf.txt']),
 ('662869',
  ['Kies,_Johann_-_Dissertatio_physica_de_iride__Tübingen_1772_pdf.txt']),
 ('845319',
  ['Scheiner,_Christoph_-_Refractiones_coelestes__Ingolstadt_1617_pdf.txt']),
 ('905698', ['Stansel,_Valentin_-_Legatus_uranicus__Prague_1683_pdf.txt']),
 ('929376', ['Gassendi,_Pierre_-_Opera_omnia__Vol__5__Lyon_1658_pdf.txt']),
 ('660341',
  ['Regiomontanus,_Johannes_-_Disputationes_contra_Cremonensia_deliramenta__Nuremberg_1475_pdf.txt'])]

In [15]:
# make a dataframe of ids and filenames
ids_filenames_df = pd.DataFrame(ids_filenames, columns=["id", "filenames_list"])

In [3]:
ids_filenames_df.to_csv("../data/ids_filenames_df.csv")

In [35]:
target_path = "/srv/data/tome/noscemus/noscemus_raw/"
try:
    os.mkdir(target_path)
except:
    pass

In [33]:
os.listdir("/srv/data/tome/noscemus/")

['NOSCEMUS FULL', 'NOSCEMUS_FULL.zip']

In [29]:
source_path = "/srv/data/tome/noscemus/NOSCEMUS FULL/"

In [48]:
for id in dir_ids_list:
    try:
        filename = [f for f in os.listdir(source_path + id) if ".txt" in f][0]
        shutil.copyfile(source_path + id + "/" + filename, target_path + id + ".txt")
    except:
        pass

In [49]:
os.listdir(target_path)[:10]

['913059.txt',
 '801742.txt',
 '888136.txt',
 '720097.txt',
 '663952.txt',
 '694621.txt',
 '868572.txt',
 '664562.txt',
 '747384.txt',
 '795562.txt']

In [None]:
os.listdir(target_path)