# Description
The textual data, as we obtained them from the developers of Noscemus, have the structure that each individual work is represented by a singular text file with the name of the author, name of the work, and place and year of publication (e.g. `Bacon,_Francis_-_Instauratio_magna__London_1620.pdf.txt`). Within the original folder structure, each of these works is contained in is in its own directory, named by its "Digital sourcebook" ID (e.g. `1031760`). This makes any iteration over the text files not that much straightforward. Therefore, in this notebook, we reorganize and rename the files in the way that we create one big directory, in which each work is represented by one text file named by its ID `1031760.txt`. These IDs we can later on easily map on the metadata. 

INPUT: "NOSCEMUS_FULL" subdirectory on sciencedata (raw data as we got them from noscemus)
OUTPUT: "noscemus_raw" subdirectory on sciencedata (reorganized raw data files from noscemus)
OUTPUT: "../data/ids_filenames_df.csv" table mapping on each othe the "Digital sourcebook" IDs (e.g. `1031760`) and original names of the raw text files (e.g. `Bacon,_Francis_-_Instauratio_magna__London_1620.pdf.txt`) - we later map it on other metadata


In [1]:
import pandas as pd
import sddk

In [2]:
# we keep the data on a shared folder on sciencedata.dk
s = sddk.cloudSession(provider="sciencedata.dk", shared_folder_name="TOME/DATA/NOSCEMUS", owner="kase@zcu.cz")

connection with shared folder established with you as its ordinary user
endpoint variable has been configured to: https://sciencedata.dk/sharingout/kase%40zcu.cz/TOME/DATA/NOSCEMUS/


In [6]:
# in the "NOSCEMUS_FULL" directory we got, each individual work is in its own directory (named by its ID), what makes the navigation not that much straightforward
# there

dir_ids_list = s.list_directories("NOSCEMUS_FULL/")
dir_ids_list[:10]

['1031760',
 '1085290',
 '1285853',
 '1285854',
 '1285855',
 '1285856',
 '1365811',
 '1370560',
 '1378359',
 '1424044']

In [9]:
len(dir_ids_list)

1009

In [13]:
%%time
# map ids on filenames
ids_filenames = []
for id in dir_ids_list:
    id_filenames = []
    for filename in s.list_filenames("NOSCEMUS_FULL/" + id, "txt"):
        id_filenames.append(filename)
    ids_filenames.append((id, id_filenames))

CPU times: user 9.23 s, sys: 360 ms, total: 9.59 s
Wall time: 5min 48s


In [14]:
# make a dataframe of ids and filenames
ids_filenames_df = pd.DataFrame(ids_filenames, columns=["id", "filenames_list"])

In [1]:
ids_filenames_df.to_csv("../data/ids_filenames_df.csv")

NameError: name 'ids_filenames_df' is not defined

In [16]:
ids_filenames_df

Unnamed: 0,id,filenames_list
0,1031760,"[Bacon,_Francis_-_Instauratio_magna__London_16..."
1,1085290,"[Linden,_Johannes_Antonides_van_der_-_Lindeniu..."
2,1285853,"[de_Conde,_Ioannes_Baptista_-_Aphorismi_seu_ax..."
3,1285854,"[van_Poort,_Henricus_-_Hippocratis_Aphorismi_m..."
4,1285855,"[Hippocrates_&_Denisot,_Gérard_-_Hippocratis_A..."
...,...,...
1004,929714,"[Merian,_Maria_Sibylla_-_Metamorphosis_insecto..."
1005,933014,"[Trotter,_Thomas_-_Dissertatio_de_ebrietate__E..."
1006,949394,"[Botallo,_Leonardo_-_De_curandis_vulneribus_sc..."
1007,971293,"[Béguin,_Jean_-_Tyrocinium_chymicum__Paris_161..."


In [18]:
filenames_ids_dict = {}
missing_ids = []
for id in dir_ids_list:
    try:
        filenames = s.list_filenames("NOSCEMUS_FULL/" + id, "txt")
        if len(filenames) > 1:
            n = 1
            for filename in filenames:
                filenames_ids_dict[filename] = id
                text = s.read_file("NOSCEMUS_FULL/" + id + "/" + filename, "str")
                s.write_file("noscemus_raw/{0}_{1}.txt".format(str(id), str(n)), text)
                n += 1
        else:
            filename = filenames[0] 
            filenames_ids_dict[filename] = id
            text = s.read_file("NOSCEMUS_FULL/" + id + "/" + filename, "str")
            s.write_file("noscemus_raw/{0}.txt".format(str(id)), text)
    except:
        missing_ids.append(id)