In [1]:
import os
import pandas as pd

## Preparation for Annif
In this Jupyter Notebook we have to prepare the data for Annif. Annif requires a folder "docs" and in that folder we have to store 2 files for each record:
+ A tsv file containing the RVK uri and its description.
+ A txt file containing text. Text means in our case: Discipline, Title, Keywords. Maybe in the near future tables of contents as well.

In [3]:
df = pd.read_csv("../cleaning_and_analysis/records_cleaned.csv")
df.head(2)

Unnamed: 0,UID,Discipline,Title,Keywords,Language,RVK,Length_RVK
0,46842,Wirtschaft / Recht,Die interne Revision als Internal Consultant -...,Hochschulschrift,de,PN 216,3
1,252219,Europäisches Management,Der Krieg um die Talente geht weiter: Organisa...,Organisationsentwicklung|Dienstleistungsbetrie...,de,QP 120,3


### Merging Discipline, Keywords and Title to "text"
Replace pipes with white space. Insert White spaces to avoid something like: "Wirtschaft / RechtDie interne Revision als...".

In [4]:
df["Keywords"] = df["Keywords"].apply(lambda x: " " + x.replace("|", " "))
df["Discipline"] = df["Discipline"].apply(lambda x: x + " ")
df.head()

Unnamed: 0,UID,Discipline,Title,Keywords,Language,RVK,Length_RVK
0,46842,Wirtschaft / Recht,Die interne Revision als Internal Consultant -...,Hochschulschrift,de,PN 216,3
1,252219,Europäisches Management,Der Krieg um die Talente geht weiter: Organisa...,Organisationsentwicklung Dienstleistungsbetri...,de,QP 120,3
2,252220,Europäisches Management,Der zentralamerikanischen Zollunion: Fortschri...,Zollunion Freier Warenverkehr Hochschulschrif...,de,QP 120,3
3,48555,Betriebswirtschaft,Global Sourcing im Einkauf von Fertigwaren - d...,Hochschulschrift,de,QP 120,3
4,4721,Allgemeiner Maschinenbau,Konstruktion eines Werkzeugeinstellplanes für ...,Werkzeugplanung Konstruktion Hochschulschrift...,de,ZL 3000,4


New column "text":

In [5]:
df["text"] = df["Discipline"] + df["Title"] + df["Keywords"]
df["text"]

0        Wirtschaft / Recht Die interne Revision als In...
1        Europäisches Management Der Krieg um die Talen...
2        Europäisches Management Der zentralamerikanisc...
3        Betriebswirtschaft Global Sourcing im Einkauf ...
4        Allgemeiner Maschinenbau Konstruktion eines We...
                               ...                        
15041    Business Management Digitale Geschäftsmodelle ...
15042    Wirtschaftsingenieurwesen Auswahl von mindeste...
15043    Wirtschaft / Recht Die Verfassungsmäßigkeit de...
15044    Europäisches Management Chancen und Risiken ei...
15045    Öffentliche Verwaltung Brandenburg Die Werbesa...
Name: text, Length: 15046, dtype: object

### Create URI for each RVK
The same approach is applied as in file: "../rvk_transformation/rvk_classification_flattened_with_uri.tsv"

In [6]:
uri_base = "th-wildau:"
df["uri"] = uri_base + df["RVK"].str.replace(" ", "")
df["uri"]

0         th-wildau:PN216
1         th-wildau:QP120
2         th-wildau:QP120
3         th-wildau:QP120
4        th-wildau:ZL3000
               ...       
15041     th-wildau:QP120
15042     th-wildau:QP500
15043     th-wildau:PN216
15044     th-wildau:QP120
15045     th-wildau:PN216
Name: uri, Length: 15046, dtype: object

### Create a Tsv File for Step 03_train_project
In the following steps, we create a tsv file for step 03_train_project containing 2 columns:
+ The column "text" - which is in our case discipline, title, keywords.
+ The column "uri" for ech record.


In [7]:
columns = ["text", "uri"]
df2 = df[columns]
df2.to_csv("records_for_step_03_train_project_annif.tsv", sep="\t", index=False)

### A Folder with Data for Step 07
We store all data we need for step_07 in a new dataframe df2.
+ We need UID to create an unique file name.
+ We need "text" and "uri" to feed Annif.

Then we open "../rvk_transformation/rvk_classification_flattened_with_uri.tsv" as a dataframe (df_rvks) and merge it with df2. We apply a left join to assign the RVK descriptors to our records in df2.

In [8]:
columns = ["UID", "text", "uri"]
df3 = df[columns]
df3.head(2)

Unnamed: 0,UID,text,uri
0,46842,Wirtschaft / Recht Die interne Revision als In...,th-wildau:PN216
1,252219,Europäisches Management Der Krieg um die Talen...,th-wildau:QP120


In [9]:
df_rvks = pd.read_csv("../rvk_transformation/rvk_classification_flattened_with_uri.tsv", delimiter="\t")
df_rvks.head(2)

Unnamed: 0,notation,description,uri
0,A,Allgemeines,th-wildau:A
1,AA,"Bibliografien der Bibliografien, Universalbibl...",th-wildau:AA


In [10]:
# Left join: Data of df2 will be supplemented by RVK "description".
combined_df = df3.merge(df_rvks, on="uri", how="left")
combined_df.head(2)

Unnamed: 0,UID,text,uri,notation,description
0,46842,Wirtschaft / Recht Die interne Revision als In...,th-wildau:PN216,PN 216,Allgemeines
1,252219,Europäisches Management Der Krieg um die Talen...,th-wildau:QP120,QP 120,Gesamtdarstellungen der Betriebswirtschaftslehre


### A Folder called "docs"
We create a folder docs where we put in the data...

In [11]:
# Create folder "docs" if it does not exist.
if not os.path.exists("docs"):
    os.makedirs("docs")

### Create a tsv and txt file for each record
In that folder we put a tsv and a txt file.
* The tsv file contains one or more RVKs (as of 2023-08-24 only one RVK).
* The txt file contains the "full text" which is discipline + title + keywords.

In [12]:
# Function that will be called for each row. It creates a tsv and a txt by the record's UID and then saves content to tsv and txt.
def create_files(row):
    tsv_filename = os.path.join("docs", f"{row['UID']}.tsv")
    txt_filename = os.path.join("docs", f"{row['UID']}.txt")

    with open(tsv_filename, 'w') as tsv_file:
        # Write uri and uri description to file.
        tsv_file.write(f"{row['uri']}\t{row['description']}\n")

    with open(txt_filename, "w") as txt_file:
        # Write "text" to file.
        txt_file.write(row["text"])

# Apply function for each row by using axis parameter 1.
combined_df.apply(create_files, axis=1)

0        None
1        None
2        None
3        None
4        None
         ... 
15041    None
15042    None
15043    None
15044    None
15045    None
Length: 15046, dtype: object