## Imports

In [1]:
import pandas as pd
import os
import glob

## Configuration
*input_dir:* The path to the directory that contains your text files. Please make sure to use a '/' (slash) in the end. For example: `path/to/texts/`.

*dataframe_filename:* The filename for the resulting pandas DataFrame. You may use the **.p** extension indicating a pickled file, but you are free to use whatever you like. Just make sure this is consistent in the subsequent sentiment analysis step.

In [2]:
input_dir = "texts/mimotext/"
dataframe_filename = "texts_mimotext.p"

## Directory Setup (Optional)
Creates directories according to the configuration if not already created manually.

In [3]:
if not os.path.exists(input_dir):
    os.makedirs(input_dir)

## Data Preparation

### Load texts

In [12]:
text_file_names = glob.glob("{}*.txt".format(input_dir))
print("found {} texts".format(len(text_file_names)))
texts = []
for text_file_name in text_file_names:
    if "\\" in text_file_name:
        corrected_filename = text_file_name.split("\\")[-1]
    elif "/" in text_file_name:
        corrected_filename = text_file_name.split("/")[-1]
    with open(text_file_name, "r", encoding="utf-8") as input_file:
        texts.append([corrected_filename, input_file.read()])
print("loaded {} texts".format(len(texts)))

found 10 texts
loaded 10 texts


### Create DataFrame

In [13]:
print("searching files for attributes and text")
prepared_texts = []
num_attributes = 0
for filename, text in texts:
    lines = text.split("\n")
    prepared_text = {"filename": filename}
    cur_line = 0
    for line in lines:
        line_type, line_content = line.split("=")[:2]
        if line_type != "text":
            try:
                line_content = float(line_content)
            except ValueError:
                pass
            prepared_text.update({line_type: line_content})
        else:
            break
        cur_line += 1
    num_attributes = max(num_attributes, cur_line)
    prepared_text.update({"text": " ".join(lines[cur_line:])[5:]})
    prepared_texts.append(prepared_text)

print("found {} additional attributes in .txt files".format(num_attributes))

texts_df = pd.DataFrame(prepared_texts)
texts_df.set_index("filename", inplace=True)

searching files for attributes and text
found 2 additional attributes in .txt files


In [14]:
texts_df.dtypes

﻿year    float64
title     object
text      object
dtype: object

There seems to be a problem with the datatype of "year" (float64), so I assigned the dates again: 

In [15]:
texts_df["year"]= [1758,1784,1800,1778,1783, 1798, 1774, 1778, 1767, 1759]

Now it is working: 

Which data types are the columns?

In [9]:
texts_df.dtypes

﻿year    float64
title     object
text      object
year       int64
dtype: object

### Save DataFrame

In [11]:
texts_df.to_pickle(dataframe_filename)

# Reference

Koncar, P., Druml, L., Ertler, K.-D., Fuchs, A., Geiger, B. C., Glatz, C., Hobisch, E., Mayer, P., Saric, S., Scholger, M. & Voelkl, Y. (2021) A Sentiment Tool Chain for Languages of the 18th Century. https://github.com/philkon/sentiment-tool-chai
