This is a zero-shot model attempt using luvnpce83/ancient-greek-emotion-bert, which is a fine-tuned form of pranaydeeps/Ancient-Greek-BERT. The model was trained on ancient Greek text, and it is trained to perform 8-class emotion classification on Koine Greek.

In [6]:
from transformers import pipeline
import pandas as pd

In [2]:
classifier = pipeline(
    "text-classification", model="luvnpce83/ancient-greek-emotion-bert"
)

config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/452M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

Device set to use cpu


In [3]:
sample_text = "ἐγώ ὑμᾶς ἐπαινῶ"
result = classifier(sample_text)
print(result)

[{'label': 'Joy', 'score': 0.9599943161010742}]


In [54]:
def predict_sentiment(df, text_column):
    result_list = []
    for index, row in df.iterrows():
        sequence_to_classify = row[text_column]
        result = classifier(
            sequence_to_classify
        )  # result => list of dictionaries, one dictionary in this case due to csv format
        result[0]["sequence"] = sequence_to_classify
        result_list.append(result[0])
    result_df = pd.DataFrame(result_list)
    return result_df


def predict_sentiment_batch(df, text_column, batch_size):
    texts = df[text_column].tolist()
    english = df["English"].tolist()
    result_list = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        results = classifier(batch)

        j = i
        for text, r in zip(batch, results):
            result_list.append(
                {
                    "sequence": text,
                    "translation": english[j],
                    "sentiment": r["label"],
                    "score": r["score"],
                }
            )
            j += 1

    result_df = pd.DataFrame(result_list)
    return result_df

In [55]:
df = pd.read_csv("./greek_training_data/greek_sentences.csv", encoding="utf8")
result = predict_sentiment(df, "text")
print(result)

       label     score                           sequence
0        Joy  0.959994                    ἐγώ ὑμᾶς ἐπαινῶ
1        Joy  0.715231         ὁ στρατιώτης ἐδωκε χρήματα
2      Trust  0.981736               ἐγώ οἶδα τούς Ὅρκους
3      Trust  0.821070     ὁ ποιητής τόν στρατιώτην τιμᾷ.
4      Anger  0.637150  Οἱ κύνες τούς διώκοντας φεύγουσι.
5        Joy  0.824887              ἐγραψα την ἐπιστολήν.
6        Joy  0.967243                    Δῶρα ἠγαγόμην. 
7        Joy  0.419573             Βίοτος πολλά διδάσκει.
8      Trust  0.542304                 ἡ γυνή ἐστί ἀγαθή.
9    Sadness  0.419064         ὁ ποιητής κάλλιστος ἐστὶν.
10   Sadness  0.611068          ἡ πόλις γίγνεται πλούσια.
11  Surprise  0.985877         το στράτευμα ἐφάνη πάμπολυ
12  Surprise  0.329601   μεγάλα τα τόξα τα Περσικά ἐστιν.


On this small sample, high confidence outputs are actually pretty accurate. The biggest outlier is sequence 10 which is labeled sadness despite translating to "the city becomes rich/prosperous". There are some other bad labels such as sequence 9, but at least this is low confidence. 

In [56]:
import warnings

warnings.filterwarnings(
    "ignore", category=UserWarning, module="torch.utils.data.dataloader"
)
dictionary_df = pd.read_csv(
    "../Lemmatizer-GRK/greek_dictionary/nouns.csv", encoding="utf8", sep="\t"
)
results_df = predict_sentiment_batch(dictionary_df, "FPP", 16)
# results_df.to_csv("./greekBert_sentiment")

In [57]:
print(results_df)
results_df.to_csv("./emotionBert_sentiment.csv")

        sequence                     translation sentiment     score
0      αβδηριτης                 a man of Abdera     Trust  0.418732
1      αβελτερια   silliness, stupidity, fatuity   Disgust  0.531659
2          αβιοι      without a living, starving   Disgust  0.398857
3        αβλαβια               freedom from harm       Joy  0.952323
4          αβλης                      not thrown      Fear  0.510754
...          ...                             ...       ...       ...
12585     ωφελιη  help, aid, succour, assistance       Joy  0.642819
12586     ωφελημ                        a useful       Joy  0.894010
12587   ωφελησις              a helping, aiding;       Joy  0.837792
12588      ωχρος               paleness, wanness   Sadness  0.430239
12589    ωχροτης                        paleness   Disgust  0.910967

[12590 rows x 4 columns]
