This is a zero-shot model attempt using a local ancient-greek-text-classification-BERT-2 model, which is a fine-tuned from luvnpce83/ancient-greek-emotion-bert. The model was trained on ancient Greek text, and it is trained to perform 8-class emotion classification on Koine Greek. This model must be trained on your local machine and added to ./greekbert_v1/ancient-greek-text-classification-BERT-2

In [2]:
from transformers import pipeline
import pandas as pd

In [6]:
classifier = pipeline(
    "text-classification", model="rtwins/greekbert_for_text_classification"
)

Device set to use cpu


In [7]:
sample_text = "ἐγώ ὑμᾶς ἐπαινῶ"
result = classifier(sample_text)
print(result)

[{'label': 'Joy', 'score': 0.9869072437286377}]


In [8]:
def predict_sentiment(df, text_column):
    result_list = []
    for index, row in df.iterrows():
        sequence_to_classify = row[text_column]
        result = classifier(
            sequence_to_classify
        )  # result => list of dictionaries, one dictionary in this case due to csv format
        result[0]["sequence"] = sequence_to_classify
        result_list.append(result[0])
    result_df = pd.DataFrame(result_list)
    return result_df


def predict_sentiment_batch(df, text_column, batch_size):
    texts = df[text_column].tolist()
    english = df["English"].tolist()
    result_list = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        results = classifier(batch)

        j = i
        for text, r in zip(batch, results):
            result_list.append(
                {
                    "sequence": text,
                    "translation": english[j],
                    "sentiment": r["label"],
                    "score": r["score"],
                }
            )
            j += 1

    result_df = pd.DataFrame(result_list)
    return result_df

In [9]:
df = pd.read_csv("./greek_training_data/greek_sentences.csv", encoding="utf8")
result = predict_sentiment(df, "text")
print(result)

       label     score                           sequence
0        Joy  0.986907                    ἐγώ ὑμᾶς ἐπαινῶ
1      Trust  0.997800         ὁ στρατιώτης ἐδωκε χρήματα
2      Trust  0.999936               ἐγώ οἶδα τούς Ὅρκους
3      Trust  0.999926     ὁ ποιητής τόν στρατιώτην τιμᾷ.
4      Anger  0.651851  Οἱ κύνες τούς διώκοντας φεύγουσι.
5        Joy  0.999741              ἐγραψα την ἐπιστολήν.
6        Joy  0.999965                    Δῶρα ἠγαγόμην. 
7    Sadness  0.997095             Βίοτος πολλά διδάσκει.
8      Trust  0.999907                 ἡ γυνή ἐστί ἀγαθή.
9        Joy  0.981954         ὁ ποιητής κάλλιστος ἐστὶν.
10       Joy  0.998732          ἡ πόλις γίγνεται πλούσια.
11  Surprise  0.999900         το στράτευμα ἐφάνη πάμπολυ
12  Surprise  0.812118   μεγάλα τα τόξα τα Περσικά ἐστιν.


On this small sample, high confidence outputs are actually pretty accurate. The biggest outlier is sequence 10 which is labeled sadness despite translating to "the city becomes rich/prosperous". There are some other bad labels such as sequence 9, but at least this is low confidence. 

In [11]:
import warnings

warnings.filterwarnings(
    "ignore", category=UserWarning, module="torch.utils.data.dataloader"
)
dictionary_df = pd.read_csv(
    "../Lemmatizer-GRK/greek_dictionary/nouns.csv", encoding="utf8", sep="\t"
)
results_df = predict_sentiment_batch(dictionary_df, "FPP", 16)
results_df.to_csv("./greekbert_v2_sentiment.csv")