# TF-IDF with Greek

Use this notebook to experiment with calculating TF-IDF with Greek. You won't be able to use scikit-learn's `"english"` stop words, and in general you will have to be in charge of a much larger part of the pipeline.

But this could be a useful exercise for those of you interested in the additional complexity of working with Greek!

In [2]:
%pip install -r requirements.txt

Collecting appnope==0.1.4 (from -r requirements.txt (line 1))
  Downloading appnope-0.1.4-py2.py3-none-any.whl.metadata (908 bytes)
Collecting contourpy==1.3.0 (from -r requirements.txt (line 4))
  Downloading contourpy-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.4 kB)
Collecting cycler==0.12.1 (from -r requirements.txt (line 5))
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting debugpy==1.8.7 (from -r requirements.txt (line 6))
  Downloading debugpy-1.8.7-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.1 kB)
Collecting fonttools==4.54.1 (from -r requirements.txt (line 9))
  Downloading fonttools-4.54.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (163 kB)
Collecting jedi==0.19.1 (from -r requirements.txt (line 12))
  Downloading jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting joblib==1.4.2 (from -r requirements.txt (line 13))
  Down

In [1]:
# eff it, thucy time

from lxml import etree
from MyCapytain.common.constants import Mimetypes
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText
import pandas as pd

with open("tei/tlg0003.tlg001.perseus-grc2.xml") as f:
    text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0003.tlg001.perseus-grc2", resource=f)

urns = []
raw_xmls = []
unannotated_strings = []

for ref in text.getReffs(level=len(text.citation)):
    urn = f"{text.urn}:{ref}"
    node = text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
history_df = pd.DataFrame(d)
history_df['whitespaced_tokens'] = history_df['unannotated_strings'].str.split()
history_df

Unnamed: 0,urn,raw_xml,unannotated_strings,whitespaced_tokens
0,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν ...,"[Θουκυδίδης, Ἀθηναῖος, ξυνέγραψε, τὸν, πόλεμον..."
1,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",κίνησις γὰρ αὕτη μεγίστη δὴ τοῖς Ἕλλησιν ἐγένε...,"[κίνησις, γὰρ, αὕτη, μεγίστη, δὴ, τοῖς, Ἕλλησι..."
2,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",τὰ γὰρ πρὸ αὐτῶν καὶ τὰ ἔτι παλαίτερα σαφῶς μὲ...,"[τὰ, γὰρ, πρὸ, αὐτῶν, καὶ, τὰ, ἔτι, παλαίτερα,..."
3,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",φαίνεται γὰρ ἡ νῦν Ἑλλὰς καλουμένη οὐ πάλαι βε...,"[φαίνεται, γὰρ, ἡ, νῦν, Ἑλλὰς, καλουμένη, οὐ, ..."
4,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","τῆς γὰρ ἐμπορίας οὐκ οὔσης, οὐδ’ ἐπιμειγνύντες...","[τῆς, γὰρ, ἐμπορίας, οὐκ, οὔσης,, οὐδ’, ἐπιμει..."
...,...,...,...,...
3582,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:8...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","καὶ ὁ Τισσαφέρνης ἀπὸ τῆς Ἀσπένδου, ὡς ἐπύθετο...","[καὶ, ὁ, Τισσαφέρνης, ἀπὸ, τῆς, Ἀσπένδου,, ὡς,..."
3583,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:8...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","ὄντων δὲ τῶν Πελοποννησίων ἐν τῷ Ἑλλησπόντῳ, Ἀ...","[ὄντων, δὲ, τῶν, Πελοποννησίων, ἐν, τῷ, Ἑλλησπ..."
3584,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:8...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",φοβούμενοι οὖν αὐτὸν διὰ τοῦτο τὸ ἔργον μήποτε...,"[φοβούμενοι, οὖν, αὐτὸν, διὰ, τοῦτο, τὸ, ἔργον..."
3585,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:8...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ὁ δὲ Τισσαφέρνης αἰσθόμενος καὶ τοῦτο τῶν Πελ...,"[ὁ, δὲ, Τισσαφέρνης, αἰσθόμενος, καὶ, τοῦτο, τ..."


In [8]:
import re
from collections import Counter
from math import log

def tokenize(d: str):
    return re.findall(r"\w+|[^\w\s]", d, re.UNICODE)

# tf(t,d) = count of t in d / number of words in d
def tf(t: str, d):
    doc = [str(e) for e in tokenize(d)]
    #print(doc)
    abs_freq = Counter(doc)[t] # why is this 0?
    return abs_freq / (len(doc))

def idf(t: str, D):
    return log(len([1 for d in D if t in tokenize(t)]))

def tf_idf(t: str, D, summation):
    raw_freq = tf(t, "".join(D))
    better_tf = raw_freq / summation
    #better_tf = log(1+raw_freq)
    better_idf = idf(t, D)
    return better_tf * better_idf

history_df['minis'] = history_df['unannotated_strings'].str.lower().str.replace(r'[^\w\s]+', '', regex=True).dropna()
summation = history_df['minis'].apply(lambda dock: tf("ἔργοις", dock)).sum()
history_df['some_tfidf'] = history_df['minis'].apply(lambda exx: tf_idf("ἔργοις", exx, summation))
history_df.loc[history_df['some_tfidf'] > 0]


Unnamed: 0,urn,raw_xml,unannotated_strings,whitespaced_tokens,minis,some_tfidf
38,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",περιουσίαν δὲ εἰ ἦλθον ἔχοντες τροφῆς καὶ ὄντε...,"[περιουσίαν, δὲ, εἰ, ἦλθον, ἔχοντες, τροφῆς, κ...",περιουσίαν δὲ εἰ ἦλθον ἔχοντες τροφῆς καὶ ὄντε...,0.318695
39,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ἀλλὰ δι’ ἀχρηματίαν τά τε πρὸ τούτων ἀσθενῆ ἦ...,"[ἀλλὰ, δι’, ἀχρηματίαν, τά, τε, πρὸ, τούτων, ἀ...",ἀλλὰ δι ἀχρηματίαν τά τε πρὸ τούτων ἀσθενῆ ἦν...,0.635845
69,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","ἐπιπόνως δὲ ηὑρίσκετο, διότι οἱ παρόντες τοῖς ...","[ἐπιπόνως, δὲ, ηὑρίσκετο,, διότι, οἱ, παρόντες...",ἐπιπόνως δὲ ηὑρίσκετο διότι οἱ παρόντες τοῖς ἔ...,0.886337
513,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","ταῦτα λαβὼν ὁ Παυσανίας τὰ γράμματα, ὢν καὶ πρ...","[ταῦτα, λαβὼν, ὁ, Παυσανίας, τὰ, γράμματα,, ὢν...",ταῦτα λαβὼν ὁ παυσανίας τὰ γράμματα ὢν καὶ πρό...,0.36031
575,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ἀλλ’ ἐκεῖνα μὲν καὶ ἐν ἄλλῳ λόγῳ ἅμα τοῖς ἔργο...,"[ἀλλ’, ἐκεῖνα, μὲν, καὶ, ἐν, ἄλλῳ, λόγῳ, ἅμα, ...",ἀλλ ἐκεῖνα μὲν καὶ ἐν ἄλλῳ λόγῳ ἅμα τοῖς ἔργοι...,0.273117
737,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:2...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ἔνι τε τοῖς αὐτοῖς οἰκείων ἅμα καὶ πολιτικῶν ἐ...,"[ἔνι, τε, τοῖς, αὐτοῖς, οἰκείων, ἅμα, καὶ, πολ...",ἔνι τε τοῖς αὐτοῖς οἰκείων ἅμα καὶ πολιτικῶν ἐ...,0.404423
750,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:2...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",‘καὶ οἵδε μὲν προσηκόντως τῇ πόλει τοιοίδε ἐγέ...,"[‘καὶ, οἵδε, μὲν, προσηκόντως, τῇ, πόλει, τοιο...",καὶ οἵδε μὲν προσηκόντως τῇ πόλει τοιοίδε ἐγέν...,0.256676
1322,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:3...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","ἀλλ’ ἢν οἱ ἡγεμόνες, ὥσπερ νῦν ὑμεῖς, κεφαλαιώ...","[ἀλλ’, ἢν, οἱ, ἡγεμόνες,, ὥσπερ, νῦν, ὑμεῖς,, ...",ἀλλ ἢν οἱ ἡγεμόνες ὥσπερ νῦν ὑμεῖς κεφαλαιώσαν...,0.97847
1863,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:4...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ὧν χρὴ μνησθέντας ἡμᾶς τούς τε πρεσβυτέρους ὁμ...,"[ὧν, χρὴ, μνησθέντας, ἡμᾶς, τούς, τε, πρεσβυτέ...",ὧν χρὴ μνησθέντας ἡμᾶς τούς τε πρεσβυτέρους ὁμ...,0.318572
2152,urn:cts:greekLit:tlg0003.tlg001.perseus-grc2:5...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",καὶ τὴν διὰ μέσου ξύμβασιν εἴ τις μὴ ἀξιώσει π...,"[καὶ, τὴν, διὰ, μέσου, ξύμβασιν, εἴ, τις, μὴ, ...",καὶ τὴν διὰ μέσου ξύμβασιν εἴ τις μὴ ἀξιώσει π...,0.375776


I could make this work with all forms of 'deed' but also I'm busy. what I would have done is just run each form of 'deed' through tf & idf, summing all forms' tfs and idfs separately, then doing the big fun tf * idf. Anyway, I think HW6 is done now, yay!