# Textvergleich mit unterschiedlichen Methoden für Korpora

Das Notebook bietet einen Satz-weisen Vergleich von Texten aus einem Korpus mit Hilfe von drei verschiedenen Methoden:
- TF/IDF
- Fasttext Embeddings
- MinHash

Für jede Methode gibt es eine Funktion, die
1. ein Dataframe erstellt, der die Abstände zwischen den Sätzen enthält
2. alle jene Satzpaare anzeigt, deren Ähnlichkeit einen gewissen Wert überschreitet. 

Diese Funktionen können auf ein nltk-Korpus angewandt werden, um den Zieltext mit den anderen zu vergleichen.

Das Notebook stammt aus dem Kurs "Intertextualität und Text Re-Use" und wurde wo nötig abgeändert: Tokenizen brauche ich nicht, da ich die schon lemmatizeten Texte verwende. Stattdessen müssen die Sätze abgetrennt werden (allerdings verwendet Altgriechisch andere Satzzeichen). Für den Fasttext Vergleich habe ich mein eigenes Model trainiert (siehe TrainWordembedding.py hat in Jupyter nicht funktioniert). Sentence Embeddings habe ich in Ermangelung eines Models für Altgriechisch nicht gemacht.

### Import von Packages

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from datasketch import MinHash
import unicodedata

### Processing Funktionen



In [2]:
#Unterteil griechischen Text in Sätze. Da Altgriechisch andere Satzzeichen verwendet habe ich meine eigene Funktion geschrieben.
def make_sentences_str(text):
    sentences = []
    sentence = ""
    tokens = text.split()
    for word in tokens:
        word = word.replace('᾽', '' ).replace('῞', '') 
        if word in [',', '\n', "'", "᾽", '῞',""]:
            continue
        elif word in [';', ':', '.']:
            if sentence:
                sentences.append(sentence[:-1])
                sentence = ""
        else:
            sentence += word+" "
    return sentences

In [3]:
def make_sentences_list(text):
    sentences = []
    sentence = []
    text = text.replace('.', ' . ')
    tokens = text.split()
    for word in tokens:
        word = word.replace('᾽', '' ).replace('῞', '') 
        if word in [',', '\n', "'", "᾽", ""]:
            continue
        elif word in [';', ':', '.']:
            if sentence:
                sentences.append(sentence)
                sentence = []
        else:
            sentence.append(word)
    return sentences

### Textvergleich mit TF/IDF

Beide Texte werden zusammengefasst und auf dem Gesamtkorpus die TF/IDF-Vektoren berechnet. 

In [4]:
def calculate_tfidf_similarity(text1, text2):
    # Split the texts into sentences
    sentences1 = make_sentences_str(text1)
    sentences2 = make_sentences_str(text2)

    # Combine all sentences for vectorization
    all_sentences = sentences1 + sentences2

    # Compute TF-IDF vectors
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(all_sentences)

    # Split TF-IDF matrix into two parts: one for each text
    tfidf1 = tfidf_matrix[:len(sentences1)]
    tfidf2 = tfidf_matrix[len(sentences1):]

    # Compute pairwise cosine similarity
    similarity_matrix = cosine_similarity(tfidf1, tfidf2)

    # Create a pandas DataFrame
    df = pd.DataFrame(
        similarity_matrix,
        index=range(len(sentences1)),
        columns=range(len(sentences2))
    )

    return df, sentences1, sentences2

### Textvergleich mit Fasttext

Das Fasttext-Modell wird aus den beiden Texten generiert. 

Eventuell wäre es zuverlässiger, ein auf einem größeren Korpus vortrainiertes Modell zu laden.

In [5]:
from gensim.models import KeyedVectors

In [6]:
def calculate_fasttext_similarity(text1, text2):
    # Split the texts into sentences
    sentences1 = make_sentences_list(text1)
    sentences2 = make_sentences_list(text2)

    # Train a FastText model on combined sentences
    keyed_vectors = KeyedVectors.load("cbow_keyedvectors.kv")

    # Compute sentence embeddings by averaging word vectors
    embeddings1 = [sum(keyed_vectors[unicodedata.normalize("NFC", word) ] for word in sentence)/ len(sentence) for sentence in sentences1]
    embeddings2 = [sum(keyed_vectors[unicodedata.normalize("NFC", word) ] for word in sentence)/ len(sentence) for sentence in sentences2]

    # Compute pairwise cosine similarity
    similarity_matrix = cosine_similarity(embeddings1, embeddings2)

    # Create a pandas DataFrame
    df = pd.DataFrame(
        similarity_matrix,
        index=range(len(sentences1)),
        columns=range(len(sentences2))
    )

    return df, sentences1, sentences2


### Textvergleich mit minHash

In [7]:
def calculate_minhash_similarity(text1, text2):
    # Split the texts into sentences
    sentences1 = make_sentences_list(text1)
    sentences2 = make_sentences_list(text2)

    # Create MinHash objects for each sentence
    def get_minhash(sentence):
        minhash = MinHash()
        for word in sentence:
            minhash.update(word.encode('utf8'))
        return minhash

    minhashes1 = [get_minhash(sentence) for sentence in sentences1]
    minhashes2 = [get_minhash(sentence) for sentence in sentences2]

    # Compute pairwise MinHash similarity
    similarity_matrix = [
        [m1.jaccard(m2) for m2 in minhashes2]
        for m1 in minhashes1
    ]

    # Create a pandas DataFrame
    df = pd.DataFrame(
        similarity_matrix,
        index=range(len(sentences1)),
        columns=range(len(sentences2))
    )

    return df, sentences1, sentences2


### Funktion, die alle Sätze mit Ähnlichkeitswerten über einem gewissen Threshold ermittelt.

In [8]:
def get_sentence_pairs_above_threshold(df, sentences1, sentences2, threshold):
    """Get all sentence pairs with similarity above the given threshold."""
    pairs = []
    for i, row in df.iterrows():
        for j, similarity in row.items():
            if similarity > threshold:
                pairs.append({
                    "sentence1": sentences1[i],
                    "sentence2": sentences2[j],
                    "similarity": similarity
                })
    return pairs

## Ausführung

### NLTK-Korpus anlegen

In [9]:
from nltk.corpus import CategorizedPlaintextCorpusReader

In [10]:
path = 'Lemma Text Files'
    

In [11]:
mycorpus = CategorizedPlaintextCorpusReader(path, r'.*\.txt', cat_pattern = r'(.*?)_.*')
print(len(mycorpus.fileids()), "Files importiert")

37 Files importiert


## Bakchen zu Platons Werken

Vergleichstext auswählen:

In [12]:
text1name='Bakchen.with_lemma.txt'

text1 = ' '.join(mycorpus.words(fileids=text1name))
    

Über Korpus iterieren:

In [13]:
print("Vergleich von Text ",text1name, "mit ")

for f in mycorpus.fileids():
    if f !=text1name:
        print(f)
        text2 = ' '.join(mycorpus.words(fileids=f))
        minhash_df, sentences1, sentences2 = calculate_minhash_similarity(text1, text2)
        print("Sentence pairs with similarity above 0.3 (MinHash):")
        pairs = get_sentence_pairs_above_threshold(minhash_df, sentences1, sentences2, 0.4)
        for pair in pairs:
            print(f"  - Sentence 1: {pair['sentence1']}")
            print(f"    Sentence 2: {pair['sentence2']}")
            print(f"    Similarity: {pair['similarity']:.4f}")
        

Vergleich von Text  Bakchen.with_lemma.txt mit 
Alc.+1.with_lemma.txt
Sentence pairs with similarity above 0.3 (MinHash):
  - Sentence 1: ['ἀλλ', 'ἔχω', 'χείρ']
    Sentence 2: ['ἀλλ', 'πάνυ', 'ἔχω']
    Similarity: 0.5078
  - Sentence 1: ['θεός', 'φημί', 'λέγω']
    Sentence 2: ['λέγω']
    Similarity: 0.4062
  - Sentence 1: ['θεός', 'φημί', 'λέγω']
    Sentence 2: ['λέγω']
    Similarity: 0.4062
  - Sentence 1: ['θεός', 'φημί', 'λέγω']
    Sentence 2: ['λέγω']
    Similarity: 0.4062
  - Sentence 1: ['θεός', 'φημί', 'λέγω']
    Sentence 2: ['φημί', 'λέγω']
    Similarity: 0.7266
  - Sentence 1: ['θεός', 'φημί', 'λέγω']
    Sentence 2: ['λέγω']
    Similarity: 0.4062
  - Sentence 1: ['θεός', 'φημί', 'λέγω']
    Sentence 2: ['λέγω']
    Similarity: 0.4062
  - Sentence 1: ['θεός', 'φημί', 'λέγω']
    Sentence 2: ['λέγω']
    Similarity: 0.4062
  - Sentence 1: ['θεός', 'φημί', 'λέγω']
    Sentence 2: ['λέγω']
    Similarity: 0.4062
  - Sentence 1: ['θεός', 'φημί', 'λέγω']
    Sentence 2: 

Über Korpus iterieren und Ausgabe in Dataframe:

### Ähnlichkeit basierend auf tfidf:

In [61]:
#die alte Version war nicht geeignet für große Textmengen, Pandas wird da furchtbar langsam. 
columns = ['text1', 'text2', 'sentence1','sentence2','similarity']
all_rows = []
for f in mycorpus.fileids():
    if f != text1name:
        print(f"Processing {f}...")

        text2 = ' '.join(mycorpus.words(fileids=f))
        tfidf_df, sentences1, sentences2 = calculate_tfidf_similarity(text1, text2)
        pairs = get_sentence_pairs_above_threshold(tfidf_df, sentences1, sentences2, 0.5)

        for pair in pairs:
            all_rows.append([text1name, f, pair['sentence1'], pair['sentence2'], pair['similarity']])
            
df_tfidf = pd.DataFrame(all_rows, columns=columns)

Processing Alc.+1.with_lemma.txt...
Processing Alc.+2.with_lemma.txt...
Processing Apol..with_lemma.txt...
Processing Charm..with_lemma.txt...
Processing Cleit..with_lemma.txt...
Processing Crat..with_lemma.txt...
Processing Criti..with_lemma.txt...
Processing Crito.with_lemma.txt...
Processing Epin..with_lemma.txt...
Processing Epistels.with_lemma.txt...
Processing Euthyd..with_lemma.txt...
Processing Euthyph..with_lemma.txt...
Processing Gorg..with_lemma.txt...
Processing Hipp.+Maj.with_lemma.txt...
Processing Hipp.+Min.with_lemma.txt...
Processing Hipparch..with_lemma.txt...
Processing Ion.with_lemma.txt...
Processing Lach..with_lemma.txt...
Processing Laws.with_lemma.txt...
Processing Lovers.with_lemma.txt...
Processing Lysis.with_lemma.txt...
Processing Menex..with_lemma.txt...
Processing Meno.with_lemma.txt...
Processing Minos.with_lemma.txt...
Processing Parm..with_lemma.txt...
Processing Phaedo.with_lemma.txt...
Processing Phaedrus.with_lemma.txt...
Processing Phileb..with_lemm

In [14]:
df_tfidf

Unnamed: 0,text1,text2,sentence1,sentence2,similarity
0,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,ἀμφί νάρθηκας ὑβριστής ὁσιοῦσθ,ὑβριστής Σώκρατες,0.407940
1,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,ἅμα δέ έὐασμα τοιόσδε ἐπιβρέμω,λέγω τοιόσδε,0.382890
2,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,εἶμι εἰσάγγελλε Τειρεσίας ζητέω νιν,ζητέω,0.441513
3,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,ἥκω δέ ἑτοῖμος ὅδε ἔχω σκευή θεός,δέ ἔχω,0.338595
4,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,ἐξηγέομαι γέρων γέρων τειρεσίης,ἔχω ἐξηγέομαι,0.399596
...,...,...,...,...,...
22684,Bakchen.with_lemma.txt,Tim..with_lemma.txt,γιγνώσκω ταῦτ,οὐδαμῶς ταῦτ λέγω Σώκρατες,0.322417
22685,Bakchen.with_lemma.txt,Tim..with_lemma.txt,χαλεπός δέ ὅδε ἥκω,φάρμακον χαλεπός,0.354181
22686,Bakchen.with_lemma.txt,Tim..with_lemma.txt,χαλεπός δέ ὅδε ἥκω,ὅδε λέγω,0.350836
22687,Bakchen.with_lemma.txt,Tim..with_lemma.txt,ἔρχομαι δέ ὅπου μήτε κιθαιρών ἔμ εἶδον μιαρὸς ...,διό γεγονότος ὁρατός πάντως αἰσθητός μήτηρ ὑπο...,0.369104


### Ähnlichkeit basierend auf fasttext:

In [66]:
columns = ['text1', 'text2', 'sentence1','sentence2','similarity']
all_rows = []
for f in mycorpus.fileids():
    if f != text1name:
        print(f"Processing {f}...")

        text2 = ' '.join(mycorpus.words(fileids=f))
        fasttext_df, sentences1, sentences2 = calculate_fasttext_similarity(text1, text2)
        pairs = get_sentence_pairs_above_threshold(fasttext_df, sentences1, sentences2, 0.5)

        for pair in pairs:
            all_rows.append([text1name, f, pair['sentence1'], pair['sentence2'], pair['similarity']])
            
df_fasttext = pd.DataFrame(all_rows, columns=columns)
            

Processing Alc.+1.with_lemma.txt...
Processing Alc.+2.with_lemma.txt...
Processing Apol..with_lemma.txt...
Processing Charm..with_lemma.txt...
Processing Cleit..with_lemma.txt...
Processing Crat..with_lemma.txt...
Processing Criti..with_lemma.txt...
Processing Crito.with_lemma.txt...
Processing Epin..with_lemma.txt...
Processing Epistels.with_lemma.txt...
Processing Euthyd..with_lemma.txt...
Processing Euthyph..with_lemma.txt...
Processing Gorg..with_lemma.txt...
Processing Hipp.+Maj.with_lemma.txt...
Processing Hipp.+Min.with_lemma.txt...
Processing Hipparch..with_lemma.txt...
Processing Ion.with_lemma.txt...
Processing Lach..with_lemma.txt...
Processing Laws.with_lemma.txt...
Processing Lovers.with_lemma.txt...
Processing Lysis.with_lemma.txt...
Processing Menex..with_lemma.txt...
Processing Meno.with_lemma.txt...
Processing Minos.with_lemma.txt...
Processing Parm..with_lemma.txt...
Processing Phaedo.with_lemma.txt...
Processing Phaedrus.with_lemma.txt...
Processing Phileb..with_lemm

In [17]:
df_fasttext

Unnamed: 0,text1,text2,sentence1,sentence2,similarity
0,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, ζεύς, παῖς, θηβαῖος, χθών, Διόνυσος, τίκ...","[παῖς, Κλεινίας, οἶμαι, θαυμάζω, πρῶτος, ἐραστ...",0.606096
1,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, ζεύς, παῖς, θηβαῖος, χθών, Διόνυσος, τίκ...","[αἴτιος, γίγνομαι, ἀνθρώπειος, δαιμόνιον, ἐναν...",0.382597
2,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, ζεύς, παῖς, θηβαῖος, χθών, Διόνυσος, τίκ...","[νῦν, ἐπεί, οὐκέτι, ἐναντιόω, προσελήλυθα]",0.396342
3,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, ζεύς, παῖς, θηβαῖος, χθών, Διόνυσος, τίκ...","[εὔελπις, δέ, λοιπός, ἐναντιόομαι]",0.457176
4,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, ζεύς, παῖς, θηβαῖος, χθών, Διόνυσος, τίκ...","[σχεδόν, κατανενόηκα, χρόνος, σκοπέω, ἐραστής,...",0.300826
...,...,...,...,...,...
18104900,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[τοιόσδε, ἀποβαίνω, πρᾶγμα]","[τέταρτος, γένος, ἔνυδρος, γίγνομαι, μάλιστα, ...",0.451645
18104901,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[τοιόσδε, ἀποβαίνω, πρᾶγμα]","[ὅθεν, ἰχθύς, ἔθνος, ὀστρέον, συνάπας, ὅσος, ἔ...",0.449337
18104902,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[τοιόσδε, ἀποβαίνω, πρᾶγμα]","[πᾶς, τότε, νῦν, διαμείβω, ζῷον, ἀλλήλων, νόος...",0.555502
18104903,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[τοιόσδε, ἀποβαίνω, πρᾶγμα]","[καὶ, τέλος, πᾶς, νῦν, ἤδη, λόγος, φάω, ἔχω]",0.571284


### Ähnlichkeit basierend auf minhash:

In [71]:
columns = ['text1', 'text2', 'sentence1','sentence2','similarity']
all_rows = []
for f in mycorpus.fileids():
    if f != text1name:
        print(f"Processing {f}...")

        text2 = ' '.join(mycorpus.words(fileids=f))
        minhash_df, sentences1, sentences2 = calculate_minhash_similarity(text1, text2)
        pairs = get_sentence_pairs_above_threshold(minhash_df, sentences1, sentences2, 0.5)

        for pair in pairs:
            all_rows.append([text1name, f, pair['sentence1'], pair['sentence2'], pair['similarity']])
            
df_minhash = pd.DataFrame(all_rows, columns=columns)
            


Processing Alc.+1.with_lemma.txt...
Processing Alc.+2.with_lemma.txt...
Processing Apol..with_lemma.txt...
Processing Charm..with_lemma.txt...
Processing Cleit..with_lemma.txt...
Processing Crat..with_lemma.txt...
Processing Criti..with_lemma.txt...
Processing Crito.with_lemma.txt...
Processing Epin..with_lemma.txt...
Processing Epistels.with_lemma.txt...
Processing Euthyd..with_lemma.txt...
Processing Euthyph..with_lemma.txt...
Processing Gorg..with_lemma.txt...
Processing Hipp.+Maj.with_lemma.txt...
Processing Hipp.+Min.with_lemma.txt...
Processing Hipparch..with_lemma.txt...
Processing Ion.with_lemma.txt...
Processing Lach..with_lemma.txt...
Processing Laws.with_lemma.txt...
Processing Lovers.with_lemma.txt...
Processing Lysis.with_lemma.txt...
Processing Menex..with_lemma.txt...
Processing Meno.with_lemma.txt...
Processing Minos.with_lemma.txt...
Processing Parm..with_lemma.txt...
Processing Phaedo.with_lemma.txt...
Processing Phaedrus.with_lemma.txt...
Processing Phileb..with_lemm

In [19]:
df_minhash

Unnamed: 0,text1,text2,sentence1,sentence2,similarity
0,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, δέ, ἑτοῖμος, ὅδε, ἔχω, σκευή, θεός]","[δέ, ἔχω]",0.375000
1,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἀλλ, ὁμοίως, θεός, τιμή, ἔχω]","[ἀλλ, πάνυ, ἔχω]",0.312500
2,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἀλλ, ἔχω, χείρ]","[ἀλλ, πάνυ, ἔχω]",0.507812
3,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἀλλ, ἔχω, χείρ]",[ἔχω],0.367188
4,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἀλλ, ἔχω, χείρ]","[ἀλλ, εἴπερ, ἔχω, εἶπον]",0.398438
...,...,...,...,...,...
15144,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[τοῦτ, λέγω]","[εὖ, λέγω]",0.343750
15145,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[τοῦτ, λέγω]","[πῶς, τοῦτ, πῆ, εἰκότως, διαπορέω, λέγω]",0.406250
15146,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[τοῦτ, λέγω]","[ὅδε, λέγω]",0.351562
15147,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[δύστην, ἀλήθει, καιρός, πάρειμι]","[τάχ, καιρός, πρέπω, πάρειμι, διακριβολογεῖσθαι]",0.312500


### Resultat

Man kann sehen, dass die ähnlichsten Sätze die sind, die nur aus einem Wort bestehen. Es stellt sich die Frage, ob man der Stopwortliste noch andere Worte hätte hinzufügen müssen. Es ist also einen Versuch wert, zu filtern:

In [50]:
def filter_data_frame(df, n):
    if type(df['sentence1'].iloc[0]) == str:
        df_filtered = df[df['sentence1'].apply(lambda x: len(x.split()) >= n)]
    else:
        df_filtered = df[df['sentence1'].apply(lambda x: len(x) >= n)]
    return df_filtered

In [72]:
filter_data_frame(df_minhash, 4)

Unnamed: 0,text1,text2,sentence1,sentence2,similarity
297,Bakchen.with_lemma.txt,Euthyd..with_lemma.txt,"[ἐκἀ, ἐγὼ, δι, αἰδώς, εἶπον]","[ἐκἀ, ἐγὼ, εἶπον]",0.585938
298,Bakchen.with_lemma.txt,Euthyd..with_lemma.txt,"[ἐκἀ, ἐγὼ, δι, αἰδώς, εἶπον]","[ἐκἀ, ἐγὼ, εἶπον]",0.585938
661,Bakchen.with_lemma.txt,Laws.with_lemma.txt,"[δέ, ἀμαθία, κἀσεβοῦντ, θεός]","[δέ, θεός]",0.554688
1045,Bakchen.with_lemma.txt,Meno.with_lemma.txt,"[διδάσκω, σ, καλός, ἔχω]","[ἔχω, διδάσκω, ἔχω]",0.523438
1620,Bakchen.with_lemma.txt,Prot..with_lemma.txt,"[ζεύς, δέ, ἀντεμηχανήσαθ, οἷος, θεός]","[δέ, ζεύς, θεός]",0.65625
1641,Bakchen.with_lemma.txt,Prot..with_lemma.txt,"[ἀλλ, αἰδώς, μ, ἔχω]","[ἀλλ, ἔχω]",0.507812
1642,Bakchen.with_lemma.txt,Prot..with_lemma.txt,"[ἀλλ, αἰδώς, μ, ἔχω]","[ἀλλ, ἔχω]",0.507812
2022,Bakchen.with_lemma.txt,Soph..with_lemma.txt,"[τοῦτ, αὖ, παρωχέτευσας, εὖ, λέγω]","[πῶς, αὖ, τοῦτ, λέγω]",0.539062
2120,Bakchen.with_lemma.txt,Soph..with_lemma.txt,"[ἀλλ, αἰδώς, μ, ἔχω]","[ἀλλ, ἔχω]",0.507812


In [73]:
filter_data_frame(df_fasttext, 4)

Unnamed: 0,text1,text2,sentence1,sentence2,similarity
0,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, ζεύς, παῖς, θηβαῖος, χθών, Διόνυσος, τίκ...","[παῖς, Κλεινίας, οἶμαι, θαυμάζω, πρῶτος, ἐραστ...",0.606096
1,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, ζεύς, παῖς, θηβαῖος, χθών, Διόνυσος, τίκ...","[οἴει, πρῶτος, κάλλιστός, μέγας, —, πᾶς, δῆλος...",0.531936
2,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, ζεύς, παῖς, θηβαῖος, χθών, Διόνυσος, τίκ...","[σύμπας, εἶπον, μέγας, οἴει, δύναμις, ὑπάρχω, ...",0.569192
3,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, ζεύς, παῖς, θηβαῖος, χθών, Διόνυσος, τίκ...","[προστίθημι, πλουσίων]",0.588525
4,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,"[ἥκω, ζεύς, παῖς, θηβαῖος, χθών, Διόνυσος, τίκ...","[λέγω, φίλος, παῖς, Κλεινίας, Δεινομάχη]",0.567349
...,...,...,...,...,...
5526779,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[δοκηθέντ, τελέω, δέ, ἀδοκήτων, πόρος, ηὗρε, θ...","[κατ, ἐκεῖνος, χρόνος, θεός, συνουσία, ἔρως, ἐ...",0.693636
5526780,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[δοκηθέντ, τελέω, δέ, ἀδοκήτων, πόρος, ηὗρε, θ...","[τετράπους, γένος, φύω, πολύπους, προφάσεως, θ...",0.646157
5526781,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[δοκηθέντ, τελέω, δέ, ἀδοκήτων, πόρος, ηὗρε, θ...","[τέταρτος, γένος, ἔνυδρος, γίγνομαι, μάλιστα, ...",0.528396
5526782,Bakchen.with_lemma.txt,Tim..with_lemma.txt,"[δοκηθέντ, τελέω, δέ, ἀδοκήτων, πόρος, ηὗρε, θ...","[ὅθεν, ἰχθύς, ἔθνος, ὀστρέον, συνάπας, ὅσος, ἔ...",0.514328


In [74]:
filter_data_frame(df_tfidf, 4)

Unnamed: 0,text1,text2,sentence1,sentence2,similarity
9,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,διδάσκω σ καλός ἔχω,ἱκανός διδάσκω,0.510363
19,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,ὁράω ὁράω δίδωμι ὄργια,ὁράω,0.731223
20,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,ὁράω ὁράω δίδωμι ὄργια,ὁράω,0.731223
21,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,δέ ἱερός νύκτωρ μεθ ἡμέρα τελέω,οἱόμαι νύκτωρ μεθ ἡμέρα ἐξιὼν ἔνδοθεν,0.521312
78,Bakchen.with_lemma.txt,Alc.+1.with_lemma.txt,μήν ξυνεθέμην γ θεός,μήν,0.532532
...,...,...,...,...,...
4543,Bakchen.with_lemma.txt,Theag..with_lemma.txt,νῦν δέ ὁράω χρή σ ὁράω,νῦν — ὁράω,0.816640
4548,Bakchen.with_lemma.txt,Theag..with_lemma.txt,μόνος πόλις ὅδε ὑπερκάμνω μόνος,ναί ἀλλ λέγω μόνος πόλις,0.565922
4556,Bakchen.with_lemma.txt,Theag..with_lemma.txt,ποῖος ἔρχομαι οἶκος ὑμεναῖος,ποῖος,0.510251
4559,Bakchen.with_lemma.txt,Theag..with_lemma.txt,δῆτα μέλλεθ ἀναγκαῖος ἔχω,δῆτα,0.515521


### Durchschnittliche Ähnlichkeit

Filtern hilft, um unter den ähnlichen Sätzen die längeren herauszufiltern. Nun wollen wir hingegen schauen, welche Werke Platons den Bakchen - bassierend auf den verschiednen Ähnlichkeitsmaßen - am ähnlichsten sind. Also habe ich die beweiligen Similarity Scores zusammenaddiert, allerdings durch die Länge der verschiedenen Texte dividiert, um die Werte vergleichbar zu machen. Außerdem habe ich die Möglichkeit zu filtern beibelassen: mit n filternt man alle Sätze kürzer n hinaus. 

In [56]:
len_dict = {}
for f in mycorpus.fileids():
    len_dict[f] = len(mycorpus.words(fileids=f))
len_dict

{'Alc.+1.with_lemma.txt': 7418,
 'Alc.+2.with_lemma.txt': 3002,
 'Apol..with_lemma.txt': 5859,
 'Bakchen.with_lemma.txt': 7513,
 'Charm..with_lemma.txt': 5963,
 'Cleit..with_lemma.txt': 1065,
 'Crat..with_lemma.txt': 12400,
 'Criti..with_lemma.txt': 3436,
 'Crito.with_lemma.txt': 2776,
 'Epin..with_lemma.txt': 4482,
 'Epistels.with_lemma.txt': 11522,
 'Euthyd..with_lemma.txt': 9416,
 'Euthyph..with_lemma.txt': 3515,
 'Gorg..with_lemma.txt': 17921,
 'Hipp.+Maj.with_lemma.txt': 6038,
 'Hipp.+Min.with_lemma.txt': 2993,
 'Hipparch..with_lemma.txt': 1569,
 'Ion.with_lemma.txt': 2728,
 'Lach..with_lemma.txt': 5145,
 'Laws.with_lemma.txt': 71845,
 'Lovers.with_lemma.txt': 1791,
 'Lysis.with_lemma.txt': 5120,
 'Menex..with_lemma.txt': 3246,
 'Meno.with_lemma.txt': 6737,
 'Minos.with_lemma.txt': 1941,
 'Parm..with_lemma.txt': 9988,
 'Phaedo.with_lemma.txt': 15472,
 'Phaedrus.with_lemma.txt': 11914,
 'Phileb..with_lemma.txt': 12873,
 'Prot..with_lemma.txt': 12337,
 'Republic.with_lemma.txt': 663

In [57]:
import matplotlib.pyplot as plt
import pandas as pd
import ipywidgets as widgets
from IPython.display import display

### Mit tfidf:

In [62]:
# Define interactive plotting function
def plot_similarity(n):
    df_filtered = filter_data_frame(df_tfidf, n).groupby('text2', as_index=False)['similarity'].sum()
    for index, row in df_filtered.iterrows():
        df_filtered.loc[index, 'similarity'] = row['similarity']/len_dict[row['text2']]
    
    plt.figure(figsize=(8, 5))
    plt.bar(df_filtered['text2'], df_filtered['similarity'], color='skyblue')
    plt.xlabel('Text Name')
    plt.ylabel('Normalized Similarity')
    plt.title(f'Similarity Sum for n >= {n}')
    plt.xticks(rotation=90)
    plt.show()

# Create slider and display plot interactively
slider = widgets.FloatSlider(min=1, max=10, step=1, value=2, description='n:')
widgets.interactive(plot_similarity, n=slider)


interactive(children=(FloatSlider(value=2.0, description='n:', max=10.0, min=1.0, step=1.0), Output()), _dom_c…

### Resultate für TFIDF: Top drei mit n >=:
### 1: Sophistes, Alkibiades Major, Parmenides
### 2: Alkibiades Major, Sophistes, Lysis
### 3: Lysis, Alkibiades Major, Minos
### 4: Thegenes, Ion, Minos
###


### Resultate für TFIDF mit Similarity Threshold 0.7 statt 0.5:
### 1: Sophistes, Parmenides, Lysis
### 2: Parmenides, Sophistes, Lysis
### 3: Theagenes, Sophistes, Alkibiades Major
### 4: Theagenes, Sophistes, Euthydes

### Mit Fasttext:

In [67]:
# Define interactive plotting function
def plot_similarity(n):
    df_filtered = filter_data_frame(df_fasttext, n).groupby('text2', as_index=False)['similarity'].sum()
    for index, row in df_filtered.iterrows():
        df_filtered.loc[index, 'similarity'] = row['similarity']/len_dict[row['text2']]
    
    plt.figure(figsize=(8, 5))
    plt.bar(df_filtered['text2'], df_filtered['similarity'], color='skyblue')
    plt.xlabel('Text Name')
    plt.ylabel('Normalized Similarity')
    plt.title(f'Similarity Sum for n > {n}')
    plt.xticks(rotation=90)
    plt.show()

# Create slider and display plot interactively
slider = widgets.FloatSlider(min=1, max=12, step=1, value=2, description='n:')
widgets.interactive(plot_similarity, n=slider)

interactive(children=(FloatSlider(value=2.0, description='n:', max=12.0, min=1.0, step=1.0), Output()), _dom_c…

### Resultate für Fasttext: Top drei mit n >=:
### 1, 2, 3 & 4: Minos, Theagenes, Menexenos
###

### Resultate für Fasttext mit Similarity Threshold 0.7 statt 0.5:
### 1: Critias, Ion, Minos
### 2, 3 & 4: Critias, Ion, Menexenos

### Mit minhash:

In [75]:
# Define interactive plotting function
def plot_similarity(n):
    df_filtered = filter_data_frame(df_minhash, n).groupby('text2', as_index=False)['similarity'].sum()
    for index, row in df_filtered.iterrows():
        df_filtered.loc[index, 'similarity'] = row['similarity']/len_dict[row['text2']]
    
    plt.figure(figsize=(8, 5))
    plt.bar(df_filtered['text2'], df_filtered['similarity'], color='skyblue')
    plt.xlabel('Text Name')
    plt.ylabel('Normalized Similarity')
    plt.title(f'Similarity Sum for n > {n}')
    plt.xticks(rotation=90)
    plt.show()

# Create slider and display plot interactively
slider = widgets.FloatSlider(min=1, max=8, step=1, value=2, description='n:')
widgets.interactive(plot_similarity, n=slider)

interactive(children=(FloatSlider(value=2.0, description='n:', max=8.0, min=1.0, step=1.0), Output()), _dom_cl…

 ### Resultate für MinHash: Top drei mit n >=:
 ### 1: Sophistes, Philebus, Parmenides
 ### 2: Sophistes, Alkibiades Major, Parmenides
 ### 3: Alkibiades Major, Meno, Sophistes
 ### 4: ZAHLEN ZU KLEIN
 ###
 ### Resultate für MinHash mit Similarity Threshold 0.7 statt 0.5:
 ### 1: Sophistes, Parmenides, Lysis 
 ### 2: Crito, Philebus, Sophistes
 ### 3: Menexenus, Lysis, Apologie
 ### 4: ZAHLEN ZU KLEIN