# Cosine Similarity (tf-idf vectorizer & TF-IDF)

### Description of the Code

The script below performs text processing and analysis on a collection of Latin texts, specifically focusing on identifying and comparing sentences for similarity. The workflow involves several steps, which are outlined below:

1. **Import Required Libraries**:
    - The script imports necessary libraries for file handling (`os`), CSV writing (`csv`), string manipulation (`string`), Latin text processing (`cltk`), and text similarity analysis (`sklearn`).

2. **Setup and Initialize**:
    - The script downloads and imports Latin language models using CLTK (Classical Language Toolkit) and initializes a sentence tokenizer for Latin.

3. **Define Directories and Texts**:
    - It sets up paths for the directory containing the corpus of texts and a results directory where the output will be saved.
    - A list of disputed texts (plays) is specified.

4. **Function Definitions**:
    - **`read_file(file_path)`**: Reads the content of a file given its path.
    - **`tokenize_sentences(text)`**: Tokenizes a given text into sentences, removes punctuation, and converts sentences to lowercase.
    - **`compare_sentences(play1, play2, threshold=0.6)`**: Converts sentences to TF-IDF vectors and computes cosine similarity between them. Sentences with similarity above the specified threshold are identified.

5. **Process Senecan Plays**:
    - The script iterates through the files in the specified directory, processes each Senecan play (excluding disputed texts), and tokenizes the sentences. The sentences are stored in a dictionary.

6. **Process and Compare Disputed Plays**:
    - For each disputed text, the script reads and tokenizes the content.
    - It then compares the sentences of the disputed text with each Senecan play using cosine similarity.
    - Similar sentences are identified and the results are saved in CSV files for detailed examination.

7. **Modify Disputed Texts**:
    - Sentences in the disputed texts that are similar to those in Senecan plays are removed.
    - The percentage of sentences removed is calculated and printed.
    - The modified versions of the disputed texts, with similar sentences removed, are saved to a specified directory.

In [1]:
%%time

import os
import csv
import string
from cltk.data.fetch import FetchCorpus
from cltk.sentence.lat import LatinPunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# import the Latin model for the sentence tokenizer
corpus_downloader = FetchCorpus(language='lat')
corpus_downloader.import_corpus('lat_models_cltk')
corpus_downloader.import_corpus('latin_training_set_sentence_cltk')

# initialize CLTK sentence tokenizer for Latin
sentence_tokenizer = LatinPunktSentenceTokenizer(strict=True)

# directory containing the corpus of texts
directory_path = '../../corpora/corpus_imposters/'

# directory to write the results
results_directory = os.path.join('..', 'lines-similarity', 'results_line_sim_cosine')
os.makedirs(results_directory, exist_ok=True)

# list of disputed texts
disputed_texts = ['sen_oct.txt', 'sen_her_o.txt']

# dictionary to save sentences for each play
play_sentences = {}

# dictionary to store the count of removed sentences for each disputed play
removed_sentence_count = {}

# function to read files
def read_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

# function to tokenize sentences
def tokenize_sentences(text):
    sentences = sentence_tokenizer.tokenize(text)
    sentences_without_punct = [sentence.translate(str.maketrans('', '', string.punctuation)).lower() for sentence in sentences]
    return sentences_without_punct

# function to compare sentences using cosine similarity
def compare_sentences(play1, play2, threshold=0.6):
    """
    Convert sentences into vectors and compare them using cosine similarity.
    - strip_accents='unicode': removes accents from characters using Unicode
    - lowercase=True: converts all characters to lowercase
    - analyzer='char': analyzes the input as a sequence of characters
    - ngram_range=(4, 4): considers 4-grams as features
    """
    vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True, analyzer='char', ngram_range=(4, 4))
    vectors = vectorizer.fit_transform(play1 + play2)

    # compute cosine similarity by comparing the vectors
    similarities = cosine_similarity(vectors[:len(play1)], vectors[len(play1):])

    # find similar sentences
    similar_lines = []
    for i in range(len(play1)):
        for j in range(len(play2)):
            if similarities[i, j] >= threshold:
                similar_lines.append((i, j, similarities[i, j], play1[i], play2[j]))

    return similar_lines

# process each file in the directory
for filename in os.listdir(directory_path):
    file_path = os.path.join(directory_path, filename)

    # check if the file is a Senecan play and not disputed
    if filename.startswith("sen_") and filename not in disputed_texts:
        # clean filename to make the result more readable
        clean_filename = os.path.splitext(os.path.basename(filename))[0].replace('_', ' ').capitalize()
        # read the file
        text = read_file(file_path)
        # split into sentences
        sentences = tokenize_sentences(text)
        # save the results into the dictionary
        play_sentences[clean_filename] = sentences

# process disputed plays
for disputed_text in disputed_texts:
    # clean filename to make results more readable
    disputed_text_name = os.path.splitext(os.path.basename(disputed_text))[0].replace('_', ' ').capitalize()
    disputed_path = os.path.join(directory_path, disputed_text)
    # read the contents of the file
    disputed_text_content = read_file(disputed_path)
    # split into sentences
    disputed_sentences = tokenize_sentences(disputed_text_content)

    # compare sentences with each Senecan play using cosine similarity
    for senecan_play, senecan_sentences in play_sentences.items():
        similar_lines = compare_sentences(disputed_sentences, senecan_sentences)

        # save results to CSV for closer examination
        clean_senecan_play = os.path.splitext(os.path.basename(senecan_play))[0]
        csv_filename = os.path.join(results_directory, f'similarity_{disputed_text_name}_vs_{clean_senecan_play}_res.csv')
        with open(csv_filename, 'w', encoding='utf-8', newline='') as csvfile:
            csv_writer = csv.writer(csvfile)
            csv_writer.writerow(['Disputed Sentence Index', 'Senecan Sentence Index', 'Similarity', 'Disputed Sentence', 'Senecan Sentence'])
            for i, j, similarity, disputed_sentence, senecan_sentence in similar_lines:
                csv_writer.writerow([i, j, similarity, disputed_sentence, senecan_sentence])

        print(f"Examples for {disputed_text_name} vs {clean_senecan_play}:")
        for i, j, similarity, disputed_sentence, senecan_sentence in similar_lines:
            print(f"Similarity: {similarity:.4f}")
            print(f"Disputed Sentence: {disputed_sentence}")
            print(f"Senecan Sentence: {senecan_sentence}")

        # count and store the number of removed sentences
        removed_sentence_count[(disputed_text_name, clean_senecan_play)] = len(similar_lines)

        # remove similar sentences from disputed text
        for i, _, _, _, _ in similar_lines:
            disputed_sentences[i] = ""  # Replace similar sentences with an empty string

    # calculate and print the percentage of removed sentences
    total_sentences = len(disputed_sentences)
    removed_sentences = sum(removed_sentence_count.values())
    percentage_removed = (removed_sentences / total_sentences) * 100
    print(f"Total Sentences in {disputed_text_name}: {total_sentences}")
    print(f"Removed Sentences: {removed_sentences}")
    print(f"Percentage Removed: {percentage_removed:.2f}%")

    # write modified disputed text to the directory that has the two disputed plays with the lines removed
    modified_disputed_path = os.path.join('../../corpora/corpus_imposters_cento/', f'{disputed_text_name}.txt')
    with open(modified_disputed_path.lower().replace(" ", "_"), 'w', encoding='utf-8') as modified_file:
        modified_file.write('\n'.join([sentence for sentence in disputed_sentences if sentence]))

    print(f"Modified disputed text saved to: {modified_disputed_path}")

Examples for Sen oct vs Sen ag:
Examples for Sen oct vs Sen thy:
Examples for Sen oct vs Sen her f:
Examples for Sen oct vs Sen phaed:
Examples for Sen oct vs Sen phoen:
Similarity: 0.7437
Disputed Sentence: et hoc sat est
Senecan Sentence: nec hoc sat est
Examples for Sen oct vs Sen oed:
Examples for Sen oct vs Sen med:
Similarity: 0.6417
Disputed Sentence: parere dubitas
Senecan Sentence: profugere dubitas
Examples for Sen oct vs Sen tro:
Total Sentences in Sen oct: 422
Removed Sentences: 2
Percentage Removed: 0.47%
Modified disputed text saved to: ../../corpora/corpus_imposters_cento/Sen oct.txt
Examples for Sen her o vs Sen ag:
Similarity: 1.0000
Disputed Sentence: scelus occupandum est
Senecan Sentence: scelus occupandum est
Similarity: 0.7694
Disputed Sentence: peractum est
Senecan Sentence: habet peractum est
Similarity: 0.6146
Disputed Sentence: habet peractum est quas petis poenas dedit
Senecan Sentence: habet peractum est
Similarity: 0.7574
Disputed Sentence: heu quid hoc
Sen