# Introduction

The basic intuition behind Fuzzy String Matching (also known as: Approximate String Matching) is fuzzy logic. Fixed logic is not fixed and exact but rather approximate. In terms of numbers, this kind of logic returns a number between 0 and 1 and it is completely different from the boolean logic which returns either `0` or `1` (`True` or `False`).

Therefore following this intuition we can compare two strings and find how similar they are to each other; just like plagiarism works since we are not looking only for identical sentences but for similar sentences as well.

To do this, we will exploit Python's library [`fuzzywuzzy`](https://pypi.org/project/fuzzywuzzy/). According to the documentation, this library "uses the [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to calculate the differences between sequences".

There are a lot of function that this library contains that they can find similar lines. Some of them are the following:
- `ratio()`
- `partial_ratio()`
- `token_set_ratio()`
- `token_sort_ration()`

Even though `ratio()`, `partial_ratio()`, `token_set_ratio()`, and `token_sort_ratio()` functions from the `fuzzywuzzy` library are all suitable for comparing strings for similarity. However, when it comes to finding reworked lines in two texts, they may not always be the best choice.

- `ratio()` and `partial_ratio()` functions only compare two strings character-by-character and compute their similarity as a ratio of the number of matching characters to the total number of characters. While these functions can detect similar lines, they may not be able to distinguish between verbatim and reworked lines.

- `token_set_ratio()` and `token_sort_ratio()` functions compare two strings based on the set of their unique tokens or sorted tokens, respectively. These functions can handle reordering and additional tokens in the compared strings, but they still do not distinguish between verbatim and reworked lines.

When comparing two texts for reworked lines, you may need to consider the context and the meaning of the lines. 
Two lines may have different words but convey the same meaning, or they may have the same words but different meanings in different contexts.

In [10]:
from glob import glob
import re
from fuzzywuzzy import fuzz
from fuzzywuzzy import process


def remove_punctuation(filename):
    text = ""
    with open(filename, 'r') as inp:
        text = inp.read()
    # remove punctuation from each text
    text_no_punctuation = re.sub(r'[^\w\s\n]', '', text)
    # split into lines to maintain the original format
    lines = text_no_punctuation.splitlines()
    # join the texts into a string again
    text = "\n".join(lines)
    return text


# read the disputed text and remove punctuation
disputed_text_octavia = remove_punctuation(
    '../verse_corpus_imposters/sen_oct.txt')


# define a function that will merge all the original texts into a list
# this is a prerequisite for the extract() function to work
# https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py
def merge_original_texts(directory):
    known_texts = list()
    for filename in glob(directory):
        # exclude Octavia and Hercules Oetaeus
        if filename != "../verse_corpus_imposters/sen_her_o.txt" and filename != "../verse_corpus_imposters/sen_oct.txt":
            known_texts.append(remove_punctuation(filename))
    return known_texts


# define a function that will compare the similarity of lines from the disputed text
# against all the other texts in a given corpus
def find_similar_lines(disputed_text, non_disputed_text, output_filename):
    """
    requirements:
     - python >= 2.7
     - difflib
     - python-Levenshtein
    """
    # split into lines for the disputed text
    lines_disputed_text = disputed_text.splitlines()
    # split into lines for the original text(s)
    lines_non_disp_text = [line.splitlines() for line in non_disputed_text]
    matching_lines = []  # list to populate with the matching lines
    for disputed_line_num, disputed_line in enumerate(lines_disputed_text):
        for non_disputed_line_num, non_disputed_line in enumerate(lines_non_disp_text):
            matches = process.extract(
                disputed_line, non_disputed_line, limit=None)
            if matches[0][1] >= 80:
                matching_lines.append((
                    disputed_line,  # disputed line text
                    matches[0][0],  # matching line in original text
                    matches[0][1],  # matching score
                ))
    with open(output_filename, 'w') as out:
        for line in matching_lines:
            out.write("Disputed line: {}\n".format(line[0]))
            out.write("Matching line: {}\n".format(line[1]))
            out.write("Similarity score: {}\n\n".format(line[2]))


# set the path to the directory containing the original texts
path_to_corpus = "../verse_corpus_imposters/*.txt"

# merge all the original texts into a single list
original_texts = merge_original_texts(path_to_corpus)

# find matching lines and write them to a file
output_filename = "matching_lines_octavia.txt"
find_similar_lines(disputed_text_octavia, original_texts, output_filename)