Drew Lickman\
CSCI 4820-001\
Project #1\
Due: 9/9/24

# Minimum Distance Edit Algorithm

## Assignment Requirements:

### Input
---

- [words.txt](words.txt) is the input, which holds **lowercase** sets of words on each line.
    - The first word of each line is the <u>target</u>, and all the other words in the line are <u>source</u> words that will transform into the target
- Sample input files
    - [costs.csv](costs.csv) uses Levenshtein substitution costs
    - [costs2.csv](costs2.csv) uses confusion matrix substitution costs

### Processing
---

- Insertions and Deletions cost 1
- Substitution costs are read from the costs.csv files
- For each pair of source and target words, use the Minimum Edit Distance algorithm (using both cost methods)
    - Then output the backtrace of operations (K I S D)
    - Must be able to capture all possible sources for the minimum cost at each cell
    - Randomly select one of the possible cells that provide the minimum cost
    - Do NOT seed random number generator

### Output
---

- 4 lines per method (costs and costs2)
1. Source word
2. Vertical bar for each operation per character
3. Target word
4. Operation for each character and sum of edit cost 
    - k = keep
    - i = insert
    - s = substitute
    - d = delete
- 50 hyphens will separate a pair of words from the next pair

## Python Code

In [226]:
# This block of code processes words.txt and defines the targets and sources
targets = []
sources = []
pairTargSources = []

costsFile = "costs.csv"
costs2File = "costs2.csv"
costMethods = [costsFile, costs2File]

# Read & save the target and source words from words.txt
with open("words.txt") as wordList:
    lines = wordList.readlines()

    for line in lines:
        currentLine = line.split()      # Split the line into words
        targets.append(currentLine[0])  # Target word is first in each line
        sources.append(currentLine[1:]) # Source words are everything after the first word of the line
                                        # Also, the [1:] saves it as an array, even if it's just one source word

for i in range(len(targets)):
    pairTargSources.append([targets[i], sources[i]]) # Explicitly pair the sources to their target

print("targets:", targets)
print("sources:", sources)
print("\nTargets with matching sources: ")
for pair in pairTargSources:
    print(pair)


targets: ['mischievous', 'execute']
sources: [['mischief', 'devious'], ['intention']]

Targets with matching sources: 
['mischievous', ['mischief', 'devious']]
['execute', ['intention']]


In [227]:
# This block of code reads the CSV files and calculates the cost of substituting letters

alphabet = {letter: index for index, letter in enumerate("abcdefghijklmnopqrstuvwxyz")}
substitutionCost = []
# Function able to work with costs.csv and costs2.csv
# Can improve by saving file first time it's read
def readCostFromCSV(file):
    with open(file) as costList:
        costLines = costList.readlines()

        for costLine in costLines:
            currentCostLine = costLine.split(",") # Split each value by commas
            substitutionCost.append(currentCostLine) # Save substitution cost to array, indexed by 2D alphabet
        substitutionCost.pop(0) # Remove the first line of cost.csv

        # Cleanup
        for letter in substitutionCost:
            letter.pop(0) # Remove the letter from each cost array
            letter[-1].rstrip() # Remove the newlines  #I don't think this is working

    # Accessible with alphabet dictionary
    #print(substitutionCost[alphabet["k"]][alphabet["z"]])
    return substitutionCost # Return the entire cost 2D array

# Returns either Levenshtein cost or Confusion Matrix cost
def getCostFromCSV(file, letter1, letter2):
    cost = readCostFromCSV(file)
    return int(cost[alphabet[letter1]][alphabet[letter2]]) # Calculate the specific substitution cost

getCostFromCSV(costs2File, "a", "i") # "a" substitution TO "i" returns 118

118

In [228]:
file = costsFile
operation = {"keep": 0, "insert": 1, "delete": 1, "substitute": getCostFromCSV(file, letter1="a", letter2="b")}

print(operation["substitute"])

9


In [240]:
# This block of code outputs the results of the targets and sources

for pair in pairTargSources: # mischevious, execute
    #print("TARGET: ", pair[0])

    for sourceList in pair[1:]: # [mischief, devious], [intention]
        #print("SOURCELIST: ", sourceList, end="\n")

        for singleSource in sourceList: # mischief, devious
            #print("SOURCE WORD: ", singleSource)

            for costMethod in costMethods: # costsFile, costs2File
                #print("Levenshtein" if costMethod == costsFile else "Confusion Matrix")

                for letter in singleSource: # print letters in source
                    print(letter, "", end="")
                print()

                for letter in pair[0]: # | times target length
                    print("| ", end="")
                print()

                for letter in pair[0]: # print letters in target
                    print(letter, "", end="")
                print()

                # MED Algo here?
                for i in range(len(pair[0])):
                    # If the letters are the same, print k for Keep
                    #if (pair[0][i] == singleSource[i]) and (pair[0][i] is not None) and (singleSource[i] is not None):
                        print("k ", end="")
                    #else:
                    #   print("! ", end="")


                print("\n")

            print("-" * 50)


#print(backtrace())

m i s c h i e f 
| | | | | | | | | | | 
m i s c h i e v o u s 
k k k k k k k k k k k 

m i s c h i e f 
| | | | | | | | | | | 
m i s c h i e v o u s 
k k k k k k k k k k k 

--------------------------------------------------
d e v i o u s 
| | | | | | | | | | | 
m i s c h i e v o u s 
k k k k k k k k k k k 

d e v i o u s 
| | | | | | | | | | | 
m i s c h i e v o u s 
k k k k k k k k k k k 

--------------------------------------------------
i n t e n t i o n 
| | | | | | | 
e x e c u t e 
k k k k k k k 

i n t e n t i o n 
| | | | | | | 
e x e c u t e 
k k k k k k k 

--------------------------------------------------
