Drew Lickman\
CSCI 4820-001\
Project #1\
Due: 9/9/24

# Minimum Distance Edit Algorithm

## Assignment Requirements:

### Input
---

- [words.txt](words.txt) is the input, which holds **lowercase** sets of words on each line.
    - The first word of each line is the <u>target</u>, and all the other words in the line are <u>source</u> words that will transform into the target
- Sample input files
    - [costs.csv](costs.csv) uses Levenshtein substitution costs
    - [costs2.csv](costs2.csv) uses confusion matrix substitution costs

### Processing
---

- Insertions and Deletions cost 1
- Substitution costs are read from the costs.csv files
- For each pair of source and target words, use the Minimum Edit Distance algorithm (using both cost methods)
    - Then output the backtrace of operations (K I S D)
    - Must be able to capture all possible sources for the minimum cost at each cell
    - Randomly select one of the possible cells that provide the minimum cost
    - Do NOT seed random number generator

### Output
---

- 4 lines per method (costs and costs2)
1. Source word
2. Vertical bar for each operation per character
3. Target word
4. Operation for each character and sum of edit cost 
    - k = keep
    - i = insert
    - s = substitute
    - d = delete
- 50 hyphens will separate a pair of words from the next pair

## Python Code

In [269]:
# This block of code processes words.txt and defines the targets and sources
targets = []
sources = []
pairTargSources = []

costsFile = "costs.csv"
costs2File = "costs2.csv"
costMethods = [costsFile, costs2File]

# Read & save the target and source words from words.txt
with open("words.txt") as wordList:
    lines = wordList.readlines()

    for line in lines:
        currentLine = line.split()      # Split the line into words
        targets.append(currentLine[0])  # Target word is first in each line
        sources.append(currentLine[1:]) # Source words are everything after the first word of the line
                                        # Also, the [1:] saves it as an array, even if it's just one source word

for row in range(len(targets)):
    pairTargSources.append([targets[row], sources[row]]) # Explicitly pair the sources to their target

print("targets:", targets)
print("sources:", sources)
print("\nTargets with matching sources: ")
for pair in pairTargSources:
    print(pair)


targets: ['mischievous', 'execution', 'tagctatcacgaccgcggtcgatttgcccgac']
sources: [['mischief', 'devious'], ['intention'], ['aggctatcacctgacctccaggccgatgccc']]

Targets with matching sources: 
['mischievous', ['mischief', 'devious']]
['execution', ['intention']]
['tagctatcacgaccgcggtcgatttgcccgac', ['aggctatcacctgacctccaggccgatgccc']]


In [270]:
# This block of code reads the CSV files and calculates the cost of substituting letters

alphabet = {letter: index for index, letter in enumerate("abcdefghijklmnopqrstuvwxyz")}
substitutionCost = []
substitutionCost2 = []
# Function able to work with costs.csv and costs2.csv (not working, doing manual override)
# Can improve by saving file first time it's read
def readCostFromCSV(file, substitutionTable):
    with open(file) as costList:
        costLines = costList.readlines()

        for costLine in costLines:
            currentCostLine = costLine.split(",") # Split each value by commas
            substitutionTable.append(currentCostLine) # Save substitution cost to array, indexed by 2D alphabet
        substitutionTable.pop(0) # Remove the first line of cost.csv

        # Cleanup
        for letter in substitutionTable:
            letter.pop(0) # Remove the letter from each cost array
            #letter[-1].rstrip() # Remove the newlines  #I don't think this is working

    # Accessible with alphabet dictionary
    #print(substitutionCost[alphabet["a"]][alphabet["i"]])
    return substitutionTable # Return the entire cost 2D array

costsCSV = readCostFromCSV(costsFile, substitutionCost)
costs2CSV = readCostFromCSV(costs2File, substitutionCost2) #Manual, since it's only two methods

# Returns either Levenshtein cost or Confusion Matrix cost
def getCostFromCSV(csv, letter1, letter2):
    #print("costsCSV table: ", costsCSV)
    #print("costs2CSV table: ", costs2CSV)
    #print()
    intCost = int(csv[alphabet[letter1]][alphabet[letter2]]) # Calculate the specific substitution cost
    return intCost

print("costsCSV a -> i: ", getCostFromCSV(costsCSV, "a", "i")) # "a" substitution TO "i" returns 2
print("costs2CSV a -> i: ", getCostFromCSV(costs2CSV, "a", "i")) # "a" substitution TO "i" returns 118


costsCSV a -> i:  2
costs2CSV a -> i:  118


In [271]:
# Helper function

def operation(string, file, letter1, letter2):
    match(string):
        case "keep":
            return 0
        case "insert":
            return 1
        case "delete":
            return 1
        case "substitute":
            if (file is not None) and (letter1 is not None) and (letter2 is not None):
                cost = getCostFromCSV(file, letter1, letter2)
                return cost
            return -100
        case _:
            return -200

print("Substitution: ", operation("substitute", costsCSV, "a", "i")) #debug

Substitution:  2


In [272]:
# Minimum Edit Distance Algorithm

DistTable = []

def MED(source, target):
    #save the source and target words for usage in backtrace
    global savedSource, savedTarget
    savedSource = source
    savedTarget = target

    sourceLen = len(source)
    targetLen = len(target)
    #print("Table dimensions: ", sourceLen, targetLen)
    DistTable = [[0 for i in range(targetLen+1)] for j in range(sourceLen+1)] # 2D array the size of source length+1 by target length+1

    # Initialize table size and number axies
    for i in range(1, sourceLen+1):
        DistTable[i][0] = DistTable[i-1][0] + operation("delete", "", "", "") #cols
    for j in range(1, targetLen+1):
        DistTable[0][j] = DistTable[0][j-1] + operation("insert", "", "", "") #rows

    # Recurrence relation:
    for row in range(1, sourceLen+1):
        for col in range(1, targetLen+1):
            # How will I know which operation is performed?
            tryDelete = DistTable[row-1][col] + operation("delete", "", "", "")#delete, #1 #delete(source[i])?
            trySubstitute = DistTable[row-1][col-1] + operation("substitute", file, source[row-1], target[col-1]) #0 if same or 2 if different
            tryInsert = DistTable[row][col-1] + operation("insert", "", "", "")#insert) #1 #insert(target[j])?

            DistTable[row][col] = min(
                            tryDelete,
                            trySubstitute,
                            tryInsert)

    printing = True #debug: displays visual table
    if printing:
        # Print table
        print("      ", end="")
        # Print the target on top
        for letter in target: 
            print(letter, end="  ")
        print()
        # print the source on the left
        for row in range(len(DistTable)):
            if row != 0:
                print(source[row-1], end="")
            else:
                print(" ", end="")
            # print the minimum edit distance cost table
            for value in DistTable[row]:
                print(f'{value : >3}', end="") # Padding of 3, aligned right
            print()

    return DistTable # Returns the minimum edit distance value at the bottom right of the table

file = costsCSV
DistTable = MED("mischief", "mischievous")


      m  i  s  c  h  i  e  v  o  u  s  
   0  1  2  3  4  5  6  7  8  9 10 11
m  1  0  1  2  3  4  5  6  7  8  9 10
i  2  1  0  1  2  3  4  5  6  7  8  9
s  3  2  1  0  1  2  3  4  5  6  7  8
c  4  3  2  1  0  1  2  3  4  5  6  7
h  5  4  3  2  1  0  1  2  3  4  5  6
i  6  5  4  3  2  1  0  1  2  3  4  5
e  7  6  5  4  3  2  1  0  1  2  3  4
f  8  7  6  5  4  3  2  1  2  3  4  5


In [275]:
# Backtracing

import random

backtracePathPositions = []
backtracePathOperations = []

def shortestPath(): 
    return DistTable[-1][-1]

# Array of operations used by MED. Looks like [[D, S, I], [D, S, I], [D, I], [S], etc.]
def backtrace():
    print(savedSource)
    print(savedTarget)
    # Reversed DistTable
    #for x in range(len(DistTable)-1, 0, -1):
    #    for y in range(len(DistTable[0])-1, 0, -1): 
    #        #print(DistTable[x][y], end = " ")
    #        diagonalCost = DistTable[x][y] - DistTable[x-1][y-1]
    #        horizontalCost = DistTable[x][y] - DistTable[x-1][y]
    #        verticalCost = DistTable[x][y] - DistTable[x][y-1]
    #        
            #print("`XY: [",x,y,"]", DistTable[x][y], ", DHV: ", diagonalCost, horizontalCost, verticalCost)
            #print("\t", DistTable[x][y], "-", DistTable[x-1][y-1], diagonalCost)
            #print("\t", DistTable[x][y], "-", DistTable[x-1][y], horizontalCost)
            #print("\t", DistTable[x][y], "-", DistTable[x][y-1], verticalCost)
            
    #        backtracePath.append(min(diagonalCost, horizontalCost, verticalCost))
    #    print()

    x = len(DistTable[0])-1
    y = len(DistTable)-1
    while (x > 0 and y > 0):
        currentTargetLetter = x-1
        currentSourceLetter = y-1

        # Cell positions [x,y]
        currentCellCost = DistTable[y][x]
        diagonalCell = DistTable[y-1][x-1]
        horizontalCell = DistTable[y-1][x]
        verticalCell = DistTable[y][x-1]

        # Cell costs
        diagonalCost = currentCellCost - diagonalCell  # check diag
        horizontalCost = currentCellCost - horizontalCell  # check horz
        verticalCost = currentCellCost - verticalCell  # check vert

        backtracePathPositions.append((x,y)) #need to swap to operations
        randomQueue = []
        #print(savedTarget[currentTargetLetter])
        #print(savedSource[currentSourceLetter])
        

        #if len(savedSource) < len(savedTarget):
            # insert
        #if len(savedSource) > len(savedTarget):
            # delete
        print(savedSource[currentSourceLetter], savedTarget[currentTargetLetter])

        print(diagonalCell, verticalCell)
        print(horizontalCell, currentCellCost)

        #print("same?", savedSource[currentSourceLetter] == savedTarget[currentTargetLetter])
        #print("diag?", currentCellCost-2 == diagonalCell)
        #print("horz?", currentCellCost-1 == horizontalCell)
        #print("vert?", currentCellCost-1 == verticalCell)

        if savedSource[currentSourceLetter] == savedTarget[currentTargetLetter]: #if diag == same letter, append diag to backtracePath and move on
            randomQueue.append("k")
            print("k added to random queue")
        else: #if diag != same letter, 
            if currentCellCost-2 == diagonalCell: #if here -2 == diag, add to random queue #will need to update to work with confusing matrix
                randomQueue.append("s")
                print("s added to random queue")
            if currentCellCost-1 == horizontalCell: #if here -1 == left, add to random queue
                randomQueue.append("i")
                print("i added to random queue")
            if currentCellCost-1 == verticalCell: #if here -1 == up, add to random queue
                randomQueue.append("d")
                print("d added to random queue")

        randomChoice = random.choice(randomQueue) #choose random path
        print("RandomQueue: ", randomQueue, "Choice: ", randomChoice)
        
        # Traverse backtrace
        if randomChoice == "k" or randomChoice == "s":
            print("keep/substitute", savedTarget[currentTargetLetter])
            x-=1
            y-=1
        elif randomChoice == "i":
            print("insert", savedTarget[currentTargetLetter])
            y-=1
        elif randomChoice == "d":
            print("delete", savedSource[currentSourceLetter])
            x-=1

        print(backtracePathPositions)

    print("-")

    for row in DistTable:
        for cell in row:
            print(cell, end=" ")
        print()

    return backtracePathPositions

backtrace()
shortestPath()


mischief
mischievous
f s
3 4
4 5
s added to random queue
i added to random queue
d added to random queue
RandomQueue:  ['s', 'i', 'd'] Choice:  d
delete f
[(11, 8)]
f u
2 3
3 4
s added to random queue
i added to random queue
d added to random queue
RandomQueue:  ['s', 'i', 'd'] Choice:  i
insert u
[(11, 8), (10, 8)]
e u
3 2
4 3
d added to random queue
RandomQueue:  ['d'] Choice:  d
delete e
[(11, 8), (10, 8), (10, 7)]
e o
2 1
3 2
d added to random queue
RandomQueue:  ['d'] Choice:  d
delete e
[(11, 8), (10, 8), (10, 7), (9, 7)]
e v
1 0
2 1
d added to random queue
RandomQueue:  ['d'] Choice:  d
delete e
[(11, 8), (10, 8), (10, 7), (9, 7), (8, 7)]
e e
0 1
1 0
k added to random queue
RandomQueue:  ['k'] Choice:  k
keep/substitute e
[(11, 8), (10, 8), (10, 7), (9, 7), (8, 7), (7, 7)]
i i
0 1
1 0
k added to random queue
RandomQueue:  ['k'] Choice:  k
keep/substitute i
[(11, 8), (10, 8), (10, 7), (9, 7), (8, 7), (7, 7), (6, 6)]
h h
0 1
1 0
k added to random queue
RandomQueue:  ['k'] Choice: 

5

In [274]:
# This block of code outputs the results of the targets and sources

for pair in pairTargSources: # mischevious, execution
    #print("TARGET: ", pair[0])
    for sourceList in pair[1:]: # [mischief, devious], [intention]
        #print("SOURCELIST: ", sourceList, end="\n")
        for singleSource in sourceList: # mischief, devious
            #print("SOURCE WORD: ", singleSource)
            for costMethod in costMethods: # costsFile, costs2File
                #print("Levenshtein" if costMethod == costsFile else "Confusion Matrix")
                
                for letter in singleSource: # print letters in source
                    print(letter, "", end="")
                print()

                for letter in pair[0]: # | times target length
                    print("| ", end="")
                print()

                for letter in pair[0]: # print letters in target
                    print(letter, "", end="")
                print()

                #backtrace()
                for row in range(len(pair[0])):
                    # If the letters are the same, print k for Keep
                    #if (pair[0][i] == singleSource[i]) and (pair[0][i] is not None) and (singleSource[i] is not None):
                        print("k ", end="")
                    #else:
                    #   print("! ", end="")

                print("(",shortestPath(),")")
                print()

            print("-" * 50)


#print(backtrace())

m i s c h i e f 
| | | | | | | | | | | 
m i s c h i e v o u s 
k k k k k k k k k k k ( 5 )

m i s c h i e f 
| | | | | | | | | | | 
m i s c h i e v o u s 
k k k k k k k k k k k ( 5 )

--------------------------------------------------
d e v i o u s 
| | | | | | | | | | | 
m i s c h i e v o u s 
k k k k k k k k k k k ( 5 )

d e v i o u s 
| | | | | | | | | | | 
m i s c h i e v o u s 
k k k k k k k k k k k ( 5 )

--------------------------------------------------
i n t e n t i o n 
| | | | | | | | | 
e x e c u t i o n 
k k k k k k k k k ( 5 )

i n t e n t i o n 
| | | | | | | | | 
e x e c u t i o n 
k k k k k k k k k ( 5 )

--------------------------------------------------
a g g c t a t c a c c t g a c c t c c a g g c c g a t g c c c 
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 
t a g c t a t c a c g a c c g c g g t c g a t t t g c c c g a c 
k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k ( 5 )

a g g c t a t c a c c t g a c c t c c a g g c c g a t g