# NLP Course 2 Week 1 Lesson : Building The Model - Lecture Exercise 02


# Candidates from String Edits
Create a list of candidate strings by applying an edit operation

### Imports and Data

In [1]:
# data
word = 'dearz' # 🦌

### Splits
Find all the ways you can split a word into 2 parts !

In [2]:
# Splits with a loop

splits_a = []       # Initialize an empty list to store the split results

for i in range(len(word) + 1):  # Iterate over each index from 0 to the length of 'word' (inclusive)
    
    splits_a.append([word[:i], word[i:]]) # For each index 'i', split the word into two parts:
                                          # The first part is the substring from the start up to index 'i' (not inclusive).
                                          # The second part is the substring from index 'i' to the end of the word.
                                          # Append this pair of splits as a list to 'splits_a'.

for i in splits_a:  # Iterate over each split in 'splits_a'
    print(i)        # Print the current split, which is a list containing the two parts of the word

['', 'dearz']
['d', 'earz']
['de', 'arz']
['dea', 'rz']
['dear', 'z']
['dearz', '']


In [3]:
# Same splits, done using a list comprehension

splits_b = [(word[:i], word[i:])  # Create a tuple containing two parts:
                                  # - The first part is the substring from the start up to index 'i' (not inclusive).
                                  # - The second part is the substring from index 'i' to the end of the word.
            
          for i in range(len(word) + 1)]  # Iterate over each index from 0 to the length of 'word' (inclusive)
                                          # and generate the split for each index. The list comprehension collects
                                          # all these splits into the list 'splits_b'.

for i in splits_b:  # Iterate over each split (tuple) in 'splits_b'
    print(i)        # Print the current split tuple, which contains the two parts of the word

('', 'dearz')
('d', 'earz')
('de', 'arz')
('dea', 'rz')
('dear', 'z')
('dearz', '')


### Delete Edit
Delete a letter from each string in the `splits` list.
<br>
What this does is effectively delete each possible letter from the original word being edited. 

In [4]:
splits_a    # [L,R]

[['', 'dearz'],
 ['d', 'earz'],
 ['de', 'arz'],
 ['dea', 'rz'],
 ['dear', 'z'],
 ['dearz', '']]

In [5]:
# Deletes with a loop

splits = splits_a  # Assign the list of splits from 'splits_a' to the variable 'splits'
deletes = []       # Initialize an empty list to store the results of deletions (not used in this code snippet)

print('word : ', word)  # Print the original word to provide context

for L, R in splits:     # Iterate over each split in 'splits', where each split is a pair (L, R)
    
    if R:               # Check if the right part (R) of the split is not empty
        
        print(L + R[1:], ' <-- delete ', R[0])  # Print the result of deleting the first character of R
                                                # - Concatenate L with R[1:], which is R without its first character
                                                # - Indicate which character (R[0]) was deleted

word :  dearz
earz  <-- delete  d
darz  <-- delete  e
derz  <-- delete  a
deaz  <-- delete  r
dear  <-- delete  z


It's worth taking a closer look at how this is excecuting a 'delete'.
<br>
Taking the first item from the `splits` list :

In [6]:
# Breaking it down

print('word : ', word) # Print the original word

one_split = splits[0]  # Take the first item from the splits list and store it in one_split

print('first item from the splits list : ', one_split) # Print the first item from the splits list
print()

L = one_split[0]  # Assign the first part of one_split (before the split point) to L
R = one_split[1]  # Assign the second part of one_split (after the split point) to R


print('L : ', L)    # Print the value of L (the left part of the split)
print('R : ', R)    # Print the value of R (the right part of the split)
print()

# Print a message indicating the next operation: deleting the leading letter of R
print('*** now implicit delete by excluding the leading letter ***')
print('L + R[1:] : ',L + R[1:], ' <-- delete ', R[0]) # Print the result of concatenating L with R[1:], 
                                                      # which is R without its first character
                                                      # Also, show which character is being "deleted" (R[0])

word :  dearz
first item from the splits list :  ['', 'dearz']

L :  
R :  dearz

*** now implicit delete by excluding the leading letter ***
L + R[1:] :  earz  <-- delete  d


So the end result transforms **'dearz'** to **'earz'** by deleting the first character.
<br>
And you use a **loop** (code block above) or a **list comprehension** (code block below) to do
<br>
this for the entire `splits` list.

In [7]:
# Deletes with a list comprehension

splits = splits_a      # Assign the list of splits from 'splits_a' to the variable 'splits'

deletes = [L + R[1:]           # Create a new list where each element is the result of deleting the first character of R
           for L, R in splits  # Iterate over each split in 'splits', where each split is a pair (L, R)
           if R]               # Only include the split if the right part (R) is not empty

print(deletes)         # Print the list of deletions
print()


print('*** which is the same as ***')  # Print a separator for clarity

for i in deletes:  # Iterate over each item in the 'deletes' list
    print(i)       # Print each deletion result

['earz', 'darz', 'derz', 'deaz', 'dear']

*** which is the same as ***
earz
darz
derz
deaz
dear


### Ungraded Exercise
You now have a list of ***candidate strings*** created after performing a **delete** edit.
<br>
Next step will be to filter this list for ***candidate words*** found in a vocabulary.
<br>
Given the example vocab below, can you think of a way to create a list of candidate words ? 
<br>
Remember, you already have a list of candidate strings, some of which are certainly not actual words you might find in your vocabulary !
<br>
<br>
So from the above list **earz, darz, derz, deaz, dear**. 
<br>
You're really only interested in **dear**.

In [8]:
vocab = ['dean', 'deer', 'dear', 'fries', 'and', 'coke']  # Define a list of words representing the vocabulary

edits = deletes.copy()    # Create a copy of the 'deletes' list and assign it to 'edits'

print('vocab : ', vocab)  # Print the vocabulary list
print('edits : ', edits)  # Print the list of words after deletions

candidates = []           # Initialize an empty list to store candidate words

### START CODE HERE ###

candidates = set(edits).intersection(set(vocab))  # Find the intersection of the 'edits' list and 'vocab'.
                                                  # Convert both lists to sets and get common elements.
                                                  # The result is assigned to 'candidates'.

### END CODE HERE ###

print('candidate words : ', candidates)  # Print the list of candidate words found in both 'edits' and 'vocab'

vocab :  ['dean', 'deer', 'dear', 'fries', 'and', 'coke']
edits :  ['earz', 'darz', 'derz', 'deaz', 'dear']
candidate words :  {'dear'}


Expected Outcome:

vocab :  ['dean', 'deer', 'dear', 'fries', 'and', 'coke']
<br>
edits :  ['earz', 'darz', 'derz', 'deaz', 'dear']
<br>
candidate words :  {'dear'}

### Summary
You've unpacked an integral part of the assignment by breaking down **splits** and **edits**, specifically looking at **deletes** here.
<br>
Implementation of the other edit types (insert, replace, switch) follows a similar methodology and should now feel somewhat familiar when you see them.
<br>
This bit of the code isn't as intuitive as other sections, so well done!
<br>
You should now feel confident facing some of the more technical parts of the assignment at the end of the week.