<a href="https://colab.research.google.com/github/DrAlexSanz/NLP-SPEC-C2/blob/master/W1/Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

*   Get a word count given a corpus
*   Get a word probability in the corpus
*   Manipulate strings
*   Filter strings
*   Implement Minimum edit distance to compare strings and to help find the     optimal path for the edits.
*   Understand how dynamic programming works



### Edit Distance
In this assignment, you will implement models that correct words that are 1 and 2 edit distances away.

We say two words are n edit distance away from each other when we need n edits to change one word into another.
An edit could consist of one of the following options:

Delete (remove a letter): ‘hat’ => ‘at, ha, ht’
Switch (swap 2 adjacent letters): ‘eta’ => ‘eat, tea,...’
Replace (change 1 letter to another): ‘jat’ => ‘hat, rat, cat, mat, ...’
Insert (add a letter): ‘te’ => ‘the, ten, ate, ...’

In [None]:
# Let's start by importing all I will need and download the Shakespeare file

import re
from collections import Counter
import numpy as np
import pandas as pd

!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W1/shakespeare.txt

--2020-09-30 20:11:28--  https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W1/shakespeare.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 306996 (300K) [text/plain]
Saving to: ‘shakespeare.txt’


2020-09-30 20:11:29 (5.91 MB/s) - ‘shakespeare.txt’ saved [306996/306996]



### Exercise 1
Implement the function process_data which

1) Reads in a corpus (text file)

2) Changes everything to lowercase

3) Returns a list of words.

Options and Hints
If you would like more of a real-life practice, don't open the 'Hints' below (yet) and try searching the web to derive your answer.
If you want a little help, click on the green "General Hints" section by clicking on it with your mouse.
If you get stuck or are not getting the expected results, click on the green 'Detailed Hints' section to get hints for each step that you'll take to complete this function.

In [None]:
def process_data():
    """
    Read shakespeare file
    Make it lowercase
    Return the list of words
    I won't pass the path or the file as an argument this time
    """

    with open("shakespeare.txt") as f:
        data = f.read()

    data = data.lower()
    words = re.findall(r"\w+", data) # If I put (\W+) It will also return the spaces as characters
    # I could probably use .split()

    return words



In [None]:
words = process_data()

# print(words)

Now I convert the whole list to a set to avoid duplicates

In [None]:
print("Total words in file:", len(words))
vocab = set(words)
print("Unique words in file:", len(vocab))

Total words in file: 53614
Unique words in file: 6116


## Exercise 2
Implement a get_count function that returns a dictionary

The dictionary's keys are words. The value for each word is the number of times that word appears in the corpus.

* Try implementing this using a for loop and a regular dictionary. This may be good practice for similar coding interview questions
* You can also use defaultdict instead of a regualr dictionary, along with the for loop
* Otherwise, to skip using a for loop, you can use Python's Counter class

In [None]:
def get_count_auto(word_list):
    """
    Implement a get_count function that returns a dictionary.
    The dictionary's keys are words.
    The value for each word is the number of times that word appears in the corpus.
    """

    word_count_dict = Counter(word_list)

    return word_count_dict

def get_count_manual(word_list):
    """
    Implement a get_count function that returns a dictionary.
    The dictionary's keys are words.
    The value for each word is the number of times that word appears in the corpus.
    """

    word_count_dict = {}

    for i in word_list:
        word_count_dict[i] = word_count_dict.get(i, 0) + 1

    return word_count_dict

In [None]:
auto_voc = get_count_auto(words)
man_voc= get_count_manual(words)

print(list(auto_voc.items())[:10])
print(list(man_voc.items())[:10])

[('o', 157), ('for', 474), ('a', 757), ('muse', 18), ('of', 1094), ('fire', 22), ('that', 785), ('would', 138), ('ascend', 1), ('the', 1525)]
[('o', 157), ('for', 474), ('a', 757), ('muse', 18), ('of', 1094), ('fire', 22), ('that', 785), ('would', 138), ('ascend', 1), ('the', 1525)]


### Exercise 3

Given the dictionary of word counts, compute the probability that each word will appear if randomly selected from the corpus of words. Remember:

$P(w) = \frac{Count(w)}{Total(w)}$ 

In [None]:
def get_probs(voc):
    """
    Get a dictionary with the vocabulary and calculate the probability of a word
    using the count
    """

    voc_prob = {}
    total_words = sum(voc.values())
    for w in voc.keys():
      voc_prob[w] = voc.get(w, 0)/total_words

    return voc_prob

In [None]:
probs = get_probs(auto_voc)

print(len(probs))
print(sum(probs.values()))

6116
0.999999999999934


##Part 2: String Manipulations
Now, that you have computed $P(w_i)$ for all the words in the corpus, you will write a few functions to manipulate strings so that you can edit the erroneous strings and return the right spellings of the words. In this section, you will implement four functions:

*delete_letter*: given a word, it returns all the possible strings that have one character removed.
*switch_letter*: given a word, it returns all the possible strings that have two adjacent letters switched.
*replace_letter*: given a word, it returns all the possible strings that have one character replaced by another different letter.
*insert_letter*: given a word, it returns all the possible strings that have an additional character inserted.

### Exercise 4
Instructions for delete_letter(): Implement a delete_letter() function that, given a word, returns a list of strings with one character deleted.

For example, given the word nice, it would return the set: {'ice', 'nce', 'nic', 'nie'}.

**Step 1:** Create a list of 'splits'. This is all the ways you can split a word into Left and Right: For example,
'nice is split into : [('', 'nice'), ('n', 'ice'), ('ni', 'ce'), ('nic', 'e'), ('nice', '')] This is common to all four functions (delete, replace, switch, insert).
**Step 2:** This is specific to delete_letter. Here, we are generating all words that result from deleting one character.
This can be done in a single line with a list comprehension. You can makes use of this type of syntax:
[f(a,b) for a, b in splits if condition]

For our 'nice' example you get: ['ice', 'nce', 'nie', 'nic'].

In [None]:
def delete_letter(word):
    """
    take one word, split it into L and R substrings. Then delete 
    the first letter of R and combine the result.
    """

    splits = [(word[:i], word[i:]) for i in range(len(word))]
    results = [(L + R[1:]) for L, R in splits]

    return results

In [None]:
cip = delete_letter("cipote")
print(cip) #Check that it doesn't return the full word (range(len(w) +1))

['ipote', 'cpote', 'ciote', 'cipte', 'cipoe', 'cipot']


## Exercise 5
Instructions for switch_letter(): Now implement a function that switches two letters in a word. It takes in a word and returns a list of all the possible switches of two letters that are adjacent to each other. It shouldn't return the original word.

For example, given the word 'eta', it returns {'eat', 'tea'}, but does not return 'ate'.
Step 1: is the same as in delete_letter()
Step 2: A list comprehension or for loop which forms strings by swapping adjacent letters. This is of the form:
[f(L,R) for L, R in splits if condition] where 'condition' will test the length of R in a given iteration. See below.

In [None]:
def switch_letter(word):
    """
    Get a word and produce all the combinations of two adjacent letters swapped
    It shouldn't return the original word
    """

    splits = [(word[:i], word[i:]) for i in range(len(word)) if (len(word[:i]) and len(word[:i])) > 0] # Same as before

    swaps = [L[:-1] + R[0] + L[-1] + R[1:] for L, R in splits]

    return swaps

In [None]:
test = switch_letter("Cipote")
print(test)

['iCpote', 'Cpiote', 'Ciopte', 'Ciptoe', 'Cipoet']


### Exercise 6
Instructions for replace_letter(): Now implement a function that takes in a word and returns a list of strings with one replaced letter from the original word.

Step 1: is the same as in delete_letter()

Step 2: A list comprehension or for loop which form strings by replacing letters. This can be of the form:
[f(a,b,c) for a, b in splits if condition for c in string]
**Note the use of the second for loop.**

It is expected in this routine that one or more of the replacements will include the original word. For example, replacing the first letter of 'ear' with 'e' will return 'ear'.

Step 3: Remove the original input letter from the output.

* To remove a word from a list, first store its contents inside a set()
* Use set.discard('the_word') to remove a word in a set (if the word does not exist in the set, then it will not throw a KeyError.
* Using set.remove('the_word') throws a KeyError if the word does not exist in the set.

In [None]:
def replace_letter(word):
    """
    Split, replace, and then remove the original
    """
    letters = 'abcdefghijklmnopqrstuvwxyz'

    splits = [(word[:i], word[i:]) for i in range(len(word))] # Same as before

    replaced = [L + alph + (R[1:] if len(R) > 1 else "") for L, R in splits for alph in letters]

    # Make replaced a set to get only uniques

    replaced_set = set(replaced)
    replaced_set.remove(word)
    results = sorted(list(replaced_set))

    return results



In [None]:
a = replace_letter("cipote") # Test with aa in case of doubt. It works
print(len(a))


150


### Exercise 7
Instructions for insert_letter(): Now implement a function that takes in a word and returns a list with a letter inserted at every offset.

* Step 1: is the same as in delete_letter()

* Step 2: This can be a list comprehension of the form:
[f(a,b,c) for a, b in splits if condition for c in string]

In [None]:
def insert_letter(word):

    letters = 'abcdefghijklmnopqrstuvwxyz'

    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]

    inserts = [L + alph + R for L, R in splits for alph in letters]


    return inserts


In [None]:
a = insert_letter("cipote")
print(len(a)) # Should be 78 with input = "at"

182
