# Replacing and Correcting Words

- Stemming words
- Lemmatizing words with WordNet
- Replacing words matching regular expressions
- Removing repeating characters
- Spelling correction with Enchant
- Replacing synonyms
- Replacing negations with antonyms

## Stemming words

- A technique to remove affixes from a word, engind up with the stem
 - e.g., cooking -> cook
- Common stemming algorithms
 - Porter stemming algorithm 
   - remove and replace well-known suffixes of English words
   - generally the default choice
 - Lancaster stemming algorithm
 - SnowballStemmer
   - support 13 non-English languages
   - provide the original porter algorithm as well as the new English stemming algorithm
 - DIY by using the RegexpStemmer
   - take a single regular expression (either compiled or as a string) and removes any prefix or suffix that matches the expression

In [10]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem import RegexpStemmer

words = ['cooking', 'cookery', 'ingleside']
# Porter
p_stemmer = PorterStemmer()
l_stemmer = LancasterStemmer()
s_stemmer = SnowballStemmer('english')
r_stemmer = RegexpStemmer('ing')
string_format = '{:^12} {:^12} {:^12} {:^12} {:^12}'
print(string_format.format('Word', 'Porter', 'Lancaster', 'Snowball', 'Regexp'))
for word in words:
    print(string_format.format(word,
                               p_stemmer.stem(word),
                               l_stemmer.stem(word),
                               s_stemmer.stem(word),
                               r_stemmer.stem(word)))

    Word        Porter     Lancaster     Snowball      Regexp   
  cooking        cook         cook         cook         cook    
  cookery      cookeri      cookery      cookeri      cookery   
 ingleside     inglesid     inglesid     inglesid      leside   


## Lemmatizing words with WordNet

- Lemmatizing is silimar to stemming, but is more akin to synonum replacement.
- Unlike stemming, we are always left with a valid word
- Side effect is that the word may end up completely different.

### Combining stemming with lemmatization

Stemming and lemmatization can be combined to compress words more than either process can by itself.

In [13]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print('cooking   ', lemmatizer.lemmatize('cooking'))
print('cooking   ', lemmatizer.lemmatize('cooking', pos='v'))
print('cookbooks ', lemmatizer.lemmatize('cookbooks'))
print('cookery   ', lemmatizer.lemmatize('cookery'))

cooking    cooking
cooking    cook
cookbooks  cookbook
cookery    cookery


## Replacing words matching regular expressions

This recipe aims to fix the trouble with contractions by replacing contractions with their expanded forms.

### How to do it?
- Define a number of replacement patterns.
- Create a RegexReplacer class that will compile the patterns and provide a replace() method to substitute all the found patterns with their replacements

In [19]:
import re

replacement_patterns = [
    (r'won\'t', 'will not'),
    (r'can\'t', 'cannot'),
    (r'i\'m', 'i am'),
    (r'ain\'t', 'is not'),
    (r'(\w+)\'ll', '\g<1> will'),
    (r'(\w+)n\'t', '\g<1> not'),
    (r'(\w+)\'ve', '\g<1> have'),
    (r'(\w+)\'s', '\g<1> is'),
    (r'(\w+)\'re', '\g<1> are'),
    (r'(\w+)\'d', '\g<1> would')
]

class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
    
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            s = re.sub(pattern, repl, s)
        return s


replacer = RegexpReplacer()
for sentence in ["can't is a contraction", "I should've done that thing I didn't do"]:
    print(sentence)
    print(replacer.replace(sentence))
    print('')

can't is a contraction
cannot is a contraction

I should've done that thing I didn't do
I should have done that thing I did not do



### Replacement before tokenization

In [21]:
from nltk.tokenize import word_tokenize
replacer = RegexpReplacer()

sentence = "can't is a contraction"

# tokenization only
print(word_tokenize(sentence))

# Repalce contraction first
print(word_tokenize(replacer.replace(sentence)))

['ca', "n't", 'is', 'a', 'contraction']
['can', 'not', 'is', 'a', 'contraction']


## Removing repeating characters

- This recipe aims to fix repeating characters by backreference.
- A backreference is a way to refer to a previously matched group in a regular expression.

In [26]:
import re
from nltk.corpus import wordnet

class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'
    
    def replace(self, word):
        if wordnet.synsets(word):
            return word

        repl_word = self.repeat_regexp.sub(self.repl, word)

        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

### How it works?

The repaet_regexp pattern matches three groups:
- 0 or more starting characters (\w*)
- A single character (\w) that is followed by another instance of that character (\2)
- 0 or more ending characters (\w*)

The replacement string is then used to keep all the matched groups, while discarding the backreference to the second group.

For example:
looooooove => (looooo)(o)o(ve) => loooooove => ... => love
     

In [28]:
replacer = RepeatReplacer()
words = ['looooooove', 'oooooooh', 'goose']
string_format = '{:^16} {:>20}'
print(string_format.format('Replaced word', 'Original word'))
for word in words:
    print(string_format.format(replacer.replace(word), word))

 Replaced word          Original word
      love                 looooooove
      ooh                    oooooooh
     goose                      goose


## Spelling correction with Enchant

### What is Enchant?
- A spelling correction API.
- An offshoot of the AbiWord open source word processor
  - Always ensure that we use the correct dictionary for whichever language we are performing spelling correction on.
- Support personal word lists.
  - e.g. assuming 'nltk' contain in 'mywords.txt'
  
  > d = enchant.Dict('en_US')  
  d.check('nltk') => False  
  d = enchant.DictWithPWL('en_US', 'mywords.txt')  
  d.check('nltk') => True
- For dictionaries, Aspell is a good open source spellchecker

In [33]:
import enchant

print(enchant.list_languages())

dUS = enchant.Dict('en_US')
dGB = enchant.Dict('en_GB')

print(dUS.check('theater'))
print(dGB.check('theater'))
print(dGB.check('theatre'))

['de_DE', 'fr_FR', 'en_GB', 'en_AU', 'en_US']
True
False
True


In [31]:
import enchant
from nltk.metrics import edit_distance

class SpellingReplacer(object):
    def __init__(self, dict_name = 'en', max_dist = 2):
        self.spell_dict = enchant.Dict(dict_name)
        self.max_dist = max_dist
    
    def replace(self, word):
        if self.spell_dict.check(word):
            return word

        suggestions = self.spell_dict.suggest(word)
        
        if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
            return suggestions[0]
        else:
            return word

In [34]:
replacer = SpellingReplacer()
replacer.replace('cookbo')

us_replacer = SpellingReplacer('en_US')
gb_replacer = SpellingReplacer('en_GB')

print(us_replacer.replace('theater'))
print(gb_replacer.replace('theater'))

theater
theatre


## Replacing synonyms

### Why do we want to replace synonyms?
- It is often useful to reduce the vocabulary of a text.
- We can save memory in cases such as 
 - frequency analysis (https://en.wikipedia.org/wiki/Frequency_analysis)
 - text indexing (https://en.wikipedia.org/wiki/Full_text_search)
 
### How to do it?
- Option 1: Maintaining a lookup table to replace synonyms like WordReplacer.
 - It is actually a class wrapper around a Python dictionary.
 - This is not a good long-term solution
- Option 2: Store the synonyms in a CSV file or in a YAML file
 - the csv file might look like this: bday,birthday
 - the YAML file should be a simple mapping of word: synonym, such as bday: birthday

In [40]:
class WordReplacer(object):
    def __init__(self, word_map):
        self.word_map = word_map
    
    def replace(self, word):
        return self.word_map.get(word, word)

replacer = WordReplacer({'bday': 'birthday'})
for word in ['bday', 'happy']:
    print('{:>10} {:>10}'.format(word, replacer.replace(word)))

      bday   birthday
     happy      happy


In [44]:
import csv

class CsvWordReplacer(WordReplacer):
    def __init__(self, fname):
        word_map = {}
        for line in csv.reader(open(fname)):
            word, syn = line
            word_map[word] = syn
        super(CsvWordReplacer, self).__init__(word_map)

In [45]:
import yaml # pip3 install pyyaml

class YamlWordReplacer(WordReplacer):
    def __init__(self, fname):
        word_map = yaml.load(open(fname))
        super(YamlWordReplacer, self).__init__(word_map)

ModuleNotFoundError: No module named 'yaml'

## Replacing negations with antonyms

In [59]:
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize

class AntonymReplacer(object):
    def replace(self, word, pos=None):
        antonyms = set()
        for syn in wordnet.synsets(word, pos=pos):
            for lemma in syn.lemmas():
                for antonym in lemma.antonyms():
                    antonyms.add(antonym.name())
            if len(antonyms) == 1:
                return antonyms.pop()
            else:
                return None
    
    def replace_negations(self, sent):
        i, l = 0, len(sent)
        words = []

        while i < l:
            word = sent[i]
            if word == 'not' and i + 1 < l:
                ant = self.replace(sent[i + 1])
                if ant:
                    words.append(ant)
                    i += 2
                    continue
            words.append(word)
            i += 1

        return words

# example 1
replacer = AntonymReplacer()
for word in ['good', 'uglify']:
    print(replacer.replace(word))
sentence = "let's not uglify our code"
replacer.replace_negations(word_tokenize(sentence))

None
beautify


['let', "'s", 'beautify', 'our', 'code']

In [60]:
class AntonymWordReplacer(WordReplacer, AntonymReplacer):
    pass

replacer = AntonymWordReplacer({'evil': 'good'})
sentence = "good is not evil"
replacer.replace_negations(word_tokenize(sentence))

['good', 'is', 'good']