# Sorri not veri gud in inglish

Have you ever googled someone's name without knowing exactly how should it be written? Were you ever reluctant to look up the correct spelling of a query you typed? Or just unable to type properly because of being in a rush? Modern search engines usually do a pretty good job in deciphering defective user input. In order to be able to do that, a good spell-checking mechanism should be incorporated into a search procedure. Today we will take one step further towards building a good search engine and work on tolerant retrieval with respect to user queries. We will consider two cases:

1. User knows that he doesn't know the correct spelling OR he wants to get the results that follow some known pattern, so he uses so called wildcards - queries like 'retr*val';
2. User doesn't know the correct spelling OR he doesn't care OR he's in a rush OR he expects his mistakes will be corrected OR your option, so he makes mistakes and we need to handle them using:

    2.1. Simple spellchecker by Peter Norvig;
    
    2.2. Phonetic correction by means of Soundex algorithm;
    
    2.3. Trigrams with Jaccard coefficient.

## 1. Handling wildcards

We will handle wildcard queries using k-grams. K-grams is a list of consecutive k chars in a string - i.e., for the word *'star'*, it will be '*\$st*', '*sta*', '*tar*', and '*ar$*', if we take k=3. Take a look at [book](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) *chapter 3.2.2* to understand how k-grams can help efficiently match a wildcard against dictionary words. Here we will only consider wildcards with star symbols (may be multiple).

Notice that for building k-grams index, **we will need a vocabulary of original word forms** to compare words in user input to the vocabulary of "correct" words (think why inverted index which we built for stemmed words doesn't work here).   

You need to implement the following:

- `build_inverted_index_orig_forms` - creates inverted index of original world forms from `facts` list, which is already given to you.  
    Output format: `term:[collection_frequency, (doc_id_1, doc_freq_1), (doc_id_2, doc_freq_2), ...]`
    

- `build_k_gram_index` - creates k-gram index which maps every k-gram encountered in facts collection to a list of words containing this k-gram. Use the abovementioned inverted index of original words to construct this index.  
    Output format: `'k_gram': ['word1_with_k_gram', 'word2_with_k_gram', ...]`
    
    
- `generate_wildcard_options` - produce a list of vocabulary words matching given wildcard by intersecting postings of k-grams present in the wildcard (refer to *ch 3.2.2*). 

- `search_wildcard` - return list of facts that contain the words matching a wildcard query.


We will use the dataset with curious facts for testing.

In [0]:
import urllib.request
data_url = "https://raw.githubusercontent.com/hsu-ai-course/hsu.ai/master/code/datasets/nlp/facts.txt"
local_filename, headers = urllib.request.urlretrieve(data_url)

facts = []
with open(local_filename, encoding="utf8", errors='ignore') as fp:
    for cnt, line in enumerate(fp):
        facts.append(line.strip('\n'))
        
print(*facts[-5:], sep='\n')

151. Women have twice as many pain receptors on their body than men. But a much higher pain tolerance.
152. There are more stars in space than there are grains of sand on every beach in the world.
153. For every human on Earth there are 1.6 million ants.
154. The total weight of all those ants, however, is about the same as all the humans.
155. On Jupiter and Saturn it rains diamonds.


In [0]:
#remove numbers at start
for g in range (len(facts)):
  facts[g] = facts[g].lstrip('0123456789- ')
  facts[g] = facts[g].lstrip('. ')

#facts_d, dictionary key is number
enum=enumerate(facts)
facts_d=dict((i,j) for i,j in enum)


In [0]:
#conver tuples into string
def convertTuple(tup): 
    str =  ''.join(tup) 
    return str

In [0]:
import nltk
nltk.download('punkt')
from collections import Counter
import re
from nltk.util import ngrams

def build_inverted_index_orig_forms(files_data):
    #TODO build an inverted index of original word forms 
    # (without stemming, just word tokenized and lowercased)   
    inverted_index = {}
    for name, strings in files_data.items():
        tokens = nltk.word_tokenize(strings.lower())
        file_index = Counter(tokens)
        # update global index
        for term in file_index.keys():
            file_freq = file_index[term]
            if term not in inverted_index:   
                # also add $ to words             
                inverted_index['$'+term+'$'] = [file_freq, (name, file_freq)]
            else:
                inverted_index['$'+term+'$'][0] += file_freq
                inverted_index['$'+term+'$'].append((name, file_freq))
    
    return inverted_index


def build_k_gram_index(inverted_index, k=3):
    #TODO build index of k-grams for dictionary words. 
    # Padd with '$' ($word$) before splitting to k-grams
    k_gram_index = {}
    #create set of unique ngrams
    all_set = set()

    for el in inverted_index.keys():
      for element in set(ngrams(el, k)):
          all_set.add(convertTuple(element))

    for element in all_set:
      set_of_famillar_words = []
      for g in set(inverted_index.keys()): 
        if element in g:
          set_of_famillar_words.append(g)
          # print(element)
      # print(set_of_famillar_words)
      k_gram_index [element] = set_of_famillar_words
      



      
    return k_gram_index


def generate_wildcard_options(wildcard, k_gram_index, inverted_index,k=3):
    #TODO for a given wildcard return all words matching it using k-grams
    # refer to book chapter 3.2.2
    # don't forget to pad wildcard with '$', when appropriate  
    #split by * set
    wildcard = "$"+wildcard+"$"
    #minimal word len 
    minimal_len = 0
    wild_deck = wildcard.split("*")
    #calculating minimal len of word
    for element in wild_deck:
      minimal_len += len(element)

    
    #list of sets
    royal_flash = list()
    #For every part of wildcard search for suitable words and save it in list
    for element in wild_deck:
      gram_set = set()
      for gram in set(ngrams(element, k)):
        gram_set.add(convertTuple(gram))

      #set of suitable words for gram
      suitable_words = set()
      even_more_suitable_words = set()
      #going throuth 
      for gram in gram_set:
        # print(gram)
        for gram_1 in k_gram_index.keys():
          if gram_1 in gram:
            suitable_words.update(set(k_gram_index[gram_1]))
        for word in suitable_words:
          if element in word:
            even_more_suitable_words.add(word)
        royal_flash.append(even_more_suitable_words)
    gambit =  set.intersection (*royal_flash)
    #fitlering minimal len of word
    gambit = [i for i in gambit if len(i)>= minimal_len]
    return gambit


def search_wildcard(wildcard, k_gram_index, index, docs,k = 3):
    # k_gram_index = build_k_gram_index(index_orig_forms, 3)
    #TODO retrive list of documnets (facts) that contain words matching wildcard
    absolute_power_force = set()
    wildcard_options = generate_wildcard_options(wildcard, k_gram_index, index, k)
    for n, card in enumerate(wildcard_options):
      wildcard_options[n] = card.strip('$')
    for  opportunity in wildcard_options:
      for line in docs.values():
        if opportunity.upper() in line.upper():
          #if username.upper() in map(str.upper, USERNAMES):
          absolute_power_force.add(line)
      
    return list(absolute_power_force)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 1.2 Tests

In [0]:
index_orig_forms = build_inverted_index_orig_forms(facts_d)
k_gram_index = build_k_gram_index(index_orig_forms, 3)
wildcard = "re*ed"

wildcard_options = generate_wildcard_options(wildcard, k_gram_index, index_orig_forms)
print(wildcard_options)
assert(len(wildcard_options) >= 3)

wildcard_results = search_wildcard(wildcard, k_gram_index, index_orig_forms, facts_d)
# some pretty printing
for r in wildcard_results:
    # highlight terms for visual evaluation
    for term in wildcard_options:
        r = re.sub(r'(' + term + ')', r'\033[1m\033[91m\1\033[0m', r, flags=re.I)
    print(r)

assert(len(wildcard_results) >=3)

assert "James Buchanan, the 15th U.S. president continuously bought slaves with his own money in order to free them." in search_wildcard("pres*dent", k_gram_index, index_orig_forms, facts_d)
assert "9 out of 10 Americans are deficient in Potassium." in search_wildcard("p*tas*um", k_gram_index, index_orig_forms, facts_d)
assert "A man from Britain changed his name to Tim Pppppppppprice to make it harder for telemarketers to pronounce." in search_wildcard("*price", k_gram_index, index_orig_forms, facts_d)

['$reduced$', '$received$', '$recorded$']
A person can live without food for about a month, but only about a week without water. If the amount of water in your body is reduced by just 1%, youll feel thirsty. If its reduced by 10%, youll die.
More than 50% of the people in the world have never made or received a telephone call.
The largest recorded snowflake was in Keogh, MT during year 1887, and was 15 inches wide.




## 2. Handling typos

### 2.1 Dataset 

Download github typo dataset from [here](https://github.com/mhagiwara/github-typo-corpus).
Load it with this code:

In [0]:
!pip install jsonlines
import jsonlines

dataset_file = "github-typo-corpus.v1.0.0.jsonl"

dataset = []
other_langs = set()

with jsonlines.open(dataset_file) as reader:
    for obj in reader:
        for edit in obj['edits']:
            if edit['src']['lang'] != 'eng':
                other_langs.add(edit['src']['lang'])
                continue

            if edit['is_typo']:
                src, tgt = edit['src']['text'], edit['tgt']['text']
                if src.lower() != tgt.lower():
                    dataset.append((edit['src']['text'], edit['tgt']['text']))
                
print(f"Dataset size = {len(dataset)}")

Collecting jsonlines
  Downloading https://files.pythonhosted.org/packages/4f/9a/ab96291470e305504aa4b7a2e0ec132e930da89eb3ca7a82fbe03167c131/jsonlines-1.2.0-py2.py3-none-any.whl
Installing collected packages: jsonlines
Successfully installed jsonlines-1.2.0
Dataset size = 245909


#### Explore sample typos
Please, explore the dataset. You may see, that this is
- mostly markdown
- some common mistakes with do/does
- some just refer to punctuation typos (which we do not consider)

In [0]:
for pair in dataset[1010:1020]:
    print(f"{pair[0]} => {pair[1]}")

        """Make am instance. =>         """Make an instance.
* travis: test agains Node.js 11 => * travis: test against Node.js 11
The parser receive a string and returns an array inside a user-provided  => The parser receives a string and returns an array inside a user-provided 
CSV data is send through the `write` function and the resulted data is obtained => CSV data is sent through the `write` function and the resulting data is obtained
One useful function part of the Stream API is `pipe` to interact between  => One useful function of the Stream API is `pipe` to interact between 
source to a `stream.Writable` object destination. This example available as  => source to a `stream.Writable` object destination. This example is available as 
`node samples/pipe.js` read the file, parse its content and transform it. => `node samples/pipe.js` and reads the file, parses its content and transforms it.
Most of the generator is imported from its parent project [CSV][csv] in a effort  => Most o

#### Build a dataset vocabulary
We will need it for Norvig's spellchecker as well as for estimating overall correction quality. Consider only word-level. Be carefull, there is markdown (e.g. \`name\`. \[url\]\(http://url)) and comment symbols (\#, //, \*).

In [0]:
def sent_to_words(sent):
    # splits sentence to words, filtering out non-alphabetical terms
    words = nltk.word_tokenize(sent)    
    words_filtered = filter(lambda x: x.isalpha(), words)
    return words_filtered

In [0]:
vocabulary = Counter()
for pair in dataset:
    for word in sent_to_words(pair[1].lower()):
        vocabulary[word] += 1
len(vocabulary)

58392

In [0]:
from itertools import islice
print(list(islice(vocabulary.items(), 10)))

[('function', 6100), ('de', 80), ('deutsch', 4), ('nocomments', 2), ('you', 41999), ('can', 26004), ('disable', 527), ('comments', 351), ('for', 44674), ('the', 206912)]


### 2.2 Implement context-independent spellcheker ##

0) Write code to compute editorial distance

1) [Norvig's corrector](https://norvig.com/spell-correct.html)

2) [Soundex](https://en.wikipedia.org/wiki/Soundex)

3) Trigrams with Jaccard coefficient.

#### Editorial distance

Frequently used distance measure between two character sequences. We will use this distance to sort Soundex search results.

In [0]:
!pip install pyxDamerauLevenshtein
from pyxdameraulevenshtein import damerau_levenshtein_distance

# source https://github.com/gfairchild/pyxDamerauLevenshtein

Collecting pyxDamerauLevenshtein
[?25l  Downloading https://files.pythonhosted.org/packages/b5/54/2d398545cae80d2fc8444345542ad5f3ffab0694c8efb8ed2fbe92017305/pyxDamerauLevenshtein-1.5.3.tar.gz (58kB)
[K     |█████▋                          | 10kB 19.4MB/s eta 0:00:01[K     |███████████▏                    | 20kB 3.2MB/s eta 0:00:01[K     |████████████████▉               | 30kB 4.3MB/s eta 0:00:01[K     |██████████████████████▍         | 40kB 2.9MB/s eta 0:00:01[K     |████████████████████████████    | 51kB 3.5MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 2.9MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: pyxDamerauLevenshtein
  Building wheel for pyxDamerauLevenshtein (PEP 517) ... [?25l[?25hdone
  Created wheel for pyxDamerauLevenshtein: filename=pyxDamerauLevenshtein-1.5.3-cp36-cp36m-linu

#### Tests

In [0]:
assert damerau_levenshtein_distance("korrectud", "corrected") == 2, "Edit distance is computed incorrectly"
assert damerau_levenshtein_distance("soem", "some") == 1, "Edit distance is computed incorrectly"
assert damerau_levenshtein_distance("one", "one") == 0, "Edit distance is computed incorrectly"

#### Norvig's spellchecker

In [0]:
def fix_typo_norvig(word) -> str:
  Leader = 99999
  Aspirant = 0
  wordo = "car"
  for g in (vocabulary.keys()):
    Aspirant = damerau_levenshtein_distance(word, g)
    if Aspirant < Leader:
      wordo = g
      Leader = Aspirant
  return wordo

#### Tests

In [0]:
assert fix_typo_norvig("korrectud") == "corrected", "Norvig's correcter doesn't work"
assert fix_typo_norvig("speling") == "spelling", "Norvig's correcter doesn't work"

#### Soundex 

For cases when the exact spelling is unknown, phonetic algorithms such as Soundex can be very helpful - they allow user to type a word the way he thinks it should sound, and then suggest the corrrect version. Go through *chapter 3.4* to understand how Soundex algorithm works.

In [0]:
def mapa (letter,previous):
  zero = set('aAeEiIoOuUhHwWyY')
  one = set('bBfFpPvV')
  two = set('cCgGjJkKqQsSxXzZ')
  thr = set('dDtT')
  fou = set('lL')
  fiv = set('mMnN')
  six = set('Rr')

  if any((c in zero) for c in letter):
    if previous == '0':
      return ''
    else:
      return '0'
  if any((c in one) for c in letter):
    if previous == '1':
      return ''
    else:
      return '1'
  if any((c in two) for c in letter):
    if previous == '2':
      return ''
    else:
      return '2'
  if any((c in thr) for c in letter):
    if previous == '3':
      return ''
    else:
      return '3'
  if any((c in fou) for c in letter):
    if previous == '4':
      return ''
    else:
      return '4'
  if any((c in fiv) for c in letter):
    if previous == '5':
      return ''
    else:
      return '5'
  if any((c in six) for c in letter):
    if previous == '6':
      return ''
    else:
      return '6'
  if previous == '0':
    return ''
  else:
    return '0'

def produce_soundex_code(word):
    #TODO implement Soundex algorithm, version from book chapter 3.4
    # input word is already lowercased
    # return Soundex 4-character code, like 'k450'
    transcript = word[0]
    for w in range (1,len(word)):
      transcript += mapa(word[w],transcript[-1])
    transcript_2=''
    for letter in transcript:
      if letter =='0':
        transcript_2+=''
      else:
        transcript_2+=letter
    while len(transcript_2)<4:
      transcript_2+='0'
    while len(transcript_2)>4:
      transcript_2 = transcript_2[0:-1]
    return transcript_2


def build_soundex_index(dictionary):
    #TODO build soundex index for dictionary words.
    # dictionary is a vocabulary of original words
    # output format: 'code1': ['word1_with_code1', 'word2_with_code1', ...]    
    soundex_index = {}
    for element in vocabulary.keys():
      key = produce_soundex_code(element)
      if key not in soundex_index:
        soundex_index[key] = []
      soundex_index[key].append(element)
    return soundex_index


def fix_typo_soundex(word, soundex_index) -> list:
    #TODO return words from vocabulary that match with result by soundex fingerprint
    # ordered results by editorial distance
  matched = []
  tuples = dict()
  #soundex of word
  sndx = produce_soundex_code(word)
  if sndx not in soundex_index.keys():
    return (word)
  else:
    for wordo in soundex_index[sndx]:
      tuples[wordo] = damerau_levenshtein_distance(word,wordo)
  tuples = {k: v for k, v in sorted(tuples.items(), key=lambda item: item[1])}
  tuples = list(tuples.keys())


  return tuples

#### Tests

In [0]:
soundex_index = build_soundex_index(vocabulary)

code1 = produce_soundex_code("britney")
code2 = produce_soundex_code("breatany")
print(code1, code2)
assert code1 == code2

print(fix_typo_soundex("enouhg", soundex_index))
assert "enough" in fix_typo_soundex("enouhg", soundex_index), "Assert soundex failed"

b635 b635
['enough', 'ensue', 'eng', 'enjoy', 'emoji', 'enqueue', 'ens', 'enc', 'emojii', 'enki', 'enso', 'enzo', 'enwiki', 'emesh', 'emg', 'emacs', 'emc', 'emas', 'euank', 'enmasse', 'emac', 'emmc', 'emgo']


#### Trigrams with Jaccard coefficient

In [0]:
def fix_typo_kgram(goram, k_gram_index) -> list:
  k = len(list(k_gram_index.keys())[0])
  #grams of word
  gr_set = set()
  #potential words
  gr_chall = set()
  #dictionary word:Jaccard
  mrgrgr = dict()
  #collect ngrams
  goram = '$'+goram+'$'
  for element in set(ngrams(goram, k)):
    gr_set.add(convertTuple(element))
  #collect possible matching words
  for element in gr_set:
    if k_gram_index_github.get(element)!=None:
      gr_chall.update(k_gram_index_github.get(element))
  # with every word
  if not gr_set:
    return goram
  else:
    for element in gr_chall:
      trgrams = set()
      #find it's word ngram
      element_1 = '$' +element +'$'
      for el in set(ngrams(element_1, k)):
        trgrams.add(convertTuple(el))
      divider_part2 = len(gr_set.intersection(trgrams))
      # calcualate it's divider
      divider_part3 = len(gr_set) + len(trgrams) - divider_part2
      mrgrgr[element] = divider_part2/divider_part3
    if bool(mrgrgr):
      mrgrgr = {k: v for k, v in sorted(mrgrgr.items(), key=lambda item: item[1])}
      mrgrgr = list(mrgrgr.keys())
      mrgrgr.reverse()
      return mrgrgr
    else:
      return word


#### Tests

In [0]:
# k_gram_index_github = build_k_gram_index(vocabulary, 3)
print(fix_typo_kgram("enouh", k_gram_index_github)[:20])
assert "enough" in fix_typo_kgram("enouh", k_gram_index_github), "Assert k-gram failed"

['enough', 'eno', 'enought', 'endogenous', 'enomem', 'enospc', 'enosys', 'enormous', 'renounce', 'exogenous', 'enormously', 'homogenous', 'hetrogenous', 'heterogenous', 'noun', 'nous', 'deno', 'menoh', 'nouns', 'inout']


### 2.3 Estimate quality

In [0]:
norvig, soundex, kgram = 0, 0, 0
limit = 10000
counter = limit
for i, (src, target) in enumerate(dataset):
    if i == limit:
        break
    words = sent_to_words(src.lower())
    # word suspected for typos
    sn, ss, sk = src.lower(), src.lower(), src.lower()
    for word in words:
        print (word)
        if word not in vocabulary and word.isalpha():
            print (word)
            # top-1 accuracy
            wn= fix_typo_norvig(word)
            print (word)
            ws = fix_typo_soundex(word, soundex_index)[0]
            print (word)
            wk = fix_typo_kgram(word, k_gram_index_github)[0]

            sn = sn.replace(word, wn)
            ss = ss.replace(word, ws)
            sk = sk.replace(word, wk)
    norvig += int(sn == target.lower())
    soundex += int(ss == target.lower())
    kgram += int(sk == target.lower())

print(f"Norvig accuracy ({norvig}) = {norvig / limit}")
print(f"Soundex accuracy ({soundex}) = {soundex / limit}")
print(f"k-gram accuracy ({kgram}) = {kgram / limit}")

# Norvig accuracy (2346) = 0.2346
# Soundex accuracy (1673) = 0.1673
# k-gram accuracy (1566) = 0.1566

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
description
tutorial
on
how
to
create
a
route
component
that
checks
if
a
user
is
logged
in
to
your
app
and
redirects
description
tutorial
on
how
to
add
a
container
and
a
navbar
and
routes
to
your
app
with
react
router
this
component
creates
a
component
from
the
that
renders
the
first
matching
route
that
is
defined
within
it
for
now
we
only
have
a
single
route
it
looks
for
and
renders
the
component
when
matched
we
are
also
using
the
prop
to
ensure
that
it
matches
the
route
exactly
this
is
because
the
path
will
also
match
any
route
that
starts
with
a
so
the
method
of
our
should
now
like
this
next
we
are
going
add
login
and
signup
links
to
our
navbar
now
we
do
something
very
similar
for
the
logout
process
since
we
are
already
using
the
hoc
for
our
app
component
we
can
go
ahead
and
the
bit
that
does
the
redirect
there
are
many
ways
to
solve
the
above
problems
the
simplest
would
be
to
just
check
the
conditions
in
our
container

In [0]:
print(f"Norvig accuracy ({norvig}) = {norvig / limit}")
print(f"Soundex accuracy ({soundex}) = {soundex / limit}")
print(f"k-gram accuracy ({kgram}) = {kgram / limit}")

Norvig accuracy (2432) = 0.2432
Soundex accuracy (1798) = 0.1798
k-gram accuracy (1560) = 0.156
