## Spelling Correction Tool
- By spelling correction, we mean predicting the correct word from an incorrectly typed word. 

__The steps in spell checking:__
- Read in a large corpus of words
- Count the number of times of appearance of each word
- Generate candidate words from the input word that are -:
    - Input word itself
    - Words that are 1 edit distance (by way of insert, delete, transpose, replace) away
    - Words that are 2 edit distance (by way of insert, delete, transpose, replace) away
- Find the candidate word with the maximum probability of occurring in the corpus


1. [Building word list](#Building-word-list)

2. [Spelling Correction](#Spelling-Correction)

3. [Test](#Test)


In [1]:
import pymysql
import psycopg2
import configparser
import pandas as pd
from nltk import word_tokenize
import string
import re
from collections import Counter
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
import inflect as inf

import nltk
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

config = configparser.ConfigParser()
config.read('config.ini')
config.sections()

[nltk_data] Downloading package wordnet to /Users/lexiew/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['SOS_DB', 'docdx_db_production']

### Building word list
__STEPS__
- Fetch all Answer data from docdx
- Cleaning data 
    - messy code
    - puntuation
    - singular/plural
    - white/double space
    - digit
    - stopwords
- store into a file (txt)

In [3]:
import_db_section = config['docdx_db_production']
host = import_db_section['host']
user = import_db_section['user']
password = import_db_section['password']
db_dx=import_db_section['db']
port = int(import_db_section.get('port',5432))

conn_str = "host={} dbname={} user={} password={}".format(host, db_dx, user, password)
db_docdx = psycopg2.connect(conn_str)

In [4]:
## Get data + data cleaning 
def get_comment():
    query_comment = """select id comment_id, topic_id, created_by, text from comment
                        where reply_to_comment_id is null and modified_by is null and deleted_at is null
                        order by created_at """
    comment = pd.read_sql(query_comment, db_docdx)
    return comment

In [5]:
comment = get_comment()

In [8]:
#remove punctuation
def remove_punc(x):
    exclude = set(",.:;'\"-?!/{}()+%&")
    return "".join([(ch if ch not in exclude else " ") for ch in x])

def data_cleaning(df):
    #remove messy code
    p = inf.engine()
    #p = inflect.engine()
    #p.singular_noun('apples')
    df['text'] = df.text.apply(lambda x: re.sub('\n', " ", x))
    df['text'] = df.text.apply(lambda x: re.sub('\t', " ", x))
    df['text'] = df.text.apply(lambda x: re.sub('w/', "with", x))
    
    #remove punc
    df['cleaned_text'] = df['text'].apply(lambda x: remove_punc(x)).str.lower()
    
    #remove digit
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub(r'\d+', '', x))
    
    #remove double space
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub('\s+',' ', x))
    #remove stopwords
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: ' '.\
                                                  join([word for word in x.split(' ') if word not in stop_words]))
    
    df = df[df.cleaned_text != '']
    df = df[df.cleaned_text != ' ']
    
    #plural to singular 
    df['cleaned_text'] = df['cleaned_text'].apply(lambda x: ' '.join([word if p.singular_noun(word) is 
                                                                     False else p.singular_noun(word) for word in x.split(' ')]))
    return df

#value_when_true if condition else value_when_false
#[x+1 if x >= 45 else x+5 for x in l]

In [9]:
comment = data_cleaning(comment)

In [11]:
comment.head()

Unnamed: 0,comment_id,topic_id,created_by,text,cleaned_text
0,24,2,560652,"In patients wth moderate to severe IBD, biolog...",patient wth moderate severe ibd biologic clear...
1,25,2,563658,The problem with stopping Biologics is the po...,problem stopping biologic potential formation...
2,26,2,522329,Yes they can- I've had many patient achieve re...,ye many patient achieve remission using biolog...
3,27,2,585883,The risk is developing HACA antibodies against...,risk developing haca antibody biologic trainin...
4,28,2,568457,If biologics were not antigenic (inducing neut...,biologic antigenic inducing neutralizing antib...


In [12]:
#store all words in a file 
def store_file(df):
    text = ' '.join(i for i in df.cleaned_text.str.lower())
    #text = ' '.join(i for i in df_comment.text.str.lower())
    file = open("comment_up_to_date.txt", "w")
    file.write(text)
    file.close()
store_file(comment)

### Spelling Correction


- The input word is first split into possible pairs of words. 
- A set of candidate words are generated from these pairs by performing deletion, transposition, replacement, insertion at edit distance 1 and edit distance 2. 
- The candidates are then checked for their presence in the corpus and that word is chosen which has the maximum probability of occurrance in the corpus. 
    - (Please note however, that the input word is preferred to candidate words at edit distance 1 which in turn is preferred over candidates at edit distance 2 away.)

In [13]:
def splits(word):
    "Return a list of all possible (first, rest) pairs that comprise word."
    return [(word[:i], word[i:]) for i in range(len(word)+1)]

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): return (e2 for e1 in edits1(word) for e2 in edits1(e1))

def known(words): return set(w for w in words if w in WORDS)

def correction(word):
    "Find the best spelling correction for word."
    candidates = (known([word]) or 
                  known(edits1(word)) or 
                  known(edits2(word)) or 
                  [word])
    return max(candidates, key=WORDS.get)


"""

The following is just a minor modification to preserve the case and punctuation in a sentence 
while performing the spelling correction for each word comprising the sentence.

"""


def correct_text(text):
    "Correct all the words within a text, returning the corrected text."
    return re.sub('[a-zA-Z]+', correct_match, text)

def correct_match(match):
    "Spell-correct word in match, and preserve proper upper/lower/title case."
    word = match.group()
    return case_of(word)(correction(word.lower()))

def case_of(text):
    "Return the case-function appropriate for text: upper, lower, title, or just str."
    return (str.upper if text.isupper() else
            str.lower if text.islower() else
            str.title if text.istitle() else
            str)

In [15]:
def words(text): return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open(r'comment_up_to_date.txt', encoding = "ISO-8859-1").read()))


print("Total number of unique words in docdx: %d " % len(WORDS))

Total number of unique words in docdx: 22403 


### Test

In [16]:
correct_text("""biopsy biosy biossy""")

'biopsy biopsy biopsy'

In [17]:
correct_text('stopped biologic patiens los efficay develop intolerance drug')

'stopped biologic patient los efficacy develop intolerance drug'

In [18]:
correct_text("""migraine migrainne""")

'migraine migraine'

In [20]:
correct_text("""cancer canner canccer""")

'cancer cancer cancer'