## Preprocessing Steps

Most of the NLP datasets have a lot of inconsistencies that cannot be directly used.

### Removing non alpha numeric (special) characters

In [1]:
import re
def remove_nonalphanumeric(text):
    text = re.sub("[^a-zA-Z0-9 ]", " ", text)
    text = text.strip()
    return text

### Case Folding

In [2]:
def convert_lower(text):
    text = text.lower()
    return text

### Tokenization
        E.g., This is a sentence -> [This, is, a, sentence]

In [3]:
def tokenization(text):
    tokens = text.split()
    return tokens

### Stopwords
        E.g., Removing stopword list, i.e., This is a sentence -> This sentence

In [4]:
import nltk
english_stopwords = nltk.corpus.stopwords.words('english')
def remove_stopwords(tokens):
    non_stopwords = []
    for t in tokens:
        if t not in english_stopwords:
            non_stopwords.append(t)
    return non_stopwords

### Normalization
       - Stemming: Converting a word into its base form using heuristic rules like removing "ing", "s", etc.
       - E.g., runs -> run, walking -> walk
       - Lemmatization: Converting a word into its base form using morphological analysis
       - E.g., ran -> run


In [5]:
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")

def stemmer(tokens):
    lemmas = []
    for token in tokens:
        lemmas.append(snowball_stemmer.stem(token))
    
    return lemmas
    

In [6]:
stemmer(['runs', 'walking'])

['run', 'walk']

In [7]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''


def lemmatizer(tokens, pos_tags):
    lemmas = []
    for i in range(len(tokens)):
        token = tokens[i]
        pos_tag = pos_tags[i]
        wordnet_tag = get_wordnet_pos(pos_tag)
        lemmas.append(wordnet_lemmatizer.lemmatize(token, wordnet_tag))
    return lemmas

In [8]:
lemmatizer(['ran'], ['VB'])

['run']

### What is WordNet?

WordNet is a freely distributed English lexical database. It loosely resembles a thesaurus where words are organized with their synonyms and antonyms.

In [35]:
wordnet.synsets("bank")

[Synset('bank.n.01'),
 Synset('depository_financial_institution.n.01'),
 Synset('bank.n.03'),
 Synset('bank.n.04'),
 Synset('bank.n.05'),
 Synset('bank.n.06'),
 Synset('bank.n.07'),
 Synset('savings_bank.n.02'),
 Synset('bank.n.09'),
 Synset('bank.n.10'),
 Synset('bank.v.01'),
 Synset('bank.v.02'),
 Synset('bank.v.03'),
 Synset('bank.v.04'),
 Synset('bank.v.05'),
 Synset('deposit.v.02'),
 Synset('bank.v.07'),
 Synset('trust.v.01')]

In [43]:
wordnet.synsets("bank")[5].definition()

'the funds held by a gambling house or the dealer in some gambling games'

In [44]:
w1 = wordnet.synset('bank.n.05')
w2 = wordnet.synset('bank.n.06')
print(w1.wup_similarity(w2))

0.5


### Spell Correct

In [9]:
from autocorrect import spell
def get_spell_corrections(tokens):
    n_tokens = []
    for token in tokens:
        n_tokens.append(spell(token))
    return n_tokens

### Preprocessing Steps combined together

In [10]:
text = "ThES is a very*hard to READ sentance."

In [11]:
stemmer(remove_stopwords(get_spell_corrections(tokenization(convert_lower(remove_nonalphanumeric(text))))))

['hard', 'read', 'sentenc']

### Part-Of-Speech Tagging

Identifying the Part of Speech tags for each word of a sentence. These tags help us in identifying the grammar structure of a sentence along with its intents and objects. E.g., Noun, Verb, Adjective, etc.

The complete set of POS tags are derived from PENN Treebank tags. Ref: [2]

In [12]:
def get_pos_tags(sent):
    pos_tags = nltk.pos_tag(nltk.word_tokenize(sent))
    return pos_tags

In [13]:
get_pos_tags("Ram killed Ravan")

[('Ram', 'NNP'), ('killed', 'VBD'), ('Ravan', 'NNP')]

### Noun Phrase Extraction / Chunking
A word or group of words containing a noun and functioning in a sentence as subject, object, or prepositional object.
    - [We] are attending [NLP Classes].
    - [Ram] killed [Ravan].

This is a very important step in a NLP pipeline as this step is often used to extract insights from NLP datasets.

In [14]:
def get_noun_phrases(pos_tags):
    grammar = r"""
              NP: {<PP\$>?<JJ>*<NN>+} 
                  {<NN>*<NNP>+}                # chunk sequences of proper nouns
                  {<NN>*<NNS>+}
                  {<NN>+}
            """
        #{<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
    phrases = []
    chunkParser = nltk.RegexpParser(grammar)
    chunked = chunkParser.parse(pos_tags)
    for subtree in chunked.subtrees(filter=lambda t: t.label() == 'NP'):
        phrase = " ".join([np[0] for np in subtree.leaves()])
        phrases.append(phrase)
    return phrases


In [15]:
get_noun_phrases(get_pos_tags("Jim bought 300 shares of Acme Corp. in 2006."))

['Jim', 'shares', 'Acme Corp.']

### Named Entity Recognition

It is the task of identifying and extracting entities of type PERSON, ORGANIZATION, LOCATION, TIME, MONETARY VALUE, etc.

    - E.g., [Jim/PERSON] bought 300 shares of [Acme Corp./ORGANIZATION] in [2006/TIME].

In [21]:
from collections import defaultdict
from nltk.chunk import tree2conlltags
def get_ner_tags(pos_tags):
    ner_chunks = nltk.ne_chunk(pos_tags)
    ner_tagged = tree2conlltags(ner_chunks)
    ner_type = ''
    ner_str = ''
    ner_dict = defaultdict(lambda: [])
    
    for tags in ner_tagged:
        tag = tags[2]
        if tag == 'O':
            if len(ner_str) > 0:
                ner_dict[ner_type].append(ner_str)
            ner_str = ''
            ner_type = ''
            continue
        
        if tag[:2] == "B-":
            if len(ner_str) > 0:
                ner_dict[ner_type].append(ner_str)
            
            ner_str = tags[0]
            ner_type = tag[2:]
        
        if tag[:2] == 'I-':
            ner_str += " " + tags[0]
        
    
    if len(ner_str) > 0:
        ner_dict[ner_type].append(ner_str)
    
    return ner_dict
        

In [24]:
get_ner_tags(get_pos_tags('Jim bought 300 shares of Acme Corp. in 2006.'))

defaultdict(<function __main__.get_ner_tags.<locals>.<lambda>>,
            {'ORGANIZATION': ['Acme Corp.'], 'PERSON': ['Jim']})

### Further Readings
1. http://norvig.com/spell-correct.html
2. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
3. https://wordnet.princeton.edu/