## Natural Language Processing

1. Tokenization - processing of converting text into 'tokens'
2. Tokens - words or entities present in text
3. Text Object - a sentence, phrase, or word 

### 2. Preprocessing

Preprocessing involes 'cleaning' the text and (hopefully) making it as noise free as possible.

1. Noise removal
2. Lexicon normalization
3. Object standardization
4. Grammar check
5. Spellcheck

### Noise Removal

Stopwords are generally the most common words in a language - things like 'the', 'is', 'at', 'a', 'which'. They can also be hashtags, tinyurls, or unwanted ascii characters.

In [1]:
def removeNoise(s):
    """Takes a string and removes the noise."""
    
    noise = ['the', 'is', 'at', 'a', 'which']

    a = []
    
    for index, value in enumerate(s.split()):
        if value not in noise:
            a.append(value)
            
    return " ".join(a)

removeNoise('Steve is at the bar.')

'Steve bar.'

We'd also like to remove punctuation. We can do that using regex:

In [2]:
import re

def removeNoise(s):
    """Takes a string and removes the noise."""
    
    noise = ['the', 'is', 'at', 'a', 'which']

    # Substitute all non alphanumeric characters and spaces with an empty string
    s = re.sub(r'[^\w\s]','',s) 
    
    a = []
    
    for index, value in enumerate(s.split()):
        if value not in noise:
            a.append(value)
    
    
    return a

removeNoise('Steve is at the bar.')

['Steve', 'bar']

## Lexicon Normalization

Adding a suffix to a word does not change the root context of the word. For example 'running' and 'runs' all have the same root of 'run'. Likewise 'play', 'player', 'played', 'playing', and 'plays' have the same root of 'play'.

- Stemming is the process of stripping suffixes ('ing', 's', 'es', 'ly, 'y', etc.)
- Lemmatization is the process of obtaining the root of the word

We can use the Natural Language Toolkit to carry out these tasks (https://www.nltk.org/)

In [3]:
from nltk.stem.wordnet import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
stemmer = PorterStemmer()


print(stemmer.stem('programming'))
print(lemmatizer.lemmatize('running', 'v')) # where a=adjective and v=verb

program
run


## Object Standardization

Sometimes we abbreviate or truncate words and use acronyms. These are not necessarily noise and can be valuable pieces of data.

In [4]:
table = {'wyd':'what are you doing', 'dm':'direct message', 'rt' : 'retweet'}
# We can continue to append words to our dictionary via table['key'] = 'value'

def standardizeObjects(s):
    """Takes a string and standardizes it."""
    
    a = []
    
    for index, value in enumerate(s.split()):
        if value.lower() in table:
            s = s.replace(value, table[value])
        
    return s
            


standardizeObjects('wyd Charlie, dm me')

'what are you doing Charlie, direct message me'

## Feature Engineering - Text Data

Depending upon the usage of preprocessed data, text features can be constructed using various techniques:
1. Syntactical Parsing
2. Entitles or N-grams
3. Word-Based Features
4. Statistical Features
5. Word Embeddings

### Syntactical Parsing

Syntactical parsing involves analyzing the words in a sentence for their grammar and observing how they are arranged in a sentence. This involves:
1. Dependency Grammar
2. Part of Speech Tags

Dependency Trees - Displays the relationship amongs words in a sentence
Part of Speech Tagging - Defines the usage and function of a word in a sentence (nouns, verbs, adjectives, adverbs, etc.)

In [5]:
from nltk import word_tokenize, pos_tag

tokens = word_tokenize("Today I went to the gym.")
print(pos_tag(tokens))

[('Today', 'NN'), ('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('gym', 'NN'), ('.', '.')]


Some words have multiple meanings according to their usage:
- I was stuck in a traffic jam
- I like jam on my toast

Lesk Algorithm can be used for this type of classification (https://en.wikipedia.org/wiki/Lesk_algorithm)

#### Improving Word-Based Features

A learning model could learn different contexts of a word when words are used as features.

"Book my flight, I will read this book."

Tokens: (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1) 

Tokens with POS: (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)

#### Normalization and Lemmatization
Parts of Speach (POS) tags are the basis of lemmatization for converting a word to its base form (lemma).

#### Efficient Stopword Removal
POS tags are also useful in the removal of stopwords.

### Entity Extraction (Entities as Features)

Entities are defined as the most important chunks of a sentence - noun phrases, verb phrases, or both. Entity Detection algorithms are generally ensemble models of rule based parsing, dictionary lookups, pos tagging, and dependency parsing.

The appliaction of entity detection can be seen in the automated chat bots, content analyzers, and consumer insights.

"At the W party Thursday night at Chateau Marmont, Cate Blanchett barely made it up the elevator."

From here we can see that:
- Date: Thursday
- Time: night
- Location: Chateau Marmont
- Person: Cate Blanchett

#### Named Entity Recognition (NER)

The process of detecting the named entities such as person names, location names, company names, etc. from the text is called as NER. 

"Sergey Brin, the manager of Google Inc. is walking in the streets of New York."

- Named Entities:  
    - (“person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New York”)
    

#### Noun Phrase Identification
These steps deal with extracting all the noun phrases from a text using dependency parsing and part of speech tagging.

#### Phrase Classification
This is the classification step in which all the extracted noun phrases are classified into their respective categores (locations, names, etc.).

Google Maps API provides a good path to disambiguate locations, dbpedia can be used to extract data from wikipedia (persons/orgs). We can also just create lookup tables and dictionaries by combining information from all different sources.

#### Entity Disambiguation
Sometimes entities will be misclassified, hence creating a validation layer on top of the results is very useful.

### Topic Modeling

Topic modeling is a process of automatically identifying the topics present in a text corpus. It derives the hidden patterns among the words in the corpus in an unsuperivsed manner. Tops are defined as 'a repeating pattern of co-occurring terms in a corpus'.

Given a topic of 'Healthcare', a good topic model result set could be:
    - 'health', 'doctor', 'patient', 'hospital'

The topic of 'Farming' could yield:
    - 'farm', 'crops', 'wheat'
    

Latent Dirichlet Allocation (LDA) is the most popular topic modeling technique. (https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

In [6]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

import gensim
from gensim import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results 
print(ldamodel.print_topics())

[(0, '0.029*"driving" + 0.029*"My" + 0.029*"my" + 0.029*"sister" + 0.029*"to" + 0.029*"and" + 0.029*"stress" + 0.029*"Doctors" + 0.029*"suggest" + 0.029*"may"'), (1, '0.053*"driving" + 0.053*"sister" + 0.053*"My" + 0.053*"my" + 0.053*"to" + 0.053*"time" + 0.053*"spends" + 0.053*"practice." + 0.053*"a" + 0.053*"father"'), (2, '0.063*"to" + 0.036*"but" + 0.036*"father." + 0.036*"have" + 0.036*"not" + 0.036*"likes" + 0.036*"sugar," + 0.036*"Sugar" + 0.036*"consume." + 0.036*"bad"')]


### N-Grams as Features

A combination of N words together are called N-Grams. N-Grams (N > 1) are generally more informative as compared to words (Unigrams) as features. Bi-Grams (N = 2) are considered the most important features of all.

In [7]:
def generate_ngrams(text, n):
    words = text.split()
    output = []  
    for i in range(len(words)-n+1):
        output.append(words[i:i+n])
    return output

generate_ngrams('this is a sample text', 2)

[['this', 'is'], ['is', 'a'], ['a', 'sample'], ['sample', 'text']]

### Statistical Features

Text data can also be quantified directly into numbers.


#### Term Frequency - Inverse Document Frequency (TF-IDF)

TF-IDF converts the text into vector models based on the occurrence of words in the document without taking into consideration the exact ordering..

- Term Frequency (TF): Defined as the count of a term 'T' in a document 'D'.
- Inverse Document Frequency (IDF): Logarithmic ratio of total documents available in the corpus and number of documents containing term 'T'

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

obj = TfidfVectorizer()

corpus = ['This is sample document.', 'another random document.', 'third sample document text']

X = obj.fit_transform(corpus)
print(X)

  (0, 7)	0.5844829010200651
  (0, 2)	0.5844829010200651
  (0, 4)	0.444514311537431
  (0, 1)	0.34520501686496574
  (1, 1)	0.3853716274664007
  (1, 0)	0.652490884512534
  (1, 3)	0.652490884512534
  (2, 4)	0.444514311537431
  (2, 1)	0.34520501686496574
  (2, 6)	0.5844829010200651
  (2, 5)	0.5844829010200651


#### Count/Density/Readability Features

Count or Density based features can also be used in models and analysis. 
1. Word Count
2. Sentence Count
3. Punctuate Count
4. Industry Specific Words

Textstat can be used to create such features (https://github.com/shivam5992/textstat)


#### Word Embedding (Text Vectors)

Word embedding is the modern way of representing words as vectors. The aim of word embedding is to redefine high dimensional word features into low dimensional feature vectors by preserving the contextual similarity in the corpus. They are widely used in deep learning models such as Convolution Neural Networks and Recurrent Neural Networks.

Word2Vec and GloVe are two popular models to create word embedding of a text. These models take a text corpus as input and produce the word vectors as output.

Word2Vec (https://code.google.com/archive/p/word2vec/)
GloVe (https://nlp.stanford.edu/projects/glove/)

Word2Vec model is composed of a preprocessing module, a shallow neural network model called Continuous Bag of Words, and another shallow neural network model called skip-gram. These models are widely used for all other NLP problems. It first constructs a vocabulary from the training corpus and then learns word embedding representations.

In [9]:
from gensim.models import Word2Vec
import sys
sentences = [['data', 'science'], ['google', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]

# train the model on your corpus  
model = Word2Vec(sentences, min_count = 1)

print(model.wv.similarity('data', 'science'))

-0.010777194


  "C extension not loaded, training will be slow. "


### Important Tasks of NLP

#### Text Classification

Text classification is one of the classical problems of NLP.
- Identifying email spam
- News article classifications

Text classification is defined as a technique to systematically classify a text object (document or sentence) in one of the fixed categories. It's especially helpful if the data is large.

A typical natural language classifier consists of two parts: 
1. Training
2. Prediction

In [10]:
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]

model = NBC(training_corpus) 
print(model.classify("Their codes are amazing."))
print(model.accuracy(test_corpus))

Class_A
0.8333333333333334


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn import svm 

# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes example)
train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = [] 
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)
# Apply model on test data 
test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear 
model = svm.SVC(kernel='linear') 
model.fit(train_vectors, train_labels) 
prediction = model.predict(test_vectors)

print(prediction)
print('\n')
print (classification_report(test_labels, prediction))

['Class_A' 'Class_A' 'Class_B' 'Class_B' 'Class_A' 'Class_A']


              precision    recall  f1-score   support

     Class_A       0.50      0.67      0.57         3
     Class_B       0.50      0.33      0.40         3

    accuracy                           0.50         6
   macro avg       0.50      0.50      0.49         6
weighted avg       0.50      0.50      0.49         6



### Text Matching / Similarity

One of the more important areas of NLP is the matching of text objects to find similarities. Important applications of text matching includes automatic spelling correction, data de-duplication, genome analysis, etc.

#### Levenshtein Distance
Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with allowable edit operations being insertion, deletion, or substitution of a single character.

In [12]:
def levenshtein(s1,s2): 
    if len(s1) > len(s2):
        s1,s2 = s2,s1 
    distances = range(len(s1) + 1) 
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1]) 
            else:
                 newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-1]))) 
        distances = newDistances 
    return distances[-1]

print(levenshtein("analyze", "analyse"))

1


#### Phonetic Matching
A Phonetic matching algorithm takes a keyword as input (person's name, location, etc.) and produces a character string that identifies a set of words that are (roughly) phonetically similar. It is very useful for searching large text corpuses, correcting spelling errors, and matching relevant names. 

Soundex and Metaphone are two main phonetic algorithms used for this purpose. Python's module Fuzzy is used to compute soundex strings for different words.

#### Flexible String Matching
A complete text matching system includes different algorithms piplined together to compute a variety of text variations. Regular expressions are really helpful for this purpose as well. Other common techniques include exact string matching, lemmatized matching, and compact matching (which takes care of spaces, punctuation, slang words, etc.).

#### Cosine Similarity
When the text is representated as vector notation, a general cosine similarity can also be applied in order to measure vectorized similarity. 

In [13]:
import math
from collections import Counter
def get_cosine(vec1, vec2):
    common = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in common])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()]) 
    sum2 = sum([vec2[x]**2 for x in vec2.keys()]) 
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
   
    if not denominator:
        return 0.0 
    else:
        return float(numerator) / denominator

def text_to_vector(text): 
    words = text.split() 
    return Counter(words)

text1 = 'This is an article on analytics vidhya' 
text2 = 'article on analytics vidhya is about natural language processing'

vector1 = text_to_vector(text1) 
vector2 = text_to_vector(text2) 
cosine = get_cosine(vector1, vector2)
print('{:.3f}'.format(cosine))

0.630


### Important Libraries for NLP


1. Scikit-learn: Machine learning in Python
2. Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
3. Pattern – A web mining module for the with tools for NLP and machine learning.
4. TextBlob – Easy to use nl p tools API, built on top of NLTK and Pattern.
5. spaCy – Industrial strength N LP with Python and Cython.
6. Gensim – Topic Modelling for Humans
7. Stanford Core NLP – NLP services and packages by Stanford NLP Group.
