# Week 7 - Modern Digital Technologies in Text Analysis

# Advanced Natural Langage Processing

In this seminar, we are going to cover various advanced NLP techniques and leverage machine learning algorithms to extract information from text data as well as some of the advanced NLP applications with the solution approach and implementation.

1. Noun Phrase extraction
2. Text similarity
3. Parts of speech tagging
4. Information extraction – NER – Entity recognition
5. Topic modeling
6. Text classification
7. Sentiment analysis
8. Word sense disambiguation
9. Speech recognition and speech to text
10. Text to speech
11. Language detection and translation

Before going further let's discuss and understand the NLP pipeline and life cycle first. There are so many concepts we are implementing in this course, and we might get overwhelmed by the content of it. To make it simpler and smoother, let’s see what is the flow that we need to follow for an NLP solution.

For example, let’s consider customer sentiment analysis and prediction for a product or brand or service.

- **Define the Problem:** Understand the customer sentiment across the products.
- **Understand the depth and breadth of the problem:** Understand the customer/user sentiments across the product; why we are doing this? What is the business impact? Etc.

- **Data requirement brainstorming:** Have a brainstorming activity to list out all possible data points.
     - All the reviews from customers on e-commerce platforms like Amazon, Flipkart, etc.
     - Emails sent by customers
     - Warranty claim forms
     - Survey data
     - Call center conversations using speech to text
     - Feedback forms
     - Social media data like Twitter, Facebook, and LinkedIn
 
- **Data collection:** We learned different techniques to collect the data eailier from different resources (Seminar: Extracting the Data). Based on the data and the problem, we might have to incorporate different data collection methods. In this case, we can use web scraping and Twitter APIs.

- **Text Preprocessing:** We know that data won’t always be clean. We need to spend a significant amount of time to process it and extract insight out of it using different methods that we discussed earlier. (Seminar: Exploring and Processing Text Data)

- **Text to feature:** As we discussed, texts are characters and machines will have a tough time understanding them. We have to convert them to features that machines and algorithms can understand using any of the methods we learned in the previous seminar.

- **Machine learning/Deep learning:** Machine learning/ Deep learning is a part of an artificial intelligence umbrella that will make systems automatically learn patterns in the data without being programmed. Most of the NLP solutions are based on this, and since we converted text to features, we can leverage machine learning or deep learning algorithms to achieve the goals like text classification, natural language generation, etc.

- **Insights and deployment:** There is absolutely no use or building NLP solutions without proper insights being communicated to the business. Always take time to connect the dots between model/analysis output and the business, thereby creating the maximum impact.


## 1. Extracting Noun Phrases

Let us extract a noun phrase from the text data (a sentence or the documents).

### Problem

You want to extract a noun phrase.

### Solution

Noun Phrase extraction is important when you want to analyze the “who” in a sentence. Let’s see an example below using TextBlob.

### How It Works

Execute the following code to extract noun phrases.

In [None]:
#Import libraries
import nltk
from textblob import TextBlob

In [None]:
# Extract noun
blob = TextBlob("John is learning natural language processing")

for np in blob.noun_phrases:
    print(np)

In [None]:
blob.noun_phrases

## 2. Finding Similarity Between Texts

We are going to discuss how to find the similarity between two documents or text. There are many similarity metrics like Euclidian, cosine, Jaccard, etc. Applications of text similarity can be found in areas like spelling correction and data deduplication.

Here are a few of the similarity measures:

- **Cosine similarity:** Calculates the cosine of the angle between the two vectors.
- **Jaccard similarity:** The score is calculated using the intersection or union of words.
- **Jaccard Index:** (the number in both sets) / (the number in either set) * 100.
- **Levenshtein distance:** Minimal number of insertions, deletions, and replacements required for transforming string “a” into string “b.”
- **Hamming distance:** Number of positions with the same symbol in both strings. But it can be defined only for strings with equal length.

### Problem

You want to find the similarity between texts/documents.

### Solution

The simplest way to do this is by using **cosine** similarity from the sklearn library.

### How It Works

Let’s follow the steps to compute the similarity score between text documents.

### 2.1 Create/read the text data

In [None]:
documents = ("I like NLP",
             "I am exploring NLP",
             "I am a beginner in NLP",
             "I want to learn NLP",
             "I like advanced NLP")

### Step 2-2 Find the similarity

In [None]:
# Import libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Compute tfidf : feature engineering (refer previous seminar)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
tfidf_matrix.shape

In [None]:
# compute similarity for first sentence with rest of the sentences
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

If we clearly observe, the first sentence and last sentence have higher similarity compared to the rest of the sentences.

### Phonetic matching

The next version of similarity checking is phonetic matching, which roughly matches the two words or sentences and also creates an alphanumeric string as an encoded version of the text or word. It is very useful for searching large text corpora, correcting spelling errors, and matching relevant names. **Soundex** and **Metaphone** are two main phonetic algorithms used for this purpose. The simplest way to do this is by using the fuzzy library.

1. Install and import the library

In [None]:
# Install the library

# !pip install Fuzzy
# !pip install fuzzywuzzy

In [None]:
import fuzzy
import jellyfish
from fuzzywuzzy import fuzz

2. Generate the phonetic form

In [None]:
soundex1 = jellyfish.soundex('natural')
soundex1

In [None]:
soundex2 = jellyfish.soundex('natuaral')
soundex2

In [None]:
soundex3 = jellyfish.soundex('language')
soundex3

In [None]:
soundex4 = jellyfish.soundex('processing')
soundex4

Soundex is treating “natural” and “natuaral” as the same, and the phonetic code for both of the strings is “N364.” And for “language” and “processing,” it is “L522” and “P625” respectively.

## 3. Tagging Part of Speech

Part of speech (POS) tagging is another crucial part of natural language processing that involves labeling the words with a part of speech such as noun, verb, adjective, etc. POS is the base for Named Entity Resolution, Sentiment Analysis, Question Answering, and Word Sense Disambiguation.

### Problem

Tagging the parts of speech for a sentence.

### Solution

There are 2 ways a tagger can be built.

- Rule based - Rules created manually, which tag a word belonging to a particular POS.
- Stochastic based - These algorithms capture the sequence of the words and tag the probability of the sequence using hidden Markov models.

### How It Works

NLTK has the best POS tagging module. `nltk.pos_tag(word)` is the function that will generate the POS tagging for any given word. Use for loop and generate POS for all the words present in the document.

### 3.1 Store the text in a variable

In [None]:
text = "I love NLP and I will learn NLP in 2 month"

### 3.2 NLTK for POS

In [None]:
# Importing necessary packages and stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words('english'))

# Tokenize the text
tokens = sent_tokenize(text)

# Generate tagging for all the tokens using loop
for i in tokens:
    words = nltk.word_tokenize(i)
    words = [w for w in words if not w in stop_words]
    # POS-tagger.
    tags = nltk.pos_tag(words)
    
tags

Below are the short forms and explanation of POS tagging. The word `“love”` is VBP, which means verb, sing. present, non-3d take

- CC coordinating conjunction
- CD cardinal digit
- DT determiner
- EX existential there (like: “there is” ... think of it like “there exists”)
- FW foreign word
- IN preposition/subordinating conjunction
- JJ adjective ‘big’
- JJR adjective, comparative ‘bigger’
- JJS adjective, superlative ‘biggest’
- LS list marker 1)
- MD modal could, will
- NN noun, singular ‘desk’
- NNS noun plural ‘desks’
- NNP proper noun, singular ‘Harrison’
- NNPS proper noun, plural ‘Americans’
- PDT predeterminer ‘all the kids’
- POS possessive ending parent’s
- PRP personal pronoun I, he, she
- PRP\\$ possessive pronoun my, his, hers
- RB adverb very, silently
- RBR adverb, comparative better
- RBS adverb, superlative best
- RP particle give up
- TO to go ‘to’ the store
- UH interjection
- VB verb, base form take
- VBD verb, past tense took
- VBG verb, gerund/present participle taking
- VBN verb, past participle taken
- VBP verb, sing. present, non-3d take
- VBZ verb, 3rd person sing. present takes
- WDT wh-determiner which
- WP wh-pronoun who, what
- WP$ possessive wh-pronoun whose
- WRB wh-adverb where, when

## 4. Extract Entities from Text

We are going to discuss how to identify and extract entities from the text, called Named Entity Recognition. There are multiple libraries to perform this task like NLTK chunker, StanfordNER, SpaCy, opennlp, and NeuroNER; and there are a lot of APIs also like WatsonNLU, AlchemyAPI, NERD, Google Cloud NLP API, and many more.

### Problem

You want to identify and extract entities from the text.

### Solution

The simplest way to do this is by using the **ne_chunk** from NLTK or **SpaCy**.

### How It Works

Let’s follow the steps in this section to perform NER.

### 4.1 Read/create the text data

In [None]:
# !pip install svgling

In [None]:
sent = "John is studying at Stanford University in California"

### 4.2 Extract the entities

In [None]:
#import libraries
import nltk
from nltk import ne_chunk
from nltk import word_tokenize

#NER
ne_chunk(nltk.pos_tag(word_tokenize(sent)), binary = False)

Here "*John*" is tagged as "*PERSON*"

"*Stanford*" as "*ORGANIZATION*"

"*California*" as "*GPE*". Geopolitical entity, i.e. countries, cities, states.

### Using SpaCy

In [None]:
# !pip install spacy

# !python3 -m spacy download en_core_web_sm

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

# Read/create a sentence
doc = nlp(u'Apple is ready to launch new phone worth $10000 in New york time square ')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

According to the output, 
* Apple is an organization
* 10000 is money
* New York is place

The results are accurate and can be used for any NLP applications.