# "Topic 02: Building an NLP Pipeline(PART-1)"
"Introduction to NLP pipeline"

*   toc: true
*   badges: true 
*  comments: true
*  categories: [nlp-pipeline]
*  hide: true
*  sticky_rank: 2

If we were asked to build an NLP application, think about how we would approach doing so at an organization. We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them. Since language processing is involved, we would also list all the forms of text processing needed at each step. 

 

*This step-by-step processing of text is known as a NLP pipeline. It is the series of steps involved in building any NLP model.*

<img src="my_icons/topic_02.a.1.png">

The key stages in the pipeline are as follows: 

1. Data acquisition 

2. Text cleaning 

3. Pre-processing 

4. Feature engineering 

5. Modeling 

6. Evaluation 

7. Deployment 

8. Monitoring and model updating 



Before we dive into NLP applications implementation the first and foremost thing is to get a clear picture about it’s pipeline. Hence, below are a detail overview about each component in it's pipeline. 



NOTE: This blog post is divided into 2 parts. First part will deal with data with data collection, cleaning,  pre-processing and feature engineering and the second part deal with model building , evaluation , deployment and monitoring and updating the model. 

 

# DATA ACUQISTION 

Data plays a major role in the NLP pipeline. Hence it's quite important that how we collect the relevant data for our NLP project.  

Sometime it's easily available to us. But sometime extra effort need to be done to collect  these precious data. 

## 1).Scrape web pages:  
To create an application that can summarizes the top news into just 100 words . So for that you need to scrape the data from the current affairs websites and webpages. 

 

## 2).Data Augmentation:
NLP has a bunch of techniques through which we can take a small dataset and use some tricks to create more data. These tricks are also called data augmentation, and they try to exploit language properties to create text that is syntactically similar to source text data. They may appear as hacks, but they work very well in practice. Let’s look at some of them: 

 

## 3).Back translation : 
Let say we have sentence s1 which is in French. We will translate it to other language (in this case English) and after translation it become sentence s2. Now we will translate this sentence s2 to again French and now it become s3. We’ll find that S1 and S3 are very similar in meaning but are slight variations of each other. Now we can add S3 to our dataset. 

 

## 4).Replacing Entities: 
To create more dataset we will replace the entities name with other entities. Let say s1 is "I want to go to New York", here we will replace New York with other entity name for e.g. New Jersey. 

 
## 5).Synonym Replacement:
Randomly choose “k” words in a sentence that are not stop words. Replace these words with their synonyms. 

 
## 6).Bigram flipping: 
Divide the sentence into bigrams. Take one bigram at random and flip it. For example: “I am going to the supermarket.” Here, we take the bigram “going to” and replace it with the flipped one: “to Going.” 


# TEXT CLEANING 


After  collecting data it is also important that data need to be in the form that is understood by computer. Consider the text contains different symbols and words which doesn't convey meaning to the model while training. So we will remove them before feeding to the model in an efficient way. This method is called Data Cleaning. Different Text Cleaning process are as follows:  

## HTML tag cleaning : 

Well when collecting the data we scrap through various web pages. Beautiful Soup and Scrapy, which provide a range of utilities to parse web pages.  
 

In [1]:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
page = urllib.request.urlopen(url) # conntect to website
try:
    page = urllib.request.urlopen(url)
except:
    print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('^tocsection-')
content_lis = soup.find_all('li', attrs={'class': regex})
content = []
for li in content_lis:
    content.append(li.getText().split('\n')[0])
print(content)

['1 History', '2 Basics', '3 Challenges', '3.1 Reasoning, problem solving', '3.2 Knowledge representation', '3.3 Planning', '3.4 Learning', '3.5 Natural language processing', '3.6 Perception', '3.7 Motion and manipulation', '3.8 Social intelligence', '3.9 General intelligence', '4 Approaches', '4.1 Cybernetics and brain simulation', '4.2 Symbolic', '4.2.1 Cognitive simulation', '4.2.2 Logic-based', '4.2.3 Anti-logic or scruffy', '4.2.4 Knowledge-based', '4.3 Sub-symbolic', '4.3.1 Embodied intelligence', '4.3.2 Computational intelligence and soft computing', '4.4 Statistical', '4.5 Integrating the approaches', '5 Tools', '6 Applications', '7 Philosophy and ethics', '7.1 The limits of artificial general intelligence', '7.2 Ethical machines', '7.2.1 Artificial moral agents', '7.2.2 Machine ethics', '7.2.3 Malevolent and friendly AI', '7.3 Machine consciousness, sentience and mind', '7.3.1 Consciousness', '7.3.2 Computationalism and functionalism', '7.3.3 Strong AI hypothesis', '7.3.4 Robo

## Unicode Normalization:  

While cleaning the data we may also encounter various Unicode characters, including symbols, emojis, and other graphic characters. To parse such non-textual symbols and special characters, we use Unicode normalization. This means that the text we see should be converted into some form of binary representation to store in a computer. This process is known as text encoding. 

 

<INSERT CODE HERE> 


In [2]:
#hide
!pip install emoji

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/ff/1c/1f1457fe52d0b30cbeebfd578483cedb3e3619108d2d5a21380dfecf8ffd/emoji-0.6.0.tar.gz (51kB)
[K     |██████▍                         | 10kB 14.7MB/s eta 0:00:01[K     |████████████▉                   | 20kB 9.3MB/s eta 0:00:01[K     |███████████████████▎            | 30kB 7.8MB/s eta 0:00:01[K     |█████████████████████████▊      | 40kB 7.4MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.7MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-0.6.0-cp36-none-any.whl size=49717 sha256=a6365101db710e8ae5f3986d405494a39b45d5a929650cdea2bcaa2b6868515d
  Stored in directory: /root/.cache/pip/wheels/46/2c/8b/9dcf5216ca68e14e0320e283692dce8ae321cdc01e73e17796
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-0.6.0


In [3]:
import emoji
text = emoji.emojize("Python is fun :red_heart:")
print(text)

Python is fun ❤


In [4]:
Text = text.encode("utf-8")
print(Text)

b'Python is fun \xe2\x9d\xa4'


## Spelling Correction:  

The data that we have might have some spelling mistake because of fast typing the text or using short hand or slang that are used on social media like twitter. Using these data may not result in better prediction by our model therefore it is quite important to handle these data before feeding it to the model. we don’t have a robust method to fix this, but we still can make good attempts to mitigate the issue. Microsoft released a REST API that can be used in Python for potential spell checking. 

 


### System-Specific Error Correction: 

* What if we need to extract the data from the PDF. Different PDF documents are encoded differently, and sometimes, we may not be able to extract the full text, or the structure of the text may get messed up. There are several libraries, such as PyPDF, PDFMiner, etc., to extract text from PDF documents but they are far from perfect. 

* Another common source of textual data is scanned documents. Text extraction from scanned documents is typically done through optical character recognition (OCR), using libraries such as Tesseract. 

 

 

# PRE-PROCESSING:  

To pre-process your text simply means to bring your text into a form that is predictable  and analysable for your task. A task here is a combination of approach and domain. For example, extracting top keywords with TF-IDF (approach) from Tweets (domain) is an example of a Task. 

 

                      Task = approach + domain 
 

One task’s ideal pre-processing can become another task’s worst nightmare. So take note: text pre-processing is not directly transferable from task to task. 

Let’s take a very simple example, let’s say you are trying to discover commonly used words in a news dataset. If your pre-processing step involves removing stopwords  because some other task used it, then you are probably going to miss out on some of the common words as you have ALREADY eliminated it. So really, it’s not a one-size-fits-all approach. 

Here are some common pre-processing steps used in NLP software: 

* Preliminaries:  
Sentence segmentation and word tokenization. 

 
* Frequent steps:  
Stop word removal, stemming and lemmatization, removing digits/punctuation, lowercasing, etc. 

 

* Advanced processing:  
  POS tagging, parsing, coreference resolution, etc. 

 

## Preliminaries :  

While not all steps will be followed in all the NLP pipelines we encounter, the first two are more or less seen everywhere. Let’s take a look at what each of these steps mean. 

 

The NLP can analysis the text by breaking it into sentence(sentence segmentation) and then further into words(words tokenization). 

### SENTENCE SEGMENTATION : 

We may easily divide the text into sentence on the basis of the position of the full stop(.). But what happen if we have Dr.Joy or (….) in our text. 

We have NLP libraries which help to overcome these issue. Like NLTK. 

 

In [30]:
#hide
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
import nltk
from nltk.tokenize import sent_tokenize
text = "It's fun to study NLP. Would recommend to all."
print(sent_tokenize(text))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
["It's fun to study NLP.", 'Would recommend to all.']


### WORD TOKENIZATION : 

To tokenize a sentence into words, we can start with a simple rule to split text into words based on the presence of punctuation marks. The NLTK library allows us to do that. 

In [None]:
from nltk.tokenize import word_tokenize
text = "It's fun to study NLP. Would recommend to all."
print(word_tokenize(text))


['It', "'s", 'fun', 'to', 'study', 'NLP', '.', 'Would', 'recommend', 'to', 'all', '.']


## Frequent Steps : 

Some frequent steps for pre-processing steps are: 

* Lower casing 
* Removal of Punctuations 
* Removal of Stopwords 
* Removal of Frequent words 
* Stemming 
* Lemmatization 
* Removal of emojis 
* Removal of emoticons 

 
Well these steps are frequent but they may vary problem to problem. For eg let us consider we want to predict whether the given text belong to news, music or any other field. So for this problem we cannot remove the Frequent words like news article might contain the word news in it a lot, Hence to categorize the article we cannot just remove it. 

As we encounter the word stemming and Lemmatization for the first time. Lets clear the meaning of these terms. 

Stemming and Lemmatization helps us to achieve the root forms (sometimes called synonyms in search context) of inflected (derived) words.  

 

###  Stemming : 

Stemming is faster because it chops words without knowing the context of the words in given sentences. 

* It is rule-based approach. 
* Accuracy is less. 
* When we convert any words into root-form then stemming mat create the non-existence meaning of a word. 
* Stemming is preferred when the meaning of the word is not important for analysis. EXAMPLE :- Spam Detection 
* For example : Studies => Studi 

In [None]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

word_data = "Da Vinci Code is such an amazing book to read.The book is full of suspense and Thriller. One of the best work of Dan Brown."
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
       print("Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w)))

Actual: Da  Stem: Da
Actual: Vinci  Stem: vinci
Actual: Code  Stem: code
Actual: is  Stem: is
Actual: such  Stem: such
Actual: an  Stem: an
Actual: amazing  Stem: amaz
Actual: book  Stem: book
Actual: to  Stem: to
Actual: read  Stem: read
Actual: .  Stem: .
Actual: .The  Stem: .the
Actual: book  Stem: book
Actual: si  Stem: si
Actual: full  Stem: full
Actual: of  Stem: of
Actual: suspense  Stem: suspens
Actual: and  Stem: and
Actual: Thriller  Stem: thriller
Actual: .  Stem: .
Actual: One  Stem: one
Actual: of  Stem: of
Actual: the  Stem: the
Actual: best  Stem: best
Actual: work  Stem: work
Actual: of  Stem: of
Actual: Dan  Stem: dan
Actual: Brown  Stem: brown


### Lemmatization :  

Lemmatization is slower as compared to stemming but it knows the context of the word before proceeding. 

* It is a dictionary-based approach. 
* Accuracy is more as compared to Stemming. 
* Lemmatization always gives the dictionary meaning word while converting into root-form. 
* Lemmatization would be recommended when the meaning of the word is important for analysis.Example: Question Answer 
*For Example: “Studies” => “Study” 

 

In [19]:
#hide
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [20]:

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

word_data = "Da Vinci Code is such an amazing book to read.The book is full of suspense and Thriller. One of the best work of Dan Brown."
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
       print("Actual: %s  Lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w)))

Actual: Da  Lemma: Da
Actual: Vinci  Lemma: Vinci
Actual: Code  Lemma: Code
Actual: is  Lemma: is
Actual: such  Lemma: such
Actual: an  Lemma: an
Actual: amazing  Lemma: amazing
Actual: book  Lemma: book
Actual: to  Lemma: to
Actual: read.The  Lemma: read.The
Actual: book  Lemma: book
Actual: is  Lemma: is
Actual: full  Lemma: full
Actual: of  Lemma: of
Actual: suspense  Lemma: suspense
Actual: and  Lemma: and
Actual: Thriller  Lemma: Thriller
Actual: .  Lemma: .
Actual: One  Lemma: One
Actual: of  Lemma: of
Actual: the  Lemma: the
Actual: best  Lemma: best
Actual: work  Lemma: work
Actual: of  Lemma: of
Actual: Dan  Lemma: Dan
Actual: Brown  Lemma: Brown
Actual: .  Lemma: .


Some other pre-processing steps that are not that common are: 

### Text Normalization: 

Text normalization is the process of transforming a text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form. Another example is mapping of near identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”. 

 

### Language Detection: 

Well what happen if our text is in other language apart from English. Our whole pipeline will accept an English  text. So for that we need to detect the language before creating the pipeline. 


## Advance Processing  

### POS Tagging : 
Imagine we’re asked to develop a system to identify person and organization names in our company’s collection of one million documents. The common pre-processing steps we discussed earlier may not be relevant in this context. Identifying names requires us to be able to do POS tagging, as identifying proper nouns can be useful in identifying person and organization names. Pre-trained and readily usable POS taggers are implemented in NLP libraries such as NLTK, spaCy and Parsey McParseface Tagger. 

 

In [12]:
tokens = nltk.word_tokenize("The quick brown fox jumps over a lazy dog")
print("Part of speech",nltk.pos_tag(tokens))

Part of speech [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('a', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


### Parse Tree: 

Now if we have to find the relationship between person and organization then  we need to anayisis the senentce in depth and for that  parse tree play a major role. Parse tree is a tree representation of different syntactic categories of a sentence. It helps us to understand the syntactical structure of a sentence. 

 

In [21]:
#hide
nltk.download('punkt') 
nltk.download('averaged_perceptron_tagger') 


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [22]:

from nltk import pos_tag, word_tokenize, RegexpParser 
   
# Example text 
sample_text = "The quick brown fox jumps over the lazy dog"
   
# Find all parts of speech in above sentence 
tagged = pos_tag(word_tokenize(sample_text)) 
   
#Extract all parts of speech from any text 
chunker = RegexpParser(""" 
                       NP: {<DT>?<JJ>*<NN>}    #To extract Noun Phrases 
                       P: {<IN>}               #To extract Prepositions 
                       V: {<V.*>}              #To extract Verbs 
                       PP: {<P> <NP>}          #To extract Prepostional Phrases 
                       VP: {<V> <NP|PP>*}      #To extarct Verb Phrases 
                       """) 
  
# Print all parts of speech in above sentence 
output = chunker.parse(tagged) 
print("After Extracting\n", output) 

After Extracting
 (S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  (VP (V jumps/VBZ) (PP (P over/IN) (NP the/DT lazy/JJ dog/NN))))


### Conference Resolution : 

Coreference resolution is the task of finding all expressions that refer to the same entity in a text.

<img src="my_icons/topic_02.a.2.png">

# FEATURE ENGINEERING: 

When we use ML methods to perform our modeling step later, we’ll still need a way to feed this pre-processed text into an ML algorithm. Feature engineering refers to the set of methods that will accomplish this task. It’s also referred to as feature extraction. The goal of feature engineering is to capture the characteristics of the text into a numeric vector that can be understood by the ML algorithms. 

two different approaches taken in practice for feature engineering in  

## classical NLP and traditional ML pipeline  

Feature engineering is an integral step in any ML pipeline. Feature engineering steps convert the raw data into a format that can be consumed by a machine. These transformation functions are usually handcrafted in the classical ML pipeline, aligning to the task at hand. For example, imagine a task of sentiment classification on product reviews in e-commerce. One way to convert the reviews into meaningful “numbers” that helps predict the reviews’ sentiments (positive or negative) would be to count the number of positive and negative words in each review. There are statistical measures for understanding if a feature is useful for a task or not. 

One of the advantages of handcrafted features is that the model remains interpretable—it’s possible to quantify exactly how much each feature is influencing the model prediction. 

 

## DL pipeline 

In the DL pipeline, the raw data (after pre-processing) is directly fed to a model.The model is capable of “learning” features from the data. Hence, these features are more in line with the task at hand, so they generally give improved performance. But, since all these features are learned via model parameters, the model loses interpretability. 

 

 

# RECAP: 

The first step in the process of developing any NLP system is to collect data relevant to the given task. Even if we’re building a rule-based system, we still need some data to design and test our rules. The data we get is seldom(rarely) clean, and this is where text cleaning comes into play. After cleaning, text data often has a lot of variations and needs to be converted into a canonical (principle or a pre-defined way) form. This is done in the pre-processing step. This is followed by feature engineering, where we carve out indicators that are most suitable for the task at hand. 

These indicators/features are converted into a format that is understandable by modeling algorithms.  


{{ 'Notes are compiled from [ Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems](https://www.oreilly.com/library/view/practical-natural-language/9781492054047/), [GeeksforGeeks](https://www.geeksforgeeks.org/syntax-tree-natural-language-processing/), [tutorialspoint-Stemming and Lemmatization](https://www.tutorialspoint.com/python_data_science/python_stemming_and_lemmatization.htm), [gfg-parse tree](https://www.geeksforgeeks.org/syntax-tree-natural-language-processing/), [morioh](https://morioh.com/p/a7b8982e5a5a) and [Medium-Tokenization and Parts of Speech(POS) Tagging in Python’s NLTK library](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b)' | fndetail: 2 }}
{{ 'If you face any problem or have any feedback/suggestions feel free to comment.' | fndetail: 3 }}