# NLP
 Book Reference: Practical Natural Language Processing - June 2020

 <img src="images/book.png">

<img src="images/h1.png">

<img src="images/h2.png">

NLP project pipeline

<img src="images/Pipeline_NLP.png">

## 1. Data Acquisition
a) Use a public dataset  
b) Scrape data  
c) Product intervention  
d) Data augmentation


## 2. Text Extraction and Cleanup
Process of extracting raw text from the input data by removing all the other non-textual information, such as markup, metadata, etc., and converting the text to the required encoding format.


<img src="images/text_topy.png">

In [3]:
from PIL import Image
from pytesseract import image_to_string
filename = "images/text_topy.png"
text = image_to_string(Image.open(filename))
print(text)

 

 

In the nineteenth century the only kind of linguistics considered
seriously was this comparative and historical study of words in languages
known or believed to be raguate say the Semitic languages, ot the Indo-
European languages. It is significant that the Germans who really made
the subject what it was, used the term Indo-germanisch. ‘Those who know
the popular works of Otto Jespersen will remember how firmly he
declares that linguistic science is historical. And those who have noticed

 

 



## 3. Pre-Processing


a) Preliminaries: Sentence segmentation and word tokenization.  
b) Frequent steps: Stop word removal, stemming and lemmatization, removing digits/punctuation,
lowercasing, etc.  
c) Other steps: Normalization, language detection, code mixing, transliteration, etc.  
d) Advanced processing: POS tagging, parsing, coreference resolution, etc.  

### Sentence Segmentation

In [7]:
from nltk.tokenize import sent_tokenize, word_tokenize
mytext = """In the previous chapter, we saw examples of some common NLP
applications that we might encounter in everyday life. If we were asked to
build such an application, think about how we would approach doing so at our
organization. We would normally walk through the requirements and break the
problem down into several sub-problems, then try to develop a step-by-step
procedure to solve them. Since language processing is involved, we would also
list all the forms of text processing needed at each step. This step-by-step
processing of text is known as pipeline. It is the series of steps involved in
building any NLP model. These steps are common in every NLP project, so it
makes sense to study them in this chapter. Understanding some common procedures
in any NLP pipeline will enable us to get started on any NLP problem encountered
in the workplace. Laying out and developing a text-processing pipeline is seen
as a starting point for any NLP application development process. In this
chapter, we will learn about the various steps involved and how they play
important roles in solving the NLP problem and we’ll see a few guidelines
about when and how to use which step. In later chapters, we’ll discuss
specific pipelines for various NLP tasks (e.g., Chapters 4–7)."""
my_sentences = sent_tokenize(mytext)
print(my_sentences)

['In the previous chapter, we saw examples of some common NLP\napplications that we might encounter in everyday life.', 'If we were asked to\nbuild such an application, think about how we would approach doing so at our\norganization.', 'We would normally walk through the requirements and break the\nproblem down into several sub-problems, then try to develop a step-by-step\nprocedure to solve them.', 'Since language processing is involved, we would also\nlist all the forms of text processing needed at each step.', 'This step-by-step\nprocessing of text is known as pipeline.', 'It is the series of steps involved in\nbuilding any NLP model.', 'These steps are common in every NLP project, so it\nmakes sense to study them in this chapter.', 'Understanding some common procedures\nin any NLP pipeline will enable us to get started on any NLP problem encountered\nin the workplace.', 'Laying out and developing a text-processing pipeline is seen\nas a starting point for any NLP application develo

In [10]:
for sentence in my_sentences[0:2]:
    print(sentence)
    print(word_tokenize(sentence))

In the previous chapter, we saw examples of some common NLP
applications that we might encounter in everyday life.
['In', 'the', 'previous', 'chapter', ',', 'we', 'saw', 'examples', 'of', 'some', 'common', 'NLP', 'applications', 'that', 'we', 'might', 'encounter', 'in', 'everyday', 'life', '.']
If we were asked to
build such an application, think about how we would approach doing so at our
organization.
['If', 'we', 'were', 'asked', 'to', 'build', 'such', 'an', 'application', ',', 'think', 'about', 'how', 'we', 'would', 'approach', 'doing', 'so', 'at', 'our', 'organization', '.']


### Frequent Steps
Common Words

In [35]:
example_sent = """This is a sample sentence,
                  showing off the stop words filtration."""
stop_words = set(stopwords.words("english"))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence = []
 
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
 
print(word_tokens)
print(filtered_sentence)


['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [36]:
example_sent = """Esta es una muestra de stop words"""
stop_words = set(stopwords.words("spanish"))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence = []
 
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
 
print(word_tokens)
print(filtered_sentence)

['Esta', 'es', 'una', 'muestra', 'de', 'stop', 'words']
['Esta', 'muestra', 'stop', 'words']


In [37]:
print(set(stopwords.words('spanish')))

{'habríais', 'quien', 'estuvisteis', 'tuvierais', 'fuisteis', 'eres', 'un', 'tuvieses', 'tendremos', 'en', 'tendrían', 'sí', 'y', 'otra', 'tuvimos', 'habíamos', 'hubisteis', 'hubo', 'sentidas', 'contra', 'de', 'le', 'fuerais', 'hubimos', 'tenidos', 'siente', 'estará', 'tengamos', 'estando', 'tuviesen', 'hubieron', 'otro', 'fueran', 'míos', 'otras', 'sentida', 'ellas', 'las', 'tendrá', 'del', 'esos', 'suya', 'tenga', 'otros', 'teníais', 'poco', 'tuviésemos', 'hemos', 'seas', 'tendré', 'tu', 'están', 'estás', 'desde', 'vosotros', 'hayáis', 'estéis', 'estuviesen', 'mío', 'seamos', 'la', 'estuviésemos', 'estaba', 'nuestras', 'estaría', 'mucho', 'una', 'mis', 'tuvieran', 'tuvieseis', 'han', 'eras', 'fuésemos', 'tenían', 'estén', 'estuviese', 'hubieras', 'yo', 'habéis', 'lo', 'qué', 'donde', 'estaríamos', 'sentidos', 'esa', 'ellos', 'estado', 'éramos', 'todo', 'estuviera', 'fueseis', 'todos', 'sean', 'tienen', 'estemos', 'estuviste', 'estad', 'e', 'son', 'serías', 'estáis', 'seréis', 'hubiér

### Stemming and Lemmatization

Stemming: Stemming refers to the process of removing suffixes and reducing a word to some
base form.  

Lemmatization: Is the process of mapping all the different forms of a word to its base
word, or lemma -> involve linguistic analysis, take longer time to run than stemming , is optional

<img src="images/stem_lemma.png">

In [43]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"he was better meeting")
for token in doc:
    print(f'{token}-->{token.lemma_}')

he-->he
was-->be
better-->well
meeting-->meeting


In [50]:
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer

lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")

words = ["fishing", "fishes", "fished"]
for word in words:
    print(f"word = {word}")
    print(f"stemmed_word = {stemmer.stem(word)}")
    print(f"lemma = {lemmatizer.lemmatize(word)}")
    print("")

word = fishing
stemmed_word = fish
lemma = fishing

word = fishes
stemmed_word = fish
lemma = fish

word = fished
stemmed_word = fish
lemma = fished



In [40]:
stemmer = SnowballStemmer("spanish")

words = ["pensar", "pescando", "pasado"]
for word in words:
    print(f"word = {word}")
    print(f"stemmed_word = {stemmer.stem(word)}")
    #print(f"lemma = {lemmatizer.lemmatize(word)}")
    print("")

word = pensar
stemmed_word = pens

word = pescando
stemmed_word = pesc

word = pasado
stemmed_word = pas



In [52]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("was")) # verb

be


In [54]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos="a")) # adjective

good


In [56]:
import spacy
sp = spacy.load('en_core_web_sm')
token = sp(u'better')
for word in token:
    print(word.text, word.lemma_)

better well


<img src="images/preproc_steps.png">

## Other Pre-Processing Steps

- Text normalization: convert digits to text, expand abbreviations.  
- Language detection: We can use the library Polyglot to detect the language of the text.

## Advanced Processing
- POS: Part of Speech Tagging

In [16]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Charles Spencer Chaplin was born on 16 April 1889 toHannah Chaplin (born Hannah Harriet Pedlingham Hill) and Charles Chaplin Sr')
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.shape_, token.is_alpha, token.is_stop)

Charles Charles PROPN Xxxxx True False
Spencer Spencer PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
was be AUX xxx True True
born bear VERB xxxx True False
on on ADP xx True True
16 16 NUM dd False False
April April PROPN Xxxxx True False
1889 1889 NUM dddd False False
toHannah toHannah PROPN xxXxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
( ( PUNCT ( False False
born bear VERB xxxx True False
Hannah Hannah PROPN Xxxxx True False
Harriet Harriet PROPN Xxxxx True False
Pedlingham Pedlingham PROPN Xxxxx True False
Hill Hill PROPN Xxxx True False
) ) PUNCT ) False False
and and CCONJ xxx True True
Charles Charles PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
Sr Sr PROPN Xx True False


<img src="images/Preproc_pipeline.png">

## 4. Feature Engineering or feature extraction

The goal of feature engineering is to capture the characteristics of the text into a numeric
vector that can be understood by the ML algorithms. Also called "text representation".

- Classical NLP/ ML Pipeline : Count words in a review for sentiment analysis task
- DL Pipeline: Using word embeddings, vectors representation of words, is difficult to interpret the vector representation.

<img src="images/feat_engineering.png">

## Text Representation
Transform a given text into numerical form so that it can be fed into NLP and ML algorithms.  
- Convert images and sound to numeric Representations is straightforward.  
- Convert to text to numbers is not straightforward -> 4 categories: 
1. Basic vectorization approaches  
2. Distributed representations  
3. Universal language representation  
4. Handcrafted features  


## Vector Space Models


<img src="images/vector_space.png">


### 1. Basic Vectorization Approaches:

### One hot Encoding

In [3]:
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [4]:
#Build the vocabulary
vocab = {}
count = 0
for doc in processed_docs:
    for word in doc.split():
        if word not in vocab:
            count = count +1
            vocab[word] = count
print(vocab)

{'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}


In [5]:

#Get one hot representation for any string based on this vocabulary. 
#If the word exists in the vocabulary, its representation is returned. 
#If not, a list of zeroes is returned for that word. 
def get_onehot_vector(somestring):
    onehot_encoded = []
    for word in somestring.split():
        temp = [0]*len(vocab)
        if word in vocab:
            temp[vocab[word]-1] = 1 # -1 is to take care of the fact indexing in array starts from 0 and not 1
        onehot_encoded.append(temp)
    return onehot_encoded

In [17]:
print(processed_docs[1])
onehot = get_onehot_vector(processed_docs[0]) #one hot representation for a text from our corpus.
print(onehot)
print(len(onehot[0]))

man bites dog
[[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0]]
6


In [18]:
get_onehot_vector("man and dog are good")

[[0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0]]

### Disadvantages
- The size of a one-hot vector is directly proportional to size of the vocabulary.  
- This representation does not give a fixed-length representation for text.  
- It treats words as atomic units and has no notion of (dis)similarity between
words. Is very poor at capturing the meaning of the word in relation to other words.  

### Bag of Words
See 2_Bag_of_Words.ipynb

### Advantages
- Like one-hot encoding, BoW is fairly simple to understand and implement.  
- The vector space resulting from the BoW scheme captures the semantic similarity of documents. So if two documents have similar vocabulary, they’ll be closer to each other in the vector space and vice versa.
- We have a fixed-length encoding for any sentence of arbitrary length.

### Disadvatanges  
- The size of the vector increases with the size of the vocabulary.  
- It does not capture the similarity between different words that mean the same
thing. (“I run”, “I ran”,)
- As the name indicates, it is a “bag” of words—word order information is lost in
this representation

### 2. Distributed Representations
These methods gained momentum in the past six to seven years. They use neural network architectures to create dense, low-dimensional representations of words and texts.

Key ideas:
- Distributional representation : BoW, One Hot vector: high dimensional and sparse vectors to represent words
- Distributed representation: Word2Vec, Glove: Low Dimensional and dense vectors


### 3. Universal Text Representations (State of the art)
Use of contextual word representations to obtain word vectors:  
- Example : bank  
Neural architectures such as recurrent neural networks (RNNs) and transformers
were used to develop large-scale models of language (ELMo , BERT), which
can be used as pre-trained models to get text representations. Applying transfer learning

- Vector Embeddings with Transformers Architectures (BERT)  


<img src="images/hf_vector1.png">


Each word of a sentence apport in the vectorization of a wor.   
<img src="images/hf_vector2.png">


A encoder is composed by Bidirectional neural networks and attention mechanisms. 


<img src="images/hf_encoder.png">


ELMO uses Recurrent Neural Networks inside his architecture (LSTM)

<img src="images/elmo_lstm.png">
