# Natural Language Processing (NLP) 


spaCy package: leading NLP library. 
- relies on models that are language-specific and come in different sizes. 

In [7]:
import spacy


In [8]:
nlp=spacy.load('en_core_web_sm') # load English language model

In [16]:
doc = nlp("Tea is healthy and I love cats and I don't like people messing with my kitties! ") # process text like this

## Tokenizing  
A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractins like "don't" into two tokens, "do" and "n't". 


In [17]:
for token in doc:
    print(token)

Tea
is
healthy
and
I
love
cats
and
I
do
n't
like
people
messing
with
my
kitties
!


Iterating through a document gives you token objects, and each of these tokens come with additional information.   
`token.lemma_` and `token.is_stop` are important in many cases.  

 ## Text Preprocessing  
We need to preprocess to improve modeling of words.  
The first is **lemmatizing**. The **lemma** of a word is its base form. A good example is *walk*, the lemma of the word *walking*. When you lemmatize the word walking, it would convert to walk.  
 
Another common technique is to remove *stopwords*. These are the words that occur frequently in the language and don't contin much information. In English these include, "the", "is", "and", "but", "not".  
  
Using spaCy, `token.lemma_` returns the lemma, while `token.is_stop` returns a boolean `True` if the token is a stopword.  
  
    
Removing stop words might help the predictive model focus on relevant words. Lemmatizing similarily helps by combining multiple forms of the same word into a base form.  
  
However, lemmatizing and dropping stopwords might result in lower performance. Treat this preprocessing as part of your hyperparameter optimization process.  
  
 



In [18]:
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
I		I		True
love		love		False
cats		cat		False
and		and		True
I		I		True
do		do		True
n't		n't		True
like		like		False
people		people		False
messing		mess		False
with		with		True
my		my		True
kitties		kitty		False
!		!		False


## Pattern Matching  
Another common NLP task is matching tokens and phrases with chunks of text or whole documents.   

In spaCy, to match individual tokens, you create a `Matcher`. When matching a list of terms, it's easier and more efficient to use `PhraseMatcher`. For example, if you want to find where different smartphone models show up in some text, you can create patterns for the model names of interest.    
To start, create the `PhraseMatcher`:

In [19]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

The matcher is created using the vocabulary of your model. Here we're using the small English model you loaded earlier. Setting attr='LOWER' will match the phrases on lowercased text. This provides case insensitve matching.  
  
Next, create a list of terms to match in the text. The phrase matcher needs the patterns as document objects. The easiest way to get these is with a list comprehension using the nlp model. 

In [20]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add('TerminologyList', patterns)

Then you create a document from the text to search and use the phrase matcher to find where the terms occur in the text.

In [21]:
# Borrowed from https://daringfireball.net/linked/2019/09/21/patel-11-pro
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.")

matches=matcher(text_doc)
print(matches)

[(3766102292120407359, 17, 19), (3766102292120407359, 22, 24), (3766102292120407359, 30, 32), (3766102292120407359, 33, 35)]


The matches are a tuple: (match id, start postiion, end position)

In [22]:
match_id, start, end=matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

TerminologyList iPhone 11


---
## Reference
https://www.kaggle.com/matleonard/intro-to-nlp  