# Natural Language Processing (NLP) 


spacy package: leading NLP library. 
- relies on models that are language-specific and come in different sizes. 

In [7]:
import spacy


In [8]:
nlp=spacy.load('en_core_web_sm') # load English language model

In [16]:
doc = nlp("Tea is healthy and I love cats and I don't like people messing with my kitties! ") # process text like this

## Tokenizing  
A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractins like "don't" into two tokens, "do" and "n't". 


In [17]:
for token in doc:
    print(token)

Tea
is
healthy
and
I
love
cats
and
I
do
n't
like
people
messing
with
my
kitties
!


Iterating through a document gives you token objects, and each of these tokens come with additional information.   
`token.lemma_` and `token.is_stop` are important in many cases.  

 ## Text Preprocessing  
We need to preprocess to improve modeling of words.  
The first is **lemmatizing**. The **lemma** of a word is its base form. A good example is *walk*, the lemma of the word *walking*. When you lemmatize the word walking, it would convert to walk.  
 
Another common technique is to remove *stopwords*. These are the words that occur frequently in the language and don't contin much information. In English these include, "the", "is", "and", "but", "not".  
  
Using spaCy, `token.lemma_` returns the lemma, while `token.is_stop` returns a boolean `True` if the token is a stopword.  
  
    
Removing stop words might help the predictive model focus on relevant words. Lemmatizing similarily helps by combining multiple forms of the same word into a base form.  
  
However, lemmatizing and dropping stopwords might result in lower performance. Treat this preprocessing as part of your hyperparameter optimization process.  
  
 



In [18]:
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
I		I		True
love		love		False
cats		cat		False
and		and		True
I		I		True
do		do		True
n't		n't		True
like		like		False
people		people		False
messing		mess		False
with		with		True
my		my		True
kitties		kitty		False
!		!		False


# Reference
https://www.kaggle.com/matleonard/intro-to-nlp  