# Introduction to Natural Language Processing (NLP) with spaCy



This code is from the NLP course available in Kaggle: https://www.kaggle.com/learn/natural-language-processing.

I wanted to start it from 0 to make sure I totally understand what's behind.



# Part 1 - Intro to NLP

In this part, we will focus on 3 main concepts:
* Tokenizing
* Text Preprocessing
* Pattern Matching

I will use the leading NLP library: spaCy.


## NLP with spaCy

spaCy is the leading library for NLP. To find out more, have a look at its documentation: https://spacy.io/usage.
To install, you may need: 
```
python -m spacy download en_core_web_sm
python -m spacy download en
```

First of all, we have to import the library and load the english language model.


In [6]:
import spacy

nlp = spacy.load('en')

Then, with this model loaded, we can now process text:


In [12]:
doc = nlp("This coffe is awful. I shouldn't have taken a big cup!")
doc

This coffe is awful. I shouldn't have taken a big cup!

## Tokenizing

A document contains tokens. A token is a simple unit of text, like a word or punctuation.

In [13]:
for token in doc:
    print(token)

This
coffe
is
awful
.
I
should
n't
have
taken
a
big
cup
!


## Text preprocessing

A token comes with additional information. 
It has the "lemma" of the word. The lemma of a word is its base, its root.
For instance "speak" is the lemma of "speaking".

A token may be a stopword. 

In [15]:
print(f"Token \t\tLemma \t\t Stopword")
print("-"*40)
for token in doc:
    print(f"{str(token)} \t\t{token.lemma_} \t\t{token.is_stop}")

Token 		Lemma 		 Stopword
----------------------------------------
This 		this 		True
coffe 		coffe 		False
is 		be 		True
awful 		awful 		False
. 		. 		False
I 		-PRON- 		True
should 		should 		True
n't 		not 		True
have 		have 		True
taken 		take 		False
a 		a 		True
big 		big 		False
cup 		cup 		False
! 		! 		False


By knowing if a token is a stopword and what is its lemma, you can preprocess your text to contain more valuable content.

## Pattern Matching

You may often need to mach some words. A comme tool for that is to use regular expressions. With spaCy, you can do that easily.



In [18]:
from spacy.matcher import PhraseMatcher

# We specify attr='LOWER' to make the matcher case insensitive
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

In [20]:
# We define the terms
terms = ["Galaxy Note", "iPhone 11", "iPhone XS", "Google Pixel"]

# We process the terms through spaCy
patterns = [nlp(term) for term in terms]

# We add our terms to the matcher
matcher.add("TerminologyList", patterns)

In [32]:
review = "Glowing review overall, and some really interesting side-by-side  photography tests pitting the iPhone 11 Pro against the  Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3."

doc = nlp(review)

matches = matcher(doc)
print(matches)

[(3766102292120407359, 18, 20), (3766102292120407359, 24, 26), (3766102292120407359, 32, 34), (3766102292120407359, 35, 37)]


The results are the match id, the positions of the start and end of the phrase.


In [33]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], doc[start:end])

TerminologyList iPhone 11
