# Natural Language Processing Basics
## 1. Tokenization
### Tokenization:
* The spacy library lets you read text
* NLP objects perform a variety of actions on the text by default
* Tokenization is the first, where it splits all text into tokens (i.e. constituent parts)
* You can see that certain tokens are kept together rather than fully split out (e.g. U.S.)
* This is due to the library that has been loaded
* **Documents** are chunks of text which are to be parsed
* **Modules** (e.g. nlp below) are the objects which let you parse text using their knowledge from the library you've loaded
* **Spans** are sub-sections of documents

In [1]:
# import libraries
import spacy

# load language library (small ENG) into spacy module (i.e. nlp)
nlp = spacy.load('en_core_web_sm')

# process text using nlp object
doc = nlp(u"Tesla isn\'t looking at buying U.S. startup for $6 million")

# iterate through tokens from text
for token in doc:
    # show components
    print(token, token.pos_, token.dep_) # token, part of speech (i.e. verb, noun), dependencies
    
# grab specific token
#doc[3]

# grab span (i.e. sub-section of doc)
# this refers to index of token, not of char
#span = doc[5:15]

Tesla PROPN nsubj
is AUX aux
n't PART neg
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


### Sentences
* You can split documents into sentences
* You can also check various things such as whether or not a word is the start of a sentence

In [2]:
# load sentence into module
doc = nlp(u"This sentence. Is separate. To this sentence.")

# iterate through sentences
for sentence in doc.sents:
    print(sentence)

This sentence.
Is separate.
To this sentence.


In [3]:
# check if word is start of sentence
print(doc[3])
print(doc[3].is_sent_start)

Is
True


### Pipeline:
* Tokenization is just one of the steps within the NLP pipeline
* You can see that Spacy runs multpiple different stages when parsing
* We will cover the below stages in more detail later
* You can add in custom pipeline stages if you want as well
* These stages can either be simple rule-based steps or statistically trained models

In [4]:
# show series of operations run when reading text into doc
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1ba4de87be8>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1ba50ac8f48>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1ba50ad7278>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1ba50ad73c8>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1ba50bb2988>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1ba50bb2448>)]

### Examples of Tokenization
* Certain tokens are kept together for various reasons
* Below, you can see that times, monetary values, emails etc. are kept together
* Whilst certain forms of punctuation are split out specifically (e.g. !, ., ")
* 

In [5]:
# store text in module
doc = nlp(u'We\'re here to help! Send notes to snail-mail@hotmail.co.uk before 9:00a.m. at a cost of $10.30')

# show all tokens
for token in doc:
    print(token)

We
're
here
to
help
!
Send
notes
to
snail-mail@hotmail.co.uk
before
9:00a.m
.
at
a
cost
of
$
10.30


In [6]:
# check length of document (i.e. token #)
len(doc)

19

In [7]:
# check number of lexemes (words) in the current library
len(doc.vocab)

# does the same thing as the above (library length doesn't change)
# length is determined by the library you load (i.e. en_core_web_sm)
# not the object you're reading
len(nlp.vocab)

785

In [8]:
# tokens cannot be reassigned
doc[3] = "new text"

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

### Named Entitites
* Spacy can identify persons, companies, locations etc.
* These are known as named entities
* There are a number of default rules and entities loaded into the Spacy libraries
* You can extract the entities specifically from the tokens (see below)
* Spacy provides a number of attributes for entitites
    * Label: tells you what the label is (e.g. organisation, money, country)
    * Explain: describes exactly what the label means

In [9]:
# text containing named entities
doc = nlp(u'Apple to build Hong Kong factory for $6 million.')

# iterate through tokens and show entities
for entity in doc.ents:
    print(entity, entity.label_, str(spacy.explain(entity.label_)))

Apple ORG Companies, agencies, institutions, etc.
Hong Kong GPE Countries, cities, states
$6 million MONEY Monetary values, including unit


### Noun Chunks
* Noun chunks can also be split out from text
* Noun chunks are small sections of text where you have a noun with any attached descriptors
* This helps you extract sections of meaning and can be used for things like sentiment analysis or general meaning of overall text

In [10]:
# text with noun chunks
doc = nlp(u'Autonomous cars are an insurance liability for manufacturers')

# iterate through noun chunks
for nc in doc.noun_chunks:
    print(nc)

Autonomous cars
an insurance liability
manufacturers


### Visualizing Tokenization
* You can actually visualize the tokenization steps Spacy is performing
* **displacy** is the library to use for rendering this information
* In the below example we've selected the dependency visualizer style
* It shows you the labelled dependencies between each word and the part of speech label

In [11]:
# load libraries
from spacy import displacy

# text
doc = nlp(u'Apple to build a U.K. factory for $6 million.')

# render the tokenization process
# specify that it's being performed in Jupyter (rather than in a script etc.)
# specify dependency style to display dependencies specifically
# distance lets you control space between tokens
displacy.render(doc, style='dep', jupyter=True, options={'distance':110})

* There are other views, such as the entity view
* This highlights different entities within your text and colour code them for ease of interpretation
* [Spacy Visualization Options](https://spacy.io/usage/visualizers)

In [12]:
# text
doc = nlp(u'Over the last quarter, Apple sold almost 25,000 iPods for a total profit of $25 million.')

# render using entity style
displacy.render(doc, style='ent', jupyter=True)

In [None]:
# text
doc = nlp(u'This is a sentence.')

# you can display your visualizations on a separate server
# this can be useful if you're running a script external to Jupyter
displacy.serve(doc, style='dep');

## 2. Stemming
### Stemming Basics
* Stemming is the process of reducing a word down to its stem (e.g. Caresses > Caress, Meeting > Meet)
* It's quite a crude process as it basically just hacks bits off words to get to the root
* This can result in quite a few errors or missclassifications
* As such, **lemmatization** is the preferred technique (we'll look at this next) but stemming is useful to know for reference
* Stemming essentially applies multiple stages of cropping (based on different rules) to get to the stem of the word
* Porter's Algorithm is one of the most commonly used methods (5 stages of cropping) whilst Snowball is a slightly newer, more efficient method (a.k.a. Porter's 2)
* We will use **NLTK** to implement stemming here as Spacy does not include stemming (it uses lemmatization instead)
* [Why Spacy doesn't use stemming](https://github.com/explosion/spaCy/issues/327)

In [20]:
# load libraries
import nltk
from nltk.stem.porter import PorterStemmer

# create stemming object
p_stemmer = PorterStemmer()

# list of words
words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness'] # mix of verbs and nouns

# show word and stem
for word in words:
    print(word + ' ----> ' + p_stemmer.stem(word))

run ----> run
runner ----> runner
ran ----> ran
runs ----> run
easily ----> easili
fairly ----> fairli
fairness ----> fair


Notes:
* You can see that the Porter's method above behaves differently to the Snowball method below
* For both, runner is recognised as a noun, hence why it doesn't get cropped to e.g. run
* Easily and fairly get cropped to have an i at the end of them in Porter's whilst Snowball is better at splitting fairly for example
* Essentially, both methods have different sets of algorithmic rules and will work better/worse than one another in different scenarios

In [21]:
# load libraries
from nltk.stem.snowball import SnowballStemmer

# create stemming object (requires language)
s_stemmer = SnowballStemmer(language='english')

# stem words again
for word in words:
    print(word + ' ----> ' + s_stemmer.stem(word))

run ----> run
runner ----> runner
ran ----> ran
runs ----> run
easily ----> easili
fairly ----> fair
fairness ----> fair


Notes:
* The important thing here is not necessarily the exact root that is determined
* The main thing is that words with the same root (in reality) are given the same exact root by the algorithm
* For example, in Porter's method above, run and ran receive different roots whereas in the Snowball method they are given the same root
* Neither of these is necessarily right or wrong, it depends on how you're looking to group similar words
* So knowing the algorithmic steps implemented by your selected method is important to ensure you're grouping words that you want to be seen as having the same root

In [22]:
# new set of words
words = ['generous', 'generously', 'generate', 'generation']

# stem words again
for word in words:
    print(word + ' ----> ' + s_stemmer.stem(word))

generous ----> generous
generously ----> generous
generate ----> generat
generation ----> generat


## 3. Lemmatization
### Lemmatization Basics
* Lemmatization is a far more advanced method of getting to a word's root than stemming
* Instead of simply cropping parts off a word, it has knowledge of different words and their roots (e.g. running > run)
* It also looks at the context of a word (i.e. words before/after, parts of speech etc.) to determine a word's root
* For example, it understands that meeting (noun) should remain as meeting, whilst meeting (verb) can be rooted to meet
* Below, you can see that lemmatization breaks our text down in a specific way
    * **Lemma** is the root object produced from lemmatization
    * The **lemma_** object is the final root object
    * Nouns retain their original text (i.e. runner > runner)
    * Whilst verbs are rooted (i.e. running > run)
    * Each lemma has its own unique hash value stored in **lemma**

In [25]:
# text
doc = nlp(u'I am a runner running in a race because I love to run since I ran today.')

# method to format lemmas
def show_lemmas(text):
    # iterate through tokens
    for token in text:
        # show token text and lemma attributes
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

# show lemmas for text
show_lemmas(doc)

I            PRON   4690420944186131903    I
am           AUX    10382539506755952630   be
a            DET    11901859001352538922   a
runner       NOUN   12640964157389618806   runner
running      VERB   12767647472892411841   run
in           ADP    3002984154512732771    in
a            DET    11901859001352538922   a
race         NOUN   8048469955494714898    race
because      SCONJ  16950148841647037698   because
I            PRON   4690420944186131903    I
love         VERB   3702023516439754181    love
to           PART   3791531372978436496    to
run          VERB   12767647472892411841   run
since        SCONJ  10066841407251338481   since
I            PRON   4690420944186131903    I
ran          VERB   12767647472892411841   run
today        NOUN   11042482332948150395   today
.            PUNCT  12646065887601541794   .


## 4. Stop Words
### Stop Words
* Stop words are words like 'a', 'the' etc.
* Essentially all the really common words which add little or nothing to the meaning of your text
* They can harm your learning models because they add redundant noise to your text data
* Spacy contains ~330 stop words in a **set** which you can access via the below method in order to filter them out of your text
    * **NOTE:** sets are unordered, unindexed lists of items attached to a variable name
    * They are 1 of 4 python collections (others being list, tuple and dictionary)

In [31]:
# show all stop words
print(len(nlp.Defaults.stop_words))
print(nlp.Defaults.stop_words)

326
{'part', 'than', 'cannot', 'ourselves', 'these', 'yours', 'besides', 'become', 'elsewhere', 'a', 'same', 'due', 'other', 'through', 'thereby', 'eleven', 'are', 'each', 'back', 'well', 'ca', 'be', 'himself', 'us', 'nowhere', 'between', 'mine', 'move', 'since', 'hence', 'few', 'beyond', 'an', 'behind', 'of', 'below', 'down', 'formerly', 'that', 'n’t', 'should', 'together', 'upon', 'became', 'five', 'toward', 'using', '‘re', 'most', 'whether', 'front', 'wherever', 'therein', 'has', 'six', 'everything', 'her', 'whereafter', 'onto', 'otherwise', 'however', 'least', '’m', 'they', 'ever', 'very', 'less', 'out', 'full', 'when', 'seem', 'nor', 'beside', 'twenty', 'its', 'whatever', 'doing', 'therefore', 'them', 'every', 'put', 'amount', 'after', 'never', 'further', 'though', 'i', 'it', "'re", 'am', 'seems', 'almost', 'although', 'he', 'noone', 'on', 'themselves', 'can', 'somewhere', 'take', 'either', 'others', 'amongst', 'hereupon', 'him', 'no', 'afterwards', 'yet', 'whole', 'the', 'moreove

In [32]:
# check if specific word is a stop word
# the vocab object lets you perform a number of operations on text
nlp.vocab['is'].is_stop

True

In [33]:
# add specific word to list of stop words
# both lines are required to make it a stop word
nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop = True

# remove words from stop words
nlp.Defaults.stop_words.remove('btw')
nlp.vocab['btw'].is_stop = False

## 5. Phrase Matching & Vocab
### Rule-Based Matching
* Phrase matching can be thought of as an advanced form of regular expressions
* Again, you are looking for patterns to match and extract from text
* However, this time you are looking at specific Spacy text elements (e.g. parts of speech) to help you more accurately match sections of text
* You create a **matcher** object to run phrase matching
* You define specific patterns (see tables further down for quantifiers and attributes) to match on
* Each pattern is a list of dictionaries, where each dictionary references one token
* [Spacy rule-based matching](https://spacy.io/usage/rule-based-matching)

Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

In [52]:
# load libraries
from spacy.matcher import Matcher

# create matcher object (based on loaded language library)
matcher = Matcher(nlp.vocab)

# define pattern to match
# trying to match all of the below cases
# SolarPower, Solar-power, Solar power
pattern1 = [{'LOWER':'solarpower'}] # for all one word, any case (e.g. SolarPower, SOLARPOWER)
pattern2 = [{'LOWER':'solar'}, {'IS_PUNCT':True}, {'LOWER':'power'}] # match hyphenated word (e.g. Solar-power)
pattern3 = [{'LOWER':'solar'}, {'LOWER':'power'}] # match two separate words, any case (e.g. solar power)

# store all patterns in a list
patterns = [pattern1, pattern2, pattern3]

# alternate shorthand for writing above
#patterns = [
#    [{'LOWER':'solarpower'}],
#    [{'LOWER':'solar'}, {'IS_PUNCT':True}, {'LOWER':'power'}],
#    [{'LOWER':'solar'}, {'LOWER':'power'}]
#]

# add patterns to matcher object
# name the specific matcher so you can access it
# add any and all patterns you'd like to match within this matcher
matcher.add('SolarPower', patterns)

# create text
doc = nlp(u'The Solar Power industry continues to grow as solar-power increases. SolarPower is awesome!')

# find matches
found_matches = matcher(doc)


# method to print out match strings
def show_matches(matches):
    for match_id, start, stop in found_matches:
        # get string using match_id
        string_id = nlp.vocab.strings[match_id]

        # extract matched span using start and stop
        span = doc[start:stop]

        # show matched values
        print(match_id, string_id, start, stop, span.text)
        
# show matches (shows unique ID for match, start and stop token index of match)
show_matches(found_matches)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 8 11 solar-power
8656102463236116519 SolarPower 13 14 SolarPower


Notes:
* Your matcher is assigned an ID
* Within it are all the matchers versions you've created (e.g. SolarPower above)
* You can access the information attached to these objects using the IDs generated above
* This is very useful for extracting specific matches from your matcher objects

### Operators/Quantifiers
* Just like regex, you can use additional **operators/quantifiers** to enhance your pattern matching
* In the below code we ask our matcher to identify patterns where the words 'solar' and 'power' are separated by any number of punctuation
* The below table shows other options for operators

<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>

In [54]:
# remove specific matcher pattern(s)
matcher.remove('SolarPower')

# create new patterns
patterns = [
    [{'LOWER':'solarpower'}],
    [{'LOWER':'solar'}, {'IS_PUNCT':True, 'OP':'*'}, {'LOWER':'power'}] # operators/quantifiers to capture any amount of punctuation (see below)
]

# add patterns to matcher
matcher.add('SolarPower', patterns)

# create text
doc = nlp(u'Solar--Power is solarpower yeah!')

# match patterns in text
found_matches = matcher(doc)

# show matches
show_matches(found_matches)

8656102463236116519 SolarPower 0 3 Solar--Power
8656102463236116519 SolarPower 4 5 solarpower


### Phrase Matching
* Besides defining rules for pattern matching, you can look for specific phrases too
* This might be helpful if you're trying to extract a subset of information based on specific criteria
* The matcher object here works in much the same way as the rule based matcher above

In [71]:
# load libraries
from spacy.matcher import PhraseMatcher

# create matcher object
matcher = PhraseMatcher(nlp.vocab)

# load text file
with open('NLP Course Files/TextFiles/reaganomics.txt') as f:
    doc = nlp(f.read())
    
# create list of phrases to search for
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

# process each phrase into spacy
phrase_patterns = [nlp(text) for text in phrase_list]

# load phrases into matcher
matcher.add('EconMatcher', phrase_patterns)

# get matches
found_matches = matcher(doc)

# show matches
show_matches(found_matches)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 49 53 trickle-down economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 61 65 free-market economics
3680293220734633682 EconMatcher 673 677 supply-side economics
3680293220734633682 EconMatcher 2987 2991 trickle-down economics
