# Natural Language Processing Demystified | Preprocessing
## Sources
Largely based on
https://nlpdemystified.org<br>
https://github.com/nitinpunjabi/nlp-demystified

# 1. Installation and dependencies

### spaCy upgrade and package installation.

At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy.
<br><br>
**IMPORTANT**<br>
If you're running this in the cloud rather than using a local Jupyter server on your machine, then the notebook will **timeout** after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical packages.
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

In [1]:
!pip install -U spacy

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
!python3 -m spacy info

[1m

spaCy version    3.4.1                         
Location         /Users/jospolfliet/Library/Python/3.8/lib/python/site-packages/spacy
Platform         macOS-12.5.1-arm64-arm-64bit  
Python version   3.8.9                         
Pipelines        en_core_web_sm (3.4.0)        



In [3]:
 import spacy 

After importing spaCy, the next thing we need to do is load a suitable statistical model for our project. spaCy offers a variety of models for different languages. These models help with tokenization, part-of-speech tagging, named entity recognition, and more.

Here, we're loading the **en_core_web_sm** model which is the smallest English model spaCy offers and is a good starting point for NLP tasks.<br>
https://spacy.io/models/en#en_core_web_sm

Since we upgraded spaCy, we'll need to download the statistical model as well.

In [4]:
!python3 -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
nlp = spacy.load('en_core_web_sm')

**en_core_web_sm** is trained on OntoNotes 5 which is an annotated corpus comprising news, blogs, transcripts, etc. Put simply, this means a bunch of documents were labelled with information such as how each sentence should be parsed, whether a particular word is a noun or adjective or other part-of-speech, whether a word is a special entity like a person or a real-world organization, and other language-related labels. A statistical model was then generated from these labelled documents.<br>
https://catalog.ldc.upenn.edu/LDC2013T19
<br><br>
You can learn more about the available spaCy models at these links:<br>
https://spacy.io/models<br>
https://spacy.io/usage/models

After loading the model, the _nlp_ variable now references a **Language** class instance which contains language-specific rules for various tasks (e.g. tokenization) and a processing pipeline.<br>
https://spacy.io/api/language

In [6]:
type(nlp) 

spacy.lang.en.English

# Tokenization

Course module for this demo:
https://www.nlpdemystified.org/course/tokenization


In [None]:
# Just so you can start here and skip the installation part when you did that before.
import spacy
nlp = spacy.load('en_core_web_sm')

### Tokenization with spaCy

We pass whatever text we want to process to _nlp_, which returns a **Doc** container object containing the tokenized text and a number of annotations for each token. These annotations are discussed in follow-up videos. You can learn more about the **Doc** object here:<br>
https://spacy.io/api/doc

In [50]:
# Sample sentence.
s = "It's not about the money (only $20.15), it's about sending a message :). 🚀💎🙌"
doc = nlp(s)

We can iterate over this **Doc** object and view the tokens.

In [51]:
print([t.text for t in doc])

['It', "'s", 'not', 'about', 'the', 'money', '(', 'only', '$', '20.15', ')', ',', 'it', "'s", 'about', 'sending', 'a', 'message', ':)', '.', '🚀', '💎', '🙌']


Note how
- `it's` is separated in `it` and `'s`
- the currency symbol and amount are separated.
- punctuation like `.` is separated when it has a function at the end of the sentence
- punctuation like `.` is not separated when it is an indivisible part of a token
- the period at the end of the sentence is its own token
- emoji's are one token each

The **Doc** object can be indexed and sliced like a regular list. The **Doc** object contains **Token** and **Span** objects, which offer different views into the text.

In [21]:
# We can view an individual token by indexing into the Doc object.
print(doc[9])

20.15


In [22]:
# A Doc object is a container of other objects, namely Token and Span objects.
print(type(doc[0]))

<class 'spacy.tokens.token.Token'>


In [23]:
# Slicing a Doc object returns a Span object.
print(doc[0:3])
print(type(doc[0:3]))

It's not
<class 'spacy.tokens.span.Span'>


In [24]:
# Access a token's index in a sentence.
print([(t.text, t.i) for t in doc])

[('It', 0), ("'s", 1), ('not', 2), ('about', 3), ('the', 4), ('money', 5), ('(', 6), ('only', 7), ('$', 8), ('20.15', 9), (')', 10), (',', 11), ('it', 12), ("'s", 13), ('about', 14), ('sending', 15), ('a', 16), ('message', 17), ('.', 18), ('🚀', 19), ('💎', 20), ('🙌', 21)]


Spacy's tokenization is _non-destructive_, which means the original input can be reconstructed from the tokens.

In [25]:
# You can view the original input like so:
print(doc.text)

It's not about the money (only $20.15), it's about sending a message. 🚀💎🙌


In [46]:
# Tokens have many useful properties
print(doc[10])
doc[10].is_punct


)


True

In [47]:
print(doc[15])
list(doc[15].subtree)

sending


[sending, a, message]

You can learn more about the **Token** and **Span** objects here:<br>
https://spacy.io/api/token<br>
https://spacy.io/api/span


We can also tokenize multiple sentences and access each sentence individually using the **Doc** object's _sents_ property.

In [29]:
s = """JUST PUT IN ANOTHER 30K IN NOK CALLS LET'S GO! $GME $NOK BUY AND HOLD 🚀🚀 🚀🚀. We need to stick together and 💎🖐 the ever lovin shit out of this opportunity. We will leave no man or woman behind! Forward! Together!"""

doc = nlp(s)

# Look at individual sentences (there should be multiple 'Span' objects).
for sent in doc.sents:
    print(f"##### {sent}")

##### JUST PUT IN ANOTHER 30K IN NOK CALLS LET'S GO!
##### $GME $NOK BUY AND HOLD 🚀🚀 🚀🚀.
##### We need to stick together and 💎🖐 the ever lovin shit out of this opportunity.
##### We will leave no man or woman behind!
##### Forward!
##### Together!


### Tokenization Exercises

In [33]:
#
# EXERCISE:
# 1) Tokenize the following text
# 2) Iterate through the tokens to check whether there's a currency symbol.
# 3) If there is, and the currency label is followed by a number, print
#    both the symbol and the number.
# 
# Look through https://spacy.io/api/token#attributes on how to check whether
# a token is a currency symbol or a number.
#
# Expected output: "$20".
s = "It's not about the money (only $20.15), it's about sending a message. 🚀💎🙌"
doc = nlp(s)

In [None]:
# Exercise: find the longest token in the WallStreetBets dataset.

In [34]:
#
# EXERCISE: Learn how the spaCy tokenizer works and how to customize it:
# https://spacy.io/usage/linguistic-features#tokenization
#

In [None]:
#
# EXERCISE: Read through spaCy-101 and if you're interested, check out their course
# on spaCy itself (link on the page).
# https://spacy.io/usage/spacy-101
#

In [None]:
#
# EXERCISE: Look up how to tokenize the sentence below using NLTK. The imports 
# are done for you. Does the NLTK tokenizer handle "N.Y.C." correctly?
#
import nltk
from nltk.tokenize import TreebankWordTokenizer
s = "Let's go to N.Y.C. for the weekend."

**NOTE**: Different tokenizers will give subtly different results based on the rules they use. Experiment with different tokenizers and use the one best suited for your project.

# Basic Preprocessing
## Case-Folding, Stop Word Removal, Stemming, and Lemmatization.

Course module for this demo:
https://www.nlpdemystified.org/course/basic-preprocessing

spaCy performs all these preprocessing steps (except stemming) behind the scenes for you. Inline with its non-destructive policy, the tokens aren't modified directly. Rather, each **Token** object has a number of attributes which can help you get views of your document with these pre-processing steps applied. The attributes a **Token** has can be found here:<br>
https://spacy.io/api/token#attributes
<br><br>
More information about spaCy's processing pipeline:<br>
https://spacy.io/usage/processing-pipelines

In [52]:
import spacy
nlp = spacy.load('en_core_web_sm')
s = "Once you're done with GME - $AG and $SLV, the gentleman's short squeeze, driven by macro fundamentals :/"
doc = nlp(s)

### Case-Folding

View your document with case-folding using the *lower_* attribute.

In [53]:
print([t.lower_ for t in doc])

['once', 'you', "'re", 'done', 'with', 'gme', '-', '$', 'ag', 'and', '$', 'slv', ',', 'the', 'gentleman', "'s", 'short', 'squeeze', ',', 'driven', 'by', 'macro', 'fundamentals', ':/']


You can also apply conditions when generating these views. For example, we can skip case-folding if a token is the start of a sentence.

In [54]:
print([t.lower_ if not t.is_sent_start else t for t in doc])

[Once, 'you', "'re", 'done', 'with', 'gme', '-', '$', 'ag', 'and', '$', 'slv', ',', 'the', 'gentleman', "'s", 'short', 'squeeze', ',', 'driven', 'by', 'macro', 'fundamentals', ':/']


### Stop Word Removal

spaCy comes with a default stop word list. To view your document with stop words removed, you can use the *is_stop* attribute.

In [55]:
# spaCy's default stop word list.
print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))

{'his', 'becomes', 'whereafter', 'few', 'we', 'anyway', 'one', 'within', 'or', 'something', 'too', 'these', 'fifty', 'which', 'front', 'almost', 'me', 'from', 'five', '‘m', 'get', 'then', 'my', 'they', 'on', 'unless', 'everyone', 'keep', 'around', 'to', 'onto', 'via', 'anywhere', 'hence', 'yourself', 'two', 'since', 'even', '‘ve', 'always', 'and', 'former', 'four', 'under', '‘re', 'all', 'been', 'forty', 'will', 'though', 'using', 'besides', 'except', 'put', 'noone', '’ll', 'has', 'namely', 'during', 'serious', 'the', 'beside', 'hereafter', 'in', 'back', 'another', 'where', 'here', 'thru', 'whom', 'than', 'until', 'do', '‘d', 'sometimes', 'thereupon', 'take', 'else', 'must', 'but', 'are', 'various', 'top', 'n’t', 'am', 'make', 'once', 'cannot', 'him', 'be', 'other', 'hers', "'d", 'without', 'please', 'already', "'s", '’re', 'you', 'those', 'three', 'amongst', 'thereby', 'therein', 'into', 'formerly', 'who', 'whereby', 'off', 'bottom', 'as', 'made', 'would', 'across', 'themselves', 'lat

In [56]:
print([t for t in doc if not t.is_stop])

[GME, -, $, AG, $, SLV, ,, gentleman, short, squeeze, ,, driven, macro, fundamentals, :/]


### Lemmatization

It's similar with lemmatization. You can view your document with lemmatization applied through the *lemma_* attribute.

In [57]:
[(t.text, t.lemma_) for t in doc]

[('Once', 'once'),
 ('you', 'you'),
 ("'re", 'be'),
 ('done', 'do'),
 ('with', 'with'),
 ('GME', 'GME'),
 ('-', '-'),
 ('$', '$'),
 ('AG', 'AG'),
 ('and', 'and'),
 ('$', '$'),
 ('SLV', 'SLV'),
 (',', ','),
 ('the', 'the'),
 ('gentleman', 'gentleman'),
 ("'s", "'s"),
 ('short', 'short'),
 ('squeeze', 'squeeze'),
 (',', ','),
 ('driven', 'drive'),
 ('by', 'by'),
 ('macro', 'macro'),
 ('fundamentals', 'fundamental'),
 (':/', ':/')]

*Question:* Why would we do this?

In [None]:
#
# EXERCISE: Find out how to add and remove your own stop words in spaCy. Add the 
# word 'told' as a stop word, test that it works, then remove it from 
# the stop word list.
#

In [None]:
#
# EXERCISE: Read up on how to add your own custom attributes to Token objects
# and try adding one of your own.
# https://spacy.io/usage/processing-pipelines#custom-components-attributes
#

## Named Entity Recognition, and Parsing

Course module for this demo:
https://www.nlpdemystified.org/course/advanced-preprocessing

spaCy performs Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and parsing as part of its default pipeline in the *nlp* object.

In [66]:
import spacy
nlp = spacy.load('en_core_web_sm')
s = "Once you're done with GME, $AG and $SLV, the gentleman's short squeeze, driven by macro fundamentals."
doc = nlp(s)

### Named Entity Recognition

There are multiple ways to access named entities. One way is through the *ent_type_* attribute.


In [67]:

[(t.text, t.ent_type_) for t in doc]

[('Once', ''),
 ('you', ''),
 ("'re", ''),
 ('done', ''),
 ('with', ''),
 ('GME', 'ORG'),
 (',', ''),
 ('$', ''),
 ('AG', 'ORG'),
 ('and', ''),
 ('$', ''),
 ('SLV', 'ORG'),
 (',', ''),
 ('the', ''),
 ('gentleman', ''),
 ("'s", ''),
 ('short', ''),
 ('squeeze', ''),
 (',', ''),
 ('driven', ''),
 ('by', ''),
 ('macro', ''),
 ('fundamentals', ''),
 ('.', '')]

You can also check if a token is an entity before printing it by checking whether the _ent_type_ (note the lack of trailing underscore) attribute is non-zero.

In [71]:
print([(t.text, t.ent_type_) for t in doc if t.ent_type != 0])

[('GME', 'ORG'), ('AG', 'ORG'), ('SLV', 'ORG')]


Another way is through the _ents_ property of the **Doc** object. Here, we iterate through _ents_ and print the entity itself and its label.

In [70]:
print([(ent.text, ent.label_) for ent in doc.ents])

[('GME', 'ORG'), ('AG', 'ORG'), ('SLV', 'ORG')]


Note how "next fall" is outputted above as a single span when you use _ents_.
<br><br>
You can also access the positions of entities:

In [72]:
print([(ent.text, ent.label_, ent.start_char, ent.end_char) for ent in doc.ents])

[('GME', 'ORG', 22, 25), ('AG', 'ORG', 28, 30), ('SLV', 'ORG', 36, 39)]


spaCy is bundled with visualizers for both parsing and named entities.<br>
https://spacy.io/usage/visualizers
<br><br>
Here, we visualize the entities in our sample sentence.

In [73]:
from spacy import displacy

# We need to set the 'jupyter' variable to True in order to output
# the visualization directly. Otherwise, you'll get raw HTML.
displacy.render(doc, style='ent', jupyter=True)

### Fine-tuning models
You can fine-tune any NER model too in spaCy. Read the docs if you need that!
### Using spaCy's Matcher to find patterns
spaCy comes with a host of pattern-matching functionality. Beyond regex, spaCy can match on a variety of attributes such as POS tags, entity labels, lemmas, dependencies, entire phrases, and a lot more. You can learn more here:<br>
https://spacy.io/usage/rule-based-matching<br>
https://explosion.ai/demos/matcher
<br><br>
Here, we try to search for patterns that may be useful for our r/WallStreetBets analyser


# Additional Reading and Resources

- https://spacy.io/usage/processing-pipelines
- Take the free and succinct spaCy course (available in multiple languages):<br>
https://course.spacy.io/
- https://spacy.io/usage/spacy-101

