If running in google colab, you can upload the requirements.txt from the following cell

In [None]:
from google.colab import files
uploaded = files.upload()


Saving requirements.txt to requirements.txt


In [None]:
import os, sys
from google.colab import drive
drive.mount('/content/drive')
nb_path = '/content/notebooks'
os.symlink('/content/drive/My Drive/Colab Notebooks', nb_path)
sys.path.insert(0, nb_path)

Mounted at /content/drive


In [None]:
!pip install --target=$nb_path -r requirements.txt


## NLP Preprocessing

### Removing Stopwords

In [None]:
tweet = """I’m amazed how often in practice, not only does a @huggingface NLP model solve your problem, but one of their public finetuned checkpoints, is good enough for the job.

Both impressed, and a little disappointed how rarely I get to actually train a model that matters :("""

In [None]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
stop_words = stopwords.words('english')

In [None]:
stop_words = set(stop_words)

In [None]:
tweet = tweet.lower().split()

In [None]:
tweet_no_stopwords = [word for word in tweet if word not in stop_words]

print("With stopwords:", ' '.join(tweet))
print("Without:", ' '.join(tweet_no_stopwords))

With stopwords: i’m amazed how often in practice, not only does a @huggingface nlp model solve your problem, but one of their public finetuned checkpoints, is good enough for the job. both impressed, and a little disappointed how rarely i get to actually train a model that matters :(
Without: i’m amazed often practice, @huggingface nlp model solve problem, one public finetuned checkpoints, good enough job. impressed, little disappointed rarely get actually train model matters :(


### Tokens

* A word
* Part of a word
* A single character
* Puntuation mark [,!-.]
* Special token like < URL >, or < NAME >
* Model-specific special tokens, like [CLS] and [SEP] for BERT

For the BERT transformer model there are *five* special tokens that are used by the model, these are:

| Token | Meaning |
| --- | --- |
| **[PAD]** | Padding token, allows us to maintain same-length sequences (512 tokens for Bert) even when different sized sentences are fed in |
| **[UNK]** | Used when a word is unknown to Bert |
| **[CLS]** | Appears at the start of every sequence |
| **[SEP]** | Indicates a seperator or end of sequence |
| **[MASK]** | Used when masking tokens, for example in training with masked language modelling (MLM) |

### Stemming

Stemming is a text normalization method used in NLP to simplify text before it is processed by a model. When stemming break the final few characters of a word in order to find a common form of the word.

In [None]:
txt = "I am amazed by how amazingly amazing you are"

In [None]:
words_to_stem = ['happy', 'happiest', 'happier', 'cactus', 'cactii', 'elephant', 'elephants', 'amazed', 'amazing', 'amazingly', 'cement', 'owed', 'maximum', 'maxim']

In [None]:
from nltk.stem import PorterStemmer, LancasterStemmer

In [None]:
porter = PorterStemmer()
lancaster = LancasterStemmer()

In [None]:
stemmed = [(word, porter.stem(word), lancaster.stem(word)) for word in words_to_stem]

In [None]:
stemmed

[('happy', 'happi', 'happy'),
 ('happiest', 'happiest', 'happiest'),
 ('happier', 'happier', 'happy'),
 ('cactus', 'cactu', 'cact'),
 ('cactii', 'cactii', 'cacti'),
 ('elephant', 'eleph', 'eleph'),
 ('elephants', 'eleph', 'eleph'),
 ('amazed', 'amaz', 'amaz'),
 ('amazing', 'amaz', 'amaz'),
 ('amazingly', 'amazingli', 'amaz'),
 ('cement', 'cement', 'cem'),
 ('owed', 'owe', 'ow'),
 ('maximum', 'maximum', 'maxim'),
 ('maxim', 'maxim', 'maxim')]

### Lemmatization

Lemmatization is very similiar to stemming in that it reduces a set of inflected words down to a common word. The difference is that lemmatization reduces inflections down to their real root words, which is called a lemma

In [None]:
words = ['amaze', 'amazed', 'amazing']


In [None]:
import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
[lemmatizer.lemmatize(word) for word in words]

['amaze', 'amazed', 'amazing']

we could place each word as a verb, which we can then implement like so:

In [None]:
[lemmatizer.lemmatize(word, wordnet.VERB) for word in words]

['amaze', 'amaze', 'amaze']

### Unicode Normalization

In [None]:
import unicodedata

We use unicode normalization to *normalize* our characters into matching pairs. As there are different forms of equivalence, there are also different forms of normalization. These are all called **N**ormal **F**orm, and there are four different methods:

| Name | Abbreviation | Description | Example |
| --- | --- | --- | --- |
| Form D | NFD | *Canonical* decomposition | `Ç` → `C ̧` |
| Form C | NFC | *Canoncial* decomposition followed by *canonical* composition | `Ç` → `C ̧` → `Ç` |
| Form KD | NFKD | *Compatibility* decomposition | `ℌ ̧` → `H ̧` |
| Form KC | NFKC | *Compatibility* decomposition followed by *canonical* composition | `ℌ ̧` → `H ̧` → `Ḩ` |

Let's take a look at each of these forms in action. Our C with cedilla character Ç can be represented in two ways, as a single character called *Latin capital C with cedilla* (*\u00C7*), or as two characte

In [None]:
c_with_cedilla = "\u00C7"  # Latin capital C with cedilla (single character)
c_with_cedilla

'Ç'

In [None]:
c_plus_cedilla = "\u0043\u0327"  # \u0043 = Latin capital C, \u0327 = 'combining cedilla' (two characters)
c_plus_cedilla

'Ç'

In [None]:
#And we will find that these two version do not match when compared:
c_with_cedilla == c_plus_cedilla

False

If we perform **NFD** on our C with cedilla character `\u00C7`, we **decompose** the character into it's smaller components, which are the *Latin capital C* character, and *combining cedilla* character `\u0043` + `\u0327`. This means that if we compare an **NFD** normalized C with cedilla character to both the C character and the cedilla character, we will return true:

In [None]:
unicodedata.normalize('NFD', c_with_cedilla) == c_plus_cedilla

True

But if we switch the **NFC** encoding to instead be performed on our two characters `\u0043` + `\u0327`, they will first be **decomposed** (which will do nothing as they are already decomposed), then compose them into the single `\u00C7` character:

In [None]:
c_with_cedilla == unicodedata.normalize('NFC', c_plus_cedilla)

True

In [None]:
unicodedata.normalize('NFKC', 'ℌ')

'H'