Feature extraction can rarely retain all the information content of the input data in any machine learning pipeline. In NLP you have to find the balance of where your tokenizer needs to be adjusted to extract different or more information for the particular applciation

# Stemming
How do you cut a word into its semantically significant bits?

You want to remove the "ing" from verbs, so "ending" becomes "end", and "running" to "run", but "sing" should remain intact. Letters can be very misleading

Approaches include the likes of statistically finding the "semantic stems" from a large collection of natural language text

# Tokenization
A particular kind of document segmentation. Segmentation is breaking up text into smaller chunks or segments, with more focused information. Tokenization segments the document into Tokens

A tokenizer can also be found in a compiler, aka a scanner or lexer, reading code and matching it against the *lexicon*, or the set of all the valid tokes, ie the vocabulary. A *scannerless parser* is a compiler with a tokenizer incorporated into the parser.

As the first step in an NLP pipeline, has big impact on the rest

Basic approach: using the python `split` to create one hot vectors

In [16]:
import numpy as np

sentence = "Along the drifting cloud, the eagle searching down on the land."
token_sequence = sentence.split()
vocab = sorted(set(token_sequence))
print(vocab)

num_tokens = len(token_sequence)
vocab_size = len(vocab)
onehot_vectors = np.zeros((num_tokens,
                           vocab_size), int)
for i, word in enumerate(token_sequence):
    onehot_vectors[i, vocab.index(word)] = 1
onehot_vectors

['Along', 'cloud,', 'down', 'drifting', 'eagle', 'land.', 'on', 'searching', 'the']


array([[1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0]])

Pandas DataFrames can make it a little easier to interpret (think of R's dataframes)

In [15]:
import pandas as pd
pd.DataFrame(onehot_vectors, columns=vocab)

Unnamed: 0,Along,"cloud,",down,drifting,eagle,land.,on,searching,the
0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1
2,0,0,0,1,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1
5,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,0,1,0
7,0,0,1,0,0,0,0,0,0
8,0,0,0,0,0,0,1,0,0
9,0,0,0,0,0,0,0,0,1


A super sparse data structure, but no information is lost, so useful for neural nets: seq2seq models, and generative language models.

Immensely increases the size of the data though. A dataset with a million tokens, containing 3000 books with 3500 sentences each, with 15 words per sentence totals to 157.5 terabyes. Fortunately all this data will cycle through the RAM, and never be stored.

But preferably we want to compress a document down to a single vector rather than big table, which is possible if we're willing to give up the ability recall perfectly. By summing up all the one hot vectors, we get a bag of words, which contain information from which the meaning of the document can be roughly inferred. And if you limit the tokens to the 10k most important ones, the afforementioned 3k books go down to 30 MB

A balance between the one hot vector and bag of words could be bag of words for each sentence, Another approach is a binary array, keeping track of the presense or absence of a word in a sentence

In [17]:
sentence_bow = {}
for token in sentence.split():
    sentence_bow[token] = 1
sorted(sentence_bow.items())

[('Along', 1),
 ('cloud,', 1),
 ('down', 1),
 ('drifting', 1),
 ('eagle', 1),
 ('land.', 1),
 ('on', 1),
 ('searching', 1),
 ('the', 1)]

The use of a dictionary or a hashtable here ensures the data is only as big as it needs to be. The dictionary can be made even more efficient by storing an integer pointer to the word instead of the word itself

A Pandas `Series` is an even more efficient dictionary

In [21]:
import pandas as pd
df = pd.DataFrame(pd.Series(dict([(token, 1) for token in sentence.split()])), columns=['sent']).T
df

Unnamed: 0,Along,the,drifting,"cloud,",eagle,searching,down,on,land.
sent,1,1,1,1,1,1,1,1,1


In [23]:
sentences = [
    "Catching the swirling wind, the sailor sees the rim of the land.",
    "The eagle's dancing wings create as weather spins out of hand.",
    "Go closer hold the land feel partly no more than grains of sand.",
    "We stand to lose all time a thousand answers by in our hand.",
    "Next to your deeper fears we stand surrounded by million years.",
    "I'll be the roundabout."
]

corpus = {}
for i, sent in enumerate(sentences):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
df[df.columns[:10]]

Unnamed: 0,Catching,the,swirling,"wind,",sailor,sees,rim,of,land.,The
sent0,1,1,1,1,1,1,1,1,1,0
sent1,0,0,0,0,0,0,0,1,0,1
sent2,0,1,0,0,0,0,0,1,0,0
sent3,0,0,0,0,0,0,0,0,0,0
sent4,0,0,0,0,0,0,0,0,0,0
sent5,0,1,0,0,0,0,0,0,0,0


Clearly very little overlap exists between the sentences. Computing overlap can be done through a dot product.

In [31]:
df = df.T
df.sent2

Catching       0
the            1
swirling       0
wind,          0
sailor         0
sees           0
rim            0
of             1
land.          0
The            0
eagle's        0
dancing        0
wings          0
create         0
as             0
weather        0
spins          0
out            0
hand.          0
Go             1
closer         1
hold           1
land           1
feel           1
partly         1
no             1
more           1
than           1
grains         1
sand.          1
We             0
stand          0
to             0
lose           0
all            0
time           0
a              0
thousand       0
answers        0
by             0
in             0
our            0
Next           0
your           0
deeper         0
fears          0
we             0
surrounded     0
million        0
years.         0
I'll           0
be             0
roundabout.    0
Name: sent2, dtype: int64

In [32]:
df.sent0.dot(df.sent1)

1

There is one word shared between the sentences