## Building a vocabulary for word tokenization 


My practice here of understand foundational NLP concepts of work tokenizaton to analyse text. Text in this manner is segmented into parts (text) compiled (parsed) from a certain pattern (usually words seperated by whitespace) that generically ignores puncuation besides the word.
<br>
In this way we build a lexicon (vocabulary) - set of valid words/tokens.
<br>
***Natural Language Processing In Action (NLPIA) by Lane, Howard and Hapke (2019)*** is what I'm closely following to rehearse this workflow (esp. chp 2.2).


### Generalized workflow 

Sentence gathered from [Your dictionary](https://sentence.yourdictionary.com/nlp)
<br>
Example is a rough example relying mostly on built-ins and base scientific libraries (numpy and pandas)

In [1]:
sentence = 'NLP is a method of examining the way a person thinks and acts through language and using this knowledge to affect change.'

In [2]:
# Sentence when tokenized will be split by white space and replaced with commas (delimited) in between
sentence.split()

['NLP',
 'is',
 'a',
 'method',
 'of',
 'examining',
 'the',
 'way',
 'a',
 'person',
 'thinks',
 'and',
 'acts',
 'through',
 'language',
 'and',
 'using',
 'this',
 'knowledge',
 'to',
 'affect',
 'change.']

Here we can see the only mistake was at then with 'change.' including the puncuation. For developing models eventually, this would be very messy and impractical. Being the first stage of the NLP, it's pretty crucial to do this as accurately as possible. 
<br>
The tokenizer in this way breaks unstructured data (natural language text) into chunks of information that can be represented as meaningful discrete elements.
<br>
Nevertheless, as a basic/coarse example, we can generally see the process of utilising textual data.


In [3]:
# Will be used to create a numerical (one-hot) vector which captures occurences in document text via a sequence of words 
import numpy as np

In [4]:
token_sequence = str.split(sentence) # built-in class method of using split
vocab = sorted(set(token_sequence)) # taken distinct set of words and sort them

In [5]:
vocab

['NLP',
 'a',
 'acts',
 'affect',
 'and',
 'change.',
 'examining',
 'is',
 'knowledge',
 'language',
 'method',
 'of',
 'person',
 'the',
 'thinks',
 'this',
 'through',
 'to',
 'using',
 'way']

In [18]:
# Keep in sentence form i.e. remove the list form that split() defaults to as an iterable
', '.join(vocab)

'NLP, a, acts, affect, and, change., examining, is, knowledge, language, method, of, person, the, thinks, this, through, to, using, way'

In [7]:
num_tokens = len(token_sequence)
vocab_size = len(vocab)

In [8]:
print(num_tokens),
print(vocab_size)

22
20


Difference of two words because the vocab we created took account repeat - namely 'a' along with 'and'

In [9]:
# constructing the one-hot (oh) vectors
oh_vectors = np.zeros((num_tokens, vocab_size), int)
oh_vectors.shape # 22 rows by 20 columns

(22, 20)

In [10]:
# Tagging index words appears in at such order and the word beside it
for i, word in enumerate(token_sequence, start=1):
    print(i, word)

1 NLP
2 is
3 a
4 method
5 of
6 examining
7 the
8 way
9 a
10 person
11 thinks
12 and
13 acts
14 through
15 language
16 and
17 using
18 this
19 knowledge
20 to
21 affect
22 change.


When interpreting the oh_vector array:
* Rows - index of the words as seen directly above 
* Columns - distinct set of sorted words as seen in **vocab** variable
* Value - where we see '1' is where the word appeared in the vocab (sorted) index from the corresponding (original) tokenized sentence
  
~~Technical note - this is a type of 'bag of words (BOW)' representation, which is lossless since every unique attribute (token) has its own indicator (index) in the vector/array~~

In [11]:
for i, word in enumerate(token_sequence):
    oh_vectors[i, vocab.index(word)] = 1

In [12]:
' '.join(vocab)

'NLP a acts affect and change. examining is knowledge language method of person the thinks this through to using way'

In [13]:
oh_vectors

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0,

Example - row 3, the phrase 'a' in the **token_sequence** appears in the index order of 2 in the sorted **vocab** list 
* Given that it appears twice, it also appears in row 9 as in **token_sequence**, while still maintaining its index order (2).

To make this even more clearer, the last step is to turn this array into a DataFrame - enabling us to see such data in a tabulated way where columns have labels 

In [14]:
import pandas as pd 
df = pd.DataFrame(oh_vectors, columns=vocab)
df # bear in mind the inherent zero-indexing

Unnamed: 0,NLP,a,acts,affect,and,change.,examining,is,knowledge,language,method,of,person,the,thinks,this,through,to,using,way
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
8,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


### Data Manipulation

In [16]:
df[df == 0] = ''
df

Unnamed: 0,NLP,a,acts,affect,and,change.,examining,is,knowledge,language,method,of,person,the,thinks,this,through,to,using,way
0,1.0,,,,,,,,,,,,,,,,,,,
1,,,,,,,,1.0,,,,,,,,,,,,
2,,1.0,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,1.0,,,,,,,,,
4,,,,,,,,,,,,1.0,,,,,,,,
5,,,,,,,1.0,,,,,,,,,,,,,
6,,,,,,,,,,,,,,1.0,,,,,,
7,,,,,,,,,,,,,,,,,,,,1.0
8,,1.0,,,,,,,,,,,,,,,,,,
9,,,,,,,,,,,,,1.0,,,,,,,


From 20 distinct words (columns), we can search by row and find where '1' is to locate its position in the original document
* Hence the name 'one-hot' - given a word (vector) where '1' appears once for each row for a single token and '0' elsewhere in the vocab for that token/column pair (based on row position)

End Goal:
* Vector representation of words/documents - no information is lost, given each word is tracked/indicated by the column (vocab)