### A Quick Tour of NLP concepts - Corpus, Tokens, and Types

NLP methods begin with a text dataset, also called a corpus. A corpus usually contains raw text (in ASCII or UTF-8) and any metadata associated with the text. 
The raw text is a sequence of characters (bytes), but most times it is useful to group those characters into contiguous units called tokens. 
In English, tokens correspond to words and numeric sequences separated by white-space characters or punctuation.
In machine learning, the text along with its metadata is called an instance or data point. The corpus ( Figure 1 ), a collection of instances, is also known as a dataset.


Rao, Delip,McMahan, Brian. Natural Language Processing with PyTorch. O'Reilly Media.

Figure 1

![Corpus.jpg](attachment:Corpus.jpg)

### Tokenization
The process of breaking a text down into tokens is called tokenization. Tokenization is the identification of linguistically meaningful units (LMU) from the surface text. For example, there are five tokens in the sentence “Take a flight to Mumbai” 
Tokenization can become more complicated than simply splitting text based on nonalphanumeric characters. For agglutinative languages such as Japanese, Turkish, and Tamil, splitting on whitespace and punctuation might not be sufficient to identify the meaning of the sentense. Thus, more specialized techniques are required. 

### Tokenizing Tweets
Let's consider the follwoing tweet to tokenize.
![Tweet.jpg](attachment:Tweet.jpg)

Full texts:

President @realDonaldTrump took a firsthand look at border wall construction and received a progress update from Border Patrol officials.

Tokenizing tweets, in some cases, involves preserving hashtags, @handles, and URLs as one unit. Should the hashtag # be one token or multiple? Most research papers don’t give much attention to these matters. In fact, many of the tokenization decisions tend to be arbitrary but those decisions can significantly affect accuracy in practice more than is acknowledged. 

### Example - Tokenizing text

In [1]:
## download "python -m spacy download en" if you have not.
import spacy
nlp = spacy.load('en')
text = "President @realDonaldTrump took a firsthand look at border wall construction and received a progress update from Border Patrol officials."
print([str(token) for token in nlp(text.lower())])

['president', '@realdonaldtrump', 'took', 'a', 'firsthand', 'look', 'at', 'border', 'wall', 'construction', 'and', 'received', 'a', 'progress', 'update', 'from', 'border', 'patrol', 'officials', '.']


### Types
After tokenizing a text, the first figure we can calculate is the word frequency. From a word frequency, we indicate the number of times each token occurs in a text. 
When talking about word frequency, we distinguish between types and tokens. Types are the distinct words in a corpus, whereas tokens are the words, including repeats.

Let's see how we can use Python to calculate these figures. First, let's tokenise the sentence by using a tokeniser which uses non-alphabetical characters as a separator.

In [2]:
from nltk.tokenize.regexp import WhitespaceTokenizer
my_str = "Types are the distinct words in a corpus, whereas tokens are the running words."
tokens = WhitespaceTokenizer().tokenize(my_str)
print (len(tokens))

14


Now we are going to perform the same operation but using a different tokenizer.

In [3]:
from nltk.tokenize.regexp import WordPunctTokenizer
my_str = "Types are the distinct words in a corpus, whereas tokens are the running words."
my_toks = WordPunctTokenizer().tokenize(my_str)
print (len(my_toks))

16


What is the difference between the two exapmles? 
In the first one, the vocabulary ends up containing "words" and "words." as two distinct words; whereas in the second example "words" is a token type and "." (i.e. the dot) is split into a separate token and this results into a new token type in addition to "words".
The WordPunctTokenizer tokenizes a text into a sequence of alphabetic and non-alphabetic characters, using the regular expression \w+|[^\w\s]+.

##### Reference
Rao, Delip,McMahan, Brian. Natural Language Processing with PyTorch. O'Reilly Media.

An introduction to Natural Language Processing (NLP) https://port.sas.ac.uk/mod/book/view.php?id=583&chapterid=381

Why do I need a tokenizer for each language https://stackoverflow.com/questions/17314506/why-do-i-need-a-tokenizer-for-each-language

Hanlon, C. (2018). Tokenization of Japanese Text: Using a Morphological Transducer.