# Converting Raw Text into Sequence Data

## Reading the Dataset

Here, we will work with H. G. Wells’ The Time Machine, a book containing just over 30,000 words. While real applications will typically involve significantly larger datasets, this is sufficient to demonstrate the preprocessing pipeline. The following _download method reads the raw text into a string.



In [1]:
using Downloads

file_path = Downloads.download("https://www.gutenberg.org/cache/epub/35/pg35.txt")

"/tmp/jl_eD65mPmPM9"

In [28]:
raw_text = open(io->read(io, String),file_path)
raw_text[begin:60]

"\ufeffThe Project Gutenberg eBook of The Time Machine\r\n    \r\nTh"

For simplicity, we ignore punctuation and capitalization when preprocessing the raw text.

In [29]:
str = lowercase(replace(raw_text,r"[^A-Za-z]+"=>" "))
str[begin:60]

" the project gutenberg ebook of the time machine this ebook "

## Tokenization

Tokens are the atomic (indivisible) units of text. Each time step corresponds to 1 token, but what precisely constitutes a token is a design choice. For example, we could represent the sentence “Baby needs a new pair of shoes” as a sequence of 7 words, where the set of all words comprise a large vocabulary (typically tens or hundreds of thousands of words). Or we would represent the same sentence as a much longer sequence of 30 characters, using a much smaller vocabulary (there are only 256 distinct ASCII characters). Below, we tokenize our preprocessed text into a sequence of characters.

In [31]:
tokens = [str...]
join(tokens[begin:30],",")

" ,t,h,e, ,p,r,o,j,e,c,t, ,g,u,t,e,n,b,e,r,g, ,e,b,o,o,k, ,o"

## Vocabulary

We now construct a vocabulary for our dataset, converting the sequence of strings into a list of numerical indices. Note that we have not lost any information and can easily convert our dataset back to its original (string) representation.

In [32]:
vocab = unique(tokens)
vocab_dict = Dict(vocab .=> 1:length(vocab))
indices_dict = Dict(i[2]=>i[1] for i in vocab_dict)

to_indices(v::Vector{Char}) = [vocab_dict[i] for i in v]
to_vocab(v::Vector{Int}) = [indices_dict[i] for i in v]

indices = to_indices(vocab[begin:10])
println("indices:$(indices)")
println("words:$(to_vocab(indices))")

indices:[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
words:[' ', 't', 'h', 'e', 'p', 'r', 'o', 'j', 'c', 'g']


## Exploratory Language Statistics

Using the real corpus and the Vocab class defined over words, we can inspect basic statistics concerning word use in our corpus. Below, we construct a vocabulary from words used in The Time Machine and print the ten most frequently occurring of them.

In [57]:
using TextAnalysis

sd = StringDocument(raw_text)
remove_case!(sd)
prepare!(sd, strip_non_letters)
crps = Corpus([sd])
update_lexicon!(crps)
lex_dict = lexicon(crps)

partialsort([lex_dict...], 1:10; by = x -> x[2], rev=true)

10-element view(::Vector{Pair{String, Int64}}, 1:10) with eltype Pair{String, Int64}:
  "the" => 2468
  "and" => 1296
   "of" => 1281
    "i" => 1242
    "a" => 864
   "to" => 760
   "in" => 605
  "was" => 550
 "that" => 451
   "my" => 439

 Word frequency tends to follow a power law distribution (specifically the Zipfian) as we go down the ranks. To get a better idea, we plot the figure of the word frequency.