# Text Classification

We'll use the [IMDB](http://ai.stanford.edu/~amaas/data/sentiment/) dataset for this task. This is a dataset for binary sentiment classification containing 25,000 highly polarized movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

In [1]:
using FastAI

In [4]:
data, blocks = load(datarecipes()["imdb"])


((mapobs(loadfile, ObsView(::MLDatasets.FileDataset{typeof(identity), String}, ::Vector{Int64})), mapobs(parentname, ObsView(::MLDatasets.FileDataset{typeof(identity), String}, ::Vector{Int64}))), (Paragraph(), Label{String}(["neg", "pos"])))

Each sample is a review, this'll be our input data. The output is the sentiment of the input, either positive or negative.

In [13]:
println("Sample = ",getobs(data, 1))
println("Block = ",blocks)

Sample = ("Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.", "neg")
Block = (Paragraph(), Label{String}(["neg", "pos"]))


In [15]:
task = FastAI.TextClassificationSingle(blocks, data)

SupervisedTask(Paragraph -> Label{String})

The task consists of encodings that needs to be applied to the input data and output data.

The encodings for the input data are as follows:
- **Sanitize**: Involves text cleaning steps like case trimming, remove punctuations, removing stop words, and some fastai specific preprocessing steps (xxbos, xxup, etc).
- **Tokenize**: Tokenizing the text into words.
- **EmbedVocabulary**: Embedding the words into a vector space. This step constructs the vocabulary for the training data and returns the vector embedding for the input data.

In [16]:
task.encodings

(Sanitize(Function[FastAI.Textual.replace_all_caps, FastAI.Textual.replace_sentence_case, FastAI.Textual.convert_lowercase, FastAI.Textual.remove_punctuations, FastAI.Textual.basic_preprocessing, FastAI.Textual.remove_extraspaces]), Tokenize([FastAI.Textual.tokenize]), FastAI.Textual.EmbedVocabulary(OrderedCollections.OrderedDict("redeemiing" => 1, "poulange" => 1, "inattentive" => 1, "sleepwalking" => 20, "photosynthesis" => 1, "lunk" => 1, "henry" => 407, "whiz" => 16, "redresses" => 1, "gathered" => 38…)), OneHot{DataType}(Float32, 0.5f0))

In [23]:
input, target = getobs(data, 1)

("Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.", "neg")

In [35]:
encoded_input, encoded_output = encodesample(task, Training(), (input, target))

println(encoded_input)
println(encoded_output)

[25000, 633779, 11990, 46, 395, 102, 633779, 1220, 5383, 433, 1374, 306, 3246, 122678, 27, 47, 2198, 241, 523, 157, 657, 1, 79, 633779, 1353, 182, 306, 122678, 12727, 424, 720, 369, 633779, 614, 633779, 18, 1542, 633779, 296, 802, 739, 28, 633779, 305, 963, 985, 900, 633779, 4, 633779, 4, 633779, 900, 1700, 633779, 135, 633779, 24, 633779, 13, 633779, 42, 6696, 136]
Float32

[1.0, 0.0]


Let us now look at each step of the above encoding process.

##### Sanitize

The sanitized input data will have no stop words, no punctuations, and no case. Along with those, it'll also contain some fastai specific tokens like xxbos (beginning of the sentence), xxup (the next word if uppercase in the original text), xxmaj (the first letter is uppercase in the original text), etc.

In [40]:
encoding_1 = Textual.Sanitize()
sanitized_data = FastAI.encode(encoding_1, Training(), Paragraph(), input)

"xxbos xxmaj story unnatural feelings pig xxmaj starts scene terrific example absurd comedy xxup formal orchestra audience insane violent mob crazy chantings singers xxmaj unfortunately stays absurd xxup time narrative eventually putting xxmaj era xxmaj cryptic dialogue xxmaj shakespeare easy third grader xxmaj technical level cinematography future xxmaj vilmos xxmaj zsigmond xxmaj future stars xxmaj sally xxmaj kirkland xxmaj frederic xxmaj forrest seen briefly "

##### Tokenize

Tokenize the sanitized input data.

In [42]:
encoding_2 = Textual.Tokenize()
tokenized_data = FastAI.encode(encoding_2, Training(), Paragraph(), sanitized_data)

64-element Vector{String}:
 "xxbos"
 "xxmaj"
 "story"
 "unnatural"
 "feelings"
 "pig"
 "xxmaj"
 "starts"
 "scene"
 "terrific"
 ⋮
 "sally"
 "xxmaj"
 "kirkland"
 "xxmaj"
 "frederic"
 "xxmaj"
 "forrest"
 "seen"
 "briefly"

##### EmbedVocabulary

This step is the most important step in the encoding process. It constructs the vocabulary for the training data and returns the vector embedding for the input data.

In [44]:
vocab = setup(Textual.EmbedVocabulary, data)
encoding_3 = Textual.EmbedVocabulary(vocab = vocab.vocab)

FastAI.Textual.EmbedVocabulary(OrderedCollections.OrderedDict("redeemiing" => 1, "poulange" => 1, "inattentive" => 1, "sleepwalking" => 20, "photosynthesis" => 1, "lunk" => 1, "henry" => 407, "whiz" => 16, "redresses" => 1, "gathered" => 38…))

In [49]:
vector_data = encode(encoding_3, Training(), Textual.Tokens(), tokenized_data)

64-element Vector{Int64}:
  25000
 633779
  11990
     46
    395
    102
 633779
   1220
   5383
    433
      ⋮
    135
 633779
     24
 633779
     13
 633779
     42
   6696
    136