word and character level word tokenizer. #28

samuelzxu · 2021-04-08T03:18:32Z

Some miscellaneous changes with the tokenizer being the main focus. Implements the LearnBase getobs and nobs methods. Uses the WordTokenizers module.

Closes #24

ToucheSir

Thanks for the PR! I've left some comments and considerations, but it shouldn't take much to get this landed :)

ToucheSir · 2021-04-08T22:23:05Z

src/datasets/transformations.jl

+    tokenize(type, input)
+
+Tokenizes an input string or stream into pieces depending on selected type. 
+"""


Is there any advantage to this over the default WordTokenizers.tokenize? I think it's fine to have a separate path for characters, but for words we'd prefer to rely on their implementation. One way to accomodate this may be to point users towards the experimental split API and simply define an function overload which uses your character tokenizer.

The advantage was that I can control the parameters, so that if WordTokenizers is updated we're not pointing to a function that doesn't exist anymore. I don't quite understand what you mean by the last sentence. Are you suggesting to not implement the tokenizer, and instead tell the user to use the WordTokenizer module's split functions, to implement the same thing with WordTokenizer.tokenize() in place of the case for :words, or something else? The confusion is what you mean by "point".

That's correct. We can always re-export tokenize from WordTokenizers if need be. I mention the split API because it uses multiple dispatch to select which tokenizer to use and thus doesn't require a conditional over a symbol parameter like :words or :chars. This allows one to swap out tokenizer functions (see https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/split_api.jl for the built-in list), so doing the equivalent of :chars could be as easy as calling collect on the input string.

ToucheSir · 2021-04-08T22:23:25Z

src/datasets/transformations.jl

+Tokenizes an input string or stream into pieces depending on selected type. 
+"""
+
+function tokenize(type::Symbol,input)::AbstractArray{AbstractString}


The type declarations here can be safely omitted.

ToucheSir · 2021-04-08T22:25:16Z

src/datasets/transformations.jl

+
+# LearnBase functions {nobs, getobs} for AbstractArray{AbstractString}
+
+LearnBase.nobs(data::AbstractArray{AbstractString}) = size(data,1)


length(data) does the same thing as size(data, 1) here. We may want to look into a smarter data container eventually as special-casing a certain array type is non-optimal, but that shouldn't hold up this PR.

Ah, also we shouldn't be implementing getobs for Base types like AbstractArray and AbstractString.

ToucheSir · 2021-04-08T22:25:50Z

test/datasets/transformations.jl

+        @test getobs(tdata2,3) == "rabbit"
+        @test getobs(tdata3,3) == "e"
+        @test getobs(tdata4,3) == "h"
+        #TODO: add stream tests


By "stream tests", do you mean enumerating more than one token? If so, I agree those should be added.

No, I meant to take a stream as input, which I have yet to figure out how to create as a test. By enumerate more than one token, do you mean check more than one index?

ToucheSir · 2021-04-08T22:26:35Z

README.md

@@ -10,7 +10,7 @@ As an example, training an image classification model from scratch is as simple
 using FastAI
 path = datasetpath("imagenette2-160")
 data = loadtaskdata(path, ImageClassificationTask)
-method = ImageClassification(Datasets.loadclassesclassification("imagenette2-160"), (160, 160))
+method = ImageClassification(Datasets.getclassesclassification("imagenette2-160"), (160, 160))


This and the change below appear unrelated to the PR. Can you revert them?

Sure, but they do cause bugs (namely, loadclasses... gave undefined function) during the tutorial.

In that case it's better to open another quick PR. Generally if it's not blocking CI, then it's better to not put unrelated changes in.

A PR with just that change would be appreciated, so it can be fixed swiftly 👍

darsnack

To match what @lorenzoh suggested, we don't want to be defining getobs on generic types like AbstractArray{<:AbstractString}.

In addition to the existing suggestions, I think what we want is something like

struct Tokenizer{T, ...}
    data::T
    # ...
end

function LearnBase.getobs(tokenizer::Tokenizer, i)
    # fetch a new sample from data if necessary
    sample = getobs(tokenizer.data, ??)

    # tokenize the sample
    tokens = tokenize(sample)

    # return correct token
    return token[??]
end

This is pseudocode, but the idea is that the Tokenizer wraps data which is any stream of text. This could be a string or text dataset on disk wrapped in a lazy loader that read line by line. The Tokenizer is responsible for doing the index logic to know when it needs to fetch more text from data, tokenizing that text, and buffering the excess tokens. Certain calls to getobs(::Tokenizer, ...) will not require fetching more text from data but just returning what's already in the buffer.

ToucheSir · 2021-04-09T16:03:58Z

As cool as incrementally reading text would be, I'm not sure it's feasible since you'd need to tokenize the entire stream to implement nobs. Hence the question becomes how to represent a collection of tokenized text samples rather than an individual one. Here it's worth looking at how fastai, spaCy and more handle this.

darsnack · 2021-04-09T17:11:13Z

True, I didn't think about that. At the very least, we should tokenize the full stream, store the tokens in Tokenizer, then return those on getobs(::Tokenizer, i). As opposed to tokenizing the data into a generic vector of tokens and writing getobs(::AbstractVector{<:AbstractString}, i).

samuelzxu · 2021-04-10T02:51:16Z

Ah, I see. I thought about this, and my logic going in was that since there was only one element in the struct, and we're working in and NLP context, it would be simpler for a coder to use this version. Can you point me to a place where I can read more about these design choices? (Like not implementing methods for base types @lorenzoh ) It feels like there is a lot of context I'm missing.

samuelzxu · 2021-04-10T03:32:40Z

@darsnack How do I update the PR? I read online I only have to push to my branch, but it doesn't seem to be showing up.

samuelzxu · 2021-04-10T03:33:04Z

Oh, wow, I must be blind! I see it now.

samuelzxu · 2021-04-11T04:52:19Z

@darsnack @ToucheSir could you guys help me out with the importing and extending bug?

lorenzoh · 2021-04-11T08:30:32Z

This section of the Julia documentation explains why type piracy (defining methods of functions you don't own for types you don't own) should be avoided:
https://docs.julialang.org/en/v1/manual/style-guide/#Avoid-type-piracy

darsnack · 2021-04-11T14:07:45Z

src/datasets/transformations.jl

@@ -41,8 +42,9 @@ struct NamedTupleData{TData, F}
    namedfs::NamedTuple{F}
 end

-LearnBase.nobs(data::NamedTupleData) = nobs(getfield(data, :data))
+# LearnBase functions {nobs, getobs} for NamedTupleData


These changes are not related to this PR. We like to separate changes into different PRs when possible. Can you move these to another PR?

I think separating out these changes is still outstanding

darsnack · 2021-04-11T14:08:07Z

src/datasets/transformations.jl


+LearnBase.nobs(data::NamedTupleData) = nobs(getfield(data, :data))


Same here (move to another PR)

darsnack · 2021-04-11T14:08:29Z

src/datasets/transformations.jl

+# LearnBase functions {nobs, getobs} for JoinedData
+


Same here (move to another PR)

darsnack · 2021-04-11T14:09:00Z

src/datasets/transformations.jl

+"""
+    tokenize(type, input)
+    type = :words or :chars
+
+Tokenizes an input string or stream into pieces depending on selected type. 
+"""


This should go above the actual function definition.

darsnack · 2021-04-11T14:11:00Z

src/datasets/transformations.jl

+struct Tokenizer
+    data::AbstractArray{AbstractString}
+end


For performance reasons, in Julia we use type parameters to get concrete typed fields:

Suggested change

struct Tokenizer

data::AbstractArray{AbstractString}

end

struct Tokenizer{T<:AbstractArray{<:AbstractString}}

data::T

end

Also, should this be a AbstractVector or AbstractArray?

I chose AbstractArray because AbstractVector <: AbstractArray

darsnack · 2021-04-11T14:11:58Z

src/datasets/transformations.jl

+function LearnBase.getobs(toks::Tokenizer, idx)
+    return toks.data[idx]
+end


Suggested change

function LearnBase.getobs(toks::Tokenizer, idx)

return toks.data[idx]

end

LearnBase.getobs(toks::Tokenizer, idx) = toks.data[idx]

Just to make things cleaner

darsnack · 2021-04-11T14:18:08Z

You need to add WordTokenizer.jl to the test/Project.toml.

tutorial errors: changed {load->get}classesclassification

darsnack

Can you rebase?

darsnack · 2021-04-12T16:14:49Z

test/datasets/transformations.jl

+        tdata1 = Datasets.Tokenizer(Datasets.tokenize(:words,"The quick rabbit jumps over the lazy fox."))
+        tdata2 = Datasets.Tokenizer(Datasets.tokenize(:chars,"The quick rabbit jumps over the lazy fox."))


Doesn't tokenize return a Tokenizer already?

…t to avoid conflicting with imported WordTokenizers.tokenize. While I assumed tokenizers and WordTokenizers.tokenize would not conflict, I do get warnings about it and it possibly may have been the cause of some errors. Adjusted a test to include period. Fixed testing semantics usage issue.

samuelzxu · 2021-04-12T18:54:37Z

@darsnack

darsnack · 2021-04-12T18:57:12Z

src/datasets/Datasets.jl


    # utilities
    isimagefile,
    loadfile,
    filename,
+    tokenize,


Suggested change

tokenize,

tokenize_input,

@darsnack got it.

samuelzxu added 10 commits April 7, 2021 15:53

corrected {load -> get}classesclassification

ad8fb4d

Comment clarifications

630b856

Documentation clarifications

9b621a9

added tokenize functionality, tests, and dependencies

8e82748

learnbase functions for tokenized data

a003784

deleting accidental double loop

56310cd

debugging

b9f86c4

imported another WordTokenizer function

90cffe7

Debugging tests

163eb47

final debugging changes

eec12a9

ToucheSir requested changes Apr 8, 2021

View reviewed changes

darsnack requested changes Apr 9, 2021

View reviewed changes

Made changes according to suggestions

a08d4fc

implemented tokenizer struct

8613113

samuelzxu added 5 commits April 10, 2021 22:24

debugging Tokenizer

f260e7b

exported Tokenizers in Datasets, added stream testing

2203df8

fixed tokenizer typing issues

152fd59

removed coment

e23b88c

fixe of function import issue, which still needs a better solution

44c132b

darsnack requested changes Apr 11, 2021

View reviewed changes

samuelzxu added 2 commits April 11, 2021 23:07

moved function specification

0ba2e83

Added WordTokenizers to test Project.toml

349daa4

samuelzxu and others added 2 commits April 12, 2021 00:06

debugging

de8b483

Merge pull request #58 from SamuelzXu/QuickFixes

d26186e

tutorial errors: changed {load->get}classesclassification

darsnack reviewed Apr 12, 2021

View reviewed changes

samuelzxu added 23 commits April 12, 2021 13:06

Comment clarifications

3958b23

Documentation clarifications

6dfae51

added tokenize functionality, tests, and dependencies

14286e4

learnbase functions for tokenized data

355af43

deleting accidental double loop

fd79c5f

debugging

9cda64d

imported another WordTokenizer function

f1f4a8d

Debugging tests

e8b2daa

final debugging changes

ce34f9a

Made changes according to suggestions

6447492

implemented tokenizer struct

4d370f5

debugging Tokenizer

d5cea9e

exported Tokenizers in Datasets, added stream testing

5c86a16

fixed tokenizer typing issues

0e45637

removed coment

d0d3e07

fixe of function import issue, which still needs a better solution

b0ab4a5

moved function specification

e3155eb

Added WordTokenizers to test Project.toml

76d1720

fixing extending error

521c499

debugging

71224fd

Merge remote-tracking branch 'refs/remotes/origin/master'

89f302f

darsnack reviewed Apr 12, 2021

View reviewed changes

Cleaning up a function reference

addbd07

lorenzoh closed this Jul 6, 2022


		# LearnBase functions {nobs, getobs} for AbstractArray{AbstractString}

		LearnBase.nobs(data::AbstractArray{AbstractString}) = size(data,1)


		LearnBase.nobs(data::NamedTupleData) = nobs(getfield(data, :data))

		tdata1 = Datasets.Tokenizer(Datasets.tokenize(:words,"The quick rabbit jumps over the lazy fox."))
		tdata2 = Datasets.Tokenizer(Datasets.tokenize(:chars,"The quick rabbit jumps over the lazy fox."))

word and character level word tokenizer. #28

word and character level word tokenizer. #28

Conversation

samuelzxu commented Apr 8, 2021

ToucheSir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack left a comment

Choose a reason for hiding this comment

ToucheSir commented Apr 9, 2021

darsnack commented Apr 9, 2021

samuelzxu commented Apr 10, 2021

samuelzxu commented Apr 10, 2021

samuelzxu commented Apr 10, 2021

samuelzxu commented Apr 11, 2021

lorenzoh commented Apr 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack commented Apr 11, 2021

darsnack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelzxu commented Apr 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment