New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word and character level word tokenizer. #28
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I've left some comments and considerations, but it shouldn't take much to get this landed :)
tokenize(type, input) | ||
|
||
Tokenizes an input string or stream into pieces depending on selected type. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any advantage to this over the default WordTokenizers.tokenize
? I think it's fine to have a separate path for characters, but for words we'd prefer to rely on their implementation. One way to accomodate this may be to point users towards the experimental split API and simply define an function overload which uses your character tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage was that I can control the parameters, so that if WordTokenizers is updated we're not pointing to a function that doesn't exist anymore. I don't quite understand what you mean by the last sentence. Are you suggesting to not implement the tokenizer, and instead tell the user to use the WordTokenizer module's split functions, to implement the same thing with WordTokenizer.tokenize() in place of the case for :words
, or something else? The confusion is what you mean by "point".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. We can always re-export tokenize
from WordTokenizers
if need be. I mention the split API because it uses multiple dispatch to select which tokenizer to use and thus doesn't require a conditional over a symbol parameter like :words
or :chars
. This allows one to swap out tokenizer functions (see https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/split_api.jl for the built-in list), so doing the equivalent of :chars
could be as easy as calling collect
on the input string.
src/datasets/transformations.jl
Outdated
Tokenizes an input string or stream into pieces depending on selected type. | ||
""" | ||
|
||
function tokenize(type::Symbol,input)::AbstractArray{AbstractString} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type declarations here can be safely omitted.
src/datasets/transformations.jl
Outdated
|
||
# LearnBase functions {nobs, getobs} for AbstractArray{AbstractString} | ||
|
||
LearnBase.nobs(data::AbstractArray{AbstractString}) = size(data,1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length(data)
does the same thing as size(data, 1)
here. We may want to look into a smarter data container eventually as special-casing a certain array type is non-optimal, but that shouldn't hold up this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, also we shouldn't be implementing getobs
for Base
types like AbstractArray
and AbstractString
.
test/datasets/transformations.jl
Outdated
@test getobs(tdata2,3) == "rabbit" | ||
@test getobs(tdata3,3) == "e" | ||
@test getobs(tdata4,3) == "h" | ||
#TODO: add stream tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "stream tests", do you mean enumerating more than one token? If so, I agree those should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I meant to take a stream as input, which I have yet to figure out how to create as a test. By enumerate more than one token, do you mean check more than one index?
README.md
Outdated
@@ -10,7 +10,7 @@ As an example, training an image classification model from scratch is as simple | |||
using FastAI | |||
path = datasetpath("imagenette2-160") | |||
data = loadtaskdata(path, ImageClassificationTask) | |||
method = ImageClassification(Datasets.loadclassesclassification("imagenette2-160"), (160, 160)) | |||
method = ImageClassification(Datasets.getclassesclassification("imagenette2-160"), (160, 160)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the change below appear unrelated to the PR. Can you revert them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but they do cause bugs (namely, loadclasses... gave undefined function) during the tutorial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case it's better to open another quick PR. Generally if it's not blocking CI, then it's better to not put unrelated changes in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A PR with just that change would be appreciated, so it can be fixed swiftly 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To match what @lorenzoh suggested, we don't want to be defining getobs
on generic types like AbstractArray{<:AbstractString}
.
In addition to the existing suggestions, I think what we want is something like
struct Tokenizer{T, ...}
data::T
# ...
end
function LearnBase.getobs(tokenizer::Tokenizer, i)
# fetch a new sample from data if necessary
sample = getobs(tokenizer.data, ??)
# tokenize the sample
tokens = tokenize(sample)
# return correct token
return token[??]
end
This is pseudocode, but the idea is that the Tokenizer
wraps data
which is any stream of text. This could be a string or text dataset on disk wrapped in a lazy loader that read line by line. The Tokenizer
is responsible for doing the index logic to know when it needs to fetch more text from data
, tokenizing that text, and buffering the excess tokens. Certain calls to getobs(::Tokenizer, ...)
will not require fetching more text from data
but just returning what's already in the buffer.
As cool as incrementally reading text would be, I'm not sure it's feasible since you'd need to tokenize the entire stream to implement |
True, I didn't think about that. At the very least, we should tokenize the full stream, store the tokens in |
Ah, I see. I thought about this, and my logic going in was that since there was only one element in the struct, and we're working in and NLP context, it would be simpler for a coder to use this version. Can you point me to a place where I can read more about these design choices? (Like not implementing methods for base types @lorenzoh ) It feels like there is a lot of context I'm missing. |
@darsnack How do I update the PR? I read online I only have to push to my branch, but it doesn't seem to be showing up. |
Oh, wow, I must be blind! I see it now. |
@darsnack @ToucheSir could you guys help me out with the importing and extending bug? |
This section of the Julia documentation explains why type piracy (defining methods of functions you don't own for types you don't own) should be avoided: |
@@ -41,8 +42,9 @@ struct NamedTupleData{TData, F} | |||
namedfs::NamedTuple{F} | |||
end | |||
|
|||
LearnBase.nobs(data::NamedTupleData) = nobs(getfield(data, :data)) | |||
# LearnBase functions {nobs, getobs} for NamedTupleData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are not related to this PR. We like to separate changes into different PRs when possible. Can you move these to another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think separating out these changes is still outstanding
|
||
LearnBase.nobs(data::NamedTupleData) = nobs(getfield(data, :data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here (move to another PR)
# LearnBase functions {nobs, getobs} for JoinedData | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here (move to another PR)
""" | ||
tokenize(type, input) | ||
type = :words or :chars | ||
|
||
Tokenizes an input string or stream into pieces depending on selected type. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go above the actual function definition.
src/datasets/transformations.jl
Outdated
struct Tokenizer | ||
data::AbstractArray{AbstractString} | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance reasons, in Julia we use type parameters to get concrete typed fields:
struct Tokenizer | |
data::AbstractArray{AbstractString} | |
end | |
struct Tokenizer{T<:AbstractArray{<:AbstractString}} | |
data::T | |
end |
Also, should this be a AbstractVector
or AbstractArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose AbstractArray because AbstractVector <: AbstractArray
src/datasets/transformations.jl
Outdated
function LearnBase.getobs(toks::Tokenizer, idx) | ||
return toks.data[idx] | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function LearnBase.getobs(toks::Tokenizer, idx) | |
return toks.data[idx] | |
end | |
LearnBase.getobs(toks::Tokenizer, idx) = toks.data[idx] |
Just to make things cleaner
You need to add WordTokenizer.jl to the |
tutorial errors: changed {load->get}classesclassification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rebase?
test/datasets/transformations.jl
Outdated
tdata1 = Datasets.Tokenizer(Datasets.tokenize(:words,"The quick rabbit jumps over the lazy fox.")) | ||
tdata2 = Datasets.Tokenizer(Datasets.tokenize(:chars,"The quick rabbit jumps over the lazy fox.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't tokenize
return a Tokenizer
already?
…t to avoid conflicting with imported WordTokenizers.tokenize. While I assumed tokenizers and WordTokenizers.tokenize would not conflict, I do get warnings about it and it possibly may have been the cause of some errors. Adjusted a test to include period. Fixed testing semantics usage issue.
…t to avoid conflicting with imported WordTokenizers.tokenize. While I assumed tokenizers and WordTokenizers.tokenize would not conflict, I do get warnings about it and it possibly may have been the cause of some errors. Adjusted a test to include period. Fixed testing semantics usage issue.
src/datasets/Datasets.jl
Outdated
|
||
# utilities | ||
isimagefile, | ||
loadfile, | ||
filename, | ||
tokenize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenize, | |
tokenize_input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@darsnack got it.
Some miscellaneous changes with the tokenizer being the main focus. Implements the LearnBase getobs and nobs methods. Uses the WordTokenizers module.
Closes #24