Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word and character level word tokenizer. #28

Closed
wants to merge 47 commits into from
Closed

word and character level word tokenizer. #28

wants to merge 47 commits into from

Conversation

samuelzxu
Copy link
Contributor

Some miscellaneous changes with the tokenizer being the main focus. Implements the LearnBase getobs and nobs methods. Uses the WordTokenizers module.

Closes #24

Copy link
Member

@ToucheSir ToucheSir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I've left some comments and considerations, but it shouldn't take much to get this landed :)

tokenize(type, input)

Tokenizes an input string or stream into pieces depending on selected type.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any advantage to this over the default WordTokenizers.tokenize? I think it's fine to have a separate path for characters, but for words we'd prefer to rely on their implementation. One way to accomodate this may be to point users towards the experimental split API and simply define an function overload which uses your character tokenizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage was that I can control the parameters, so that if WordTokenizers is updated we're not pointing to a function that doesn't exist anymore. I don't quite understand what you mean by the last sentence. Are you suggesting to not implement the tokenizer, and instead tell the user to use the WordTokenizer module's split functions, to implement the same thing with WordTokenizer.tokenize() in place of the case for :words, or something else? The confusion is what you mean by "point".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. We can always re-export tokenize from WordTokenizers if need be. I mention the split API because it uses multiple dispatch to select which tokenizer to use and thus doesn't require a conditional over a symbol parameter like :words or :chars. This allows one to swap out tokenizer functions (see https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/split_api.jl for the built-in list), so doing the equivalent of :chars could be as easy as calling collect on the input string.

Tokenizes an input string or stream into pieces depending on selected type.
"""

function tokenize(type::Symbol,input)::AbstractArray{AbstractString}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type declarations here can be safely omitted.


# LearnBase functions {nobs, getobs} for AbstractArray{AbstractString}

LearnBase.nobs(data::AbstractArray{AbstractString}) = size(data,1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length(data) does the same thing as size(data, 1) here. We may want to look into a smarter data container eventually as special-casing a certain array type is non-optimal, but that shouldn't hold up this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, also we shouldn't be implementing getobs for Base types like AbstractArray and AbstractString.

@test getobs(tdata2,3) == "rabbit"
@test getobs(tdata3,3) == "e"
@test getobs(tdata4,3) == "h"
#TODO: add stream tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "stream tests", do you mean enumerating more than one token? If so, I agree those should be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant to take a stream as input, which I have yet to figure out how to create as a test. By enumerate more than one token, do you mean check more than one index?

README.md Outdated
@@ -10,7 +10,7 @@ As an example, training an image classification model from scratch is as simple
using FastAI
path = datasetpath("imagenette2-160")
data = loadtaskdata(path, ImageClassificationTask)
method = ImageClassification(Datasets.loadclassesclassification("imagenette2-160"), (160, 160))
method = ImageClassification(Datasets.getclassesclassification("imagenette2-160"), (160, 160))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the change below appear unrelated to the PR. Can you revert them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but they do cause bugs (namely, loadclasses... gave undefined function) during the tutorial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it's better to open another quick PR. Generally if it's not blocking CI, then it's better to not put unrelated changes in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PR with just that change would be appreciated, so it can be fixed swiftly 👍

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match what @lorenzoh suggested, we don't want to be defining getobs on generic types like AbstractArray{<:AbstractString}.

In addition to the existing suggestions, I think what we want is something like

struct Tokenizer{T, ...}
    data::T
    # ...
end

function LearnBase.getobs(tokenizer::Tokenizer, i)
    # fetch a new sample from data if necessary
    sample = getobs(tokenizer.data, ??)

    # tokenize the sample
    tokens = tokenize(sample)

    # return correct token
    return token[??]
end

This is pseudocode, but the idea is that the Tokenizer wraps data which is any stream of text. This could be a string or text dataset on disk wrapped in a lazy loader that read line by line. The Tokenizer is responsible for doing the index logic to know when it needs to fetch more text from data, tokenizing that text, and buffering the excess tokens. Certain calls to getobs(::Tokenizer, ...) will not require fetching more text from data but just returning what's already in the buffer.

@ToucheSir
Copy link
Member

As cool as incrementally reading text would be, I'm not sure it's feasible since you'd need to tokenize the entire stream to implement nobs. Hence the question becomes how to represent a collection of tokenized text samples rather than an individual one. Here it's worth looking at how fastai, spaCy and more handle this.

@darsnack
Copy link
Member

darsnack commented Apr 9, 2021

True, I didn't think about that. At the very least, we should tokenize the full stream, store the tokens in Tokenizer, then return those on getobs(::Tokenizer, i). As opposed to tokenizing the data into a generic vector of tokens and writing getobs(::AbstractVector{<:AbstractString}, i).

@samuelzxu
Copy link
Contributor Author

Ah, I see. I thought about this, and my logic going in was that since there was only one element in the struct, and we're working in and NLP context, it would be simpler for a coder to use this version. Can you point me to a place where I can read more about these design choices? (Like not implementing methods for base types @lorenzoh ) It feels like there is a lot of context I'm missing.

@samuelzxu
Copy link
Contributor Author

@darsnack How do I update the PR? I read online I only have to push to my branch, but it doesn't seem to be showing up.

@samuelzxu
Copy link
Contributor Author

Oh, wow, I must be blind! I see it now.

@samuelzxu
Copy link
Contributor Author

@darsnack @ToucheSir could you guys help me out with the importing and extending bug?

@lorenzoh
Copy link
Member

This section of the Julia documentation explains why type piracy (defining methods of functions you don't own for types you don't own) should be avoided:
https://docs.julialang.org/en/v1/manual/style-guide/#Avoid-type-piracy

@@ -41,8 +42,9 @@ struct NamedTupleData{TData, F}
namedfs::NamedTuple{F}
end

LearnBase.nobs(data::NamedTupleData) = nobs(getfield(data, :data))
# LearnBase functions {nobs, getobs} for NamedTupleData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are not related to this PR. We like to separate changes into different PRs when possible. Can you move these to another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think separating out these changes is still outstanding


LearnBase.nobs(data::NamedTupleData) = nobs(getfield(data, :data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (move to another PR)

Comment on lines +138 to +139
# LearnBase functions {nobs, getobs} for JoinedData

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (move to another PR)

Comment on lines +164 to +169
"""
tokenize(type, input)
type = :words or :chars

Tokenizes an input string or stream into pieces depending on selected type.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go above the actual function definition.

Comment on lines 171 to 173
struct Tokenizer
data::AbstractArray{AbstractString}
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance reasons, in Julia we use type parameters to get concrete typed fields:

Suggested change
struct Tokenizer
data::AbstractArray{AbstractString}
end
struct Tokenizer{T<:AbstractArray{<:AbstractString}}
data::T
end

Also, should this be a AbstractVector or AbstractArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose AbstractArray because AbstractVector <: AbstractArray

Comment on lines 191 to 193
function LearnBase.getobs(toks::Tokenizer, idx)
return toks.data[idx]
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function LearnBase.getobs(toks::Tokenizer, idx)
return toks.data[idx]
end
LearnBase.getobs(toks::Tokenizer, idx) = toks.data[idx]

Just to make things cleaner

@darsnack
Copy link
Member

You need to add WordTokenizer.jl to the test/Project.toml.

samuelzxu and others added 2 commits April 12, 2021 00:06
tutorial errors: changed {load->get}classesclassification
Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase?

Comment on lines 38 to 39
tdata1 = Datasets.Tokenizer(Datasets.tokenize(:words,"The quick rabbit jumps over the lazy fox."))
tdata2 = Datasets.Tokenizer(Datasets.tokenize(:chars,"The quick rabbit jumps over the lazy fox."))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't tokenize return a Tokenizer already?

…t to avoid conflicting with imported WordTokenizers.tokenize. While I assumed tokenizers and WordTokenizers.tokenize would not conflict, I do get warnings about it and it possibly may have been the cause of some errors. Adjusted a test to include period. Fixed testing semantics usage issue.
…t to avoid conflicting with imported WordTokenizers.tokenize. While I assumed tokenizers and WordTokenizers.tokenize would not conflict, I do get warnings about it and it possibly may have been the cause of some errors. Adjusted a test to include period. Fixed testing semantics usage issue.
@samuelzxu
Copy link
Contributor Author

@darsnack


# utilities
isimagefile,
loadfile,
filename,
tokenize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenize,
tokenize_input,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darsnack got it.

@lorenzoh lorenzoh closed this Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add character and word level tokenizers
4 participants