Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retraining on floydhub.com is not available #20

Closed
cnlinxi opened this issue Jul 27, 2020 · 5 comments
Closed

Retraining on floydhub.com is not available #20

cnlinxi opened this issue Jul 27, 2020 · 5 comments

Comments

@cnlinxi
Copy link

cnlinxi commented Jul 27, 2020

Can you provide corpus? We cannot retraining this model on floydhub.com in readme. Thanks a lot.

@cnlinxi cnlinxi changed the title Retraining on floydhub.com is not availuable Retraining on floydhub.com is not available Jul 27, 2020
@p16i
Copy link
Collaborator

p16i commented Jul 28, 2020

@cnlinxi sorry for your inconvenience. I thought using floydhub would be sustainable but it seems very costly in a long run. So, I've decided to cancel my subscription, hence losing the datasets there.

I'll get back to you regarding the corpus. Would you mind sharing a bit on what you plan to do with the code?

@cnlinxi
Copy link
Author

cnlinxi commented Aug 1, 2020

@heytitle Sorry for reply too late. I hope to use this model to segment Thai words, and hope to improve it. I hope to provide a good Thai text regularization method.

@p16i
Copy link
Collaborator

p16i commented Aug 8, 2020

@cnlinxi sorry again for my response. You can find the data at https://codeforthailand.s3-ap-southeast-1.amazonaws.com/attacut-related/data.zip

Please unzip and make sure the root directory is at ./data. The content of the archive contains
image

Only the first two are relevant for training; sampling-0 means all the dateset, while sampling-10 means only 10 files are used. You can use sampling-10 for quick training.

Before running the training command below, make sure that you have the ./artifacts directory.

python ./scripts/train.py --model-name seq_sy_ch_conv_concat \
 --model-params "embc:8|embs:8|conv:8|l1:6|do:0.1" \
 --data-dir ./data/best-syllable-crf-and-character-seq-feature-sampling-0  \
 --output-dir ./artifacts/model-xx  \
 --epoch 2 \
 --batch-size 1024 \
 --lr 0.001 \
 --lr-schedule "step:5|gamma:0.5"

@cnlinxi
Copy link
Author

cnlinxi commented Aug 9, 2020

@heytitle thank you very much. I have trained this model on BEST 2010. Great work:)

@cnlinxi cnlinxi closed this as completed Aug 9, 2020
@charlesfufu
Copy link

https://codeforthailand.s3-ap-southeast-1.amazonaws.com/attacut-related/data.zip

Is word split by "~" in "best-syllable-tokenized" dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants