Add new word lists - top10000, top25000 and commonly-misspelled#27
Add new word lists - top10000, top25000 and commonly-misspelled#27Samyak2 merged 8 commits intoSamyak2:mainfrom notjedi:main
Conversation
|
@Samyak2 there is a 10k word list too, i can add that if you want. |
|
Thank you for this!
Yes, that would be great. The commonly misspelled list would be a nice addition too. I had a few concerns though:
Sorry for the late reply. I may be late to reply for the next 3-4 days too. |
|
it's weird. the check passes for me locally and i don't have any local changes. any idea what causes this? @Samyak2 |
EDIT: i just mailed the author of monkeytype asking if we can use the word lists and the source of the word lists. will let you know once i get a reply from him. |
Looks like a locale issue. The script runs fine on my system too. But when I change line 6 in the script to: LC_COLLATE=POSIX sort -c -d "src/word_lists/$f"I can reproduce the issue locally too. The CI can be fixed by changing line 6 to: LC_COLLATE=en_US.UTF-8 sort -c -d "src/word_lists/$f" |
That's not right. Monkeytype is licensed under GPL-v3. This means that any work deriving from it must also be licensed under GPL-v3. Copying wordlists from it can also be considered a derivative work and will require changing the license of toipe to GPL-v3, which I wouldn't want to do.
Thanks! Though, unless there's a special license given by the author, we cannot use these wordlists. We could use word lists directly from the source if we get it though. |
|
This PR is towards #17 (mentioning it to create a back link) |
|
cool, fixed the scripts. do you think we should ping him here in this issue? EDIT: his username is |
Looks good. Thanks!
I don't think that's a good idea. Could you cc me in the email instead? My email can be found on this page. |
|
cool, i'll do it tomorrow? |
Sure |
|
@Samyak2 sorry for the delay, i totally forgot about this. i cc'ed you in that mail, please do check in on that. |
|
did you check your mail? he is okay with us using the word lists. good for us ig |
I haven't received this mail. Can you forward it to me? Sorry for the late reply |
|
Got the mail - that's great! I'll take a final look and merge the PR |
Also linked to exact files in monkeytype instead of just the main repo.
There was a problem hiding this comment.
LGTM!
Thank you for contributing! And thank you for doing all of the extra work to get an approval from Monkeytype's maintainer ❤️
I changed some of the docstrings to make the cargo doc look nicer and more appropriate for the new word lists.
One improvement/suggestion for the future - these new word lists add around 300KB of size to the release binary, which is quite significant compared to the total binary size (1.3MB -> 1.6MB). As we'll have more word lists, it will get even larger. Perhaps it would be a good idea to have some kind of compile-time compression on these files. They will then have to be decompressed at run time to read the words. I'll open an issue for this.

the word list is taken from monkeytype.
here is a to few other word lists: