Skip to content

Add new word lists - top10000, top25000 and commonly-misspelled#27

Merged
Samyak2 merged 8 commits intoSamyak2:mainfrom
notjedi:main
May 7, 2022
Merged

Add new word lists - top10000, top25000 and commonly-misspelled#27
Samyak2 merged 8 commits intoSamyak2:mainfrom
notjedi:main

Conversation

@notjedi
Copy link
Contributor

@notjedi notjedi commented Apr 14, 2022

the word list is taken from monkeytype.

here is a to few other word lists:

  1. english commonly mispelled
  2. english 10k
  3. english 450k

@notjedi
Copy link
Contributor Author

notjedi commented Apr 14, 2022

@Samyak2 there is a 10k word list too, i can add that if you want.

@Samyak2
Copy link
Owner

Samyak2 commented Apr 15, 2022

Thank you for this!

@Samyak2 there is a 10k word list too, i can add that if you want.

Yes, that would be great. The commonly misspelled list would be a nice addition too.

I had a few concerns though:

  • Is there a common source where these words were taken from? If so, they should be credited. We should also check the license for the word list (what if it's an exclusive license for use in monkeytype?)
  • Nevertheless, monkeytype should be credited as the source for this wordlist in the code, as a doc comment (/// Doc comment).
  • The tests are failing because the word list is not sorted correctly. Check the assumptions made by toipe for wordlists.

Sorry for the late reply. I may be late to reply for the next 3-4 days too.

@Samyak2 Samyak2 added the words Word selection, word lists, language support, etc. label Apr 15, 2022
@notjedi
Copy link
Contributor Author

notjedi commented Apr 16, 2022

it's weird. the check passes for me locally and i don't have any local changes. any idea what causes this? @Samyak2

image

@notjedi
Copy link
Contributor Author

notjedi commented Apr 16, 2022

  1. as for the lists are concerned, i will add the 10k and commonly misspelled lists too.
  2. the word lists are contributed by the users, so i don't think there is any license to that. but we should of course credit monkeytype for the word lists.

EDIT: i just mailed the author of monkeytype asking if we can use the word lists and the source of the word lists. will let you know once i get a reply from him.

@Samyak2
Copy link
Owner

Samyak2 commented Apr 17, 2022

it's weird. the check passes for me locally and i don't have any local changes. any idea what causes this? @Samyak2

Looks like a locale issue. The script runs fine on my system too.

But when I change line 6 in the script to:

LC_COLLATE=POSIX sort -c -d "src/word_lists/$f"

I can reproduce the issue locally too. POSIX (or C) is the default locale that is used if not set, which is what is happening in GitHub CI environment I suppose. The locale for me was en_IN.UTF-8 (which you can check using echo $LANG) which probably considers 'A' and 'a' to be the same char when sorting.

The CI can be fixed by changing line 6 to:

LC_COLLATE=en_US.UTF-8 sort -c -d "src/word_lists/$f"

@Samyak2
Copy link
Owner

Samyak2 commented Apr 17, 2022

  1. the word lists are contributed by the users, so i don't think there is any license to that

That's not right. Monkeytype is licensed under GPL-v3. This means that any work deriving from it must also be licensed under GPL-v3. Copying wordlists from it can also be considered a derivative work and will require changing the license of toipe to GPL-v3, which I wouldn't want to do.

EDIT: i just mailed the author of monkeytype asking if we can use the word lists and the source of the word lists. will let you know once i get a reply from him.

Thanks! Though, unless there's a special license given by the author, we cannot use these wordlists. We could use word lists directly from the source if we get it though.

@Samyak2
Copy link
Owner

Samyak2 commented Apr 17, 2022

This PR is towards #17 (mentioning it to create a back link)

@notjedi
Copy link
Contributor Author

notjedi commented Apr 18, 2022

cool, fixed the scripts. do you think we should ping him here in this issue?

EDIT: his username is Miodec.

@Samyak2
Copy link
Owner

Samyak2 commented Apr 20, 2022

cool, fixed the scripts

Looks good. Thanks!

do you think we should ping him here in this issue?

I don't think that's a good idea. Could you cc me in the email instead? My email can be found on this page.

@notjedi
Copy link
Contributor Author

notjedi commented Apr 20, 2022

cool, i'll do it tomorrow?

@Samyak2
Copy link
Owner

Samyak2 commented Apr 20, 2022

cool, i'll do it tomorrow?

Sure

@notjedi
Copy link
Contributor Author

notjedi commented Apr 26, 2022

@Samyak2 sorry for the delay, i totally forgot about this. i cc'ed you in that mail, please do check in on that.

@notjedi
Copy link
Contributor Author

notjedi commented Apr 26, 2022

did you check your mail? he is okay with us using the word lists. good for us ig

@Samyak2
Copy link
Owner

Samyak2 commented May 7, 2022

did you check your mail? he is okay with us using the word lists. good for us ig

I haven't received this mail. Can you forward it to me?

Sorry for the late reply

@Samyak2
Copy link
Owner

Samyak2 commented May 7, 2022

Got the mail - that's great! I'll take a final look and merge the PR

Also linked to exact files in monkeytype instead of just the main repo.
Copy link
Owner

@Samyak2 Samyak2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thank you for contributing! And thank you for doing all of the extra work to get an approval from Monkeytype's maintainer ❤️

I changed some of the docstrings to make the cargo doc look nicer and more appropriate for the new word lists.

One improvement/suggestion for the future - these new word lists add around 300KB of size to the release binary, which is quite significant compared to the total binary size (1.3MB -> 1.6MB). As we'll have more word lists, it will get even larger. Perhaps it would be a good idea to have some kind of compile-time compression on these files. They will then have to be decompressed at run time to read the words. I'll open an issue for this.

@Samyak2 Samyak2 merged commit c875be7 into Samyak2:main May 7, 2022
@Samyak2 Samyak2 mentioned this pull request May 7, 2022
7 tasks
@Samyak2 Samyak2 changed the title feat: add new 25k word list Added new word lists - top10000, top25000 and commonly-misspelled May 7, 2022
@Samyak2 Samyak2 changed the title Added new word lists - top10000, top25000 and commonly-misspelled Add new word lists - top10000, top25000 and commonly-misspelled May 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

words Word selection, word lists, language support, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants