Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial addition of the Russian language #66

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

iLeonidze
Copy link

@iLeonidze iLeonidze commented Jan 5, 2020

How many sentences did you get at the end?

2,118,974 lines on output, 1,871,107 after dedup (cat wiki.ru.txt | sort | uniq | wc -l).

How did you create the blacklist file?

  1. A few bad words, which were found in results
  2. A lot of abbreviations

Review

For review please use sample file.

Reviewers:
@swenkal @spartedo @IvanParadox

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've added some comments for minor improvements. Can't say anything about the actual sentences though, as I don't speak Russian ;)

src/rules/russian.toml Outdated Show resolved Hide resolved
src/rules/russian.toml Outdated Show resolved Hide resolved
@swenkal
Copy link

swenkal commented Jan 5, 2020

I checked two segments of "sample file":

  1. 169087 - 169300 lines
  2. 548455 - 548704 lines

1 segment:
169170 line. "Ей, угодниче Божий!"
Weird sentence. It's look like old slavonic lang

169258 line. "В году в карьере актрисы наметился спад."
Missed year numbers. Therefore the meaning of the sentence is lost

169286 and 169288 lines are the same.
"Исполняли функции на общественных началах."

2 segment - normal. Did not find some abbreviations, syntax or gramma errors.

@iLeonidze
Copy link
Author

@swenkal thank you.

Weird sentence. It's look like old slavonic lang

It's ok, The language is large, it has many different styles of speech, especially since we do not need any particular one.

Missed year numbers

I found the original Wikipedia article, and actually there were numbers in the original sentence. I don’t know where they went, it’s not my settings, it’s nothing to fix here either.

lines are the same

This issue already exists #14

@iLeonidze
Copy link
Author

@MichaelKohler I updated the rules file according to your comments, please check.

@nukeador
Copy link

nukeador commented Jan 7, 2020

Thanks for leading this, super excited to see Russian coming!

For the QA of the outputs, my recommendation would be to create 4 files with 100 randoms sentences each, and get a native speaker (ideally linguist) to review them and comment back with the number of sentences (you can use this template spreadsheet for easy evaluation).

On your comment about the blacklist file, it seems you are talking about the rules file, did you tried the process of removing less used words? For example in Spanish we added to the blacklist words with 80 repetitions or less, and that allowed to clean up most complex or foreign words.

Thanks for your efforts!

@iLeonidze
Copy link
Author

@nukeador first QA already checked ~400 sentences, please see his reply from above. As for others, we are waiting for them to check. Is @swenkal's reply format is ok for you? I suppose spreadsheet is a bit complex for this.

As for blacklist file - I do not see the reason for removing less used words (as I understand you correctly). First of all it is difficult to find out them in 2 million sentences. Second, in Wikipedia articles there is a good quality speech-style, may be there will be a small number of strange sentences, but among two million lines, this is a drop in the bucket.

@stefangrotz
Copy link
Contributor

stefangrotz commented Jan 7, 2020

As for blacklist file - I do not see the reason for removing less used words (as I understand you correctly). First of all it is difficult to find out them in 2 million sentences. Second, in Wikipedia articles there is a good quality speech-style, may be there will be a small number of strange sentences, but among two million lines, this is a drop in the bucket.

Hey iLeonidze, I really recommend using a blacklist, it deletes a lot of foreign words and typos. It is not difficult to generate this blacklist, you can find a instruction in the readme.md of this repo. If you don't want to loose too many sentences you can decrease the number of repetitions, e.g. you can say you only delete words that are used less than 40 or 60 times on the russian wikipedia.

@iLeonidze
Copy link
Author

@stefangrotz thank you, I did not see at first that there is a script that removes rarely used words.
Of course, I will try and make changes with blacklist as you need it, but can you explain or give me a link, why do I need to get rid of such words?

@stefangrotz
Copy link
Contributor

stefangrotz commented Jan 7, 2020

@iLeonidze I can only give you examples from other languages. The list will mainly contain typos. For example "Wikipediaa" instead of "Wikipedia". This will maybe appear 60 times in the complete corpus and this blacklist with every word that appears less than 80 time helps to avoid words like this. You can also delete valid words from the blacklist manually if you want to keep them.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made two comments. After fixing that, I'd suggest to re-run the export and create sample files out of that. When running the export, I don't think you need to wait until it's fully done, once you have a good chunk of sentences, you can abort and create the samples from there.


allowed_symbols_regex = "[А-Яа-яёЁ\\s:,.\\-‑?;!—­‐–'·=’―−”‘]"

disallowed_symbols = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this won't be used as allowed_symbols_regex is set. However, given that those invisible chars should not be allowed by the above regex, I guess that's fine and it can be completely removed?

Copy link
Author

@iLeonidze iLeonidze Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. As I found out invisible chars are not detected by regex for some reason. Perhaps because they are part of \s. Therefore, they had to be defined here, and can't be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's probably what is happening here. Could you just use the specific character you want for whitespace instead of \s or is my Regex knowledge completely failing me here?

# с Иваном Галамяном.
needs_uppercase_start = true

allowed_symbols_regex = "[А-Яа-яёЁ\\s:,.\\-‑?;!—­‐–'·=’―−”‘]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that your previous version excluded all quotes and some other characters as it lead to uneven symbol occurances in the sentences. This now allows quotes, so you might end up with sentences with wrong symbols. I'd suggest to add these to the even_symbols config option to make sure quotes (and possibly other symbols) are always even.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The may difference with the previous version, is that I allowed symbol - that is ok, in Russian there can be words like "Côte d'Ivoire".

Please, can you provide more info about even_symbols, I can't find detailed description about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this regex here, you could potentially get the following sentence: This is ”uneven and suddenly stops the quote without terminating symbol

Regarding even_symbols, can you tell me what exactly is not clear from the README? Then we can adjust the README to provide better info

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have a chance to look at this? Is there anything I can help with?

@MichaelKohler
Copy link
Member

@iLeonidze did you see my last comment? Is there anything I could help you with?

Can I also ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced?

  • Update your branch with the latest code from the master branch
  • Rename src/rules/russian.toml to src/rules/ru.toml
  • Rename src/rules/disallowed_words/russian.toml to src/rules/disallowed_words/ru.toml

Also note that the local command for extraction will now be:

cargo run -- extract -l ru -d path/to/files

Happy to answer any question you may have and thanks for your efforts!

@iLeonidze
Copy link
Author

@MichaelKohler Hi, I’m already a little busy for a couple of months, I really want to get back to your help as soon as possible. Thank you for the information, I will try

@Nihisil
Copy link

Nihisil commented Jul 4, 2020

@iLeonidze thanks for your work. Is there any update on your side?

@MichaelKohler MichaelKohler marked this pull request as draft September 1, 2020 16:10
@MichaelKohler MichaelKohler changed the base branch from master to main October 27, 2020 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants