Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial italian rules #29

Merged
merged 15 commits into from Jul 18, 2019

Conversation

@alex179ohm
Copy link
Contributor

commented Jul 7, 2019

Fix english rules with new rules name (replace alphanumeric with letter) and adds an initial italian rules file.

@MichaelKohler
Copy link
Contributor

left a comment

Thanks! Can you please revert the changes to the english.toml file as those are part of https://github.com/Common-Voice/common-voice-wiki-scraper/pull/27/files#diff-c5c25e22f88e44782e309f420579ef27R4 ?

Also, as far as I can see, the added italian file is identical to the English one. Have you tested it? For German I just needed to add quite a few rules.

With #18 this will get easier as we will be able to run this in a step in the cloud, until then testing is manual though.

@Mte90

This comment has been minimized.

Copy link

commented Jul 8, 2019

In our discussion we said to try with a first test and see if we need more symbols to exclude as example.

@alex179ohm

This comment has been minimized.

Copy link
Contributor Author

commented Jul 8, 2019

@MichaelKohler yes the italian file rules is the same as the english one, maybe after some tests, we start adjusting some rules, the english file rules seems a good start to implement the italian scraper.
incoming commits might will contain more accurate italian rules.

With #18 this will get easier as we will be able to run this in a step in the cloud, until then testing is manual though.

very appreciated :)

@Mte90

This comment has been minimized.

Copy link

commented Jul 9, 2019

I got that error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { inner: ErrorInner { kind: Custom, line: None, col: 0, message: "missing field `min_alphanumeric_characters`", key: [] } }', src/libcore/result.rs:1051:5
@Mte90

This comment has been minimized.

Copy link

commented Jul 9, 2019

I added to silent errors:

min_alphanumeric_characters = 1
quote_start_with_alphanumeric = true
needs_alphanumeric_start = true

I think that the issue happens also in the other languages because they don't have that rule.

@MichaelKohler

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

Please merge latest master in your branch, the rename is there: https://github.com/Common-Voice/common-voice-wiki-scraper/pull/27/files

@Mte90

This comment has been minimized.

Copy link

commented Jul 9, 2019

I am testing but I see that export a lot of names that I don't think are useful a lot:

Virgil Jacomini
Donald Johnston
Clyde King
Alden Sanborn
Alfred Lindley
James Stillman Rockefeller
Frederick Sheffield
John Brinck
Marvin Stalder
Autore del brano è Michel Berger.
David Dunlap
Duncan Gregg
Burton Jastram
Joe Rantz
Frank Shakespeare
William Fields
Molti dei suoi libri sono ispirati ai suoi viaggi in Medio Oriente.
Africa e Asia.
James Dunbar
Wayne Frye
Charles Manring
Richard Murphy

There are also other words not very useful:

Es.
F. H.
a.C.
Tom M.
ingl.
All'art.
È solitamente abbreviato con il simbolo ©.
Volsinia di E.
In altre parole: se la v.c.
Chaco
Città dell'.
Slavs!
Iser

Maybe we can do a regex that ignore this cases?

Anyway this is the export https://send.firefox.com/download/1ab37650c1d06b62/#niVfyAaS10WfxD86Bu18dA

@alex179ohm

This comment has been minimized.

Copy link
Contributor Author

commented Jul 11, 2019

The latest commit should fix some unwanted behaviors (match-name cases).
And fix some clippy warnings on checker.rs and extractor.rs.

@@ -0,0 +1,12 @@
min_trimmed_length = 3
min_word_count = 1

This comment was marked as resolved.

Copy link
@Mte90

Mte90 Jul 11, 2019

I think that is better to put 2, so we can avoid cases:

Es:
Algol
quote_start_with_letter = true
needs_punctuation_end = false
needs_letter_start = true
needs_uppercase_start = false

This comment was marked as resolved.

Copy link
@Mte90

Mte90 Jul 11, 2019

True so we can avoid broken sentences like:
in matematica.

@Mte90

This comment has been minimized.

Copy link

commented Jul 11, 2019

I am wondering if it is possible to put a text characters limit, because for sentence collector and cv, we had a limit of 125 chars per sentence.

@Mte90

This comment has been minimized.

Copy link

commented Jul 11, 2019

We need something to ignore sentences like:

Siano "A" = {"a","a","a"} e "B" = {"b","b"}.
Se l'esponente "n" è pari, la disuguaglianza è valida per "ogni" numero reale "x".
La disuguaglianza può anche essere generalizzata ad un qualsiasi esponente reale "r".
In altre parole: se la v.c.
Il nome "disprosio" deriva dal greco "dysprositos", "difficile arrivarci".
Il primo romanzo, "Dirk Gently.

I think that for the " we can do a regex that ignore sentences with ", so we can avoid maths stuff but also wrong sentences like the last one (usually happen with film titles that are truncated) and for the v.c. case if there are more then one point to ignore the sentence.

Anyway now is working better then before, there is only the issue of english surname that sometimes are very difficult.

alex179ohm added some commits Jul 11, 2019

Fix italian rules #29
Change:
   min_word_count = 2
   needs_uppercase_start = true
@Mte90

This comment has been minimized.

Copy link

commented Jul 11, 2019

There are also letters not from the italian accents because are words from other languages that we don't use and can create issues:

Una gigantessa poteva anche essere detta "gýgr".
Attacchi aerei anche su Grödig, Hallein, Bischofshofen, Schwarzach.
Deșteaptă-te, Române!
Il testo di "Deșteaptă-te, Române!
Il suo simbolo non ufficiale è Λ.

Maybe we need to see the alphabet with the italian accented letter included and we are done.

And also sentences with " without ending, probably is an issue also for other languages:

L'anno seguente fu acquistata dalla "Commodore International Ltd.
Esso non è un bene per nessuno"".
Nemini nimium bene est significa letteralmente: ""perché aspiriamo al superfluo?
@Mte90

This comment has been minimized.

Copy link

commented Jul 11, 2019

Opened a ticket about the " issues #32
So we can focus on the alphabet to use, we don't need the full latin alphabet.

@Mte90

This comment has been minimized.

Copy link

commented Jul 12, 2019

Now is perfect we need only to define the alphabet and symbols.
We have still issues like: Nella sua recensione nel "The New Yorker", la Kael affermò "È un grande film. but I think that is something that need to be fixed in the scraper itself.

If we change the references to letters with: [a-zA-ZàèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ] I think that we are done, for other languages there is https://www.rexegg.com/regex-interesting-character-classes.html

@nukeador

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

Can we make this PR which with modifications to the Italian file so I can merge?

Thanks!

@Mte90

This comment has been minimized.

Copy link

commented Jul 15, 2019

There is the settings to disable symbols https://github.com/Common-Voice/common-voice-wiki-scraper/blob/master/src/rules/german.toml I think that we can start from there are and add the german (umlaut) and other maths symbols

@Mte90

This comment has been minimized.

Copy link

commented Jul 15, 2019

disallowed_symbols = [
  '<', '>', '+', '*', '\', '#', '@', '^', '[', ']', '(', ')', '/', '»', '«',
  'â', 'ă', 'ș', 'ç', 'Å', 'ý', 'ö', 'ā', 'ü', 'ö', '%', '&',
  'α', 'β', 'Γ', 'γ', 'Δ', 'δ', 'ε', 'ζ', 'η', 'Θ', 'θ', 'ι', 'κ',
  'Λ', 'λ', 'μ', 'ν', 'Ξ', 'ξ', 'Π', 'π', 'ρ', 'Σ', 'σ', 'ς', 'τ',
  'υ', 'Φ', 'φ', 'χ', 'Ψ', 'ψ', 'Ω', 'ω', '©'
]

And we are done

@Mte90

This comment has been minimized.

Copy link

commented Jul 18, 2019

ready to be merged :-D

@nukeador

This comment has been minimized.

Copy link
Contributor

commented Jul 18, 2019

This PR also modifies other files, can you make sure it only modifies the Italian rules?

Thanks!

@Mte90

This comment has been minimized.

Copy link

commented Jul 18, 2019

@alex179ohm is a Rust developer (fron the Rome meetup) and I guess that he did that to improve the code quality and avoid issues in the future, do you prefer a second pr for this (tiny) changes?

@nukeador

This comment has been minimized.

Copy link
Contributor

commented Jul 18, 2019

Yes, please.

@Mte90

This comment has been minimized.

Copy link

commented Jul 18, 2019

done

@nukeador nukeador merged commit 3dca539 into Common-Voice:master Jul 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.