New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mass-tagging of sentences #785
Comments
Requested by Hybrid on Tatoeba Day 8 and on Tatoeba Day #9 and by Ricardo14 in Tatoeba Day #11. |
One possible way to implement what Hybrid seems to want to do is to come up with a way to easily add the previously-used tag, similar to how some software has the feature to repeat previously used formatting, etc. This wouldn't really be "mass tagging", perhaps, but would allow members who sequentially contribute things that need the same tag to easily do so. I think doing something like this would help Hybrid do what he wants without unnecessarily doing a lot of programming and introducing the possibility that some members may accidentally mis-tag a lot of sentences. |
personally I would find checkboxes very useful to tag lots of sentences in Italian (with tags like 'OK' and tags related to the verbs' tense). this would greatly improve the proofreading of sentences by users that contribute/contributed a lot (more than 1000) sentences, since it would be quite painful to do that one by one. an example: from time to time I proofread riccioberto's Italian sentences (https://tatoeba.org/ita/sentences/of_user/riccioberto/ita) to check if there aren't typos, but it's quite long to tag them accordingly if I find the sentences correct (or if they have to be changed) because he owns more than 6000 |
+1 |
When I think about the situations where mass-tagging is needed, I imagine the following:
Right now I don't have a clear vision of what would be the best way to implement this. One possibility would be:
Some problems:
|
While this isn't exactly the solution being asked for, one easy solution for anyone wanting to add the same tag to several sentences in a series is to copy the tag, and then just paste it in on the next sentence. That's what I often do. Another more geeky solution would be to explain to members how they can create a JavaScript bookmarket to add a certain tag to the sentence of the page they are on. A member could make several different bookmarklets with one for each tag they often use. This would accomplish what the person requesting this feature wants without cluttering up the pages on tatoeba.org. |
We could also add an extra "field" on the main page (maybe accessible to CMs and Admins) which would display the latest tagged sentences as we have for Latest Contributions , for example - https://tatoeba.org/eng/contributions/latest - and last comments - https://tatoeba.org/eng/sentence_comments/index . |
Here's why I really would like that implemented: As far as I am learning several languages at the same time and I'm a teacher too, I don't really like to talk to anyone or read something except in it his/her/its native language. I do like to understand what the person is saying - which might not be possible if we don't do that into our native language. There words that can't even be translated. For example, tag. It means is "quite equal" to label or something. But we don't say that in Spanish, Portuguese, French. We use SIMILAR (not equal). Another good example is "pet". It was not translated into Portuguese, for example ( (eu) Fui à um pet shop ontem.). - I went to a pet shop (a vet clinic) yesterday). Etc OK, an what's it related to tags? There are so many scenarios which tags are important to me.
I don't take courses here in Brazil - there are either too expensive (150, 200 euros a month and a teacher gets about 200 euros a month here. The "luckier" ones gets 300, 400 euros.
► If it's a sentence that I can translate kinda easily, that contains (no) words which I haven't studied before - in this case, I don't translate but I "mark as OK", I study them and so, I translate ► If it's a sentence that I can't translate - I just mark as OK to study it later. I ask its meaning to someone else. I study it also on websites such as lang8, duolingo, busuu and livemocha or even talk to my friends on WhatsApp. After I understand its meaning completely, I remove the mark, translate it and add to the both lists (3351 and 3547) -Sentences in other languages I follow the same principles, but I add them into the list https://tatoeba.org/eng/sentences_lists/show/4065 I often ask someone to translate the sentences in there. As you see, all of this began using just one Tag. A big problem I found is that just English sentences have been tagged (sometimes Italian and Portuguese ones). I'd like to have all sentences tagged someday...
I teach English, Spanish and Portuguese. I'm used also to prepare my materials using Tatoeba also and its tags again. For example, if I am going to teach how we express the weather - no matter which language. I compare the student's native language and the language s/he is studying with me. ( In Portuguese, we say "Está frio" (we don't use the pronoun), and in English "It's cold) (pronoun added). so I browse for tags > weather and I see which sentences I can use. (especially that ones that are often ""omitted" on textbooks here in Brazil). I take notes, I added those sentences into a list that I usually delete after the class. Each time I am going to teach about the weather, I do the same thing. As an exercise, I ask to my "high-level" students to translate sentences on Tatoeba (after I watch them translating sentences by themselves and "offline".) I always ask them to translate sentences Many students come and ask me sentences in Portuguese in "presente do indicativo", for example. I tag those sentences and I send the link for them. (or after tagging, I just send some sentences to them).
There are too many sentences in English that were not translated into Portuguese - and that worries me a lot since many people is studying this language around the world. I mostly use the advanced search, write what "kind" of sentence I want to translate - Present Simple, Past Simple, location, etc - and I so, I start translating. (e.g.: https://tatoeba.org/eng/sentences/search?query=&from=eng&to=por&orphans=no&unapproved=no&user=&tags=present+simple&has_audio=yes&trans_filter=exclude&trans_to=por&trans_link=direct&trans_user=&trans_orphan=&trans_unapproved=&trans_has_audio=&sort=random |
Hybrid. says that he usually adds many sentences from the same actor or with the same verb partner and that would be easier if there were checkboxes on the sentences page. So he could easily to add more than one tag.
One problem for him might be that you can only show 100 sentences or something per page.
He says that's a good point and maybe there could be a button on the right for "add tag" with a textbox and a button for "remove tag" with a textbox (to write which one you want to remove).
He thinks that it would be a good idea to create a log of all the tags that are being added and removed and that it should be accessible to everyone. |
Just my two cents: I don't think this would be a good idea. Probably limiting it to admins and corpus maintainers (any suggestions about advanced contributors?) would prevent an abuse of this feature, probably. Mistakes can still happen to anyone, but this could prevent serious damage (i.e.: people/bots deliberately vandalizing the corpus by mass tagging sentences with harmful tags, such as spam, insults, etc.) |
One way to make this feature reliable is to make it available only to some users (members who requested, members who have an expertise, members who know to code the basics, etc). I believe it's not that hard to write a kind of script which tags a sentence whenever it's in Portuguese and starts with "Eu", for example. It'd save us a lot of time and also improve the corpus quality once we would have more sentences tags which were created to clarify sentences. |
https://tatoeba.org/eng/sentences/search?query=%3D%5EEu&from=por&to=none&orphans=no&unapproved=no&user=&tags=&list=&has_audio=&trans_filter=limit&trans_to=und&trans_link=&trans_user=&trans_orphan=&trans_unapproved=&trans_has_audio=&sort=words |
It has been discussed again today -https://tatoeba.org/fra/wall/show_message/32221#message_32221 Again, it would help members that want to help members and non-members to study a certain language. I myself use both CK and Guybrush's tags to study English and Italian. However as soliloquist says, "I, too, have thousands of sentences that need to be tagged, but it's discouraging having to visit each sentence's page. " . In other words we are losing a chance to get more sentences tagged. |
I think making to tag sentences more easily might be possible wtihout adding a complicated mass-tagging feature. Similar to the add-to-list icon, a tag icon that opens a text box for adding tags when clicking on it could save a lot of time. In this way, we could tag all sentences on pages with multiple sentences without leaving that page and going to sentences' pages separately. It's not mass-tagging, but it'll make tagging easier. |
1. Here's one possibility.Allow the data to be imported in a tab-delimited text file, similar to the way sentences used to be able to be imported. Sentence_number + tab + tag_name
Members could work with the sentences.csv file and the tag.csv files offline. Step 1. Step 2. Step 3. Step 4. This would work for me, and would be similar to the way we used to be able to do this by sending URLs. 2. Here's another possibility.Have a form that only allows importing one type of tag at a time. 3. Similar to 1.Maybe I'd like this one best. Allow one sentence number to have several tags on the same line, comma-delimiting the tags Sentence_number + tab + tag_name,tag_name2,tagname3,tagname4
NoteFor importing, it could be similar to the way sentences were imported, allowing an admin to import tags for other members. Maybe this wouldn't be the ideal way, but it could be a temporary solution until we see how things work. |
If soliloquist-tatoeba's idea were to be used, it would be a good idea to also show the same thing when any new sentence or translation were added, since that would likely be a time when people would want to tag a sentence. Link to the idea. Perhaps that idea should be another issue. It would be useful to have that function in addition to mass tagging. |
That's a good idea. It would be even better if it included mass-adding sentences to lists, too. Sentence_number + tab + list_number1,list_number2,list_number3....... 1234567 876,678,1234
I've opened a new issue for that: #1923 |
Before we consider implementing mass-tagging and how it could be done, we have to take into account that tags are quite messy, and mass-tagging might just make things worse. For the context, tags were implemented back in 2010. After several months, we figured it was quite a mess and tried to tidy them up. Someone rightfully wrote in the comments of the second blog post:
And I think it turned out very true. We were not able to maintain the tags in the long run. Today, tagging is a free-for-all activity. Contributors are not consulting each other before creating new tags. We have many duplicate tags and many "personal" tags. There are some questions I've asked myself, for which I do not really have a clear answer.
Mass-tagging is a solution to something. But to what exactly? If we take the original request, the use-case provided by Hybrid is about adding many sentences from the same source or same author (if the sentence is taken from a book, an article, etc). This doesn't have to be solved with mass-tagging. First of all, there's the assumption that the original author of a sentence should be mentioned as a tag. But why should it be a tag actually? Why don't we have a new field that would store this information? Just like we have a field that indicates what's the language of a sentence, another field that indicates who's the owner of the sentence, we could have a field that indicates who is the original author in case of a copied sentence. Why not? Second of all, wouldn't it more practical to be able to add this information before creating the sentences and not after? To make the analogy with the language selection, let's imagine our language detection was extremely bad and never detects the correct language. If you know that the next 20 sentences you're adding are going to be in English, you would select it in the language dropdown. It's more practical that way than adding 20 sentences with "auto-detect", then have a feature to mass-edit the language. The same can be done for Hybrid's use-case. If you know you're going that your next 20 sentences are from the same author, then why not have a way to select the author before adding the sentences? So let's forget about mass-tagging for a moment. We need to identify the problems. What are the scenarios in which people have to repeatedly add the same tag? Each scenario could be an issue on its own, and may not be solved with mass-tagging but with something else. Hybrid's scenario is:
CK's scenario, I assume, is something like:
I need concrete scenarios that describe the current working mode and indicate:
Note: I'm leaving this issue open for discussion, but it will eventually be closed and each use-case/scenario of mass-tagging will be handled in a separate issue. |
Then the difference would be rather cosmetic and it wouldn't matter much if I used tags or lists. |
By the way, one difference between tags and lists is that tags are collaborative. All sentences with the same tag can be viewed together no matter how many different users added them. It's more difficult with lists as the list needs to be collaborative and the other users need to be aware of that list. Also, it's not possible yet to merge contents of multiple lists created by different users as discussed on #1704. |
I agree with this and think this is a very powerful function of tags. Also, I think I've mentioned this earlier, but tags can be a very useful tool for students who want to search for sentences that have been tagged with tenses (present simple, etc.), situations (restaurant, etc.), functions (requests, etc.), and so on. Students may want to search for all such sentences with translations into their own languages, or members may want to search for all such sentences that yet need translations into their own native languages. I think it would benefit such people, thus also the Tatoeba Project, to make it easier for educators and researchers to more easily tag sentences, similar to what I've been able to do up to now. |
Just to be clear, when I ask what's the difference between list and tags, I'm not exactly looking to know what's the difference functionally speaking. I have been closely involved in the implementation of both these features, so I know very well what's the difference in terms of functionalities :) I'm more looking to understand how everyone is interpreting the notion of lists and the notion of tags. In other words, forget about how things are implemented now. Just imagine that we implemented all the possible features for lists and all the possible features for tags that you've dreamed of. What then, would be the difference between lists and tags? On the collaborative aspect, just imagine that instead of the endless dropdown to add a sentence to a list, you have a text input with auto-suggestion. Just like tags have auto-suggestion. And so if there was a collaborative list named "Present simple", you would see it as a suggestion when you start typing "Pre...". And imagine it is easy to merge lists. So if I created a list "Present simple" and you created one too, I could easily transfer the sentences from my list into yours if you would agree. And imagine we also have macro-lists. What now? As a contributor, what would drive you to add a tag instead of adding to a list? And I should also ask, as a learner, what would drive you to search in the tags rather than searching in the lists? |
I posted on the Wall for in case non-GitHub users want to participate in the discussion: https://tatoeba.org/eng/wall/show_message/32260#message_32260 |
For me, conceptually, a list is an enumeration. Associating a list with a sentence is saying "This sentence is a member of a group." I generally think of a list as serving a particular purpose (keeping track of the next batch of 100 sentences that I want to upload to Anki, for instance). It tends to be of a conceptually finite size that is manageable for its purpose. By contrast, I think of a tag as a descriptor. Associating a tag with a sentence is saying "This sentence has this attribute." It says nothing about how many other sentences have that attribute. As we know, historically, lists could only be downloaded if they had 100 or fewer sentences, and there was no simple way to download sentences with a particular tag. Furthermore, from a list of sentences, a sentence could be assigned to a list without leaving the window, while adding a tag to a sentence required going to that sentence's page. Also, there was no obvious association between a tag and a contributor, while most lists are associated with a single contributor, and even when they're collaborative, there are generally only a small number of contributors who use it (or so I surmise). While the first restriction I mentioned has been changed, and others could as well, history matters. Now that I've created 60+ lists for my personal use, I've gotten used to the rhythm of finishing a list at 100 sentences, uploading it to Anki, marking it inactive, and starting a new list. I wouldn't want to shift to using tags for the same purpose just because I suddenly had that option. That would introduce an inconsistency. I would feel the same way about seeing other people make that shift: after years of seeing lists like "100 Chuvash sentences I want to learn" and tags like "simple present", I would not want to see sentences associated with tags like "100 Chuvash sentences I want to learn" and lists like "simple present". I would find that jarring and disorienting (and I think it would seriously confuse new contributors). I can easily see the utility of eliminating the restriction on the size of lists to be downloaded, or introducing mass tagging of sentences, or otherwise allowing sentences to be tagged from list views. But I don't think this should serve as an opportunity for us to erode the useful connotations of "list" and "tag" that we have built up over the course of Tatoeba's existence. That seems like introducing chaos for no good reason. |
I realize I may have sounded like I want to merge lists and tags into one feature, but be assured that it's not the case. I do have my own an answer as to what is the difference between lists and tags, but it is my personal definition. While the distinction is somewhat clear to me, what is not clear is whether my definition of lists and tags is a valid definition for Tatoeba. Hence this discussion. It is interesting to me, to know what people have to say. Because when looking at how lists and tags are used in practice, I feel that not everyone may share the same definition.
So this is interesting. Let's imagine for a moment that we introduce mass-tagging and someone starts tagging sentences with "100 Chuvash sentences I want to learn". Their reasoning is: "it's easier for me to use tags because there's no way of mass-listing".
To elaborate, I'm asking all of this because of several reasons.
|
One reason is that it's possible to search with more than one tag. Here is an example of a search for imperative sentences that would be used in a restaurant. https://tatoeba.org/eng/sentences/search?from=eng&tags=imperative%2C+restaurant |
My feeling is that just about anybody who has been on the website for long enough to understand how it works should be given rights to tag sentences. Perhaps we need to be a little more careful about who gets the rights to link and unlink sentences, especially unlinking. |
I agree with this, and for the most part, with the exception of the @ tags and the tags for quality, that's how they seem to be used, at least the ones with the most tags. |
If you want to regulate tags, one possibility would be to not allow members to create new tags, but request new tags and have them added by an admin, or perhaps a corpus maintainer. We could further limit tags that only fit into certain categories if you wanted to. This would prevent tags like "100 Chuvash sentences I want to learn". At one time, I suggested sorting tags into categories and have a demo page for that online somewhere. Being able to look at a list of tags sorted by categories would make it more obvious to people what tags are used for. |
On a related note, it might be a good idea to also allow mass untagging of one's own tags. |
A member could go through the exported data, choosing appropriate tags for sentences and then add the tags to sentences without needing to spend all the time that it would take to visit each page, choose a tag, wait for the tag to be added, and then choose the next tag. Mass tagging would save a lot of time, making more efficient use of volunteers' time. Imagine how nice it would be to have a large number of our sentences tagged with at least tenses, functions, situations and maybe a few others. |
That's my main idea for mass tagging sentences. |
It's better to use lists for some purposes like:
On the other hand, tags like '@change' and 'literal translation' may not work well with lists. But for many other purposes (i.e. weather, football, maths etc.), both tags and lists would work fine. It's just a matter of choice which one we use. If we focus on this (rather large) gray area, then it might seem redundant to have both these features. |
I don't agree with this. Tags would be much better, since someone looking for such sentences would want all such sentences, and not just one member's listed sentences. If I wanted to find English sentences in the present simple tense, I'd want to find sentences tagged as such by others and not just sentences on a list I made. |
@ckjpn
|
This may be a little off-topic, but has anyone considered tagging a sentence group rather than a single sentence? By "sentence group" I mean all sentences that are linked to each other, no matter how "far" the link is, no matter how many levels of indirections there are in the group. I believe that some tags, like sports or mathematics, are quite universal and if a member decides to tag it in its own language, it would be very beneficial for all the other languages involved in the sentence group to get the same tag, translated accordingly. This would allow to get sentences classified in languages where little or no members tag sentences. On the other hand, if the tag is conveying a concept that cannot be easily transposed to some other languages, it becomes a problem (or even a danger of assigning a foreign classification onto a different culture). Anyway, my point is that the universality of a tag (or lack thereof) may be another way to look at how to organize and define tags vs. lists. Lists, to the contrary, tend to be tied to a single language, even though it’s possible to create multi-language lists. Unless there are valid use cases for multi-language lists, we might want to enforce a single language for each list in order to better distinguish them from tags. |
Personally, I'd prefer to mass tag sentences in the same language, so that I'm sure that I understand there's the proper usage of the tag I want to add. Mass tagging whole groups of sentences, in my opinion, might be confusing, since maybe one of the far indirect sentences might convey completely different meanings and contexts (i.e.: a word in a language can mean many different things but has only one form that covers all the meanings, but, in a second language, it may have different forms for each meaning), so a single tag might not be ok for all the possible sentences. |
@ckjpn, @Guybrush88 I've identified two scenarios of mass-tagging and as I've said: each use-case/scenario of mass-tagging will be handled in a separate issue. I will create these issues in due time, we are not in a situation of urgency. But if you have another scenario of mass-tagging, then it's a good time to share them here. If not, the rest of the discussion is to figure out what should be the difference between tags and lists. In regards of mass-tagging, this discussion mostly affects its prioritization. In the larger scheme, this discussion will help shaping the tags and the lists features in the long term.
This is an interesting point of view but we would have an issue with the verb tags ("present simple", "past simple", etc). Tenses are not universal but specific to a language.
I searched for "music" in the lists and found this one: It includes sentences in multiple languages. At that time this list was created, tags were already implemented. The contributor had access to the tag feature but chose nonetheless to make a list. This led me to think of the following use-case: the user may not want to have all the sentences about music, just a custom list. Maybe they find some sentences too boring for their taste, maybe they want to avoid near-duplicates. The user may be browsing the sentences at random and can understand several languages, so whenever they see an interesting sentence about the topic they're collecting sentences for, they would just add it to their list, regardless of the language. I don't think this use-case was the reason why the list I found had multiple languages though. But it would be a valid use-case to me. |
As far as we may get sentences tagged as "custom list", perhaps we should
also create a way to mass "untag" some specific sentences
… |
It makes sense as a use-case for multi-language lists. And this is very similar to Ricardo's use-case. Use tags as a way to browse sentences and add some to your personal list as you see fit. Actually, maybe what really distinguish lists from tags is their scope of use. The scope of tags is the whole corpus while the scope of lists is restricted to one or more individuals’ needs. We can also think about the wording. The word list (and especially my lists) convey the idea of an individual writing down an enumeration of sentences on a sheet of paper, which already some vague idea about the purpose of lists. However I think tag is very vague. It doesn’t give me any hint about the purpose. A word like category gives me the hint that this can be used to classify. (It’s just an example; I don’t want tags to be reworded into category.) |
I'm thinking of summarizing, compacting, and filtering ideas about this topic. I think a lot has already been said so please do not restart the conversation to say or ask the same thing again. That will only make the work more discouraging and more difficult. I just want to ask one question to the people who contributed here: Would a solution similar to how we can add lists in the new design be enough to fit your needs? Something like @soliloquist-tatoeba described in a comment above. |
@agrodet at least for me, this solution would work fine to speed up mass tagging, i think |
Same here, @agrodet |
The advantage of allowing mass tagging as I used to be able do it would be that it would be easier to deal with sentences already in the database when tagging things like tenses, situations and functions. It is much easier and faster to work with text files offline than to have to use a web interface. Adding something similar to how sentences can be added to lists in the new design for tags, too, would definitely be a good idea, which is somewhat similar to @soliloquist-tatoeba's idea. |
There is an HTTP request (sorry if I named it wrong) for removing a tag from a sentence. https://tatoeba.org/eng/tags/remove_tag_from_sentence/TAG_ID/SENTENCE_ID There are also similar requests for adding/removing sentences to/from lists. My question is, is it possible to add a particular tag (using the tag id) to a particular sentence in a similar manner? If it isn't, could it be possible to add such a function like the others? It could be used by some users for mass-tagging sentences without much programming skills and efforts. |
Related Wall Post: https://tatoeba.org/en/wall/show_message/38846#!#message_38846 If mass tagging were possible, then I, or someone else, could easily do what is suggested by this post. |
By Hybrid:
The text was updated successfully, but these errors were encountered: