-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
creating a cheatsheet for extracting & pushing clips onto zooniverse #181
Comments
I think you may have accidentally run
It is true that you need to provide segments before, which the doc currently does not clearly state. But if you do not provide these segments, you won't be able to go any further.
You need to define a destination (otherwise, an error will be thrown, and the script will stop) It is up to the user to decide where to store the output. It might not be within solomon-data; e.g., if you are developing an analysis aside, you may have imported solomon-data as a subdataset, and the chunks will preferably lie somewhere in your analysis folder. We do not expect every user to push their own chunks to the original dataset in the general case. (Also, honestly, the audio chunks do not need to be kept once they have been uploaded.) However, in case you want to push the chunks to the dataset, I am not sure You could then set the destination to
Batch size defines how many of the chunks will be grouped and uploaded together. This reproduces the behavior of Chiara's script. This is apparently needed because of Zooniverse upload rate quotas. --batch-size defines how many chunks should each batch contain. At the upload step, you need to define how many of these batches will be uploaded. This way, you can upload n batches the first day, then n more batches the second day, etc. Maybe we could avoid that, and have only one option to specify how many chunks should be uploaded during the upload - in this case, we could drop the batch system. Let me rethink this!
We have not. Should we ?
This is the equivalent of git add & git commit; you will also need to push the data at some point (
You will also have to set the destination for child-project zooniverse retrieve-classifications solomon-data --destination solomon-data/samples/high-volubility/classifications_2021-04-10.csv --project-id XXX PS: you don't need to cd out of solomon-data; you could just do |
I have found a workaround to avoid the batch system, which I implemented in #182 . You can try it by installing the package from: pip install git+https://github.com/LAAC-LSCP/ChildProject.git@zooniverse/improvements --upgrade Below you can find the upgraded documentation: $ child-project zooniverse extract-chunks --help
usage: child-project zooniverse extract-chunks [-h] --keyword KEYWORD
[--chunks-length CHUNKS_LENGTH]
[--chunks-min-amount CHUNKS_MIN_AMOUNT]
--segments SEGMENTS
--destination DESTINATION
[--exclude-segments EXCLUDE_SEGMENTS [EXCLUDE_SEGMENTS ...]]
[--threads THREADS]
path
positional arguments:
path path to the dataset
optional arguments:
-h, --help show this help message and exit
--keyword KEYWORD export keyword
--chunks-length CHUNKS_LENGTH
chunk length (in milliseconds). if <= 0, the segments
will not be split into chunks
--chunks-min-amount CHUNKS_MIN_AMOUNT
minimum amount of chunks to extract from a segment
--segments SEGMENTS path to the input segments dataframe
--destination DESTINATION
destination
--exclude-segments EXCLUDE_SEGMENTS [EXCLUDE_SEGMENTS ...]
segments to exclude before sampling
--threads THREADS how many threads to run on $ child-project zooniverse upload-chunks --help
usage: child-project zooniverse upload-chunks [-h] --chunks CHUNKS
--project-id PROJECT_ID
--set-name SET_NAME
[--amount AMOUNT]
[--zooniverse-login ZOONIVERSE_LOGIN]
[--zooniverse-pwd ZOONIVERSE_PWD]
optional arguments:
-h, --help show this help message and exit
--chunks CHUNKS path to the chunk CSV dataframe
--project-id PROJECT_ID
zooniverse project id
--set-name SET_NAME subject set display name
--amount AMOUNT amount of chunks to upload
--zooniverse-login ZOONIVERSE_LOGIN
zooniverse login. If not specified, the program
attempts to get it from the environment variable
ZOONIVERSE_LOGIN instead
--zooniverse-pwd ZOONIVERSE_PWD
zooniverse password. If not specified, the program
attempts to get it from the environment variable
ZOONIVERSE_PWD instead |
I have just realised I had forgotten to answer about chunkification. if you do not specify a value for --chunk-length, currently, input segments will not be split (because the default value is zero). but we could change the default to a non-zero value (e.g. 500) |
THIS IS THE MOST UP TO DATE VERSION OF THE CHEAT SHEET -- NOT TESTED THE WHOLE THING & chunkify section needs a second check. Consider also replacing the scripts with commands in other sections cheatsheet for zooniverse clip pushingThis is a cheatsheet for extracting & pushing clips onto zooniverse. It works on oberon; it does not work on my home computer (git-annex cannot be downloaded with my OS; not enough space for the audios). I've adapted the zoo example python script and the zoo-phon-data script. I created two separate scripts: one for sampling, one for uploading. preparationI start by installing the dataset.
Then I get the recordings & the VTC annotations, and validate.
Both of those steps can be skipped if I already have the data. Preparing the folderI'm about to extract many files that can be re-generated if need be, and take up space + slow down indexing, so even before I generate them, I want to tell DataLad not to pay attention to them. This way, they won't get tracked or pushed. For more information on avoiding DataLad tracking look here). For our purposes, all we need to do is the following:
samplingThen I sample segments, chunkify, and upload. For sampling, I'll do 250 random CHI vocs + 250 random FEM vocs. I decided to store the sound files in a folder called
And I call it like this because all the paths are defined inside the code:
chunkify (not tested)For chunkification, I'll do 500 ms length and only 2 threads as I'm in a smaller computer than the cluster. My script looks like this:
This step takes a while, so to be on the safe side, I first do a screen, activate the environment, and call the script (like this because all the paths are defined inside the code):
NOTE! one problem of doing the above is that I didn't overtly define a name for the chunks.csv file to be generated. So alternatively, next time, I could do instead:
uploadFor upload, I target our new project and don't batch them as it's no longer needed. I directly call the function:
record actionsThe next step is to create a record that I did this, but updating the data. Here is my command draft for that:
get classificationsEventually, I'll get classifications:
And repeat the data update.
|
That seems good (a few details: sample_size should be 500 instead of 250 according to your description, and the destination of zooniverse classifications should be something like samples/random instead of samples/high-volubility for consistency, but these are all details/probably typos). However, there are a few issues:
|
thanks for the proofing! I see in the sampler docs that I can specify multiple talkers. If I changed my code to:
will I get 250 of each, or no assurance on this? |
Nope, it will sample uniformly among the union of CHI and FEM segments. So you need to sample them separately if you want the same amount of each. You can then concat the dataframes and save them into one dataframe if is more convenient to you however. |
roger! I fixed a couple of typos and I'm close, but:
|
Your script is working for me - at least the sampling part, I have not tested the zooniverse part. A few suggestions:
|
I tried from oberon, where the error does NOT replicate - but I get a new error. On oberon, upgraded package, checked VTC annotations (they are there, eg:
My naïve reading of the error is that there are fewer vocalizations than the ones I asked for, correct? |
You are right. However, this should not happen with the latest version of the package (I can see from the error that the code is outdated) can you try upgrading again ?
|
|
neither of the following tried, even in a virtual environment:
however, uninstalling and reinstalling got rid of the error
Then the script runs. |
in the zooniverse section, I got
This was because I was using the raw recordings, rather than the converted recordings. |
I'm very close, but not quite done! In oberon, I'm doing:
And getting:
Note that I added a set_name to my script (although the sample script didn't have this). |
Also, |
Are you sure I would suggest you to run the script in a screen instead. You can start a screen by doing You can detach from the screen by doing Ctrl+d+a. You can also do |
Yes, I think they should not be saved. That's like 200.000 files in your case! Remember you can speed up most datalad operations by using the |
I'm sorry, I'm not sure I understand how to fix the situation and/or how to do this better next time. Let me lay out some possible lessons:
So if I had done things properly, I should have done this before actually creating the samples:
Sadly, that's not what I did, so now even doing I can keep reading the manual, but if you already know a way in which I can fix my previous error, that would be really helpful! |
I think the best way is the one you described: you can leave your samples into the dataset, but make sure you add a .gitignore file beforehand. Now, in order to recover a clean dataset, assuming the chunks were added in the last commit, you can do:
(Something like this should work) For further clean up, you should remove the dangling chunks from the annex as well (see https://git-annex.branchable.com/walkthrough/unused_data/) |
great, and to check whether that's the case, I can do |
it's the last mile! Last error is:
yields:
https://childproject.readthedocs.io/en/latest/zooniverse.html#chunk-upload shows:
I don't see my error, do you? |
The error was that the command should have been:
That did create the subject up in zooniverse, but it didn't push clips, however. Here is the output:
And a snapshot of the Zooniverse subject section: It looks like the error is exceeding 10k quota. So I tried again, this time specifying an amount:
Unfortunately, I get the same error:
|
My understanding is that the 10,000 quota is a limit for the whole project, and that you have to ask the administrators to have them increased. |
certainly, we'll ask - in fact, we also need to ask if we can bypass the beta phase (given that we already did it with our other project). But before we do that, I'd like to try out the interface with some sample data. Is there a way in which I can push up just a few clips? I thought the "amount" flag did that, in |
The --amount flag does exactly that - at least it should. But you've reached your project quota already, so even one clip (--amount 1) will be too much |
but there are no subjects -- so how can it think that we've gone over our quota? Also, notice that in my screenshot, it says "The project has 0 uploaded subjects. You have uploaded 0 subjects from an allowance of 10000. Your uploaded subject count is the tally of all subjects (including those deleted) that your account has uploaded through the project builder or Zooniverse API. Please contact us to request changes to your allowance." |
Weird! Could be because too many subjects were uploaded the first time and upload did not complete (because of the exception thrown by the API). I'll try to see if there's a way to find invisible subjects like this I realised I have access to your project, so I can take care of it. |
not urgent, but if we could get a couple of subjects in there, so I can test the project's interface, that would unblock me to ask them for permission etc. |
Well, I just managed to get chunks through, on the same project and subject set. Can you give it another try ? Try low values for --amount (e.g. 1 to begin with) |
note that his affects my account specifically (not the project) |
This cheatsheet is outdated! look at https://gin.g-node.org/LAAC-LSCP/zoo-campaign#comparing-zooniverse-annotations-with-other-annotations instead |
Hi,
I'm trying to create a cheatsheet for myself for pushing extracting & clips onto zooniverse. I'll always do this on oberon, so I'll only think of that case.
So far I have:
But that last step fails:
Just in case, I pushed along and the next step worked:
I'm uncertain as to the following step. I think I should select segments somehow -- but in the docs the next step is the chunkification of the segments.
About chunkification, I currently have this draft of the command:
Should destination be inside solomon-data or somewhere else? What happens if I leave it unspecified? Same for chunks-length & batch-size. Could/should we have a default behavior that means that the chunks will be created inside solomon-data in some place such that the actual mp3/wavs don't get included in the data but the metadata etc does?
What is batch-size, actually? Why is it declared in the chunkification stage in addition to the upload stage? I just saw in the upload stage this is optional -- shouldn't it be mandatory (or have a default of 1000) in the upload stage?
The step after that is chunk upload. Here is my command draft:
Have we decided on a naming convention for the prefix?
The next step is to create a record that I did this, but updating the data. Here is my command draft for that:
Eventually, I'll get classifications:
And repeat the data update.
The text was updated successfully, but these errors were encountered: