-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding table_prefix
option to prevent database table collisions for auto-generated tables
#495
Conversation
Hmm, the current default for table_name in psiturk 3 is "assignments" instead of the former "turkdemo" because I didn't understand the multi lab user use case. Prepending "assignments" probably would be weird. Can you describe the shared use case approach? In this approach, is each lab user running a unique experiment? If so, I have been reading the sqlalchemy docs and they support "schemas," where each schema can have its own copy of a table, and collisions are avoided. The schema name becomes a prefix of sorts in the underlying query. Like "exp1.assignments," "exp2.assignments." a single database can have many schemas, and heroku says having up to 50 schemas for a given db should be fine. One thing that schemas would not work for would be for people who want to share one single table of assignments for purposes of blocking participants, differentiating studies by prefixes to values they sert for code_version. For that reason, I might want to avoid the explicit schemas route. But maybe have a table_prefix config var that is blank by default, but if set to a string, then that string becomes the prefix for the two tables you mentioned. Does that sound okay? It's more complex and less intuitive than schemas, but maybe that's okay. |
Yes the use case I think it pretty typical. A lab might have one mysql server running at a fixed location like a lab server. Multiple researchers might be running experiments on that database at once or a researcher might be running more than one psiturk experiment at once, but using the same database. In some cases, an IRBs might require a particular database system. For instance an IRB my say you have to use the Mysql Database infrastructure managed by the university and can't put PPI on random servers. The overhead of making multiple databases for each experimenter might be non-zero depending on the management tools provided by the university say. This just gives people that option to use 1 DB and multiple psiturk tasks. Re: the prefix option as opposed to the flag, that sounds good.... |
actually if assignments is the default table name it feels like they should all just be autogenerated and the table_name used as the prefix? would make more sense that |
Yeah I agree long term that having a way to override the "assignments"
table name is pointless and should be replaced with table_prefix in name
and use. And also that the name amt_camaigns makes more sense. But perhaps
not amt_assignments... I use psiturk for "lab" mode, which uses the
assignments table.
But alas, semantic versioning, I'm not ready to bump to psiturk v4 for this
lol, which would mean we'd need to keep table_name working as-is to stay at
v3.
For your use case, prefixing everything makes sense, and marking the
table_name field as deprecated. That calls for a schema I think? Should be
clean:
https://docs.sqlalchemy.org/en/14/core/metadata.html#specifying-a-default-schema-name-with-metadata
Querying schemas can be tricky as only the psql default "public" schema
gets searched by default. Maybe we just hack the table name for now and
avoid schemas, on those grounds.
I'm unsure how to continue to unofficially do a shared assignments table.
Maybe we just don't, and later that becomes a new feature somehow. There
was that one guy on the Google group who wanted to run like three different
studies simultaneously, but to not allow anyone to do more than one.
Psiturk is just not set up for that without some kind of centralized lab
server.
…On Mon, Apr 26, 2021, 10:21 PM Todd Gureckis ***@***.***> wrote:
actually if assignments is the default table name it feels like they
should all just be autogenerated and the table_name used as the prefix?
would make more sense that amt_assignments amt_hit and amt_campaigns are
sort of standardized perhaps with the option to prefix them as needed?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#495 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI6Y7NF5BJB2QZBDH2IBQDTKY3VVANCNFSM43UELXRA>
.
|
well the most straightforward options then are
now that I write it, perhaps 4 is the most sensible because it also calls attention to the fact that these tables are going to be created. it wasn't clearly mentioned anywhere obvious in the new version and it took a bit of debugging to realize this was happening. if it is listed in the default config it raises awareness. |
As it happens I encountered this issue a few days ago. It is not a shared psiturk environment, but we run longitudinal or multi-step studies, and it makes a lot of sense to keep things in the same database, particularly when recruiting participants and avoiding participants who have already done the study (one could go a complicated qualifications route I suppose). The table prefix seems a reasonable idea. In a shared environment its still not going to stop clobbering, because users may not be sufficiently careful (for example, because they borrow config files from each other). btw, and just putting this out there, but I hardly use the base table any more to store experimental data (i.e. what used to be turkdemo). Psiturk saves each subjects data in one row, and the task data in one cell. That means each PUT has to have all the data in it each time. For long experiments, or experiments where participants have kept buttons pressed or used scripts to press buttons repeatedly, these requests have gotten so large that the system bogs down significantly, or times out, or in one cases, violated my databases request size limit. Its also generally inefficient, makes it difficult to do other things like base their bonus on performance. So its custom tables almost all the way. One of the big problems with this is having to worry about race conditions, and repeated requests--you need some kind of nonce and the server side code to support it. Also, there really should be no reason psiturk shouldn't be able to support multiple experiments simultaneously from one server. The main thing is the routes (including the ad route) need to be prefixed with the experiment name, and the ads too when creating the HIT. |
@gureckis number 4 sounds good. So that would be:
@jacob-lee your psiturk use case is interesting, you've evolved to a new plane :-) |
There's another table we need to be mindful of -- |
Ok I think this implements option 4 from above. The exact syntax for deprecating config options is not clear to me but in this example I describe that |
- when set this option prefixes the `amt_hit` and `campaign` tables to match the values set by configuration option table_name. This ensures that experiments or researchers sharing a databased do not conflict with their table names.
As ugly as it is, I think we might need to leave the default for the jobs table to "apscheduler_jobs" so that upgrading psiturk doesn't break anyone's currently-running jobs by creating a new table all of a sudden. Btw I rebased this PR and force pushed. |
Same for changing back to Edit: the currently used table names are:
|
darn it! ok. I suspect this effects between N=0 and N=2 people but we're running a professional shop here. |
It's purely an academic exercise! :-) |
I'm working on the changes rn |
Oop okay, just got your changes. I'll finish merging in a few other tweaks. |
I'm still not 100% sure on the assignments table... if a future person only provides |
Yes -- psiturk_config.py will set |
to prevent it _always_ overriding any `table_name` that the user sets also, update docs
🎉 woo, high five! |
Thanks! high five. I feel so accomplished, I'm going to go try to understand experiment.py just for fun |
A somewhat undocumented features is that psiturk now creates two additional book-keeping tables in your database besides the one specified by
table_name
option. These areamt_hit
andcampaign
and help coordinate the campaigns and hit listings for the dashboard and command line. In some cases though multiple psiturk users might share one database for lab coordination purposes. If so then these tables will conflict across multiple instances and cause problems. This PR adds an option calledtable_prefix
(default value false). Whenn it is set to true this will prepend the value specified bytable_name
to those auto-generated tables. For instance iftable_name
isexp1
then the two tables in question becomeexp1_amt_hit
andexp1_campaign
. This at least provides a mechanism for avoiding collisions.