adding `table_prefix` option to prevent database table collisions for auto-generated tables #495

gureckis · 2021-04-27T03:38:32Z

A somewhat undocumented features is that psiturk now creates two additional book-keeping tables in your database besides the one specified by table_name option. These are amt_hit and campaign and help coordinate the campaigns and hit listings for the dashboard and command line. In some cases though multiple psiturk users might share one database for lab coordination purposes. If so then these tables will conflict across multiple instances and cause problems. This PR adds an option called table_prefix (default value false). Whenn it is set to true this will prepend the value specified by table_name to those auto-generated tables. For instance if table_name is exp1 then the two tables in question become exp1_amt_hit and exp1_campaign. This at least provides a mechanism for avoiding collisions.

coveralls · 2021-04-27T03:51:30Z

Coverage increased (+1.07%) to 61.118% when pulling 48db427 on exp_table_names into 956ef9f on master.

deargle · 2021-04-27T04:03:54Z

Hmm, the current default for table_name in psiturk 3 is "assignments" instead of the former "turkdemo" because I didn't understand the multi lab user use case. Prepending "assignments" probably would be weird.

Can you describe the shared use case approach? In this approach, is each lab user running a unique experiment? If so, I have been reading the sqlalchemy docs and they support "schemas," where each schema can have its own copy of a table, and collisions are avoided. The schema name becomes a prefix of sorts in the underlying query. Like "exp1.assignments," "exp2.assignments." a single database can have many schemas, and heroku says having up to 50 schemas for a given db should be fine.

One thing that schemas would not work for would be for people who want to share one single table of assignments for purposes of blocking participants, differentiating studies by prefixes to values they sert for code_version. For that reason, I might want to avoid the explicit schemas route. But maybe have a table_prefix config var that is blank by default, but if set to a string, then that string becomes the prefix for the two tables you mentioned. Does that sound okay? It's more complex and less intuitive than schemas, but maybe that's okay.

gureckis · 2021-04-27T04:13:47Z

Yes the use case I think it pretty typical. A lab might have one mysql server running at a fixed location like a lab server. Multiple researchers might be running experiments on that database at once or a researcher might be running more than one psiturk experiment at once, but using the same database. In some cases, an IRBs might require a particular database system. For instance an IRB my say you have to use the Mysql Database infrastructure managed by the university and can't put PPI on random servers. The overhead of making multiple databases for each experimenter might be non-zero depending on the management tools provided by the university say. This just gives people that option to use 1 DB and multiple psiturk tasks.

Re: the prefix option as opposed to the flag, that sounds good....

gureckis · 2021-04-27T04:21:31Z

actually if assignments is the default table name it feels like they should all just be autogenerated and the table_name used as the prefix? would make more sense that amt_assignments amt_hit and amt_campaigns are sort of standardized perhaps with the option to prefix them as needed?

deargle · 2021-04-27T04:43:07Z

Yeah I agree long term that having a way to override the "assignments" table name is pointless and should be replaced with table_prefix in name and use. And also that the name amt_camaigns makes more sense. But perhaps not amt_assignments... I use psiturk for "lab" mode, which uses the assignments table. But alas, semantic versioning, I'm not ready to bump to psiturk v4 for this lol, which would mean we'd need to keep table_name working as-is to stay at v3. For your use case, prefixing everything makes sense, and marking the table_name field as deprecated. That calls for a schema I think? Should be clean: https://docs.sqlalchemy.org/en/14/core/metadata.html#specifying-a-default-schema-name-with-metadata Querying schemas can be tricky as only the psql default "public" schema gets searched by default. Maybe we just hack the table name for now and avoid schemas, on those grounds. I'm unsure how to continue to unofficially do a shared assignments table. Maybe we just don't, and later that becomes a new feature somehow. There was that one guy on the Google group who wanted to run like three different studies simultaneously, but to not allow anyone to do more than one. Psiturk is just not set up for that without some kind of centralized lab server.

…

On Mon, Apr 26, 2021, 10:21 PM Todd Gureckis ***@***.***> wrote: actually if assignments is the default table name it feels like they should all just be autogenerated and the table_name used as the prefix? would make more sense that amt_assignments amt_hit and amt_campaigns are sort of standardized perhaps with the option to prefix them as needed? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#495 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI6Y7NF5BJB2QZBDH2IBQDTKY3VVANCNFSM43UELXRA> .

gureckis · 2021-04-27T13:30:56Z

well the most straightforward options then are

a prefix_table string that applies to all three tables which defaults to empty
a prefix_table string that applies to only the amt_hit and campaign tables (again defaulting to empty)
a prefix_table flag that prefixes the amt_hit and campaign tables with the contents of table_name (this is in the current PR)
explicitly configure the amt_hit and campaign tables however you want, again defaulting to the existing values

now that I write it, perhaps 4 is the most sensible because it also calls attention to the fact that these tables are going to be created. it wasn't clearly mentioned anywhere obvious in the new version and it took a bit of debugging to realize this was happening. if it is listed in the default config it raises awareness.

jacob-lee · 2021-04-27T14:43:44Z

As it happens I encountered this issue a few days ago. It is not a shared psiturk environment, but we run longitudinal or multi-step studies, and it makes a lot of sense to keep things in the same database, particularly when recruiting participants and avoiding participants who have already done the study (one could go a complicated qualifications route I suppose).

The table prefix seems a reasonable idea. In a shared environment its still not going to stop clobbering, because users may not be sufficiently careful (for example, because they borrow config files from each other).

btw, and just putting this out there, but I hardly use the base table any more to store experimental data (i.e. what used to be turkdemo). Psiturk saves each subjects data in one row, and the task data in one cell. That means each PUT has to have all the data in it each time. For long experiments, or experiments where participants have kept buttons pressed or used scripts to press buttons repeatedly, these requests have gotten so large that the system bogs down significantly, or times out, or in one cases, violated my databases request size limit. Its also generally inefficient, makes it difficult to do other things like base their bonus on performance. So its custom tables almost all the way. One of the big problems with this is having to worry about race conditions, and repeated requests--you need some kind of nonce and the server side code to support it.

Also, there really should be no reason psiturk shouldn't be able to support multiple experiments simultaneously from one server. The main thing is the routes (including the ad route) need to be prefixed with the experiment name, and the ads too when creating the HIT. psiturk create hit foo_experiment 3.00 10 1

deargle · 2021-04-27T23:03:59Z

@gureckis number 4 sounds good. So that would be:

add a new config var called assignments_table_name and deprecate table_name. If both are set, prefer assignments_table_name If only one is set, prefer that one.

I already wrote a way to deprecate config vars here
add new config vars campaigns_table_name and amt_hits_table_name
add entries for all three round about here, then rerun create_sample_config_from_defaults.py to get a new default user config file.
update docs

@jacob-lee your psiturk use case is interesting, you've evolved to a new plane :-)

deargle · 2021-05-07T05:26:10Z

There's another table we need to be mindful of -- apscheduler_jobs. It gets created if the dashboard is used to try to add jobs.

gureckis · 2021-05-19T18:24:14Z

Ok I think this implements option 4 from above. The exact syntax for deprecating config options is not clear to me but in this example I describe that table_name is being retired in preference for assignments_table_name. The code still refers to table_name internally.

- when set this option prefixes the `amt_hit` and `campaign` tables to match the values set by configuration option table_name. This ensures that experiments or researchers sharing a databased do not conflict with their table names.

deargle · 2021-05-20T04:56:35Z

As ugly as it is, I think we might need to leave the default for the jobs table to "apscheduler_jobs" so that upgrading psiturk doesn't break anyone's currently-running jobs by creating a new table all of a sudden.

Btw I rebased this PR and force pushed.

deargle · 2021-05-20T04:59:26Z

Same for changing back to amt_hit

Edit: the currently used table names are:

campaign
amt_hit
apscheduler_jobs
assignments

gureckis · 2021-05-20T05:01:17Z

darn it! ok. I suspect this effects between N=0 and N=2 people but we're running a professional shop here.

deargle · 2021-05-20T05:02:19Z

It's purely an academic exercise! :-)

deargle · 2021-05-20T05:05:27Z

I'm working on the changes rn

deargle · 2021-05-20T05:07:32Z

Oop okay, just got your changes. I'll finish merging in a few other tweaks.

gureckis · 2021-05-20T05:08:39Z

I'm still not 100% sure on the assignments table... if a future person only provides assignment_table_name in their config will the lookup via the psiturkconfig object still resolve? Sorry, I know I could read but also could just ask what you remember.

deargle · 2021-05-20T05:10:35Z

if a future person only provides assignment_table_name in their config will the lookup via the psiturkconfig object still resolve?

Yes -- psiturk_config.py will set table_name to the value of assignments_table_name. But I'm changing the config file to have no default for assignments_table_name -- because if it did have a default, then the default would always override whatever the user set for table_name, even if they never set assignmenst_table_name

to prevent it _always_ overriding any `table_name` that the user sets also, update docs

deargle · 2021-05-20T05:18:44Z

🎉 woo, high five!

gureckis · 2021-05-20T05:20:33Z

Thanks! high five. I feel so accomplished, I'm going to go try to understand experiment.py just for fun

gureckis added 8 commits May 19, 2021 22:48

initial support for table_prefix option

025418d

- when set this option prefixes the `amt_hit` and `campaign` tables to match the values set by configuration option table_name. This ensures that experiments or researchers sharing a databased do not conflict with their table names.

typos and missing the entire point of the PR.

b92e2d2

typo

56550d8

add new configuration options for all the table names used by psiturk

72a8151

allow the apschedule table name to be configured

68cc9ae

use configured table options instead of defaults

aadc368

updating docs and change log

0db879b

fixes failing tests

3629a9d

deargle force-pushed the exp_table_names branch from d6cbcd9 to 3629a9d Compare May 20, 2021 04:50

changing defaults

cee8f87

remove default for assignments_table_name

48db427

to prevent it _always_ overriding any `table_name` that the user sets also, update docs

deargle merged commit d56d26c into master May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding `table_prefix` option to prevent database table collisions for auto-generated tables #495

adding `table_prefix` option to prevent database table collisions for auto-generated tables #495

gureckis commented Apr 27, 2021 •

edited

Loading

coveralls commented Apr 27, 2021 •

edited

Loading

deargle commented Apr 27, 2021

gureckis commented Apr 27, 2021

gureckis commented Apr 27, 2021

deargle commented Apr 27, 2021 via email

gureckis commented Apr 27, 2021 •

edited

Loading

jacob-lee commented Apr 27, 2021

deargle commented Apr 27, 2021

deargle commented May 7, 2021

gureckis commented May 19, 2021

deargle commented May 20, 2021

deargle commented May 20, 2021 •

edited

Loading

gureckis commented May 20, 2021

deargle commented May 20, 2021

deargle commented May 20, 2021

deargle commented May 20, 2021

gureckis commented May 20, 2021 •

edited

Loading

deargle commented May 20, 2021

deargle commented May 20, 2021

gureckis commented May 20, 2021

adding table_prefix option to prevent database table collisions for auto-generated tables #495

adding table_prefix option to prevent database table collisions for auto-generated tables #495

Conversation

gureckis commented Apr 27, 2021 • edited Loading

coveralls commented Apr 27, 2021 • edited Loading

deargle commented Apr 27, 2021

gureckis commented Apr 27, 2021

gureckis commented Apr 27, 2021

deargle commented Apr 27, 2021 via email

gureckis commented Apr 27, 2021 • edited Loading

jacob-lee commented Apr 27, 2021

deargle commented Apr 27, 2021

deargle commented May 7, 2021

gureckis commented May 19, 2021

deargle commented May 20, 2021

deargle commented May 20, 2021 • edited Loading

gureckis commented May 20, 2021

deargle commented May 20, 2021

deargle commented May 20, 2021

deargle commented May 20, 2021

gureckis commented May 20, 2021 • edited Loading

deargle commented May 20, 2021

deargle commented May 20, 2021

gureckis commented May 20, 2021

adding `table_prefix` option to prevent database table collisions for auto-generated tables #495

adding `table_prefix` option to prevent database table collisions for auto-generated tables #495

gureckis commented Apr 27, 2021 •

edited

Loading

coveralls commented Apr 27, 2021 •

edited

Loading

gureckis commented Apr 27, 2021 •

edited

Loading

deargle commented May 20, 2021 •

edited

Loading

gureckis commented May 20, 2021 •

edited

Loading