Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support Redshift #9

Open
HarlanH opened this issue Mar 27, 2018 · 11 comments
Open

support Redshift #9

HarlanH opened this issue Mar 27, 2018 · 11 comments
Labels
enhancement New feature or request

Comments

@HarlanH
Copy link

HarlanH commented Mar 27, 2018

Redshift is based on Postgres 8.0.2, so a few things are different. Maybe this issue should be a list of known issues?

pg_authid does not exist

Failed to execute query "SELECT * FROM pg_authid WHERE rolname != 'pg_signal_backend';": relation "pg_authid" does not exist

It looks like it was introduced in 8.1:

This change also replaces pg_shadow and pg_group by new role-capable catalogs pg_authid and pg_auth_members.

@zcmarine
Copy link
Contributor

Hey Harlan, that's a great request! My understanding is that the best way to verify whether something works with redshift is to test against Postgres 8.0. Do you have any sense of whether getting pgbedrock working with Postgres 8.0 is roughly equivalent to getting it working with Redshift, or are there other distinctions between those two systems?

In addition to swapping out pg_authid, we'd also need to skip the default privilege grants / removals as that wasn't introduced until Postgres 9.0.

In any case, I think the solution is relatively simple: upon connecting to the database, immediately ask what Postgres version we're working with. For default privileges we then just skip that block of code if we're in an older version. With regard to pg_authid, I'm curious whether we can just use pg_shadow / pg_group permanently or if we will need to similarly branch. Until we tinker with Postgres 8.0 it likely won't be clear.

As far as specific testing process, I think this should be pretty straightforward (at least to describe in words): add a Postgres 8.0 docker container and run our test suite against it. We'll need to mark tests to skip if we're testing the older Postgres version (e.g. default privileges), and we may have to even split out some existing tests to account for the expected differing behavior. As an example, we currently test a grant to myschema.* and make sure that a grant happens for all tables as well as a default privilege grant occurs. We'd need to skip that test for Postgres 8.0 and have a new test that asserts all the same things except for the default privilege grant.

Assuming that Postgres 8.0 is functionally equivalent to Redshift, if we can get all tests to pass then we should be in a good place, but until someone actually tests against a Redshift I guess it's still a bit of a guess.

We don't actually use Redshift at Squarespace so I'm not sure how quickly this would get prioritized on my end, but I'm also happy to sync up and offer guidance on how to try implementing this if it's higher priority on your end.

@zcmarine
Copy link
Contributor

After poking around a bit more, it looks like Redshift does support default privileges as per AWS's documentation here, yet Postgres 8.0.2 definitely does not.

First, the Postgres release notes first mention default privileges in 9.0. Additionally, when I try running a Postgres 8.0.2 docker instance I can't get an ALTER DEFAULT PRIVILEGES statement to run. Moreover, within that container there doesn't appear to be a pg_catalog.pg_default_acl table, which is how default privileges are maintained (at least in Postgres 9+).

For documentation purposes, I looked into this by getting a local Postgres 8.0.2 running via:

docker run -p 5439:5439 --name local_redshift guildeducation/docker-amazon-redshift

I then tested default privileges with:

CREATE USER test_user;
CREATE SCHEMA test_schema;
ALTER DEFAULT PRIVILEGES IN SCHEMA test_schema GRANT SELECT ON TABLES TO test_user;

I then searched for the default ACL table via:

SELECT * 
FROM information_schema.tables 
WHERE table_schema = 'pg_catalog' 
ORDER BY table_name;

The takeaway from this is that I'm not really sure how to properly test redshift then: my thought was that we could spin up a Postgres 8.0.2 container and operate against that, but if that doesn't match Redshift (which based on the above it does not), then I don't know how to reliably test redshift without having a live redshift and tests that rely on network connectivity, which feels very awkward. Maybe you have some ideas though Harlan.

@HarlanH
Copy link
Author

HarlanH commented Mar 29, 2018

Hi Zach, OK, great. I think that managing permissions is an even bigger problem on analytical databases like Redshift than on operational databases. That's certainly true for us.

Yes, Redshift has backported a bunch of stuff from post 8.0.2, including the default privs stuff. I don't think that you'll be able to test Redshift without an actual Redshift instance. I wonder how other open source projects handle testing against non-free databases?

@zcmarine
Copy link
Contributor

zcmarine commented Apr 7, 2018

Hey Harlan, I looked around a bit and it seems like people just create another redshift instance and test on that, so that seems like the best path forward. I'm hoping because there isn't much data being added / removed that the cost element would be pretty small since I don't want to discourage people from testing.

I'm hoping that since Redshift appears to support default privileges that there isn't much machinery here that will have to change, i.e. perhaps it's just that the system catalogs are still organized the way Postgres used to be and it's just a few tables that are missing. In that vein, could you try running the below query and tell me what comes back? It's just listing out which of the pg_catalog tables that pgbedrock relies on are missing within Redshift:

SELECT
    desired_tables.table_name AS missing_tables
FROM (
    VALUES
        ('pg_auth_members'),
        ('pg_authid'),
        ('pg_class'),
        ('pg_default_acl'),
        ('pg_depend'),
        ('pg_namespace'),
        ('pg_roles')
    ) AS desired_tables (table_name)
    LEFT JOIN information_schema.tables info_tables
        ON desired_tables.table_name = info_tables.table_name
        AND info_tables.table_schema = 'pg_catalog'
WHERE
    info_tables.table_name IS NULL
;

@HarlanH
Copy link
Author

HarlanH commented Apr 7, 2018

Yeah -- a single-node dc2.large instance is 25 cents an hour, so if there was a spin-up/test/spin-down script for Redshift, it'd be no big deal...

That query doesn't run because Redshift doesn't have VALUES. This query:

SELECT table_name, 
	table_type,
	table_name in ('pg_auth_members', 'pg_authid', 'pg_class', 'pg_default_acl', 'pg_depend', 'pg_namespace', 'pg_roles') as present
from information_schema.tables info_tables
WHERE info_tables.table_schema = 'pg_catalog'
order by 3 desc

yields:

pg_default_acl	BASE TABLE	t
pg_class	BASE TABLE	t
pg_namespace	BASE TABLE	t
pg_depend	BASE TABLE	t
pg_conf	BASE TABLE	f
pg_library	BASE TABLE	f
pg_shdepend	BASE TABLE	f
pg_statistic	BASE TABLE	f
pg_type	BASE TABLE	f
pg_attribute	BASE TABLE	f
pg_tablespace	BASE TABLE	f
pg_inherits	BASE TABLE	f
pg_index	BASE TABLE	f
pg_operator	BASE TABLE	f
pg_opclass	BASE TABLE	f
pg_am	BASE TABLE	f
pg_amop	BASE TABLE	f
pg_amproc	BASE TABLE	f
pg_language	BASE TABLE	f
pg_largeobject	BASE TABLE	f
pg_aggregate	BASE TABLE	f
pg_statistic_multicol	BASE TABLE	f
pg_trigger	BASE TABLE	f
pg_listener	BASE TABLE	f
pg_cast	BASE TABLE	f
pg_conversion	BASE TABLE	f
pg_udf	BASE TABLE	f
pg_bar_repos	BASE TABLE	f
pg_bar_state	BASE TABLE	f
pg_highwatermark	BASE TABLE	f
pg_resize	BASE TABLE	f
pg_attrdef	BASE TABLE	f
pg_constraint	BASE TABLE	f
pg_database	BASE TABLE	f
pg_description	BASE TABLE	f
pg_group	BASE TABLE	f
pg_proc	BASE TABLE	f
pg_rewrite	BASE TABLE	f
pg_bar_ddllog	BASE TABLE	f
pg_test	BASE TABLE	f
pg_statistic_indicator	BASE TABLE	f
pg_user	VIEW	f
pg_rules	VIEW	f
pg_views	VIEW	f
pg_tables	VIEW	f
pg_table_def	VIEW	f
pg_indexes	VIEW	f
pg_stats	VIEW	f
pg_stat_all_tables	VIEW	f
pg_stat_sys_tables	VIEW	f
pg_stat_user_tables	VIEW	f
pg_statio_all_tables	VIEW	f
pg_statio_sys_tables	VIEW	f
pg_statio_user_tables	VIEW	f
pg_stat_all_indexes	VIEW	f
pg_stat_sys_indexes	VIEW	f
pg_stat_user_indexes	VIEW	f
pg_statio_all_indexes	VIEW	f
pg_statio_sys_indexes	VIEW	f
pg_statio_user_indexes	VIEW	f
pg_statio_all_sequences	VIEW	f
pg_statio_sys_sequences	VIEW	f
pg_statio_user_sequences	VIEW	f
pg_stat_activity	VIEW	f
pg_stat_database	VIEW	f
pg_locks	VIEW	f
pg_user_info	VIEW	f
pg_database_info	VIEW	f
pg_settings	VIEW	f
pg_external_schema	BASE TABLE	f
stv_wlm_qmr_config	BASE TABLE	f
stl_rlf_scan	BASE TABLE	f
stl_internal_query_details	BASE TABLE	f
svl_awsclient_error	VIEW	f
svl_s3requests	VIEW	f
svv_comm_pkt_latency_hist	VIEW	f
svv_comm_pkt_latency	VIEW	f
svv_io_latency_read	VIEW	f
svv_io_latency_sync	VIEW	f
svv_io_latency_write	VIEW	f
svv_sem_usage	VIEW	f
svv_sem_usage_leader	VIEW	f
svl_user_info	VIEW	f
padb_config_harvest	BASE TABLE	f
stl_aggr	BASE TABLE	f
stl_alert_event_log	BASE TABLE	f
stl_analyze	BASE TABLE	f
stl_bcast	BASE TABLE	f
stl_column_stats	BASE TABLE	f
stl_commit_stats	BASE TABLE	f
stl_connection_log	BASE TABLE	f
stl_ddltext	BASE TABLE	f
stl_delete	BASE TABLE	f
stl_disk_full_diag	BASE TABLE	f
stl_dist	BASE TABLE	f
stl_error	BASE TABLE	f
stl_explain	BASE TABLE	f
stl_file_scan	BASE TABLE	f
stl_hash	BASE TABLE	f
stl_hashjoin	BASE TABLE	f
stl_insert	BASE TABLE	f
stl_leader_snapshot	BASE TABLE	f
stl_limit	BASE TABLE	f
stl_load_commits	BASE TABLE	f
stl_load_errors	BASE TABLE	f
stl_load_trace	BASE TABLE	f
stl_loaderror_detail	BASE TABLE	f
stl_merge	BASE TABLE	f
stl_mergejoin	BASE TABLE	f
stl_metadata_step	BASE TABLE	f
stl_nestloop	BASE TABLE	f
stl_parse	BASE TABLE	f
stl_plan_info	BASE TABLE	f
stl_project	BASE TABLE	f
stl_query	BASE TABLE	f
stl_query_metrics	BASE TABLE	f
stl_querytext	BASE TABLE	f
stl_replacements	BASE TABLE	f
stl_restarted_sessions	BASE TABLE	f
stl_return	BASE TABLE	f
stl_rpc	BASE TABLE	f
stl_s3client	BASE TABLE	f
stl_s3client_error	BASE TABLE	f
stl_save	BASE TABLE	f
stl_scan	BASE TABLE	f
stl_sessions	BASE TABLE	f
stl_sort	BASE TABLE	f
stl_sshclient_error	BASE TABLE	f
stl_stream_segs	BASE TABLE	f
stl_tr_conflict	BASE TABLE	f
stl_udf_compile_error	BASE TABLE	f
stl_undone	BASE TABLE	f
stl_unique	BASE TABLE	f
stl_unload_log	BASE TABLE	f
stl_userlog	BASE TABLE	f
stl_utilitytext	BASE TABLE	f
stl_vacuum	BASE TABLE	f
stl_vacuum_detail	BASE TABLE	f
stl_warning	BASE TABLE	f
stl_window	BASE TABLE	f
stl_wlm_error	BASE TABLE	f
stl_wlm_query	BASE TABLE	f
stl_wlm_rule_action	BASE TABLE	f
stv_active_cursors	BASE TABLE	f
stv_block_reps	BASE TABLE	f
stv_blocklist	BASE TABLE	f
stv_cursor_configuration	BASE TABLE	f
stv_disk_extents	BASE TABLE	f
stv_exec_state	BASE TABLE	f
stv_inflight	BASE TABLE	f
stv_interleaved_counts	BASE TABLE	f
stv_load_state	BASE TABLE	f
stv_locks	BASE TABLE	f
stv_partitions	BASE TABLE	f
stv_pg_wal_length	BASE TABLE	f
stv_proc_stat	BASE TABLE	f
stv_query_metrics	BASE TABLE	f
stv_query_stats	BASE TABLE	f
stv_recents	BASE TABLE	f
stv_sessions	BASE TABLE	f
stv_slices	BASE TABLE	f
stv_startup_recovery_state	BASE TABLE	f
stv_tbl_perm	BASE TABLE	f
stv_tbl_trans	BASE TABLE	f
stv_underrepped_blocks	BASE TABLE	f
stv_wlm_classification_config	BASE TABLE	f
stv_wlm_query_queue_state	BASE TABLE	f
stv_wlm_query_state	BASE TABLE	f
stv_wlm_query_task_state	BASE TABLE	f
stv_wlm_service_class_config	BASE TABLE	f
stv_wlm_service_class_state	BASE TABLE	f
svl_compile	VIEW	f
svl_qlog	VIEW	f
svl_query_metrics	VIEW	f
svl_query_metrics_summary	VIEW	f
svl_query_queue_info	VIEW	f
svl_query_report	VIEW	f
svl_query_summary	VIEW	f
svl_s3catalog	VIEW	f
svl_s3list	VIEW	f
svl_s3log	VIEW	f
svl_s3partition	VIEW	f
svl_s3partition_summary	VIEW	f
svl_s3query	VIEW	f
svl_s3query_summary	VIEW	f
svl_s3retries	VIEW	f
svl_statementtext	VIEW	f
svl_udf_log	VIEW	f
svl_vacuum_percentage	VIEW	f
svv_columns	VIEW	f
svv_diskusage	VIEW	f
svv_external_columns	VIEW	f
svv_external_databases	VIEW	f
svv_external_partitions	VIEW	f
svv_external_schemas	VIEW	f
svv_external_tables	VIEW	f
svv_interleaved_columns	VIEW	f
svv_query_inflight	VIEW	f
svv_query_state	VIEW	f
svv_restore_table_state	VIEW	f
svv_table_info	VIEW	f
svv_tables	VIEW	f
svv_transactions	VIEW	f
svv_vacuum_progress	VIEW	f
svv_vacuum_summary	VIEW	f
systable_globaldict	BASE TABLE	f
systable_schema	BASE TABLE	f
systable_topology	BASE TABLE	f

@zcmarine
Copy link
Contributor

zcmarine commented Apr 7, 2018

Interesting. So it looks like it's missing three tables that pgbedrock uses:

  • pg_roles - this doesn't really need to be used; we should just swap out the one reference to this for pg_authid instead
  • pg_auth_members - it looks like this could be replaced by pg_group (Postgres docs)
  • pg_authid - This looks like the most challenging relation to swap out, for a few reasons:
    1. The attributes that are available via pg_user (Postgres docs) differ, so attributes.py would need to behave differently.
    2. Group roles in much older versions of Postgres behaved differently: they didn't have attributes themselves and I think (?) were just there to manage privileges exclusively. There'd be a bit of a ramp-up making sure we understand how to use group roles properly in this other context.
    3. We'd need to handle login and non-login roles separately. As an example, changing the login attribute for a role would likely mean we'd have to DROP USER foo; and then CREATE GROUP foo;. Similarly, if someone wanted to modify some attributes of a group role (e.g. give it a connection limit) I think that wouldn't work, so in the Redshift implementation we'd have to put additional schema validation in against specs to explain what is and isn't possible.
    4. The most trivial issue: Currently we look up all roles in pg_authid, but now we'd need to decide how to work across two tables: looking up non-login roles in pg_group and looking up login roles in pg_user.

In addition, we'd need to figure out how to test against Redshift, i.e. potentially add custom markers listing tests that should only run on Redshift and others that should only run on Postgres, then add the ability to swap out the test URL so it can hit a real Redshift.

Again, I'm not sure how highly prioritized this will get at Squarespace as we don't really use Redshift, but it seems like this is doable if someone who uses Redshift would want to take the lead on trying to flesh this out with my help.

@HarlanH
Copy link
Author

HarlanH commented Apr 7, 2018

Huge thanks for doing this research, Zach! I think this gives a big head-start to whoever can devote some time to this. (Hoping my colleague @hanleyhansen might be interested...? :) )

@zcmarine zcmarine added the enhancement New feature or request label May 8, 2018
@clrcrl
Copy link

clrcrl commented May 20, 2018

So I'm hacking away at this with the goal to make my forked repo Redshift-compatible (but in the process, I'm losing Postgres compatibility).

I'm pretty new to python, so I'm using this as an exercise in teaching myself, since the subject matter (Redshift user permissions) is very familiar.

As such, I don't think any of the code that I produce will be pull-request-standard. If I get my forked repo working for Redshift, I'm happy to share it around, but I don't feel I'll be the right person to introduce multi-database compatibility. Nonetheless, what I'm learning along the way will be very valuable to the person that does take this issue on.

I haven't got it working yet, and I don't imagine I'll get another chance to look at it for at least a week, so just wanted to share my current progress in case someone is looking at this soon.

So far the noteworthy challenges are:

Group memberships

  • Group memberships are completely different in Redshift, and quite a challenge
  • There's no equivalent table to pg_auth_members, instead pg_groups stores group memberships as an array, returning tuples of (group, members), e.g.:
groname grosysid grolist
webpowerusers 101
webdevusers 102
webappusers 100 102, 103
  • grolist is an array (according to information_schema.columns), which is weird, because arrays aren't supported in Redshift
  • It's very difficult to unnest arrays (which are normally stored as strings) in Redshift, without using some sort of pre-made sequence of numbers (see this tutorial)
  • To work around this, I considered doing a Regex match between pg_user.usesysid and pg_group.gro_list to join them together (yuk!), but I couldn't do Regex on the array type. And then because arrays are not a supported data type, it won't let me cast it to a string ¯_(ツ)_/¯
  • So it seems to me that for Redshift compatibility, the group memberships will have to be returned to pgbedrock as arrays, and then I'll have to use python to unnest the array, to get the tuples of (member, group).
  • Also note that rather than granting membership of a group to a role (user), in Redshift, you have to alter group "group_name" add user "user_name" - a bit strange but nothing too bad.

Users & Groups versus Roles

  • As you've already realised, Redshift treats users and groups as separate objects, compared to postgres which uses roles.
  • In working through this, I decided it would be easier to replace can_login: in the config with a required field, role_type: (user/group).
  • I then implemented some extra Cerberus validation so that if role_type == group, certain configurations cannot exist (e.g. for owns:, role_type must equal user, since groups cannot own tables/schemas in Redshift).

Passwords

  • Passwords are required for Redshift, and have constraints on length/character-types
  • This is compared to Postgres where they are optional

Superusers

  • The concept of a superuser in Redshift is slightly different to postgres - basically a superuser is just someone with the createuser privilege, so createuser should not be able to be defined in attributes

@mikekaminsky
Copy link

@clrcrl have you made any progress on this? I might be interested in collaborating.

@JoeNaso
Copy link

JoeNaso commented Feb 20, 2019

I'm pretty late to the party here, but have there been any developments with Redshift compatibility?

@ghost
Copy link

ghost commented Nov 24, 2019

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants