Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token always expired #92

Closed
AZhyrenkov opened this issue Mar 10, 2020 · 17 comments
Closed

Token always expired #92

AZhyrenkov opened this issue Mar 10, 2020 · 17 comments
Assignees
Labels
bug

Comments

@AZhyrenkov
Copy link

@AZhyrenkov AZhyrenkov commented Mar 10, 2020

Hi!
I am experiencing the issue with library when trying to assume role during the connection process. But for same user over pure boto3 with same credentials everything is working fine.

0. Working example with boto3:

session = boto3.Session(profile_name='my-profile')
credentials = session.get_credentials()
current_credentials = credentials.get_frozen_credentials()
client = boto3.client('athena',
                       aws_access_key_id = current_credentials.access_key,
                       aws_secret_access_key = current_credentials.secret_key,
                       aws_session_token = current_credentials.token)

And 3 not working examples:
1. Passing the access parameters within connection

> con <- dbConnect(RAthena::athena(), 
+                  aws_access_key_id = credentials$AccessKeyId,
+                  aws_secret_access_key = credentials$SecretAccessKey,
+                  s3_staging_dir='s3://my-bucket/tmp/',
+                  region_name='eu-west-1')
Error: An error occurred (UnrecognizedClientException) when calling the GetDatabases operation: The security token included in the request is invalid.

2. Over the profile file:

> con <- dbConnect(RAthena::athena(), 
+                  profile_name = "default",
+                  role_arn = "my-role-arn",
+                  s3_staging_dir='s3:/my-bucket/tmp/',
+                  region_name='eu-west-1')
Error in names(y$TableType) <- y$Name : 
  attempt to set an attribute on NULL

3. Without an assume role:

> con <- dbConnect(RAthena::athena(),
+                  s3_staging_dir='s3:/my-bucket/tmp/',
+                  region_name='eu-west-1')
Error in names(y$TableType) <- y$Name : 
  attempt to set an attribute on NULL

Could you please help? Am I doing something wrong?

@DyfanJones
Copy link
Owner

@DyfanJones DyfanJones commented Mar 10, 2020

Hi sorry for the issue you having. Are you working in RStudio IDE? My initial hunch is there is a bug in passing the role to the Schema Hierarchy builder for Rstudio's connection tab.

@DyfanJones DyfanJones added the bug label Mar 10, 2020
@AZhyrenkov
Copy link
Author

@AZhyrenkov AZhyrenkov commented Mar 10, 2020

Thank you so much for such a quick response! Yes, the set up is like this:

  • ec2 machine with proper iam
  • docker container with RStudio Server and boto3
  • .aws_profile is copied inside container
  • non-root user inside RStudio Server

Bu the way, it was working fine ~1-2 weeks ago, and not sure what was changes since that time.

@DyfanJones
Copy link
Owner

@DyfanJones DyfanJones commented Mar 10, 2020

Are you working on the cran version or github dev? are you able to provide a traceback?

@AZhyrenkov
Copy link
Author

@AZhyrenkov AZhyrenkov commented Mar 10, 2020

Sure, here is the traceback:

> traceback()
15: FUN(X[[i]], ...)
14: lapply(X = X, FUN = FUN, ...)
13: sapply(x, function(y) {
        names(y$TableType) = y$Name
        y$TableType
    })
12: FUN(X[[i]], ...)
11: lapply(output, function(x) sapply(x, function(y) {
        names(y$TableType) = y$Name
        y$TableType
    }))
10: unlist(lapply(output, function(x) sapply(x, function(y) {
        names(y$TableType) = y$Name
        y$TableType
    })))
9: AthenaTableTypes(connection)
8: AthenaListObjectTypes.default(connection)
7: AthenaListObjectTypes(connection)
6: listObjectTypes()
5: observer$connectionOpened(displayName = computeDisplayName(connection), 
       type = "Athena", host = computeHostName(connection), icon = icon, 
       connectCode = paste(c("library(DBI)", "con <- dbConnect(RAthena::athena())"), 
           collapse = "\n"), disconnect = function() {
           dbDisconnect(connection)
       }, listObjectTypes = function() {
           AthenaListObjectTypes(connection)
       }, listObjects = function(...) {
           AthenaListObjects(connection, ...)
       }, listColumns = function(...) {
           AthenaListColumns(connection, ...)
       }, previewObject = function(rowLimit, ...) {
           AthenaPreviewObject(connection, rowLimit, ...)
       }, actions = AthenaConnectionActions(connection), connectionObject = connection)
4: on_connection_opened(con)
3: .local(drv, ...)
2: dbConnect(RAthena::athena(), profile_name = "default", role_arn = "role-arn", 
       s3_staging_dir = "s3://my-bucket/tmp/", 
       region_name = "eu-west-1")
1: dbConnect(RAthena::athena(), profile_name = "default", role_arn = "role-arn", 
       s3_staging_dir = "s3://my-bucket/temp/", 
       region_name = "eu-west-1")
@AZhyrenkov
Copy link
Author

@AZhyrenkov AZhyrenkov commented Mar 10, 2020

And have tested both versions from CRAN and this repo.

@DyfanJones
Copy link
Owner

@DyfanJones DyfanJones commented Mar 10, 2020

Interesting, it looks like that client is created successfully but the issue is when glue is called to get your Glue Catalogue.

Has the permissions changed for that role with glue?

The connection code roughly goes like this (python version)

import boto3

sts = boto3.Session(profile_name = profile_name, region_name = "region_name").client("sts")
response = sts.assume_role(RoleArn = "role-arn",
                                               RoleSessionName = "RAthena",
                                               DurationSeconds = 3600)

creds = response["Credentials"]

session = boto3.Session(aws_access_key_id = creds["AccessKeyId"],
                                         aws_secret_access_key = creds["SecretAccessKey"],
                                         aws_session_token = creds["SessionToken"],
                                         region_name = "region_name")

glue = session.client("glue")

database = [x["Name"] for x in glue.get_databases()["DatabaseList"]]

output = [glue.get_tables(DatabaseName = x)["TableList"] for x in database]

# This is missing some of tables, but a little sleepy to perfect it :P
[[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]

Are you able to run this from python?

@AZhyrenkov
Copy link
Author

@AZhyrenkov AZhyrenkov commented Mar 11, 2020

It goes smoothly by the last line:

[[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-15-e7298c858d45> in <module>
----> 1 [[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]

<ipython-input-15-e7298c858d45> in <listcomp>(.0)
----> 1 [[y[0].get("TableType"), y[0].get("Name")] for y in [x for x in output]]

IndexError: list index out of range

The output object has all the needed things:

print(len(output))
print(output[0][0]['TableType'])
print(output[0][0]['Name'])
->100
->VIRTUAL_VIEW
->access_logs_ireland_costs
@DyfanJones
Copy link
Owner

@DyfanJones DyfanJones commented Mar 11, 2020

Oh very interesting. Can you check for me, is there an object in output that doesn't have a TableType? And if so can you share what it is? (You can remove any sensitive info from it).

The error is happening when R is assigning NULL to a name. To prevent that from happening I need to understand what object isnt a table type and what it should be instead :)

@DyfanJones
Copy link
Owner

@DyfanJones DyfanJones commented Mar 11, 2020

@AZhyrenkov i have created an initial work around and return "" when TableType is NULL. Can you check it out:

remotes::install_github("dyfanjones/rathena", ref = "rstudio-view")

@DyfanJones
Copy link
Owner

@DyfanJones DyfanJones commented Mar 11, 2020

If you can find out what object in output (from the python code), I will be able to assign the correct meta data to the table instead of ""

Any help with this would be very much appreciated

@DyfanJones DyfanJones self-assigned this Mar 11, 2020
@AZhyrenkov
Copy link
Author

@AZhyrenkov AZhyrenkov commented Mar 11, 2020

So I tried to take a look at the missing TableTypes, here is the code:

def get_params(output):
    result = {'db'    : [],
              'names' : [], 
              'types' : [] }
    for x in output:
        for y in x:
            # Append dbs
            try:
                result['db'].append(y['DatabaseName'])
            except:
                result['db'].append('')
                
            # Append names
            try:
                result['names'].append(y['Name'])
            except:
                result['names'].append('')
            # Append types
            try:
                result['types'].append(y['TableType'])
            except:
                result['types'].append('')
    return(result)
result = pd.DataFrame(get_params(output))

And it's showing

result[result['types'] == ''].shape
(122, 3)

Quite a lot of new tables in our Glue catalog is of type Null.

And the metainfo of table without a tabletype:

{   'CreateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal()),
    'DatabaseName': 'database',
    'IsRegisteredWithLakeFormation': False,
    'Name': 'problem_table',
    'Owner': 'email',
    'PartitionKeys': [   {'Name': 'platform_type', 'Type': 'string'},
                         {'Name': 'country_code', 'Type': 'string'},
                         {'Name': 'year', 'Type': 'string'},
                         {'Name': 'month', 'Type': 'string'},
                         {'Name': 'day', 'Type': 'string'}],
    'Retention': 0,
    'StorageDescriptor': {   'Columns': [   {columns} ],
                             'Compressed': False,
                             'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
                             'Location': 's3://bucket',
                             'NumberOfBuckets': 0,
                             'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
                             'SerdeInfo': {   'Name': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'},
                             'SortColumns': [],
                             'StoredAsSubDirectories': False},
    'UpdateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal())}
@AZhyrenkov
Copy link
Author

@AZhyrenkov AZhyrenkov commented Mar 11, 2020

@AZhyrenkov i have created an initial work around and return "" when TableType is NULL. Can you check it out:

remotes::install_github("dyfanjones/rathena", ref = "rstudio-view")

This is working!

@DyfanJones
Copy link
Owner

@DyfanJones DyfanJones commented Mar 11, 2020

@AZhyrenkov perfect.

Edit: Sorry I didn't see your earlier message. Please ignore below

Can you please let me know the object types from the output where TableType is not provided. So for example:

output[0]
 [{'Name': "can be removed",
  'DatabaseName': "can be removed",
  'Owner': 'hadoop',
  'CreateTime': datetime.datetime(2020, 3, 9, 19, 31, 44, tzinfo=tzlocal()),
  'UpdateTime': datetime.datetime(2020, 3, 9, 19, 31, 44, tzinfo=tzlocal()),
  'LastAccessTime': datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzlocal()),
  'Retention': 0,
  'StorageDescriptor': {'Columns': "can be removed",
   'Location': "can be removed",
   'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
   'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
   'Compressed': False,
   'NumberOfBuckets': -1,
   'SerdeInfo': {'SerializationLibrary': 'org.apache.hive.hcatalog.data.JsonSerDe',
    'Parameters': {'serialization.format': '1'}},
   'BucketColumns': [],
   'SortColumns': [],
   'Parameters': {},
   'SkewedInfo': {'SkewedColumnNames': [],
    'SkewedColumnValues': [],
    'SkewedColumnValueLocationMaps': {}},
   'StoredAsSubDirectories': False},
  'PartitionKeys': 'can be removed',
  'Parameters': {'EXTERNAL': 'TRUE', 'transient_lastDdlTime': '1583782304'},
  'CreatedBy': 'can be removed',
  'IsRegisteredWithLakeFormation': False}]

All my unit tests have TableType return from boto3. I just want to make sure I am not missing anything.

@DyfanJones
Copy link
Owner

@DyfanJones DyfanJones commented Mar 11, 2020

So I tried to take a look at the missing TableTypes, here is the code:

def get_params(output):
    result = {'db'    : [],
              'names' : [], 
              'types' : [] }
    for x in output:
        for y in x:
            # Append dbs
            try:
                result['db'].append(y['DatabaseName'])
            except:
                result['db'].append('')
                
            # Append names
            try:
                result['names'].append(y['Name'])
            except:
                result['names'].append('')
            # Append types
            try:
                result['types'].append(y['TableType'])
            except:
                result['types'].append('')
    return(result)
result = pd.DataFrame(get_params(output))

And it's showing

result[result['types'] == ''].shape
(122, 3)

Quite a lot of new tables in our Glue catalog is of type Null.

And the metainfo of table without a tabletype:

{   'CreateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal()),
    'DatabaseName': 'database',
    'IsRegisteredWithLakeFormation': False,
    'Name': 'problem_table',
    'Owner': 'email',
    'PartitionKeys': [   {'Name': 'platform_type', 'Type': 'string'},
                         {'Name': 'country_code', 'Type': 'string'},
                         {'Name': 'year', 'Type': 'string'},
                         {'Name': 'month', 'Type': 'string'},
                         {'Name': 'day', 'Type': 'string'}],
    'Retention': 0,
    'StorageDescriptor': {   'Columns': [   {columns} ],
                             'Compressed': False,
                             'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
                             'Location': 's3://bucket',
                             'NumberOfBuckets': 0,
                             'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
                             'SerdeInfo': {   'Name': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'},
                             'SortColumns': [],
                             'StoredAsSubDirectories': False},
    'UpdateTime': datetime.datetime(2020, 3, 5, 13, 44, 54, tzinfo=tzlocal())}

Thanks for this. I am not sure why Glue isn't storing your TableTypes. It seems like the approach I have taken is the correct one. I will merge the branch onto master. If you find any other issues please raise a ticket. I am happy to help :)

@DyfanJones DyfanJones closed this Mar 11, 2020
@AZhyrenkov
Copy link
Author

@AZhyrenkov AZhyrenkov commented Mar 11, 2020

To clarify this a bit - all the tables with this issue were produced automatically via in-house developed data-streaming engine, so it's the issue on table creation stage.

It's interesting that boto3 allows to create a table without a TableType :)

Thank you so much for the quick resolution. Are you planning to merge this in master and release to CRAN?

@DyfanJones
Copy link
Owner

@DyfanJones DyfanJones commented Mar 11, 2020

Yeah I didn't expect boto3 to return a different syntax. To be respectful to the cran volunteers I try to push dev work once a month. So the next one will be in 6 days.

If you require it to be pushed sooner please let me know.

DyfanJones added a commit that referenced this issue Mar 11, 2020
…#92)

Rstudio view bug fix when glue doesn't correctly store table metadata (#92)
@DyfanJones
Copy link
Owner

@DyfanJones DyfanJones commented Mar 17, 2020

version 1.8.0 has been pushed to the cran

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.