Clarify documentation to note that ingesting s3 access logs from a sub-prefix inside prefix does not work #33
Description
I ran into an issue where we aggregate S3 distribution logs from a variety of sources into one account, and the logs are broken down into sub-prefix:
s3://<log-bucket>/s3_distribution_logs/<deployment-name-sub-prefix>/<log-file-name>
I was trying to run a single Glue job for s3://<log-bucket>/s3_distribution_logs/
to populate all <deployment-name-sub-prefix>
logs into the same CONVERTED_TABLE_NAME
. In this case, the RAW_TABLE_NAME
athena table was getting populated, the job would initially error withe the below error, then on subsequent runs would run "successfully". Unfortunately, I wouldn't get any logs into my CONVERTED_TABLE_NAME
Athena table.
With continuous logging enabled, and a little tinkering, I tracked the issue down to _get_first_key_in_prefix()
:
line 128, in _get_first_key_in_prefix
first_object = response.get('Contents')[0].get('Key')
TypeError: 'NoneType' object is not subscriptable
The values going into self.s3_client.list_objects_v2(**query_params)
were:
{'Bucket': 'reformated-log-bucket', 'Prefix': 's3_access/', 'MaxKeys': 10}
from the glue_jobs.json
:
"S3_CONVERTED_TARGET":"s3://reformated-log-bucket/s3_access/"
Its entirely unclear to my why, since I'm VERY new to both this project and Glue in general, but if I supply
"S3_SOURCE_LOCATION":"s3://<log-bucket>/s3_distribution_logs/<deployment-name-sub-prefix>/"
instead of:
"S3_SOURCE_LOCATION":"s3://<log-bucket>/s3_distribution_logs/"
...it just work. Albeit with a smaller subset of data than I wanted. This may also be related to #30