Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Get error running initial job for S3 logs #22

Open
pkasson opened this issue Jul 21, 2020 · 6 comments
Open

Get error running initial job for S3 logs #22

pkasson opened this issue Jul 21, 2020 · 6 comments

Comments

@pkasson
Copy link

pkasson commented Jul 21, 2020

Using latest from this repo ...

The databases get created, but just after the access_optimized table is created, we receive this error:

glue_service_logs/utils.py", line 128, in _get_first_key_in_prefix TypeError: 'NoneType' object has no attribute '__getitem__' End of LogType:stdout

Full trace:

`20/07/21 09:47:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/21 09:47:30 INFO MemoryStore: MemoryStore cleared
20/07/21 09:47:30 INFO BlockManager: BlockManager stopped
20/07/21 09:47:30 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/21 09:47:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/21 09:47:30 INFO SparkContext: Successfully stopped SparkContext
20/07/21 09:47:30 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User application exited with status 1)
20/07/21 09:47:30 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
20/07/21 09:47:30 INFO ApplicationMaster: Deleting staging directory hdfs://ip-172-32-99-27.ec2.internal:8020/user/root/.sparkStaging/application_1595322435097_0001
20/07/21 09:47:30 INFO ShutdownHookManager: Shutdown hook called
20/07/21 09:47:30 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/root/appcache/application_1595322435097_0001/spark-1ca0f7e1-0bc4-47d6-ab91-12879d18055f
20/07/21 09:47:30 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/root/appcache/application_1595322435097_0001/spark-1ca0f7e1-0bc4-47d6-ab91-12879d18055f/pyspark-11265b43-46c7-46a1-a043-8f182446bdf0
End of LogType:stderr

LogType:stdout
Log Upload Time:Tue Jul 21 09:47:32 +0000 2020
LogLength:2765
Log Contents:
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.169.254
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): glue.us-east-1.amazonaws.com
INFO:athena_glue_service_logs.job:Initial run, scanning S3 for partitions.
INFO:athena_glue_service_logs.catalog_manager:Creating database aws_service_logs
INFO:athena_glue_service_logs.catalog_manager:Creating database table s3_access_raw
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): glue.us-east-1.amazonaws.com
INFO:athena_glue_service_logs.catalog_manager:Creating database table s3_access_optimized
null_fields []
INFO:athena_glue_service_logs.converter:No data returned, skipping conversion.
INFO:athena_glue_service_logs.job:Initial run with source NullPartitioner, adding all partitions from S3.
INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): rodin-glue.s3.amazonaws.com
Parse yarn logs get error message: TypeError: 'NoneType' object has no attribute 'getitem'
Traceback (most recent call last):
File "script_2020-07-21-09-45-54.py", line 4, in
job_run.convert_and_partition()
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/job.py", line 156, in convert_and_partition
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/job.py", line 137, in add_new_optimized_partitions
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/catalog_manager.py", line 89, in get_and_create_partitions
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/partitioners/date_partitioner.py", line 34, in build_partitions_from_s3
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_glue_service_logs/utils.py", line 80, in get_first_hivecompatible_date_in_prefix
File "/mnt/yarn/usercache/root/appcache/application_1595322435097_0001/container_1595322435097_0001_01_000001/athena_glue_converter_latest.zip/athena_`

@ghost
Copy link

ghost commented Jan 12, 2021

I had the same problem. Changing the runtime to

Spark 2.4, Python 2 (Glue Version 1.0)

seemed to fix it.

@ryanrf-ac
Copy link

I'm experiencing this same issue, and I've set the job to use Spark 2.4, Python 2 (Glue Version 1.0). If there are any suggestions/ tips on debugging this issue that would also be greatly appreciated.
I've noticed this only happens on the first run, when creating the databases. After encountering the error I can run the job again and it succeeds. Well... sort of. It doesn't actually process any of the logs from the "raw" table, so doesn't actually convert anything to parquet. And, without any errors, I'm not sure where it's failing.

@dacort
Copy link
Contributor

dacort commented May 4, 2021

@ryanrf-ac Hey there, thanks for chiming in - based on the error message from the original report, it looks like no data got converted in the initial job.

INFO:athena_glue_service_logs.converter:No data returned, skipping conversion.

Did you see that error message as well? Can you provide some info on how the job was configured? Specifically:

  • The arguments provided to the Glue job
  • An example of the first few files in the source location

Thanks!

@ryanrf-ac
Copy link

Hi @dacort. Thanks for getting back to me.
I do see that message:

INFO:athena_glue_service_logs.catalog_manager:Creating database table s3_access_optimized
null_fields []
INFO:athena_glue_service_logs.converter:No data returned, skipping conversion.
INFO:athena_glue_service_logs.job:Initial run with source NullPartitioner, adding all partitions from S3.

As for the arguments (I've removed the bucket name, but it uses the same bucket as the source location and converted target):

--converted_database_name aws_service_logs
--raw_database_name aws_service_logs
--raw_table_name s3_access_raw
--s3_converted_target s3://<bucket name>/converted/s3_access
--converted_table_name s3_access_optimized
--s3_source_location s3://<bucket name>/service=s3/

And here are some samples from the S3 access logs:

943d72306eb63a3592b82a2d25e70ce30b2bc75f63f8c23d572210847033be4e <bucket name> [11/Mar/2021:03:55:55 +0000] 76.78.94.88 arn:aws:iam::<AWS Account>:user/service-prod-common BPHSD3PPFKM071DR REST.GET.OBJECT ActivePerl/MSWin32-x64/20210220T005354Z/ActivePerl-5.28.1.0000-MSWin32-x64-719c477d.zip "GET /ActivePerl/MSWin32-x64/20210220T005354Z/ActivePerl-5.28.1.0000-MSWin32-x64-719c477d.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAQ5FYQM547I2EFPRW%2F20210311%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210311T035555Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.1" 200 - 46616371 46616371 11885 69 "-" "Go-http-client/1.1" - HZWnnqunXit1g72dvo9WEL/xLoa80jhFfaAo1P9SzDLXglPOERELpdLAB4nWeTn+2+PcNy9OT24= SigV4 ECDHE-RSA-AES128-GCM-SHA256 QueryString <bucket name>.s3.amazonaws.com TLSv1.2
943d72306eb63a3592b82a2d25e70ce30b2bc75f63f8c23d572210847033be4e <bucket name> [19/Mar/2021:21:59:55 +0000] 129.6.14.8 arn:aws:iam::<AWS Account>:user/service-prod-common BHA83P3NAZTRFXB3 REST.GET.OBJECT ActiveTcl/MSWin32-x64/20201021T182615Z/ActiveTcl-8.6.10.0000-MSWin32-x64-90f8b13d.exe "GET /ActiveTcl/MSWin32-x64/20201021T182615Z/ActiveTcl-8.6.10.0000-MSWin32-x64-90f8b13d.exe?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAQ5FYQM547I2EFPRW%2F20210319%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210319T215952Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.1" 200 - 16341112 16341112 274 137 "https://platform.activestate.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" - FP2TCU89nN2//Mos3f7wVVEh3+8VqxQCLZyZRwnVKb9UU2Sc+fFLLIiIM39VumSgMQn57IZuMNo= SigV4 ECDHE-RSA-AES128-GCM-SHA256 QueryString <bucket name>.s3.amazonaws.com TLSv1.2
943d72306eb63a3592b82a2d25e70ce30b2bc75f63f8c23d572210847033be4e <bucket name> [29/Mar/2021:04:06:22 +0000] 218.92.226.69 - XJCC355ZR6BQJ64H REST.GET.OBJECT ActivePerl/MSWin32-x64/20200528T195345Z/ActivePerl-5.28.1.0000-MSWin32-x64-b462fde1.exe "GET /ActivePerl/MSWin32-x64/20200528T195345Z/ActivePerl-5.28.1.0000-MSWin32-x64-b462fde1.exe?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJAFTYUXEZJ3HWLEQ/20200702/us-east-1/s3/aws4_request&X-Amz-Date=20200702T021019Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.1" 403 AccessDenied 369 - 3 - "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" - xWvyO2/BG1xpUiuhZqy4D44LX5bfuQF8ettxE1kl4jHz64h4vNmvsaCUlabKVOK414nwpT6DrZY= SigV4 ECDHE-RSA-AES128-SHA QueryString <bucket name>.s3.amazonaws.com TLSv1.2
943d72306eb63a3592b82a2d25e70ce30b2bc75f63f8c23d572210847033be4e <bucket name> [12/Apr/2021:15:52:47 +0000] 35.227.107.181 arn:aws:iam::<AWS Account>:user/service-prod-common RGHSVM64E6SF8H3B REST.GET.OBJECT ActivePerl/MSWin32-x64/20210220T005354Z/ActivePerl-5.28.1.0000-MSWin32-x64-719c477d.zip "GET /ActivePerl/MSWin32-x64/20210220T005354Z/ActivePerl-5.28.1.0000-MSWin32-x64-719c477d.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAQ5FYQM547I2EFPRW%2F20210412%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210412T155247Z&X-Amz-Expires=21600&X-Amz-SignedHeaders=host&X-Amz-Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX HTTP/1.1" 200 - 46616371 46616371 467 30 "-" "Go-http-client/1.1" - djl+Ff4NgHP9yPhx28Tm2pwGy7EL1r6+W0rPAsbNhTJhqCWy1Gd1zzoWZUPgA1J78wn4j2ZtM64= SigV4 ECDHE-RSA-AES128-GCM-SHA256 QueryString <bucket name>.s3.amazonaws.com TLSv1.2

@dacort
Copy link
Contributor

dacort commented May 6, 2021

@ryanrf-ac Hm, ok everything looks fine there. That message indicates it's not seeing any data in the source table. A couple followup questions:

  • Are the access logs directly under the prefix s3://<bucket name>/service=s3/? Or is there another prefix that they might be in?
  • Can you query the s3_access_raw table in Athena? Do any results get returned?

@ryanrf-ac
Copy link

@dacort The access logs are using another prefix s3://<bucket name>/service=s3/bucket=<bucket accessed>/06-05-2021-alongstring But I did test with specifying a more specific path (e.g. s3://<bucket name>/service=s3/bucket=<bucket accessed> as the --s3_source_location.

Yes, I can query the s3_access_raw table in Athena. But the table isn't like other tables I've created using crawlers. I've tried using the DynamicFrame from_catalog method from a Glue ETL job on that table and can't read any values from the table.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants