-
Notifications
You must be signed in to change notification settings - Fork 44
Clarify documentation to note that ingesting s3 access logs from a sub-prefix inside prefix does not work #33
Comments
A little bit of follow up. I've been working with this a bunch. I was getting the So, even though the Athena table was built around The next step in the Based on this Re:Post Question, I added the I also changed Ultimately though, I may be totally SOL on this project. When the data gets optimized into parquet files, there is no backwards traceability to the original source log document. I tried a bunch of methods to include the raw Athena tables |
Hi @bbuechler - Thanks for opening this issue and providing so much information. Apologies I haven't been able to respond until now. For S3 Access Logs, the tool does assume that they're in their original location as written to by the S3 service. In other words, not organized under a sub-folder or partition. I'm glad you were able to get it to work, but yes there's no sort of lineage tracking. I presume you could potentially add Unfortunately I haven't been able to maintain this project since publishing - it'd be nice to update for newer versions of Glue or make the library more generic so it could run anywhere, or even just make it Athena CTAS/INSERT INTO. But alas, not able to at the moment. |
@dacort No worries, I understand the loss of traction. This was an awesome project to help me to learn a bunch of the mechanics of Glue and PySpark, something I previously had no experience with. Hopefully the I had considered Any way, I appreciate your work, and hopefully my input above can help someone else in the future, even if this project does not continue to evolve. FWIW, I've been running my ingests as Glue 3 to take advantage of autoscaling. Seems to be working without a hitch. Glue 4 gave me a deprecation warning, but seemed to execute just fine. |
Awesome, glad to hear you got it working! S3 Access Logs, given their legacy, can be quite persnickety. :) Super glad to hear it's been helpful to you and thanks for the kudos. 🙏 |
One final note for anyone following in my footsteps. The # Retrieve the source data from the Glue catalog
source_data = self.glue_context.create_dynamic_frame.from_catalog(
database=self.data_catalog.get_database_name(),
table_name=self.data_catalog.get_table_name(),
transformation_ctx="source_data",
additional_options = {
"recurse": True, # RECURSIVE!
"groupFiles": 'none' # Performance Hit to ensure input_file_name works
}
) to inject the log file name, I just add it after the Dynamic Frame is converted to a Data Frame, also within from pyspark.sql.functions import input_file_name, lit
....
data_frame = data_frame.withColumn("log_object", lit(input_file_name())) |
I ran into an issue where we aggregate S3 distribution logs from a variety of sources into one account, and the logs are broken down into sub-prefix:
I was trying to run a single Glue job for
s3://<log-bucket>/s3_distribution_logs/
to populate all<deployment-name-sub-prefix>
logs into the sameCONVERTED_TABLE_NAME
. In this case, theRAW_TABLE_NAME
athena table was getting populated, the job would initially error withe the below error, then on subsequent runs would run "successfully". Unfortunately, I wouldn't get any logs into myCONVERTED_TABLE_NAME
Athena table.With continuous logging enabled, and a little tinkering, I tracked the issue down to
_get_first_key_in_prefix()
:The values going into
self.s3_client.list_objects_v2(**query_params)
were:from the
glue_jobs.json
:Its entirely unclear to my why, since I'm VERY new to both this project and Glue in general, but if I supply
instead of:
...it just work. Albeit with a smaller subset of data than I wanted. This may also be related to #30
The text was updated successfully, but these errors were encountered: