-
Notifications
You must be signed in to change notification settings - Fork 21.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how do you specify blob storage source where year/month/day.csv storage is involved #6117
Comments
Thanks for your feedback! We will investigate and update as appropriate. |
. . . and here is an example of the dataset json that worked with adf v1 but doesn't appear to work when I try it with adf v2
|
@myusrn Let us know if that helps! |
Thanks for followup. I had received a pointer to the first of the above to documents in response to datafactoryv2.azure.com feedback submission I had also created. Using it I arrived at the following settings to enable parameterized file processing. The outstanding issue it left me with is that the files I needed to process lived in a /{Year}/{Month}/{Day}.csv blob storage structure where the date range was in the past, i.e. 2016/01/01.csv -> 2016/06/30.csv content. So a scheduled trigger couldn't be setup to fire daily for these dates in the past and a Trigger Now where I provided a scheduleRunTime input parameter value wouild only allow me to process one of those past date files at a time. Is a TumblingWindow trigger setup what is needed to allow a person to kick off a pipeline to process a date range of parameterized blob storage folder/file hierarchy content or some pipeline FOR loop option?
|
@myusrn please use Tumbling Window trigger instead for this and set the interval to 24h (daily), which can do backfill as you expected. For difference between schedule trigger vs tumbling window trigger, check this: https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers#trigger-type-comparison |
I try to accomplish the same, however it seems the parameter 'ScheduledRunTime' is not passed through from trigger to pipeline. When I create the trigger it recognizes the pipeline parameter 'ScheduledRunTime' and the trigger parameter is created with value @trigger().scheduledTime. |
@terpie when I used 'Trigger Now' for a Scheduled Trigger setup I found that the parameter prompt was asking me to provide a datetime value, e.g. 01/01/2016 12:00 AM, that would get used in pipeline execution in lieu of that value being passed by an actual firing of the Scheduled Trigger setup. I'm about to test the Tumbling Window option to determine how I can use 'Trigger Now' to kick off date based parametrized blob folder/file path reading for an entire date range in the past, e.g. 01/01/2016 12:00 AM -> 06/30/2016 11:59 PM. |
Yes I try the same but the (tumbling window) trigger should pass through de value @trigger().scheduledTime via the parameter 'ScheduledRunTime' to the pipeline so the CSV dataset can use it to select the correct .csv file. But this passing through of the parameter I don't get it to work... |
@linda3wj i'm reviewing the document you referenced and am not seeing how using a Tumbling Window Trigger option addresses this need, any differently than a Scheduled Trigger, to execute a pipeline Copy activity for a range of past dates. In fact that document suggests that with v2 the only way for me to accomplish this is to manually trigger the pipeline from a powershell script or c# program that issues rest api call in a loop passing in the scheduledRunTime parameter value for each of the past date ranges I need covered, 01/01/2016 12:00 AM -> 06/30/2016 11:59 PM. Alternatively it appears one might be able to setup a Interations and Conditionals | ForEach activity that calls my Copy activity and have the ForEach parameters configured to pass into the child Copy activity each of the past datetime values I need that Copy activity to execute. Not sure if that is possible but this aspect of v2 appears to be significantly different than what v1 support wrt executing activity using a past data range. |
@myusrn & @terpie In the meantime, I invite you to leave feedback for the product team here: If you post a link to your feedback here, I will leave a comment on your feedback with a link back to this issue. We really appreciate your interest in ADF and hope to improve your experience with the service. |
thanks @jason-j-MSFT for the closure and next steps suggestion on this matter. I created the product team feedback entry https://feedback.azure.com/forums/270578-data-factory/suggestions/33787015-story-for-running-a-pipeline-for-a-range-of-dates to try and ensure this scenario is covered and if not that it gets looked at. |
@myusrn & @terpie I should have suggested this simpler path to help you understand this earlier: as you are using the copy data tool, please try the built-in scheduled copy which will auto generate all the parameters and the corresponding tumbling window trigger. After deployment, you can then check how things are chained together through generic authoring UI. In the copy data tool first page, "Run regularly on schedule" -> select start date and end date (suggest you go with a shorten period for test first), and check https://docs.microsoft.com/en-us/azure/data-factory/copy-data-tool#filter-data-in-an-azure-blob-folder on specifically how to config datetime partitioned path in copy data tool. |
@linda33wj : thanks, now it works! I've used the built-in 'copy data' as you proposed to create the dataset/pipeline/trigger with scheduling and this one works. I've compared the json code with the one I created before and the difference is in the parameters. So I changed my earlier created dataset/trigger using these parameters and this pipeline works also! in trigger: in pipeline: So apparently the trigger parameter 'ScheduledRunTime' set to @trigger().scheduledTime didn't work and did not pass through a value to the pipeline. That was all. |
@terpie great to know! The tumbling window trigger has different system variable name comparing to scheduled trigger, as mentioned at https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers#trigger-type-comparison. @myusrn hope you can make your case work as well. :) |
I used the adfV2 overview | let's get started | copy data wizard and followed the instructions in https://docs.microsoft.com/en-us/azure/data-factory/copy-data-tool#filter-data-in-an-azure-blob-folder link shared earlier to swap out browser + choose selected specific file with a datetime parameterized blob storage source folder/file setting. I also confirmed that the "run regularly on a schedule" selection up front created an associated tumbling window trigger with the trigger run parameters windowStart = @trigger().outputs.windowStartTime and windowEnd = @trigger().outputs.windowEndTime. What I found using that process is that the blob storage linked service dataset was created with
and not the following settings that I used when manually creating this dataset based on earlier issue suggestions.
Regardless of folderPath/fileName expression syntax used it's currently cranking away on what appears to be each blob storage files existing in the my manually defined tumbling window trigger for date range 01/01/2016 12:00 am -> 06/30/2016 11:59 pm. Will know if I have success when those entries in monitoring window stop showing up and I can do a count of all rows created from copy process to confirm I have the same 262080 rows that resulted when using adfV1 based setup. q1. is the tumbling window once activated via adfV2 "publish all" is essentially firing a bunch of pipeline instances where @trigger().outputs.windowStartTime = <date> 12:00 am and @trigger().outputs.windowEndTime = <date> 11:59 pm such that each firing of pipeline is able to use pipeline().parameters.windowStart <date> to construct blob storage date parameterized path setting that involves /yyyy/MM/dd.csv format? q2. what if I wanted rerun this pipeline/trigger setup to debug resulting differences after making some edits and optionally clearing out all the target data sink results from an earlier run. To do that in adfV2 does one unactivate trigger and select "publish all" and then reactivate trigger and select "publish all"? q3. it would seem that use of pipeline "trigger now" is not viable in this scenario except for the case where you want to debug execution for one day at a time correct? q4. when looking at monitor | <pipeline execution instance> | actions | view activity runs | actions input/output/details I'm unable to find any details telling me what tumbling window @trigger().outputs.windowStartTime and @trigger().outputs.windowEndTime iteration that particular monitoring entry is associated with. Am I overlooking how to see that? @terpie are you also taking the msft academy big data track [ https://aka.ms/bdMsa ], specifically dat223.3x orchestrating big data with azure data factory course's lab 3, and are trying to get an adfV2 based pipeline processing setup working for the game points blob2sql copy lab working in lieu of the adfV1 based one covered in the lab? I started down this path because I found the adfV1 approach was really challenging when it came to terminating failed runs and restarting new runs and making sense of what was currently executing and what was done executing in the adfV1 monitoring views . . . so hope was that adfV2 even in preview state would be more approachable for this date parameterized blob storage input scenario that I expect would be pretty common especially when processing application log files. |
q1. yes I am following a training with edx (currently at Lab 3): Martin. |
@terpie thanks for those details that helps. wrt q2. I found a similar experience with adfV1 in that if I managed to pause/stop and restart an existing pipeline definition it was remembering what was executed on prior pass(es) and simply trying to execute what hadn't gotten finished in prior runs or had been in waiting to execute states. With adfV1 I could only seem to get a pipeline/trigger setup to reexecute from scratch by duplicating it an entirely different adfV1 environment. With adfV2 in this case to make mods and rerun pipeline/trigger combination from scratch i'm going to test removing the trigger association and then create a new replica of that trigger associated with the pipeline and then "publish all" to get another test pass from beginning going. If that works then I'd say its an improvement over the adfV1 case of having to create a whole new adfV1 service replica of the setup. wrt q3. so this would seem to be the appropriate way to debug/test your blob storage date parameterized folder/file path input one day at a time vs having a bad/incorrect setup try and execute across the entire date range you want the final pass to operate on. wrt q4. thanks for the pointer to where to find trigger passed parameter values associated with each monitor entry . . . I was completely overlooking that "Parameters" column data on far right side and focusing instead just on what the "Actions" column "View Activity Runs" exposed. wrt msft academy / edX.org training lab yes that's the same course and lab 3 that I found challenging to debug and fix errors with using adfV1 and if I had the above understanding of how to make key aspects of that exercise work in adfV2 I think it would have been easier and quicker to debug and initiate final complete pass that generated 262080 rows when I did it using adfV1 and 260640 rows when I did it using the adfV2 approach discussed above. Since each days set of records consisted of 1440 rows and (262080 - 260640) / 1440 = 1 then i'm guessing my adfV1 run perhaps had an extra days worth of rows present perhaps from a single day execution test pass. I'd be curious what your "SELECT count(*) FROM dbo.points" produced in adfV1 and adfV2 cases. @linda33wj I might suggest that the adfV2 documentation perhaps have an example of how to setup and debug/run the msft academy big data track [ https://aka.ms/bdMsa ] dat223.3x orchestrating big data with azure data factory course's lab 3 where the lab 3 instructions pdf download can be found at https://aka.ms/edx-dat223.3x-lab3 . |
I tested how to get a tumbling window trigger enabled pipeline like this to re-run the entire date range and what I found I had to do was de-activate and delete the existing trigger and recreate it for things to re-run. From a product documentation perspective it might be helpful to have this documented in the appropriate place and from a product design perspective it would be nice to have a "refire trigger" button that automated these steps for you. to delete adfV2 triggers see #6163, e.g.
Also reviewing monitor pipeline runs executed by tumbling window trigger I found that I needed to specify date range 01/01/2016 -> 07/01/2016 in order to get it to process date parameterized blob storage files from 01/01/2016 -> 06/30/2016. This was because the windowStart value for last firing when I use 06/30/2016 11:59 pm as tumbling window trigger end date was 06/29/2016 12:00 am not 06/30/2016 12:00 am. An aspect of this scenario configuration that didn't seem intuitive. Also I found that a successful no errors configuration pass of the 01/01/2016 -> 06/30/2016 files, consisting 180 files each with 1440 lines, took 1 hour to start and finish using the copy data wizard tumbling window trigger settings of max concurrency = 10 | retry policy count = 3 | retry policy interval in seconds = 120. I did another pass using max concurrency = 50 (default setting for new tw trigger) | retry policy count = 1 | retry policy interval in seconds = 10 and that ripped through the same dataset in 2 mins more aligned with what I would have expected a big data etl processing like adfV2 to be able to deliver with this quantity of data and simple 1:1 mapping into the target azure sql db. So documentation wise i'll have to look if there is guidance on this front given with my naïve big data experience i'd be inclined to run this in the future with max concurrency = total number of blob storage files to process, if they are able to all be processed concurrently w/o any dependency on ordering, in order to start/finish this pipeline/trigger processing very quickly and in turn optimizing the developer fail/fix/retry cycle. |
@myusrn In being able to re-run the trigger for the whole data range I was hoping to find something in PowerShell. I am new in area so I just installed the Azure Powershell package and was able to Login into Azure and to run some adf trigger commands like above. So I think this PowerShell is the way to go with json scripts to create/start/stop etc. Martin. |
@terpie note that I simply installed current vs17 15.6.4 bits with the workloads | web & cloud | azure development option enabled and this appears to applied azure powershell modules for me. Presumably at some point we get adfV2 project template support as part of azure development option workload that will enable not only defining but also executing and monitoring adfV2 pipeline/trigger setups from within vs17 ide given there is an adfV1 project support story in the azure sdk for vs15 ide environment. |
@myusrn my github account name is “concat”; it makes for some interesting emails to me at times - like this one which has code snippet that triggered email to me! |
@myusrn Indeed the trigger with |
fyi, received a pointer from other channel that deleting adfV2 triggers can be accomplished in UI using azure portal < adfV2 instance > | author & monitor | author | triggers [ bottom left just below connections ] | < your trigger > | actions | delete |
. . . a related documentation issue, how would one define and adfV2 tumbling window trigger to do backfill scenario where the dates are every week on Saturday [ or every 7 days ] starting 2009/01/03 thru 2009/12/26 ? Using adfV1 I tried setting up a schedule of 2009/01/03/ thru 2009/12/26 and "availability": "availability": { "frequency": "Week", "interval": 1 } or { "frequency": "Day", "interval": 7 } but neither of those generated start/end tumbling window dates that aligned with my 2009/01/03 starting date and every Saturday after that. So I switched to "availability": { "frequency": "Day", "interval": 1 } with the plan of just accepting all the dates that would have “Waiting” status due to no data which doesn't seem optimal. I've now moved over to adfV2 to try and solve this scenario of a tumbling window trigger to do backfill scenario where the dates are every week on Saturday [ or every 7 days ] starting 2009/01/03 thru 2009/12/26 .
|
@myusrn for V2, the scheduler becomes quite flexible. Authoring trigger through UI would be much easier: go to your pipeline -> new/edit trigger -> you can then choose the start date (2009/01/03), end date (optional field, 2009/12/26), recurrence (weekly), and also advanced settings there (e.g. every Sat & Sun). |
@linda33wj thanks the pointer. I am only seeing "Every Minute" and "Hourly" options in new/edit trigger UI for TumblingWindow but I do see the full set of options you mention in the new/edit trigger UI for Schedule, specifically Weekly with advanced “Run on these days” option and Monthly with an even more extensive set of advanced recurrence setting options. I setup the TumblingWindow with "Hourly" and 168 hour intervals [ 7 days * 24 hours ] and that did the trick. Also it appears that adfV2 Copy Data wizard UI only exposes Run Now and a combined experience for Schedule and TumblingWindow trigger configuration option. In the trigger case it appeared if you picked weekly it pinned the start day as the Sunday date for whatever date you enter, e.g. I entered 01/03/2009 as start date which is a Saturday and copy data wiz created trigger was setup with 12/28/2008 as the start date which is the Sunday of the week containing the date I entered. So I had to go re-create the trigger copy data wizard setup using authoring environment Triggers section to get one that started on the actual Saturday and ran every 7 days / 168 hours following that for the backfill dates I needed to cover. |
I cannot use pipeline().TriggerTime when trying to grab a folder formatted as dd mm yyyy. |
Hi everybody, We’ve no idea what is going wrong, because we use the same credentials as always.We’ve setup the trigger, pipeline and copy activity as follows: in dataset: in pipeline: in copy activity: |
@JosHaemers the error message should return the exact path Copy activity was looking for and was missing, then you can double check the blob existence. Per your config, I'd suggest you double check the folder path setting on dataset, esp. on capital vs small letter, note blob path is case-sensitive. #please-close |
with adf v1 if you were specifying a blob storage source where log files were involved using a year/month/day.csv storage hierarchy you could specify the container source folder as "mylogdata/{Year}/{Month}" and the file as "{Day}.csv", provided you included partitionedBy section outlining how these token placeholders are calculated. I'm finding adf v2 dataset based on blob storage linked service doesn't like this. Has the syntax for folder and file hierarchy based on dates changed?
The error I get when I try this is in adf v2 is
Activity Copy Points Data failed: Failure happened on 'Source' side. ErrorCode=UserErrorSourceBlobNotExist,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The required Blob is missing.
ContainerName: https://mystoracct.blob.core.windows.net/mylogdata, ContainerExist: True,
BlobPrefix: {Day}.csv, BlobCount: 0.,Source=Microsoft.DataTransfer.ClientLibrary,'
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
The text was updated successfully, but these errors were encountered: