Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous Web Job frozen and preventing further QueueTriggers #590

Closed
ThreeScreenStudios opened this issue Oct 27, 2015 · 15 comments
Closed

Comments

@ThreeScreenStudios
Copy link

Hi,

I have an continous Web Job that executes with a QueueTrigger. Normally if there is an exception or any problem, the job will fail, the queued message will go back into the queue, and the job will try to reprocess (until it finishes or fails 3 times and goes into the poison queue).

However, I noticed on 10/26/2015 that no messages had processed in the past day or so. I investigated on the Azure Portal, and saw that the webjob still had a "running" status. I clicked into the web job, and discovered that the current execution was still going, and had been executing for the past 2 days. For some reason, the job did not time out or quit, and there were no further QueueTriggers even though there were multiple messages backed up in the Queue.

There were also no logs or exceptions/errors thrown (I have a decent amount of logging and exception handling in the method).

I aborted the current job execution via the Azure Portal, and once that happened, all of the backed up queue messages began processing immediately.

I can provide account details via email if needed (woot@threescreenstudios.com).

@ThreeScreenStudios
Copy link
Author

A bit more requested info:

Web Jobs SDK: using Web Jobs SDK 1.0.6

JobHost setup:

var config = new JobHostConfiguration();
            config.StorageConnectionString = config.DashboardConnectionString = ConfigurationManager.AppSettings["AzureStorageConnection"];
            config.Queues.MaxPollingInterval = TimeSpan.FromSeconds(60);
            config.Queues.BatchSize = 1;
            config.Queues.MaxDequeueCount = 5;
            var host = new JobHost(config);
            host.RunAndBlock();

Processing Code:

public async static Task ProcessFromQueue([QueueTrigger("alertqueue")] string queuedMessage, CancellationToken token)
        {
            if (token.IsCancellationRequested)
            {
                logger.Error("Cancellation requested");
                return;
            }

            logger.Info("Processing message:" + queuedMessage);
            try
            {
                var worker = IocContainer.Resolve<IQueueWorker>();
                await worker.DoWork(queuedMessage);
            }
            catch (Exception ex)
            {
                logger.Error("Error processing queued message:" + queuedMessage, ex);
                throw;
            }
            logger.Info("Finished processing message:" + queuedMessage);

        }

@mathewc
Copy link
Member

mathewc commented Oct 27, 2015

What is "logger" and how does it log? That's a bit of unknown code - it might be that there was no error, and your logger didn't write out the message. Where does it log to? Also, how do you guarantee timeouts occur in worker.DoWork?

I strongly suspect that somewhere AFTER we invoke your job function it is hanging/never returning. The SDK does not make any assumptions currently about how long your function may need to run so doesn't enforce any timeout. So if your code hangs, the job hangs indefinitely.

I'm considering adding a TimeoutAttribute (e.g [Timeout("1:00:00")] timeout after 1 hour) that allows you to opt-in to this behavior. We'd also have global knob on JobHostConfiguration that you can set.

@ThreeScreenStudios
Copy link
Author

@mathewc - "logger" is a DI instance of NLog, it outputs to console as well as an integration with Raygun (an online error tracking system). The odd thing is that not even the initial log of "logger.Info("Processing message:" + queuedMessage);" was in the logs, which indicates to me that perhaps there was an error before the function could even fire?

Inside DoWork, any async calls are with RestSharp, which has a default 30 second timeout.

Having the TimeoutAttribute sounds like a good addition.

@mathewc
Copy link
Member

mathewc commented Oct 28, 2015

@rustd @ThreeScreenStudios Ok, I've implemented TimeoutAttribute. Here's an example function that would hang for a day if TimeoutAttribute was not used:

[Timeout("00:00:10")]
public static async Task ProcessMessage(
    [QueueTrigger("samples-input")] string message,
    TextWriter log,
    CancellationToken cancellationToken)
{
    log.WriteLine("Begin ProcessMessage");

    await Task.Delay(TimeSpan.FromDays(1), cancellationToken);

    log.WriteLine("PRocessMessage complete");
}

Notes:

  • TimeoutAttribute can be applied at the class/method level. At class level, it applies to all functions in the class.
  • the timeout only encapsulates time spent in the user code (e.g. not any pre/post work the SDK does). I.e., the timer starts right before we call your function and ends after it returns.
  • to receive timeout notifications, a function must bind to a CancellationToken
  • if the configured timeout expires before the function returns, we cancel the token
  • the function should pass the cancellation token to any async work it does, and should monitor it for cancellation
  • There is also a JobHostConfiguration.FunctionTimeout global value. It's null by default, but you can set it. This value will be used for all functions, unless those functions override via class/method level TimeoutAttributes.
  • In a non async function, you can still bind to CancellationToken, you just have to periodically check it for cancellation, and if cancelled stop your work and call cancellationToken.ThrowIfCancellationRequested()

@ThreeScreenStudios I'll also point out that the reason your function hung and wouldn't process any more messages is because you have JobHostQueuesConfiguration.BatchSize set to 1. That means that only a single message is pulled per batch, and we won't pull another batch until that one is complete. Why do you have it set to 1? If you allowed multiple (as is the default), the one message might have hung, but others would continue to process.

@agnauck
Copy link

agnauck commented Oct 28, 2015

had the same problem several times in the last days where triggered functions get stuck in the code below forever. I had a BatchSize of 32 and all of them got stuck after a while.

The Timeout attribute is a great solution for that, exactly what I am looking for.
Is there already a new build available? Or do I have to compile the sources myself?

 public static async Task FtpToBlob(
           [QueueTrigger("ftp-download-file")] FtpToAzureBlobArgs ftpToAzureBlobArgs,
           string Filename,
           string FtpFolder,
           string SomeId,
           string CloudDir,
           [Blob("mycontainer/{CloudDir}/{SomeId}/{FileName}")] ICloudBlob output,
           TextWriter log)
{
    try
    {
        var uri = new Uri($"ftp://ftp.example.com/Foo/Bar/{FtpFolder}/{Filename}");
        FtpWebRequest request = (FtpWebRequest)WebRequest.Create(uri);
        request.Method = WebRequestMethods.Ftp.DownloadFile;
        request.Credentials = new NetworkCredential(FtpUser, FtpPass);
        FtpWebResponse response = (FtpWebResponse)request.GetResponse();

        await output.UploadFromStreamAsync(response.GetResponseStream());
        await log.WriteLineAsync("Downloaded: " + uri.ToString());
    }
    catch (Exception ex)
    {
        await log.WriteLineAsync(ex.StackTrace);
    }
    await log.WriteLineAsync("Finished");
}

@ThreeScreenStudios
Copy link
Author

@mathewc - ah thanks for pointing out the batch size issue - is there any guidance on how to choose an optimal batch size?

Also thanks for putting the TimeoutAttribute, I think that will be quite helpful for many folks.

@mathewc
Copy link
Member

mathewc commented Oct 28, 2015

@agnauck If all of your functions are getting stuck after a while, that indicates a problem in your code. To use the new TimeoutAttribute, you'll update your method signature to take the CancellationToken, and should then pass that to other async operations you initiate. No there isn't a build out yet - I'll get one out today (on our myget feed) and let you guys know.

@ThreeScreenStudios Well, the defaults are designed to be optimal (default is 16, max is 32). I was wondering why you dialed it back to 1.

@agnauck
Copy link

agnauck commented Oct 28, 2015

@mathewc the code is posted above is all the code I have in this WebJob. I will add the CancellationToken as suggested.

@rustd
Copy link

rustd commented Oct 28, 2015

Regarding batchSize This limit applies separately to each function that has a QueueTrigger attribute. If you don't want parallel execution for messages received on one queue, set the batch size to 1.
For more information read this https://azure.microsoft.com/en-us/documentation/articles/websites-dotnet-webjobs-sdk-storage-queues-how-to/

@mathewc
Copy link
Member

mathewc commented Oct 28, 2015

Ok, the TimeoutAttribute feature is in. Please see the release notes for details, and for a link to a sample.

@agnauck @ThreeScreenStudios Can you guys please give this a try and verify that it meets your needs? Thanks. You can pull the latest bits from our myget feed (instructions here). Version 1.1.0-beta1-10149 includes the changes.

@mathewc mathewc closed this as completed Oct 28, 2015
@agnauck
Copy link

agnauck commented Nov 5, 2015

works perfect. Thanks, this is a great new feature and very helpful for us.

@sakthigeek
Copy link

sakthigeek commented Mar 20, 2018

@agnauck @ThreeScreenStudios @mathewc Hi guys. I kind of have the same situation where the web job is getting hung on a single process for hours and even with extensive logging, I couldn't log anything. No exceptions or errors too. It's like the thread doesn't reach the code itself and it hangs indefinitely. I am running a single instance continuous web job with a restart time of 2 seconds. I also have similar continuous web jobs that are running fine. I have tried to restart it, rename it, delete it, redeploy it, but nothing fixes the issue. Rechecked the code multiple times, the code is running fine locally without any issues. What are all the possible reasons for this to happen? Can anyone help with this?

@appalaraju
Copy link

Hi,
i also faced the same issue. my web job reads messages from topic. what i observed is my web job suddenly stopped processing messages from topic even topic is keep getting messages from sender. i opened logs of my web job and saw that my web job is processing a message from "2 hours" with status "running".my web job has custom error logging mechanism but there are no errors found. it seems like web job is really not processing that message but showing it as processing. after i waited 1 more hour , i aborted the message manually using azure web job logs web page then remaining messages started processing by web job so how to resolve this issue?

@vhatuncev
Copy link

Got same issue two days ago, WebJob stuck to process messages from queue. It just stuck with message: Never Finished. The underlying code does database calls and other API calls, but it was unchanged for a few months and bad thing that this is happened in production without any notifications or warning or failures.
For now I see the TImeout attribute will solve this issue and able to throw exception if needed, but I'm wondering what can cause such issues?

BatchSize = 16, MaxDequeueCount = 2, MaxPollingInterval = 3 seconds.

@melenaos
Copy link

melenaos commented Aug 23, 2019

TimeoutAttribute seems perfect for most of the discussed problems but it has to be implemented in many places in the job, just as it should be done for any CancellationToken.

My concern is that the "hung" part of the job might be a single unit that it won't throw exception when the CancellationToken is triggered.

foreach(var item in items){
   cancelToken.ThrowIfCancellationRequested();
   SingleUnitJobThatHungsForever();
}

How we can stop the webjob process execution in the same way we can do from Azure portal after a specific timeout? Is there a way to kill the specific queue message process without the usage of TimeoutAttribute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants