Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_result_from_job_id AssertionError while initializing redeployed LS/ML NER backend #117

Closed
wojnarabc opened this issue May 30, 2022 · 23 comments
Labels
bug Something isn't working

Comments

@wojnarabc
Copy link

wojnarabc commented May 30, 2022

I am trying to deploy NER example model trained on my local machine along with the Label Studio project to another machine. I've gone through following steps:

  1. Recreated Label Studio and ML Backend environments similarly as on a target machine
  2. Copied folder with the model itself (folder named with just integers) to target machine ML Backend folder.
  3. Extracted content (data, annotations and predictions) of the project through Label Studio API into json format (using ...export?exportType=JSON&download_all_tasks=true command)
  4. Imported project json file into the newly created Label Studio project.
    When trying to initialize and pair LS and ML Backend on a new machine, i am getting :
    [2022-05-30 10:18:56,133] [ERROR] [label_studio_ml.model::get_result_from_last_job::128] 1647350146 job returns exception: Traceback (most recent call last): File "/Users/user/Projects/label-studio-ml-backend/label_studio_ml/model.py", line 126, in get_result_from_last_job result = self.get_result_from_job_id(job_id) File "/Users/user/Projects/label-studio-ml-backend/label_studio_ml/model.py", line 108, in get_result_from_job_id assert isinstance(result, dict) AssertionError
    and it keeps repeating for each job

Should any additional steps be performed during deploy of the project/model to other environments ?

I've tried with following LS versions (1.1.1 - my initial one, 1.4.1post1 - most recent one) and the most current code base of ML backend. Using Python 3.8 and MacOS for both source and target environments.

@KonstantinKorotaev
Copy link
Contributor

KonstantinKorotaev commented Jun 1, 2022

Hi @wojnarabc
This error indicates that training of your model ended without any contrete result.
ner example doesn't have fit method, but you can add it like that:

def fit(self, completions, workdir=None, **kwargs):
    import random
    return {'random': random.randint(1, 10)}

and the error should disappear.
I will add this stub to the ner example in future release.

If you are using TransformersBasedTagger - please check that fit method ends with appropriate result.

@wojnarabc
Copy link
Author

Hello @KonstantinKorotaev, there is a fit method in example in ...examples/ner/ner.py folder starting in line 461. TransformersBasedTagger is part of it.

@jrdalenberg
Copy link

jrdalenberg commented Jul 18, 2022

I run into the exact same problem with my custom backend.

I am in the process of upgrading my system to the latest LS and backend. Everything was working fine with LS 1.1.1 and the backend from a year ago.

After training, another job is sent for some reason, and then train_output is being cleared causing the backend to lose the knowledge about the last trained model.

I already set LABEL_STUDIO_ML_BACKEND_V2_DEFAULT = True

 'train_output': {'model_path': '././my_backend/5.1655881660/1658156596'},
 'value': 'image'}
[2022-07-18 17:03:37,760] [INFO] [werkzeug::_log::225] 192.168.123.133 - - [18/Jul/2022 17:03:37] "POST /train HTTP/1.1" 201 -
[2022-07-18 17:03:37,781] [INFO] [werkzeug::_log::225] 192.168.123.133 - - [18/Jul/2022 17:03:37] "GET /health HTTP/1.1" 200 -
[2022-07-18 17:03:37,787] [ERROR] [label_studio_ml.model::get_result_from_last_job::130] 1658156583 job returns exception: 
Traceback (most recent call last):
  File "/home/USER/.virtualenvs/ls-1.5/lib/python3.8/site-packages/label_studio_ml/model.py", line 128, in get_result_from_last_job
    result = self.get_result_from_job_id(job_id)
  File "/home/USER/.virtualenvs/ls-1.5/lib/python3.8/site-packages/label_studio_ml/model.py", line 110, in get_result_from_job_id
    assert isinstance(result, dict)
AssertionError

@jrdalenberg
Copy link

jrdalenberg commented Jul 25, 2022

So, to be more clear, I based my custom backend on pytorch_transfer_learning.py. I had to set LABEL_STUDIO_ML_BACKEND_V2_DEFAULT = True in model.py, because otherwise I run into the issue described in #118.

Now, when I monitor train_output after training, label studio initializes the backend with the correct train_output. But immediately after, another initialization event follows with an empty train_output.

I made an ugly workaround for this issue using an OS environment variable that saves the location of last trained model locally. Note that this causes issues if users switch between projects and it won't redeploy an existing model.

# ...

from label_studio_ml import model
model.LABEL_STUDIO_ML_BACKEND_V2_DEFAULT = True

# ...

class ImageClassifierAPI(model.LabelStudioMLBase):
   def __init__(self, label_config=None, train_output=None,**kwargs):
       super(ImageClassifierAPI, self).__init__(**kwargs)
       
      # ...
       
       if self.train_output:
           print(f"trying to use {self.train_output['model_path']} as model path")
           self.model = ImageClassifier(self.classes, self.boxType)
           self.model.load(self.train_output['model_path'])
       elif os.environ.get("LAST_TRAINED_MODEL") is not None:
           model_path = os.environ.get("LAST_TRAINED_MODEL")
           print(f"trying to use {model_path} as model path")
           self.model = ImageClassifier(self.classes, self.boxType)
           self.model.load(model_path)
       else: 
           self.model = ImageClassifier(self.classes, self.boxType)

   # ...

   def fit(self, annotations, workdir=None, batch_size=12, num_epochs=100, **kwargs):
       self.model.train(annotations, workdir, self.boxType, batch_size=batch_size, num_epochs=num_epochs)

       # Save workdir in environment
       os.environ["LAST_TRAINED_MODEL"] = workdir

       return {'model_path': workdir}

Having to manually change LABEL_STUDIO_ML_BACKEND_V2_DEFAULT = True and Label studio loosing the returned dict from the fit function is a bit awkward to work with though.

@makseq makseq added the bug Something isn't working label Jul 26, 2022
@KonstantinKorotaev
Copy link
Contributor

@jrdalenberg
LABEL_STUDIO_ML_BACKEND_V2_DEFAULT this variable tells your ML backend that you are using active learning cycle. Starting from version 1.4.1 of Label Studio training is invoked with webhooks.
Please check this documentation.

def fit(self, annotations, workdir=None, batch_size=12, num_epochs=100, **kwargs):
self.model.train(annotations, workdir, self.boxType, batch_size=batch_size, num_epochs=num_epochs)

Your fit method seems to be expecting annotations, but from version 1.4.1 there will be an event instead of annotations. Maybe this is leading to empty train_output.

If you want to switch this logic off - disable webhook in LS.
Check this example if you want to use active learning cycle.

@jrdalenberg
Copy link

Your fit method seems to be expecting annotations, but from version 1.4.1 there will be an event instead of annotations. Maybe this is leading to empty train_output.

I see. It’s still odd that train_output is returned once and then cleared right after, though.

Check this example if you want to use active learning cycle.

Thanks! I did not see this in the object recognition examples. I’ll test it out after my vacation 😄

@idreamerhx
Copy link

same bug

/opt/label-studio-ml-backend/coco-detector/mmdetection.py(137)fit()
-> logger.info(f'tasks={tasks}, workdir={workdir}, kwargs={kwargs}')
(Pdb) bt
/usr/lib/python3.10/threading.py(966)_bootstrap()
-> self._bootstrap_inner()
/usr/lib/python3.10/threading.py(1009)_bootstrap_inner()
-> self.run()
/usr/lib/python3.10/threading.py(946)run()
-> self._target(*self._args, **self._kwargs)
/usr/lib/python3.10/socketserver.py(683)process_request_thread()
-> self.finish_request(request, client_address)
/usr/lib/python3.10/socketserver.py(360)finish_request()
-> self.RequestHandlerClass(request, client_address, self)
/usr/lib/python3.10/socketserver.py(747)init()
-> self.handle()
/opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/werkzeug/serving.py(342)handle()
-> BaseHTTPRequestHandler.handle(self)
/usr/lib/python3.10/http/server.py(425)handle()
-> self.handle_one_request()
/opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/werkzeug/serving.py(374)handle_one_request()
-> self.run_wsgi()
/opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/werkzeug/serving.py(319)run_wsgi()
-> execute(self.server.app)
/opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/werkzeug/serving.py(308)execute()
-> application_iter = app(environ, start_response)
/opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/flask/app.py(2464)call()
-> return self.wsgi_app(environ, start_response)
/opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/flask/app.py(2447)wsgi_app()
-> response = self.full_dispatch_request()
/opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/flask/app.py(1950)full_dispatch_request()
-> rv = self.dispatch_request()
/opt/pyvenv-labelstudio-ml-backend/lib/python3.10/site-packages/flask/app.py(1936)dispatch_request()
-> return self.view_functionsrule.endpoint
/opt/label-studio-ml-backend/label_studio_ml/exceptions.py(39)exception_f()
-> return f(*args, **kwargs)
/opt/label-studio-ml-backend/label_studio_ml/api.py(93)_train()
-> job = _manager.train(annotations, project, label_config, **params)
/opt/label-studio-ml-backend/label_studio_ml/model.py(711)train()
-> job_result = cls.train_script_wrapper(
/opt/label-studio-ml-backend/label_studio_ml/model.py(667)train_script_wrapper()
-> train_output = m.model.fit(data_stream, workdir, **train_kwargs)
/opt/label-studio-ml-backend/coco-detector/mmdetection.py(137)fit()
-> logger.info(f'tasks={tasks}, workdir={workdir}, kwargs={kwargs}')

022-08-19 03:17:21,961] [ERROR] [label_studio_ml.model::get_result_from_last_job::131] 1660820149 job returns exception:
Traceback (most recent call last):
File "/opt/label-studio-ml-backend/label_studio_ml/model.py", line 129, in get_result_from_last_job
result = self.get_result_from_job_id(job_id)
File "/opt/label-studio-ml-backend/label_studio_ml/model.py", line 111, in get_result_from_job_id
assert isinstance(result, dict)
AssertionError

@idreamerhx
Copy link

# train, see https://labelstud.io/guide/ml_create.html
def fit(self, tasks, workdir=None, **kwargs):
    # Retrieve the annotation ID from the payload of the webhook event
    # Use the ID to retrieve annotation data using the SDK or the API
    # Do some computations and get your model
    #import pdb; pdb.set_trace()
    logger.info(f'tasks={tasks}, workdir={workdir}, kwargs={kwargs}')
    return {'checkpoints': '3.1660708258/1660885087/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth', 'model_file': "3.1660708258/1660885087/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth", "model_version":"3.1660708258/1660885087", 'classes': 80}
    ## JSON dictionary with trained model artifacts that you can use later in code with self.train_output

Load new model from: /opt/mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
load checkpoint from local path: /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
Load new model from: /opt/mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
load checkpoint from local path: /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
[2022-08-19 05:06:31,709] [INFO] [werkzeug::_log::225] 192.168.1.7 - - [19/Aug/2022 05:06:31] "POST /train HTTP/1.1" 201 -
[2022-08-19 05:06:32,102] [INFO] [werkzeug::_log::225] 192.168.1.7 - - [19/Aug/2022 05:06:32] "GET /health HTTP/1.1" 200 -
Load new model from: /opt/mmdetection/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
load checkpoint from local path: /opt/mmdetection/checkpoints/faster_rcnn/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
[2022-08-19 05:06:33,168] [INFO] [werkzeug::_log::225] 192.168.1.7 - - [19/Aug/2022 05:06:33] "POST /setup HTTP/1.1" 200 -

seems return values not work

@KonstantinKorotaev
Copy link
Contributor

Hi @idreamerhx

seems return values not work

I don't see any error in your log, what do you mean by not work?

assert isinstance(result, dict)
AssertionError

This means that your training procedure didn't return results, could you please check what your fit method returns?

@sjtuytc
Copy link

sjtuytc commented Sep 3, 2022

This should be a bug somehow. Because the results should be a list of dict but it tells me it is wrong.

@KonstantinKorotaev
Copy link
Contributor

@sjtuytc could you please clarify what do you mean?

@sjtuytc
Copy link

sjtuytc commented Sep 7, 2022

I mean this assertion is too restrictive. A dummy result would tricker the assertion error. This should not be desired.

@seanswyi
Copy link

I'm also getting the same assertion error problem. Does the model always have to return a dictionary? I'm following the examples provided in Label Studio and it seems like lists of dicts are also allowed: https://github.com/heartexlabs/label-studio-ml-backend/blob/db6d1d6d3efde1db503532f1f77a6977a7f100d2/label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py#L93

@KonstantinKorotaev
Copy link
Contributor

Hi @seanswyi
Predictions should be a list of dicts.
Assertion error is about training job failure.

@ZhangYuef
Copy link
Contributor

Errors after setting LABEL_STUDIO_ML_BACKEND_V2_DEFAULT=True in model.py:

[2023-03-14 10:37:49,673] [ERROR] [label_studio_ml.model::get_result_from_last_job::132] 1678761465 job returns exception: Job 1678761465 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log.
Traceback (most recent call last):
File "label-studio-ml-backend/label_studio_ml/model.py", line 130, in get_result_from_last_job
result = self.get_result_from_job_id(job_id)
File "label-studio-ml-backend/label_studio_ml/model.py", line 111, in get_result_from_job_id
assert isinstance(result, dict), f"Job {job_id} was finished unsuccessfully. No result was saved in job folder."
AssertionError: Job 1678761465 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log.
[2023-03-14 10:37:49,673] [ERROR] [label_studio_ml.model::get_result_from_last_job::132] 1678761463 job returns exception: Job 1678761463 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log.
Traceback (most recent call last):
File "label-studio-ml-backend/label_studio_ml/model.py", line 130, in get_result_from_last_job
result = self.get_result_from_job_id(job_id)
File "label-studio-ml-backend/label_studio_ml/model.py", line 111, in get_result_from_job_id
assert isinstance(result, dict), f"Job {job_id} was finished unsuccessfully. No result was saved in job folder."
AssertionError: Job 1678761463 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log.

How should we work around?

@KonstantinKorotaev
Copy link
Contributor

Assertion error is about training job failure.

Do you have folders with job results 1678761463 and 1678761465?

@TrueWodzu
Copy link

TrueWodzu commented Mar 17, 2023

Hello, I have the same problem:

I've created a custom backend based on example in mmdetection.py. I don't use active learning (I think, i did not set that up). Every time I will switch to next image for annotation, I am getting this output in console:

[2023-03-17 09:30:05,107] [ERROR] [label_studio_ml.model::get_result_from_last_job::132] 1679041803 job returns exception: Job 1679041803 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log.
Traceback (most recent call last):
  File "d:\coding\developement\python\label-studio-ml-backend\label_studio_ml\model.py", line 130, in get_result_from_last_job
    result = self.get_result_from_job_id(job_id)
  File "d:\coding\developement\python\label-studio-ml-backend\label_studio_ml\model.py", line 111, in get_result_from_job_id
    assert isinstance(result, dict), f"Job {job_id} was finished unsuccessfully. No result was saved in job folder." \
AssertionError: Job 1679041803 was finished unsuccessfully. No result was saved in job folder.Please clean up failed job folders to remove this error from log.

The exception is caused because _get_result_from_job_id is returning None because os.path.exists(result_file) is returning False:

    def _get_result_from_job_id(self, job_id):
        """
        Return job result or {}
        @param job_id: Job id (also known as model version)
        @return: dict
        """
        job_dir = self._job_dir(job_id)
        if not os.path.exists(job_dir):
            logger.warning(f"=> Warning: {job_id} dir doesn't exist. "
                           f"It seems that you don't have specified model dir.")
            return None
        result_file = os.path.join(job_dir, self.JOB_RESULT)
        if not os.path.exists(result_file): <--- THIS CHECK RESULTS IN FLSE
            logger.warning(f"=> Warning: {job_id} dir doesn't contain result file. "
                           f"It seems that previous training session ended with error.")
            return None
        logger.debug(f'Read result from {result_file}')
        with open(result_file) as f:
            result = json.load(f)
        return result

What is strange for me is that I do have job_result.json file in the required directory but probably it is not there when the check occurs? It must be created later. The contents of file is empty json.

and here is my predict() method:

    def predict(self, tasks, **kwargs):
        assert len(tasks) == 1
        task = tasks[0]

        image_url = self._get_image_url(task)
        image_path = self.get_local_path(image_url)
        image = cv2.imread(image_path)
        output = Inference.infer_from_image(self.model, image)
        indices, boxes, confidences = Inference.filter_outputs(self.config, image, output)

        results = []
        all_scores = []
        for i in indices:
            score = confidences[i]
            # print(f'{confidences[i]:.2f}')
            x, y, width, height = self.convert_to_ls(boxes[i][0], boxes[i][1], boxes[i][2], boxes[i][3], image.shape[1], image.shape[0])
            results.append({
                'from_name': self.from_name,
                'to_name': self.to_name,
                'type': 'rectanglelabels',
                'value': {
                    'rectanglelabels': ['head'],
                    'x': x,
                    'y': y,
                    'width': width,
                    'height': height
                },
                'score': float(score)
            })
            all_scores.append(score)
            avg_score = sum(all_scores) / max(len(all_scores), 1)
        return [{
            'result': results,
            'score': float(avg_score)
        }]   

But I dont think this is due to predict() as I've said earlier, the check:
if not os.path.exists(result_file): is failing for some reason.

@KonstantinKorotaev
Copy link
Contributor

Hi @TrueWodzu

What is strange for me is that I do have job_result.json file in the required directory but probably it is not there when the check occurs? It must be created later. The contents of file is empty json.

This file is created during training session.
If it is empty - training session ended with error.

@TrueWodzu
Copy link

TrueWodzu commented Mar 17, 2023

Hi @KonstantinKorotaev thank you for your answer. So is this bug? Because it seems like the file is required but at the same time, some of us don't want to train. How can I prevent it from happening?

@TrueWodzu
Copy link

@KonstantinKorotaev I don't have training turned on so def _get_result_from_job_id(self, job_id): should handle a case where training is not on and do not throw exception?

A simple change in the code solves the problem of exception:

def _get_result_from_job_id(self, job_id):
    """
    Return job result or {}
    @param job_id: Job id (also known as model version)
    @return: dict
    """
    job_dir = self._job_dir(job_id)
    if not os.path.exists(job_dir):
        logger.warning(f"=> Warning: {job_id} dir doesn't exist. "
                       f"It seems that you don't have specified model dir.")
        return None
    result_file = os.path.join(job_dir, self.JOB_RESULT)
    if not os.path.exists(result_file):
        logger.warning(f"=> Warning: {job_id} dir doesn't contain result file. "
                       f"It seems that previous training session ended with error.")
        return {} <---- A simple change here from None to empty json solves the issue.
    logger.debug(f'Read result from {result_file}')
    with open(result_file) as f:
        result = json.load(f)
    return result

@KonstantinKorotaev
Copy link
Contributor

Hi @TrueWodzu
I have added environment variable to integrate your changes in branch.
Please check if it's appropriate solution for you.

@TrueWodzu
Copy link

Hi @KonstantinKorotaev many thanks for the change! While it definitely works for me, I am just wondering is this a correct approach? What I mean by that is, is it really required right now to have fit method defined? Because in LabelStudio project I have training turned off:

obraz

So if I have training turned off, then there should be no exception about result_file?

@KonstantinKorotaev
Copy link
Contributor

Hi @TrueWodzu

So if I have training turned off, then there should be no exception about result_file?

Yes, but intention for this error is to get message in case of anybody tried to load model that wasn't trained successfully.
I will add this flag so anybody can ignore such errors in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants