# Find a Legend Challenge - Submission Notebook

This notebook is designed to work with as few external files needed.
Unfortunately the OCR Tool use (tesseract) needs to be installed manually as a executable.
For a windows machine we suggest using the installer provided here: https://github.com/UB-Mannheim/tesseract/wiki (standard configuration will save the files to expected the target location). Feel free to install tesseract in other ways, but notice that we cannot assure all paths are working correctly.


Competition:  https://xeek.ai/challenges/extract-crossplot-markers <br>
Repository: https://github.com/REDA-solutions/PlotLegendDetectionCV

In [2]:
team_name = 'REDA solutions' 
model_name = 'legend_detection_model'

## Imports

In [9]:
! pip install -r requirements.txt --user

Collecting numpy==1.23.3
  Using cached numpy-1.23.3-cp38-cp38-win_amd64.whl (14.7 MB)
Collecting pandas==1.5.1

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scipy 1.6.2 requires numpy<1.23.0,>=1.16.5, but you have numpy 1.23.3 which is incompatible.



  Using cached pandas-1.5.1-cp38-cp38-win_amd64.whl (11.0 MB)
Collecting Pillow==9.3.0
  Using cached Pillow-9.3.0-cp38-cp38-win_amd64.whl (2.5 MB)
Collecting pytesseract==0.3.10
  Using cached pytesseract-0.3.10-py3-none-any.whl (14 kB)
Collecting torchvision==0.14.0
  Using cached torchvision-0.14.0-cp38-cp38-win_amd64.whl (1.1 MB)
Collecting tqdm==4.64.1
  Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting packaging>=21.3
  Using cached packaging-22.0-py3-none-any.whl (42 kB)
Installing collected packages: Pillow, packaging, numpy, tqdm, torchvision, pytesseract, pandas
Successfully installed Pillow-9.3.0 numpy-1.23.3 packaging-22.0 pandas-1.5.1 pytesseract-0.3.10 torchvision-0.14.0 tqdm-4.64.1


In [2]:
import os
import platform
import sys
import numpy as np
import pandas as pd
from glob import glob
import torch
import re

from models_ocr.preprocessing.preprocessor import Preprocessor # import of our class --> see repository
from models_ocr.pytesseract_model import PytesseractModel # import of our class --> see repository

## Description

![Model Architecture](misc/model_architecture.png)

## Submission inference pipeline

In [3]:
from time import perf_counter

In [4]:
TEST_DATA_ROOT = "raw_data/helvetios_challenge_dataset_test"
TEST_IMAGE_DATASET_PATH = f"{TEST_DATA_ROOT}/images"
TEST_LABELS_DATA_PATH = f"{TEST_DATA_ROOT}/labels"
TEST_INFERENCE_RESULTS_PATH = f"results"

In [7]:
def run_inference_pipeline(TEST_IMAGE_DATASET_PATH, TEST_INFERENCE_RESULTS_PATH):
   
   print(f"* OS                          : {platform.system()}, {platform.release()}")
   python_version = str(sys.version).replace('\n', ' ')
   print(f"* Python version              : {python_version}")

   os.makedirs(TEST_INFERENCE_RESULTS_PATH, exist_ok=True)
         
   ts_start = perf_counter()
   
   model = torch.hub.load('../yolov5/', 'custom', path='models_detection/best.pt', source='local')
   model.conf = 0.2 # define minimal confidence for found legends
   preprocessor = Preprocessor(deskew=True)
   tesseract = PytesseractModel(preprocessor=preprocessor, confidence=15, custom_config=r"-l eng --psm 11")

   imgs = TEST_IMAGE_DATASET_PATH + r'/*.png'  
   imgs = list(glob(imgs))
   
   results = []
   sample_names = []

   for img in imgs:
      legends = model(img)
      legends = legends.crop()
      if len(legends) == 0:
         results.append(np.nan)
      else:
         predictions = []
         for legend_ in legends:
            legend = legend_['im']
            prediction = tesseract.predict(legend)
            predictions.extend(prediction)
         reg = re.compile('/[^0-9]/g')
         predictions = [s for s in predictions if not any(chr.isdigit() for chr in s)]
         predictions = [s for s in predictions if len(s)!=1 or not s.islower()]
         prediction_str = "["
         for word in predictions: prediction_str += f"'{word}' "
         prediction_str = prediction_str.strip()
         prediction_str += "]"
         if prediction_str == "[]":
            prediction_str = np.nan
         results.append(prediction_str)
      sample_names.append(img.split("\\")[1])
   
   ts_after_test = perf_counter()
   
   print(f"Inference time: {ts_after_test-ts_start:.2f} sec.")
   
   inference_results = {'sample_name': sample_names,
                        'legend': results}
   inference_results_df = pd.DataFrame(inference_results)
      
   inference_results_df.to_csv(f"{TEST_INFERENCE_RESULTS_PATH}/{team_name}_{model_name}_results.csv", index = False)
   
   print(inference_results_df)

   print(f"The submission file   : {TEST_INFERENCE_RESULTS_PATH}/{team_name}_{model_name}_results.csv")


In [9]:
run_inference_pipeline(TEST_IMAGE_DATASET_PATH, TEST_INFERENCE_RESULTS_PATH)

YOLOv5  v7.0-10-g10c025d Python-3.9.7 torch-1.13.0+cpu CPU

Fusing layers... 


* OS                          : Windows, 10
* Python version              : 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]


Model summary: 157 layers, 7012822 parameters, 0 gradients, 15.8 GFLOPs
Adding AutoShape... 
Saved 1 image to [1mruns\detect\exp3[0m
Saved results to runs\detect\exp3

Saved 1 image to [1mruns\detect\exp4[0m
Saved results to runs\detect\exp4

Saved 1 image to [1mruns\detect\exp5[0m
Saved results to runs\detect\exp5

Saved 1 image to [1mruns\detect\exp6[0m
Saved results to runs\detect\exp6

Saved 1 image to [1mruns\detect\exp7[0m
Saved results to runs\detect\exp7

Saved 1 image to [1mruns\detect\exp8[0m
Saved results to runs\detect\exp8

Saved 1 image to [1mruns\detect\exp9[0m
Saved results to runs\detect\exp9

Saved 1 image to [1mruns\detect\exp10[0m
Saved results to runs\detect\exp10

Saved 1 image to [1mruns\detect\exp11[0m
Saved results to runs\detect\exp11

Saved 1 image to [1mruns\detect\exp12[0m
Saved results to runs\detect\exp12

Saved 1 image to [1mruns\detect\exp13[0m
Saved results to runs\detect\exp13

Saved 1 image to [1mruns\detect\exp14[0m
Saved res

Inference time: 113.27 sec.


NameError: name 'team_name' is not defined