<a href="https://colab.research.google.com/github/milmor/deep-puma/blob/main/Image-preprocessing/Evaluacion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluación del rendimiento del OCR

### Acceso a Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Bibliotecas

In [2]:
!pip install datasets
!pip install jiwer jiwer

Collecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 5.0 MB/s 
Collecting fsspec>=2021.05.0
  Downloading fsspec-2021.7.0-py3-none-any.whl (118 kB)
[K     |████████████████████████████████| 118 kB 54.0 MB/s 
Collecting huggingface-hub<0.1.0
  Downloading huggingface_hub-0.0.15-py3-none-any.whl (43 kB)
[K     |████████████████████████████████| 43 kB 1.8 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 46.9 MB/s 
Installing collected packages: xxhash, huggingface-hub, fsspec, datasets
Successfully installed datasets-1.11.0 fsspec-2021.7.0 huggingface-hub-0.0.15 xxhash-2.0.2
Collecting jiwer
  Downloading jiwer-2.2.0-py3-none-any.whl (13 kB)
Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 2.8 MB/s 
Building wheels for collected p

In [3]:
from datasets import load_metric
import pandas as pd

### Character Error Rate (CER)
$$CER = \frac{S + D + I}{N} = \frac{S + D + I}{S + D + C}$$

Donde:

* S: número de sustituciones.
* D: Número de eliminaciones.
* I: Número de inserciones.
* N: Número de caracteres en la referencia.
* C: Número correcto de caracteres.


In [4]:
cer = load_metric('cer')

Downloading:   0%|          | 0.00/1.91k [00:00<?, ?B/s]

### Word Error Rate (WER)

$$ WER = \frac{S + D + I}{N} = \frac{S + D + I}{S + D + C} $$

Donde:

* S: número de sustituciones.
* D: Número de eliminaciones.
* I: Número de inserciones.
* N: Número de caracteres en la referencia.
* C: Número correcto de caracteres.

In [5]:
wer = load_metric('wer')

Downloading:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

In [6]:
def get_prediction_text(path):
  text = open(path,'r').read()
  return text

In [7]:
def get_ocr_evaluation(outputs_path, files_info, references, metric):
  filenames = [f.split('.')[0]  for f in files_info]
  predictions = [get_prediction_text('{}{}.txt'.format(outputs_path,fn)) for fn in filenames]
  cer_score = metric.compute(predictions=predictions, references=references)
  return cer_score

### Lectura de datos

In [8]:
TRANSCRIPTIONS_PATH = 'drive/MyDrive/Datos - Hackathon JusticIA/JusticIA_DatosTranscripciones.csv'

In [9]:
transcriptions_df = pd.read_csv(TRANSCRIPTIONS_PATH)

### Evaluación de Fichas_manual

In [10]:
manual_df = transcriptions_df[(transcriptions_df['Conjunto']== 'Fichas_manual')]

In [11]:
files_info = manual_df['NombreArchivo']
references = manual_df['Texto']

In [12]:
cer_manual = get_ocr_evaluation(outputs_path='drive/MyDrive/HackathonRIIAA2021/Texts_v2/Fichas_manual/', 
                                files_info=files_info, 
                                references=references, 
                                metric=cer)

In [13]:
wer_manual = get_ocr_evaluation(outputs_path='drive/MyDrive/HackathonRIIAA2021/Texts_v2/Fichas_manual/', 
                                files_info=files_info, 
                                references=references, 
                                metric=wer)

In [14]:
print('Evaluación directorio Fichas_manual:\nCER = {}\nWER={}'.format(cer_manual, wer_manual))

Evaluación directorio Fichas_manual:
CER = 0.6448604847548377
WER=1.3488420332092224


### Evaluación de Fichas_auto

In [15]:
auto_df = transcriptions_df[(transcriptions_df['Conjunto']== 'Fichas_auto')]

In [16]:
files_info = auto_df['NombreArchivo']
references = auto_df['Texto']

In [17]:
cer_auto = get_ocr_evaluation(outputs_path='drive/MyDrive/HackathonRIIAA2021/Texts_v2/Fichas_auto/', 
                                files_info=files_info, 
                                references=references, 
                                metric=cer)

In [18]:
wer_auto = get_ocr_evaluation(outputs_path='drive/MyDrive/HackathonRIIAA2021/Texts_v2/Fichas_auto/', 
                                files_info=files_info, 
                                references=references, 
                                metric=wer)

In [19]:
print('Evaluación directorio Fichas_auto:\nCER = {}\nWER={}'.format(cer_auto, wer_auto))

Evaluación directorio Fichas_auto:
CER = 0.6218335000025907
WER=1.0918706900213817
