# Verificando se a "nova" base de dados está ok

O notebook disponível em `src/tratando_base_dados.ipynb` modifica a base de dados original BID, adicionando um novo arquivo para cada imagem, que é um arquivo JSON contendo informações relevantes do documento, como nome, data de nascimento, RG e CPF.

Além disso, também é criado um arquivo `.csv` com cada linha contendo informações sobre cada documento, como os arquivos correspondentes a eles, os seus dados importantes e o seu ID.

## Importações

In [1]:
import os
import json

import pandas as pd
import cv2
import ipywidgets as widgets

## Constantes

In [2]:
DATASET_FOLDER_PATH = '../../RG-Dataset'
DATASET_CSV_PATH = f'{DATASET_FOLDER_PATH}/dataset.csv'

## Lendo o arquivo "dataset.csv"

In [3]:
dataset = pd.read_csv(DATASET_CSV_PATH, sep=';')
dataset

Unnamed: 0,id,image_path,ocr_path,segmentation_path,info_path,cpf,rg,birthdate,name
0,111111,files/000111111_in.jpg,files/000111111_gt_ocr.txt,files/000111111_gt_segmentation.jpg,files/000111111_info.json,354.205.532-87,08.096.661-5,02/01/1963,Rebelo Ronei Nakamurakare
1,230000,files/000230000_in.jpg,files/000230000_gt_ocr.txt,files/000230000_gt_segmentation.jpg,files/000230000_info.json,188.354.397-52,29.227.222-4,05/05/1984,Kohatsu Liberatti Ivan
2,233025,files/000233025_in.jpg,files/000233025_gt_ocr.txt,files/000233025_gt_segmentation.jpg,files/000233025_info.json,370.678.495-51,73.377.624-3,16/02/1976,Chicaro Okubaro Salvo
3,233331,files/000233331_in.jpg,files/000233331_gt_ocr.txt,files/000233331_gt_segmentation.jpg,files/000233331_info.json,624.476.345-95,84.941.430-1,20/11/2008,Hochun Cerdeira Crema
4,250000,files/000250000_in.jpg,files/000250000_gt_ocr.txt,files/000250000_gt_segmentation.jpg,files/000250000_info.json,,48.753.318-5,20/07/1978,Scrignoli Petenusci Rombach
...,...,...,...,...,...,...,...,...,...
913,26861,files/00026861_in.jpg,files/00026861_gt_ocr.txt,files/00026861_gt_segmentation.jpg,files/00026861_info.json,184.843.419-76,28.321.208-1,17/02/1978,Hime Sylvestre Eiryo
914,26862,files/00026862_in.jpg,files/00026862_gt_ocr.txt,files/00026862_gt_segmentation.jpg,files/00026862_info.json,973.352.343-79,27.205.367-3,25/10/1958,Ciuffi Prandini Restum
915,26863,files/00026863_in.jpg,files/00026863_gt_ocr.txt,files/00026863_gt_segmentation.jpg,files/00026863_info.json,579.129.745-98,04.355.100-2,07/05/2002,Avelas Wilkens Lacaz
916,26864,files/00026864_in.jpg,files/00026864_gt_ocr.txt,files/00026864_gt_segmentation.jpg,files/00026864_info.json,866.125.989-48,44.840.014-5,02/02/1982,Akutagawa Crivelini Eleia


In [4]:
docs_count = dataset.shape[0]
print(f'Existem {docs_count} documentos no dataset')

Existem 918 documentos no dataset


In [5]:
dataset.isna().sum()

id                     0
image_path             0
ocr_path               0
segmentation_path      0
info_path              0
cpf                  149
rg                     0
birthdate              0
name                   0
dtype: int64

In [6]:
nan_docs_count = dataset.isna().sum().sum()
print(f'Existem {nan_docs_count} documentos com dados nulos')

Existem 149 documentos com dados nulos


In [7]:
not_nan_docs_count = docs_count - nan_docs_count
print(f'Logo, temos {not_nan_docs_count} documentos sem nenhum dado nulo')

Logo, temos 769 documentos sem nenhum dado nulo


## Código para vizualizar informações do dataset

In [8]:
img_widget = widgets.Image(
    format='jpg',
    width=600,
    height=400
)

rotate_img_btn = widgets.Button(
    description='Rotacionar',
    icon='undo'
)

In [9]:
DOC_NUMBER = 917

current_doc = dataset.iloc[DOC_NUMBER]

img = cv2.imread(f'{DATASET_FOLDER_PATH}/{current_doc["image_path"]}')
if img.shape[0] > img.shape[1]:
    img = cv2.rotate(img, cv2.ROTATE_90_COUNTERCLOCKWISE)
img_widget.value = cv2.imencode('.jpg', img)[1].tobytes()

with open(f'{DATASET_FOLDER_PATH}/{current_doc["ocr_path"]}') as ocr_file:
    ocr_info = ocr_file.read()

with open(f'{DATASET_FOLDER_PATH}/{current_doc["info_path"]}') as info_file:
    info_json = json.load(info_file)

def rotate_img():
    global img
    img = cv2.rotate(img, cv2.ROTATE_180)
    img_widget.value = cv2.imencode('.jpg', img)[1].tobytes()

rotate_img_btn.on_click(lambda _: rotate_img())

In [None]:
display(current_doc)

display(widgets.HBox([
    img_widget, 
    widgets.VBox([
        rotate_img_btn
    ])
], layout=widgets.Layout(align_items='center')))

print(ocr_info)
print(info_json)