<a href="https://colab.research.google.com/github/Awerito/data-cience-dataset/blob/master/dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descarga del dataset
La descarga es de una versión limpia de 1217 imagenes [Annotated intestinal parasites image dataset](http://air.ug/downloads/intestinalparasites-phonecamera.zip).

## Correciones
- Los datos son separadons en archivos `xml` y `jpg`.
- Los archivos `xml` incluyen la información:
    - Resolución de la imagen
    - Ruta del archivo `jpg` respectivo
    - Se corrige tag `label` a `name`.

In [None]:
!wget -r https://drive.google.com/drive/folders/1BDK9jNeqah4PGFcvwxLdo8OsP6be4zBW?usp=sharing

# Carga de información
Se carga la información individual contenida en los archivos `xml` y se convierte en `dataframe` de `pandas` para posterios exploración de datos.

In [30]:
import re
import json
import pandas as pd
import matplotlib.pyplot as plt


TOTAL_FILES = 1217


data = {
    "columns": ["filename", "width", "height", "name", "quantity", "xmin", "ymin", "xmax", "ymax"],
    "index": [i + 1 for i in range(TOTAL_FILES)],
    "data": list(),
}
for i in range(TOTAL_FILES):
    with open(f"intestinalparasites-phenocamera/xml/intestinalparasites-phone-{i + 1:04}.xml", "r") as file:
        text = file.read()
        name = re.findall(r'<name>(.*?)<\/name>', text)
        quan = len(name)
        xmin = re.findall(r'<xmin>(.*?)<\/xmin>', text)
        ymin = re.findall(r'<ymin>(.*?)<\/ymin>', text)
        xmax = re.findall(r'<xmax>(.*?)<\/xmax>', text)
        ymax = re.findall(r'<ymax>(.*?)<\/ymax>', text)
        info = [
            re.findall(r'<filename>(.*?)<\/filename>', text)[0],
            re.findall(r'<width>(.*?)<\/width>', text)[0],
            re.findall(r'<height>(.*?)<\/height>', text)[0],
            name if name else None,
            quan,
            xmin if xmin else None,
            ymin if ymin else None,
            xmax if xmax else None,
            ymax if ymax else None,
        ]
        data["data"].append(info)

df = pd.read_json(json.dumps(data), orient="split")

# Ejemplo de muestra
Entradas con presencia parasitaria mayor a un avistamiento.

In [31]:
df[df.quantity > 1]

Unnamed: 0,filename,width,height,name,quantity,xmin,ymin,xmax,ymax
184,intestinalparasites-phone-0184.xml,1632,1224,"[Hookworm, Hookworm]",2,"[1043, 982]","[413, 714]","[1235, 1189]","[714, 976]"
215,intestinalparasites-phone-0215.xml,1632,1224,"[Hookworm, Taenia]",2,"[373, 510]","[437, 336]","[632, 665]","[668, 492]"
271,intestinalparasites-phone-0271.xml,1632,1224,"[Hookworm, Hookworm]",2,"[379, 492]","[406, 583]","[681, 772]","[632, 760]"
393,intestinalparasites-phone-0393.xml,1632,1224,"[Hookworm, Hookworm, Hookworm]",3,"[458, 863, 406]","[248, 251, 811]","[696, 1104, 690]","[489, 537, 1022]"
492,intestinalparasites-phone-0492.xml,1632,1224,"[Hookworm, Hookworm]",2,"[668, 836]","[251, 461]","[891, 1040]","[507, 720]"
523,intestinalparasites-phone-0523.xml,1632,1224,"[Hookworm, Taenia]",2,"[367, 513]","[878, 766]","[623, 665]","[1110, 939]"
558,intestinalparasites-phone-0558.xml,1632,1224,"[Hookworm, Hookworm]",2,"[942, 434]","[665, 0]","[1146, 629]","[948, 166]"
594,intestinalparasites-phone-0594.xml,1632,1224,"[Hookworm, Hookworm]",2,"[187, 1085]","[492, 352]","[492, 1289]","[684, 650]"
665,intestinalparasites-phone-0665.xml,1632,1224,"[Hookworm, Hookworm]",2,"[1134, 202]","[285, 227]","[1445, 516]","[495, 416]"
747,intestinalparasites-phone-0747.xml,1632,1224,"[Hookworm, Hookworm]",2,"[370, 839]","[193, 848]","[662, 1122]","[394, 1079]"
