# Création DataFrame IIT_CDIP_BASIC_FEATURES

## README
Ce notebook permet de créer la DataFrame IIT_CDIP_BASIC_FEATURES, à partir des images de la base de données IIT-CDIP. Pour mémoire, ce sont ces images qui ont servi de source à la création de la base de données RVL-CDIP.
Au sein de la base de données IIT-CDIP, seules les images ayant servi à la constitution de la base de données RVL-CDIP seront utilisées. Cela représente 400 000 images (contre un peu moins de 7 000 000 dans la base de données IIT-CDIP complète)

Il réalise tout d'abord certaines opérations préalables (chapitre 1), dont la définition des variables globales d'exécution (**A METTRE A JOUR LORS D'UNE PREMIERE UTILISATION**)

A l'issue (chapitre 2), il permet de créer le DataFrame, qui contient les informations suivantes:
- document_id
- width
- height
- amount_of_pages
- size_in_kb
- mode

## 1. Préparation

In [1]:
import os
import time
import numpy as np
import pandas as pd
from PIL import Image

In [2]:
project_path = '/Users/ben/Work/mle/ds-project/mai25_bds_extraction/' # à modifier par chacun en fonction de son arborescence

data_path = os.path.join(project_path, 'data')
raw_data_path = os.path.join(data_path, 'raw')
processed_data_path = os.path.join(data_path, 'processed')

raw_rvl_cdip_path = os.path.join(raw_data_path, 'RVL-CDIP')
rvl_cdip_images_path = os.path.join(raw_rvl_cdip_path, 'images')
rvl_cdip_labels_path = os.path.join(raw_rvl_cdip_path, 'labels')

iit_cdip_images_path = os.path.join(raw_data_path, 'IIT-CDIP', 'images')
iit_cdip_xmls_path = os.path.join(raw_data_path, 'IIT-CDIP', 'xmls')

## 2. Création du DataFrame IIT_CDIP_BASIC_FEATURES

## 2.1. Création de la base de la DataFrame

In [3]:
df_documents = pd.read_parquet(os.path.join(processed_data_path, "df_documents.parquet"))
df_documents.head()

Unnamed: 0,document_id,filename,rvl_image_path,label,data_set,iit_image_path,iit_individual_xml_path,iit_collective_xml_path
0,aaa06d00,50486482-6482.tif,raw/RVL-CDIP/images/imagesa/a/a/a/aaa06d00/504...,6,test,raw/IIT-CDIP/images/imagesa/a/a/a/aaa06d00/504...,raw/IIT-CDIP/images/imagesa/a/a/a/aaa06d00/aaa...,raw/IIT-CDIP/xmls/aa.xml
1,aaa08d00,2072197187.tif,raw/RVL-CDIP/images/imagesa/a/a/a/aaa08d00/207...,9,train,raw/IIT-CDIP/images/imagesa/a/a/a/aaa08d00/207...,raw/IIT-CDIP/images/imagesa/a/a/a/aaa08d00/aaa...,raw/IIT-CDIP/xmls/aa.xml
2,aaa09e00,2029372116.tif,raw/RVL-CDIP/images/imagesa/a/a/a/aaa09e00/202...,11,val,raw/IIT-CDIP/images/imagesa/a/a/a/aaa09e00/202...,raw/IIT-CDIP/images/imagesa/a/a/a/aaa09e00/aaa...,raw/IIT-CDIP/xmls/aa.xml
3,aaa10c00,2085133627a.tif,raw/RVL-CDIP/images/imagesa/a/a/a/aaa10c00/208...,2,train,raw/IIT-CDIP/images/imagesa/a/a/a/aaa10c00/208...,raw/IIT-CDIP/images/imagesa/a/a/a/aaa10c00/aaa...,raw/IIT-CDIP/xmls/aa.xml
4,aaa11d00,515558347+-8348.tif,raw/RVL-CDIP/images/imagesa/a/a/a/aaa11d00/515...,3,train,raw/IIT-CDIP/images/imagesa/a/a/a/aaa11d00/515...,raw/IIT-CDIP/images/imagesa/a/a/a/aaa11d00/aaa...,raw/IIT-CDIP/xmls/aa.xml


In [4]:
df_base = df_documents[["document_id", "iit_image_path"]]

In [5]:
len(df_base)

400000

In [6]:
df_base.head()

Unnamed: 0,document_id,iit_image_path
0,aaa06d00,raw/IIT-CDIP/images/imagesa/a/a/a/aaa06d00/504...
1,aaa08d00,raw/IIT-CDIP/images/imagesa/a/a/a/aaa08d00/207...
2,aaa09e00,raw/IIT-CDIP/images/imagesa/a/a/a/aaa09e00/202...
3,aaa10c00,raw/IIT-CDIP/images/imagesa/a/a/a/aaa10c00/208...
4,aaa11d00,raw/IIT-CDIP/images/imagesa/a/a/a/aaa11d00/515...


## 2.2. Extraction des données

In [7]:
# on écrit une fonction pour récupérer le nombre de pages par fichier
def nombre_pages(img):
    try:
        count = 0
        while True:
            try:
                img.seek(count)
                count += 1
            except EOFError:
                break
        return count
    except UnidentifiedImageError:
        return None

In [8]:
def get_basic_image_features():
    tmp_list = []
    for index, row in df_base.iterrows():
        document_id, relative_path = row.values.tolist()
        filename = os.path.join(data_path, relative_path)
        try:
            with Image.open(filename) as img:
                format_ = img.format
                width, height = img.size 
                mode = img.mode
                amount_of_pages = nombre_pages(img)
            size_in_kb = os.path.getsize(filename) / 1024
            tmp_list.append([
                document_id, width, height, amount_of_pages, size_in_kb, mode])
        except Exception as e:
            print(f"Erreur avec l'image {document_id}")
    
    df_data = pd.DataFrame(
        tmp_list,
        columns = ["document_id", "width", "height", "amount_of_pages", "size_in_kB", "mode"]
    )
    return df_data

In [9]:
t = time.time()
df_data = get_basic_image_features()
print(f"Duree d'exécution: {time.time() - t:.3f} secondes.")
df_data.head()

Erreur avec l'image fpv22d00
Duree d'exécution: 271.123 secondes.


Unnamed: 0,document_id,width,height,amount_of_pages,size_in_kB,mode
0,aaa06d00,1728,2292,1,25.595703,1
1,aaa08d00,1728,2292,1,58.537109,1
2,aaa09e00,2560,3301,1,22.52832,1
3,aaa10c00,1728,2292,1,4.379883,1
4,aaa11d00,1728,2292,2,91.145508,1


## 2.3. Création de la DataFrame et sauvegarde

In [10]:
df_iit_cdip_basic_features = df_base.merge(df_data, on="document_id", how="left")
df_iit_cdip_basic_features.head()

Unnamed: 0,document_id,iit_image_path,width,height,amount_of_pages,size_in_kB,mode
0,aaa06d00,raw/IIT-CDIP/images/imagesa/a/a/a/aaa06d00/504...,1728.0,2292.0,1.0,25.595703,1
1,aaa08d00,raw/IIT-CDIP/images/imagesa/a/a/a/aaa08d00/207...,1728.0,2292.0,1.0,58.537109,1
2,aaa09e00,raw/IIT-CDIP/images/imagesa/a/a/a/aaa09e00/202...,2560.0,3301.0,1.0,22.52832,1
3,aaa10c00,raw/IIT-CDIP/images/imagesa/a/a/a/aaa10c00/208...,1728.0,2292.0,1.0,4.379883,1
4,aaa11d00,raw/IIT-CDIP/images/imagesa/a/a/a/aaa11d00/515...,1728.0,2292.0,2.0,91.145508,1


In [11]:
df_iit_cdip_basic_features.to_parquet(os.path.join(processed_data_path, "df_iit_cdip_basic_features.parquet"))

### Remarque:
Il y a un soucis avec l'image fpv22d00, dont les caractéristiques n'ont pu être extraites.
Il sera sans doute possible d'améliorer le script pour cela.

In [12]:
df_iit_cdip_basic_features[df_iit_cdip_basic_features.document_id == "fpv22d00"]

Unnamed: 0,document_id,iit_image_path,width,height,amount_of_pages,size_in_kB,mode
88241,fpv22d00,raw/IIT-CDIP/images/imagesf/f/p/v/fpv22d00/250...,,,,,
