# Création DataFrame IIT_CDIP_COLL_XML_FEATURES

## README
Ce notebook permet d'exploiter les fichiers xml collectifs associés aux images de la base de données RVL-CDIP par "jointure" sur la base de données IIT-CDIP. Les informations extraites sont rassemblées dans deux DataFrame, qui correspondent aux deux sources de données ayant alimenté le xml:
- iit_cdip_coll_xml_a_features
- iit_cdip_coll_xml_ltdlwocr_features
La différence entre les deux sources est ainsi décrite:
*The information in the <A> and <LTDLWOCR> elements is largely, but not completely, redundant with each other. The data in the <LTDLWOCR> elements was produced more recently and fixes some known minor glitches with data in the <A> elements.  On the other hand, some of the interesting data in the <LTDLWOCR> elements is in XML comments, while all the data in the <A> elements is in XML subelements.*

Il réalise tout d'abord certaines opérations préalables (chapitre 1), dont la définition des variables globales d'exécution (**A METTRE A JOUR LORS D'UNE PREMIERE UTILISATION**)

A l'issue (chapitre 2), il explore les fichiers xml afin d'en déterminer la structure. Deux dictionnaires de tags sont créés dans cette partie afin de pouvoir renommer les colonnes à terme. 

Enfin (chapitre 3), il permet de créer les deux DataFrame, qui contiennent les informations suivantes:
#### iit_cdip_coll_xml_a_features
- document_id
- corporate_source
- title
- authors_people_or_org
- corporate_source_and_id
- document_date
- attachment_group
- people_org_attending
- brands
- bates_number
- copied_people_org
- legal_case_id
- leagal_case_name
- document_characteristics
- document_description
- document_begin_bates_number
- document_end_bates_number
- date_loaded
- date_modified
- date_produced
- document_type
- estimated_date
- ending_date
- file
- grant_number
- litigation_usage
- names_mentionned
- names_noted
- oklahoma_downgrades
- page_count
- physical_attachment_1
- physical_attachment_2
- production_box
- recipients
- redacted
- request_number
- source
- special_collections
- date_shipped
- source_site
- st
- trial_exhibit
- topics
#### iit_cdip_coll_xml_ltdlwocr_features
- document_id
- authors_people
- bates_number
- production_box
- authors_organization
- recipients
- document_date
- date_modified
- document_type
- file
- names_mentionned_organization
- names_mentionned_people
- ocr_output
- copied_people_org
- page_count
- recipients
- title

## 1. Préparation

In [None]:
import sys
from pathlib import Path

project_root = Path().resolve().parent
if not project_root in [Path(p).resolve() for p in sys.path]:
    sys.path.append(str(project_root))

from src import PATHS

In [None]:
import os
import time
import numpy as np
import pandas as pd

from collections import defaultdict

from lxml import etree
from functools import reduce
from utils import remove_ds_store_files

## 2. Exploration des fichiers xml

### 2.1. Identification de la structure des fichiers xml

In [None]:
class StructureNode:
    def __init__(self, tag_name):
        self.tag_name = tag_name
        self.attributes = set()
        self.children = {}

    def add_attributes(self, attrib_keys):
        self.attributes.update(attrib_keys)

    def add_child(self, child_node):
        if not isinstance(child_node, StructureNode):
            raise TypeError(f"Expected StructureNode, got {type(child_node)}")
        if child_node.tag_name not in self.children:
            self.children[child_node.tag_name] = child_node
        else:
            self.children[child_node.tag_name].merge(child_node)

    def merge(self, other_node):
        self.attributes.update(other_node.attributes)
        for child_tag, child_node in other_node.children.items():
            if child_tag in self.children:
                self.children[child_tag].merge(child_node)
            else:
                self.children[child_tag] = child_node

    def display(self, level=0):
        indent = "  " * level
        attrs = f" [attributes: {', '.join(sorted(self.attributes))}]" if self.attributes else ""
        print(f"{indent}- {self.tag_name}{attrs}")
        for child in sorted(self.children.values(), key=lambda c: str(c.tag_name)):
            child.display(level + 1)

def build_structure(element):
    node = StructureNode(element.tag)
    node.add_attributes(element.attrib.keys())
    for child in element:
        if isinstance(child, etree._Element):
            child_node = build_structure(child)
            node.add_child(child_node)
    return node

def parse_file(filename):
    tree = etree.parse(filename)
    root = tree.getroot()
    return build_structure(root)

def merge_structures(root_nodes):
    if not root_nodes:
        return None
    base = root_nodes[0]
    for node in root_nodes[1:]:
        if node.tag_name == base.tag_name:
            base.merge(node)
        else:
            merged_root = StructureNode("MergedRoot")
            merged_root.add_child(base)
            merged_root.add_child(node)
            base = merged_root
    return base

def parse_files(files):
    roots = []
    for filename in files:
        try:
            struct = parse_file(filename)
            roots.append(struct)
        except (etree.XMLSyntaxError, FileNotFoundError) as e:
            print(f"Error processing '{filename}': {e}")
    return merge_structures(roots)

In [None]:
xml_file_paths = list(PATHS.iit_cdip_xmls.iterdir())
merged_structure = parse_files(xml_file_paths)
merged_structure.display()

### 2.2. Définiton des tags des records
La structure ainsi retrouvée est conforme au document de description des fichiers xmls trouvé sur le net (en l'absence de DTD officiel).
A partir de ce documents, deux dictionnaires de tags sont créés, afin de faciliter la compréhension à la lecture de la DataFrame)
Les champs censés être parfaitement identiques entre les deux DataFrame portent strictement le même nom après conversion par le dictionnaire.

In [None]:
tags_a = {
    'DS': 'corporate_source',
    'K': 'title',
    'L': 'authors_people_or_org',
    'PV': 'corporate_source_and_id',
    'YR': 'document_date',
    'ag': 'attachment_group',
    'at': 'people_org_attending',
    'b': 'brands',
    'br': 'bates_number',
    'c': 'copied_people_org',
    'ci': 'legal_case_id',
    'cn': 'leagal_case_name',
    'co': 'document_characteristics',
    'd': 'document_description',
    'db': 'document_begin_bates_number',
    'de': 'document_end_bates_number',
    'dl': 'date_loaded',
    'dm': 'date_modified',
    'dp': 'date_produced',
    'dt': 'document_type',
    'ed': 'estimated_date',
    'eda': 'ending_date',
    'f': 'file',
    'gn': 'grant_number',
    'lu': 'litigation_usage',
    'm': 'names_mentionned',
    'n': 'names_noted',
    'od': 'oklahoma_downgrades',
    'p': 'page_count',
    'pa1': 'physical_attachment_1',
    'pa2': 'physical_attachment_2',
    'pb': 'production_box',
    'r': 'recipients',
    're': 'redacted',
    'rn': 'request_number',
    's': 'source',
    'sc': 'special_collections',
    'sh': 'date_shipped',
    'si': 'source_site',
    'st': 'st',
    'te': 'trial_exhibit',
    'tp': 'topics',
}

In [None]:
tags_ltdlwocr = {
    'au': 'authors_people',
    'bt': 'bates_number',
    'bx': 'production_box',
    'ca': 'authors_organization',
    'cr': 'recipients_organization',
    'dd': 'document_date',
    'dl': 'date_modified',
    'dt': 'document_type',
    'fn': 'file',
    'no': 'names_mentionned_organization',
    'np': 'names_mentionned_people',
    'ot': 'ocr_output',
    'pc': 'copied_people_org',
    'pg': 'page_count',
    'rc': 'recipients_people',
    'ti': 'title',
    'tid': 'document_id',
}

## 3. Création et sauvegarde des DataFrames

### 3.1. Création

In [None]:
def parse_records_from_file(file_path):
    tree = etree.parse(file_path)
    root = tree.getroot()
    
    records = root.findall(".//record")
    
    ltdlwocr_data = []
    a_data = []

    for record in records:
        # Partie LTDLWOCR
        ltdlwocr = record.find("LTDLWOCR")
        if ltdlwocr is not None:
            ltdlwocr_row = {
                child.tag: (child.text.strip() if child.text else "")
                for child in ltdlwocr
                if isinstance(child.tag, str)
            }
            ltdlwocr_data.append(ltdlwocr_row)

        # Partie A sous ucsf200507
        a = record.find("ucsf200507/A")
        if a is not None:
            # On récupère l'attribut ID
            document_id = a.attrib["ID"].lower()
            a_row = {
                "document_id": document_id
            }
            # On ajoute les balises enfants
            a_row.update({
                child.tag: (child.text.strip() if child.text else "")
                for child in a
                if isinstance(child.tag, str)
            })
            a_data.append(a_row)
    
    return ltdlwocr_data, a_data


def parse_multiple_files(file_paths):
    all_ltdlwocr = []
    all_a = []
    for file_path in file_paths:
        ltdlwocr_rows, a_rows = parse_records_from_file(file_path)
        all_ltdlwocr.extend(ltdlwocr_rows)
        all_a.extend(a_rows)

    # Convert to DataFrames
    df_ltdlwocr = pd.DataFrame(all_ltdlwocr)
    df_a = pd.DataFrame(all_a)
    
    return df_ltdlwocr, df_a


In [None]:
df_ltdlwocr, df_a = parse_multiple_files(xml_file_paths)

In [None]:
len(df_a), len(df_ltdlwocr)
# on retrouve les 2000 fichiers manquants suite au téléchargement des records (voir fin du notebook 1.2)

In [None]:
df_a = df_a.rename(columns=tags_a).set_index("document_id", drop=True)
df_ltdlwocr = df_ltdlwocr.rename(columns=tags_ltdlwocr).set_index("document_id", drop=True)

In [None]:
display(df_a.head(1))
display(df_ltdlwocr.head(1))

### 3.2. Sauvegarde des DataFrames

In [None]:
df_a.to_parquet(PATHS.processed_data / "df_iit_cdip_coll_xml_a_features.parquet")
df_ltdlwocr.to_parquet(PATHS.processed_data / "df_iit_cdip_coll_xml_ltdlwocr_features.parquet")