# Creación de Jerarquías

Una **jerarquía de generalización** es una estructura que define niveles de abstracción para un atributo en un dataset. Se usa para anonimizar datos manteniendo la utilidad, agrupando valores detallados en categorías más generales.

Ejemplo con la edad:

Edad original | Nivel 1 (más detallado) | Nivel 2 (más general) | Nivel 3 (muy general)
--- | --- | --- | ---
23 | 20-30 | 20-40 | *
27 | 20-30 | 20-40 | *
45 | 40-50 | 40-60 | *
51 | 50-60 | 40-60 | *

Aquí, la edad se generaliza en rangos para evitar identificar a personas específicas.

### Atributos Numéricos

Para los atributos numéricos se pueden usar rangos, aquí abajo definimos una función que crea jerarquías de generalización para un atributo numérico en un dataset. Para ello, se toma el mínimo y máximo valor del atributo y a cada nivel se le adjudica una lista de rangos de igual tamaño, pero cada vez más grandes, esto es, menos detallados.

In [1]:
from csv import  writer
import os

# Definir los niveles de jerarquía para un atributo numérico
def generate_num_hierarchy(start_from : int, end_at : int, steps : list, offset : int = 1, filename : str = None) -> None:
    """
    Genera una jerarquía de niveles para un atributo numérico.
    :param start_from: Número inicial
    :param end_at: Número final
    :param steps: Lista de pasos para cada nivel
    :param offset: Incremento entre los valores del atributo, si no se proporciona se asume 1
    :param filename: Nombre del archivo donde se guardará la jerarquía, si no se proporciona se imprime en consola
    """

    directory = filename.split("/")[:-1]
    directory = "/".join(directory)

    if not os.path.exists(directory):
        print(f"Creado el directorio {directory}")
        os.makedirs(directory)

    hierarchy = []
    for num in range(start_from, end_at, offset):
        levels = []
        for step in steps:
            level = f"{(num//step)*step}-{(num//step)*step+step-1}"
            levels.append(level)
        levels.append("*")
        hierarchy.extend([[num] + levels])
    
    headers = ["Original"] + [f"Nivel{i+1}" for i in range(len(steps)+1)]
    if filename:
        with open(filename, mode="w", newline="") as file:
            csvwriter = writer(file)
            csvwriter.writerow(headers)  # Encabezados
            csvwriter.writerows(hierarchy)
        print(f"Jerarquía guardada en {filename}")
    else:
        print(headers)
        for elem in hierarchy:
            print(elem)
            print()

Apliquemos esta función sobre los atributos numéticos del Adults' Income Dataset.

In [2]:
hierarchies_folder = "../data/adults/hierarchies/"

# Generar y exportar la jerarquía para 'age'
generate_num_hierarchy(0,100,[5,10,20,40], filename=f"{hierarchies_folder}age_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/age_hierarchy.csv


In [3]:
# Generar y exportar la jerarquía para 'fnlwgt'
generate_num_hierarchy(0,1500000,[5000,10000,30000,60000,120000], offset=100, filename=f"{hierarchies_folder}fnlwgt_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/fnlwgt_hierarchy.csv


In [4]:
# Generar y exportar la jerarquía para 'hours-per-week'
generate_num_hierarchy(0,100,[5,10,20,40], filename=f"{hierarchies_folder}hours-per-week_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/hours-per-week_hierarchy.csv


In [None]:
# Generar y exportar la jerarquía para 'capital-gain' y 'capital-loss'
generate_num_hierarchy(0,100000,[1000,5000,10000,20000], offset= 100, filename=f"{hierarchies_folder}capital-gain_hierarchy.csv")
generate_num_hierarchy(0,5000,[250,500,1000,2000], offset= 10, filename=f"{hierarchies_folder}capital-loss_hierarchy.csv")


Jerarquía guardada en ../data/adults/hierarchies/capital-gain_hierarchy.csv
Jerarquía guardada en ../data/adults/hierarchies/capital-loss_hierarchy.csv


In [24]:
# Generar y exportar la jerarquía para 'education-num'
generate_num_hierarchy(0,20,[2,4,8], filename=f"{hierarchies_folder}education-num_hierarchy.csv")


Jerarquía guardada en ../data/adults/hierarchies/education-num_hierarchy.csv


### Atributos Catetóricos 

Definimos ahora una función para los atributos categóricos. En este caso, se toman los valores distintos y se busca una jerarquía de generalización semántica, es decir, que está relacionada con el significado de los valores. Por ejemplo, para el atributo "education" se podría definir una jerarquía de generalización como:

Edad original | Nivel 1 (más detallado) | Nivel 2 (más general) | Nivel 3 (muy general)
--- | --- | --- | ---
Bachelors | University | Higher Education | *
Doctorate | University | Higher Education | *
HS-grad | High School | Lower Education | *
12th | High School | Lower Education | *
Some-college | College | Higher Education | *
Assoc-acdm | Associate Degree | Higher Education | *
Assoc-voc | Associate Degree | Higher Education | *
9th | Primary School | Lower Education | *
1st-4th | Primary School | Lower Education | *
Preschool | Primary School | Lower Education | *



In [25]:
# Función para guardar una jerarquía de atributo categórico en CSV
def generate_cat_hierarchy(hierarchy_data : list, levels : int = 3, filename : str = None) -> None:
    """
    Guarda una jerarquía de atributo categórico en un archivo CSV.
    :param hierarchy_data: Lista con los datos de la jerarquía
    :param levels: Número de niveles de la jerarquía, por defecto 3
    :param filename: Nombre del archivo donde se guardará la jerarquía, si no se proporciona se imprime en consola
    """

    directory = filename.split("/")[:-1]
    directory = "/".join(directory)

    if not os.path.exists(directory):
        print(f"Creado el directorio {directory}")
        os.makedirs(directory)

    headers = ["Original"] + [f"Nivel{i+1}" for i in range(levels)]
    if filename is None:
        print("Original ", "Nivel1 ", "Nivel2 ", "Nivel3")
        for elem in hierarchy_data:
            print(elem)
            print()
        return
    else:
        with open(filename, mode="w", newline="") as file:
            csvwriterwriter = writer(file)
            csvwriterwriter.writerow(headers)  # Encabezados
            csvwriterwriter.writerows(hierarchy_data)
        print(f"Jerarquía guardada en {filename}")

In [26]:
# Jerarquía de workclass
workclass_hierarchy = [
    ["State-gov", "Government", "Employed", "*"],
    ["Federal-gov", "Government", "Employed", "*"],
    ["Local-gov", "Government", "Employed", "*"],
    ["Private", "Private", "Employed", "*"],
    ["Self-emp-not-inc", "Self-employed", "Employed", "*"],
    ["Self-emp-inc", "Self-employed", "Employed", "*"],
    ["Without-pay", "Unemployed", "Unemployed", "*"],
    ["Never-worked", "Unemployed", "Unemployed", "*"],
    ["?", "Unknown", "Unknown", "*"]
]
generate_cat_hierarchy(workclass_hierarchy, filename=f"{hierarchies_folder}workclass_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/workclass_hierarchy.csv


In [27]:
# Jerarquía de education
education_hierarchy = [
    ["Bachelors", "University", "Higher Education", "*"],
    ["Masters", "University", "Higher Education", "*"],
    ["Doctorate", "University", "Higher Education", "*"],
    ["Prof-school", "University", "Higher Education", "*"],
    ["HS-grad", "High School", "Lower Education", "*"],
    ["11th", "High School", "Lower Education", "*"],
    ["10th", "High School", "Lower Education", "*"],
    ["12th", "High School", "Lower Education", "*"],
    ["Some-college", "College", "Higher Education", "*"],
    ["Assoc-acdm", "Associate Degree", "Higher Education", "*"],
    ["Assoc-voc", "Associate Degree", "Higher Education", "*"],
    ["9th", "Primary School", "Lower Education", "*"],
    ["7th-8th", "Primary School", "Lower Education", "*"],
    ["5th-6th", "Primary School", "Lower Education", "*"],
    ["1st-4th", "Primary School", "Lower Education", "*"],
    ["Preschool", "Primary School", "Lower Education", "*"]
]
generate_cat_hierarchy(education_hierarchy, filename=f"{hierarchies_folder}education_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/education_hierarchy.csv


In [28]:
# Jerarquía de marital-status
marital_hierarchy = [
    ["Married-civ-spouse", "Married", "In Relationship", "*"],
    ["Married-AF-spouse", "Married", "In Relationship", "*"],
    ["Never-married", "Single", "Not Married", "*"],
    ["Divorced", "Divorced/Widowed", "Not Married", "*"],
    ["Separated", "Divorced/Widowed", "Not Married", "*"],
    ["Widowed", "Divorced/Widowed", "Not Married", "*"],
    ["Married-spouse-absent", "Divorced/Widowed", "Not Married", "*"]
]
generate_cat_hierarchy(marital_hierarchy, filename=f"{hierarchies_folder}marital-status_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/marital-status_hierarchy.csv


In [29]:
# Jerarquía de occupation
occupation_hierarchy = [
    ["Exec-managerial", "Professional", "White Collar", "*"],
    ["Prof-specialty", "Professional", "White Collar", "*"],
    ["Adm-clerical", "Office Jobs", "White Collar", "*"],
    ["Tech-support", "Office Jobs", "White Collar", "*"],
    ["Sales", "Service Jobs", "White Collar", "*"],
    ["Protective-serv", "Service Jobs", "White Collar", "*"],
    ["Other-service", "Service Jobs", "White Collar", "*"],
    ["Machine-op-inspct", "Manual Labor", "Blue Collar", "*"],
    ["Craft-repair", "Manual Labor", "Blue Collar", "*"],
    ["Transport-moving", "Manual Labor", "Blue Collar", "*"],
    ["Farming-fishing", "Manual Labor", "Blue Collar", "*"],
    ["Handlers-cleaners", "Low Wage Jobs", "Blue Collar", "*"],
    ["Priv-house-serv", "Low Wage Jobs", "Blue Collar", "*"],
    ["Armed-Forces", "Military", "Other", "*"],
    ["?", "Unknown", "Unknown", "*"]
]
generate_cat_hierarchy(occupation_hierarchy, filename=f"{hierarchies_folder}occupation_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/occupation_hierarchy.csv


In [30]:
# Jerarquía de relationship
relationship_hierarchy = [
    ["Husband", "Married", "*"],
    ["Wife", "Married", "*"],
    ["Own-child", "Child", "*"],
    ["Not-in-family", "Other", "*"],
    ["Other-relative", "Other", "*"],
    ["Unmarried", "Other", "*"]
]
generate_cat_hierarchy(relationship_hierarchy, levels=2, filename=f"{hierarchies_folder}relationship_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/relationship_hierarchy.csv


In [31]:
# Jerarquía de race
race_hierarchy = [
    ["White", "White", "*"],
    ["Black", "Black", "*"],
    ["Asian-Pac-Islander", "Asian", "*"],
    ["Amer-Indian-Eskimo", "Indigenous", "*"],
    ["Other", "Other", "*"]
]
generate_cat_hierarchy(race_hierarchy, levels=2, filename=f"{hierarchies_folder}race_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/race_hierarchy.csv


In [32]:
# Jerarquía de sex
sex_hierarchy = [
    ["Male", "*"],
    ["Female", "*"]
]
generate_cat_hierarchy(sex_hierarchy, levels=1, filename=f"{hierarchies_folder}sex_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/sex_hierarchy.csv


In [None]:
# Jerarquía de native-country
country_hierarchy = [
    ["United-States", "North America", "America", "*"],
    ["Canada", "North America", "America", "*"],
    ["Mexico", "North America", "America", "*"],
    ["Cuba", "Latin America", "America", "*"],
    ["Jamaica", "Latin America", "America", "*"],
    ["Puerto-Rico", "Latin America", "America", "*"],
    ["Haiti", "Latin America", "America", "*"],
    ["Dominican-Republic", "Latin America", "America", "*"],
    ["Honduras", "Latin America", "America", "*"],
    ["El-Salvador", "Latin America", "America", "*"],
    ["Guatemala", "Latin America", "America", "*"],
    ["Nicaragua", "Latin America", "America", "*"],
    ["Columbia", "Latin America", "America", "*"],
    ["Ecuador", "Latin America", "America", "*"],
    ["Peru", "Latin America", "America", "*"],
    ["England", "Europe", "Europe", "*"],
    ["Germany", "Europe", "Europe", "*"],
    ["France", "Europe", "Europe", "*"],
    ["Poland", "Europe", "Europe", "*"],
    ["China", "Asia", "Asia", "*"],
    ["Japan", "Asia", "Asia", "*"],
    ["India", "Asia", "Asia", "*"],
    ["Iran", "Asia", "Asia", "*"],
    ["Vietnam", "Asia", "Asia", "*"],
    ["Philippines", "Asia", "Asia", "*"],
    ["Thailand", "Asia", "Asia", "*"],
    ["Taiwan", "Asia", "Asia", "*"],
    ["South", "Asia", "Asia", "*"],
    ["Scotland", "Europe", "Europe", "*"],
    ["Portugal", "Europe", "Europe", "*"],
    ["Italy", "Europe", "Europe", "*"],
    ["Ireland", "Europe", "Europe", "*"],
    ["Hungary", "Europe", "Europe", "*"],
    ["Hong", "Asia", "Asia", "*"],
    ["Greece", "Europe", "Europe", "*"],
    ["Cambodia", "Asia", "Asia", "*"],
    ["Laos", "Asia", "Asia", "*"],
    ["Trinadad&Tobago", "Latin America", "America", "*"],
    ["Yugoslavia", "Europe", "Europe", "*"],
    ["Outlying-US(Guam-USVI-etc)", "North America", "America", "*"],
    ["Holand-Netherlands", "Europe", "Europe", "*"],
    ["?", "Unknown", "Unknown", "*"]
]
generate_cat_hierarchy(country_hierarchy, filename=f"{hierarchies_folder}native-country_hierarchy.csv")

Jerarquía guardada en ../data/adults/hierarchies/native-country_hierarchy.csv


### Exportación de Jerarquías

Se han exportado las jerarquías de generalización en archivos CSV ubicados en `../data/adults/hierarchies/` para su importación en otros notebooks.