# EDA of victimas-accionar-represivo-ilegal.csv

## Python Environment Setup

This project uses a Python virtual environment to keep dependencies isolated.  

### 1. Create and activate the virtual environment

```bash
# From the project root
cd Toys/Suisei_eda

# Create the venv inside this folder
python3 -m venv venv

# Activate it
source venv/bin/activate   # macOS / Linux / zsh
# On Windows (cmd): venv\Scripts\activate
# On Windows (PowerShell): venv\Scripts\Activate.ps1
```

You should now see `(venv)` in your terminal prompt.

---

### 2. Install dependencies

```bash
pip install --upgrade pip
pip install -r requirements.txt
```

---

### 3. Running notebooks or scripts

* Open the project in your IDE (VS Code, PyCharm, etc.).
* Select the Python interpreter inside the venv:

  * macOS / Linux: `Toys/Suisei_eda/venv/bin/python`
  * Windows: `Toys\Suisei_eda\venv\Scripts\python.exe`
* Open notebooks or Python scripts — they will run using this environment.

> No JupyterLab server is required. The venv is automatically used by your IDE.

---

### 4. (Optional) Register the venv as a Jupyter kernel

If you want to use it inside Jupyter:

```bash
source venv/bin/activate
pip install ipykernel
python -m ipykernel install --user --name=suisei-venv --display-name "Python (Suisei venv)"
```

* `--name` is internal; `--display-name` is what shows in Jupyter.
* Open JupyterLab or Notebook and select **Python (Suisei venv)** from the kernel menu.


In [49]:
# imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import re


## Data ingestion and cleaning

In [50]:
# ingest data and inspect columns

df_raw = pd.read_csv('victimas-accionar-represivo-ilegal.csv')
df_clean = df_raw.copy()

print(df_raw.info())
df_raw.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8632 entries, 0 to 8631
Data columns (total 15 columns):
 #   Column                                      Non-Null Count  Dtype 
---  ------                                      --------------  ----- 
 0   anio_denuncia                               8632 non-null   int64 
 1   tipificacion_ruvte                          8632 non-null   object
 2   id_unico_ruvte                              8632 non-null   object
 3   apellido_paterno_nombres                    8632 non-null   object
 4   apellido_materno                            8330 non-null   object
 5   apellido_casada                             1024 non-null   object
 6   edad_al_momento_del_hecho                   8632 non-null   object
 7   documentos                                  8632 non-null   object
 8   anio_nacimiento                             8632 non-null   object
 9   provincia_pais_nacimiento                   8632 non-null   object
 10  nacionalidad            

Unnamed: 0,anio_denuncia,tipificacion_ruvte,id_unico_ruvte,apellido_paterno_nombres,apellido_materno,apellido_casada,edad_al_momento_del_hecho,documentos,anio_nacimiento,provincia_pais_nacimiento,nacionalidad,embarazo,fecha_lugar_detencion_secuestro,fecha_lugar_asesinato_o_hallazgo_de_restos,fotografia
0,1984,DESAPARICION FORZADA,ID 5389,ABACHIAN JUAN CARLOS,BEDROSSIAN,,26 años,LE 8293245,1950,BUENOS AIRES,ARGENTINA,,26/12/1976 LA PLATA BUENOS AIRES,---,Sí
1,1984,DESAPARICION FORZADA,ID 87,ABAD ANA CATALINA,SCARLATA,PERUCCA,24 años,LC 10048122,1951,MENDOZA,ARGENTINA,,15/08/1976 CORDOBA CAPITAL CORDOBA,---,Sí
2,1984,DESAPARICION FORZADA,ID 11788,ABAD JULIO RICARDO,CORONEL,,21 años,DNI 10283544,1954,TUCUMAN,ARGENTINA,,NOV/1976 CAPITAL FEDERAL,---,No
3,1984,ASESINATO,ID 9907,ABAD OSCAR GERARDO,DOMATO,,25 años,DNI 10353245,1951,BUENOS AIRES,ARGENTINA,,08/10/1976 LA PLATA BUENOS AIRES,21/10/1976 GRAL. MANSILLA (BARTOLOME BAVIO) ...,No
4,1984,DESAPARICION FORZADA,ID 89,ABAD ROBERTO RODOLFO TOMAS,ZABALA,,23 años,DNI 10650064,1953,CAPITAL FEDERAL,ARGENTINA,,09/08/1976 FLORIDA VICENTE LOPEZ BUENOS AIRES,---,Sí


The following cells clean and process the data as follows:
- `tipificacion_ruvte`: This column describes the case, and stores values such as 'DESAPARICION FORZADA' and 'ASESINATO', often with extra details. I encoded the values into the following broader categories and added an other/unknown option to handle missing values:
    - 'Child disappearance'
    - 'Forced disappearance'
    - 'Child murder'
    - 'Murder' 
    - 'Negligent death'
    - 'Born in captivity disappearance'
    - 'Other / Unknown'
- `id_unico_ruvte`: This column stores the case ID in the format 'ID xxxx'. I processed the values to be stored as ints by extracting the numeric portion.
- `edad_al_momento_del_hecho`: This column stores the age of the missing person at time of incident in the format 'xx año(s)/mes(es)/día(s)'. I processed the values to be stored as ints in years using floor division for months and days.
- `fecha_lugar_detencion_secuestro` and `fecha_lugar_asesinato_o_hallazgo_de_restos`: These columns store the date and location of the disappearance/murder combined. I split each into separate date and location columns (`fecha_detencion`, `lugar_detencion`, `fecha_asesinato`, `lugar_asesinato`) and parsed the dates into datetime objects. I also encoded ambiguous dates such as 'fines/09/1976' to approximated dates like '28/09/1976' and handled complex Spanish date formats and ranges.
- `embarazo`: This column contains pregnancy status information. I normalized the values into standardized categories:
    - 'Probably pregnant'
    - 'Pregnant - stillbirth'
    - 'Pregnant - child born in captivity'
    - 'Pregnant - detained'
    - 'Pregnant (confirmed)'
    - 'No data'
- `nacionalidad`: This column contained nationality information with some entries marked as naturalized (e.g., 'ARGENTINA (NATURALIZADA)'). I normalized these by extracting the base nationality and created a separate `naturalizada` boolean column to indicate naturalization status.
- `anio_nacimiento`: Birth year column with 'sin datos' entries converted to missing values and remaining values converted to integers.
- `fotografia`: Boolean column indicating whether a photograph exists, converted from 'Sì'/'No' to True/False.
- String columns (`apellido_paterno_nombres`, `apellido_materno`, `apellido_casada`, `documentos`, `provincia_pais_nacimiento`): Converted to pandas string dtype for consistency.
- `periodo`: Created a new categorical variable that classifies cases into historical periods based on detention dates: 'Pre-dictadura' (before March 24, 1976), 'Dictadura' (March 24, 1976 - December 10, 1983), and 'Post-dictadura' (after December 10, 1983).

In [51]:
# encode/normalize tipificacion_ruvte

# find unique vals in tipificacion_ruvte
print(f"Unique values in column '{'tipificacion_ruvte'}': {df_raw['tipificacion_ruvte'].unique()} ")

# function to aggregate case typles into broader categories
def normalize_case_type(value):
    val = value.upper()
    if 'DESAPARICION FORZADA' in val:
        if 'NIÑO' in val or 'NIÑA' in val:
            return 'Child disappearance'
        return 'Forced disappearance'
    elif 'ASESINATO' in val:
        if 'NIÑO' in val or 'NIÑA' in val:
            return 'Child murder'
        return 'Murder'
    elif 'NEGLIGENTE' in val:
        return 'Negligent death'
    elif 'NACIDA EN CAUTIVERIO' in val:
        return 'Born in captivity disappearance'
    else:
        return 'Other / Unknown'

df_clean['tipificacion_ruvte'] = df_raw['tipificacion_ruvte'].apply(normalize_case_type).astype('string')

df_clean.head()


Unique values in column 'tipificacion_ruvte': ['DESAPARICION FORZADA' 'ASESINATO'
 'ASESINATO / PROCEDENTE POR APLICACION DE ART. 6º DE LA LEY 24.823'
 'DESAPARICION FORZADA / PROBADO EL DECESO'
 'DESAPARICION FORZADA / EXHUMADOS E IDENTIFICADOS SUS RESTOS'
 'ASESINATO / PROBADO EN CAUSA JUDICIAL'
 'DESAPARICION FORZADA / CON INFORMACION SOBRE EL DECESO (NO PROBADO)'
 'NIÑA "DESAPARECIDA" JUNTO A PADRES O ALLEGADOS'
 'DESAPARICION FORZADA / INVESTIGADO EN CAUSA JUDICIAL'
 'DESAPARICION FORZADA / EN INVESTIGACION'
 'ASESINATO / EXHUMADOS E IDENTIFICADOS SUS RESTOS'
 'NIÑO ASESINADO JUNTO A PADRES O ALLEGADOS'
 'NIÑA ASESINADA JUNTO A PADRES O ALLEGADOS'
 'NIÑO "DESAPARECIDO" JUNTO A PADRES O ALLEGADOS'
 'ASESINATO / PROCEDENTE POR RESOLUCION JUDICIAL'
 'ASESINATO / INVESTIGADO EN CAUSA JUDICIAL'
 'DESAPARICION FORZADA / A DETERMINAR TIPIFICACION'
 'NIÑO FALLECIDO POR TRATO NEGLIGENTE'
 'ASESINATO / PROCEDENTE POR ACTO ADMINISTRATIVO EN APLICACION DE ART. 6º DE LA LEY 24.823'
 'DESAPARIC

Unnamed: 0,anio_denuncia,tipificacion_ruvte,id_unico_ruvte,apellido_paterno_nombres,apellido_materno,apellido_casada,edad_al_momento_del_hecho,documentos,anio_nacimiento,provincia_pais_nacimiento,nacionalidad,embarazo,fecha_lugar_detencion_secuestro,fecha_lugar_asesinato_o_hallazgo_de_restos,fotografia
0,1984,Forced disappearance,ID 5389,ABACHIAN JUAN CARLOS,BEDROSSIAN,,26 años,LE 8293245,1950,BUENOS AIRES,ARGENTINA,,26/12/1976 LA PLATA BUENOS AIRES,---,Sí
1,1984,Forced disappearance,ID 87,ABAD ANA CATALINA,SCARLATA,PERUCCA,24 años,LC 10048122,1951,MENDOZA,ARGENTINA,,15/08/1976 CORDOBA CAPITAL CORDOBA,---,Sí
2,1984,Forced disappearance,ID 11788,ABAD JULIO RICARDO,CORONEL,,21 años,DNI 10283544,1954,TUCUMAN,ARGENTINA,,NOV/1976 CAPITAL FEDERAL,---,No
3,1984,Murder,ID 9907,ABAD OSCAR GERARDO,DOMATO,,25 años,DNI 10353245,1951,BUENOS AIRES,ARGENTINA,,08/10/1976 LA PLATA BUENOS AIRES,21/10/1976 GRAL. MANSILLA (BARTOLOME BAVIO) ...,No
4,1984,Forced disappearance,ID 89,ABAD ROBERTO RODOLFO TOMAS,ZABALA,,23 años,DNI 10650064,1953,CAPITAL FEDERAL,ARGENTINA,,09/08/1976 FLORIDA VICENTE LOPEZ BUENOS AIRES,---,Sí


In [28]:
# extract id_unico_ruvte as int

# find unique vals in id_unico_ruvte
print(f"Unique values in column '{'id_unico_ruvte'}': {df_raw['id_unico_ruvte'].unique()} ")

# function to extract ID as int
def extract_id(value):
    if 'ID' in value:
        return int(value.split(' ')[1])
    else:
        return pd.NA

df_clean['id_unico_ruvte'] = df_raw['id_unico_ruvte'].apply(extract_id)
df_clean.head() 

Unique values in column 'id_unico_ruvte': ['ID 5389' 'ID 87' 'ID 11788' ... 'ID 10263' 'ID 5384' 'ID 7789'] 


Unnamed: 0,anio_denuncia,tipificacion_ruvte,id_unico_ruvte,apellido_paterno_nombres,apellido_materno,apellido_casada,edad_al_momento_del_hecho,documentos,anio_nacimiento,provincia_pais_nacimiento,nacionalidad,embarazo,fecha_lugar_detencion_secuestro,fecha_lugar_asesinato_o_hallazgo_de_restos,fotografia
0,1984,Forced disappearance,5389,ABACHIAN JUAN CARLOS,BEDROSSIAN,,26 años,LE 8293245,1950,BUENOS AIRES,ARGENTINA,,26/12/1976 LA PLATA BUENOS AIRES,---,Sí
1,1984,Forced disappearance,87,ABAD ANA CATALINA,SCARLATA,PERUCCA,24 años,LC 10048122,1951,MENDOZA,ARGENTINA,,15/08/1976 CORDOBA CAPITAL CORDOBA,---,Sí
2,1984,Forced disappearance,11788,ABAD JULIO RICARDO,CORONEL,,21 años,DNI 10283544,1954,TUCUMAN,ARGENTINA,,NOV/1976 CAPITAL FEDERAL,---,No
3,1984,Murder,9907,ABAD OSCAR GERARDO,DOMATO,,25 años,DNI 10353245,1951,BUENOS AIRES,ARGENTINA,,08/10/1976 LA PLATA BUENOS AIRES,21/10/1976 GRAL. MANSILLA (BARTOLOME BAVIO) ...,No
4,1984,Forced disappearance,89,ABAD ROBERTO RODOLFO TOMAS,ZABALA,,23 años,DNI 10650064,1953,CAPITAL FEDERAL,ARGENTINA,,09/08/1976 FLORIDA VICENTE LOPEZ BUENOS AIRES,---,Sí


In [29]:
# extract edad_al_momento_del_hecho as int

# find unique vals in edad_al_momento_del_hecho
print(f"Unique values in column '{'edad_al_momento_del_hecho'}': {df_raw['edad_al_momento_del_hecho'].unique()} ")

# function to extract age in years
def extract_age(value):
    if 'año' in value:
        return int(value.split(' ')[0])
    elif 'mes' in value:
        return int(value.split(' ')[0]) // 12
    elif 'día' in value:
        return int(value.split(' ')[0]) // 365
    else:
        return pd.NA

df_clean['edad_al_momento_del_hecho'] = df_raw['edad_al_momento_del_hecho'].apply(extract_age).astype('Int64')
df_clean.head()

Unique values in column 'edad_al_momento_del_hecho': ['26 años' '24 años' '21 años' '25 años' '23 años' '54 años' '27 años'
 '19 años' '33 años' '40 años' '43 años' '29 años' '35 años' '32 años'
 '42 años' '30 años' '17 años' '36 años' '22 años' '50 años' '65 años'
 '20 años' '34 años' '28 años' 'sin datos' '16 años' '39 años' '56 años'
 '48 años' '31 años' '37 años' '18 años' '55 años' '58 años' '38 años'
 '63 años' '1 año' '57 años' '53 años' '5 meses' '15 años' '59 años'
 '51 años' '47 años' '52 años' '13 años' '67 años' '60 años' '41 años'
 '3 años' '5 años' '44 años' '45 años' '49 años' '14 años' '61 años'
 '3 meses' '4 años' '68 años' '64 años' '62 años' '1 mes' '46 años'
 '6 años' '66 años' '74 años' '69 años' '73 años' '1 día' '7 años'
 '77 años' '4 meses' '5 días' '72 años' '9 años' '70 años' '80 años'
 '81 años' '76 años' '7 meses'] 


Unnamed: 0,anio_denuncia,tipificacion_ruvte,id_unico_ruvte,apellido_paterno_nombres,apellido_materno,apellido_casada,edad_al_momento_del_hecho,documentos,anio_nacimiento,provincia_pais_nacimiento,nacionalidad,embarazo,fecha_lugar_detencion_secuestro,fecha_lugar_asesinato_o_hallazgo_de_restos,fotografia
0,1984,Forced disappearance,5389,ABACHIAN JUAN CARLOS,BEDROSSIAN,,26,LE 8293245,1950,BUENOS AIRES,ARGENTINA,,26/12/1976 LA PLATA BUENOS AIRES,---,Sí
1,1984,Forced disappearance,87,ABAD ANA CATALINA,SCARLATA,PERUCCA,24,LC 10048122,1951,MENDOZA,ARGENTINA,,15/08/1976 CORDOBA CAPITAL CORDOBA,---,Sí
2,1984,Forced disappearance,11788,ABAD JULIO RICARDO,CORONEL,,21,DNI 10283544,1954,TUCUMAN,ARGENTINA,,NOV/1976 CAPITAL FEDERAL,---,No
3,1984,Murder,9907,ABAD OSCAR GERARDO,DOMATO,,25,DNI 10353245,1951,BUENOS AIRES,ARGENTINA,,08/10/1976 LA PLATA BUENOS AIRES,21/10/1976 GRAL. MANSILLA (BARTOLOME BAVIO) ...,No
4,1984,Forced disappearance,89,ABAD ROBERTO RODOLFO TOMAS,ZABALA,,23,DNI 10650064,1953,CAPITAL FEDERAL,ARGENTINA,,09/08/1976 FLORIDA VICENTE LOPEZ BUENOS AIRES,---,Sí


In [30]:
# function to convert mixed date formats to datetime

def parse_mixed_dates(date_str):
    if pd.isna(date_str) or date_str in ['---', '', 'nan', 'sin', 'Sin']:
        return pd.NaT
    
    date_str = str(date_str).strip()
    
    # handle special case: just "2ª" 
    if date_str == '2ª':
        return pd.NaT  # ambiguous, treat as missing
    
    # handle Spanish ordinals (1º = 1st, 2º = 2nd, etc.)
    date_str = date_str.replace('1º', '01').replace('2º', '02').replace('3º', '03')
    date_str = date_str.replace('2ª', '02')  # handle "2ª" -> "02"
    
    # translate Spanish months to English
    spanish_months = {
        'ENE': 'JAN', 'FEB': 'FEB', 'MAR': 'MAR', 'ABR': 'APR',
        'MAY': 'MAY', 'JUN': 'JUN', 'JUL': 'JUL', 'AGO': 'AUG',
        'SEP': 'SEP', 'OCT': 'OCT', 'NOV': 'NOV', 'DIC': 'DEC'
    }
    
    for spanish, english in spanish_months.items():
        date_str = date_str.replace(spanish, english)
    
    # handle Spanish text patterns including "fines" (end of period)
    if 'med' in date_str.lower():
        date_str = date_str.replace('med', '15').replace('MED', '15')
    if 'fin' in date_str.lower() or 'fines' in date_str.lower():
        date_str = date_str.replace('fin', '28').replace('FIN', '28').replace('fines', '28')
    if 'princ' in date_str.lower():
        date_str = date_str.replace('princ', '01').replace('PRINC', '01')
    
    # handle "AÑO/YYYY" and "AÑOS/YYYY-YY" patterns
    if 'AÑO' in date_str.upper():
        if 'AÑOS/' in date_str.upper():
            year_part = date_str.upper().replace('AÑOS/', '')
            if '-' in year_part:
                first_year = year_part.split('-')[0]
            else:
                first_year = year_part
        else:
            first_year = date_str.upper().replace('AÑO/', '')
        
        try:
            return pd.to_datetime(first_year, format='%Y')
        except:
            pass
    
    # handle date ranges with dashes
    if '-' in date_str and '/' in date_str:
        parts = date_str.split('-')
        if len(parts) == 2:
            first_part = parts[0]
            
            # complex day ranges like "31/01-18/02/1977" 
            if date_str.count('/') >= 3:
                if first_part.count('/') == 1:  # "31/01" from "31/01-18/02/1977"
                    year = date_str.split('/')[-1]
                    complete_date = f"{first_part}/{year}"
                    try:
                        return pd.to_datetime(complete_date, format='%d/%m/%Y')
                    except:
                        pass
                elif first_part.count('/') == 2:  # "08/07/1976" from "08/07-31/12/1976"
                    try:
                        return pd.to_datetime(first_part, format='%d/%m/%Y')
                    except:
                        pass
            
            # day ranges like "02-04/11/1977"
            elif date_str.count('/') == 2 and first_part.isdigit():
                month_year = '/'.join(date_str.split('/')[1:])
                reconstructed = f"{first_part}/{month_year}"
                try:
                    return pd.to_datetime(reconstructed, format='%d/%m/%Y')
                except:
                    pass
            
            # month ranges like "FEB-MAR/1977"
            elif date_str.count('/') == 1:
                first_part = first_part + '/' + date_str.split('/')[-1]
                try:
                    return pd.to_datetime(first_part, format='%b/%Y')
                except:
                    pass
            
            # year ranges like "NOV/1975-ABR/1976"
            else:
                try:
                    return pd.to_datetime(first_part, format='%b/%Y')
                except:
                    pass
    
    # try common formats
    formats = ['%d/%m/%Y', '%m/%Y', '%Y', '%b/%Y', '%B/%Y']
    
    for fmt in formats:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue
    
    # fallback: let pandas infer
    try:
        return pd.to_datetime(date_str, dayfirst=True, errors='coerce')
    except:
        return pd.NaT

In [31]:
# split fecha_lugar_detencion_secuestro into date and location 

df_clean[['fecha_detencion', 'lugar_detencion']] = df_raw['fecha_lugar_detencion_secuestro'].str.split(' ', n=1, expand=True)
df_clean['lugar_detencion'] = df_clean['lugar_detencion'].astype('string')

# clean date columns
df_clean['fecha_detencion'] = df_clean['fecha_detencion'].apply(parse_mixed_dates)

print(f"Date range: {df_clean['fecha_detencion'].min()} to {df_clean['fecha_detencion'].max()}")
print(f"Invalid dates: {df_clean['fecha_detencion'].isna().sum()}")

# diagnose failed date parsing
failed_dates = df_clean[df_clean['fecha_detencion'].isna()]['fecha_detencion'].copy()
print("Sample failed dates:")
print(df_raw.loc[failed_dates.index, 'fecha_lugar_detencion_secuestro'].head(10).tolist())

# check what the original split produced for failed cases
failed_originals = df_raw.loc[failed_dates.index, 'fecha_lugar_detencion_secuestro'].str.split(' ', n=1, expand=True)[0]
print("Failed date strings after split:")
print(failed_originals.value_counts().head(10))

# Check if failures are due to missing/null values vs parsing issues
print(f"\nFailure breakdown:")
print(f"- Null/missing original values: {df_raw.loc[failed_dates.index, 'fecha_lugar_detencion_secuestro'].isna().sum()}")
print(f"- Total parsing failures: {len(failed_dates)}")
#df_clean.head()

Date range: 1962-08-23 00:00:00 to 1983-05-14 00:00:00
Invalid dates: 1132
Sample failed dates:
['---', '---', '---', '---', '---', '---', '---', '---', '---', '---']
Failed date strings after split:
0
---              1076
2ª                  4
fines/1975          4
fines/12/1976       2
fines/10/1976       2
fines/07/1977       2
fines/01/1977       2
sin                 2
fines/06/1976       2
fines/1977          2
Name: count, dtype: int64

Failure breakdown:
- Null/missing original values: 8
- Total parsing failures: 1132


In [32]:
# split fecha_lugar_asesinato_o_hallazgo_de_restos into date and location

df_clean[['fecha_asesinato', 'lugar_asesinato']] = df_raw['fecha_lugar_asesinato_o_hallazgo_de_restos'].str.split(' ', n=1, expand=True)
df_clean['lugar_asesinato'] = df_clean['lugar_asesinato'].astype('string')

# clean date columns
df_clean['fecha_asesinato'] = df_clean['fecha_asesinato'].apply(parse_mixed_dates)

print(f"Date range: {df_clean['fecha_asesinato'].min()} to {df_clean['fecha_asesinato'].max()}")
print(f"Invalid dates: {df_clean['fecha_asesinato'].isna().sum()}")

# diagnose failed date parsing
failed_dates = df_clean[df_clean['fecha_asesinato'].isna()]['fecha_asesinato'].copy()
print("Sample failed dates:")
print(df_raw.loc[failed_dates.index, 'fecha_lugar_asesinato_o_hallazgo_de_restos'].head(10).tolist())

# check what the original split produced for failed cases
failed_originals = df_raw.loc[failed_dates.index, 'fecha_lugar_asesinato_o_hallazgo_de_restos'].str.split(' ', n=1, expand=True)[0]
print("Failed date strings after split:")
print(failed_originals.value_counts().head(10))

# Check if failures are due to missing/null values vs parsing issues
print(f"\nFailure breakdown:")
print(f"- Null/missing original values: {df_raw.loc[failed_dates.index, 'fecha_lugar_detencion_secuestro'].isna().sum()}")
print(f"- Total parsing failures: {len(failed_dates)}")


Date range: 1967-01-12 00:00:00 to 1983-05-14 00:00:00
Invalid dates: 6221
Sample failed dates:
['---', '---', '---', '---', '---', '---', '---', '---', nan, '---']
Failed date strings after split:
0
---              6070
sin               139
fines/1975          1
Sin                 1
fines/1978          1
fines/04/1978       1
fines/02/1977       1
fines/08/1978       1
Name: count, dtype: int64

Failure breakdown:
- Null/missing original values: 0
- Total parsing failures: 6221


In [33]:
# function to clean place names and extract province

# list of Argentine provinces / common place tokens (upper-case)
arg_provinces = [
    "CIUDAD AUTONOMA DE BUENOS AIRES", "CAPITAL FEDERAL", "BUENOS AIRES",
    "CATAMARCA", "CHACO", "CHUBUT", "CORDOBA", "CORRIENTES", "ENTRE RIOS",
    "FORMOSA", "JUJUY", "LA PAMPA", "LA RIOJA", "MENDOZA", "MISIONES",
    "NEUQUEN", "RIO NEGRO", "SALTA", "SAN JUAN", "SAN LUIS", "SANTA CRUZ",
    "SANTA FE", "SANTIAGO DEL ESTERO", "TIERRA DEL FUEGO", "TUCUMAN"
]
# sort by length to prefer longer matches first (e.g. "TIERRA DEL FUEGO" before "FUEGO")
arg_provinces = sorted(arg_provinces, key=lambda x: -len(x))

def clean_place_and_extract_province(raw_place):
    if pd.isna(raw_place):
        return pd.NA, pd.NA
    s = str(raw_place).strip()
    # mark common "empty" tokens as missing
    if s == "" or s == "---" or s.lower().startswith("sin datos"):
        return pd.NA, pd.NA

    # normalize spacing and punctuation, uppercase
    s = re.sub(r"[\.,;\/\\\(\)\[\]]", " ", s)         # remove common punctuation
    s = re.sub(r"\s+", " ", s).strip().upper()

    # remove stray leading numeric tokens or ordinal markers like "1" or "1º" or "2ª"
    s = re.sub(r"^(?:\d+º?ª?\s+)+", "", s)
    s = re.sub(r"^\d+\s*", "", s)

    # common normalization: CAPITAL FEDERAL -> CIUDAD AUTONOMA DE BUENOS AIRES
    s = s.replace("CAPITAL FEDERAL", "CIUDAD AUTONOMA DE BUENOS AIRES")
    s = s.replace("CABA", "CIUDAD AUTONOMA DE BUENOS AIRES")

    # Try to find a known province substring
    province_found = pd.NA
    for p in arg_provinces:
        if p in s:
            province_found = p
            break

    # fallback: use last token (word) if nothing matched
    if pd.isna(province_found):
        tokens = s.split()
        province_found = tokens[-1] if tokens else pd.NA

    return s, province_found



In [34]:
# apply to lugar_detencion 
src_ld = df_clean.get('lugar_detencion', df_raw.get('fecha_lugar_detencion_secuestro', pd.Series([pd.NA]*len(df_clean))))
cleaned = df_clean['lugar_detencion'].fillna(df_raw.get('fecha_lugar_detencion_secuestro', pd.Series([pd.NA]*len(df_clean))))
res = cleaned.apply(lambda x: clean_place_and_extract_province(x))
df_clean['lugar_detencion_clean'] = [r[0] for r in res]
df_clean['provincia_detencion'] = [r[1] for r in res]

# simplified province: keep top N, else 'Other'
top_provinces = df_clean['provincia_detencion'].value_counts().head(12).index
df_clean['provincia_detencion_simple'] = df_clean['provincia_detencion'].apply(lambda x: x if x in top_provinces else 'Other')

# quick diagnostics after improved cleaning
print("\n--- Post-cleaning lugar_detencion diagnostics ---")
ld = df_clean['lugar_detencion_clean'].fillna('').str.strip()
print(f"Missing/empty lugar_detencion_clean: {(ld=='').sum()} / {len(df_clean)}")
print("\nTop cleaned lugar_detencion (head 20):")
print(df_clean['lugar_detencion_clean'].value_counts(dropna=True).head(20))

print("\nProvincia_detencion_simple distribution:")
print(df_clean['provincia_detencion_simple'].value_counts(dropna=False).head(20))



--- Post-cleaning lugar_detencion diagnostics ---
Missing/empty lugar_detencion_clean: 1133 / 8632

Top cleaned lugar_detencion (head 20):
lugar_detencion_clean
CIUDAD AUTONOMA DE BUENOS AIRES                      1929
CORDOBA CAPITAL CORDOBA                               532
LA PLATA BUENOS AIRES                                 421
SAN MIGUEL DE TUCUMAN CAPITAL TUCUMAN                 285
ROSARIO SANTA FE                                      241
MAR DEL PLATA GRAL PUEYRREDON BUENOS AIRES            220
BAHIA BLANCA BUENOS AIRES                              69
CAMPANA BUENOS AIRES                                   58
SALTA CAPITAL SALTA                                    54
MENDOZA CAPITAL MENDOZA                                54
ZARATE BUENOS AIRES                                    53
RAMOS MEJIA LA MATANZA BUENOS AIRES                    52
CASEROS TRES DE FEBRERO BUENOS AIRES                   51
MORON BUENOS AIRES                                     49
GRAL SAN MARTIN BUENOS AIR

In [35]:
# apply same logic to lugar_asesinato
cleaned_a = df_clean['lugar_asesinato'].fillna(df_raw.get('fecha_lugar_asesinato_o_hallazgo_de_restos', pd.Series([pd.NA]*len(df_clean))))
res_a = cleaned_a.apply(lambda x: clean_place_and_extract_province(x))
df_clean['lugar_asesinato_clean'] = [r[0] for r in res_a]
df_clean['provincia_asesinato'] = [r[1] for r in res_a]
top_provinces_a = df_clean['provincia_asesinato'].value_counts().head(12).index
df_clean['provincia_asesinato_simple'] = df_clean['provincia_asesinato'].apply(lambda x: x if x in top_provinces_a else 'Other')

print("\n--- Post-cleaning lugar_asesinato diagnostics ---")
la = df_clean['lugar_asesinato_clean'].fillna('').str.strip()
print(f"Missing/empty lugar_asesinato_clean: {(la=='').sum()} / {len(df_clean)}")
print("\nTop cleaned lugar_asesinato (head 20):")
print(df_clean['lugar_asesinato_clean'].value_counts(dropna=True).head(20))

print("\nProvincia_asesinato_simple distribution:")
print(df_clean['provincia_asesinato_simple'].value_counts(dropna=False).head(20))



--- Post-cleaning lugar_asesinato diagnostics ---
Missing/empty lugar_asesinato_clean: 6078 / 8632

Top cleaned lugar_asesinato (head 20):
lugar_asesinato_clean
CIUDAD AUTONOMA DE BUENOS AIRES                         265
CORDOBA CAPITAL CORDOBA                                 185
ROSARIO SANTA FE                                        131
LA PLATA BUENOS AIRES                                    86
DATOS FECHA TUCUMAN                                      77
BERNAL QUILMES BUENOS AIRES                              57
MAR DEL PLATA GRAL PUEYRREDON BUENOS AIRES               48
SANTA FE LA CAPITAL SANTA FE                             46
DATOS FECHA AVELLANEDA BUENOS AIRES                      44
BAHIA BLANCA BUENOS AIRES                                41
SAN MIGUEL DE TUCUMAN CAPITAL TUCUMAN                    41
AVELLANEDA BUENOS AIRES                                  35
FATIMA PILAR BUENOS AIRES                                23
CIUDADELA TRES DE FEBRERO BUENOS AIRES                   2

In [36]:
# drop obsolete raw place/date columns now that cleaned versions exist
cols_to_drop = [
    'lugar_detencion', 'lugar_asesinato',
    'fecha_lugar_detencion_secuestro', 'fecha_lugar_asesinato_o_hallazgo_de_restos',
    'provincia_detencion', 'provincia_asesinato'
]
existing_to_drop = [c for c in cols_to_drop if c in df_clean.columns]
df_clean.drop(columns=existing_to_drop, inplace=True, errors='ignore')

In [37]:
# normalize 'embarazo' column

def normalize_embarazo(value):
    if pd.isna(value):
        return 'No data'
    v = value.upper()
    if 'PROBABLE' in v or 'APARENTE' in v:
        return 'Probably pregnant'
    elif 'SIN VIDA' in v:
        return 'Pregnant - stillbirth'
    elif 'CAUTIVERIO' in v:
        return 'Pregnant - child born in captivity'
    elif 'RECLUSION' in v:
        return 'Pregnant -  detained'
    elif 'EMBARAZADA' in v:
        return 'Pregnant (confirmed)'
    else:
        return 'No data'

df_clean['embarazo'] = df_raw['embarazo'].apply(normalize_embarazo).astype('string')


In [38]:
# normalize 'nacionalidad' column 

def normalize_nacionalidad(value):
    if pd.isna(value):
        return 'Unknown', False
    
    v = value.upper().strip()
    
    # Check if any nationality is naturalized
    if 'NATURALIZAD' in v:
        # Extract the base nationality (remove the naturalization part)
        if 'ARGENTINA' in v:
            return 'ARGENTINA', True
        elif 'URUGUAYA' in v:
            return 'URUGUAYA', True
        else:
            # Fallback: remove the parenthetical part
            base_nationality = v.split('(')[0].strip()
            return base_nationality, True
    else:
        # Not naturalized, return as-is
        return v, False

# Apply the function and split results
nationality_data = df_raw['nacionalidad'].apply(normalize_nacionalidad)
df_clean['nacionalidad'] = [x[0] for x in nationality_data]
df_clean['naturalizada'] = [x[1] for x in nationality_data]

# Convert to appropriate dtypes
df_clean['nacionalidad'] = df_clean['nacionalidad'].astype('string')
df_clean['naturalizada'] = df_clean['naturalizada'].astype('boolean')

# Check results
print("Nationality value counts:")
print(df_clean['nacionalidad'].value_counts())
print(f"\nNaturalized people: {df_clean['naturalizada'].sum()}")
print(f"Naturalized by nationality:")
print(df_clean[df_clean['naturalizada'] == True]['nacionalidad'].value_counts())

Nationality value counts:
nacionalidad
ARGENTINA         8129
URUGUAYA           154
CHILENA             83
PARAGUAYA           76
ITALIANA            44
ESPAÑOLA            43
BOLIVIANA           31
PERUANA             21
BRASILEÑA           10
FRANCESA             8
COLOMBIANA           4
ESTADOUNIDENSE       3
ALEMANA              3
BRITANICA            3
VENEZOLANA           2
POLACA               2
CUBANA               2
MEXICANA             2
CHECOESLOVACA        2
RUMANA               1
PORTUGUESA           1
GUATEMALTECA         1
SUECA                1
JAPONESA             1
AUSTRIACA            1
YUGOESLAVA           1
GRIEGA               1
SUIZA                1
FINLANDESA           1
Name: count, dtype: Int64

Naturalized people: 73
Naturalized by nationality:
nacionalidad
ARGENTINA    72
URUGUAYA      1
Name: count, dtype: Int64


In [39]:
# convert other columns to strings/ints/booleans as needed

string_cols = ['apellido_paterno_nombres', 'apellido_materno', 'apellido_casada', 'documentos', 'provincia_pais_nacimiento']

for col in string_cols:
    df_clean[col] = df_raw[col].astype('string')

def convert_to_int(value):
    if value == 'sin datos':
        return pd.NA
    else:
        return int(value)
        
df_clean['anio_nacimiento'] = df_raw['anio_nacimiento'].apply(convert_to_int).astype('Int64')

def convert_to_bool(value):
    if value == 'Sì':
        return True
    elif value == 'No':
        return False
    else:
        return pd.NA

df_clean['fotografia'] = df_raw['fotografia'].map({'Sì': True, 'No': False}).astype('boolean')

In [40]:
# categorize by historical periods of fecha_detencion
df_clean['periodo'] = pd.cut(
    df_clean['fecha_detencion'],
    bins=[pd.Timestamp('1900-01-01'),
          pd.Timestamp('1976-03-24'),
          pd.Timestamp('1983-12-10'),
          pd.Timestamp('2100-01-01')],
    labels=['Pre-dictadura', 'Dictadura', 'Post-dictadura']
)


In [None]:
# view final cleaned dataframe info and head
print(df_clean.info())
df_clean.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8632 entries, 0 to 8631
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   anio_denuncia               8632 non-null   int64         
 1   tipificacion_ruvte          8632 non-null   string        
 2   id_unico_ruvte              8632 non-null   int64         
 3   apellido_paterno_nombres    8632 non-null   string        
 4   apellido_materno            8330 non-null   string        
 5   apellido_casada             1024 non-null   string        
 6   edad_al_momento_del_hecho   8557 non-null   Int64         
 7   documentos                  8632 non-null   string        
 8   anio_nacimiento             8472 non-null   Int64         
 9   provincia_pais_nacimiento   8632 non-null   string        
 10  nacionalidad                8632 non-null   string        
 11  embarazo                    8632 non-null   string      

Unnamed: 0,anio_denuncia,tipificacion_ruvte,id_unico_ruvte,apellido_paterno_nombres,apellido_materno,apellido_casada,edad_al_momento_del_hecho,documentos,anio_nacimiento,provincia_pais_nacimiento,...,embarazo,fotografia,fecha_detencion,fecha_asesinato,lugar_detencion_clean,provincia_detencion_simple,lugar_asesinato_clean,provincia_asesinato_simple,naturalizada,periodo
0,1984,Forced disappearance,5389,ABACHIAN JUAN CARLOS,BEDROSSIAN,,26,LE 8293245,1950,BUENOS AIRES,...,No data,,1976-12-26,NaT,LA PLATA BUENOS AIRES,BUENOS AIRES,,Other,False,Dictadura
1,1984,Forced disappearance,87,ABAD ANA CATALINA,SCARLATA,PERUCCA,24,LC 10048122,1951,MENDOZA,...,No data,,1976-08-15,NaT,CORDOBA CAPITAL CORDOBA,CORDOBA,,Other,False,Dictadura
2,1984,Forced disappearance,11788,ABAD JULIO RICARDO,CORONEL,,21,DNI 10283544,1954,TUCUMAN,...,No data,False,1976-11-01,NaT,CIUDAD AUTONOMA DE BUENOS AIRES,CIUDAD AUTONOMA DE BUENOS AIRES,,Other,False,Dictadura
3,1984,Murder,9907,ABAD OSCAR GERARDO,DOMATO,,25,DNI 10353245,1951,BUENOS AIRES,...,No data,False,1976-10-08,1976-10-21,LA PLATA BUENOS AIRES,BUENOS AIRES,GRAL MANSILLA BARTOLOME BAVIO MAGDALENA BUENOS...,BUENOS AIRES,False,Dictadura
4,1984,Forced disappearance,89,ABAD ROBERTO RODOLFO TOMAS,ZABALA,,23,DNI 10650064,1953,CAPITAL FEDERAL,...,No data,,1976-08-09,NaT,FLORIDA VICENTE LOPEZ BUENOS AIRES,BUENOS AIRES,,Other,False,Dictadura


## Exploratory Analysis

### Demographics

In [42]:
# histogram for age at disappearance
fig = px.histogram(df_clean, x='edad_al_momento_del_hecho', nbins=30, 
                   title='Histogram of Age at Detention/Disappearance',
                   labels={'edad_al_momento_del_hecho': 'Age at Detention/Disappearance'})
fig.update_layout(bargap=0.1)
fig.show()

In [43]:
# nationality count bar chart: Argentina vs others 

argentina_count = (df_clean['nacionalidad'] == 'ARGENTINA').sum()
other_count = (df_clean['nacionalidad'] != 'ARGENTINA').sum()

# Create data for plotting
plot_data = pd.DataFrame({
    'Category': ['Argentina', 'Others'],
    'Count': [argentina_count, other_count]
})

fig = px.bar(plot_data, x='Category', y='Count',
             title='Nationality Distribution: Argentina vs Others',
             labels={'Count': 'Number of Cases'})
fig.show()

### Temporal Information    

In [44]:
# Age vs year of disappearance density heatmap 
df_plot = df_clean.copy()

# drop rows with missing values used for the histogram
df_plot2 = df_plot.dropna(subset=['fecha_detencion', 'edad_al_momento_del_hecho']).copy()
if df_plot2.empty:
    print("No data after dropping NA.")
else:
    # convert datetimes to int64 (ns since epoch) and ages to float
    x_vals = df_plot2['fecha_detencion'].values.astype('int64')
    y_vals = df_plot2['edad_al_momento_del_hecho'].astype(float).values

    # explicit ranges derived from the data (avoid autodetect problems)
    x_min, x_max = int(np.nanmin(x_vals)), int(np.nanmax(x_vals))
    pad = max(int((x_max - x_min) * 0.01), 1) if x_max > x_min else 1
    x_range = [x_min - pad, x_max + pad]

    y_min, y_max = int(np.nanmin(y_vals)), int(np.nanmax(y_vals))
    y_range = [y_min, y_max if y_max > y_min else y_min + 1]

    # compute 2D histogram
    hist_data = np.histogram2d(x_vals, y_vals, bins=[50, 50], range=[x_range, y_range])
    z = hist_data[0]                      # raw counts
    z_log = np.log1p(z)                   # log(1 + count) for coloring

    # map edges back to bin centers and convert x back to datetimes
    x_edges, y_edges = hist_data[1], hist_data[2]
    x_centers = 0.5 * (x_edges[:-1] + x_edges[1:])
    y_centers = 0.5 * (y_edges[:-1] + y_edges[1:])
    x_centers_dt = pd.to_datetime(x_centers)

    # plot using log-scaled values for color but attach original counts to hover via customdata
    fig = px.imshow(
        z_log.T,                     # transpose so x=fecha, y=edad
        x=x_centers_dt,
        y=y_centers,
        origin='lower',
        color_continuous_scale='Cividis',
        labels={'x': 'Year of Detention/Disappearance', 'y': 'Age at Detention/Disappearance', 'color': 'log(Count+1)'},
        title='Age at Detention/Disappearance — Log Density'
    )

    # attach original integer counts for hover and set hover template
    # use shape (rows, cols, 1) so hover can index scalar via %{customdata[0]}
    fig.data[0].customdata = np.expand_dims(z.T.astype(int), axis=2)
    fig.data[0].hovertemplate = "Count: %{customdata[0]}<br>Year: %{x|%Y}<br>Age: %{y}<extra></extra>"

    # set x-axis limits to the histogram bin centers
    fig.update_xaxes(dtick="M12", tickformat="%Y", range=[x_centers_dt[0], x_centers_dt[-1]])
    fig.update_layout(width=900, height=500, margin=dict(l=80, r=40, t=60, b=40))
    fig.show()


In [45]:
# histogram for year of dissapearance 
# color by periodo

fig = px.histogram(df_clean, x='fecha_detencion', color='periodo', nbins=50, 
                   title='Histogram of Year of Detention/Disappearance by Historical Period')
fig.update_layout(bargap=0.1)
fig.show()

In [46]:
# histogram for time between detention and assassination 

df_clean['time_to_assassination'] = (df_clean['fecha_asesinato'] - df_clean['fecha_detencion']).dt.days
fig = px.histogram(df_clean, x='time_to_assassination', nbins=30, 
                   title='Histogram of Time Between Detention and Assassination',
                   labels={'time_to_assassination': 'Days Between Detention and Assassination'})
fig.update_layout(bargap=0.1)
fig.show()

### Spatial Information

In [47]:
# bar chart for provincia_detencion_simple (plotly)
prov_counts = df_clean['provincia_detencion_simple'].value_counts().reset_index()
prov_counts.columns = ['Province', 'Count']
prov_counts = prov_counts.sort_values('Count', ascending=True)  # ascending so largest at top for horizontal bars

fig = px.bar(
    prov_counts,
    x='Count',
    y='Province',
    orientation='h',
    color='Count',
    color_continuous_scale=px.colors.sequential.Viridis,
    text='Count',
    title='Detenciones por Provincia'
)
fig.update_traces(texttemplate='%{text}', textposition='outside', marker_line_width=0)
fig.update_layout(
    xaxis_title='Number of Disappearances',
    yaxis_title='Province',
    height=600,
    margin=dict(l=120, r=30, t=60, b=40),
    coloraxis_showscale=False,
    xaxis_range=[0, prov_counts['Count'].max() * 1.1]  # extra space for text
)
fig.show()


The plots above, combined with the historical context, show the following trends:

- **Victim age:** Most disappearances affected **young adults (20–30 years old)**, reflecting the junta’s focus on **student activists, union members, and politically active youth**.  
- **Temporal patterns:**  
  - Disappearances of young adults **began before the dictatorship** (e.g., early cases like Felipe Vallese in 1962).  
  - From **1975–1980**, there is a marked spike in both young adult and some older victims, coinciding with the **peak of state repression** under the military government.  
- **Dictatorship period:** The **majority of disappearances occurred during 1976–1983**, consistent with the documented timeline of Argentina’s “Dirty War.”  
- **Duration of captivity:** Most victims were **detained less than 100 days** before assassination or disappearance, showing the regime’s **rapid targeting and elimination**.  
- **Geography:** **Buenos Aires Province (BA) and Capital Federal (CABA)** account for the bulk (~60%) of recorded disappearances, reflecting **centralized operations, urban activism, and concentration of detention centers**.  
- **Nationality:** Almost all victims were **Argentine**, highlighting the domestic focus of repression, though a small number of foreign or naturalized residents were also targeted.
