In [55]:
import pandas as pd

## Overview
The **ETL_victims** notebook is tailored to complement the **ETL_homicides** notebook, together forming a detailed dataset for traffic incident analysis. In "ETL_victims," key tasks include converting Excel files to CSV for easier handling, removing empty rows resulting from this conversion, ensuring the absence of duplicates, correcting data types for dataset consistency, and translating all columns and content into English for clearer analysis. This careful preparation paves the way for seamless integration with the "ETL_homicides" dataset, providing a robust foundation for comprehensive analysis and contributing to efforts in enhancing road safety.




## Data Extraction
- Converted the excel files provided in the started pack into csv and load them onto the workbook

In [56]:
file_path = '../Data/Raw/victimas_homicidios.csv'
df_victims = pd.read_csv(file_path, encoding='ISO-8859-1')

df_victims_clean = df_victims.copy()

In [57]:
df_victims_clean.tail()

Unnamed: 0,ID_hecho,FECHA,AAAA,MM,DD,ROL,VICTIMA,SEXO,EDAD,FECHA_FALLECIMIENTO
715,2021-0095,30/12/2021,2021.0,12.0,30.0,CONDUCTOR,MOTO,MASCULINO,27.0,02/01/2022
716,2021-0096,15/12/2021,2021.0,12.0,15.0,CONDUCTOR,AUTO,MASCULINO,60.0,20/12/2021
717,,,,,,,,,,
718,,,,,,,,,,
719,,,,,,,,,,


In [58]:
df_victims_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 720 entries, 0 to 719
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID_hecho             717 non-null    object 
 1   FECHA                717 non-null    object 
 2   AAAA                 717 non-null    float64
 3   MM                   717 non-null    float64
 4   DD                   717 non-null    float64
 5   ROL                  717 non-null    object 
 6   VICTIMA              717 non-null    object 
 7   SEXO                 717 non-null    object 
 8   EDAD                 717 non-null    object 
 9   FECHA_FALLECIMIENTO  717 non-null    object 
dtypes: float64(3), object(7)
memory usage: 56.4+ KB


In [59]:
df_victims_clean.isna().sum()

ID_hecho               3
FECHA                  3
AAAA                   3
MM                     3
DD                     3
ROL                    3
VICTIMA                3
SEXO                   3
EDAD                   3
FECHA_FALLECIMIENTO    3
dtype: int64

## Data Transformation:
- Step 1: Removed empty rows caused by conversion of excel to csv
- Step 2: Checked for duplicates and none where found
- Step 3: Corrected Data types
- Step 4: Renaming all columns and content to correct language for smoothness in data analysis: English


**Note:** Leftover missing values were considered normal and droping these rows would have caused the loss of crucial data

**Remove Empty Rows**

In [60]:
df_victims_clean.dropna(how='all', inplace=True)

**Check for duplicates**

In [61]:
duplicates = df_victims_clean[df_victims_clean.duplicated()]
duplicates.head()

Unnamed: 0,ID_hecho,FECHA,AAAA,MM,DD,ROL,VICTIMA,SEXO,EDAD,FECHA_FALLECIMIENTO


**Data type manipulation:**
- Regular columns into their respective type

In [62]:
import pandas as pd

df_victims_clean['FECHA'] = pd.to_datetime(df_victims_clean['FECHA'], format='%d/%m/%Y', errors='coerce')
df_victims_clean['FECHA_FALLECIMIENTO'] = df_victims_clean['FECHA_FALLECIMIENTO'].replace('SD', pd.NaT)
df_victims_clean['FECHA_FALLECIMIENTO'] = pd.to_datetime(df_victims_clean['FECHA_FALLECIMIENTO'], format='%d/%m/%Y', errors='coerce')

df_victims_clean['AAAA'] = df_victims_clean['AAAA'].astype(int)
df_victims_clean['MM'] = df_victims_clean['MM'].astype(int)
df_victims_clean['DD'] = df_victims_clean['DD'].astype(int)

df_victims_clean['ID_hecho'] = df_victims_clean['ID_hecho'].astype(str)

df_victims_clean['ROL'] = df_victims_clean['ROL'].astype('category')
df_victims_clean['VICTIMA'] = df_victims_clean['VICTIMA'].astype('category')
df_victims_clean['SEXO'] = df_victims_clean['SEXO'].astype('category')
df_victims_clean['EDAD'] = pd.to_numeric(df_victims_clean['EDAD'], errors='coerce').fillna(-1).astype(int)


In [63]:
df_victims_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 717 entries, 0 to 716
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   ID_hecho             717 non-null    object        
 1   FECHA                717 non-null    datetime64[ns]
 2   AAAA                 717 non-null    int32         
 3   MM                   717 non-null    int32         
 4   DD                   717 non-null    int32         
 5   ROL                  717 non-null    category      
 6   VICTIMA              717 non-null    category      
 7   SEXO                 717 non-null    category      
 8   EDAD                 717 non-null    int32         
 9   FECHA_FALLECIMIENTO  649 non-null    datetime64[ns]
dtypes: category(3), datetime64[ns](2), int32(4), object(1)
memory usage: 36.4+ KB


**English Translation**

In [64]:
# Column name translations from Spanish to English
column_name_translations = {
    'ID_hecho': 'Incident_ID',
    'FECHA': 'Date',
    'AAAA': 'Year',
    'MM': 'Month',
    'DD': 'Day',
    'ROL': 'Role',
    'VICTIMA': 'Victim',
    'SEXO': 'Gender',
    'EDAD': 'Age',
    'FECHA_FALLECIMIENTO': 'Date_of_Death'
}

df_victims_clean.rename(columns=column_name_translations, inplace=True)




In [65]:
# find unqique values to translate

unique_roles = df_victims_clean['Role'].unique()
unique_victims = df_victims_clean['Victim'].unique()
unique_genders = df_victims_clean['Gender'].unique()

print("Unique Roles:", unique_roles)
print("Unique Victims:", unique_victims)
print("Unique Genders:", unique_genders)


Unique Roles: ['CONDUCTOR', 'PASAJERO_ACOMPA¥ANTE', 'PEATON', 'SD', 'CICLISTA']
Categories (5, object): ['CICLISTA', 'CONDUCTOR', 'PASAJERO_ACOMPA¥ANTE', 'PEATON', 'SD']
Unique Victims: ['MOTO', 'AUTO', 'PEATON', 'SD', 'CARGAS', 'BICICLETA', 'PASAJEROS', 'MOVIL']
Categories (8, object): ['AUTO', 'BICICLETA', 'CARGAS', 'MOTO', 'MOVIL', 'PASAJEROS', 'PEATON', 'SD']
Unique Genders: ['MASCULINO', 'FEMENINO', 'SD']
Categories (3, object): ['FEMENINO', 'MASCULINO', 'SD']


In [66]:
# translate content of columns 'role', `victim`, `gender`

role_translation = {
    'CONDUCTOR': 'Driver',
    'PASAJERO_ACOMPA¥ANTE': 'Passenger_Companion',  
    'PEATON': 'Pedestrian',
    'SD': 'No Data',
    'CICLISTA': 'Cyclist'
}

victim_translation = {
    'MOTO': 'Motorcycle',
    'AUTO': 'Car',
    'PEATON': 'Pedestrian',
    'SD': 'No Data',
    'CARGAS': 'Cargo Vehicle',
    'BICICLETA': 'Bicycle',
    'PASAJEROS': 'Passengers',
    'MOVIL': 'Emergency Vehicle'
}

gender_translation = {
    'MASCULINO': 'Male',
    'FEMENINO': 'Female',
    'SD': 'No Data'
}

def translate_term(term, translation_dict):
    return translation_dict.get(term, term)  

df_victims_clean['Role'] = df_victims_clean['Role'].apply(translate_term, args=(role_translation,))
df_victims_clean['Victim'] = df_victims_clean['Victim'].apply(translate_term, args=(victim_translation,))
df_victims_clean['Gender'] = df_victims_clean['Gender'].apply(translate_term, args=(gender_translation,))

In [67]:
df_victims_clean.tail()

Unnamed: 0,Incident_ID,Date,Year,Month,Day,Role,Victim,Gender,Age,Date_of_Death
712,2021-0092,2021-12-12,2021,12,12,Pedestrian,Pedestrian,Female,50,2021-12-12
713,2021-0093,2021-12-13,2021,12,13,Passenger_Companion,Motorcycle,Female,18,2021-12-18
714,2021-0094,2021-12-20,2021,12,20,Passenger_Companion,Motorcycle,Female,43,2021-12-20
715,2021-0095,2021-12-30,2021,12,30,Driver,Motorcycle,Male,27,2022-01-02
716,2021-0096,2021-12-15,2021,12,15,Driver,Car,Male,60,2021-12-20


## Loading/Saving the Data
1. Saved dataframes: `df_victims_clean`
2. Saved the data in`.csv` 
3. File Path: `'../data/processed/'`

In [68]:
csv_file_path = '../data/processed/df_victims_clean.csv'

df_victims_clean.to_csv(csv_file_path, index=False)