# Data Wrangiling and Cleaning SIMAT

## Introduction

**Context:** In order to achieve the objective of identifying the socioeconomic factors that influence student repetition in Bogota, the SED gave the team the SIMAT databases for the years 2017, 2018, 2019, 2020, and 2021, each of them contains information about the students that are enrolled in the public schools of the country, with basic information (direction of residence, identification, age, school grade, etc.) and some socioeconomic data (Estrato, type of disability, ethnicity).

**Problem:** The tasks of the team are the following:
1. Identify the most relevant variables in each of the datasets
2. Find which of the variables identified in the previous point, coincide in all the datasets
3. Create datasets for each of the years, only with the most relevant variables, and merge them into one
4. Clean each of the variables in the dataset resulting from the previous point, and generate a new `.csv` file

In [33]:
## Load relevant packages

# Base libraries
import time
import datetime
import os

# Scientific librariesz
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Visual libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Helper libraries
import xlrd
import pprint
import base64
import math

# Visual setup

# Pandas options
pd.options.display.max_columns = None

After analyzing the dictionary of annex 6A of 2021 from the SIMAT, the team selected the most relevant variables for the development of the project, for this, a column was created with the name `COLUMNAS A TENER EN CUENTA` with the values `SI`, `NO`, `SI SIN VALIDAR SED`, `SI PREGUNTAR SED`.

In [2]:
dic_anexo64_2021 =  pd.read_excel('diccionarios/Diccionario_Anexo6A_31_03_2021_Modificado.xlsx', header=4, index_col=0)
dic_anexo64_2021

Unnamed: 0_level_0,NOMBRE VARIABLE,COLUMNAS A TENER EN CUENTA,ETIQUETA,TIPO DE VARIABLE,LONGITUD,MEDIDA,VALIDACIONES BÁSICAS O VALORES DE LA VARIABLE,VALIDACIONES,DESCRIPCION,OBSERVACIONES
No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,ANO_INF,SI,Año de la información,Numérico,4,Nominal,Año al que corresponde la información,,AÑO DE LA INFORMACIÓN,
2,CODIGO_MUN,NO,Municipio o Distrito,Cadena,5,Nominal,Códigos DANE Municipios,"Debe estar acorde con la divipola, se toma el ...",CODIGO DANE DEL MUNICIPIO,"Corresponde al código 001, que en DIVIPOLA cor..."
3,CODIGO_DANE,SI,Código DANE del establecimiento,Numérico,12,Nominal,Código DANE de la institución educativa (12 po...,"Debe estar registrado en el DUE \nen estado ""N...",CODIGO DANE 12 DEL ESTABLECIMIENTO SEDE EDUCATIVA,
4,CODIGO_DANE_SEDE,SI,Código DANE de la Sede,Numérico,12,Nominal,Código DANE que poseía la sede en el año 2001 ...,,CODIGO DANE 12 DE LA SEDE EDUCATIVA,"Debe estar registrado en el DUE \nen estado ""A..."
5,CONS_SEDE,NO,Consecutivo de la sede,Numérico,14,Nominal,Código generado por le DUE para identificar lo...,,CONSECUTIVO DE LA SEDE,
...,...,...,...,...,...,...,...,...,...,...
93,DIR_SECTOR_CENSAL,,Sector censal,Cadena,10,Nominal,,Información del directorio con fecha de corte ...,Sector censal en el directorio,
94,DIR_X,SI,Longitud,Numérico,20,Escala,,Información del directorio con fecha de corte ...,Coordenada X del establecimiento en el directorio,
95,DIR_Y,SI,Latitud,Numérico,20,Escala,,Información del directorio con fecha de corte ...,Coordenada Y del establecimiento en el directorio,
96,VICTIMAS_INCLUIDO,SI,Población victima del conflicto armado (result...,Numérico,1,Nominal,1 Incluido\n2 No Incluido,Resultado cruce con base de RUV,Indica los estudiantes que fueron identificado...,


Once the variables are identified, they are stored in a list called `main_variables`.

In [3]:
dic_anexo64_2021['COLUMNAS A TENER EN CUENTA'] = dic_anexo64_2021['COLUMNAS A TENER EN CUENTA'].replace(np.nan, 'NO')
main_variables = dic_anexo64_2021[dic_anexo64_2021['COLUMNAS A TENER EN CUENTA'].str.contains('SI', case=False)]['NOMBRE VARIABLE']
main_variables = main_variables.str.strip().tolist()
main_variables

['ANO_INF',
 'CODIGO_DANE',
 'CODIGO_DANE_SEDE',
 'TIPO_DOCUMENTO',
 'NRO_DOCUMENTO',
 'APELLIDO1',
 'APELLIDO2',
 'NOMBRE1',
 'NOMBRE2',
 'DIRECCION_RESIDENCIA',
 'RES_DEPTO',
 'RES_MUN',
 'ESTRATO',
 'SISBEN',
 'FECHA_NACIMIENTO',
 'GENERO',
 'POB_VICT_CONF',
 'DPTO_EXP',
 'MUN_EXP',
 'PROVIENE_SECTOR_PRIV',
 'PROVIENE_OTR_MUN',
 'TIPO_DISCAPACIDAD',
 'CAP_EXC',
 'CODIGO_ETNIA',
 'CODIGO_RESGUARDO',
 'INS_FAMILIAR',
 'CODIGO_JORNADA',
 'CARACTER',
 'CODIGO_ESPECIALIDAD',
 'CODIGO_GRADO',
 'CODIGO_METODOLOGIA',
 'REPITENTE',
 'SIT_ACAD_ANIO_ANT',
 'CON_ALUM_ANIO_ANT',
 'ZONA_RESI_ALU',
 'MADR_CABE_FAMI',
 'HIJO_MADR_CABE_FAMI',
 'BENE_VETE_FUER_PUBL',
 'HERO_NACI',
 'VALORACION1',
 'VALORACION2',
 'CODIGO_INTERNADO',
 'EDAD',
 'NIVEL',
 'NIVEL_B',
 'ETNIA_RECODIFICADA',
 'RANGO_EDAD',
 'CLASE_COLEGIO',
 'PER_ID',
 'PAIS_ORIGEN',
 'NOMBRE_PAIS_ORIGEN',
 'TIPO_NACIONALIDAD',
 'DIR_NUM_LOCALIDAD',
 'DIR_LOCALIDAD',
 'DIR_X',
 'DIR_Y',
 'VICTIMAS_INCLUIDO',
 'VICTIMAS_HECHO']

We load the datasets for each year

In [4]:
df_anexo6A_2017 = pd.read_csv('data/1-bronce/SIMAT_EFICIENCIA_2017.csv', delimiter=';')
df_anexo6A_2018 = pd.read_csv('data/1-bronce/SIMAT_EFICIENCIA_2018.csv', delimiter=';')
df_anexo6A_2019 = pd.read_csv('data/1-bronce/Anexo6A_Eficiencia_interna_MEN_2019.csv', delimiter=';')
df_anexo6A_2020 = pd.read_csv('data/1-bronce/Anexo6A_Eficiencia_interna_MEN_2020.csv', delimiter=';')
df_anexo6A_2021 = pd.read_csv('data/1-bronce/Anexo6A_depurado_31032021.csv', delimiter=';')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


One of the main problems with the SIMAT databases is that not all the columns coincided in name and quantity, they varied for each year, that is why we first proceeded to identify which columns are called the same for all years, and stored in a list called `columns_in_all_dfs`.

In [5]:
dfs = [ df_anexo6A_2017, df_anexo6A_2018, df_anexo6A_2019, df_anexo6A_2020, df_anexo6A_2021]
dfs_columns = np.concatenate([df.columns.str.strip() for df in dfs])
dfs_columns = pd.Series(dfs_columns).value_counts()
columns_in_all_dfs = dfs_columns[dfs_columns == len(dfs)]
columns_in_all_dfs = columns_in_all_dfs.index
columns_in_all_dfs

Index(['RES_MUN', 'REPITENTE', 'DIRECCION_RESIDENCIA', 'APELLIDO2', 'TEL',
       'APELLIDO1', 'NOMBRE2', 'POB_VICT_CONF', 'CARACTER', 'SISBEN',
       'NAC_DEPTO', 'CONS_SEDE', 'TIPO_DISCAPACIDAD', 'EDAD', 'DPTO_EXP',
       'NRO_DOCUMENTO', 'GRUPO', 'EXP_DEPTO', 'FECHA_NACIMIENTO', 'EXP_MUN',
       'NAC_MUN', 'MUN_EXP', 'CAP_EXC', 'ESTRATO', 'TIPO_DOCUMENTO',
       'RES_DEPTO', 'NOMBRE1', 'NUEVO', 'PROVIENE_SECTOR_PRIV', 'NIVEL'],
      dtype='object')

Once the variables that coincide in all the years have been identified, they are filtered from `main_variables`, in order to obtain the variables that remain to be identified and verify if they coincide in the databases of the other years but with another name. These variables are stored in a list called `missing_main_variables`.

In [6]:
missing_main_variables = list(filter(lambda x: x not in columns_in_all_dfs, main_variables))
missing_main_variables

['ANO_INF',
 'CODIGO_DANE',
 'CODIGO_DANE_SEDE',
 'GENERO',
 'PROVIENE_OTR_MUN',
 'CODIGO_ETNIA',
 'CODIGO_RESGUARDO',
 'INS_FAMILIAR',
 'CODIGO_JORNADA',
 'CODIGO_ESPECIALIDAD',
 'CODIGO_GRADO',
 'CODIGO_METODOLOGIA',
 'SIT_ACAD_ANIO_ANT',
 'CON_ALUM_ANIO_ANT',
 'ZONA_RESI_ALU',
 'MADR_CABE_FAMI',
 'HIJO_MADR_CABE_FAMI',
 'BENE_VETE_FUER_PUBL',
 'HERO_NACI',
 'VALORACION1',
 'VALORACION2',
 'CODIGO_INTERNADO',
 'NIVEL_B',
 'ETNIA_RECODIFICADA',
 'RANGO_EDAD',
 'CLASE_COLEGIO',
 'PER_ID',
 'PAIS_ORIGEN',
 'NOMBRE_PAIS_ORIGEN',
 'TIPO_NACIONALIDAD',
 'DIR_NUM_LOCALIDAD',
 'DIR_LOCALIDAD',
 'DIR_X',
 'DIR_Y',
 'VICTIMAS_INCLUIDO',
 'VICTIMAS_HECHO']

Because we are using a large number of columns, we need a function that finds matches in the names of the other columns in the other dataframes, with respect to the columns that we define as important, we call this function `word_in_dfs_columns`

In [7]:
def word_in_dfs_columns(dfs_dic, word, to_ignore=[]):
    """
    Function that returns the column names in the dataframes in `df_dic` that match `word`
    
    Inputs:
    dfs_dic: A dictionary with the dataframes in which you want to find `word`
    word: The word or name and column to search for
    to_ignore: A list with the words to ignore to avoid matches that generate noise
    
    Outputs:
    results: A dictionary with the name of the column names that match `word` for each dataframe in `dfs_dic`
    """
    words = word.split('_')
    words = list(filter(lambda x: x not in to_ignore, words))
    results = {}
    for key, df in dfs_dic.items():
        matches = []
        for w in words:
            matches = [*matches, *df.columns[df.columns.str.contains(w, case=False)]]
            matches = list(set(matches))
        
        results[key] = matches
    return results

In [8]:
df_dict = {'df_anexo6A_2017':df_anexo6A_2017,'df_anexo6A_2018':df_anexo6A_2018,'df_anexo6A_2019':df_anexo6A_2019, 'df_anexo6A_2020':df_anexo6A_2020}
variables = {}
for var in missing_main_variables:
    variables[var] = word_in_dfs_columns(df_dict, var, ['CODIGO', 'DIR', 'TIPO', 'B', 'ID'])
pprint.pprint(variables)

{'ANO_INF': {'df_anexo6A_2017': ['SIT_ACAD_ANO_ANT',
                                 'CON_ALUM_ANO_ANT',
                                 'ANNO_INF'],
             'df_anexo6A_2018': ['SIT_ACAD_ANO_ANT',
                                 'CON_ALUM_ANO_ANT',
                                 'ANNO_INF'],
             'df_anexo6A_2019': ['SIT_ACAD_ANO_ANT',
                                 'CON_ALUM_ANO_ANT',
                                 'ANNO_INF'],
             'df_anexo6A_2020': ['SIT_ACAD_ANO_ANT',
                                 'CON_ALUM_ANO_ANT',
                                 'ANNO_INF']},
 'BENE_VETE_FUER_PUBL': {'df_anexo6A_2017': [],
                         'df_anexo6A_2018': [],
                         'df_anexo6A_2019': [],
                         'df_anexo6A_2020': []},
 'CLASE_COLEGIO': {'df_anexo6A_2017': ['CLASE_COLEGIO_DIR', 'CLASE_SECTOR'],
                   'df_anexo6A_2018': ['CLASE_COLEGIO_DIR'],
                   'df_anexo6A_2019': ['CLASE',
            

In [9]:
word_in_dfs_columns(df_dict, 'LOC', ['CODIGO', 'DIR', 'TIPO', 'B', 'ID'])

{'df_anexo6A_2017': ['NUMERO_LOCALIDAD', 'NOMBRE_LOCALIDAD'],
 'df_anexo6A_2018': ['NUMERO_LOCALIDAD', 'NOMBRE_LOCALIDAD'],
 'df_anexo6A_2019': ['NOMBRE_LOCALIDAD', '@#_LOC'],
 'df_anexo6A_2020': ['LOC_DIR', 'NOMBRE_LOCALIDAD_DIR']}

Above we can see that not all the columns we are trying to find have a match, that is why after doing a manual check for the variables that were not found with the `word_in_dfs_columns` function, we create the following `replace_columns_dict` dictionary with the column names to replace in the other datasets, so that they all have the same variable names that we consider are important.

In [10]:

replace_columns_dict = {
    'ANNO_INF': 'ANO_INF',
    'DANE12_ESTABLECIMIENTO_EDUCATIVO': 'CODIGO_DANE',
    'DANE12_SEDE_EDUCATIVA': 'CODIGO_DANE_SEDE',
    'ESPECIALIDAD': 'CODIGO_ESPECIALIDAD',
    'ETNIA': 'CODIGO_ETNIA',
    'GRADO': 'CODIGO_GRADO',
    'INTERNADO': 'CODIGO_INTERNADO',
    'TIPO_JORNADA': 'CODIGO_JORNADA',
    'METODOLOGIA': 'CODIGO_METODOLOGIA',
    'CON_ALUM_ANO_ANT': 'CON_ALUM_ANIO_ANT',
    'GENERO.x': 'GENERO',
    'PROVIENE_OTRO_MUN': 'PROVIENE_OTR_MUN',
    'SIT_ACAD_ANO_ANT': 'SIT_ACAD_ANIO_ANT',
    'VAL_DES_PERIODO1': 'VALORACION1',
    'VAL_DES_PERIODO2': 'VALORACION2',
    'NUMERO_LOCALIDAD': 'DIR_NUM_LOCALIDAD',
    '@_LOC': 'DIR_NUM_LOCALIDAD',
    '@#_LOC': 'DIR_NUM_LOCALIDAD',
    'LOC_DIR': 'DIR_NUM_LOCALIDAD',
    'LOCALIDAD': 'DIR_LOCALIDAD',
    'NOMBRE_LOCALIDAD': 'DIR_LOCALIDAD',
    'NOMBRE_LOCALIDAD_DIR': 'DIR_LOCALIDAD',
    'ZON_ALU': 'ZONA_RESI_ALU',
    'NIVEL':'NIVEL'
}

Before we can do a merge between the datasets for each year, we first check the data types for each variable in each year.

The following cell executes code that shows how each of the matches defined in `replace_columns_dict` are related, a small sample of the values they have and the type
```
<replace_columns_dict_key> <replace_columns_dict_value>--> [values examples of dataframe 1] [values examples of dataframe 2] [values examples of dataframe 3]
types: [type of dataframe 1, type of dataframe 2, type of dataframe 3] 
```

In [11]:
text =""
for key, val in replace_columns_dict.items():
    text += f"{key}  {val}--> "
    var_types = []
    for df in dfs:
        
        if key in df.columns:
            text += f"{df[key].unique()[:5]}"
            var_types.append(df[key].dtype)
        elif val in df.columns:
            text += f"{df[val].unique()[:5]}"
            var_types.append(df[val].dtype)
        else:
            text += "(not found)"
            var_types.append("(not found)")
    text += f"\ntypes: {var_types} \n\n"   
print(text)

ANNO_INF  ANO_INF--> [2017 '2017' ' 146' ' 136' ' 133'][2018 '2018' '  83' '   0' '   1'][2019][2020][2021]
types: [dtype('O'), dtype('O'), dtype('int64'), dtype('int64'), dtype('int64')] 

DANE12_ESTABLECIMIENTO_EDUCATIVO  CODIGO_DANE--> [111001000078 111001000124 111001000132 111001000272 111001000353][111001000078 111001000124 111001000132 111001014176 111001000272][211850001473 311001090793 111001035572 111001012475 111001029955][111001000078 111001000124 111001000132 111001000272 111001000612][111001000078 111001000124 111001000132 111001000272 111001000612]
types: [dtype('int64'), dtype('O'), dtype('int64'), dtype('int64'), dtype('int64')] 

DANE12_SEDE_EDUCATIVA  CODIGO_DANE_SEDE--> [1.11001000e+11 1.11001015e+11 1.11001015e+11 1.11001000e+11
 1.11001014e+11][111001000078 111001000124 111001000132 111001000213 111001000272][211850000817 211001030630 311001090793 111001035572 111001013625][111001000078 111001014834 111001014842 111001000124 111001014303][111001000078 111001014834

The code above helps us to make a preliminary exploration of the data that we are going to use, later in the Notebook the findings will be explained in detail.

To continue with the goal of filter out variables that cannot be found in all dataframes, even after manual checking, and store them in `unfindables_variables`.

In [12]:
unfindables_variables = list(filter(lambda x: x not in replace_columns_dict.values(), missing_main_variables))
unfindables_variables

['CODIGO_RESGUARDO',
 'INS_FAMILIAR',
 'MADR_CABE_FAMI',
 'HIJO_MADR_CABE_FAMI',
 'BENE_VETE_FUER_PUBL',
 'HERO_NACI',
 'NIVEL_B',
 'ETNIA_RECODIFICADA',
 'RANGO_EDAD',
 'CLASE_COLEGIO',
 'PER_ID',
 'PAIS_ORIGEN',
 'NOMBRE_PAIS_ORIGEN',
 'TIPO_NACIONALIDAD',
 'DIR_X',
 'DIR_Y',
 'VICTIMAS_INCLUIDO',
 'VICTIMAS_HECHO']

After we have obtained the variables that were not found in all the `unfindables_variables` datasets, we filter them from the variables that we considered main, and store them in a new list called `validate_main_variables`.

In [13]:
validate_main_variables = list(filter(lambda x: x not in unfindables_variables, main_variables))
validate_main_variables

['ANO_INF',
 'CODIGO_DANE',
 'CODIGO_DANE_SEDE',
 'TIPO_DOCUMENTO',
 'NRO_DOCUMENTO',
 'APELLIDO1',
 'APELLIDO2',
 'NOMBRE1',
 'NOMBRE2',
 'DIRECCION_RESIDENCIA',
 'RES_DEPTO',
 'RES_MUN',
 'ESTRATO',
 'SISBEN',
 'FECHA_NACIMIENTO',
 'GENERO',
 'POB_VICT_CONF',
 'DPTO_EXP',
 'MUN_EXP',
 'PROVIENE_SECTOR_PRIV',
 'PROVIENE_OTR_MUN',
 'TIPO_DISCAPACIDAD',
 'CAP_EXC',
 'CODIGO_ETNIA',
 'CODIGO_JORNADA',
 'CARACTER',
 'CODIGO_ESPECIALIDAD',
 'CODIGO_GRADO',
 'CODIGO_METODOLOGIA',
 'REPITENTE',
 'SIT_ACAD_ANIO_ANT',
 'CON_ALUM_ANIO_ANT',
 'ZONA_RESI_ALU',
 'VALORACION1',
 'VALORACION2',
 'CODIGO_INTERNADO',
 'EDAD',
 'NIVEL',
 'DIR_NUM_LOCALIDAD',
 'DIR_LOCALIDAD']

In [14]:
len(validate_main_variables)

40

Once we have the names of the columns that do match in all the data sets, but that do not have the same name, we replace them with the help of `replace_columns_dict`.

In [15]:
for df in dfs:
    df.rename(columns=replace_columns_dict, inplace=True)

After replacing the names of the columns, we take only the ones we need.

In [16]:
df_anexo6A_2017 = df_anexo6A_2017[validate_main_variables]
df_anexo6A_2018 = df_anexo6A_2018[validate_main_variables]
df_anexo6A_2019 = df_anexo6A_2019[validate_main_variables]
df_anexo6A_2020 = df_anexo6A_2020[validate_main_variables]
df_anexo6A_2021 = df_anexo6A_2021[validate_main_variables]

Because it is necessary that the `ANO_INFO` variable corresponding to the year in which the data was taken, be cleaned before doing the merge, since it contained outliers such as ' 146', ' 136', ' 133', '0', in addition to mixed types, it is reassigned for each year depending on the name of the dataset.

In [17]:
df_anexo6A_2017['ANO_INF'] = '2017'
df_anexo6A_2018['ANO_INF'] = '2018'
df_anexo6A_2019['ANO_INF'] = '2019'
df_anexo6A_2020['ANO_INF'] = '2020'
df_anexo6A_2021['ANO_INF'] = '2021'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_anexo6A_2017['ANO_INF'] = '2017'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_anexo6A_2018['ANO_INF'] = '2018'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_anexo6A_2019['ANO_INF'] = '2019'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col

As before, we look at the variables, a small sample of the values ​​they have and the type of these
```
<dataframe_column> --> [values examples of dataframe 1] [values examples of dataframe 2] [values examples of dataframe 3]
types: [type of dataframe column 1, type of dataframe column 2, type of dataframe column 3] 
```

In [18]:
text =""
for col in validate_main_variables:
    text += f"{col} --> "
    var_types = []
    for df in dfs:
        text += f"{df[col].unique()[:5]}"
        var_types.append(df[col].dtype)
    text += f"\ntypes: {var_types} \n\n"   
print(text)

ANO_INF --> [2017 '2017' ' 146' ' 136' ' 133'][2018 '2018' '  83' '   0' '   1'][2019][2020][2021]
types: [dtype('O'), dtype('O'), dtype('int64'), dtype('int64'), dtype('int64')] 

CODIGO_DANE --> [111001000078 111001000124 111001000132 111001000272 111001000353][111001000078 111001000124 111001000132 111001014176 111001000272][211850001473 311001090793 111001035572 111001012475 111001029955][111001000078 111001000124 111001000132 111001000272 111001000612][111001000078 111001000124 111001000132 111001000272 111001000612]
types: [dtype('int64'), dtype('O'), dtype('int64'), dtype('int64'), dtype('int64')] 

CODIGO_DANE_SEDE --> [1.11001000e+11 1.11001015e+11 1.11001015e+11 1.11001000e+11
 1.11001014e+11][111001000078 111001000124 111001000132 111001000213 111001000272][211850000817 211001030630 311001090793 111001035572 111001013625][111001000078 111001014834 111001014842 111001000124 111001014303][111001000078 111001014834 111001014842 111001000124 111001000132]
types: [dtype('float64'

Arriba encontramos lo siguinete:

* La variable `ANO_INF` tiene tipos mescaldos entre object y integer, ademas qde que existen algunos valores extraños como ' 146', ' 136', ' 133', '  83', '   0', '   1', para los años 2017 y 2018, esta variable solo debe tomar valores entre 2017 y 2021 de tipo entero.
* La variable `CODIGO_DANE`, para el año 2018 es de tipo Objeto, mientras que para todos los demas años es de tipo Entero. Esta varaible se puede representar con ambos tipos, pero solo se debe usar uno no los dos.
* La variable `CODIGO_DANE_SEDE`, presenta tipos mezcaldos, enter float, entero, y objeto. Esta variable deberia ser representada con datos de tipo entero u objeto, pero no con float.
* La variable `TIPO_DOCUMENTO`, presenta tipos mezclados, entre Objeto y Entero, con numeros que van del 1 al 10, de acuerdo con el diccionario esta variable debe tener solamente datos que van del 1 al 12, por lo que no presenta valores inesperados, solo se debe hacer la conversion a uno de los tipos.
* Las variables `RES_DEPTO` y `RES_MUN` presentas tipos mezclados, entre Objeto y Entero, se debes de conventir a uno solo, idealmente Entero, para ocupar menos memoria.
* La variable `ESTRATO` presenta tipos mezclados, pero los valores de esta muestra premliminar se encuntran dentor de los esperados.
* La varialbe `SISBEN`, es de tipo Objeto, pero como es refernte  un puntaje, se debe de transfomar a float, pero antes se deben de remover los espacikos en blanco que puedan tener, y remplazar las comas ',', por puntos '.'.
* La variable `GENERO` solo debe de tener valores 'F' o 'M', pero contiene otros datos extranños como ' ', nan, 'D'.
* La variable `PROVIENE_SECTOR_PRIV` solo debe de tener valores 'S' o 'N', pero presenta otros como ' ', '?', '7', '6', '5'.
* La variable `PROVIENE_OTR_MUN` solo debe de tener valores 'S' o 'N', pero presenta otros como ' ', 'P', nan '8'.
* La variable `CARACTER` presenta tipos mezclados, y datos inesperados como   '*', nan.
* La varialbe `REPITENTE` solo debe de tener valores 'S' o 'N', pero presenta otros como' ', nan, '   0'.
* La variable `ZONA_RESI_ALU` solo debe de tener valores '1' o '2', pero contiene otros valores extraños como '*', '0', ' '.
* La variable `VALORACION2` contiene datos raros como '*', y para los años 2020 y 2021 esta completamente vacia.
* La variable `EDAD` presenta tipos mezclados entre Objeto y Entero, pero los datos que se logran observar en la muestra, parecen estar bien.
* La variable `DIR_NUM_LOCALIDAD` presenta tipos mezclados entre Objeto y Entero, especialmente para el año 2018, los datos son de tipo string pero representan floats, con ',' y espacios la principio, por lo que se debe modificar la cadena antes de transformar la variable a entero o float.
* La variable `DIR_LOCALIDAD`, presenta tipos mezclados entre Objeto y Entero, este tipos de dato solo puede ser String ya que representa el nombre de una localidad, pero para un año es de tipo Entero representando el numero de la localidad.
* Las variables `TIPO_DISCAPACIDAD`, `POB_VICT_CONF`, `CAP_EXC`, `CODIGO_ETNIA`, `CODIGO_JORNADA`, `CODIGO_ESPECIALIDAD`, `CODIGO_GRADO`, `CODIGO_METODOLOGIA`, `SIT_ACAD_ANIO_ANT`, `CON_ALUM_ANIO_ANT`, `NIVEL`  presentan tipos mezclados, entre Objeto y Entero, aunque los valores se encuentran dentro de lo esperado. se deben de conventir a uno solo, idealmente Entero, para ocupar menos memoria.
* En esta revision preliminar, las variables `NRO_DOCUMENTO`, `APELLIDO1`, `APELLIDO2`, `NOMBRE1`, `NOMBRE2`, `DIRECCION_RESIDENCIA`, `FECHA_NACIMIENTO`, `VALORACION1`, parecen estar bien.

Above we find the following:

* The variable `ANO_INF` has mixed types between object and integer, also there are some strange values like ' 146', ' 136', ' 133', ' 83', ' 0', ' 1', for the years 2017 and 2018, this variable should only take values between 2017 and 2021 of type integer.
* The variable `CODIGO_DANE`, for the year 2018 is of type Object, while for all other years it is of type Integer. This variable can be represented with both types, but only one should be used, not both.
* The variable `CODIGO_DANE_SEDE` has mixed types, enter float, integer, and object. This variable should be represented with data of type integer or object, but not with float.
* The variable `TIPO_DOCUMENTO`, presents mixed types, between Object and Integer, with numbers that go from 1 to 10, according to the dictionary this variable should only have data that go from 1 to 12, so it does not present unexpected values , only one of the types needs to be converted to.
* The variables `RES_DEPTO` and `RES_MUN` have mixed types, between Object and Integer, they must be converted to only one, ideally Integer, to occupy less memory.
* The variable `ESTRATO` has mixed types, but the values of this preliminary sample are within the expected ones.

* The variable `SISBEN`, is of type Object, but since it refers to a score, it must be transformed to float, but first you must remove the blank spaces that they may have, and replace the commas ',', with periods '.'.
* The variable `GENERO` should only have values 'F' or 'M', but it contains other strange data such as ' ', nan, 'D'.
* The variable `PROVIENE_SECTOR_PRIV` should only have values 'S' or 'N', but it has others such as ' ', '?', '7', '6', '5'.
* The variable `PROVIENE_OTR_MUN` should only have values 'S' or 'N', but it has others such as ' ', 'P', nan '8'.
* The `CARACTER` variable has mixed types, and unexpected data like '*', nan.
* The `REPITENTE` variable should only have 'S' or 'N' values, but it does have others such as ' ', nan, ' 0'.
* The variable `ZONA_RESI_ALU` should only have values ​​'1' or '2', but it contains other strange values such as '*', '0', ' '.
* The variable `VALORACION2` contains weird data like '*', and for the years 2020 and 2021 it is completely empty.
* The `AGE` variable has mixed types between Object and Integer, but the data that can be observed in the sample seems to be fine.
* The variable `DIR_NUM_LOCALIDAD` has mixed types between Object and Integer, especially for the year 2018, the data is of type string but represents floats, with ',' and spaces at the beginning, so the string must be modified before transforming the variable to integer or float.
* The variable `DIR_LOCALIDAD`, presents types mixed between Object and Integer, this data type can only be String since it represents the name of a locality, but for a year it is of type Integer representing the number of the locality.
* The variables `TIPO_DISCAPACIDAD`, `POB_VICT_CONF`, `CAP_EXC`, `CODIGO_ETNIA`, `CODIGO_JORNADA`, `CODIGO_SPECIALIDAD`, `CODIGO_GRADO`, `CODIGO_METODOLOGIA`, `SIT_ACAD_ANIO_ANTE`, `CON_ALUM_ANIO_ANTE`, `NIVEL` have mixed types between Object and Integer, although the values are within expectations. they must be converted to only one, ideally Integer, to occupy less memory.
* In this preliminary review, the variables `NRO_DOCUMENTO`, `APELLIDO1`, `APELLIDO2`, `NOMBRE1`, `NOMBRE2`, `DIRECCION_RESIDENCIA`, `FECHA_NACIMIENTO`, `VALORACION1` seem to be fine.

Now we do a little cleaning to the `CODIGO_DANE_SEDE` variables, and we remove the duplicates in `NRO_DOCUMENTO` by year.

In [19]:
df_anexo6A_2017['CODIGO_DANE_SEDE'] = df_anexo6A_2017['CODIGO_DANE_SEDE'].astype(int)
df_anexo6A_2017['CODIGO_DANE_SEDE']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_anexo6A_2017['CODIGO_DANE_SEDE'] = df_anexo6A_2017['CODIGO_DANE_SEDE'].astype(int)


0          111001000078
1          111001014834
2          111001014834
3          111001014834
4          111001014834
               ...     
1334072               0
1334073               0
1334074               0
1334075               0
1334076               0
Name: CODIGO_DANE_SEDE, Length: 1334077, dtype: int64

In [21]:
df_anexo6A_2017.drop_duplicates(subset=['NRO_DOCUMENTO'], keep=False, inplace=True)
df_anexo6A_2018.drop_duplicates(subset=['NRO_DOCUMENTO'], keep=False, inplace=True)
df_anexo6A_2019.drop_duplicates(subset=['NRO_DOCUMENTO'], keep=False, inplace=True)
df_anexo6A_2020.drop_duplicates(subset=['NRO_DOCUMENTO'], keep=False, inplace=True)
df_anexo6A_2021.drop_duplicates(subset=['NRO_DOCUMENTO'], keep=False, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_anexo6A_2017.drop_duplicates(subset=['NRO_DOCUMENTO'], keep=False, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_anexo6A_2018.drop_duplicates(subset=['NRO_DOCUMENTO'], keep=False, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_anexo6A_2019.drop_duplicates(subset=['NRO_DOCUMENTO'], keep=False, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pyda

Finally we concatenate the datasets for all years, and filter only the records in which the `REPEAT` values are 'Y' or 'N'.

In [22]:
df_anexo6A = pd.concat([df_anexo6A_2017, df_anexo6A_2018, df_anexo6A_2019, df_anexo6A_2020, df_anexo6A_2021], ignore_index=True)
df_anexo6A

Unnamed: 0,ANO_INF,CODIGO_DANE,CODIGO_DANE_SEDE,TIPO_DOCUMENTO,NRO_DOCUMENTO,APELLIDO1,APELLIDO2,NOMBRE1,NOMBRE2,DIRECCION_RESIDENCIA,RES_DEPTO,RES_MUN,ESTRATO,SISBEN,FECHA_NACIMIENTO,GENERO,POB_VICT_CONF,DPTO_EXP,MUN_EXP,PROVIENE_SECTOR_PRIV,PROVIENE_OTR_MUN,TIPO_DISCAPACIDAD,CAP_EXC,CODIGO_ETNIA,CODIGO_JORNADA,CARACTER,CODIGO_ESPECIALIDAD,CODIGO_GRADO,CODIGO_METODOLOGIA,REPITENTE,SIT_ACAD_ANIO_ANT,CON_ALUM_ANIO_ANT,ZONA_RESI_ALU,VALORACION1,VALORACION2,CODIGO_INTERNADO,EDAD,NIVEL,DIR_NUM_LOCALIDAD,DIR_LOCALIDAD
0,2017,111001000078,111001000078,2,00010803853,CRUZ,CANO,LAURA,VALENTINA,CL 69 A 105 F 67,11,1,2,",01",1/8/2000,F,9,,,N,N,99,9,0,6,1,5,10,1,N,1,9,1,,,3,17,4,16,PUENTE ARANDA
1,2017,111001000078,111001014834,3,05057637,NAVAS,GONZALEZ,SUJEIBY,VALENTINA,CL 34 SUR 40 A 51,11,1,3,,6/13/2012,F,9,,,N,N,99,9,0,3,0,0,0,1,N,0,9,1,,,3,4,1,16,PUENTE ARANDA
2,2017,111001000078,111001014834,3,067289989,RODRIGUEZ,GONZALEZ,SANTIAGO,JOSE,KR 54 SUR 50 B 10,11,1,3,,10/7/2008,M,9,,,N,N,7,9,0,3,0,0,3,1,N,1,9,1,,,3,8,2,16,PUENTE ARANDA
3,2017,111001000078,111001014834,3,071759788,AQUILES,,HERNANDEZ,MILANO,KR 40 28 A 02 SUR,11,1,3,,8/23/2012,M,9,,,N,N,99,9,0,3,0,0,0,1,N,0,9,1,,,3,4,1,16,PUENTE ARANDA
4,2017,111001000078,111001014834,3,089980482,ESPIN,CARDOZO,STEFANI,ALEJANDRA,KR 40 28 A 02 SUR,11,1,3,,11/16/2013,F,9,,,N,N,99,9,0,3,0,0,-1,1,N,0,9,1,,,,3,1,16,PUENTE ARANDA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4464566,2021,411102000293,411102000293,2,1012336508,REBELLON,TORRES,DAYANA,SULEY,KR 105 B 65 81 SUR,11,1,2,5468,9/6/2005,F,99,,,N,N,99,9,0,6,2,7,10,1,N,1,9,1,,,3,15,4,7,7
4464567,2021,411102000293,411102000293,2,1016713910,CARRILLO,TORRES,ASLHEY,DAYANNA,KR 99 A 71 39 SUR,11,1,2,6121,10/1/2005,F,99,,,N,N,99,9,0,6,2,7,10,1,N,1,9,1,,,3,15,4,7,7
4464568,2021,411102000293,411102000293,2,1011090034,PEREZ,CUERVO,KAROLL,MICHELLE,CL 62 SUR 93 C 53,11,1,2,5821,11/30/2005,F,99,,,N,N,99,9,0,6,2,7,10,1,N,1,9,1,,,3,15,4,7,7
4464569,2021,411102000293,411102000293,2,1011092068,MORALES,RODRIGUEZ,VALERIA,,CL 61 B SUR 81 D 03,11,1,2,5824,2/24/2006,F,99,,,N,N,99,9,0,6,2,7,10,1,N,1,9,1,,,3,15,4,7,7


In [23]:
df_anexo6A = df_anexo6A[df_anexo6A['REPITENTE'].isin(['N', 'S'])]
df_anexo6A.head(2)

Unnamed: 0,ANO_INF,CODIGO_DANE,CODIGO_DANE_SEDE,TIPO_DOCUMENTO,NRO_DOCUMENTO,APELLIDO1,APELLIDO2,NOMBRE1,NOMBRE2,DIRECCION_RESIDENCIA,RES_DEPTO,RES_MUN,ESTRATO,SISBEN,FECHA_NACIMIENTO,GENERO,POB_VICT_CONF,DPTO_EXP,MUN_EXP,PROVIENE_SECTOR_PRIV,PROVIENE_OTR_MUN,TIPO_DISCAPACIDAD,CAP_EXC,CODIGO_ETNIA,CODIGO_JORNADA,CARACTER,CODIGO_ESPECIALIDAD,CODIGO_GRADO,CODIGO_METODOLOGIA,REPITENTE,SIT_ACAD_ANIO_ANT,CON_ALUM_ANIO_ANT,ZONA_RESI_ALU,VALORACION1,VALORACION2,CODIGO_INTERNADO,EDAD,NIVEL,DIR_NUM_LOCALIDAD,DIR_LOCALIDAD
0,2017,111001000078,111001000078,2,10803853,CRUZ,CANO,LAURA,VALENTINA,CL 69 A 105 F 67,11,1,2,",01",1/8/2000,F,9,,,N,N,99,9,0,6,1,5,10,1,N,1,9,1,,,3,17,4,16,PUENTE ARANDA
1,2017,111001000078,111001014834,3,5057637,NAVAS,GONZALEZ,SUJEIBY,VALENTINA,CL 34 SUR 40 A 51,11,1,3,,6/13/2012,F,9,,,N,N,99,9,0,3,0,0,0,1,N,0,9,1,,,3,4,1,16,PUENTE ARANDA


In [25]:
for col in df_anexo6A.columns:
    text =""
    text += f"{col} --> "    
    text += f"{df_anexo6A[col].unique()[:]}"
    var_type = df_anexo6A[col].dtype
    text += f"\ntypes: {var_type} \n\n"   
    print(text)

ANO_INF --> ['2017' '2018' '2019' '2020' '2021']
types: object 


CODIGO_DANE --> [111001000078 111001000124 111001000132 ... 111001800678 111001800694
 111001800813]
types: object 


CODIGO_DANE_SEDE --> [111001000078 111001014834 111001014842 ... 111102000672 111001800813
 211850001309]
types: object 


TIPO_DOCUMENTO --> [2 3 5 7 1 6 8 9 '2' '7' '5' '1' '6' 10 11 12 ' 5' ' 3' ' 7' ' 2' ' 1'
 ' 6']
types: object 


NRO_DOCUMENTO --> ['00010803853' '05057637' '067289989' ... '1215713431' '1065822914'
 '1028781395']
types: object 


APELLIDO1 --> ['CRUZ' 'NAVAS' 'RODRIGUEZ' ... 'MACCHIARULLO' 'BOYS' 'LENGUERKE']
types: object 


APELLIDO2 --> ['CANO' 'GONZALEZ' ' ' ... 'CIVIRA' 'MENESESS' 'FERRONEZ']
types: object 


NOMBRE1 --> ['LAURA' 'SUJEIBY' 'SANTIAGO' ... 'YORGEIDYS' 'BRYANLLERLYS' 'WITNNY']
types: object 


NOMBRE2 --> ['VALENTINA' 'JOSE' 'MILANO' ... 'SHEDYEL' 'MARAIRE' 'JESSIMAR']
types: object 


DIRECCION_RESIDENCIA --> ['CL 69 A 105 F 67' 'CL 34 SUR 40 A 51' 'KR 54 SUR 50 

PROVIENE_SECTOR_PRIV --> ['N' 'S']
types: object 


PROVIENE_OTR_MUN --> ['N' 'S']
types: object 


TIPO_DISCAPACIDAD --> [99 7 8 10 18 15 13 11 3 4 19 17 14 6 12 1 2 9 5 16 '99' ' 3' '16' ' 1'
 ' 8' '13' '10']
types: object 


CAP_EXC --> [9 4 1 3 2 5 6 '9' 7 11 10]
types: object 


CODIGO_ETNIA --> [0 46 49 200 75 97 19 999 83 1 45 26 4 29 74 53 44 77 12 34 95 80 9 20 10
 42 73 72 99 71 52 7 47 100 27 17 15 25 5 14 31 66 70 28 2 48 400 22 18 6
 57 64 3 65 13 101 33 32 67 8 35 998 16 50 23 37 56 11 58 51 85 96 86 62
 40 30 54 38 82 90 36 43 21 61 24 92 81 55 68 60 94 '  0' '  7' '200' 107
 69 102 39 59 79 93 ' 25' ' 20' ' 27' 84 98 109 76]
types: object 


CODIGO_JORNADA --> [6 3 2 4 1 5 '2' '5' '4' '1' '6']
types: object 


CARACTER --> [1 0 2 '0']
types: object 


CODIGO_ESPECIALIDAD --> [5 0 6 8 9 7 10 ' 0' ' 5']
types: object 


CODIGO_GRADO --> [10 0 3 -1 9 7 11 8 6 5 2 4 1 99 25 26 23 24 22 -2 21 12 13 '26' '24' '25'
 '22' '23' '4' '3' '2' '5' '6' '9' '0' '11' '1' '-1' '-2' '10'

 campos ' ': `SISBEN`, `DPTO_EXP, MUN_EXP`, `VALORACION1`, `VALORACION2`, `CODIGO_INTERNADO`, `PAIS_ORIGEN`  
campos '\$null\$': `NOMBRE_PAIS_ORIGEN` 

In [26]:
empty_columns = ['SISBEN', 'DPTO_EXP', 'MUN_EXP', 'VALORACION1', 'VALORACION2', 'CODIGO_INTERNADO']

In [27]:
for col in empty_columns:
  print(f"Empties in {col} --> {df_anexo6A[df_anexo6A[col] == ' '][col].value_counts()[0]}")

Empties in SISBEN --> 1814711
Empties in DPTO_EXP --> 4271842
Empties in MUN_EXP --> 4271842
Empties in VALORACION1 --> 4401025
Empties in VALORACION2 --> 4402929
Empties in CODIGO_INTERNADO --> 572358


`DPTO_EXP` 2156194 vacios pero no todos los estudiantes son expulsados como victimas de conflicto  
`MUN_EXP ` 2156194 vacios pero no todos los estudiantes son expulsados como victimas de conflicto
`VALORACION1` y `VALORACION2` contiene muchos vacios, pero despues de ralizar un conteo en la base del 2019 donde se encuentran 27099 repitentes, solo 4 cuentan con el dato de valoracion, en este caso no tendriamos en cuneta estas variables para el análisis

In [28]:
df_anexo6A.to_csv('data/1-bronce/Consolidado-SIMAT-2017-2021.csv')

In [29]:
del df_anexo6A

In [2]:
df_consolidado = pd.read_csv('data/1-bronce/Consolidado-SIMAT-2017-2021.csv', index_col=0)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  mask |= (ar1 == a)


In [3]:
df_consolidado.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4402945 entries, 0 to 4464570
Data columns (total 40 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ANO_INF               int64  
 1   CODIGO_DANE           int64  
 2   CODIGO_DANE_SEDE      int64  
 3   TIPO_DOCUMENTO        int64  
 4   NRO_DOCUMENTO         object 
 5   APELLIDO1             object 
 6   APELLIDO2             object 
 7   NOMBRE1               object 
 8   NOMBRE2               object 
 9   DIRECCION_RESIDENCIA  object 
 10  RES_DEPTO             int64  
 11  RES_MUN               int64  
 12  ESTRATO               int64  
 13  SISBEN                object 
 14  FECHA_NACIMIENTO      object 
 15  GENERO                object 
 16  POB_VICT_CONF         int64  
 17  DPTO_EXP              object 
 18  MUN_EXP               object 
 19  PROVIENE_SECTOR_PRIV  object 
 20  PROVIENE_OTR_MUN      object 
 21  TIPO_DISCAPACIDAD     int64  
 22  CAP_EXC               int64  
 23  CODIGO_

In [4]:
df_consolidado['NRO_DOCUMENTO'] = df_consolidado['NRO_DOCUMENTO'].astype(str)
df_consolidado['DIRECCION_RESIDENCIA'] = df_consolidado['DIRECCION_RESIDENCIA'].astype(str)
df_consolidado.drop(['APELLIDO1', 'APELLIDO2', 'NOMBRE1', 'NOMBRE2', 'VALORACION1', 'VALORACION2'], inplace=True, axis=1)

In [5]:
df_consolidado['NRO_DOCUMENTO'].str.contains('\D', regex=True).value_counts()

False    4321467
True       81478
Name: NRO_DOCUMENTO, dtype: int64

In [6]:
((df_consolidado['NRO_DOCUMENTO'].str.contains('\D', regex=True)) & (df_consolidado['TIPO_DOCUMENTO'].isin([1,2,3]))).value_counts()

False    4400079
True        2866
dtype: int64

In [7]:
df_consolidado["RES_DEPTO"].unique()

array([11, 85, 50, 73, 25, 15, 23, 54, 41, 47, 27,  5,  8, 13, 17, 19, 76,
       44, 66, 70, 68, 20, 63, 52, 81, 95, 99, 86, 18, 88, 94, 91, 97])

In [8]:
df_consolidado["RES_MUN"].unique()

array([  1, 606, 313, 718, 675, 286, 176, 239, 443, 754, 520, 568, 306,
       148, 269, 430, 250, 483, 168, 634, 140, 238,  13, 660, 290, 275,
       740, 132, 821, 834, 573,   6, 380, 400, 352, 152, 555,  51, 469,
       394, 885,  45, 370, 670, 755, 574, 750, 570, 665, 175, 524, 551,
       288, 130,  81, 895, 124, 671, 455, 590, 835,  35, 396, 839, 319,
       307,  47, 322, 183, 386, 377,  88, 295, 154, 758,  30, 300, 178,
       444, 170, 299, 480,  68,  32, 530, 426, 899, 770, 777, 807, 682,
       405, 547, 517,  43, 624, 162, 318, 442, 126, 473, 212, 147, 276,
       892,  53, 875,  77, 498, 759, 592,  22, 842, 717, 245, 408, 591,
        79, 298, 433, 761, 638, 854, 810, 104, 440, 673, 367, 664, 417,
       490, 466, 545, 449, 535, 513, 843, 793, 873, 692, 668, 359, 710,
       861, 708, 215, 504, 678,  25, 397, 518, 418, 980, 182, 815, 540,
       139, 862, 658, 190, 897, 646, 703, 542, 137, 189, 667, 226, 806,
       572,  60, 787, 488, 268, 817, 650, 874, 736, 222, 836, 37

In [9]:
df_consolidado["RES_MUN"].max()

980

In [10]:
df_consolidado = df_consolidado[df_consolidado["ESTRATO"] != 9]
df_consolidado["ESTRATO"].unique()

array([2, 3, 0, 1, 4, 6, 5])

In [11]:
df_consolidado["SISBEN"] = df_consolidado["SISBEN"].str.replace(",",".").str.replace(" ", "")
df_consolidado["SISBEN"] = pd.to_numeric(df_consolidado["SISBEN"], errors='coerce')

In [12]:
df_consolidado["SISBEN"].dtype

dtype('float64')

In [13]:
df_consolidado["SISBEN"].min()

0.0

In [14]:
df_consolidado["FECHA_NACIMIENTO"] = pd.to_datetime(df_consolidado["FECHA_NACIMIENTO"])

In [16]:
df_consolidado["GENERO"].unique()

array(['F', 'M'], dtype=object)

In [18]:
df_consolidado['POB_VICT_CONF'].unique()

array([ 9,  1,  2,  3,  4,  5, 99, 10, 18,  6, 17,  7, 14, 20, 15, 16, 12,
       19,  8, 11, 21, 13])

In [19]:
(df_consolidado['DPTO_EXP'] == ' ').value_counts()

True     4262055
False     130993
Name: DPTO_EXP, dtype: int64

In [20]:
df_consolidado['DPTO_EXP'] = df_consolidado['DPTO_EXP'].str.strip().replace('^0+', '', regex = True)
df_consolidado['DPTO_EXP'] = df_consolidado['DPTO_EXP'].replace('', ' ')
df_consolidado['DPTO_EXP'].unique()

array([' ', '73', '25', '95', '76', '20', '50', '18', '15', '54', '68',
       '5', '23', '13', '41', '11', '17', '47', '52', '85', '86', '27',
       '8', '81', '19', '63', '70', '44', '99', '66', '97', '91', '94',
       '88'], dtype=object)

In [21]:
(df_consolidado['MUN_EXP'] == ' ').value_counts()

True     4262055
False     130993
Name: MUN_EXP, dtype: int64

In [22]:
df_consolidado['MUN_EXP'] = df_consolidado['MUN_EXP'].str.strip().replace('^0+', '', regex = True)
df_consolidado['MUN_EXP'] = df_consolidado['MUN_EXP'].replace('', ' ')
df_consolidado['MUN_EXP'].unique()

array([' ', '616', '867', '394', '1', '148', '878', '555', '497', '596',
       '111', '228', '885', '590', '860', '686', '168', '236', '689',
       '81', '24', '120', '460', '606', '483', '837', '839', '675', '711',
       '670', '873', '6', '29', '51', '251', '319', '349', '411', '744',
       '268', '660', '25', '109', '250', '622', '810', '245', '290',
       '132', '676', '592', '298', '835', '217', '136', '55', '189',
       '200', '408', '183', '141', '507', '897', '79', '20', '288', '65',
       '101', '233', '410', '68', '244', '43', '807', '701', '320', '283',
       '753', '325', '130', '417', '823', '35', '97', '777', '773', '787',
       '30', '662', '350', '899', '359', '708', '785', '206', '438',
       '698', '26', '137', '147', '124', '688', '110', '600', '272', '90',
       '862', '406', '568', '313', '551', '495', '654', '347', '377',
       '834', '572', '571', '758', '573', '212', '683', '580', '381',
       '368', '754', '678', '380', '361', '892', '865', '67', '

In [23]:
df_consolidado.drop(['DPTO_EXP', 'MUN_EXP'], inplace=True, axis=1)

In [24]:
df_consolidado['PROVIENE_SECTOR_PRIV'].unique()

array(['N', 'S'], dtype=object)

In [25]:
df_consolidado['PROVIENE_OTR_MUN'].unique()

array(['N', 'S'], dtype=object)

In [26]:
df_consolidado['TIPO_DISCAPACIDAD'].unique()

array([99,  7,  8, 10, 18, 15, 13, 11,  3,  4, 19, 17, 14,  6, 12,  1,  2,
        9,  5, 16])

In [27]:
df_consolidado['CAP_EXC'].unique()

array([ 9,  4,  1,  3,  2,  5,  6,  7, 11, 10])

In [28]:
df_consolidado['CODIGO_ETNIA'].unique()

array([  0,  46,  49, 200,  75,  97,  19, 999,  83,   1,  45,  26,   4,
        29,  74,  53,  44,  77,  12,  34,  95,  80,   9,  20,  10,  42,
        73,  72,  99,  71,  52,   7,  47, 100,  27,  17,  15,  25,   5,
        14,  31,  66,  70,  28,   2,  48, 400,  22,  18,   6,  57,  64,
         3,  65,  13, 101,  33,  32,  67,   8,  35, 998,  16,  50,  23,
        37,  56,  11,  58,  51,  85,  96,  86,  62,  40,  30,  54,  38,
        82,  90,  36,  43,  21,  61,  24,  92,  81,  55,  68,  60,  94,
       107,  69, 102,  39,  59,  79,  93,  84,  98, 109,  76])

In [29]:
df_consolidado['CODIGO_JORNADA'].unique()

array([6, 3, 2, 4, 1, 5])

In [30]:
df_consolidado['CARACTER'] = df_consolidado['CARACTER'].astype(int)
df_consolidado['CARACTER'].unique()

array([1, 0, 2])

In [31]:
df_consolidado['CODIGO_ESPECIALIDAD'].unique()

array([ 5,  0,  6,  8,  9,  7, 10])

In [32]:
df_consolidado['CODIGO_GRADO'].unique()

array([10,  0,  3, -1,  9,  7, 11,  8,  6,  5,  2,  4,  1, 99, 25, 26, 23,
       24, 22, -2, 21, 12, 13])

In [33]:
df_consolidado['CODIGO_METODOLOGIA'].unique()

array([ 1,  9, 10, 20, 11,  2, 12])

In [34]:
df_consolidado['REPITENTE'].unique()

array(['N', 'S'], dtype=object)

In [35]:
df_consolidado['SIT_ACAD_ANIO_ANT'].unique()

array([1, 0, 8, 2])

In [36]:
df_consolidado['CON_ALUM_ANIO_ANT'] = df_consolidado['CON_ALUM_ANIO_ANT'].astype(int)
df_consolidado['CON_ALUM_ANIO_ANT'].unique()

array([9, 8, 5, 3])

In [37]:
df_consolidado['ZONA_RESI_ALU'] = df_consolidado['ZONA_RESI_ALU'].astype(str)
df_consolidado['ZONA_RESI_ALU'].unique()

array(['1', '2', ' '], dtype=object)

In [38]:
(df_consolidado['ZONA_RESI_ALU'] == ' ').value_counts()

False    4393047
True           1
Name: ZONA_RESI_ALU, dtype: int64

In [39]:
df_consolidado = df_consolidado[df_consolidado['ZONA_RESI_ALU'] != ' ']
df_consolidado['ZONA_RESI_ALU'] = df_consolidado['ZONA_RESI_ALU'].astype(int)
df_consolidado['ZONA_RESI_ALU'].unique()

array([1, 2])

In [40]:
df_consolidado['CODIGO_INTERNADO'] = df_consolidado['CODIGO_INTERNADO'].astype(str)
df_consolidado['CODIGO_INTERNADO'] = df_consolidado['CODIGO_INTERNADO'].replace(' ', '3').astype(int)
df_consolidado['CODIGO_INTERNADO'].unique()

array([3])

In [41]:
df_consolidado.drop(['CODIGO_INTERNADO'], inplace=True, axis=1)

In [42]:
df_consolidado['EDAD'].sort_values().unique()

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  91,  92,
        93,  97, 100])

In [43]:
df_consolidado['NIVEL'].unique()

array([4, 1, 2, 3, 7])

In [44]:
(df_consolidado['NIVEL'] == 7).value_counts()

False    4393046
True           1
Name: NIVEL, dtype: int64

In [45]:
df_consolidado = df_consolidado[df_consolidado['NIVEL'] != 7]

In [46]:
df_consolidado['DIR_NUM_LOCALIDAD'].unique()

array([16, 8, 1, 10, 4, 12, 11, 13, 7, 6, 3, 18, 15, 17, 5, 14, 9, 19, 2,
       20, '16', '18', '19', '11', '8', '9', '10', '2', '   16,00',
       '    8,00', '    1,00', '    4,00', '   10,00', '   12,00',
       '   11,00', '   13,00', '    7,00', '    5,00', '    6,00',
       '    3,00', '   18,00', '   15,00', '   17,00', '   19,00',
       '   14,00', '    9,00', '    2,00', '   20,00', ' ', '20'],
      dtype=object)

In [47]:
(df_consolidado['DIR_NUM_LOCALIDAD'] == " ").value_counts()

False    4393031
True          15
Name: DIR_NUM_LOCALIDAD, dtype: int64

In [48]:
df_consolidado = df_consolidado[df_consolidado['DIR_NUM_LOCALIDAD'] != " "]

In [49]:
df_consolidado['DIR_NUM_LOCALIDAD'] = df_consolidado['DIR_NUM_LOCALIDAD'].astype(str).str.strip().str.replace(",.*","", regex = True)
df_consolidado['DIR_NUM_LOCALIDAD'] = df_consolidado['DIR_NUM_LOCALIDAD'].astype(int)
df_consolidado['DIR_NUM_LOCALIDAD'].unique()

array([16,  8,  1, 10,  4, 12, 11, 13,  7,  6,  3, 18, 15, 17,  5, 14,  9,
       19,  2, 20])

In [50]:
df_consolidado.drop(['DIR_LOCALIDAD'], inplace=True, axis=1)

In [60]:
def calcular_nivel_sisben(row):
    puntaje = row['SISBEN']
    zona = int(row['ZONA_RESI_ALU'])
    if math.isnan(puntaje):
        return 0
    elif ((0.00 <= puntaje <= 44.79) and  zona == 1) or ((0.00 <= puntaje <= 32.98) and zona == 2):
        return 1
    elif ((44.80 <= puntaje <= 51.57) and  zona == 1) or ((32.99 <= puntaje <= 37.80) and zona == 2):
        return 2
    elif (puntaje > 51.57 and  zona == 1) or (puntaje > 37.80 and zona == 2):
        return 3
    else:
        return np.nan
df_consolidado['NIVEL_SISBEN'] = df_consolidado[['SISBEN', 'ZONA_RESI_ALU']].apply(calcular_nivel_sisben, axis=1)
df_consolidado

Unnamed: 0,ANO_INF,CODIGO_DANE,CODIGO_DANE_SEDE,TIPO_DOCUMENTO,NRO_DOCUMENTO,DIRECCION_RESIDENCIA,RES_DEPTO,RES_MUN,ESTRATO,SISBEN,FECHA_NACIMIENTO,GENERO,POB_VICT_CONF,PROVIENE_SECTOR_PRIV,PROVIENE_OTR_MUN,TIPO_DISCAPACIDAD,CAP_EXC,CODIGO_ETNIA,CODIGO_JORNADA,CARACTER,CODIGO_ESPECIALIDAD,CODIGO_GRADO,CODIGO_METODOLOGIA,REPITENTE,SIT_ACAD_ANIO_ANT,CON_ALUM_ANIO_ANT,ZONA_RESI_ALU,EDAD,NIVEL,DIR_NUM_LOCALIDAD,NIVEL_SISBEN
0,2017,111001000078,111001000078,2,00010803853,CL 69 A 105 F 67,11,1,2,0.01,2000-01-08,F,9,N,N,99,9,0,6,1,5,10,1,N,1,9,1,17,4,16,1
1,2017,111001000078,111001014834,3,05057637,CL 34 SUR 40 A 51,11,1,3,,2012-06-13,F,9,N,N,99,9,0,3,0,0,0,1,N,0,9,1,4,1,16,0
2,2017,111001000078,111001014834,3,067289989,KR 54 SUR 50 B 10,11,1,3,,2008-10-07,M,9,N,N,7,9,0,3,0,0,3,1,N,1,9,1,8,2,16,0
3,2017,111001000078,111001014834,3,071759788,KR 40 28 A 02 SUR,11,1,3,,2012-08-23,M,9,N,N,99,9,0,3,0,0,0,1,N,0,9,1,4,1,16,0
4,2017,111001000078,111001014834,3,089980482,KR 40 28 A 02 SUR,11,1,3,,2013-11-16,F,9,N,N,99,9,0,3,0,0,-1,1,N,0,9,1,3,1,16,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4464566,2021,411102000293,411102000293,2,1012336508,KR 105 B 65 81 SUR,11,1,2,54.68,2005-09-06,F,99,N,N,99,9,0,6,2,7,10,1,N,1,9,1,15,4,7,3
4464567,2021,411102000293,411102000293,2,1016713910,KR 99 A 71 39 SUR,11,1,2,61.21,2005-10-01,F,99,N,N,99,9,0,6,2,7,10,1,N,1,9,1,15,4,7,3
4464568,2021,411102000293,411102000293,2,1011090034,CL 62 SUR 93 C 53,11,1,2,58.21,2005-11-30,F,99,N,N,99,9,0,6,2,7,10,1,N,1,9,1,15,4,7,3
4464569,2021,411102000293,411102000293,2,1011092068,CL 61 B SUR 81 D 03,11,1,2,58.24,2006-02-24,F,99,N,N,99,9,0,6,2,7,10,1,N,1,9,1,15,4,7,3


In [61]:
df_consolidado['NIVEL_SISBEN'].isnull().value_counts()

False    4393031
Name: NIVEL_SISBEN, dtype: int64

In [62]:
df_consolidado['NIVEL_SISBEN'] = df_consolidado['NIVEL_SISBEN'].astype('uint8')

In [65]:
# df_consolidado

In [66]:
df_consolidado['REPITENTE'] = df_consolidado['REPITENTE'].replace({'S': 1, 'N': 0}).astype('uint8')
df_consolidado['REPITENTE'].unique()

array([0, 1], dtype=uint8)

In [67]:
dict_etnia_recod = {
    0 : 0 ,
    200 : 2 ,
    1 : 1 ,
    97 : 5 ,
    83 : 1 ,
    46 : 1 ,
    19 : 1 ,
    8 : 1 ,
    96 : 1 ,
    95 : 4 ,
    73 : 1 ,
    75 : 1 ,
    6 : 1 ,
    101 : 1 ,
    45 : 1 ,
    49 : 1 ,
    72 : 1 ,
    35 : 1 ,
    107 : 1 ,
    65 : 1 ,
    27 : 1 ,
    74 : 1 ,
    50 : 1 ,
    98 : 6 ,
    5 : 1 ,
    66 : 1 ,
    12 : 1 ,
    31 : 1 ,
    26 : 1 ,
    48 : 1 ,
    4 : 1 ,
    16 : 1 ,
    22 : 1 ,
    34 : 1 ,
    29 : 1 ,
    2 : 1 ,
    18 : 1 ,
    7 : 1 ,
    47 : 1 ,
    52 : 1 ,
    17 : 1 ,
    80 : 1 ,
    15 : 1 ,
    109 : 1 ,
    53 : 1 ,
    400 : 3 ,
    77 : 1 ,
    25 : 1 ,
    28 : 1 ,
    70 : 1 ,
    100 : 1 ,
    3 : 1 ,
    10 : 1 ,
    13 : 1 ,
    40 : 1 ,
    59 : 1 ,
    14 : 1 ,
    20 : 1 ,
    9 : 1 ,
    64 : 1 ,
    102 : 1 ,
    44 : 1 ,
    55 : 1 ,
    81 : 1 ,
    67 : 1 ,
    57 : 1 ,
    23 : 1 ,
    37 : 1 ,
    36 : 1 ,
    99 : 1 ,
    21 : 1 ,
    84 : 1 ,
    56 : 1 ,
    51 : 1 ,
    76 : 1 ,
    43 : 1 ,
    39 : 1 ,
    54 : 1 ,
    58 : 1 ,
    68 : 1 ,
    82 : 1 ,
    42 : 1 ,
    33 : 1 ,
    11 : 1 ,
    71 : 1,
    998: 0,
    999: 0,
    24: 1,
    30: 1,
    32: 1,
    38: 1,
    60: 1,
    61: 1,
    62: 1,
    69: 1,
    79: 1,
    85: 1,
    86: 1,
    90: 1,
    92: 1,
    93: 1,
    94: 1
}

In [68]:
df_consolidado['CODIGO_ETNIA'] = df_consolidado['CODIGO_ETNIA'].replace(dict_etnia_recod)
df_consolidado['CODIGO_ETNIA'].unique()

array([0, 1, 2, 5, 4, 3, 6])

In [69]:
def grado_overage(x):
    if x['CODIGO_GRADO'] < 0 and x['EDAD'] >= 6:
        ovage = 0
    elif x['CODIGO_GRADO'] == -1 and x['EDAD'] <= 5:
        ovage = 100
    elif x['CODIGO_GRADO'] == -2 and x['EDAD'] >= 5:
        ovage = 0        
    elif x['CODIGO_GRADO'] == -2 and x['EDAD'] <= 4:
        ovage = 100    
    elif x['EDAD']-6 <= x['CODIGO_GRADO'] and x['CODIGO_GRADO'] >= 0:
        ovage = 100
    else:
        ovage = int(100*(x['CODIGO_GRADO']) /(x['EDAD']-6))
        
    return ovage

In [70]:
df_consolidado['GRADO_OVERAGE'] = df_consolidado[['EDAD','CODIGO_GRADO']].apply(grado_overage, axis = 1 )
df_consolidado['GRADO_OVERAGE']

0           90
1          100
2          100
3          100
4          100
          ... 
4464566    100
4464567    100
4464568    100
4464569    100
4464570    100
Name: GRADO_OVERAGE, Length: 4393031, dtype: int64

In [71]:
def overage(x):        
    return 1 if x['GRADO_OVERAGE'] < 100 else 0;

In [72]:
df_consolidado['OVERAGE'] = df_consolidado[['GRADO_OVERAGE']].apply(overage, axis=1)

In [73]:
dict_discapacidad = {
    15 : 1 ,
    8 : 1 ,
    18 : 1 ,
    13 : 1 ,
    7 : 1 ,
    10 : 1 ,
    17 : 1 ,
    4 : 1 ,
    3 : 1 ,
    19 : 1 ,
    11 : 1 ,
    14 : 1 ,
    12 : 1 ,
    2 : 1 ,
    9 : 1 ,
    1 : 1 ,
    99: 0,
    6: 1,
    5: 1,
    16: 1
}

In [74]:
df_consolidado['TIPO_DISCAPACIDAD'].replace(dict_discapacidad, inplace=True)
df_consolidado['TIPO_DISCAPACIDAD'].unique()

array([0, 1])

In [75]:
df_consolidado.to_csv('data/2-plata/Consolidado_SIMAT_Depurado.csv')

In [76]:
df_consolidado = pd.read_csv('data/2-plata/Consolidado_SIMAT_Depurado.csv',
                             index_col=0,
                             dtype={
                                 'ANO_INF':              'uint16',
                                 'CODIGO_DANE':          'uint64',
                                 'CODIGO_DANE_SEDE':     'uint64',
                                 'TIPO_DOCUMENTO':       'uint8',
                                 'NRO_DOCUMENTO':        'string',
                                 'DIRECCION_RESIDENCIA': 'string',
                                 'RES_DEPTO':            'uint8',
                                 'RES_MUN':              'uint16',
                                 'ESTRATO':              'uint8',
                                 'SISBEN':               'float16',
                                 'GENERO':               'string',
                                 'POB_VICT_CONF':        'uint8',
                                 'PROVIENE_SECTOR_PRIV': 'string',
                                 'PROVIENE_OTR_MUN':     'string',
                                 'TIPO_DISCAPACIDAD':    'uint8',
                                 'CAP_EXC':              'uint8',
                                 'CODIGO_ETNIA':         'uint16',
                                 'CODIGO_JORNADA':       'uint8',
                                 'CARACTER':             'uint8',
                                 'CODIGO_ESPECIALIDAD':  'uint8',
                                 'CODIGO_GRADO':         'int8',
                                 'CODIGO_METODOLOGIA':   'uint8',
                                 'REPITENTE':            'string',
                                 'SIT_ACAD_ANIO_ANT':    'uint8',
                                 'CON_ALUM_ANIO_ANT':    'uint8',
                                 'ZONA_RESI_ALU':        'uint8',
                                 'EDAD':                 'uint8',
                                 'NIVEL':                'uint8',
                                 'DIR_NUM_LOCALIDAD':    'uint8',
                                 'NIVEL_SISBEN':         'uint8',
                                 'GRADO_OVERAGE':        'uint8',
                                 'OVERAGE':        'uint8'                                 
                             },
                            )

  mask |= (ar1 == a)


In [77]:
df_consolidado

Unnamed: 0,ANO_INF,CODIGO_DANE,CODIGO_DANE_SEDE,TIPO_DOCUMENTO,NRO_DOCUMENTO,DIRECCION_RESIDENCIA,RES_DEPTO,RES_MUN,ESTRATO,SISBEN,FECHA_NACIMIENTO,GENERO,POB_VICT_CONF,PROVIENE_SECTOR_PRIV,PROVIENE_OTR_MUN,TIPO_DISCAPACIDAD,CAP_EXC,CODIGO_ETNIA,CODIGO_JORNADA,CARACTER,CODIGO_ESPECIALIDAD,CODIGO_GRADO,CODIGO_METODOLOGIA,REPITENTE,SIT_ACAD_ANIO_ANT,CON_ALUM_ANIO_ANT,ZONA_RESI_ALU,EDAD,NIVEL,DIR_NUM_LOCALIDAD,NIVEL_SISBEN,GRADO_OVERAGE,OVERAGE
0,2017,111001000078,111001000078,2,00010803853,CL 69 A 105 F 67,11,1,2,0.010002,2000-01-08,F,9,N,N,0,9,0,6,1,5,10,1,0,1,9,1,17,4,16,1,90,1
1,2017,111001000078,111001014834,3,05057637,CL 34 SUR 40 A 51,11,1,3,,2012-06-13,F,9,N,N,0,9,0,3,0,0,0,1,0,0,9,1,4,1,16,0,100,0
2,2017,111001000078,111001014834,3,067289989,KR 54 SUR 50 B 10,11,1,3,,2008-10-07,M,9,N,N,1,9,0,3,0,0,3,1,0,1,9,1,8,2,16,0,100,0
3,2017,111001000078,111001014834,3,071759788,KR 40 28 A 02 SUR,11,1,3,,2012-08-23,M,9,N,N,0,9,0,3,0,0,0,1,0,0,9,1,4,1,16,0,100,0
4,2017,111001000078,111001014834,3,089980482,KR 40 28 A 02 SUR,11,1,3,,2013-11-16,F,9,N,N,0,9,0,3,0,0,-1,1,0,0,9,1,3,1,16,0,100,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4464566,2021,411102000293,411102000293,2,1012336508,KR 105 B 65 81 SUR,11,1,2,54.687500,2005-09-06,F,99,N,N,0,9,0,6,2,7,10,1,0,1,9,1,15,4,7,3,100,0
4464567,2021,411102000293,411102000293,2,1016713910,KR 99 A 71 39 SUR,11,1,2,61.218750,2005-10-01,F,99,N,N,0,9,0,6,2,7,10,1,0,1,9,1,15,4,7,3,100,0
4464568,2021,411102000293,411102000293,2,1011090034,CL 62 SUR 93 C 53,11,1,2,58.218750,2005-11-30,F,99,N,N,0,9,0,6,2,7,10,1,0,1,9,1,15,4,7,3,100,0
4464569,2021,411102000293,411102000293,2,1011092068,CL 61 B SUR 81 D 03,11,1,2,58.250000,2006-02-24,F,99,N,N,0,9,0,6,2,7,10,1,0,1,9,1,15,4,7,3,100,0
