In [1]:
import pandas as pd
import numpy as np
import json
import glob
from unidecode import unidecode
pd.options.display.max_columns = None
pd.options.display.max_rows = None

### Read in the Data

In [2]:
inscr = pd.read_csv('../data/inscr.csv')
asig = pd.read_csv('../data/asig (1).csv')
post = pd.read_csv('../data/post.csv')

### Duplicates

In [3]:
def get_shape(df):
    print(df.shape)
    return df

(inscr
 .pipe(get_shape)
 .drop_duplicates()
 .pipe(get_shape)
 .sample()
)

(972090, 23)
(972090, 23)


Unnamed: 0.1,Unnamed: 0,ins_id,prov_nacimiento,can_nacimiento,parro_nacimiento,ued_nombre,usu_estado,ins_autoidentificacion,ins_nacionalidad,sexo,ued_tipo,cod_final,year,usu_estado_civil,usu_fecha_nac,ins_sexo,ins_fecha,pais_res,provincia_reside,canton_reside,parroquia_reside,recinto,usu_nacionalidad
868373,55358,,,,,,A,Mestizo/a,,,,2043251000.0,2020,C,32618.0,MUJER,44039.30038,ECUADOR,GUAYAS,DURAN,ELOY ALFARO (DURÁN),,ECUATORIANA


In [4]:
(asig
 .pipe(get_shape)
 .drop_duplicates()
 .pipe(get_shape)
 .sample()
)

(213154, 25)
(213154, 25)


Unnamed: 0.1,Unnamed: 0,ins_id,genero,usu_fecha_nac,usu_nacionalidad,etnia,parroquia_reside,canton_reside,provincia_reside,pos_id,pos_nota,pos_prioridad,ies_id,ies_siglas_instit,nombre_institucion,provincia,canton,parroquia,car_id,carrera,area,modalidad,ofa_id,cod_final,year
103102,89150,9312620,MASCULINO,32368.0,ECUATORIANA,Mestizo/a,PEDRO VICENTE MALDONADO,PEDRO VICENTE MALDONADO,PICHINCHA,22773277,681.0,1.0,883,,INSTITUTO TECNOLÓGICO SUPERIOR VICENTE LEÓN,COTOPAXI,LATACUNGA,"LATACUNGA, CABECERA CANTONAL Y CAPITAL PROVINCIAL",7080,TECNOLOGIA SUPERIOR EN ADMINISTRACION FINANCIERA,ADMINISTRACION,PRESENCIAL,152894,2333411747,2020


In [5]:
(post
 .pipe(get_shape)
 .drop_duplicates()
 .pipe(get_shape)
 .sample()
)

(2363966, 25)
(2363966, 25)


Unnamed: 0.1,Unnamed: 0,ins_id,pos_id,pos_nota,pos_prioridad,ies_id,nombre_institucion,provincia,canton,parroquia,car_id,carrera,area,subarea,modalidad,jornada,ofa_id,per_id,cod_final,year,nota_postula,ies_tipo_financiamiento,subarea_nombre,canton_campus,parroquia_campus
151922,312583,11462126.0,27549707.0,,3.0,520.0,,GUAYAS,GUAYAQUIL,"GUAYAQUIL, CABECERA CANTONAL Y CAPITAL PROVINCIAL",7074.0,,,,PRESENCIAL,NOCTURNA,173349.0,22.0,1819571000.0,2022,717.0,PÚBLICA,TECNOLOGIAS DE LA INFORMACION Y LA COMUNICACIO...,,


### Missing Values

Se presenta un análisis de la calidad de los datos. La cobertura temporal se extiende desde el año 2014 hasta el 2022. La base de datos tiene las siguientes variables: 
- **ins_id**: Código de la inscripcion del estudiante en el SNNA
- **prov_nacimiento**, 
- **can_nacimiento**, 
- **parro_nacimiento**, 
- **ued_nombre**, 
- **usu_estado**: Inscripción activa o inactiva, 
- **ins_autoidentificacion**: análoga a étnia, 
- **ins_nacionalidad**, 
- **sexo**, 
- **ued_tipo**: Fiscal, Municipal, Particular, Fiscomisionales 
- **cod_final**: Código final identificador del estudiante., 
- **year**, 
- **usu_estado_civil**, 
- **usu_fecha_nac**, 
- **ins_sexo**: Sexo del inscrito, 
- **ins_fecha**: Fecha de inscripción, 
- **pais_res**: pais de residencia, 
- **provincia_reside**, 
- **canton_reside**, 
- **parroquia_reside**, 
- **recinto**: Nombre de la sede de evaluación

Se evalúa el porcentaje de valores faltantes por variable y por año, y se identifican las variables que tienen más información disponible y las que tienen menos. Se discuten las posibles causas y consecuencias de los valores faltantes y se proponen algunas recomendaciones para mejorar la calidad de los datos. 
El análisis muestra que las variables que tienen más información disponible son `usu_estado, ins_autoidentificacion, cod_final, year, usu_estado_civil, usu_fecha_nac, ins_sexo, ins_fecha, provincia_reside, canton_reside, parroquia_reside`. Estas variables presenteas procentajes de valores faltantes razonables o negligibles para todo el periodo. Sin embargo, se observa que hay una mayor proporción de valores faltantes para los años 2014, 2015 y 2016, lo que puede deberse a problemas en el registro o la digitalización de los datos. Estas variables son importantes para caracterizar el perfil sociodemográfico de los inscritos y para verificar su identidad y su lugar de residencia. 

In [3]:
inscr.isna().groupby(inscr.year).mean() * 100

Unnamed: 0_level_0,Unnamed: 0,ins_id,prov_nacimiento,can_nacimiento,parro_nacimiento,ued_nombre,usu_estado,ins_autoidentificacion,ins_nacionalidad,sexo,ued_tipo,cod_final,year,usu_estado_civil,usu_fecha_nac,ins_sexo,ins_fecha,pais_res,provincia_reside,canton_reside,parroquia_reside,recinto,usu_nacionalidad
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
2014,0.0,0.0,27.830287,27.830287,27.830287,30.587833,0.0,45.332142,96.386665,0.000951,30.745678,0.0,0.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
2015,0.0,0.0,100.0,100.0,100.0,100.0,0.0,51.722934,97.966636,100.0,100.0,0.0,0.0,0.285898,0.266555,0.0,0.0,46.114435,40.445547,40.445547,40.445547,100.0,100.0
2016,0.0,0.0,100.0,100.0,100.0,100.0,0.0,47.492519,96.878824,100.0,100.0,0.0,0.0,0.280495,0.259502,0.0,0.0,42.86346,37.916834,37.916834,37.916834,100.0,100.0
2017,0.0,0.0,100.0,100.0,100.0,100.0,0.0,14.230035,95.982762,100.0,100.0,0.0,0.0,0.408721,0.375369,0.0,0.0,9.253446,2.524916,2.524916,2.524916,100.0,100.0
2018,0.0,0.0,100.0,100.0,100.0,100.0,0.0,3.514915,92.026367,100.0,100.0,0.0,0.0,0.024344,0.020599,0.0,0.0,2.402577,2.402577,2.402577,2.402577,15.568997,100.0
2019,0.0,100.0,100.0,100.0,100.0,100.0,0.0,2.369385,97.041497,100.0,100.0,0.0,0.0,0.064561,0.001614,1.636619,0.0,1.612409,1.612409,1.612409,1.612409,100.0,0.030666
2020,0.0,100.0,100.0,100.0,100.0,100.0,0.0,1.451256,94.80752,100.0,100.0,0.0,0.0,0.131932,0.051831,0.0,0.0,1.472459,1.472459,1.472459,1.472459,100.0,0.0
2021,0.0,0.0,100.0,100.0,100.0,100.0,0.0,4.52993,96.714398,100.0,100.0,0.0,0.0,0.001604,0.0,0.0,0.0,4.545965,4.545965,4.545965,4.545965,100.0,0.0
2022,0.0,0.0,100.0,100.0,100.0,100.0,0.0,3.821546,95.090279,100.0,100.0,0.0,0.0,0.001725,0.0,0.0,0.0,3.849139,3.849139,3.849139,3.849139,100.0,0.0


El análisis descriptivo de la base de datos de asignaciones de cupos a universidades, públicas en su mayoría, fue realizado utilizando las siguientes variables:

- **ins_id**: Código de la inscripcion del estudiante en el SNNA.
- **genero**: sexo del aspirante (masculino o femenino).
- **usu_fecha_nac**: fecha de nacimiento del aspirante.
- **usu_nacionalidad**: nacionalidad del aspirante (ecuatoriana o extranjera).
- **etnia**: grupo étnico al que pertenece el aspirante (afroecuatoriano, indígena, mestizo, montubio, blanco u otro).
- **parroquia_reside**: parroquia donde reside el aspirante.
- **canton_reside**: cantón donde reside el aspirante.
- **provincia_reside**: provincia donde reside el aspirante.
- **pos_id**: identificador único de la postulación.
- **pos_nota**: nota obtenida por el aspirante en el examen de ingreso a la universidad.
- **pos_prioridad**: prioridad asignada a la carrera elegida por el aspirante (de 1 a 5).
- **ies_id**: identificador único de la institución de educación superior (IES) donde se oferta la carrera.
- **ies_siglas_instit**: siglas de la IES donde se oferta la carrera.
- **nombre_institucion**: nombre completo de la IES donde se oferta la carrera.
- **provincia**: provincia donde se ubica la IES donde se oferta la carrera.
- **canton**: cantón donde se ubica la IES donde se oferta la carrera.
- **parroquia**: parroquia donde se ubica la IES donde se oferta la carrera.
- **car_id**: identificador único de la carrera.
- **carrera**: nombre de la carrera.
- **area**: área del conocimiento a la que pertenece la carrera (agropecuaria, artes, ciencias, educación, ingeniería, salud o sociales).
- **modalidad**: modalidad de estudio de la carrera (presencial, semipresencial o a distancia).
- **ofa_id**: identificador único de la oferta académica.
- **cod_final**: código final identificador del estudiante.
- **year**: año en que se realizó la asignación.

Para el año 2022 se observaron una proporción mayor de valores faltantes en las variables parroquia_reside, canton_reside y provincia_reside, en los órdenes del 15%. Estas variables podrían tener un mecanismo de falta de datos aleatorio (MAR), es decir, que la razón por la cual faltan los datos depende de otras variables observadas. Por ejemplo, podría haber una relación entre la fecha de nacimiento y la provincia de residencia, o entre la nota y el cantón de residencia. Estas relaciones podrían afectar la precisión y la validez del análisis y el modelado si no se tratan adecuadamente.
Se identificaron tres variables con más valores faltantes: etnia, parroquia_reside e ies_siglas_instit. Ies_siglas_instit en particular cuenta con varios años en los que hasta un cuarto de los datos pueden ser faltantes.

In [7]:
asig.isna().groupby(asig.year).mean() * 100

Unnamed: 0_level_0,Unnamed: 0,ins_id,genero,usu_fecha_nac,usu_nacionalidad,etnia,parroquia_reside,canton_reside,provincia_reside,pos_id,pos_nota,pos_prioridad,ies_id,ies_siglas_instit,nombre_institucion,provincia,canton,parroquia,car_id,carrera,area,modalidad,ofa_id,cod_final,year
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2014,0.0,0.0,0.0,0.25872,1.834142,18.928159,2.845923,2.845923,2.845923,0.0,0.0,0.0,0.0,20.47124,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015,0.0,0.0,36.258331,0.032177,4.169157,6.688118,3.608366,3.608366,3.608366,0.0,0.0,0.0,0.0,21.475523,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2016,0.0,0.0,6.868709,0.025808,3.882314,7.495484,5.493493,5.493493,5.493493,0.0,0.0,0.0,0.0,18.264941,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2017,0.0,0.0,2.704247,0.019044,1.266425,4.946677,3.62788,3.62788,3.62788,0.0,0.0,0.0,0.0,15.339935,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018,0.0,0.0,2.198892,0.018024,1.063398,4.442842,3.519128,3.519128,3.519128,0.0,0.0,0.0,0.0,17.072951,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2019,0.0,0.0,1.415042,0.003714,1.110492,4.375116,3.524605,3.524605,3.524605,0.0,0.0,0.0,0.0,16.300836,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020,0.0,0.0,0.643699,0.004048,1.044492,3.708352,3.068702,3.068702,3.068702,0.0,0.0,0.0,0.0,17.84948,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021,0.0,0.0,0.596083,0.0,0.714353,3.098685,2.776989,2.776989,2.776989,0.0,0.0,0.0,0.0,18.19472,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022,0.0,0.0,0.303712,0.0,0.412448,1.507312,14.964379,14.964379,14.964379,0.0,0.0,0.0,0.0,17.217848,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Para la tabla de postulacion se cuentan con las variables:
- **ins_id**: Código de la inscripcion del estudiante en el SNNA.
- **pos_id**: Es el identificador único de la postulación realizada por el aspirante a una carrera.
- **pos_nota**:: Es la nota final obtenida por el aspirante en la postulación, que se calcula en base a su puntaje Ser Bachiller y su nota de grado.
- **pos_prioridad**: Es el orden de preferencia que el aspirante asignó a la carrera en su postulación, siendo 1 la más alta y 5 la más baja.
- **ies_id**: Es el mismo que ins_id, se repite por cuestiones de compatibilidad con otras bases de datos.
- **nombre_institucion**: Es el nombre oficial de la IES que ofrece la carrera.
- **provincia**: Es la provincia donde se ubica el campus principal de la IES que ofrece la carrera.
- **canton**: Es el cantón donde se ubica el campus principal de la IES que ofrece la carrera.
- **parroquia**: Es la parroquia donde se ubica el campus principal de la IES que ofrece la carrera.
- **car_id**: Es el identificador único de la carrera ofertada por la IES.
- **carrera**: Es el nombre oficial de la carrera ofertada por la IES.
- **area**: Es el área del conocimiento a la que pertenece la carrera, según la clasificación del Consejo de Educación Superior (CES).
- **subarea**: Es la subárea del conocimiento a la que pertenece la carrera, según la clasificación del CES.
- **modalidad**: Es la modalidad de estudio de la carrera, que puede ser presencial, semipresencial o a distancia.
- **jornada**: Es el horario en el que se imparte la carrera, que puede ser matutina, vespertina o nocturna.
- **ofa_id**: Es el identificador único de la oferta académica realizada por la IES para una determinada carrera, periodo y modalidad.
- **per_id**: Es el identificador único del periodo académico en el que se realizó la postulación.
- **cod_final**: Código final identificador del estudiante.
- **year**: Es el año en el que se realizó la postulación, que coincide con los primeros cuatro dígitos del per_id.
- **nota_postula**: Es lo mismo que pos_nota, se repite por cuestiones de compatibilidad con otras bases de datos.
- **ies_tipo_financiamiento**: Es el tipo de financiamiento que recibe la IES por parte del Estado, que puede ser pública, cofinanciada o privada.
- **subarea_nombre**: Es lo mismo que subarea, se repite por cuestiones de compatibilidad con otras bases de datos.
- **canton_campus**: Es el cantón donde se ubica el campus al que pertenece la oferta académica de la carrera. Puede ser diferente al canton del campus principal de la IES.
- **parroquia_campus**: Es la parroquia donde se ubica el campus al que pertenece la oferta académica de la carrera. Puede ser diferente a la parroquia del campus principal de la IES.

Variables pos_nota, nombre_instituciones, carrera, area, subarea, canton_campus, parrouia campus están compuestas por valores faltantes principalmente presentando la mayoria porcentajes de faltantes sobre el 90% para varios años. La tabla comparte muchas variables con aquella de asignaciones.

In [8]:
post.isna().groupby(post.year).mean()
# to remove ['canton_campus', 'parroquia_campus']

Unnamed: 0_level_0,Unnamed: 0,ins_id,pos_id,pos_nota,pos_prioridad,ies_id,nombre_institucion,provincia,canton,parroquia,car_id,carrera,area,subarea,modalidad,jornada,ofa_id,per_id,cod_final,year,nota_postula,ies_tipo_financiamiento,subarea_nombre,canton_campus,parroquia_campus
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2014,0.0,4e-06,4e-06,1.0,4e-06,4e-06,1.0,1.0,1.0,1.0,4e-06,1.0,1.0,1.0,4e-06,4e-06,4e-06,4e-06,4e-06,0.0,4e-06,4e-06,4e-06,4e-06,4e-06
2015,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2016,0.0,0.0,0.001467,0.998537,4e-06,4e-06,0.998537,4e-06,4e-06,4e-06,4e-06,0.998537,0.998537,0.998537,4e-06,4e-06,4e-06,4e-06,4e-06,0.0,0.001467,0.001467,0.001467,1.0,1.0
2017,0.0,0.0,0.0,0.999596,0.0,0.0,0.999596,0.0,0.0,0.0,0.0,0.999596,0.999596,0.999596,0.0,0.0,0.0,0.0,0.0,0.0,0.000404,0.000404,0.000404,1.0,1.0
2018,0.0,0.0,0.0,0.997432,0.0,0.0,0.997432,0.0,0.0,0.0,0.0,0.997432,0.997432,0.997432,0.0,0.0,0.0,0.0,0.0,0.0,0.002568,0.002568,0.002568,1.0,1.0
2019,0.0,0.0,0.0,0.996711,0.0,0.0,0.996711,0.0,0.0,0.0,0.0,0.996711,0.996711,0.996711,0.0,0.0,0.0,0.0,0.0,0.0,0.003289,0.003289,0.003289,1.0,1.0
2020,0.0,0.0,0.0,0.995024,0.0,0.0,0.995024,0.0,0.0,0.0,0.0,0.995024,0.995024,0.995024,0.0,0.0,0.0,0.0,0.0,0.0,0.004976,0.004976,0.004976,1.0,1.0
2021,0.0,0.004124,0.0,0.995876,0.0,0.0,0.995876,0.0,0.0,0.0,0.0,0.995876,0.995876,0.995876,0.0,0.0,0.0,0.0,0.0,0.0,0.004124,0.004124,0.004124,1.0,1.0
2022,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


In [10]:
inscr.isna().groupby(inscr.provincia_reside).mean() * 100

Unnamed: 0_level_0,Unnamed: 0,ins_id,prov_nacimiento,can_nacimiento,parro_nacimiento,ued_nombre,usu_estado,ins_autoidentificacion,ins_nacionalidad,sexo,ued_tipo,cod_final,year,usu_estado_civil,usu_fecha_nac,ins_sexo,ins_fecha,pais_res,provincia_reside,canton_reside,parroquia_reside,recinto,usu_nacionalidad
provincia_reside,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
AZUAY,0.0,13.379164,100.0,100.0,100.0,100.0,0.0,5.356643,98.68728,100.0,100.0,0.0,0.0,0.208418,0.192864,0.034218,0.0,1.303388,0.0,0.0,0.0,89.432918,73.114132
BOLIVAR,0.0,15.139626,100.0,100.0,100.0,100.0,0.0,6.704362,83.429495,100.0,100.0,0.0,0.0,0.34618,0.150012,0.103854,0.0,2.977152,0.0,0.0,0.0,90.722363,68.520655
CARCHI,0.0,12.935826,100.0,100.0,100.0,100.0,0.0,5.036214,98.51777,100.0,100.0,0.0,0.0,0.101061,0.084218,0.033687,0.0,0.252653,0.0,0.0,0.0,88.714839,70.136433
CAÑAR,0.0,12.216339,100.0,100.0,100.0,100.0,0.0,8.472012,91.326273,100.0,100.0,0.0,0.0,0.126072,0.113464,0.08825,0.0,3.32829,0.0,0.0,0.0,91.048916,76.727181
CHIMBORAZO,0.0,13.373057,100.0,100.0,100.0,100.0,0.0,4.23859,74.402605,100.0,100.0,0.0,0.0,0.073715,0.061429,0.024572,0.0,0.903004,0.0,0.0,0.0,87.923091,70.379016
COTOPAXI,0.0,13.486306,100.0,100.0,100.0,100.0,0.0,6.225747,90.332377,100.0,100.0,0.0,0.0,0.449918,0.421799,0.044992,0.0,2.722007,0.0,0.0,0.0,88.915134,69.855464
EL ORO,0.0,16.325429,100.0,100.0,100.0,100.0,0.0,11.244342,99.661261,100.0,100.0,0.0,0.0,0.314758,0.191852,0.164873,0.0,6.385083,0.0,0.0,0.0,96.336821,66.308942
ESMERALDAS,0.0,15.341857,100.0,100.0,100.0,100.0,0.0,15.084138,98.155541,100.0,100.0,0.0,0.0,0.333519,0.313305,0.101066,0.0,10.339077,0.0,0.0,0.0,97.796756,72.424074
GALAPAGOS,0.0,16.55481,100.0,100.0,100.0,100.0,0.0,21.029083,90.82774,100.0,100.0,0.0,0.0,1.006711,0.0,0.0,0.0,17.449664,0.0,0.0,0.0,99.888143,69.686801
GUAYAS,0.0,15.22887,100.0,100.0,100.0,100.0,0.0,12.542324,99.225991,100.0,100.0,0.0,0.0,0.208448,0.167711,0.114805,0.0,7.289859,0.0,0.0,0.0,97.212405,67.268908


In [11]:
asig.isna().groupby(asig.provincia_reside).mean() * 100

Unnamed: 0_level_0,Unnamed: 0,ins_id,genero,usu_fecha_nac,usu_nacionalidad,etnia,parroquia_reside,canton_reside,provincia_reside,pos_id,pos_nota,pos_prioridad,ies_id,ies_siglas_instit,nombre_institucion,provincia,canton,parroquia,car_id,carrera,area,modalidad,ofa_id,cod_final,year
provincia_reside,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
AZUAY,0.0,0.0,0.713743,0.030372,0.713743,1.822323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.835232,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BOLIVAR,0.0,0.0,2.324263,0.028345,0.595238,1.643991,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.002268,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CARCHI,0.0,0.0,0.746269,0.0,1.931519,1.141352,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.856892,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CAÃAR,0.0,0.0,1.694915,0.0,0.0,0.338983,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.745763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CAÑAR,0.0,0.0,4.90566,0.04717,0.707547,2.54717,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.54717,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CHIMBORAZO,0.0,0.0,0.942127,0.0,0.794078,1.359354,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.302826,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
COTOPAXI,0.0,0.0,2.492629,0.013401,0.830876,1.648352,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.438756,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
EL ORO,0.0,0.0,7.875458,0.011447,0.538004,3.262363,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.332875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ESMERALDAS,0.0,0.0,12.51462,0.133668,1.203008,5.647452,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0401,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GALAPAGOS,0.0,0.0,15.384615,0.0,1.775148,3.550296,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.100592,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
post.isna().groupby(post.provincia).mean() * 100
# to remove ['Unnamed: 0', 'ins_id', 'pos_nota', 'nombre_institucion', 'carrera', 'area', 'subarea', 'canton_campus', 'parroquia_campus']

Unnamed: 0_level_0,Unnamed: 0,ins_id,pos_id,pos_nota,pos_prioridad,ies_id,nombre_institucion,provincia,canton,parroquia,car_id,carrera,area,subarea,modalidad,jornada,ofa_id,per_id,cod_final,year,nota_postula,ies_tipo_financiamiento,subarea_nombre,canton_campus,parroquia_campus
provincia,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
AZUAY,0.0,0.026101,0.0,99.946555,0.0,0.0,99.946555,0.0,0.0,0.0,0.0,99.946555,99.946555,99.946555,0.0,0.0,0.0,0.0,0.0,0.0,0.053445,0.053445,0.053445,100.0,100.0
BOLIVAR,0.0,0.43397,0.100147,99.205501,0.0,0.0,99.205501,0.0,0.0,0.0,0.0,99.205501,99.205501,99.205501,0.0,0.0,0.0,0.0,0.0,0.0,0.794499,0.794499,0.794499,100.0,100.0
CARCHI,0.0,0.560886,0.118081,98.99631,0.0,0.0,98.99631,0.0,0.0,0.0,0.0,98.99631,98.99631,98.99631,0.0,0.0,0.0,0.0,0.0,0.0,1.00369,1.00369,1.00369,100.0,100.0
CAÑAR,0.0,0.206897,0.0,98.724138,0.0,0.0,98.724138,0.0,0.0,0.0,0.0,98.724138,98.724138,98.724138,0.0,0.0,0.0,0.0,0.0,0.0,1.275862,1.275862,1.275862,100.0,100.0
CHIMBORAZO,0.0,0.125837,0.004535,99.621354,0.0,0.0,99.621354,0.0,0.0,0.0,0.0,99.621354,99.621354,99.621354,0.0,0.0,0.0,0.0,0.0,0.0,0.378646,0.378646,0.378646,100.0,100.0
COTOPAXI,0.0,0.179344,0.036924,99.356472,0.0,0.0,99.356472,0.0,0.0,0.0,0.0,99.356472,99.356472,99.356472,0.0,0.0,0.0,0.0,0.0,0.0,0.643528,0.643528,0.643528,100.0,100.0
EL ORO,0.0,0.0,0.0,99.971831,0.0,0.0,99.971831,0.0,0.0,0.0,0.0,99.971831,99.971831,99.971831,0.0,0.0,0.0,0.0,0.0,0.0,0.028169,0.028169,0.028169,100.0,100.0
ESMERALDAS,0.0,0.065644,0.02073,99.827253,0.0,0.0,99.827253,0.0,0.0,0.0,0.0,99.827253,99.827253,99.827253,0.0,0.0,0.0,0.0,0.0,0.0,0.172747,0.172747,0.172747,100.0,100.0
GALAPAGOS,0.0,0.0,0.0,100.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,100.0,100.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,100.0
GUAYAS,0.0,0.0,0.013258,99.894082,0.0,0.0,99.894082,0.0,0.0,0.0,0.0,99.894082,99.894082,99.894082,0.0,0.0,0.0,0.0,0.0,0.0,0.105918,0.105918,0.105918,100.0,100.0


### Memory Usage

In [18]:
(inscr
 .memory_usage(deep=True)
 .apply(lambda s_: s_ / 1000000) # in megabytes
 .pipe(lambda df_: pd.concat([df_, inscr.dtypes], axis = 1))
 .rename(columns={0:'memory', 1:'dtype'})
)

Unnamed: 0,memory,dtype
Index,0.000128,
Unnamed: 0,7.77672,int64
ins_id,7.77672,float64
prov_nacimiento,33.631242,object
can_nacimiento,33.954511,object
parro_nacimiento,34.265398,object
ued_nombre,35.296483,object
usu_estado,56.38122,object
ins_autoidentificacion,54.169021,object
ins_nacionalidad,32.537719,object


In [20]:
(asig
 .memory_usage(deep=True)
 .apply(lambda s_: s_ / 1000000) # in megabytes
 .pipe(lambda df_: pd.concat([df_, asig.dtypes], axis = 1))
 .rename(columns={0:'memory', 1:'dtype'})
)

Unnamed: 0,memory,dtype
Index,0.000128,
Unnamed: 0,1.705232,int64
ins_id,1.705232,int64
genero,13.565072,object
usu_fecha_nac,14.17436,object
usu_nacionalidad,14.353723,object
etnia,13.653735,object
parroquia_reside,14.972927,object
canton_reside,14.312466,object
provincia_reside,13.585541,object


In [21]:
(post
 .memory_usage(deep=True)
 .apply(lambda s_: s_ / 1000000) # in megabytes
 .pipe(lambda df_: pd.concat([df_, post.dtypes], axis = 1))
 .rename(columns={0:'memory', 1:'dtype'})
)

Unnamed: 0,memory,dtype
Index,0.000128,
Unnamed: 0,18.911728,int64
ins_id,18.911728,float64
pos_id,18.911728,float64
pos_nota,18.911728,float64
pos_prioridad,18.911728,float64
ies_id,18.911728,float64
nombre_institucion,75.931656,object
provincia,143.616062,object
canton,154.683776,object


### The Data

In [22]:
inscr.head().T

Unnamed: 0,0,1,2,3,4
Unnamed: 0,515577,204466,105318,149239,296742
ins_id,6754893.0,10259568.0,3097213.0,7276542.0,10149519.0
prov_nacimiento,,,,,
can_nacimiento,,,,,
parro_nacimiento,,,,,
ued_nombre,,,,,
usu_estado,A,A,A,A,A
ins_autoidentificacion,,Montubio/a,MESTIZO,MESTIZO,Mestizo/a
ins_nacionalidad,,,,,
sexo,,,,,


In [24]:
asig.head().T

Unnamed: 0,0,1,2,3,4
Unnamed: 0,26849,131904,87874,73716,22964
ins_id,4901892,8511824,10113770,4666207,7897913
genero,FEMENINO,MASCULINO,MASCULINO,FEMENINO,MASCULINO
usu_fecha_nac,26962.0,2002/05/03 00:00:00.000000000,36850,36979.0,37073.0
usu_nacionalidad,ECUATORIANA,ECUATORIANA,ECUATORIANA,ECUATORIANA,ECUATORIANA
etnia,Afrodescendiente,Mestizo,Mestizo/a,Montubio,Mestizo
parroquia_reside,TACHINA,ALLURIQUIN,CHIMBACALLE,BABA,ARENILLAS
canton_reside,ESMERALDAS,SANTO DOMINGO,DISTRITO METROPOLITANO DE QUITO,BABA,ARENILLAS
provincia_reside,ESMERALDAS,SANTO DOMINGO DE LOS TSACHILAS,PICHINCHA,LOS RIOS,EL ORO
pos_id,11048658,20617572,23877079,11142890,16776209


In [25]:
post.head().T

Unnamed: 0,0,1,2,3,4
Unnamed: 0,577709,31007,24056,68210,357279
ins_id,10400375.0,11282068.0,8118129.0,7431405.0,6333408.0
pos_id,24510605.0,24932685.0,20807061.0,18308201.0,13608229.0
pos_nota,,,,,
pos_prioridad,5.0,1.0,1.0,3.0,4.0
ies_id,82.0,84.0,494.0,46.0,51.0
nombre_institucion,,,,,
provincia,TUNGURAHUA,COTOPAXI,EL ORO,PICHINCHA,GUAYAS
canton,AMBATO,LA MANA,MACHALA,DISTRITO METROPOLITANO DE QUITO,GUAYAQUIL
parroquia,LA MERCED,"LA MANA, CABECERA CANTONAL",MACHALA,"QUITO DISTRITO METROPOLITANO, CABECERA CANTONA...","GUAYAQUIL, CABECERA CANTONAL Y CAPITAL PROVINCIAL"


### Numeric Types

In [10]:
def check_memory(df):
    print(df.memory_usage(deep=True).sum() / 1000000) # in megabytes
    return df

def get_shape(df):
    print(df.shape)
    return df

(inscr
 .select_dtypes(include=['int', 'float'])
 .loc[~inscr.year.isin([2014, 2015, 2016]), ]
 .drop(columns=['Unnamed: 0', 'ins_id'])
 .pipe(check_memory)
 .assign(year=inscr.year.astype('uint16'),
         cod_final=inscr.cod_final.astype('uint32'))
 .pipe(check_memory)
 .sample(5)
)

10.34568
6.03498


Unnamed: 0,cod_final,year
498057,2508900110,2020
432674,2007000901,2022
663509,2255603378,2020
414158,2301973687,2021
106496,2468311301,2020


In [22]:
(asig
 .select_dtypes(include=['int', 'float'])
 .loc[~asig.year.isin([2014, 2015, 2016]), ]
 .drop(columns=['Unnamed: 0', 'ins_id', ])
 .pipe(check_memory)
 .assign(pos_prioridad=asig.pos_prioridad.astype('uint8'),
        **{c:lambda df_, c=c:df_[c].astype('uint16') for c in ['year', 'ies_id', 'pos_nota']},
        **{c:lambda df_, c=c:df_[c].astype('uint32') for c in ['pos_id', 'ofa_id', 'cod_final', 'car_id']}
        )
 .pipe(check_memory)
 .sample(5)
)

10.269432
4.421561


Unnamed: 0,pos_id,pos_nota,pos_prioridad,ies_id,car_id,ofa_id,cod_final,year
106422,20153695,713,1,333,5113,150479,2223841756,2019
21837,21671209,693,1,102,4515,153710,1958751338,2020
130509,14548255,976,4,46,4938,86501,1837691701,2017
148243,20540751,696,1,59,5205,146364,2405301329,2019
179052,15009586,981,1,46,4781,85146,1637431274,2017


In [220]:
(post
 .select_dtypes(include=['int', 'float'])
 .loc[~post.year.isin([2014, 2015, 2016])]
 .drop(columns=['Unnamed: 0', 'ins_id', 'pos_nota'])
 .loc[~post.year.isin([2014, 2015, 2016]), ]
 .dropna()
 .pipe(check_memory)
 .assign(pos_prioridad= lambda df_: df_.pos_prioridad.astype('uint8'),
        **{c: lambda df_, c=c: df_[c].astype('uint16') for c in ['ies_id', 'year', 'nota_postula']},
        **{c: lambda df_, c=c: df_[c].astype('uint32') for c in ['cod_final', 'per_id', 'ofa_id', 'car_id', 'pos_id']})
 .pipe(check_memory)
 .sample(5)
)

120.86704
52.87933


Unnamed: 0,pos_id,pos_prioridad,ies_id,car_id,ofa_id,per_id,cod_final,year,nota_postula
1932240,28755215,4,46,7202,173761,22,2742031710,2022,801
1636026,16578907,5,46,5455,104076,18,2202881710,2018,786
1883298,28784635,3,46,4478,179105,22,2729531701,2022,830
2186427,18066195,5,521,5297,100726,18,2248401001,2018,660
1397438,23625731,3,101,5455,151413,20,2541680874,2020,687


In [10]:
for size in ['uint8', 'uint16', 'uint32', 'int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']:
    try:
        print(f'{size=} {np.iinfo(size)}')
    except:
        print(f'{size=} {np.finfo(size)}')

size='uint8' Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

size='uint16' Machine parameters for uint16
---------------------------------------------------------------
min = 0
max = 65535
---------------------------------------------------------------

size='uint32' Machine parameters for uint32
---------------------------------------------------------------
min = 0
max = 4294967295
---------------------------------------------------------------

size='int8' Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

size='int16' Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

size='int32' Machine parameters fo

### Non Numeric Types
The Object data types for strings, categoricals and mixed dtypes

In [11]:
with open('../utils.json') as f:
    dd = json.load(f)

(inscr
 .pipe(get_shape)
 .pipe(check_memory)
 .loc[~inscr.year.isin([2014, 2015, 2016]), ]
 .drop(columns=['prov_nacimiento', 'recinto',
                'parro_nacimiento', 'ued_nombre', 
                'ins_nacionalidad', 'sexo', 
                'can_nacimiento', 'ued_tipo',
                'usu_nacionalidad', 'usu_estado'])
 .select_dtypes('object')
 .drop(columns=[c for c in inscr.columns if 'fec' in c]) # not going to deal with datetime columns now
 .applymap(lambda tx: unidecode(tx.lower()) if isinstance(tx, str) else tx)
 .assign(ins_autoidentificacion=lambda df_:(df_
                                            .ins_autoidentificacion
                                            .replace(regex=dd['etnias'])
                                            .astype('category')),
        ins_sexo=lambda df_:(df_
                             .ins_sexo
                             .replace(to_replace='sin dato', value=np.nan)
                             .astype('category')),
        pais_res=lambda df_:(df_
                             .pais_res
                             .replace('china república popular (pekin)', 'china')
                             .astype('category')),
        parroquia_reside=lambda df_:(df_
                                     .parroquia_reside
                                     .replace(to_replace=[', ', 'cabecera cantonal', 
                                                          ' y ', 'capital provincial', 
                                                          'de la republica del ecuador'], value='', regex=True)
                                     .astype('category')),
        **{c:lambda df_, c=c: df_[c].astype('category') for c in ['usu_estado_civil', 'provincia_reside', 'canton_reside']})
 .pipe(get_shape)
 .pipe(check_memory)
 .sample(5)
)

(972090, 23)
908.055865
(431070, 7)
7.468855


Unnamed: 0,ins_autoidentificacion,usu_estado_civil,ins_sexo,pais_res,provincia_reside,canton_reside,parroquia_reside
696882,montuvio,s,mujer,ecuador,el oro,el guabo,el guabo
814456,afroecuatoriano,s,hombre,ecuador,guayas,guayaquil,febres cordero
368793,mestizo,s,mujer,ecuador,pichincha,distrito metropolitano de quito,calderon (carapungo)
795271,afroecuatoriano,s,mujer,ecuador,guayas,guayaquil,febres cordero
16146,mestizo,s,hombre,ecuador,pichincha,distrito metropolitano de quito,la magdalena


In [118]:
(asig
 .pipe(get_shape)
 .pipe(check_memory)
 .loc[~asig.year.isin([2014, 2015, 2016]), ]
 .drop(columns=['parroquia_reside', 'canton_reside', 
                'provincia_reside',  'ies_siglas_instit',
                'usu_nacionalidad'])
 .select_dtypes('object')
 .drop(columns=[c for c in asig.columns if 'fec' in c]) # not going to deal with datetime columns now
 .rename(columns={'provincia':'provincia_ies', 
                  'canton':'canton_ies', 
                  'parroquia':'parroquia_ies',
                  'nombre_institucion':'nombre_ies'})
 .applymap(lambda tx: unidecode(tx.lower()) if isinstance(tx, str) else tx)
 .assign(etnia=lambda df_:(df_.
                           etnia.
                           replace(regex=dd['etnias']).
                           astype('category')),
        provincia_ies=lambda df_:(df_
                                  .provincia_ies
                                  .replace(to_replace='caaar', value='canar')
                                  .astype('category')),
        parroquia_ies=lambda df_:(df_
                                  .parroquia_ies
                                  .replace(to_replace=[', ', 'cabecera cantonal', 
                                                       ' y ', 'capital provincial', 
                                                       'de la republica del ecuador'], value='', regex=True)
                                  .astype('category')),
        area=lambda df_:(df_
                         .area
                         .replace(dd['area'])
                         .astype('category')),
        **{c:lambda df_, c=c: df_[c].astype('category') for c in ['nombre_ies', 'genero', 
                                                                  'canton_ies', 'carrera', 
                                                                  'modalidad']})
 .pipe(get_shape)
 .pipe(check_memory)
 .sample(5)
)

(213154, 25)
250.482386
(142631, 9)
2.867982


Unnamed: 0,genero,etnia,nombre_ies,provincia_ies,canton_ies,parroquia_ies,carrera,area,modalidad
196292,femenino,mestizo,instituto tecnologico superior gran colombia,pichincha,distrito metropolitano de quito,inaquito,diseno de modas con nivel equivalente a tecnol...,artes y humanidades,presencial
140130,femenino,,universidad de las fuerzas armadas (espe),cotopaxi,latacunga,latacunga,contabilidad y auditoria,administracion,presencial
129961,masculino,mestizo,universidad tecnica de manabi,manabi,portoviejo,portoviejo,pedagogia de los idiomas nacionales y extranje...,educacion,presencial
157394,masculino,mestizo,instituto superior tecnologico tsa'chila,santo domingo de los tsachilas,santo domingo,santo domingo de los colorados,tecnologia superior en logistica y transporte,servicios,presencial
52203,femenino,mestizo,universidad estatal de bolivar,bolivar,guaranda,guaranda,comunicacion,"ciencias sociales, periodismo, informacion y d...",presencial


In [119]:
(post
 .pipe(get_shape)
 .pipe(check_memory)
 .loc[~post.year.isin([2014, 2015, 2016]), ]
 .drop(columns=['nombre_institucion', 'area',
                'carrera',  'parroquia_campus',
                'subarea', 'canton_campus'])
 .select_dtypes('object')
 .applymap(lambda tx: unidecode(tx.lower()) if isinstance(tx, str) else tx)
 .assign(parroquia=lambda df_:(df_
                               .parroquia 
                               .replace(to_replace=[', ', 'cabecera cantonal', 
                                               ' y ', 'capital provincial', 
                                               'de la republica del ecuador'], 
                                        value='', 
                                        regex=True)
                               .astype('category')),
         **{c:lambda df_, c=c:df_[c].astype('category') for c in ['provincia', 'canton', 'modalidad', 
                                                                  'jornada', 'ies_tipo_financiamiento', 
                                                                  'subarea_nombre']})
 .pipe(get_shape)
 .pipe(check_memory)
 .sample(5)
)

(2363966, 25)
1939.398539
(1514498, 7)
22.742611


Unnamed: 0,provincia,canton,parroquia,modalidad,jornada,ies_tipo_financiamiento,subarea_nombre
1864972,imbabura,san miguel de urcuqui,urcuqui,presencial,vespertina,publica,ingenieria y profesiones afines
870328,guayas,guayaquil,guayaquil,presencial,vespertina,publica,ingenieria y profesiones afines
135163,manabi,portoviejo,portoviejo,en linea,no aplica jornada,publica,ciencias sociales y del comportamiento
2361446,azuay,cuenca,cuenca.,presencial,intensiva,publica,salud
582888,guayas,guayaquil,guayaquil,presencial,nocturna,publica,derecho


### Datetime types

In [184]:
(inscr
 .loc[~inscr.year.isin([2014, 2015, 2016]), ]
 .drop(columns=[c for c in inscr.columns if 'fec' not in c])
 .assign(usu_fecha_nac=lambda df_:(df_
                                   .usu_fecha_nac
                                   .where(inscr.usu_fecha_nac.str.contains('/'))
                                   .str
                                   .replace(pat=r'\s\d\d?:\d\d(:\d\d)?', repl='', regex=True)
                                   .pipe(lambda s: pd.to_datetime(s, format='%d/%m/%Y'))),
        ins_fecha=lambda df_:(df_
                              .ins_fecha
                              .where(inscr.ins_fecha.str.contains('/'))
                              .str
                              .replace(pat=r'\s\d\d?:\d\d(:\d\d)?', repl='', regex=True)
                              .pipe(lambda s: pd.to_datetime(s, format='%d/%m/%Y'))),)
 .sample(15)
)

Unnamed: 0,usu_fecha_nac,ins_fecha
38021,2001-04-05,2018-11-20
323518,1995-02-09,2019-03-01
454482,1995-09-06,2019-03-01
294943,1997-05-17,2019-03-01
709489,1997-10-23,2021-01-15
788412,2000-04-07,2019-03-01
818974,2001-05-18,2018-11-22
661948,1996-12-24,2018-11-22
489078,2001-11-07,2021-07-07
896395,2001-07-30,2019-05-01


In [162]:
(asig
 .loc[~asig.year.isin([2014, 2015, 2016])]
 .drop(columns=[c for c in asig.columns if 'fec' not in c])
 .assign(usu_fecha_nac=lambda df_:(df_
                                   .usu_fecha_nac
                                   .where(df_.usu_fecha_nac.str.contains('/'))
                                   .str
                                   .replace(pat=r'\s\d\d?:\d\d(:\d\d)?', repl='', regex=True)))
 .isna()
 .mean() * 100
)

usu_fecha_nac    81.123318
dtype: float64

### Merging Data

### Tweak Inscripciones

In [12]:
def tweak_inscripciones(df):
    """
    This function transforms or tweaks the table `inscripciones` from the senecyt database. Specifically, \
    it does the following:
    #. drops columns and indeces where data is missing by unaccepted margins
    #. all text data is decoded and made lowercase
    #. numeric variables' data types are downgraded to save memory
    #. non-numeric variables' data types are converted to categorical data type. Also, mistypes or erroneous values are replaces
    #. datetime variables are converted to true datetime data type after being cleaned for consistency
    .. note::
        The function may be altered and is under revision since many of the origional \
        decisions regarding the selection of variables, variable inclusion and creation \
        are being constantly revised
    :return: pd.DataFrame df: tweaked table `inscripciones`
    """
    return (df
            .loc[~df.year.isin([2014, 2015, 2016]), ]
            .drop(columns=['prov_nacimiento', 'recinto',
                'parro_nacimiento', 'ued_nombre', 
                'ins_nacionalidad', 'sexo', 
                'can_nacimiento', 'ued_tipo',
                'usu_nacionalidad', 'usu_estado',
                'Unnamed: 0', 'ins_id'])
            .applymap(lambda tx: unidecode(tx.lower()) if isinstance(tx, str) else tx)
            .assign(year=df.year.astype('uint16'),
                    cod_final=lambda df_:df_.cod_final.astype('uint32'),
                    ins_autoidentificacion=lambda df_:(df_
                                                       .ins_autoidentificacion
                                                       .replace(regex=dd['etnias'])
                                                       .astype('category')),
                    ins_sexo=lambda df_:(df_
                                         .ins_sexo
                                         .replace(to_replace='sin dato', value=np.nan)
                                         .astype('category')),
                    pais_res=lambda df_:(df_
                                         .pais_res
                                         .replace('china república popular (pekin)', 'china')
                                         .astype('category')),
                    parroquia_reside=lambda df_:(df_
                                                 .parroquia_reside
                                                 .replace(to_replace=[', ', 'cabecera cantonal', 
                                                                      ' y ', 'capital provincial', 
                                                                      'de la republica del ecuador'], value='', regex=True)
                                                 .astype('category')),
                    **{c:lambda df_, c=c: df_[c].astype('category') for c in ['usu_estado_civil', 'provincia_reside', 'canton_reside']},
                    usu_fecha_nac=lambda df_:(df_
                                              .usu_fecha_nac
                                              .where(df_.usu_fecha_nac.str.contains('/'))
                                              .str
                                              .replace(pat=r'\s\d\d?:\d\d(:\d\d)?', repl='', regex=True)
                                              .pipe(lambda s: pd.to_datetime(s, format='%d/%m/%Y'))),
                    ins_fecha=lambda df_:(df_
                                          .ins_fecha
                                          .where(df_.ins_fecha.str.contains('/'))
                                          .str
                                          .replace(pat=r'\s\d\d?:\d\d(:\d\d)?', repl='', regex=True)
                                          .pipe(lambda s: pd.to_datetime(s, format='%d/%m/%Y'))),)
                   )

In [13]:
df_ii = tweak_inscripciones(inscr)
df_ii.head().T

Unnamed: 0,0,1,3,4,6
ins_autoidentificacion,,montuvio,mestizo,mestizo,mestizo
cod_final,1323421310,1076491301,2031921374,2496762147,585017191
year,2017,2021,2018,2021,2020
usu_estado_civil,s,s,s,s,s
usu_fecha_nac,1999-05-22 00:00:00,1998-01-13 00:00:00,2000-01-12 00:00:00,2002-01-18 00:00:00,NaT
ins_sexo,,mujer,hombre,mujer,mujer
ins_fecha,2019-03-01 00:00:00,2021-01-15 00:00:00,2019-04-29 00:00:00,2021-01-15 00:00:00,NaT
pais_res,,ecuador,ecuador,ecuador,ecuador
provincia_reside,manabi,manabi,manabi,sucumbios,pichincha
canton_reside,junin,portoviejo,manta,lago agrio,distrito metropolitano de quito


In [None]:
tweak_inscripciones?

### Tweak Asignaciones

In [14]:
def tweak_asignaciones(df):
    """
    This function transforms or tweaks the table `asignaciones` from the senecyt database. Specifically, \
    it does the following:
    #. drops columns and indeces where data is missing by unaccepted margins
    #. columns refering the geographical location of universities are renamed to not be confounded with geographical location of students
    #. all text data is decoded and made lowercase
    #. numeric variables' data types are downgraded to save memory
    #. non-numeric variables' data types are converted to categorical data type. Also, mistypes or erroneous values are replaces
    .. note::
        The function may be altered and is under revision since many of the origional \
        decisions regarding the selection of variables, variable inclusion and creation \
        are being constantly revised
    :return: pd.DataFrame df: tweaked table `asignaciones`
    """
    return (df
            .loc[~df.year.isin([2014, 2015, 2016]), ]
            .drop(columns=['usu_fecha_nac', 'Unnamed: 0', 'ins_id', 
                           'parroquia_reside', 'canton_reside', 
                           'provincia_reside',  'ies_siglas_instit', 
                           'usu_nacionalidad'])
            .rename(columns={'provincia':'provincia_ies', 
                             'canton':'canton_ies', 
                             'parroquia':'parroquia_ies',
                             'nombre_institucion':'nombre_ies'})
            .applymap(lambda tx: unidecode(tx.lower()) if isinstance(tx, str) else tx)
            .assign(pos_prioridad=asig.pos_prioridad.astype('uint8'),
                    **{c:lambda df_, c=c:df_[c].astype('uint16') for c in ['year', 'ies_id', 'pos_nota']},
                    **{c:lambda df_, c=c:df_[c].astype('uint32') for c in ['pos_id', 'ofa_id', 'cod_final', 'car_id']},
                    etnia=lambda df_:(df_
                                      .etnia
                                      .replace(regex=dd['etnias'])
                                      .astype('category')),
                    provincia_ies=lambda df_:(df_
                                              .provincia_ies
                                              .replace(to_replace='caaar', value='canar')
                                              .astype('category')),
                    parroquia_ies=lambda df_:(df_
                                              .parroquia_ies
                                              .replace(to_replace=[', ', 'cabecera cantonal', 
                                                                   ' y ', 'capital provincial', 
                                                                   'de la republica del ecuador'], value='', regex=True)
                                              .astype('category')),
                    area=lambda df_:(df_.area.replace(dd['area']).astype('category')),
                    **{c:lambda df_, c=c: df_[c].astype('category') for c in ['nombre_ies', 'genero', 'canton_ies', 'carrera', 'modalidad']})
           )

In [15]:
df_aa = tweak_asignaciones(asig)
df_aa.head().T

Unnamed: 0,1,2,4,5,6
genero,masculino,masculino,masculino,femenino,masculino
etnia,mestizo,mestizo,mestizo,mestizo,mestizo
pos_id,20617572,23877079,16776209,21878283,26669780
pos_nota,850,736,939,889,836
pos_prioridad,2,1,1,1,4
ies_id,84,46,495,46,59
nombre_ies,universidad tecnica de cotopaxi,universidad central del ecuador,instituto tecnolagico superior josa ochoa lean,universidad central del ecuador,universidad estatal de milagro
provincia_ies,cotopaxi,pichincha,el oro,pichincha,guayas
canton_ies,latacunga,distrito metropolitano de quito,pasaje,distrito metropolitano de quito,milagro
parroquia_ies,latacunga,quito distrito metropolitano,pasaje,quito distrito metropolitano,milagro


### tweak Postulaciones

In [16]:
def tweak_postulaciones(df):
    """
    This function transforms or tweaks the table `postulaciones` from the senecyt database. Specifically, \
    it does the following:
    #. drops columns and indeces where data is missing by unaccepted margins
    #. drops rows where nota_postula is coded 9999
    #. columns refering the geographical location of universities are renamed to not be confounded with geographical location of students
    #. all text data is decoded and made lowercase
    #. numeric variables' data types are downgraded to save memory
    #. non-numeric variables' data types are converted to categorical data type. Also, mistypes or erroneous values are replaces
    .. note::
        The function may be altered and is under revision since many of the origional \
        decisions regarding the selection of variables, variable inclusion and creation \
        are being constantly revised
    :return: pd.DataFrame df: tweaked table `asignaciones`
    """
    return (df
            .loc[~df.year.isin([2014, 2015, 2016]), ]
            .drop(columns = ['Unnamed: 0', 'ins_id', 'pos_nota', 
                             'nombre_institucion', 'area','carrera',  
                             'parroquia_campus','subarea', 
                             'canton_campus'])
            .dropna(subset=['pos_prioridad', 'ies_id', 'year', 'nota_postula', 'cod_final', 'per_id', 'ofa_id', 'car_id', 'pos_id'])
            .applymap(lambda tx: unidecode(tx.lower()) if isinstance(tx, str) else tx)
            .assign(pos_prioridad= lambda df_: df_.pos_prioridad.astype('uint8'),
                    **{c: lambda df_, c=c: df_[c].astype('uint16') for c in ['ies_id', 'year', 'nota_postula']},
                    **{c: lambda df_, c=c: df_[c].astype('uint32') for c in ['cod_final', 'per_id', 'ofa_id', 'car_id', 'pos_id']},
                    parroquia=lambda df_:(df_
                                          .parroquia
                                          .replace(to_replace=[', ', 'cabecera cantonal', 
                                                               ' y ', 'capital provincial', 
                                                               'de la republica del ecuador'], 
                                                   value='', 
                                                   regex=True)
                                          .astype('category')),
                    **{c:lambda df_, c=c:df_[c].astype('category') for c in ['provincia', 'canton', 
                                                                             'modalidad', 'jornada', 
                                                                             'ies_tipo_financiamiento', 
                                                                             'subarea_nombre']})
           )

In [17]:
df_pp = tweak_postulaciones(post) 
df_pp.head().T

Unnamed: 0,0,1,2,3,4
pos_id,24510605,24932685,20807061,18308201,13608229
pos_prioridad,5,1,1,3,4
ies_id,82,84,494,46,51
provincia,tungurahua,cotopaxi,el oro,pichincha,guayas
canton,ambato,la mana,machala,distrito metropolitano de quito,guayaquil
parroquia,la merced,la mana,machala,quito distrito metropolitano,guayaquil
car_id,7077,5456,7078,4913,4801
modalidad,presencial,presencial,presencial,presencial,presencial
jornada,intensiva,vespertina,vespertina,intensiva,vespertina
ofa_id,162620,172042,144385,98317,85081


### Missing After Tweaks

#### Missing across years

In [18]:
df_ii.isna().groupby(df_ii.year).mean() * 100

Unnamed: 0_level_0,ins_autoidentificacion,cod_final,year,usu_estado_civil,usu_fecha_nac,ins_sexo,ins_fecha,pais_res,provincia_reside,canton_reside,parroquia_reside
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2017,14.230035,0.0,0.0,0.408721,0.375369,20.7493,0.0,9.253446,2.524916,2.524916,2.524916
2018,3.514915,0.0,0.0,0.024344,0.020599,1.28462,0.0,2.402577,2.402577,2.402577,2.402577
2019,2.369385,0.0,0.0,0.064561,0.001614,1.636619,0.0,1.612409,1.612409,1.612409,1.612409
2020,1.451256,0.0,0.0,0.131932,100.0,0.0,100.0,1.472459,1.472459,1.472459,1.472459
2021,4.52993,0.0,0.0,0.001604,0.0,0.0,0.0,4.545965,4.545965,4.545965,4.545965
2022,3.821546,0.0,0.0,0.001725,0.0,0.0,0.0,3.849139,3.849139,3.849139,3.849139


In [19]:
df_aa.isna().groupby(df_aa.year).mean() * 100

Unnamed: 0_level_0,genero,etnia,pos_id,pos_nota,pos_prioridad,ies_id,nombre_ies,provincia_ies,canton_ies,parroquia_ies,car_id,carrera,area,modalidad,ofa_id,cod_final,year
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2017,2.704247,4.946677,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018,2.198892,4.442842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2019,1.415042,4.375116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020,0.643699,3.708352,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021,0.596083,3.098685,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022,0.303712,1.507312,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
df_pp.isna().groupby(df_pp.year).mean() * 100

Unnamed: 0_level_0,pos_id,pos_prioridad,ies_id,provincia,canton,parroquia,car_id,modalidad,jornada,ofa_id,per_id,cod_final,year,nota_postula,ies_tipo_financiamiento,subarea_nombre
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Missing across provinces

In [21]:
df_ii.isna().groupby(df_ii.provincia_reside).mean() * 100

Unnamed: 0_level_0,ins_autoidentificacion,cod_final,year,usu_estado_civil,usu_fecha_nac,ins_sexo,ins_fecha,pais_res,provincia_reside,canton_reside,parroquia_reside
provincia_reside,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
azuay,2.428384,0.0,0.0,0.131406,16.551905,10.09724,16.446781,0.636005,0.0,0.0,0.0
bolivar,2.766278,0.0,0.0,0.282273,11.479112,6.097102,11.366202,1.298457,0.0,0.0,0.0
canar,4.05675,0.0,0.0,0.088672,9.776103,10.418976,9.709599,1.773443,0.0,0.0,0.0
carchi,2.042529,0.0,0.0,0.08394,13.43033,7.582541,13.37437,0.111919,0.0,0.0,0.0
chimborazo,1.836243,0.0,0.0,0.050171,13.967489,7.525587,13.927353,0.451535,0.0,0.0,0.0
cotopaxi,2.915002,0.0,0.0,0.308754,11.714493,7.528151,11.451144,1.380312,0.0,0.0,0.0
el oro,5.300936,0.0,0.0,0.164896,6.823803,5.839274,6.736505,3.094233,0.0,0.0,0.0
esmeraldas,8.163086,0.0,0.0,0.220148,4.825643,8.048609,4.631913,5.433251,0.0,0.0,0.0
galapagos,12.059369,0.0,0.0,0.927644,3.710575,11.688312,3.710575,10.204082,0.0,0.0,0.0
guayas,6.215847,0.0,0.0,0.133109,6.48732,6.455794,6.395369,3.688525,0.0,0.0,0.0


In [22]:
df_aa.isna().groupby(df_aa.provincia_ies).mean() * 100

Unnamed: 0_level_0,genero,etnia,pos_id,pos_nota,pos_prioridad,ies_id,nombre_ies,provincia_ies,canton_ies,parroquia_ies,car_id,carrera,area,modalidad,ofa_id,cod_final,year
provincia_ies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
azuay,0.20908,1.851852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
bolivar,0.532319,2.357414,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
canar,0.410959,3.493151,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
carchi,0.079491,2.384738,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
chimborazo,0.21426,2.678253,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
cotopaxi,0.343584,3.109431,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
el oro,2.335669,2.669336,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
esmeraldas,1.884659,3.656238,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
guayas,1.989124,4.192902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
imbabura,0.152805,3.667322,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
df_pp.isna().groupby(df_pp.provincia).mean() * 100

Unnamed: 0_level_0,pos_id,pos_prioridad,ies_id,provincia,canton,parroquia,car_id,modalidad,jornada,ofa_id,per_id,cod_final,year,nota_postula,ies_tipo_financiamiento,subarea_nombre
provincia,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
azuay,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
bolivar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
canar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
carchi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
chimborazo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
cotopaxi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
el oro,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
esmeraldas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
guayas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
imbabura,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### TODOs
<span style="color: #B8860B">
    <span style="color:lightgreen; font-weight:normal;">@</span> Finish cleaning categories <br>
    <span style="color:lightgreen; font-weight:normal;">@</span> Include geographical data once categories are cleaned and you're able to merge <br>
    <span style="color:lightgreen; font-weight:normal;">@</span> Use Mellisa Dell's layout parser to get all categories from CES (Consejo de Educación Superior) <br>
</span>

### Anomalies

#### Atypical, estranged values in the column `nota_postula`

In [53]:
# 19 observations appear with scores of 9999, which is not possible since max score is 1000
# post.nota_postula.sort_values(ascending=False).head(19).index

(post
 .iloc[(post
        .nota_postula
        .sort_values(ascending=False)
        .head(19)
        .index), ]
)

Unnamed: 0.1,Unnamed: 0,ins_id,pos_id,pos_nota,pos_prioridad,ies_id,nombre_institucion,provincia,canton,parroquia,car_id,carrera,area,subarea,modalidad,jornada,ofa_id,per_id,cod_final,year,nota_postula,ies_tipo_financiamiento,subarea_nombre,canton_campus,parroquia_campus
1536832,178344,4604435.0,10765080.0,,4.0,101.0,,ESMERALDAS,ESMERALDAS,ESMERALDAS,4473.0,,,,PRESENCIAL,MATUTINA,69693.0,16.0,1748381000.0,2016,9999.0,PÚBLICA,EDUCACION,,
732937,35130,3607246.0,8772122.0,,4.0,48.0,,AZUAY,CUENCA,"CUENCA, CABECERA CANTONAL Y CAPITAL PROVINCIAL.",4610.0,,,,PRESENCIAL,INTENSIVA,63160.0,15.0,1791470000.0,2015,9999.0,PÚBLICA,INGENIERIA Y PROFESIONES AFINES,,
1340892,693601,3669928.0,8870861.0,,3.0,46.0,,PICHINCHA,DISTRITO METROPOLITANO DE QUITO,"QUITO DISTRITO METROPOLITANO, CABECERA CANTONA...",4781.0,,,,PRESENCIAL,VESPERTINA,64867.0,15.0,1483052000.0,2015,9999.0,PÚBLICA,DERECHO,,
844255,178343,4604435.0,10765077.0,,1.0,101.0,,ESMERALDAS,ESMERALDAS,ESMERALDAS,5455.0,,,,PRESENCIAL,MATUTINA,75504.0,16.0,1748381000.0,2016,9999.0,PÚBLICA,EDUCACION COMERCIAL Y ADMINISTRACION,,
1461610,143626,3266968.0,8368348.0,,2.0,517.0,,GUAYAS,GUAYAQUIL,"GUAYAQUIL, CABECERA CANTONAL Y CAPITAL PROVINCIAL",4538.0,,,,PRESENCIAL,NOCTURNA,62760.0,15.0,5906811000.0,2015,9999.0,PÚBLICA,INGENIERIA Y PROFESIONES AFINES,,
1165510,462586,4806074.0,12068730.0,,2.0,86.0,,MANABI,PORTOVIEJO,PORTOVIEJO,4938.0,,,,PRESENCIAL,MATUTINA,71118.0,16.0,1253221000.0,2016,9999.0,PÚBLICA,SALUD,,
923068,462589,4806074.0,12068733.0,,5.0,60.0,,MANABI,JIPIJAPA,"JIPIJAPA, CABECERA CANTONAL",4641.0,,,,PRESENCIAL,VESPERTINA,73589.0,16.0,1253221000.0,2016,9999.0,PÚBLICA,SALUD,,
1814394,178345,4604435.0,10765079.0,,3.0,101.0,,ESMERALDAS,ESMERALDAS,ESMERALDAS,4478.0,,,,PRESENCIAL,VESPERTINA,71140.0,16.0,1748381000.0,2016,9999.0,PÚBLICA,EDUCACION,,
826768,154825,3524241.0,8719914.0,,1.0,51.0,,GUAYAS,GUAYAQUIL,"GUAYAQUIL, CABECERA CANTONAL Y CAPITAL PROVINCIAL",4781.0,,,,PRESENCIAL,NOCTURNA,57168.0,15.0,1764061000.0,2015,9999.0,PÚBLICA,DERECHO,,
330392,35128,3607246.0,8772119.0,,1.0,497.0,,CAÑAR,AZOGUES,AZOGUES,7090.0,,,,PRESENCIAL,NOCTURNA,62346.0,15.0,1791470000.0,2015,9999.0,PÚBLICA,INGENIERIA Y PROFESIONES AFINES,,


#### Data type casting failed due to NaNs

In [62]:
def get_shape(df):
    print(df.shape)
    return df

# 4075 rows will be lost due to missing values in numerical columns. I'll proceed but I can't but the pattern for missing values is random, \
# otherwise this might bias any resulting statistics
(post
 .select_dtypes(include=['int', 'float'])
 .drop(columns=['Unnamed: 0', 'ins_id', 'pos_nota'])
 .pipe(get_shape)
 .dropna()
 .pipe(get_shape)
 .sample(5)
)

(2363966, 9)
(2359891, 9)


Unnamed: 0,pos_id,pos_prioridad,ies_id,car_id,ofa_id,per_id,cod_final,year,nota_postula
2078202,9199667.0,2.0,51.0,5323.0,66087.0,15.0,1638241000.0,2015,937.0
2186608,23077461.0,2.0,88.0,5482.0,157495.0,20.0,1850931000.0,2020,731.0
2341524,9955246.0,5.0,51.0,5038.0,57126.0,15.0,1268811000.0,2015,715.0
2035440,13060334.0,2.0,85.0,5455.0,76936.0,16.0,1210201000.0,2016,812.0
2147337,16277144.0,2.0,51.0,4781.0,82219.0,17.0,2099521000.0,2017,616.0


#### Missing values may not be missing at random
Missing values are overrepresented in the earliest time periods available. Check the missing values section.

In [90]:
inscrVars = ['provincia_reside', 'canton_reside', 
             'parroquia_reside',  'ins_sexo', 'pais_res',
             'usu_estado_civil', 'cod_final', 'year',
             'ins_autoidentificacion']

asigVars = ['parroquia_reside', 'canton_reside', 
            'provincia_reside',  'ies_siglas_instit',
            'usu_nacionalidad']

postVars = ['nombre_institucion', 'area', 
            'carrera',  'parroquia_campus',
            'subarea', 'canton_campus']

def getMissing(df, selVars):
    return (df
     .pipe(lambda df_: df_.drop(columns=[c for c in df_.columns if c not in selVars]))
     .isna()
     .groupby(df.year)
     .mean()
     .style
     .format(lambda n: "{:.2f} %".format(n*100))
    )

getMissing(inscr, inscrVars)

Unnamed: 0_level_0,ins_autoidentificacion,cod_final,year,usu_estado_civil,ins_sexo,pais_res,provincia_reside,canton_reside,parroquia_reside
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2014,45.33 %,0.00 %,0.00 %,100.00 %,100.00 %,100.00 %,100.00 %,100.00 %,100.00 %
2015,51.72 %,0.00 %,0.00 %,0.29 %,0.00 %,46.11 %,40.45 %,40.45 %,40.45 %
2016,47.49 %,0.00 %,0.00 %,0.28 %,0.00 %,42.86 %,37.92 %,37.92 %,37.92 %
2017,14.23 %,0.00 %,0.00 %,0.41 %,0.00 %,9.25 %,2.52 %,2.52 %,2.52 %
2018,3.51 %,0.00 %,0.00 %,0.02 %,0.00 %,2.40 %,2.40 %,2.40 %,2.40 %
2019,2.37 %,0.00 %,0.00 %,0.06 %,1.64 %,1.61 %,1.61 %,1.61 %,1.61 %
2020,1.45 %,0.00 %,0.00 %,0.13 %,0.00 %,1.47 %,1.47 %,1.47 %,1.47 %
2021,4.53 %,0.00 %,0.00 %,0.00 %,0.00 %,4.55 %,4.55 %,4.55 %,4.55 %
2022,3.82 %,0.00 %,0.00 %,0.00 %,0.00 %,3.85 %,3.85 %,3.85 %,3.85 %


#### Cleaning up of Categorical Variables


In [None]:
# This code should work with either the dictionary from the json file `utils` and the python variable `atidntreg`
with open('utils.json') as f:
    dd = json.load(f)
    
atidntreg = {r'^mesti[s|z]o?/?a?':'mestizo', r'^montu[v|b]io/?a?':'montuvio', 
             r'^afro(descendiente|ecuatoriano/a)?(\so\safrodescendiente)?|negro/?a?':'afroecuatoriano', 
             r'blanco/?a?':'blanco', r'mulato/?a?':'mulato', r'otro/?a?':'otro'}
    
(inscr
 .loc[~inscr.year.isin([2014]), ]
 .select_dtypes('object')
 .drop(columns=['prov_nacimiento', 'recinto',
                'parro_nacimiento', 'ued_nombre', 
                'ins_nacionalidad', 'sexo', 
                'can_nacimiento', 'ued_tipo',
                'usu_nacionalidad', 'usu_estado'])
 .drop(columns=[c for c in inscr.columns if 'fec' in c]) # not going to deal with datetime columns now
 .pipe(check_memory)
 .applymap(lambda tx: unidecode(tx.lower()) if isinstance(tx, str) else tx)
 .assign(ins_autoidentificacion=lambda df_:(df_
                                            .ins_autoidentificacion
                                            .replace(regex=dd['etnias'])
#                                             .replace(regex=atidntreg)
                                            .astype('category')))
 .pipe(check_memory)
 .ins_autoidentificacion
 .value_counts()
)

#### Mismatch between the number of true districts and parishes and in-data district and parishes

In [29]:
print((inscr
 .canton_reside
 .unique()
 .size
))

print((inscr
 .parroquia_reside
 .unique()
 .size
))

# provincia_reside 
# canton_reside : high cardinality ; 221 true districts contrasts to 225 in-data districts
# parroquia_reside : high_cardinality ; 1.499 true parishes contrast to 1172 in-data parishes

225
1172


#### Individuals being assigned to private universities

In [33]:
priv_uni = ['UNIVERSIDAD SAN FRANCISCO DE QUITO', 'UNIVERSIDAD PARTICULAR DE ESPECIALIDADES ESPIRITU SANTO', 'UNIVERSIDAD PARTICULAR SAN GREGORIO DE PORTOVIEJO']

(asig
 .loc[~asig.year.isin([2014, 2015, 2016])]
 .rename(columns={'nombre_institucion':'nombre_ies'})
 .pipe(lambda df_:df_.loc[df_.nombre_ies.isin(priv_uni)])
 .head()
)

Unnamed: 0.1,Unnamed: 0,ins_id,genero,usu_fecha_nac,usu_nacionalidad,etnia,parroquia_reside,canton_reside,provincia_reside,pos_id,pos_nota,pos_prioridad,ies_id,ies_siglas_instit,nombre_ies,provincia,canton,parroquia,car_id,carrera,area,modalidad,ofa_id,cod_final,year
237,88967,6242661,FEMENINO,37026.0,ECUATORIANA,Blanco,CANUTO,CHONE,MANABI,14965554,832.0,2.0,76,sangregorio,UNIVERSIDAD PARTICULAR SAN GREGORIO DE PORTOVIEJO,MANABI,PORTOVIEJO,PORTOVIEJO,4913,ODONTOLOGIA,SALUD Y BIENESTAR,PRESENCIAL,85488,2108671701,2017
6319,43686,6552398,MASCULINO,35865.0,ECUATORIANA,Indígena,PASCUALES,GUAYAQUIL,GUAYAS,13904408,726.0,2.0,49,uees,UNIVERSIDAD PARTICULAR DE ESPECIALIDADES ESPIR...,GUAYAS,SAMBORONDON,"SAMBORONDON, CABECERA CANTONAL",4473,EDUCACION INICIAL,EDUCACION,PRESENCIAL,88307,9978830938,2017
9462,4842,9120870,MASCULINO,36804.0,ECUATORIANA,Mestizo/a,SUCRE,CUENCA,AZUAY,22850623,784.0,5.0,81,usfq,UNIVERSIDAD SAN FRANCISCO DE QUITO,PICHINCHA,DISTRITO METROPOLITANO DE QUITO,"QUITO DISTRITO METROPOLITANO, CABECERA CANTONA...",4610,INGENIERIA AMBIENTAL,"INGENIERIA, INDUSTRIA Y CONSTRUCCION",PRESENCIAL,153307,2144630156,2020
10486,67484,8321118,FEMENINO,2000/11/05 00:00:00.000000000,ECUATORIANA,Indígena,SARAGURO,SARAGURO,LOJA,20765902,738.0,2.0,81,usfq,UNIVERSIDAD SAN FRANCISCO DE QUITO,PICHINCHA,DISTRITO METROPOLITANO DE QUITO,"QUITO DISTRITO METROPOLITANO, CABECERA CANTONA...",5337,EDUCACION,EDUCACION,PRESENCIAL,139784,2219761138,2019
10953,24274,8239022,FEMENINO,2001/12/25 00:00:00.000000000,ECUATORIANA,Mestizo,GARCIA MORENO (LLURIMAGUA),COTACACHI,IMBABURA,19482163,945.0,1.0,76,sangregorio,UNIVERSIDAD PARTICULAR SAN GREGORIO DE PORTOVIEJO,MANABI,PORTOVIEJO,PORTOVIEJO,4913,ODONTOLOGIA,SALUD Y BIENESTAR,PRESENCIAL,146547,2401460838,2019


#### Parishes names, after decoding and lowercase transformation, may be changed and be written differently although represent the same thing

In [12]:
# See (azogues vs azogues, cabecera cantonal) ; (la aurora (satalite) vs la aurora (satelite)) ; (iaaquito vs inaquito)
(asig
 .loc[~post.year.isin([2014, 2015, 2016]), ]
 .drop(columns=['parroquia_reside', 'canton_reside', 
                'provincia_reside',  'ies_siglas_instit',
                'usu_nacionalidad'])
 .select_dtypes('object')
 .applymap(lambda tx: unidecode(tx.title()) if isinstance(tx, str) else tx)
 .assign(provincia=lambda df_:(df_.provincia.replace(to_replace='CaaAr', value='Canar')),
         canton=lambda df_: (df_.canton.replace(to_replace='CaaAr', value='Canar')),
         parroquia=lambda df_:(df_
                               .parroquia
                               .replace(to_replace='CaaAr', value='Canar')
                               .replace(to_replace=[', ', 'Cabecera Cantonal', 
                                                    ' Y ', 'Capital Provincial', 
                                                    'De La Republica Del Ecuador'], 
                                        value='', 
                                        regex=True)))
 .groupby(['provincia', 'canton', 'parroquia'])
 .size()
 .reset_index()
)

Unnamed: 0,provincia,canton,parroquia,0
0,Azuay,Cuenca,Bellavista,42
1,Azuay,Cuenca,Cuenca.,3714
2,Azuay,Cuenca,El Vecino,104
3,Azuay,Giron,Giron,6
4,Bolivar,Chimbo,San Sebastian,80
5,Bolivar,Guaranda,Guaranda,2454
6,Bolivar,Guaranda,Simiatug,63
7,Bolivar,San Miguel,San Miguel,104
8,Canar,Azogues,Azogues,1067
9,Canar,Biblian,Biblian,54


In [71]:
(post
 .loc[~post.year.isin([2014, 2015, 2016]), ]
 .drop(columns=['nombre_institucion', 'area',
                'carrera',  'parroquia_campus',
                'subarea', 'canton_campus'])
 .select_dtypes('object')
 .applymap(lambda tx: unidecode(tx.lower()) if isinstance(tx, str) else tx)
 .assign(provincia=lambda df_:(df_.provincia.replace(to_replace='caaar', value='canar')))
 .groupby(['provincia', 'canton', 'parroquia'])
 .size()
 .reset_index()
)

Unnamed: 0,provincia,canton,parroquia,0
0,azuay,cuenca,bellavista,272
1,azuay,cuenca,"cuenca, cabecera cantonal y capital provincial.",50189
2,azuay,cuenca,el vecino,1031
3,azuay,giron,giron,22
4,bolivar,chimbo,san sebastian,54
5,bolivar,guaranda,"guaranda, cabecera cantonal y capital provincial",10927
6,bolivar,guaranda,simiatug,62
7,bolivar,san miguel,san miguel,339
8,canar,azogues,azogues,4091
9,canar,azogues,"azogues, cabecera cantonal",77


#### Cleaning Categoricals
Options I envision for effective cleaning of categorical variables, now working with names of provinces, districts and parishes in mind.
- O1: Given a pd.Series of data type string, create a function to apply to each of the strings. The string is taken, then the damarau-levenshtein distance is calculated with each of the strings present in list-like object that contains only correctly typed and formatted strings. Then, the string is replaced with the one with which the distance is the smallest.

    - Concern 1: Big Number of Categories:
    In order to reduce the space of words that the single word coming from the mistyped pd.Series must be compared to I should filter the array of correct words somehow. What is I could compare not only words but multiple strings at once.

Problems that must be addressed by O1 are i. big number of categories, ii. big ramerau levenshtein diffs although they are the same words but one contains more information than the other.