# Análisis y Limpieza de Datos Educativos

## Descripción de la Base de Datos

La base de datos **MEN_ESTADÍSTICAS EN EDUCACIÓN EN PREESCOLAR, BÁSICA Y MEDIA POR MUNICIPIO** contiene información relacionadan con los niveles educativos de **preescolar**, **básica** y **media**. La información está desglosada por municipio y presenta indicadores sectoriales que permiten analizar la cobertura educativa, deserción, repitencia, y aprobación en diferentes niveles.

La base abarca los años **2011 a 2023**, y los datos se presentan sin valores atípicos. Además, para las tasas de cobertura bruta y neta de los años **2019** y **2018**, los cálculos se realizaron utilizando las **proyecciones de población** derivadas del **Censo 2018**.

En este archivo se realizará un análisis y limpieza del conjunto de datos educativos proporcionado por el DANE. Los datos contienen información sobre tasas de matrícula, deserción, repitencia y cobertura educativa en los niveles preescolar, básica y media, desglosados por municipio y año.

El proceso se centrará en limpiar, normalizar y transformar las variables para hacer que los datos sean adecuados para análisis posteriores en SQL.

In [132]:
## Vamos a importar la librerias necesarias

import pandas as pd
import requests
pd.set_option('display.max_columns', None) ## muestra todos los datos


### 1. Extraer Datos

In [133]:
api_url = "https://www.datos.gov.co/resource/nudc-7mev.json?$limit=50000"
print(f"📥 Extrayendo datos desde: {api_url}")

try:
    response = requests.get(api_url)
    response.raise_for_status()  # Lanza un error si la petición falla (ej: 404)
    data = response.json()
    df_raw = pd.DataFrame(data)
    print(f"✅ ¡Extracción exitosa! Se cargaron {len(df_raw)} filas.")
    display(df_raw.head())

except requests.exceptions.RequestException as e:
    print(f"❌ Error al extraer los datos: {e}")
    df_raw = pd.DataFrame() # Creamos un dataframe vacío para evitar errores posteriores

except Exception as e:
    print(f"❌ Ocurrió un error inesperado: {e}")
    df_raw = pd.DataFrame()

📥 Extrayendo datos desde: https://www.datos.gov.co/resource/nudc-7mev.json?$limit=50000
✅ ¡Extracción exitosa! Se cargaron 14585 filas.


Unnamed: 0,a_o,c_digo_municipio,municipio,c_digo_departamento,departamento,c_digo_etc,etc,poblaci_n_5_16,tasa_matriculaci_n_5_16,cobertura_neta,cobertura_neta_transici_n,cobertura_neta_primaria,cobertura_neta_secundaria,cobertura_neta_media,cobertura_bruta,cobertura_bruta_transici_n,cobertura_bruta_primaria,cobertura_bruta_secundaria,cobertura_bruta_media,deserci_n,deserci_n_transici_n,deserci_n_primaria,deserci_n_secundaria,deserci_n_media,aprobaci_n,aprobaci_n_transici_n,aprobaci_n_primaria,aprobaci_n_secundaria,aprobaci_n_media,reprobaci_n,reprobaci_n_transici_n,reprobaci_n_primaria,reprobaci_n_secundaria,reprobaci_n_media,repitencia,repitencia_transici_n,repitencia_primaria,repitencia_secundaria,repitencia_media,tama_o_promedio_de_grupo,sedes_conectadas_a_internet
0,2023,5004,Abriaquí,5,Antioquia,3758,Antioquia (ETC),503,62.62,62.62,44.19,63.33,51.53,40.23,66.8,58.14,72.86,66.87,56.32,1.19,0.0,1.31,0.0,4.08,92.26,0.0,96.73,83.49,93.88,6.55,0.0,1.96,16.51,2.04,9.52,0.0,10.46,13.76,2.04,,
1,2023,95025,El Retorno,95,Guaviare,3830,Guaviare (ETC),4438,53.27,53.27,33.91,48.89,44.9,21.3,62.98,54.2,65.19,69.6,48.54,5.56,6.95,4.99,6.11,5.26,87.67,0.0,87.9,84.5,92.98,6.78,0.0,7.11,9.39,1.75,9.34,6.95,11.84,8.48,3.16,,
2,2023,95200,Miraflores,95,Guaviare,3830,Guaviare (ETC),2014,32.52,32.52,17.58,25.33,26.43,10.75,38.58,36.36,37.28,46.1,26.16,7.85,15.0,8.43,6.36,4.69,82.68,3.33,84.64,79.51,87.5,9.47,3.33,6.93,14.13,7.81,8.65,6.67,9.04,10.25,1.54,,
3,2023,97001,Mitú,97,Vaupés,3831,Vaupés (ETC),10986,59.57,59.57,42.76,55.95,43.51,17.06,70.65,64.9,76.96,72.92,53.12,3.95,2.27,1.84,6.77,5.47,90.71,0.57,94.12,84.91,89.93,5.34,0.57,4.04,8.33,4.6,16.18,7.75,21.04,13.84,7.18,,
4,2023,97161,Caruru,97,Vaupés,3831,Vaupés (ETC),1228,51.3,51.3,76.32,52.29,33.71,11.94,55.54,92.11,65.21,51.12,27.36,8.36,4.29,3.05,15.72,14.55,82.4,0.0,89.63,69.0,78.18,9.24,0.0,7.32,15.28,7.27,9.24,2.86,7.62,14.85,3.64,,


### 2. Conocer los tipos de variables
Conocer los tipos de variables te permite realizar un análisis adecuado, limpiar los datos correctamente.

In [134]:
df_raw.dtypes

a_o                            object
c_digo_municipio               object
municipio                      object
c_digo_departamento            object
departamento                   object
c_digo_etc                     object
etc                            object
poblaci_n_5_16                 object
tasa_matriculaci_n_5_16        object
cobertura_neta                 object
cobertura_neta_transici_n      object
cobertura_neta_primaria        object
cobertura_neta_secundaria      object
cobertura_neta_media           object
cobertura_bruta                object
cobertura_bruta_transici_n     object
cobertura_bruta_primaria       object
cobertura_bruta_secundaria     object
cobertura_bruta_media          object
deserci_n                      object
deserci_n_transici_n           object
deserci_n_primaria             object
deserci_n_secundaria           object
deserci_n_media                object
aprobaci_n                     object
aprobaci_n_transici_n          object
aprobaci_n_p

### 3. Convertir las variables numericas en **Tipo Float**, para poder trabajarlas

In [135]:
columnas_numericas = [
    'poblaci_n_5_16', 'tasa_matriculaci_n_5_16', 'cobertura_neta', 'cobertura_neta_transici_n',
    'cobertura_neta_primaria', 'cobertura_neta_secundaria', 'cobertura_neta_media', 'cobertura_bruta',
    'cobertura_bruta_transici_n', 'cobertura_bruta_primaria', 'cobertura_bruta_secundaria',
    'cobertura_bruta_media', 'deserci_n', 'deserci_n_transici_n', 'deserci_n_primaria', 'deserci_n_secundaria',
    'deserci_n_media', 'aprobaci_n', 'aprobaci_n_transici_n', 'aprobaci_n_primaria', 'aprobaci_n_secundaria',
    'aprobaci_n_media', 'reprobaci_n', 'reprobaci_n_transici_n', 'reprobaci_n_primaria', 'reprobaci_n_secundaria',
    'reprobaci_n_media', 'repitencia', 'repitencia_transici_n', 'repitencia_primaria', 'repitencia_secundaria',
    'repitencia_media', 'tama_o_promedio_de_grupo', 'sedes_conectadas_a_internet'
]

# Convertir las columnas numéricas a float
df_raw[columnas_numericas] = df_raw[columnas_numericas].apply(pd.to_numeric, errors='coerce')
df_raw.head(3)

Unnamed: 0,a_o,c_digo_municipio,municipio,c_digo_departamento,departamento,c_digo_etc,etc,poblaci_n_5_16,tasa_matriculaci_n_5_16,cobertura_neta,cobertura_neta_transici_n,cobertura_neta_primaria,cobertura_neta_secundaria,cobertura_neta_media,cobertura_bruta,cobertura_bruta_transici_n,cobertura_bruta_primaria,cobertura_bruta_secundaria,cobertura_bruta_media,deserci_n,deserci_n_transici_n,deserci_n_primaria,deserci_n_secundaria,deserci_n_media,aprobaci_n,aprobaci_n_transici_n,aprobaci_n_primaria,aprobaci_n_secundaria,aprobaci_n_media,reprobaci_n,reprobaci_n_transici_n,reprobaci_n_primaria,reprobaci_n_secundaria,reprobaci_n_media,repitencia,repitencia_transici_n,repitencia_primaria,repitencia_secundaria,repitencia_media,tama_o_promedio_de_grupo,sedes_conectadas_a_internet
0,2023,5004,Abriaquí,5,Antioquia,3758,Antioquia (ETC),503.0,62.62,62.62,44.19,63.33,51.53,40.23,66.8,58.14,72.86,66.87,56.32,1.19,0.0,1.31,0.0,4.08,92.26,0.0,96.73,83.49,93.88,6.55,0.0,1.96,16.51,2.04,9.52,0.0,10.46,13.76,2.04,,
1,2023,95025,El Retorno,95,Guaviare,3830,Guaviare (ETC),4438.0,53.27,53.27,33.91,48.89,44.9,21.3,62.98,54.2,65.19,69.6,48.54,5.56,6.95,4.99,6.11,5.26,87.67,0.0,87.9,84.5,92.98,6.78,0.0,7.11,9.39,1.75,9.34,6.95,11.84,8.48,3.16,,
2,2023,95200,Miraflores,95,Guaviare,3830,Guaviare (ETC),2014.0,32.52,32.52,17.58,25.33,26.43,10.75,38.58,36.36,37.28,46.1,26.16,7.85,15.0,8.43,6.36,4.69,82.68,3.33,84.64,79.51,87.5,9.47,3.33,6.93,14.13,7.81,8.65,6.67,9.04,10.25,1.54,,


Vamos a comprobar

In [136]:
df_raw.dtypes

a_o                             object
c_digo_municipio                object
municipio                       object
c_digo_departamento             object
departamento                    object
c_digo_etc                      object
etc                             object
poblaci_n_5_16                 float64
tasa_matriculaci_n_5_16        float64
cobertura_neta                 float64
cobertura_neta_transici_n      float64
cobertura_neta_primaria        float64
cobertura_neta_secundaria      float64
cobertura_neta_media           float64
cobertura_bruta                float64
cobertura_bruta_transici_n     float64
cobertura_bruta_primaria       float64
cobertura_bruta_secundaria     float64
cobertura_bruta_media          float64
deserci_n                      float64
deserci_n_transici_n           float64
deserci_n_primaria             float64
deserci_n_secundaria           float64
deserci_n_media                float64
aprobaci_n                     float64
aprobaci_n_transici_n    

### 4. Normalizar datos númericos que estan en (%)

In [137]:
porcentaje_columnas = [
    'tasa_matriculaci_n_5_16', 'cobertura_neta', 'cobertura_neta_transici_n', 
    'cobertura_neta_primaria', 'cobertura_neta_secundaria', 'cobertura_neta_media',
    'cobertura_bruta', 'cobertura_bruta_transici_n', 'cobertura_bruta_primaria', 
    'cobertura_bruta_secundaria', 'cobertura_bruta_media', 'deserci_n', 
    'deserci_n_transici_n', 'deserci_n_primaria', 'deserci_n_secundaria', 'deserci_n_media', 
    'aprobaci_n', 'aprobaci_n_transici_n', 'aprobaci_n_primaria', 'aprobaci_n_secundaria', 
    'aprobaci_n_media', 'reprobaci_n', 'reprobaci_n_transici_n', 'reprobaci_n_primaria', 
    'reprobaci_n_secundaria', 'reprobaci_n_media', 'repitencia', 'repitencia_transici_n', 
    'repitencia_primaria', 'repitencia_secundaria', 'repitencia_media',
    'sedes_conectadas_a_internet'  # Incluimos la columna que faltaba
]

# esta linea nomaliza los valores,el apply aplica la función sobre acda columna para cada argumento x
df_raw[porcentaje_columnas] = df_raw[porcentaje_columnas].apply(lambda x: x / 100)
df_raw.head(3)



Unnamed: 0,a_o,c_digo_municipio,municipio,c_digo_departamento,departamento,c_digo_etc,etc,poblaci_n_5_16,tasa_matriculaci_n_5_16,cobertura_neta,cobertura_neta_transici_n,cobertura_neta_primaria,cobertura_neta_secundaria,cobertura_neta_media,cobertura_bruta,cobertura_bruta_transici_n,cobertura_bruta_primaria,cobertura_bruta_secundaria,cobertura_bruta_media,deserci_n,deserci_n_transici_n,deserci_n_primaria,deserci_n_secundaria,deserci_n_media,aprobaci_n,aprobaci_n_transici_n,aprobaci_n_primaria,aprobaci_n_secundaria,aprobaci_n_media,reprobaci_n,reprobaci_n_transici_n,reprobaci_n_primaria,reprobaci_n_secundaria,reprobaci_n_media,repitencia,repitencia_transici_n,repitencia_primaria,repitencia_secundaria,repitencia_media,tama_o_promedio_de_grupo,sedes_conectadas_a_internet
0,2023,5004,Abriaquí,5,Antioquia,3758,Antioquia (ETC),503.0,0.6262,0.6262,0.4419,0.6333,0.5153,0.4023,0.668,0.5814,0.7286,0.6687,0.5632,0.0119,0.0,0.0131,0.0,0.0408,0.9226,0.0,0.9673,0.8349,0.9388,0.0655,0.0,0.0196,0.1651,0.0204,0.0952,0.0,0.1046,0.1376,0.0204,,
1,2023,95025,El Retorno,95,Guaviare,3830,Guaviare (ETC),4438.0,0.5327,0.5327,0.3391,0.4889,0.449,0.213,0.6298,0.542,0.6519,0.696,0.4854,0.0556,0.0695,0.0499,0.0611,0.0526,0.8767,0.0,0.879,0.845,0.9298,0.0678,0.0,0.0711,0.0939,0.0175,0.0934,0.0695,0.1184,0.0848,0.0316,,
2,2023,95200,Miraflores,95,Guaviare,3830,Guaviare (ETC),2014.0,0.3252,0.3252,0.1758,0.2533,0.2643,0.1075,0.3858,0.3636,0.3728,0.461,0.2616,0.0785,0.15,0.0843,0.0636,0.0469,0.8268,0.0333,0.8464,0.7951,0.875,0.0947,0.0333,0.0693,0.1413,0.0781,0.0865,0.0667,0.0904,0.1025,0.0154,,


### 5. Identificar Valores Nulos

In [138]:
df_raw.isnull().sum().sort_values(ascending=False)

tama_o_promedio_de_grupo       7013
sedes_conectadas_a_internet    6817
deserci_n_transici_n            903
deserci_n_media                 734
deserci_n_secundaria            270
deserci_n_primaria              242
repitencia_transici_n           159
repitencia_secundaria           152
repitencia_primaria             148
reprobaci_n_media               145
repitencia                      143
deserci_n                       142
repitencia_media                139
cobertura_bruta_media           127
tasa_matriculaci_n_5_16         115
cobertura_neta                  111
reprobaci_n_secundaria          106
aprobaci_n_media                101
cobertura_bruta_transici_n       97
reprobaci_n_primaria             97
cobertura_neta_secundaria        94
reprobaci_n_transici_n           93
aprobaci_n_transici_n            93
cobertura_neta_media             93
cobertura_neta_primaria          91
cobertura_bruta_secundaria       88
reprobaci_n                      86
cobertura_bruta_primaria    

- Vamos a eliminar TAMAÑO_PROMEDIO_DE_GRUPO y SEDES_CONECTADAS_A_INTERNET.

Decidimos eliminar las columnas **TAMAÑO_PROMEDIO_DE_GRUPO** y **SEDES_CONECTADAS_A_INTERNET** debido a que más del **50%** de sus valores están nulos. Además,  estas variables no son tan relevantes para los objetivos principales de nuestra investigación.

In [139]:
# Eliminar columnas con demasiados nulos 
df_raw.drop(columns=['tama_o_promedio_de_grupo', 'sedes_conectadas_a_internet'], inplace=True)
df_raw.head(3)

Unnamed: 0,a_o,c_digo_municipio,municipio,c_digo_departamento,departamento,c_digo_etc,etc,poblaci_n_5_16,tasa_matriculaci_n_5_16,cobertura_neta,cobertura_neta_transici_n,cobertura_neta_primaria,cobertura_neta_secundaria,cobertura_neta_media,cobertura_bruta,cobertura_bruta_transici_n,cobertura_bruta_primaria,cobertura_bruta_secundaria,cobertura_bruta_media,deserci_n,deserci_n_transici_n,deserci_n_primaria,deserci_n_secundaria,deserci_n_media,aprobaci_n,aprobaci_n_transici_n,aprobaci_n_primaria,aprobaci_n_secundaria,aprobaci_n_media,reprobaci_n,reprobaci_n_transici_n,reprobaci_n_primaria,reprobaci_n_secundaria,reprobaci_n_media,repitencia,repitencia_transici_n,repitencia_primaria,repitencia_secundaria,repitencia_media
0,2023,5004,Abriaquí,5,Antioquia,3758,Antioquia (ETC),503.0,0.6262,0.6262,0.4419,0.6333,0.5153,0.4023,0.668,0.5814,0.7286,0.6687,0.5632,0.0119,0.0,0.0131,0.0,0.0408,0.9226,0.0,0.9673,0.8349,0.9388,0.0655,0.0,0.0196,0.1651,0.0204,0.0952,0.0,0.1046,0.1376,0.0204
1,2023,95025,El Retorno,95,Guaviare,3830,Guaviare (ETC),4438.0,0.5327,0.5327,0.3391,0.4889,0.449,0.213,0.6298,0.542,0.6519,0.696,0.4854,0.0556,0.0695,0.0499,0.0611,0.0526,0.8767,0.0,0.879,0.845,0.9298,0.0678,0.0,0.0711,0.0939,0.0175,0.0934,0.0695,0.1184,0.0848,0.0316
2,2023,95200,Miraflores,95,Guaviare,3830,Guaviare (ETC),2014.0,0.3252,0.3252,0.1758,0.2533,0.2643,0.1075,0.3858,0.3636,0.3728,0.461,0.2616,0.0785,0.15,0.0843,0.0636,0.0469,0.8268,0.0333,0.8464,0.7951,0.875,0.0947,0.0333,0.0693,0.1413,0.0781,0.0865,0.0667,0.0904,0.1025,0.0154


- Analizando nuevamente la información se decide eliminar tasa de deserción en transición y media, porque aunque los valores nulos no son mayores al 50%, son muchos y no lo usaremos para nuetsro analisis.

In [140]:
df_raw.drop(columns=['deserci_n_transici_n', 'deserci_n_media'], inplace=True)
df_raw.head(3)

Unnamed: 0,a_o,c_digo_municipio,municipio,c_digo_departamento,departamento,c_digo_etc,etc,poblaci_n_5_16,tasa_matriculaci_n_5_16,cobertura_neta,cobertura_neta_transici_n,cobertura_neta_primaria,cobertura_neta_secundaria,cobertura_neta_media,cobertura_bruta,cobertura_bruta_transici_n,cobertura_bruta_primaria,cobertura_bruta_secundaria,cobertura_bruta_media,deserci_n,deserci_n_primaria,deserci_n_secundaria,aprobaci_n,aprobaci_n_transici_n,aprobaci_n_primaria,aprobaci_n_secundaria,aprobaci_n_media,reprobaci_n,reprobaci_n_transici_n,reprobaci_n_primaria,reprobaci_n_secundaria,reprobaci_n_media,repitencia,repitencia_transici_n,repitencia_primaria,repitencia_secundaria,repitencia_media
0,2023,5004,Abriaquí,5,Antioquia,3758,Antioquia (ETC),503.0,0.6262,0.6262,0.4419,0.6333,0.5153,0.4023,0.668,0.5814,0.7286,0.6687,0.5632,0.0119,0.0131,0.0,0.9226,0.0,0.9673,0.8349,0.9388,0.0655,0.0,0.0196,0.1651,0.0204,0.0952,0.0,0.1046,0.1376,0.0204
1,2023,95025,El Retorno,95,Guaviare,3830,Guaviare (ETC),4438.0,0.5327,0.5327,0.3391,0.4889,0.449,0.213,0.6298,0.542,0.6519,0.696,0.4854,0.0556,0.0499,0.0611,0.8767,0.0,0.879,0.845,0.9298,0.0678,0.0,0.0711,0.0939,0.0175,0.0934,0.0695,0.1184,0.0848,0.0316
2,2023,95200,Miraflores,95,Guaviare,3830,Guaviare (ETC),2014.0,0.3252,0.3252,0.1758,0.2533,0.2643,0.1075,0.3858,0.3636,0.3728,0.461,0.2616,0.0785,0.0843,0.0636,0.8268,0.0333,0.8464,0.7951,0.875,0.0947,0.0333,0.0693,0.1413,0.0781,0.0865,0.0667,0.0904,0.1025,0.0154


### 6. Imputaremos las siguientes variables utilizando la **mediana** para reemplazar los valores faltantes.

In [141]:
imputacion_mediana = [
    'deserci_n_secundaria', 'deserci_n_primaria', 'repitencia_transici_n', 'repitencia_secundaria', 
    'repitencia_primaria', 'reprobaci_n_media', 'repitencia', 'deserci_n', 'repitencia_media', 
    'cobertura_bruta_media', 'tasa_matriculaci_n_5_16', 'cobertura_neta', 'reprobaci_n_secundaria', 
    'aprobaci_n_media', 'cobertura_bruta_transici_n', 'reprobaci_n_primaria', 'cobertura_neta_secundaria', 
    'cobertura_neta_media', 'reprobaci_n_transici_n', 'aprobaci_n_transici_n', 'cobertura_neta_primaria', 
    'cobertura_bruta_secundaria', 'reprobaci_n', 'cobertura_bruta_primaria', 'cobertura_bruta', 
    'aprobaci_n_secundaria', 'cobertura_neta_transici_n', 'aprobaci_n_primaria', 'aprobaci_n', 
    'poblaci_n_5_16'
]

for col in imputacion_mediana:
    df_raw[col].fillna(df_raw[col].median())

In [142]:
## Prueba
### Valores nulos
df_raw.isnull().sum().sort_values(ascending=False)

deserci_n_secundaria          270
deserci_n_primaria            242
repitencia_transici_n         159
repitencia_secundaria         152
repitencia_primaria           148
reprobaci_n_media             145
repitencia                    143
deserci_n                     142
repitencia_media              139
cobertura_bruta_media         127
tasa_matriculaci_n_5_16       115
cobertura_neta                111
reprobaci_n_secundaria        106
aprobaci_n_media              101
reprobaci_n_primaria           97
cobertura_bruta_transici_n     97
cobertura_neta_secundaria      94
aprobaci_n_transici_n          93
cobertura_neta_media           93
reprobaci_n_transici_n         93
cobertura_neta_primaria        91
cobertura_bruta_secundaria     88
reprobaci_n                    86
cobertura_bruta_primaria       81
cobertura_bruta                68
aprobaci_n_secundaria          54
cobertura_neta_transici_n      52
aprobaci_n_primaria            25
aprobaci_n                     25
poblaci_n_5_16

- Revisar estadísticas descriptivas después de imputar.

In [143]:
df_raw[imputacion_mediana].describe()

Unnamed: 0,deserci_n_secundaria,deserci_n_primaria,repitencia_transici_n,repitencia_secundaria,repitencia_primaria,reprobaci_n_media,repitencia,deserci_n,repitencia_media,cobertura_bruta_media,tasa_matriculaci_n_5_16,cobertura_neta,reprobaci_n_secundaria,aprobaci_n_media,cobertura_bruta_transici_n,reprobaci_n_primaria,cobertura_neta_secundaria,cobertura_neta_media,reprobaci_n_transici_n,aprobaci_n_transici_n,cobertura_neta_primaria,cobertura_bruta_secundaria,reprobaci_n,cobertura_bruta_primaria,cobertura_bruta,aprobaci_n_secundaria,cobertura_neta_transici_n,aprobaci_n_primaria,aprobaci_n,poblaci_n_5_16
count,14315.0,14343.0,14426.0,14433.0,14437.0,14440.0,14442.0,14443.0,14446.0,14458.0,14470.0,14474.0,14479.0,14484.0,14488.0,14488.0,14491.0,14492.0,14492.0,14492.0,14494.0,14497.0,14499.0,14504.0,14517.0,14531.0,14533.0,14560.0,14560.0,14578.0
mean,0.045779,0.027567,0.009432,0.043265,0.031695,0.041308,0.032995,0.034899,0.016979,0.758558,0.849719,0.855532,0.06809,0.921405,0.871303,0.038848,0.699056,0.40747,0.004868,0.004868,0.825667,1.033824,0.047573,1.07301,0.996951,0.883587,0.576924,0.932135,0.916081,10183.25
std,0.031029,0.020478,0.024151,0.045487,0.036441,0.042118,0.033698,0.021726,0.022841,0.269927,0.185891,0.169324,0.061535,0.070143,0.255659,0.038025,0.188309,0.15629,0.016533,0.016533,0.172036,1.584949,0.039995,1.548438,1.486799,0.079718,0.161986,0.052161,0.053278,143603.2
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0235,0.0126,0.0,0.0073,0.0059,0.0067,0.0072,0.0193,0.0,0.596125,0.7485,0.7689,0.0072,0.894275,0.7143,0.0041,0.60395,0.3111,0.0,0.0,0.7321,0.8651,0.0092,0.898,0.8568,0.8363,0.475,0.9056,0.8844,1165.0
50%,0.0404,0.0232,0.0,0.0274,0.0188,0.031,0.0216,0.0313,0.0087,0.7525,0.8533,0.864,0.0616,0.9329,0.8521,0.0333,0.709,0.41165,0.0,0.0,0.8283,1.0154,0.0452,1.0332,0.975,0.891,0.579,0.9387,0.9202,2640.0
75%,0.062,0.0378,0.0093,0.0682,0.0459,0.064,0.0509,0.0466,0.0256,0.903775,0.9538,0.9454,0.1083,0.9643,1.0,0.06,0.8056,0.5054,0.0022,0.0022,0.919375,1.1574,0.0742,1.177,1.0926,0.9447,0.68,0.9695,0.9559,5868.25
max,0.4714,0.1831,0.5,0.5507,0.5076,0.6786,0.3747,0.279,0.3645,4.8962,2.7903,2.6454,0.7697,1.0,2.5929,0.5197,2.2944,1.7026,0.5271,0.5271,2.5496,110.65,0.4939,109.36,104.48,1.0,1.5047,1.0,1.0,9548263.0


- Las variables relacionadas con cobertura educativa y tasa de matrícula pueden superar el 100% debido a factores como extraedad (estudiantes fuera de la edad teórica para un nivel educativo) y flujos migratorios no capturados en las proyecciones de población. Esto es común en áreas con alta demanda social o movimientos poblacionales.

### 7. Identificar que datos tiene codigo municipio

In [144]:
# AÑO
df_raw['a_o'].value_counts().sort_index()

a_o
2011    1122
2012    1122
2013    1122
2014    1122
2015    1122
2016    1122
2017    1122
2018    1122
2019    1123
2020    1122
2021    1122
2022    1121
2023    1121
Name: count, dtype: int64

In [146]:
## necesito ver si existen datos repetidos, para esto voy a tomar código municipio y el año
municipio_ano = df_raw.groupby(['a_o', 'c_digo_municipio']).size().reset_index(name='count')
municipio_ano.head(3)

Unnamed: 0,a_o,c_digo_municipio,count
0,2011,11001,1
1,2011,13001,1
2,2011,13006,1


No hay repetidos

### 8.Vamos a eliminar `c_digo_etc` y `etc` porque no nos funcionan para el análisis.

In [148]:
df_raw.drop(columns=['c_digo_etc', 'etc'], inplace=True)

In [149]:
df_raw.head(3)

Unnamed: 0,a_o,c_digo_municipio,municipio,c_digo_departamento,departamento,poblaci_n_5_16,tasa_matriculaci_n_5_16,cobertura_neta,cobertura_neta_transici_n,cobertura_neta_primaria,cobertura_neta_secundaria,cobertura_neta_media,cobertura_bruta,cobertura_bruta_transici_n,cobertura_bruta_primaria,cobertura_bruta_secundaria,cobertura_bruta_media,deserci_n,deserci_n_primaria,deserci_n_secundaria,aprobaci_n,aprobaci_n_transici_n,aprobaci_n_primaria,aprobaci_n_secundaria,aprobaci_n_media,reprobaci_n,reprobaci_n_transici_n,reprobaci_n_primaria,reprobaci_n_secundaria,reprobaci_n_media,repitencia,repitencia_transici_n,repitencia_primaria,repitencia_secundaria,repitencia_media
0,2023,5004,Abriaquí,5,Antioquia,503.0,0.6262,0.6262,0.4419,0.6333,0.5153,0.4023,0.668,0.5814,0.7286,0.6687,0.5632,0.0119,0.0131,0.0,0.9226,0.0,0.9673,0.8349,0.9388,0.0655,0.0,0.0196,0.1651,0.0204,0.0952,0.0,0.1046,0.1376,0.0204
1,2023,95025,El Retorno,95,Guaviare,4438.0,0.5327,0.5327,0.3391,0.4889,0.449,0.213,0.6298,0.542,0.6519,0.696,0.4854,0.0556,0.0499,0.0611,0.8767,0.0,0.879,0.845,0.9298,0.0678,0.0,0.0711,0.0939,0.0175,0.0934,0.0695,0.1184,0.0848,0.0316
2,2023,95200,Miraflores,95,Guaviare,2014.0,0.3252,0.3252,0.1758,0.2533,0.2643,0.1075,0.3858,0.3636,0.3728,0.461,0.2616,0.0785,0.0843,0.0636,0.8268,0.0333,0.8464,0.7951,0.875,0.0947,0.0333,0.0693,0.1413,0.0781,0.0865,0.0667,0.0904,0.1025,0.0154


### 9. Necesito crear la variable: `total matriculados`

In [150]:
df_raw['total_matriculados'] = df_raw['poblaci_n_5_16'] * df_raw['tasa_matriculaci_n_5_16']
df_raw.head(3)

Unnamed: 0,a_o,c_digo_municipio,municipio,c_digo_departamento,departamento,poblaci_n_5_16,tasa_matriculaci_n_5_16,cobertura_neta,cobertura_neta_transici_n,cobertura_neta_primaria,cobertura_neta_secundaria,cobertura_neta_media,cobertura_bruta,cobertura_bruta_transici_n,cobertura_bruta_primaria,cobertura_bruta_secundaria,cobertura_bruta_media,deserci_n,deserci_n_primaria,deserci_n_secundaria,aprobaci_n,aprobaci_n_transici_n,aprobaci_n_primaria,aprobaci_n_secundaria,aprobaci_n_media,reprobaci_n,reprobaci_n_transici_n,reprobaci_n_primaria,reprobaci_n_secundaria,reprobaci_n_media,repitencia,repitencia_transici_n,repitencia_primaria,repitencia_secundaria,repitencia_media,total_matriculados
0,2023,5004,Abriaquí,5,Antioquia,503.0,0.6262,0.6262,0.4419,0.6333,0.5153,0.4023,0.668,0.5814,0.7286,0.6687,0.5632,0.0119,0.0131,0.0,0.9226,0.0,0.9673,0.8349,0.9388,0.0655,0.0,0.0196,0.1651,0.0204,0.0952,0.0,0.1046,0.1376,0.0204,314.9786
1,2023,95025,El Retorno,95,Guaviare,4438.0,0.5327,0.5327,0.3391,0.4889,0.449,0.213,0.6298,0.542,0.6519,0.696,0.4854,0.0556,0.0499,0.0611,0.8767,0.0,0.879,0.845,0.9298,0.0678,0.0,0.0711,0.0939,0.0175,0.0934,0.0695,0.1184,0.0848,0.0316,2364.1226
2,2023,95200,Miraflores,95,Guaviare,2014.0,0.3252,0.3252,0.1758,0.2533,0.2643,0.1075,0.3858,0.3636,0.3728,0.461,0.2616,0.0785,0.0843,0.0636,0.8268,0.0333,0.8464,0.7951,0.875,0.0947,0.0333,0.0693,0.1413,0.0781,0.0865,0.0667,0.0904,0.1025,0.0154,654.9528


In [152]:
## ordenar tablas 

columnas_orden = list(df_raw.columns)
columnas_orden.insert(6, columnas_orden.pop(columnas_orden.index('total_matriculados')))

df_raw = df_raw[columnas_orden]
df_raw.head(3)

Unnamed: 0,a_o,c_digo_municipio,municipio,c_digo_departamento,departamento,poblaci_n_5_16,total_matriculados,tasa_matriculaci_n_5_16,cobertura_neta,cobertura_neta_transici_n,cobertura_neta_primaria,cobertura_neta_secundaria,cobertura_neta_media,cobertura_bruta,cobertura_bruta_transici_n,cobertura_bruta_primaria,cobertura_bruta_secundaria,cobertura_bruta_media,deserci_n,deserci_n_primaria,deserci_n_secundaria,aprobaci_n,aprobaci_n_transici_n,aprobaci_n_primaria,aprobaci_n_secundaria,aprobaci_n_media,reprobaci_n,reprobaci_n_transici_n,reprobaci_n_primaria,reprobaci_n_secundaria,reprobaci_n_media,repitencia,repitencia_transici_n,repitencia_primaria,repitencia_secundaria,repitencia_media
0,2023,5004,Abriaquí,5,Antioquia,503.0,314.9786,0.6262,0.6262,0.4419,0.6333,0.5153,0.4023,0.668,0.5814,0.7286,0.6687,0.5632,0.0119,0.0131,0.0,0.9226,0.0,0.9673,0.8349,0.9388,0.0655,0.0,0.0196,0.1651,0.0204,0.0952,0.0,0.1046,0.1376,0.0204
1,2023,95025,El Retorno,95,Guaviare,4438.0,2364.1226,0.5327,0.5327,0.3391,0.4889,0.449,0.213,0.6298,0.542,0.6519,0.696,0.4854,0.0556,0.0499,0.0611,0.8767,0.0,0.879,0.845,0.9298,0.0678,0.0,0.0711,0.0939,0.0175,0.0934,0.0695,0.1184,0.0848,0.0316
2,2023,95200,Miraflores,95,Guaviare,2014.0,654.9528,0.3252,0.3252,0.1758,0.2533,0.2643,0.1075,0.3858,0.3636,0.3728,0.461,0.2616,0.0785,0.0843,0.0636,0.8268,0.0333,0.8464,0.7951,0.875,0.0947,0.0333,0.0693,0.1413,0.0781,0.0865,0.0667,0.0904,0.1025,0.0154


### 10. Cambiar nombres columnas

In [153]:
nuevos_nombres = {
    'a_o': 'anio',
    'c_digo_municipio': 'codigo_municipio',
    'municipio': 'nombre_municipio',
    'c_digo_departamento': 'codigo_departamento',
    'departamento': 'nombre_departamento',
    'poblaci_n_5_16': 'poblacion_5_16',
    'total_matriculados': 'total_matriculados',
    'tasa_matriculaci_n_5_16': 'tasa_matriculacion_5_16',
    'cobertura_neta': 'cobertura_neta_total',
    'cobertura_neta_transici_n': 'cobertura_neta_transicion',
    'cobertura_neta_primaria': 'cobertura_neta_primaria',
    'cobertura_neta_secundaria': 'cobertura_neta_secundaria',
    'cobertura_neta_media': 'cobertura_neta_media',
    'cobertura_bruta': 'cobertura_bruta_total',
    'cobertura_bruta_transici_n': 'cobertura_bruta_transicion',
    'cobertura_bruta_primaria': 'cobertura_bruta_primaria',
    'cobertura_bruta_secundaria': 'cobertura_bruta_secundaria',
    'cobertura_bruta_media': 'cobertura_bruta_media',
    'deserci_n': 'tasa_desercion_total',
    'deserci_n_primaria': 'tasa_desercion_primaria',
    'deserci_n_secundaria': 'tasa_desercion_secundaria',
    'aprobaci_n': 'tasa_aprobacion_total',
    'aprobaci_n_transici_n': 'tasa_aprobacion_transicion',
    'aprobaci_n_primaria': 'tasa_aprobacion_primaria',
    'aprobaci_n_secundaria': 'tasa_aprobacion_secundaria',
    'aprobaci_n_media': 'tasa_aprobacion_media',
    'reprobaci_n': 'tasa_reprobacion_total',
    'reprobaci_n_transici_n': 'tasa_reprobacion_transicion',
    'reprobaci_n_primaria': 'tasa_reprobacion_primaria',
    'reprobaci_n_secundaria': 'tasa_reprobacion_secundaria',
    'reprobaci_n_media': 'tasa_reprobacion_media',
    'repitencia': 'tasa_repitencia_total',
    'repitencia_transici_n': 'tasa_repitencia_transicion',
    'repitencia_primaria': 'tasa_repitencia_primaria',
    'repitencia_secundaria': 'tasa_repitencia_secundaria',
    'repitencia_media': 'tasa_repitencia_media'
}

df_raw.rename(columns=nuevos_nombres, inplace=True) ## aca renombra
df_raw.head(2)

Unnamed: 0,anio,codigo_municipio,nombre_municipio,codigo_departamento,nombre_departamento,poblacion_5_16,total_matriculados,tasa_matriculacion_5_16,cobertura_neta_total,cobertura_neta_transicion,cobertura_neta_primaria,cobertura_neta_secundaria,cobertura_neta_media,cobertura_bruta_total,cobertura_bruta_transicion,cobertura_bruta_primaria,cobertura_bruta_secundaria,cobertura_bruta_media,tasa_desercion_total,tasa_desercion_primaria,tasa_desercion_secundaria,tasa_aprobacion_total,tasa_aprobacion_transicion,tasa_aprobacion_primaria,tasa_aprobacion_secundaria,tasa_aprobacion_media,tasa_reprobacion_total,tasa_reprobacion_transicion,tasa_reprobacion_primaria,tasa_reprobacion_secundaria,tasa_reprobacion_media,tasa_repitencia_total,tasa_repitencia_transicion,tasa_repitencia_primaria,tasa_repitencia_secundaria,tasa_repitencia_media
0,2023,5004,Abriaquí,5,Antioquia,503.0,314.9786,0.6262,0.6262,0.4419,0.6333,0.5153,0.4023,0.668,0.5814,0.7286,0.6687,0.5632,0.0119,0.0131,0.0,0.9226,0.0,0.9673,0.8349,0.9388,0.0655,0.0,0.0196,0.1651,0.0204,0.0952,0.0,0.1046,0.1376,0.0204
1,2023,95025,El Retorno,95,Guaviare,4438.0,2364.1226,0.5327,0.5327,0.3391,0.4889,0.449,0.213,0.6298,0.542,0.6519,0.696,0.4854,0.0556,0.0499,0.0611,0.8767,0.0,0.879,0.845,0.9298,0.0678,0.0,0.0711,0.0939,0.0175,0.0934,0.0695,0.1184,0.0848,0.0316


La base de datos ya ha sido completamente analizada y limpiada, y está lista para el proceso de ejercicio en SQL. Este ejercicio se centrará en la construcción del modelo dimensional y en la resolución de preguntas.