# Processing of census extended survey data

**Objective:** Load survey data into a categorical data frame with expliciy categories per columns. Categories must match census contraints.

Extended survey data is provided in two tables, one with people data and one with household data.
Each person in the people table has a household ID that identifies people belonging to the same household.

The census extended survey is representative at municipality level, and si provided with expanded weights that match municipality population that live in particular dwellings. It excludes population in collective dwellings and homeless people, which are included in census statistics.

Valid categories for each variable are provided in `survey_categories.json` as a map from the original codes to the categorie actual value.

The special variables `ID_PERSONA`, `ID_VIV`, and `FACTOR` hold the person id, the household id, and the original survey weight respectively.

The numerical variable `NUMPER`holds the person id within the household, is otherwise not needed.

The numerical variables `EDAD` holds age with a maximum of 130 years, 999 encodes missing values.

All other variables are categorical, and are encoded as such with the categories part of the data type with codes corresponding to the original ones. TODO

In [73]:
import yaml
from pathlib import Path
from yaml import CLoader

In [1]:
import sys
sys.path.append('../src/')

In [11]:
from categories import load_mun_defs, defs

In [83]:
with open("../src/survey_categories.yaml", 'r') as file:
    categories = yaml.full_load(file)
categories["ASISTEN"]

{1: 'Sí', 3: 'No', 9: 'No especificado', (0, 0): 'a'}

## Original version, with data validation.

In [36]:
import pandas as pd

In [59]:
pd.read_csv('../data/cuestionario_ampliado/Censo2020_CA_nl_csv/Personas19.CSV').MUN.value_counts(dropna=False).sort_index()

MUN
1      2937
2      3321
3      1392
4      4942
5      4435
6     18513
7      4587
8      3630
9      6422
10    15473
11     3185
12     9167
13     3083
14     6438
15     1344
16     3228
17     6569
18    19134
19     4847
20     2715
21    16130
22     4398
23     1794
24     3537
25     9890
26    17045
27     1921
28     1382
29     3395
30     3269
31    26018
32     3073
33     7483
34     2654
35     1401
36     4288
37     3413
38     5193
39    18153
40      895
41     7317
42     3061
43     2352
44     5014
45     6019
46    17772
47     4703
48    21105
49     4661
50     1546
51     2422
Name: count, dtype: int64

In [23]:
from extended_survey import process_people_df, process_places_df, categorize_p, categorize_v

# Define data paths
personas_path = Path('../data/cuestionario_ampliado/Censo2020_CA_nl_csv/Personas19.CSV')
viviendas_path = Path('../data/cuestionario_ampliado/Censo2020_CA_nl_csv/Viviendas19.CSV')

# Load survey data
personas = process_people_df(personas_path)
viviendas = process_places_df(viviendas_path)

In [32]:
personas.PARENTESCO.value_counts(dropna=False)

PARENTESCO
Hija(o)                                        127303
Jefa(e)                                         98315
Esposa(o)                                       68228
Nieta(o)                                        19521
Nuera o yerno                                    6059
Hermana(o)                                       3730
Sobrina(o)                                       2498
Madre o padre                                    2345
Sin parentesco                                   1603
Cuñada(o)                                        1480
Suegra(o)                                        1225
Otros familiares                                 1172
Hijastra(o)                                       998
Bisnieta(o) o tataranieta(o)                      527
Prima(o)                                          517
Trabajador(a) doméstico(a)                        334
Tía(o)                                            250
NaN                                                95
Abuela(o)        

## Categorized version that matches census constraints

## Aditional simplifications to key categories