In [None]:
import pandas as pd

# Data wrangling in `pandas`
It's rare that data are as you would like them to be, so a process of massaging and cleaning before analysis is almost always a necessary first step. 

Here we'll use data at Statistical Area 1 level for Wellington from the 2023 census. If you've worked with census data before, you'll know it has lots of quirks.

In [None]:
sa1_data = pd.read_csv("2023_Census_totals_by_topic_for_individuals_by_SA1.csv")
sa1_data

**First quirk**: The natural index for these data is the SA1 code. So let's make it so. The column name is not very convenient, so we will rename it first, and also drop the `OBJECTID` column which is simply a sequence number of limited use.

In [None]:
sa1_data_v2 = sa1_data.drop(columns = "OBJECTID")
sa1_data_v2 = sa1_data_v2.rename(
    columns = {"Statistical area 1 (SA1) 2023 code": "sa1_code"})
sa1_data_v2 = sa1_data_v2.set_index("sa1_code")
sa1_data_v2

**Second quirk**: the `Landwater code` and `Landwater name` attributes are redundant. We probably don't need either of them, but before we drop them use the values to select only rows in the data that are 'Mainland':

In [None]:
sa1_data_v3 = sa1_data_v2[sa1_data_v2["Landwater name"] == "Mainland"]
cols_to_drop = [c for c in sa1_data_v3.columns if c.startswith("Landwater")]
sa1_data_v3 = sa1_data_v3.drop(columns = cols_to_drop)
sa1_data_v3

**Quirk 3**: not really a quirk, but too much information! Say for present purposes we are really only interested in the 5-year age groups data and not the 500 or so other variables that have come with this dataset.

Using the column name based indexing we just saw above, if we make up a list of only the columns that relate to this aspect of the data, then we can reduce things down further.

In [None]:
age_vars = []
for c in sa1_data_v3.columns:
    if "Year: 2023" in c and "5-year groups" in c:
        age_vars.append(c)

ages_data = sa1_data_v3[age_vars]
ages_data

**Fourth quirk**: super-long column names. Theses are helpful in being highly specific, but also extremely inconvenient to work with. So let's write a simple function to rename them. There area a bunch of colons (`:`) and commas (`,`) in the variable names as they stand. It's convenient working in a notebook to do this in steps.

In [None]:
varname = ages_data.columns[0]
varname

It looks like splitting that on the colons and taking on the last section would be a start.

In [None]:
varname.split(":")[-1]

OK... although notice that leading white space. Probably we only really need the `Age` and `0-4` part of this. Here's one approach:

In [None]:
varname.split(":")[-1].strip().replace("(5-year groups - ", "").replace(" years", "").replace(")", "")

That's close, but probably we don't want whitespace in our variable names (unlike the Census people, apparently...). It is also advisable to avoid hyphens which can be misinterpreted as minus signs in some settings, and lower case is probably preferable. So... a couple more steps

In [None]:
varname.split(":")[-1].strip().replace("(5-year groups - ", "").replace(" years)", "").replace(" ", "_").replace("-", "_").lower()

That's more like it! Let's wrap all that up in a function, since we need to apply it to more than one column.

In [None]:
def clean_age_var_name(varname):
    remove = ["(5-year groups - ", " years", ")"]
    replace = [" ", "-"]
    new_name = varname.split(":")[-1].lower().strip()
    for r in remove:
        new_name = new_name.replace(r, "")
    for r in replace:
        new_name = new_name.replace(r, "_")
    return new_name.lower()

clean_age_var_name(varname)

We can use this to make a 'renamer' dictionary which translates old names to new ones.

In [None]:
renamer = {}
for old_name in ages_data.columns:
    renamer[old_name] = clean_age_var_name(old_name)

# recall that we could also do this as a one-liner comprehension
# renamer = {n: clean_age_var_name(n) for n in ages_data.columns}

renamer

And apply it to the data frame using the `rename()` method.

In [None]:
ages_data_final = ages_data.rename(columns = renamer)
ages_data_final

## Restricting the data to Wellington
These data are for all of New Zealand, but we only want Wellington data. To narrow things down we use a table available from Statistics New Zealand at https://datafinder.stats.govt.nz/table/111243-geographic-areas-table-2023/ which shows the relationships between all the various spatial areas designated by Stats NZ made up from meshblocks.

In [None]:
stats_areas = pd.read_csv("geographic-areas-table-2023.csv")
stats_areas.shape

And then use a columns selection to select only the ones we need.

In [None]:
sa1_ur_lookup = stats_areas[["SA12023_code", "UR2023_name"]]
welly_sa1 = sa1_ur_lookup[sa1_ur_lookup.UR2023_name == "Wellington"]
welly_sa1

There are some duplicates here because there are more meshblocks than there are SA1s. We also don't need the `UR2023_name` column any more, so...

In [None]:
welly_sa1 = welly_sa1 \
    .drop(columns = ["UR2023_name"]) \
    .drop_duplicates()
welly_sa1

If we now set the SA1 code variable to be the index of the `DataFrame` we get a 'bare` table ready to have data of interest joined to it.

In [None]:
welly_sa1 = welly_sa1.set_index("SA12023_code")
welly_sa1

Finally, we can join the data for all of New Zealand (as we've seen above 32,601 rows) to the Wellington only SA1 codes (1408 rows). We will look at table joins in more detail later. The important thing here is that if you don't specify otherwise it joins based on the table indexes, which we've made sure are the same (at least they are if Stats NZ have done their job).

In [None]:
welly_ages = welly_sa1.join(ages_data_final)
welly_ages

With the exception of the specifics of the column renaming step, more or less all of the above are likely to be applicable as a first step in preparing data for analysis. Probably the most important decision above was choosing the index variable and assigning it with `set_index()`. In this case it's an obvious choice. The important thing is that is unique per record of interest (in this case the SA1s). Then dropping columns with either `drop(columns = ...)` or simply using the `df[[...]]` method to keep only columns you want. 

Renaming using a function and a 'renamer' dictionary as above seems like overkill, but is a useful approach to develop as it allows for the development of complex renaming schemes.

Finally, I've mostly renamed the `DataFrame` at key steps along the way so that it's possible to double back a little bit and rerun a cell. In practical workflows you probably won't do that. It can be a useful technique when you are developing code, however.

## Cleaning data
So we have the data we want... but it's got a bunch of weird values. What are `-999` people? These along with `-997`s are sentinel values Stats NZ use in census data tables to indicate 'Confidential' and 'Not available' respectively. We should consider ways to remove or replace these.

In this case, the easiest method might be to use `.replace` and change all those values to 0.

In [None]:
welly_ages.replace([-999, -997], 0)

That works, but now we can't tell the difference between actual zeros and missing data. `pandas` has a special value `pd.NA` for this purpose, so can instead do

In [None]:
welly_ages_NA = welly_ages.replace([-999, -997], pd.NA)
welly_ages_NA

This approach gives us options, the most useful being that we can easily tell `pandas` how to handle the NA values in calculations, or further processing.

For example we can drop NA values, with options to remove any row (`axis="index"`, the default), or any column (`axis="columns"`) that has any NA value (`how="any"` the default), or only any row or any column with all NA values (`how="all"`).

In [None]:
# set axis to 'index' or 'columns' to drop either rows or columns
# set how to 'any' to remove rows or columns with any NAs, and to 'all'
# to remove only rows or columns that are all NAs
welly_ages_NA.dropna(axis = "index", how = "any")

You can also specify a `thresh=<some number>` to only remove rows/columns with at least that number of NAs. In this case that is useful, because we see from setting `axis = "columns", how = "any"` above that the total population column has no NAs, so it's easy to remove rows that are all NAs in all the other columns:

In [None]:
welly_ages_NA.dropna(axis = "index", thresh = 19)

### Handling NAs
It's worth seeing how to include NA values in calculations. Most functions include a `skipna` parameter, which is set `True` by default. If we set it `False` then NA values will propagate.

In [None]:
welly_ages_NA.sum(skipna = False)

Retaining NA values gives us more control and is worth doing until the point where it prevents analysis.

## What about that total column
Totals in census data are often not consistent with the contributing columns due to independent rounding of values and confidentiality suppression of some columns. We can check this by taking the difference. Note that we have to use `.iloc[]` here with two slices (rows and columns) to get row totals excluding the last one.

In [None]:
welly_ages_NA.iloc[:, :-1].sum(axis = "columns") - welly_ages_NA.age_total

It probably makes sense to drop the total population column for further work (we can easily make a new one if we need it).

In [None]:
welly_ages_final = welly_ages_NA.drop(columns = "age_total")
welly_ages_final


And since we've done a lot of work to get this far, let's save the result so far.

In [None]:
welly_ages_final.to_csv(
    "welly-ages-final.csv", index = True, index_label = "sa1_code")