# Census Dissemination Areas


## Overview

Let's explore the data with [csvkit](http://csvkit.readthedocs.io/en/0.9.1/)

### List all (470) column names

And save it to [columns.txt](columns.txt)

```
csvcut -n census_dissemination.csv > columns.txt
```

In [None]:
%cd ../sources/canada/census-dissemination/

### Look at the first 3 columns of the first 4 rows

In [None]:
%%bash
head -n 5 census_dissemination.csv | csvcut -c 1,2,3 | csvlook

### Find Geography values starting with a letter

In [None]:
%%bash
csvcut -c 1,2 census_dissemination.csv | csvgrep -c 1 -r "^[A-Z]" | csvlook

### Population

The column values for "Population, 2011" and "Total population by age groups" don't quite match:

In [None]:
%%bash
head -n 10 census_dissemination.csv | csvcut -c 1,2,6 | csvlook


### Create a new file with line numbers added

In [None]:
%%bash
csvcut -l census_dissemination.csv > census_indexed.csv
head -n 5 census_indexed.csv | csvcut -c 1,2,3 | csvlook


### Create new files with groups of related columns

In [None]:
%%bash
# dwellings and land area
csvcut -c 1,2,3,4,5,6 census_indexed.csv > census_dwellings.csv
csvcut -n census_dwellings.csv


In [None]:
%%bash
# age groups
csvcut -c 1,2,3,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32 \
census_indexed.csv > census_age_groups.csv
csvcut -n census_age_groups.csv


### Joining data files

In [None]:
%%bash
csvjoin -c "line_number" census_age_groups.csv census_dwellings.csv | \
csvgrep -c 2 -r "^[A-Z]" | csvcut -c 1,2,5,35 > census_join_example.csv
csvlook census_join_example.csv


## Agate

In [None]:
import agate

# we can specify new column names for census_join_example.csv
column_names = ["id", "municipality", "population", "area"]

# load the csv into agate with the new column names
preschoolers = agate.Table.from_csv("census_join_example.csv", column_names)

preschoolers.print_table()

In [None]:
# we can refer to column values by name (row["id"]) or index (row[0])
for row in preschoolers.rows:
    municipality = row["municipality"].split(" (")[0] # extract just the name
    density = int(row["population"]/row["area"])
    print("{} has {} preschoolers per square kilometre".format(municipality, density))

### Group Dissemination Areas by Municipality

Lines 3-19 appear to be dissemination areas for North Dumfries.

In [None]:
%%bash
head -n 21 census_indexed.csv | csvcut -c 1,2 > census_geography_example.csv
csvlook census_geography_example.csv


In [None]:
%%bash
csvcut -c 1,2 census_indexed.csv | csvgrep -c 2 -r "^[A-Z]" > census_municipalities.csv
csvlook census_municipalities.csv

In [None]:
%%bash
csvcut -c 1,2 census_indexed.csv > census_areas_raw.csv
head -n 10 census_areas_raw.csv | csvlook

In [None]:
# helper functions
def match_area_id(text):
    """
    Extract area/name and identifier from Census Geography column
    
    match_area_id("Waterloo (3530)   00000") == ("Waterloo", "3530")
    """
    
    import re
    _match = re.match("(.*) [(](.*)[)] .*", text)
    return _match.groups()

def get_ranges(start_values, all_values):
    """
    Use list of start_values to create ranges of items to extract from all_values
    
    get_ranges([1,3], [1,2,3,4,5]) == [(1, 2), (3, 5)]
    """

    stop_values = [value - 1 for value in start_values[1:]]
    stop_values.append(len(all_values))
    
    return zip(start_values, stop_values)

# load data
municipalities = agate.Table.from_csv("census_municipalities.csv")

areas = agate.Table.from_csv("census_areas_raw.csv")

# first row of municipalities table is Waterloo Region
region_row = municipalities.rows[0]
region_id = match_area_id(region_row["Geography"])


start_values = [int(row["line_number"]) for row in municipalities.rows[1:]] # skip first row
ranges = get_ranges(start_values, areas.rows)

names = [match_area_id(row["Geography"]) for row in municipalities.rows[1:]]
named_ranges = zip(names, ranges)

column_values = []
for n, r in named_ranges:
    start, stop = r
    for row in areas.rows[start:stop]:
        _, _id = match_area_id(row["Geography"])
        columns = list(region_id)
        columns.extend(list(n))
        columns.append(_id)
        column_values.append(columns)

column_names = ["Region", "region_id", "Municipality", "municipality_id", "area_id"]

areas_table = agate.Table(column_values, column_names)
areas_table.to_csv("census_areas.csv")

In [None]:
%%bash
head -n 5 census_areas.csv | csvlook