<h1 align="center">Health specific activity 2</h1>

We revisit last week's activity to understand what we left aside, and do part of what we did better.

Again, the second set of data available at https://www.gen-agedcaredata.gov.au/Resources/Access-data/2017/August/GEN-Data-People-using-aged-care, namely, *People using aged care services, 30 June 2013–2016 (CSV, 67.3 MB)*, is supposed to be downloaded and saved in the directory where this jupyter notebook sheet is run.

This time, let us start from scratch. To read the contents of a csv (for *comma separated values*) file in Python, it is convenient to open the file with the **open** function, which returns a "handle" to the file that can then be passed as an argument to the **reader** function of the **csv** module, provided the latter has been imported:

In [None]:
import csv

with open('People_2013to2016_GENdata.csv') as csv_file:
    aged_care_data = csv.reader(csv_file)

The contents of the file can then be read line by line by calling again and again the **next** function, with *aged_care_data* as argument. The first call to **next** should return the result of parsing the first line of the file:

In [None]:
import csv

with open('People_2013to2016_GENdata.csv') as csv_file:
    aged_care_data = csv.reader(csv_file)
    print(next(aged_care_data))

Well, that does not work. We have an encoding problem. Text, data, can be encoded in so many ways, and the default encoding of **utf-8** turns out not to be appropriate here. Another encoding has been used to encode the contents of this file. Encodings are a tricky matter, and kind of an advanced topic. But here we have a concrete problem and we have to bite the bullet in one way or another! A Google search suggests a method to try and guess the encoding of a file. It uses the **detect** method of the **chardet** module. Let us try and use that by just letting **detect** examine the first line of the file, and then the second line of the file:

In [None]:
import chardet

with open('People_2013to2016_GENdata.csv', 'rb') as csv_file:
    print(chardet.detect(next(csv_file)))
    print(chardet.detect(next(csv_file)))

The first line of the file makes **chardet** believe that we are dealing with **ascii** encoding; the second line makes **chardet** believe that we are dealing with **Windows-1252** encoding. If we open the file in an editor and look at the first two lines, here is what we can get (with some editors):

YEAR,STATE,ACPR CODE,ACPR NAME,PROGCODE,ADMTYPE,HOME CARE LEVEL,AGE GROUP,SEX,ATSI CODE,LAN,COB

2013,NSW,101,Central Coast,106889,,L 2,85\22689,F,,English ,Australia

The first line just contains the names of the fields, and consists of nothing but uppercase letters and commas. The second line is the first record, containing the comma separated values of the fields for some person. That line contains a strange \226 that we assume stands for a special character, which is not part of the ascii character set. This is probably what gives **chardet** a clue (and makes it believe with 73% confidence) that that line, and therefore the whole file, could be encoded using **Windows-1252**. Rich of that insight, let us modify our original code so that the **open** function does not implicitly use the default **utf-8** encoding, but explicitly uses the **Windows-1252** encoding:

In [None]:
import csv

with open('People_2013to2016_GENdata.csv', encoding = 'Windows-1252') as csv_file:
    aged_care_data = csv.reader(csv_file)
    print(next(aged_care_data))

Good, that works, we've got the list of names of all fields! Let us rather read the first two lines:

In [None]:
import csv

with open('People_2013to2016_GENdata.csv', encoding = 'Windows-1252') as csv_file:
    aged_care_data = csv.reader(csv_file)
    print(next(aged_care_data))
    print(next(aged_care_data))

Still all good. And we discover that \226 was used to encode the large hyphen to separate the lower and upper bounds of the age group (from 85 to 89).

If we did not use **csv.reader**, we could, as we did when exploring encodings with **chardet**, read the contents of the file line by line passing *csv_file* as argument to **next**:

In [None]:
with open('People_2013to2016_GENdata.csv', encoding = 'Windows-1252') as csv_file:
    print(next(csv_file))
    print(next(csv_file))

Then what we get is a string, one string per line. Using the **reader** function of the **csv** module and calling **next** on *aged_care_data* rather than on *csv_file*, we get instead, for each line, a list of strings, with as many strings in the list as comma-separated values on the line.

Last week, we created a list to keep track of all records for the year 2016, ignoring all other years. Let us do it a in better way, reading the contents of the file line by line, discarding the records that are not for 2016, and keeping all others. This is more efficient than first creating a list of all records, and then, from that list, creating a new list of records for 2016 only. The way our list is created is a Pythonic construct known as list comprehension: 

In [None]:
with open('People_2013to2016_GENdata.csv', encoding = 'Windows-1252') as csv_file:
    aged_care_data = csv.reader(csv_file)
    records_for_2016 = [other_fields for year, *other_fields in aged_care_data if year == '2016']

We surely remember from last week how many records we got, don't we?... Let us check we get the same number:

In [None]:
print(len(records_for_2016))

Before we revisit further what we did last week, let us see what are the possible values for the last field, COB (country of birth). We use a set which, in contrast to a list, has no duplicate element, so it is a good way to "collapse" a collection of (possibly duplicated) values to the collection of distinct values:

In [None]:
print(set(record[-1] for record in records_for_2016))

There is no natural order for the elements in a set. Here we have a set with 4 elements and **print** displays them in an arbitrary order.

Let us see what are the possible values for the second to last field, LAN (*language*):

In [None]:
print(set(record[-2] for record in records_for_2016))

Now back again to what we did last week, when we organised the data for 2016, grouping them by state (the first field) thanks to a dictionary:

In [None]:
from collections import defaultdict

records_per_state = defaultdict(list)
for state, *other_fields in records_for_2016:
    records_per_state[state].append(other_fields)

Let us check this works too:

In [None]:
for state in sorted(records_per_state):
    print(f'{state}\t{len(records_per_state[state])}')

Rather than using a standard dictionary, this time we use a **defaultdict**, imported from the **collections** module. A few examples will do better than long explanations to illustrate the benefits of a **defaultdict** over a standard dictionary. Essentially, when a new key is processed in the context of **defaultdict** with lists as values, a default value, namely, an empty list, is created, to which the first record can be appended:

In [None]:
# Does not work
D = {}
D['ACT'].append('A record for ACT')

In [None]:
# Does work
D = defaultdict(list)
D['ACT'].append('A record for ACT')
print(D)
D['ACT'].append('Another record for ACT')
print(D)
D['ACT'].append('Still another record for ACT')
print(D)

Then we organised the data for 2016 further, grouping them by state (the first field), and for each state, grouping them by gender (the fourth to last field). We examined the data to find out that the possible values for gender are 'M', 'F' and 'U', and made use of that "knowledge" to create our dictionaries. Let us do things better:

In [None]:
gender_distribution_per_state = defaultdict(lambda: defaultdict(int))
for state in records_per_state:
    for *_, gender, _, _, _ in records_per_state[state]:
        gender_distribution_per_state[state][gender] += 1
    for gender in gender_distribution_per_state[state]:
        gender_distribution_per_state[state][gender] /= len(records_per_state[state])

Let us check it still works:

In [None]:
for state in sorted(gender_distribution_per_state):
    print('State:', state)
    for gender in sorted(gender_distribution_per_state[state]):
        print(f'\t{gender}: {gender_distribution_per_state[state][gender] * 100:.2f}%')

We see that the output is actually better, as the category 'U' is output only when there is at least one record (for a given state) that falls in that category.

We improved the code by making use of a **defaultdict** again, together with a lambda expression.

* **defaultdict(list)**: when encountering a new key, automatically create an empty list as value for that key, so we can always append something to the value of the dictionary for any key, new or not.
* **defaultdict(int)**: when encountering a new key, automatically create 0 as value for that key, so we can always add 1 to the value of the dictionary for any key, new or not.
* **defaultdict(lambda: defaultdict(int))**: when encountering a new key, automatically create defaultdict(int) as value for that key, so we can use such a dictionary or any key, new or not: 

Let us see some practice code to understand lambda expressions, which are nothing by anonymous (unnamed) functions. Here are three lambda expressions for three functions which take no, one or two arguments, respectively:

In [None]:
f = lambda: 0
print(f())
g = lambda x: x + 1
print(g(2))
h = lambda x, y: x + y
print(h(4, 6))

An empty list is created by calling **list**, 0 is created by calling **int**, and a **defaultdict** with 0 as default value is created by calling **lambda: defaultdict(int)**:

In [None]:
print(list(), int(), (lambda: defaultdict(int))())

Finally, we organised the data for 2016 even further, grouping them by state (the first field), and for each state, grouping them by age group (the second field), and for each age group, grouping them by gender (the fourth to last field). Again, let us do things better:

In [None]:
gender_distribution_per_state_and_age_group = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
for state in records_per_state:
    for *_, age_group, gender, _, _, _ in records_per_state[state]:
        gender_distribution_per_state_and_age_group[state][age_group][gender] += 1
    for age_group in gender_distribution_per_state_and_age_group[state]:
        tally = sum(gender_distribution_per_state_and_age_group[state][age_group].values())
        for gender in gender_distribution_per_state_and_age_group[state][age_group]:
            gender_distribution_per_state_and_age_group[state][age_group][gender] /= tally

Let us check out what we get for New South Wales:

In [None]:
for age_group in sorted(gender_distribution_per_state_and_age_group['NSW']):
    print(f'\t{age_group}:')
    for gender in sorted(gender_distribution_per_state_and_age_group['NSW'][age_group]):
        print(f'\t\t{gender}: {gender_distribution_per_state_and_age_group["NSW"][age_group][gender] * 100:.2f}%')

It is the same results as last week, but not as well presented, as we had the percentages for 100+ displayed last. The issue is that the string '0–49' lexicographically comes before the string '100+', which comes before the string '50–54'. What we need is to order the strings that represent the age groups using for comparison key the leading number:
* 0 for '0–49',
* 100 for '100+',
* 50 for '50–54'
* ...

This is what we did last week when we passed an extra argument to **sorted**, to tell the sorting function: do not use the default ordering on strings (the lexicographic ordering), but instead, map the key '0–49' to 49, map the key '100+' to 100, map the key '50–54' to 50, and order the results based on the values computed from the keys, not from the keys themselves:

In [None]:
def age_group_lower_bound(age_group):
    lower_bound = 0
    for c in age_group:
        if not c.isdigit():
            break
        lower_bound = lower_bound * 10 + int(c)
    return lower_bound

for age_group in sorted(gender_distribution_per_state_and_age_group['NSW'], key = age_group_lower_bound):
    print(f'\t{age_group}:')
    for gender in sorted(gender_distribution_per_state_and_age_group['NSW'][age_group]):
        print(f'\t\t{gender}: {gender_distribution_per_state_and_age_group["NSW"][age_group][gender] * 100:.2f}%')

Here is some practice code to understand how *age_group_lower_bound* does what we want:

In [None]:
x = '5230–5340'
for c in x:
    print(c, end = ' ')
print()

In [None]:
x = '5230–5340'
for c in x:
    if not c.isdigit():
        break
    print(c, end = ' ')
print()

In [None]:
x = '5230–5340'
n = 0
for c in x:
    if not c.isdigit():
        break
    # c is a one character string, one of '0', '1',... '9', that we convert to 0, 1,... 9, respectively
    n = n * 10 + int(c)
    print(n)