<h1 align="center">Health specific activity 1</h1>

In this activity, we are going to analyse Aged care data, available at
https://www.gen-agedcaredata.gov.au/Resources/Access-data/2017/August/GEN-Data-People-using-aged-care.
The aim of this activity is to illustrate how we can use the fundamental Python builtin data structures (lists, sets, dictionaries), for the purpose of structuring information collected from data.

The second set of data: *People using aged care services, 30 June 2013–2016 (CSV, 67.3 MB)* should be downloaded and saved in the directory where this jupyter notebook sheet is run.

Next week we will revisit this activity, introducing additional programming techniques and Python syntax, and in particular, providing details on reading data from csv files, possibly facing encoding issues, that we avoid here thanks to a function that we provide for you to use, together with another function that will prove useful at the end of this activity, without expecting you to understand how these two functions do their work. So just run the contents of the following cell. We leave it to next week to understand the code in this cell.

In [None]:
import csv

def read_data_from_csv_file(csv_filename):
    with open(csv_filename, encoding = 'Windows-1252') as csv_file:
        return [record for record in csv.reader(csv_file)]

def age_group_lower_bound(age_group):
    lower_bound = 0
    for c in age_group:
        if not c.isdigit():
            break
        lower_bound = lower_bound * 10 + int(c)
    return lower_bound

Let us call the first function, providing as argument the name of the file that contains our data. As the file is very large, it takes a few seconds for the whole data to be read. When you run the contents of the following cell, you first see a star between the pair of square brackets, which means that computation is in progress. When computation is over, the star changes to a number $-$ 2 if it is the second time a cell is being executed.

In [None]:
aged_care_data = read_data_from_csv_file('People_2013to2016_GENdata.csv')

*aged_care_data* is a list. We can find out how many elements it contains thanks to the **len** function:

In [None]:
print(len(aged_care_data))

More precisely, *aged_care_data* is a list of lists, as we can see when we examine *aged_care_data*'s elements. Let us start with the first one:

In [None]:
print(aged_care_data[0])

We see it is the list of names of all fields. We could count the number of fields manually, but a better way is to use **len** again:

In [None]:
print(len(aged_care_data[0]))

Let us now examine *aged_care_data*'s second element:

In [None]:
print(aged_care_data[1])

That is our first record, with 12 strings as values for the 12 associated fields. We see that the list of values for the first record contains 2 empty strings, for 2 missing values.

Let us examine *aged_care_data*'s second record (so its third element):

In [None]:
print(aged_care_data[2])

It is not too different to the first record. What about the last record? Of course, knowing thanks to **len** the number of elements in *aged_care_data*, we could get the index of the last record (the value returned by **len** applied to *aged_care_data* minus 1). But it is just simpler to index *aged_care_data*'s from the end rather than from the beginning, starting with -1 rather than with 0:

In [None]:
print(aged_care_data[-1])

We see that more values are missing for the last record. What about the penultimate record? When moving from left to right, list indexes increase by 1; when moving from right to left, list indices decrease by 1:

In [None]:
print(aged_care_data[-2])

No better than the last record in terms of missing values...

Let us focus on the records for the year 2016: we want to create a list of all records for the year 2016, ignoring all other years. For that purpose, we create an empty list, *records_for_2016*, we go through all elements of *aged_care_data* thanks to a **for** loop, we test whether the first string of the list we are processing is '2016', and in case it is, we append the list of all other attributes to *records_for_2016* (we will write a better, more "Pythonic" version of this code next week):

In [None]:
records_for_2016 = []
for record in aged_care_data:
    if record[0] == '2016':
        records_for_2016.append(record[1: ])

Note how we keep track of all fields except the first one using the syntax of slices. Here are a few examples of that syntax:

In [None]:
record = aged_care_data[0]

print(record)
print(record[1: ]) # All elements starting from index 1 included
print(record[2: ]) # All elements starting from index 2 included
print(record[: -1]) # All elements up to index -1 excluded
print(record[: -2]) # All elements up to index -2 excluded
print(record[1: -2]) # All elements starting from index 1 included up to index -2 excluded

We know how to find out how many elements we have in *records_for_2016*:

In [None]:
print(len(records_for_2016))

So a bit more than one record out of four in the whole dataset is for the year 2016.

Let us now organise the data for 2016, grouping them by state (the first field). For that purpose, we are going to build a dictionary, which will map each state to the list of records for that state (omitting the state from the records to avoid unnecessary redundancy, as we got rid of '2016' in all members of *records_for_2016*). Eventually, the dictionary should have as keys all strings that denote a state:

In [None]:
records_per_state = {}
for state, *other_fields in records_for_2016:
    if state not in records_per_state:
        records_per_state[state] = [other_fields]
    else:
        records_per_state[state].append(other_fields)

There is a bit of mysterious syntax in the code we wrote. A few examples will do better than a long explanation and illuminate how the * symbol is used to indicate that "all data there" have to be collected together in a list:

In [None]:
first_field, *last_fields = records_for_2016[0]
print(records_for_2016[0])
print(first_field)
print(last_fields)

In [None]:
*first_fields, second_to_last_field, last_field = records_for_2016[0]
print(records_for_2016[0])
print(first_fields)
print(second_to_last_field)
print(last_field)

In [None]:
first_field, *middle_fields, second_to_last_field, last_field = records_for_2016[0]
print(records_for_2016[0])
print(first_field)
print(middle_fields)
print(second_to_last_field)
print(last_field)

Also note in the code that computes *records_per_state* how we create a list with only one record the first time we see a state, which becomes a new key in our dictionary, and how we append new records to that list when we process new records for that same state. (We will see a more elegant way to proceed next week.) Here is some extra code to illustrate:

In [None]:
D = {}
D['ACT'] = ['First record for ACT']
print(D)
D['ACT'].append('Second record for ACT')
print(D)
D['NSW'] = ['First record for NSW']
print(D)
D['ACT'].append('Third record for ACT')
print(D)
D['NSW'].append('Second record for NSW')
print(D)
D['SA'] = ['First record for SA']
print(D)
D['ACT'].append('Fourth record for ACT')
print(D)
D['SA'].append('Second record for SA')
print(D)

How many records do we get per state? The **print** statement in the code below **f**ormats the string between quotes by printing out *state*, then a tab, then *len(records_per_state[state])* with *state* ranging over all keys of our dictionary:

In [None]:
for state in sorted(records_per_state):
    print(f'{state}\t{len(records_per_state[state])}')

We see that besides all Australian states, there is "blank" state, for 2 records. We actually know what these two records are: the last two, for which the first three fields are empty strings.

The keys of a dictionary form a set, hence if we output them, they will be displayed in arbitrary order. Using the **sorted** function, we converted the set of all keys to the sorted list of all keys.

Let us organise the data for 2016 further, grouping them by state (the first field), and for each state, grouping them by gender (the fourth to last field). For that purpose, we are going to build a dictionary, which will map each state to a dictionary, which will map each gender to the list of records for that state and that gender (omitting the state and gender from the records to avoid unnecessary redundancy). Examination of the data reveals that the possible values for gender are 'M', 'F' and 'U' (we will improve the code and not make use of that "knowledge" when we revisit the code next week):

In [None]:
gender_distribution_per_state = {}
for state in records_per_state:
    if state not in gender_distribution_per_state:
        gender_distribution_per_state[state] = {'M': 0, 'F': 0, 'U': 0}
    for *_, gender, _, _, _ in records_per_state[state]:
        gender_distribution_per_state[state][gender] += 1
    for gender in gender_distribution_per_state[state]:
        gender_distribution_per_state[state][gender] /= len(records_per_state[state])

Let us see the proportions across genders for each state:

In [None]:
for state in sorted(gender_distribution_per_state):
    print('State:', state)
    for gender in sorted(gender_distribution_per_state[state]):
        print(f'\t{gender}: {gender_distribution_per_state[state][gender] * 100:.2f}%')

Note how in *{gender_distribution_per_state[state][gender] &ast; 100:.2f}*, the proportions of a given gender in a given state, namely, *gender_distribution_per_state[state][gender]*, is multiplied by 100 and formatted as a **f**loating point number with **2** decimal digits after **.**.

Let us organise the data for 2016 even further, grouping them by state (the first field), and for each state, grouping them by age group (the second field), and for each age group, grouping them by gender (the fourth to last field). For that purpose, we are going to build a dictionary, which will map each state to a dictionary, which will map each age group to a dictionary, which will map each gender to the list of records for that state, that age group and that gender (omitting the state, the age group, and gender from the records for avoid unnecessary redundancy):

In [None]:
# Creates the dictionary {'': None, 'ACT': None, 'NSW': None, 'NT': None, 'QLD': None,
#                         'SA': None, 'TAS': None, 'VIC': None, 'WA': None
#                        }
gender_distribution_per_state_and_age_group = dict.fromkeys(records_per_state)
# Changes the dictionary to {'': None, 'ACT': {}}, 'NSW': {}, 'NT': {}, 'QLD': {},
#                            'SA': {}, 'TAS': {}, 'VIC': {}, 'WA': {}
#                           }
for state in gender_distribution_per_state_and_age_group:
    gender_distribution_per_state_and_age_group[state] = {}
for state in records_per_state:
    for *_, age_group, gender, _, _, _ in records_per_state[state]:
        if age_group not in gender_distribution_per_state_and_age_group[state]:
            gender_distribution_per_state_and_age_group[state][age_group] = {'M': 0, 'F': 0, 'U': 0}
        gender_distribution_per_state_and_age_group[state][age_group][gender] += 1
    for age_group in gender_distribution_per_state_and_age_group[state]:
        tally = sum(gender_distribution_per_state_and_age_group[state][age_group].values())
        for gender in gender_distribution_per_state_and_age_group[state][age_group]:
            gender_distribution_per_state_and_age_group[state][age_group][gender] /= tally

Note how the number of records for a given state and age group (its tally) is computed. Here is some extra code to illustrate the use of **values** and **sum**:

In [None]:
D = {'A': 10, 'B': 7, 'C': 23, 'D': 10}
print(D.values())
print(sum(D.values()))

Let us see the proportions across genders for NSW, for each age group:

In [None]:
for age_group in sorted(gender_distribution_per_state_and_age_group['NSW'], key = age_group_lower_bound):
    print(f'\t{age_group}:')
    for gender in sorted(gender_distribution_per_state_and_age_group['NSW'][age_group]):
        print(f'\t\t{gender}: {gender_distribution_per_state_and_age_group["NSW"][age_group][gender] * 100:.2f}%')

In the last code snippet, we used the second function we defined at the beginning. We will understand its purpose and how it works next week.