# ABS to Average Persona
ABS Data including:
- Age
- Continent of Birth
- Occupation
- Income


This notebook uses the ***Age*** and ***Continent of Birth*** census data to engineer the following features **by postcode**.
- *Average Age*
- *Quantiles of Age*
- *Count of Millennials and GenZ over 18* as research shows these age groups are the main and target customers of the BNPL industry.
- *Percentage of people born in each continent* as an indicator of cultural background.

Then, merge the rest of the census data engineered by other group members into one dataframe and output as a CSV file.

In [1]:
import pandas as pd
import numpy as np

## Age

In [2]:
age_df = pd.read_csv('../data/tables/external/by_postcode/1 year age.csv')
age_df.head()

Unnamed: 0,AGEP Age,0,1,2,3,4,5,6,7,8,...,107,108,109,110,111,112,113,114,115,Total
0,"2000, NSW",181,174,145,113,93,78,61,57,54,...,0,0,0,0,0,0,0,0,0,27411
1,"2006, NSW",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1261
2,"2007, NSW",51,25,57,41,27,23,12,26,15,...,0,0,0,0,0,0,0,0,0,8846
3,"2008, NSW",60,45,49,29,15,24,13,19,25,...,0,0,0,0,0,0,0,0,0,11712
4,"2009, NSW",151,145,119,115,77,59,84,76,67,...,0,0,0,0,0,0,0,0,0,12813


**Transpose the dataframe to simplify calculations**

In [3]:
age_df = age_df.set_index('AGEP Age').T
display(age_df.head())
print(f'shape: {age_df.shape[0]} rows, {age_df.shape[1]} cols')

AGEP Age,"2000, NSW","2006, NSW","2007, NSW","2008, NSW","2009, NSW","2010, NSW","2011, NSW","2015, NSW","2016, NSW","2017, NSW",...,"2904, ACT","2905, ACT","2906, ACT","2911, ACT","2912, ACT","2913, ACT","2914, ACT","2899, OT","6798, OT","6799, OT"
0,181,0,51,60,151,181,108,142,119,264,...,130,369,252,107,106,620,450,19,18,6
1,174,0,25,45,145,196,79,113,105,253,...,157,450,236,146,94,658,528,13,17,12
2,145,0,57,49,119,117,66,114,89,206,...,173,411,244,120,96,632,527,18,29,8
3,113,0,41,29,115,105,72,102,86,164,...,147,414,281,90,92,590,579,20,21,10
4,93,0,27,15,77,97,67,79,65,142,...,155,392,262,98,87,549,620,14,26,7


shape: 117 rows, 2653 cols


**Check for Null or Na values**<br />
-> no Null/Na values exists

In [4]:
(age_df.isnull() | age_df.isna() | np.isnan(age_df)).values.any()

False

**Check if 'Total' is correct**<br />
-> not correct, we will use the true sum for the average calculation

In [5]:
print("There exists {} instances where 'Total' = true total".format((age_df.iloc[:-1,:].sum() == age_df.loc['Total',:]).sum()))
print("There exists {} instances where 'Total' < true total".format((age_df.iloc[:-1,:].sum() < age_df.loc['Total',:]).sum()))
print("There exists {} instances where 'Total' > true total".format((age_df.iloc[:-1,:].sum() > age_df.loc['Total',:]).sum()))

# Change the 'Total' to true sum
age_df.loc['Total',:] = age_df.iloc[:-1].sum()

There exists 47 instances where 'Total' = true total
There exists 1569 instances where 'Total' < true total
There exists 1037 instances where 'Total' > true total


### Average Age

In [6]:
ave_age = pd.Series((age_df.iloc[:-1,:].to_numpy()
                            * pd.DataFrame([[age for i in range(0,2653)] for age in range(0,116)])).sum()
                           / age_df.iloc[-1,:].array,
                           name = 'mean_Age')\
                               .set_axis(age_df.columns)
                               
display(ave_age.head())

AGEP Age
2000, NSW    33.961999
2006, NSW    20.621951
2007, NSW    30.115950
2008, NSW    29.771024
2009, NSW    36.903884
Name: mean_Age, dtype: float64

#### Quantiles

In [7]:
# set up a dataframe with quantiles of the number of people 
quantile_check = pd.DataFrame({'Q0': round(age_df.iloc[-1,:] / age_df.iloc[-1,:]),
                               'Q1': round(age_df.iloc[-1,:] / 4),
                               'Median': round(age_df.iloc[-1,:] / 2),
                               'Q3': round(age_df.iloc[-1,:] * 0.75),
                               'Q4': round(age_df.iloc[-1,:])})
# empty dataframe to count frequency
counts_df = pd.DataFrame([0 for i in range(len(age_df.columns))], index = age_df.columns, columns = ['Counts'])
# empty dictionary to record indexes that has reached quantile
reached_dic = dict()
# empty dataframe to record the according age for each quantile
quantile_df = pd.DataFrame(index = age_df.columns, columns = ['Q0', 'Q1', 'Median', 'Q3', 'Q4'])


for age in range(0,116):
    counts_df['Counts'] = counts_df['Counts'] + age_df.iloc[age,:]
    
    for col in quantile_check.columns:
        # record the index by quantile
        reached_dic[col] = counts_df[counts_df['Counts']>=quantile_check[col]].index
        
        for i in age_df.columns:
            if quantile_df[col][i] is np.nan: # this is to prevent overwrite
                if i in reached_dic[col]:
                    quantile_df[col][i] = int(age_df.index[age])
                    
# to distinguish locations has no data with age 0
quantile_df.loc[counts_df[counts_df['Counts'] == 0].index, :] = [np.nan for i in range(0,5)]

display(quantile_df.head())

Unnamed: 0_level_0,Q0,Q1,Median,Q3,Q4
AGEP Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"2000, NSW",0,24,30,40,97
"2006, NSW",10,19,20,21,57
"2007, NSW",0,22,26,34,90
"2008, NSW",0,22,26,34,95
"2009, NSW",0,26,34,48,94


### Count for Millennials and Gen Z over 18
Age 18-41

In [8]:
upperBound = 41
lowerBound = 18

counts = pd.Series(age_df.iloc[lowerBound:upperBound,:].sum(), name = '#Millen_Z')
percentage = pd.Series(counts / age_df.iloc[-1,:], name = 'Millen_Z%')

### Prepare the Output Dataframe

**Concat the outputs**

In [9]:
output = pd.concat([ave_age, quantile_df, counts, percentage], axis=1) \
    .rename(columns={"Q0": "min_Age", "Q1": "q1_Age", "Median": "median_Age", "Q3": "q3_Age", "Q4": "max_Age"})
output.index.name = None
output.head()

Unnamed: 0,mean_Age,min_Age,q1_Age,median_Age,q3_Age,max_Age,#Millen_Z,Millen_Z%
"2000, NSW",33.961999,0,24,30,40,97,19310,0.70423
"2006, NSW",20.621951,10,19,20,21,57,1189,0.966667
"2007, NSW",30.11595,0,22,26,34,90,6910,0.781674
"2008, NSW",29.771024,0,22,26,34,95,9321,0.795782
"2009, NSW",36.903884,0,26,34,48,94,7075,0.550627


**Check if nan values exists because of 0 denominator**

In [10]:
print(f"number of 0 denominators: {(age_df.iloc[-1,:].array == 0.0).sum()}\n\
number of nan values: {np.isnan(output.loc[:,['mean_Age', 'Millen_Z%']]).all(axis=1).sum()}")

number of 0 denominators: 20
number of nan values: 20


**Restore nan values back to 0**

In [11]:
output = output.fillna(0)

## Continent of Birth

In [12]:
conti_df = pd.read_csv('../data/tables/external/by_postcode/continent of birth.csv', index_col = 'BPLP - 1 Digit Level')
display(conti_df.head())
print(f'shape: {conti_df.shape[0]} rows, {conti_df.shape[1]} cols')

Unnamed: 0_level_0,Oceania and Antarctica,North-West Europe,Southern and Eastern Europe,North Africa and the Middle East,South-East Asia,North-East Asia,Southern and Central Asia,Americas,Sub-Saharan Africa,Supplementary codes,Not stated,Total
BPLP - 1 Digit Level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
"2000, NSW",5087,1329,646,352,7812,6284,966,725,142,32,4050,27411
"2006, NSW",885,65,4,5,20,68,12,35,27,0,142,1261
"2007, NSW",1687,413,213,147,1238,3423,310,269,42,12,1089,8846
"2008, NSW",3390,593,306,174,1519,3581,307,433,87,9,1320,11712
"2009, NSW",5067,1168,632,206,1184,1959,468,714,118,10,1292,12813


shape: 2653 rows, 12 cols


**Remove the 'Supplementary codes' and 'Not stated' variables**

In [13]:
conti_df = conti_df.drop(['Supplementary codes', 'Not stated'], axis = 1)

**Check for Null or Na values**<br />
-> no Null/Na values exists

In [14]:
(conti_df.isnull() | conti_df.isna() | np.isnan(conti_df)).values.any()

False

**Calculate the total**

In [15]:
conti_df['Total'] = conti_df.iloc[:,:-1].sum(axis=1)

### Change to Proportion

In [16]:
for col in conti_df.columns[:-1]:
    conti_df.loc[:,col] = conti_df.loc[:,col]/conti_df.loc[:,'Total']

**Restore NaN values back to 0**

In [17]:
print(f'NaN values exists because of 0 denominator: {(conti_df.isnull() | conti_df.isna() | np.isnan(conti_df)).values.any()}')

conti_df = conti_df.fillna(0).iloc[:,:-1]

NaN values exists because of 0 denominator: True


In [18]:
conti_df.head()

Unnamed: 0_level_0,Oceania and Antarctica,North-West Europe,Southern and Eastern Europe,North Africa and the Middle East,South-East Asia,North-East Asia,Southern and Central Asia,Americas,Sub-Saharan Africa
BPLP - 1 Digit Level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
"2000, NSW",0.217924,0.056934,0.027674,0.015079,0.334661,0.269203,0.041383,0.031059,0.006083
"2006, NSW",0.789474,0.057984,0.003568,0.00446,0.017841,0.06066,0.010705,0.031222,0.024086
"2007, NSW",0.217902,0.053345,0.027512,0.018987,0.159907,0.442134,0.040041,0.034746,0.005425
"2008, NSW",0.326275,0.057074,0.029451,0.016747,0.146198,0.344658,0.029548,0.041675,0.008373
"2009, NSW",0.439997,0.101424,0.05488,0.017888,0.102813,0.170111,0.040639,0.062001,0.010247


### Prepare the Output Dataframe

**Check if the postcodes match those in the Age file**<br />
-> matched

In [19]:
(conti_df.index == age_df.columns).all()

True

**Concat the outputs and clean postcodes**

In [20]:
output = pd.concat([output, conti_df], axis=1)
# use integers to represent postcodes, note loc[2000,] is differ from iloc[2000,]
output.index = output.index.str.replace('\,.*', '', regex=True) \
    .astype("int")
display(output.head())
print(f'Output shape: {output.shape[0]} rows, {output.shape[1]} cols')

Unnamed: 0,mean_Age,min_Age,q1_Age,median_Age,q3_Age,max_Age,#Millen_Z,Millen_Z%,Oceania and Antarctica,North-West Europe,Southern and Eastern Europe,North Africa and the Middle East,South-East Asia,North-East Asia,Southern and Central Asia,Americas,Sub-Saharan Africa
2000,33.961999,0,24,30,40,97,19310,0.70423,0.217924,0.056934,0.027674,0.015079,0.334661,0.269203,0.041383,0.031059,0.006083
2006,20.621951,10,19,20,21,57,1189,0.966667,0.789474,0.057984,0.003568,0.00446,0.017841,0.06066,0.010705,0.031222,0.024086
2007,30.11595,0,22,26,34,90,6910,0.781674,0.217902,0.053345,0.027512,0.018987,0.159907,0.442134,0.040041,0.034746,0.005425
2008,29.771024,0,22,26,34,95,9321,0.795782,0.326275,0.057074,0.029451,0.016747,0.146198,0.344658,0.029548,0.041675,0.008373
2009,36.903884,0,26,34,48,94,7075,0.550627,0.439997,0.101424,0.05488,0.017888,0.102813,0.170111,0.040639,0.062001,0.010247


Output shape: 2653 rows, 17 cols


## Save to csv
Cleaned data of *Age* and *Continent of Birth*

In [21]:
output.to_csv('../data/curated/persona/input/age_continent_cleaned.csv')

## Merge all Curated ABS Datasets

In [22]:
income = pd.read_csv('../data/curated/persona/input/income_cleaned.csv', index_col='postcode')
occupation = pd.read_csv('../data/curated/persona/input/occupation_cleaned.csv', index_col='postcode')
population = pd.read_csv('../data/curated/persona/input/postcode_total_population.csv', index_col='postcode')

**Check if index are matched**

In [23]:
((output.index == income.index) & (income.index == occupation.index) & (income.index == population.index)).all()

True

**Merge the wanted data**

In [24]:
abs_cleaned = pd.concat([output, income, occupation, population], axis=1)
abs_cleaned = abs_cleaned.drop(columns = ['#Millen_Z'])\
    .rename(columns={"average_salary": "mean_Salary", "Q0_salary": "min_Salary", "Q1_salary": "q1_Salary", 
                     "median_salary": "median_Salary", "Q3_salary": "q3_Salary", "Q4_salary": "max_Salary"})

display(abs_cleaned.head())

Unnamed: 0,mean_Age,min_Age,q1_Age,median_Age,q3_Age,max_Age,Millen_Z%,Oceania and Antarctica,North-West Europe,Southern and Eastern Europe,...,max_Salary,Managers_%,Professionals_%,Technicians and Trades Workers_%,Community and Personal Service Workers_%,Clerical and Administrative Workers_%,Sales Workers_%,Machinery Operators and Drivers_%,Labourers_%,Total
2000,33.961999,0,24,30,40,97,0.70423,0.217924,0.056934,0.027674,...,2500.0,0.07964,0.13936,0.061654,0.090256,0.047682,0.044982,0.007333,0.066105,27411
2006,20.621951,10,19,20,21,57,0.966667,0.789474,0.057984,0.003568,...,450.0,0.017446,0.11023,0.007137,0.17843,0.054718,0.064235,0.0,0.019826,1261
2007,30.11595,0,22,26,34,90,0.781674,0.217902,0.053345,0.027512,...,1625.0,0.046914,0.116889,0.041714,0.064662,0.044653,0.045105,0.011418,0.038322,8846
2008,29.771024,0,22,26,34,95,0.795782,0.326275,0.057074,0.029451,...,1875.0,0.058658,0.171363,0.036031,0.053962,0.049949,0.044826,0.006574,0.027066,11712
2009,36.903884,0,26,34,48,94,0.550627,0.439997,0.101424,0.05488,...,3500.0,0.125185,0.210567,0.049013,0.064154,0.068836,0.040584,0.007961,0.029813,12813


### Save to CSV

In [25]:
abs_cleaned.to_csv('../data/curated/persona/output/abs_cleaned.csv')

#