# Aggregation by ages groups for Persona Data Clean up

#### In this notebook, we will aggregate the raw Persona data about age in each postcode to create a tally which contains the counts for each age group and total population over 18 for each postcode in Australia

## Pyspark set up

In [1]:
# import libraries
from gettext import npgettext
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import numpy as np
from statistics import mean, stdev
import json

In [2]:
# setup spark
spark = (
    SparkSession.builder.appName("aggregate data for first 3 final model variables")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.driver.memory", "15g")
    .getOrCreate()
)

22/10/06 11:38:50 WARN Utils: Your hostname, Lis-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.13.47.249 instead (on interface en0)
22/10/06 11:38:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/06 11:38:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/06 11:38:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Read in data

In [3]:
data = spark.read.csv("../data/curated/2016_age.csv", header = True)

In [4]:
data = data.withColumnRenamed('AGEP Age','postcode')

## Aggregation
### Age bracket is defined by BNPL analysis regarding age

In [5]:
data_agg = data.withColumn("18-24", sum(data[col] for col in range(18+1,24+1+1)))\
    .withColumn("25-34", sum(data[col] for col in range(25+1,35+1)))\
    .withColumn("35-44", sum(data[col] for col in range(35+1,45+1)))\
        .withColumn("45-54", sum(data[col] for col in range(45+1,55+1)))\
            .withColumn("55-64", sum(data[col] for col in range(55+1,65+1)))\
                .withColumn("65+", sum(data[col] for col in range(65+1,115+1+1)))

In [6]:
output = data_agg.select("postcode","18-24","25-34","35-44","45-54","55-64","65+")

### Find total popuation (>= 18 Years Old) and keep all postcodes in numeric form only

In [7]:
output = output.withColumn("total", sum(output[col] for col in range(1,7)))\
    .withColumn("postcode",F.regexp_extract('postcode', r'\d+',0))

## Save the output file

In [8]:
output.toPandas().to_csv('../data/curated/Age_after_agg.csv', index = False)
# found that adding the total with ppl<18 cannot made up to the Total in raw data

22/10/06 11:38:55 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


## Investigation on the output aggregation

In [9]:
# check if there is null values 
from pyspark.sql.functions import isnan, when, count, col

output.select([count(when(isnan(c), c)).alias(c) for c in output.columns]).show()

+--------+-----+-----+-----+-----+-----+---+-----+
|postcode|18-24|25-34|35-44|45-54|55-64|65+|total|
+--------+-----+-----+-----+-----+-----+---+-----+
|       0|    0|    0|    0|    0|    0|  0|    0|
+--------+-----+-----+-----+-----+-----+---+-----+



### Found that adding the total with ppl<18 cannot made up to the Total in raw data

In [10]:
# sum of numbers across rows are different from teh totals in the raw data
col_list = [str(i) for i in range(0,116)]
df2 = data.withColumn(
    'SUM',
    sum([F.col(c) for c in col_list])
)