We will be analyzing the number of regions each merchant sells to, in order to generate our metric for ranking system


In [1]:
import pandas as pd
import numpy as np
import io
import requests
import os

# Set working directory
if not "/data/tables" in os.getcwd():
    os.chdir("../data/tables")

from pyspark.sql import SparkSession
from pyspark.shell import spark
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
import matplotlib.pyplot as plt
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import countDistinct

%matplotlib inline

    
spark = (
    SparkSession.builder.appName("MAST30034 Project 2")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.driver.memory", "4g")
    .config("spark.sql.broadcastTimeout", -1)
    .getOrCreate()
)



22/09/15 00:27:10 WARN Utils: Your hostname, DESKTOP-IK201ES resolves to a loopback address: 127.0.1.1; using 172.29.219.202 instead (on interface eth0)
22/09/15 00:27:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/15 00:27:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
      /_/

Using Python version 3.8.10 (default, Jun 22 2022 20:18:18)
Spark context Web UI available at http://172.29.219.202:4040
Spark context available as 'sc' (master = local[*], app id = local-1663165632977).
SparkSession available as 'spark'.
22/09/15 00:27:14 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [2]:
consumer = spark.read.option("delimiter", "|").csv('tbl_consumer.csv', header = True)
user_detail = spark.read.parquet("consumer_user_details.parquet")
transaction = spark.read.parquet("transactions_20210828_20220227_snapshot/")

                                                                                

In [3]:
def read_url_data(url, data_format='csv'):
    
    content = requests.get(url).content
    
    if data_format == 'xlsx':
        return pd.read_excel(content)
        
    else:
        return pd.read_csv(io.StringIO(content.decode('utf-8')))




In [4]:
sa2_data = read_url_data("https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/allocation-files/SA2_2021_AUST.xlsx", 'xlsx')
postcode_database = read_url_data("https://www.matthewproctor.com/Content/postcodes/australian_postcodes.csv", "csv")


In [5]:
invalid_postcodes = consumer.where(~(F.col('postcode').isin(postcode_database['postcode'].unique().tolist())))

In [6]:
invalid_postcodes.count()

                                                                                

0

In [7]:
postcode_sdf = spark.createDataFrame(postcode_database[['postcode', 'SA2_MAINCODE_2016']])
new_consumer = consumer.join(postcode_sdf,
                             consumer.postcode == postcode_sdf.postcode,
                             how='left')

In [8]:
new_consumer = new_consumer.join(user_detail, 
                                 ['consumer_id'],
                                 how = 'left')                      
                    

In [9]:
new_consumer

                                                                                

consumer_id,name,address,state,postcode,gender,postcode.1,SA2_MAINCODE_2016,user_id
407340,Karen Chapman,2706 Stewart Oval...,NSW,2033,Female,2033,118021564.0,6
712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,509031247.0,5
712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,509031247.0,5
712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,509031247.0,5
712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,509031247.0,5
712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,509031247.0,5
712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,509031247.0,5
712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,509031247.0,5
712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,509031247.0,5
712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,509031247.0,5


We will be using an external dataset that contains the weekly median total personal income as our main feature for ranking

In [10]:
census_data = spark.read.csv("2021Census_G02_AUST_SA2.csv", header = True)

In [11]:
weekly_personal_income = census_data['SA2_CODE_2021', 'Median_tot_prsnl_inc_weekly']
weekly_personal_income = weekly_personal_income.withColumnRenamed('SA2_CODE_2021', 'SA2_MAINCODE_2016')

In [12]:
new_consumer = new_consumer.join(weekly_personal_income,
                                ['SA2_MAINCODE_2016'],
                                how = "left")

In [13]:
new_consumer

SA2_MAINCODE_2016,consumer_id,name,address,state,postcode,gender,postcode.1,user_id,Median_tot_prsnl_inc_weekly
118021564.0,407340,Karen Chapman,2706 Stewart Oval...,NSW,2033,Female,2033,6,946
509031247.0,712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,5,897
509031247.0,712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,5,897
509031247.0,712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,5,897
509031247.0,712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,5,897
509031247.0,712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,5,897
509031247.0,712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,5,897
509031247.0,712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,5,897
509031247.0,712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,5,897
509031247.0,712975,Rebecca Blanchard,9271 Michael Mano...,WA,6355,Female,6355,5,897


### Checking if each consumer has multiple regions
Since consumer_id(s) have multiple locations (SA2), we will be taking the average personal income, which will be done further down the notebook

In [84]:
new_consumer.groupBy('consumer_id')\
    .agg(countDistinct('SA2_MAINCODE_2016'))

                                                                                

consumer_id,count(SA2_MAINCODE_2016)
752008,1
154770,2
1271409,1
779268,2
709093,1
359432,1
1190791,1
452212,1
441664,1
1358126,1


In [14]:
merchant_sales = transaction.join(new_consumer,
                                 ['user_id'],
                                 how = "left")

In [15]:
merchant_sales

                                                                                

user_id,merchant_abn,dollar_value,order_id,order_datetime,SA2_MAINCODE_2016,consumer_id,name,address,state,postcode,gender,postcode.1,Median_tot_prsnl_inc_weekly
7,80518954462,87.82641684859922,30e323a0-5e5d-45c...,2021-12-05,319021506.0,511685,Andrea Jones,122 Brandon Cliff,QLD,4606,Female,4606,490
7,80518954462,87.82641684859922,30e323a0-5e5d-45c...,2021-12-05,319021506.0,511685,Andrea Jones,122 Brandon Cliff,QLD,4606,Female,4606,490
7,80518954462,87.82641684859922,30e323a0-5e5d-45c...,2021-12-05,319021506.0,511685,Andrea Jones,122 Brandon Cliff,QLD,4606,Female,4606,490
7,80518954462,87.82641684859922,30e323a0-5e5d-45c...,2021-12-05,319021506.0,511685,Andrea Jones,122 Brandon Cliff,QLD,4606,Female,4606,490
7,80518954462,87.82641684859922,30e323a0-5e5d-45c...,2021-12-05,319021506.0,511685,Andrea Jones,122 Brandon Cliff,QLD,4606,Female,4606,490
7,80518954462,87.82641684859922,30e323a0-5e5d-45c...,2021-12-05,319021506.0,511685,Andrea Jones,122 Brandon Cliff,QLD,4606,Female,4606,490
7,80518954462,87.82641684859922,30e323a0-5e5d-45c...,2021-12-05,319021506.0,511685,Andrea Jones,122 Brandon Cliff,QLD,4606,Female,4606,490
7,80518954462,87.82641684859922,30e323a0-5e5d-45c...,2021-12-05,319021506.0,511685,Andrea Jones,122 Brandon Cliff,QLD,4606,Female,4606,490
7,80518954462,87.82641684859922,30e323a0-5e5d-45c...,2021-12-05,319021506.0,511685,Andrea Jones,122 Brandon Cliff,QLD,4606,Female,4606,490
7,95279812400,23.855323140698204,e05b9df5-b068-4a0...,2021-12-05,319021506.0,511685,Andrea Jones,122 Brandon Cliff,QLD,4606,Female,4606,490


In [56]:
merchant_sales.printSchema()

root
 |-- user_id: long (nullable = true)
 |-- merchant_abn: long (nullable = true)
 |-- dollar_value: double (nullable = true)
 |-- order_id: string (nullable = true)
 |-- order_datetime: date (nullable = true)
 |-- consumer_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- state: string (nullable = true)
 |-- postcode: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- postcode: long (nullable = true)
 |-- SA2_MAINCODE_2016: double (nullable = true)



Since we will be mainly working on the number of SA2 level regions each merchant sells to, we will be removing columns that are not helpful in our case.

In [16]:
filter_cols = ["merchant_abn", "consumer_id", "SA2_MAINCODE_2016", "order_id", "Median_tot_prsnl_inc_weekly"]
merchant_sales_filtered = merchant_sales[filter_cols]

In [17]:
merchant_sales_filtered

                                                                                

merchant_abn,consumer_id,SA2_MAINCODE_2016,order_id,Median_tot_prsnl_inc_weekly
80518954462,511685,319021506.0,30e323a0-5e5d-45c...,490
80518954462,511685,319021506.0,30e323a0-5e5d-45c...,490
80518954462,511685,319021506.0,30e323a0-5e5d-45c...,490
80518954462,511685,319021506.0,30e323a0-5e5d-45c...,490
80518954462,511685,319021506.0,30e323a0-5e5d-45c...,490
80518954462,511685,319021506.0,30e323a0-5e5d-45c...,490
80518954462,511685,319021506.0,30e323a0-5e5d-45c...,490
80518954462,511685,319021506.0,30e323a0-5e5d-45c...,490
80518954462,511685,319021506.0,30e323a0-5e5d-45c...,490
95279812400,511685,319021506.0,e05b9df5-b068-4a0...,490


Finding out how many regions each merchant sells to

In [43]:
unique_regions_per_merchant = merchant_sales_filtered.groupby('merchant_abn')\
                                                     .agg(countDistinct('SA2_MAINCODE_2016'))\
                                                     .withColumnRenamed('count(SA2_MAINCODE_2016)', 'region_count')

In [44]:
unique_regions_per_merchant

                                                                                

merchant_abn,region_count
15613631617,671
24406529929,1159
38700038932,1518
41956465747,160
83412691377,1858
19839532017,327
73256306726,1360
96946925998,45
35344855546,620
73841664453,461


Finding out unique customers per region for each merchant

In [18]:
unique_customers_per_region = merchant_sales_filtered.groupby('merchant_abn', 'SA2_MAINCODE_2016')\
                                              .agg(countDistinct('consumer_id'))\
                                              .orderBy('merchant_abn', 'SA2_MAINCODE_2016')\
                                              .withColumnRenamed('count(consumer_id)', 'unique_customers')


In [19]:
unique_customers_per_region



22/09/15 00:29:07 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:07 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:07 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:07 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:07 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:07 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:07 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:07 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:07 WARN RowBasedKeyValueBatch: Calling spill() on

[Stage 107:>                                                      (0 + 12) / 13]

22/09/15 00:29:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:08 WARN RowBasedKeyValueBatch: Calling spill() on



22/09/15 00:29:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:19 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:19 WARN RowBasedKeyValueBatch: Calling spill() on

[Stage 131:>                                                      (0 + 12) / 13]

22/09/15 00:29:20 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:20 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:20 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:20 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:20 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:20 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:20 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:20 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:29:20 WARN RowBasedKeyValueBatch: Calling spill() on

                                                                                

merchant_abn,SA2_MAINCODE_2016,unique_customers
10023283211,101021011.0,1
10023283211,101031013.0,3
10023283211,101031014.0,1
10023283211,101031015.0,4
10023283211,101031016.0,5
10023283211,101041017.0,1
10023283211,101041018.0,1
10023283211,101041019.0,1
10023283211,101041020.0,2
10023283211,101041022.0,1


Previously, we mentioned that some consumer_ids have multiple locations (SA2), hence they have different weekly personal income in that case, we will be taking the average

In [32]:
merchant_customer_avg_income = merchant_sales_filtered.groupby('merchant_abn', 'consumer_id')\
                                                        .agg(
                                                            {
                                                                'Median_tot_prsnl_inc_weekly': 'mean'
                                                            }
                                                        )\
                                                        .orderBy('merchant_abn', 'consumer_id')\
                                                        .withColumnRenamed('avg(Median_tot_prsnl_inc_weekly)', 'avg_cust_personal_inc')

In [33]:
merchant_customer_avg_income



22/09/15 00:36:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:39 WARN RowBasedKeyValueBatch: Calling spill() on

[Stage 294:>                                                      (0 + 12) / 13]

22/09/15 00:36:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:40 WARN RowBasedKeyValueBatch: Calling spill() on

                                                                                

22/09/15 00:36:48 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:48 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:48 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:48 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:48 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:48 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:48 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:48 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:48 WARN RowBasedKeyValueBatch: Calling spill() on

[Stage 312:>                                                      (0 + 12) / 13]

22/09/15 00:36:49 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:49 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:49 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:49 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:49 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:49 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:49 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:49 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:36:49 WARN RowBasedKeyValueBatch: Calling spill() on

                                                                                

merchant_abn,consumer_id,avg_cust_personal_inc
10023283211,1000718,662.6666666666666
10023283211,100167,865.1538461538462
10023283211,1002004,792.0
10023283211,1003027,876.0
10023283211,1003953,
10023283211,1004737,
10023283211,1006040,
10023283211,1008200,
10023283211,1011395,751.0
10023283211,1013100,774.0


This will be the overall average customer personal income for each merchant

In [38]:
merchant_ranking = merchant_customer_avg_income.groupby('merchant_abn')\
                                                .agg(
                                                    {
                                                        'avg_cust_personal_inc': 'mean'
                                                    }
                                                )\
                                                .withColumnRenamed('avg(avg_cust_personal_inc)', 'avg_customer_income')

In [39]:
merchant_ranking

                                                                                

22/09/15 00:43:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:57 WARN RowBasedKeyValueBatch: Calling spill() on

[Stage 418:>                                                      (0 + 12) / 13]

22/09/15 00:43:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:43:58 WARN RowBasedKeyValueBatch: Calling spill() on

                                                                                

22/09/15 00:44:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:08 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:08 WARN RowBasedKeyValueBatch: Calling spill() on

[Stage 454:>                                                      (0 + 12) / 13]

22/09/15 00:44:09 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:09 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:09 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:09 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:09 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:09 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:09 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:09 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
22/09/15 00:44:09 WARN RowBasedKeyValueBatch: Calling spill() on

                                                                                

merchant_abn,avg_customer_income
12516851436,834.3608527735312
15613631617,784.2667954850245
15700338102,756.9344023323616
10648956813,833.678194645838
11944993446,795.3376453694681
14148282104,809.1637892244471
10346855916,780.0
10714068705,801.6457184884257
13467303030,514.1818181818181
14315147591,684.7888888888889
