# Data Cleaning
In this session, we will do same basic cleaning steps on the main datasets

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import shapefile as shp
import pandas as pd
import numpy as np
import os

In [2]:
# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("Data Cleaning")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.driver.memory", "9g") 
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

24/09/07 22:07:08 WARN Utils: Your hostname, LAPTOP-406UJ3L3 resolves to a loopback address: 127.0.1.1; using 172.27.231.53 instead (on interface eth0)
24/09/07 22:07:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/07 22:07:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Load dataset

We have 3 kinds of dataset: transactions, consumer, merchant. We aim to find top 100 merchants we should accept, so, let's look at merchant dataset.

## 1. Merchants

In [3]:
# Information on merchants
merchant = spark.read.parquet("../data/tables/part_1/tbl_merchants.parquet")

# Information on merchant's fraud probability
merchant_fraud_prob = pd.read_csv("../data/tables/part_1/merchant_fraud_probability.csv")

In [4]:
print(f"Number of rows: {merchant.count()}")
merchant.limit(5)

Number of rows: 4026


name,tags,merchant_abn
Felis Limited,"((furniture, home...",10023283211
Arcu Ac Orci Corp...,"([cable, satellit...",10142254217
Nunc Sed Company,"([jewelry, watch,...",10165489824
Ultricies Digniss...,"([wAtch, clock, a...",10187291046
Enim Condimentum PC,([music shops - m...,10192359162


In [5]:
merchant_fraud_prob.head()

Unnamed: 0,merchant_abn,order_datetime,fraud_probability
0,19492220327,2021-11-28,44.403659
1,31334588839,2021-10-02,42.755301
2,19492220327,2021-12-22,38.86779
3,82999039227,2021-12-19,94.1347
4,90918180829,2021-09-02,43.325517


We should look at the data type of each column.

In [6]:
merchant.printSchema()

root
 |-- name: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- merchant_abn: long (nullable = true)



In [7]:
merchant_fraud_prob.dtypes

merchant_abn           int64
order_datetime        object
fraud_probability    float64
dtype: object

### Comment
- The merchant dataset contains their names, tags (seems like stuff they sell), and the abn.
- For the probability dataset, they record the datetime of an order and the probability it is broken?.
- Two tables share the `merchant_abn` column.

In [27]:
# Look at the tags column
merchant.select("tags").limit(5).collect()

[Row(tags='((furniture, home furnishings and equipment shops, and manufacturers, except appliances), (e), (take rate: 0.18))'),
 Row(tags='([cable, satellite, and otHer pay television and radio services], [b], [take rate: 4.22])'),
 Row(tags='([jewelry, watch, clock, and silverware shops], [b], [take rate: 4.40])'),
 Row(tags='([wAtch, clock, and jewelry repair shops], [b], [take rate: 3.29])'),
 Row(tags='([music shops - musical instruments, pianos, and sheet music], [a], [take rate: 6.33])')]

In [34]:
# Check duplicate
print(f"Number of duplicates: {merchant.select('merchant_abn').count() - merchant.select('merchant_abn').distinct().count()}")

Number of duplicates: 0


In [111]:
# Check null value
print(f"NaN in merchant detail: {merchant.filter(F.col('merchant_abn').isNull() | F.col('tags').isNull() | F.col('name').isNull()).count()}")
print(f"Nan in merchant fraud rate:\n{merchant_fraud_prob.isna().sum()}")

NaN in merchant detail: 0
Nan in merchant fraud rate:
merchant_abn         0
order_datetime       0
fraud_probability    0
dtype: int64


In [55]:
# Number of merchant with a fraud rate
print(f"Number of merchant with a fraud rate: {len(merchant_fraud_prob.merchant_abn.unique())}")

Number of merchant with a fraud rate: 61


In [68]:
# The abn with fraud but not in  merchant info
abn_not_in_merchant = spark.createDataFrame(pd.DataFrame(merchant_fraud_prob.merchant_abn)).subtract(merchant.select('merchant_abn'))
abn_not_in_merchant.limit(1)

merchant_abn
82999039227


In [71]:
print(f"Number of abn not in merchant dataset: {abn_not_in_merchant.count()}")
merchant.filter(F.col("merchant_abn") == 82999039227).show()
merchant_fraud_prob[merchant_fraud_prob.merchant_abn == 82999039227]

Number of abn not in merchant dataset: 13
+----+----+------------+
|name|tags|merchant_abn|
+----+----+------------+
+----+----+------------+



Unnamed: 0,merchant_abn,order_datetime,fraud_probability
3,82999039227,2021-12-19 00:00:00,94.1347


In [52]:
# Time range for fraud probability
merchant_fraud_prob.loc[:,"order_datetime"] = pd.to_datetime(merchant_fraud_prob.order_datetime)
merchant_fraud_prob.order_datetime.sort_values()

49    2021-03-25
33    2021-04-17
47    2021-08-28
46    2021-08-29
109   2021-09-01
         ...    
16    2022-02-17
45    2022-02-19
41    2022-02-20
83    2022-02-25
15    2022-02-27
Name: order_datetime, Length: 114, dtype: datetime64[ns]

The time range recoded in the merchant fraud dataset is from 25/03/2021 to 27/02/2022.

In [54]:
# Check if fraud prob is in a valid range
merchant_fraud_prob.fraud_probability.describe()

count    114.000000
mean      40.419335
std       17.187745
min       18.210891
25%       28.992765
50%       32.692032
75%       48.395260
max       94.134700
Name: fraud_probability, dtype: float64

The fraud probability is in good range (from 0 to 1), with the highest probability at 94%, which was found to belong to an abn not existing in the merchant dataset. 

### Consideration
- We need to separate each tupe into 3 values: stuff_type, ?, take_rate
- How to deal with the text containing goods: remove stop words, dec2vec, count2vec, etc.
- Convert `order_datetime` to datetime data type.

## 2. Consumer

In [77]:
# Information on consumer
consumer_user_detail = spark.read.parquet("../data/tables/part_1/consumer_user_details.parquet")
consumer = pd.read_csv("../data/tables/part_1/tbl_consumer.csv", delimiter="|")

# Information on consumer's fraud probability
consumer_fraud_prob = pd.read_csv("../data/tables/part_1/consumer_fraud_probability.csv")

In [78]:
print(f"Shapes: {consumer.shape}")
consumer.head(5)

Number of rows: (499999, 6)


Unnamed: 0,name,address,state,postcode,gender,consumer_id
0,Yolanda Williams,413 Haney Gardens Apt. 742,WA,6935,Female,1195503
1,Mary Smith,3764 Amber Oval,NSW,2782,Female,179208
2,Jill Jones MD,40693 Henry Greens,NT,862,Female,1194530
3,Lindsay Jimenez,00653 Davenport Crossroad,NSW,2780,Female,154128
4,Rebecca Blanchard,9271 Michael Manors Suite 651,WA,6355,Female,712975


In [80]:
print(f"Shape: {consumer_fraud_prob.shape}")
consumer_fraud_prob.head(5)

Shape: (34864, 3)


Unnamed: 0,user_id,order_datetime,fraud_probability
0,6228,2021-12-19,97.629808
1,21419,2021-12-10,99.24738
2,5606,2021-10-17,84.05825
3,3101,2021-04-17,91.421921
4,22239,2021-10-19,94.703425


In [81]:
print(f"Number of rows: {consumer_user_detail.count()}")
consumer_user_detail.limit(5)

Number of rows: 499999


user_id,consumer_id
1,1195503
2,179208
3,1194530
4,154128
5,712975


### Comment
- The consumer dataset contains their name, address, state, postcode, gender and consumer_id. While the consumer user details dataset contain corresponding consumer_id for each user_id.
- Consumer fraud probability and consumer dataset can be merged through consumer user detail dataset.

In [85]:
# Check duplicate in user_id and consumer_id
print(f"Number of duplicates in user_id: {consumer_user_detail.select('user_id').count() - consumer_user_detail.select('user_id').distinct().count()}")

print(f"Number of duplicates in consumer_id: {consumer_user_detail.select('consumer_id').count() - consumer_user_detail.select('consumer_id').distinct().count()}")

Number of duplicates in user_id: 0
Number of duplicates in consumer_id: 0


**Question**: Why do we need both consumer_id and user_id (related to database)?

An user id has to be in a range of 1 and 499,999, while consumer_id can be random?

In [112]:
# Check null values
print(f"Nan in merchant fraud rate:\n{consumer_fraud_prob.isna().sum()}")
print(f"Nan in merchant fraud rate:\n{consumer.isna().sum()}")

Nan in merchant fraud rate:
user_id              0
order_datetime       0
fraud_probability    0
dtype: int64
Nan in merchant fraud rate:
name           0
address        0
state          0
postcode       0
gender         0
consumer_id    0
dtype: int64


In [94]:
# Check if user id range in other dataframes is valid.
print(f"Max user id in the fraud record: {max(consumer_fraud_prob.user_id)}")
print(f"Min user id in the fraud record: {min(consumer_fraud_prob.user_id)}")

Max user id in the fraud record: 24081
Min user id in the fraud record: 1


In [95]:
# Check for the consumer id without a valid corresponding user id
invalid_consumer_id = spark.createDataFrame(pd.DataFrame(consumer.consumer_id)).subtract(consumer_user_detail.select('consumer_id'))
invalid_consumer_id.limit(1)

consumer_id


There are no problems in consumer_id and user_id.

In [98]:
# Time range for fraud probability
consumer_fraud_prob.loc[:,"order_datetime"] = pd.to_datetime(consumer_fraud_prob.order_datetime)
consumer_fraud_prob.order_datetime.sort_values()

15812   2021-02-28
18284   2021-02-28
3674    2021-02-28
14061   2021-02-28
4787    2021-02-28
           ...    
11970   2022-02-27
5119    2022-02-27
22952   2022-02-27
26151   2022-02-27
14025   2022-02-27
Name: order_datetime, Length: 34864, dtype: datetime64[ns]

In [100]:
# Check if fraud prob is in a valid range
consumer_fraud_prob.fraud_probability.describe()

count    34864.000000
mean        15.120091
std          9.946085
min          8.287144
25%          9.634437
50%         11.735624
75%         16.216158
max         99.247380
Name: fraud_probability, dtype: float64

It has a suitable time range and probability. Overall, we just need to convert order_datetime to datetime data type.

## 3. Transaction

This dataset contains details for each transaction between a merchant and a user.

In [119]:
# Read transaction dataset
transaction1 = spark.read.parquet("../data/tables/part_2")
transaction2 = spark.read.parquet("../data/tables/part_3")
transaction3 = spark.read.parquet("../data/tables/part_4")

                                                                                

In [128]:
# Combine datasets into a single DataFrame
transaction = transaction1.union(transaction2).union(transaction3)
transaction.count() # Number of rows

                                                                                

14195505

In [129]:
transaction.printSchema()

root
 |-- user_id: long (nullable = true)
 |-- merchant_abn: long (nullable = true)
 |-- dollar_value: double (nullable = true)
 |-- order_id: string (nullable = true)
 |-- order_datetime: date (nullable = true)



In [130]:
transaction.limit(5)

user_id,merchant_abn,dollar_value,order_id,order_datetime
18478,62191208634,63.255848959735246,949a63c8-29f7-4ab...,2021-08-20
2,15549624934,130.3505283105634,6a84c3cf-612a-457...,2021-08-20
18479,64403598239,120.15860593212784,b10dcc33-e53f-425...,2021-08-20
3,60956456424,136.6785200286976,0f09c5a5-784e-447...,2021-08-20
18479,94493496784,72.96316578355305,f6c78c1a-4600-4c5...,2021-08-20


#### 1. Check for missing values

In [131]:
# List of columns to check
cols = transaction.columns

for col_name in cols:
    # Filter rows where 'PULocationID' is null and count them
    null_count = transaction.filter(F.col(col_name).isNull()).count()

    if null_count > 0:
        print(f"Number of rows with null {col_name}: {null_count}")

                                                                                

So there are no missing values in any columns.

In [133]:
for col in ['user_id', 'dollar_value', 'order_datetime']:
    print(f"Max of {col}: {transaction.agg({col: 'max'})}")
    print(f"Min of {col}: {transaction.agg({col: 'min'})}")

                                                                                

Max of user_id: +------------+
|max(user_id)|
+------------+
|       24081|
+------------+



                                                                                

Min of user_id: +------------+
|min(user_id)|
+------------+
|           1|
+------------+



                                                                                

Max of dollar_value: +------------------+
| max(dollar_value)|
+------------------+
|105193.88578925544|
+------------------+



                                                                                

Min of dollar_value: +--------------------+
|   min(dollar_value)|
+--------------------+
|9.756658099412162E-8|
+--------------------+



                                                                                

Max of order_datetime: +-------------------+
|max(order_datetime)|
+-------------------+
|         2022-10-26|
+-------------------+





Min of order_datetime: +-------------------+
|min(order_datetime)|
+-------------------+
|         2021-02-28|
+-------------------+



                                                                                

### Comment
- Although we have 499,999 user ids, but only at most 24081 of them made a transaction within the provided time range.
- the `dollar_value` range seems strange as the minimum value is almost 0. We may need to do some outlier analysis for it.
- The time range of the transaction dataset is wider than the time range of other datasets, which can lead to missing data when we joined every table together.

# Create Function

At this step, only the column `tags` of merchant dataset needs to be preprocessed. We decide to convert the tbl_merchants.parquet to csv because of its small size and complex necessary preprocessing steps. 

In [8]:
import sys
sys.path.append('../scripts')
from clean_dataset import clean_merchant_df

In [11]:
# Apply clean function
clean_merchant_df(merchant)

In [13]:
clean_merchant_csv = pd.read_csv("../data/curated/part_1/clean_merchant.csv")

In [16]:
clean_merchant_csv.head()

Unnamed: 0,name,merchant_abn,goods,symbol,take_rate
0,Felis Limited,10023283211,"furniture, home furnishings and equipment shop...",e,0.18
1,Arcu Ac Orci Corporation,10142254217,"cable, satellite, and other pay television and...",b,4.22
2,Nunc Sed Company,10165489824,"jewelry, watch, clock, and silverware shops",b,4.4
3,Ultricies Dignissim Lacus Foundation,10187291046,"watch, clock, and jewelry repair shops",b,3.29
4,Enim Condimentum PC,10192359162,"music shops - musical instruments, pianos, and...",a,6.33


In [17]:
clean_merchant_csv.shape

(4026, 5)