In [1]:
import pandas as pd

# Keeping the API key safe

## Setting the Environment Variable in PowerShell
In PowerShell, I set the environment variable by running:

```powershell
$env:API_KEY="api_key"
```

This stored the API key as an environment variable named `API_KEY`.

---

## Step 2: Access the Environment Variable in Python
For better security and convenience, I stored the API key in a `.env` file. First, I had to install the `python-dotenv` package:

```bash
pip install python-dotenv
```

Then, I created a `.env` file with the following content:

```
API_KEY=api_key
```

Finally, I loaded the environment variables from the `.env` file using the following Python code:

```python
from dotenv import load_dotenv
import os

load_dotenv()  # Load environment variables from .env file

api_key = os.getenv('API_KEY')
```

---

The merchant, consumer, and fraud data was loaded from Parquet and CSV files. Key information regarding merchants were extracted from the tags column, splitting it into category, revenue rate, and take rate components. The merchant's data was merged with fraud probability and missing values were handled by filling fraud_probability with 0, and duplicates were removed. 

The consumer data was cleaned and merged with fraud and user details and dollar_value was cleaned by filtering out low values and removing outliers using z-scores.

Below are the final combined data, refer to plots in plots/1_Pre-processing and Preliminary Analysis for analysis

In [5]:
# Pre-processing and Preliminary Analysis of the Data
consumer_data = pd.read_parquet('../data/curated/consumer_merged.parquet')
merchant_info = pd.read_parquet('../data/curated/merchant_info.parquet')

In [6]:
consumer_data.head()

Unnamed: 0,name,address,state,postcode,gender,consumer_id,user_id,order_datetime,fraud_probability,merchant_abn,dollar_value,order_id
0,Yolanda Williams,413 Haney Gardens Apt. 742,WA,6935,Female,1195503,1,2022-02-20,9.805431,28000487688,133.226894,0c37b3f7-c7f1-48cb-bcc7-0a58e76608ea
1,Yolanda Williams,413 Haney Gardens Apt. 742,WA,6935,Female,1195503,1,2022-02-20,9.805431,83690644458,30.441348,40a2ff69-ea34-4657-8429-df7ca957d6a1
2,Yolanda Williams,413 Haney Gardens Apt. 742,WA,6935,Female,1195503,1,2022-02-20,9.805431,96680767841,301.677498,5b6c059c-6831-4710-a2ff-76e265b4a509
3,Yolanda Williams,413 Haney Gardens Apt. 742,WA,6935,Female,1195503,1,2022-02-20,9.805431,49891706470,5.904951,39e028ec-1663-4aff-a341-a9ad882d68d7
4,Yolanda Williams,413 Haney Gardens Apt. 742,WA,6935,Female,1195503,1,2022-02-20,9.805431,62403837535,18.135592,316b9dec-1140-4ef9-bbf3-810694eb7eda


In [7]:
merchant_info.head()

Unnamed: 0,merchant_abn,name,category,revenue_rate,fraud_probability,take_rate_value
0,10023283211,Felis Limited,"furniture, home furnishings and equipment shop...",e,0.0,0.18
1,10142254217,Arcu Ac Orci Corporation,"cable, satellite, and otHer pay television and...",b,0.0,4.22
2,10165489824,Nunc Sed Company,"jewelry, watch, clock, and silverware shops",b,0.0,4.4
3,10187291046,Ultricies Dignissim Lacus Foundation,"wAtch, clock, and jewelry repair shops",b,0.0,3.29
4,10192359162,Enim Condimentum PC,"music shops - musical instruments, pianos, and...",a,0.0,6.33


In [9]:
abs_data = pd.read_csv('../data/curated/2021Census_G02_AUST_POA_curated.csv')
abs_data.head()

Unnamed: 0,POA_CODE_2021,Median_age_persons,Median_mortgage_repay_monthly,Median_rent_weekly,Median_tot_hhd_inc_weekly
0,2000,32,2800,625,2225
1,2007,30,2500,500,1805
2,2008,28,2600,525,1746
3,2009,37,2800,580,2422
4,2010,36,2900,550,2297


The merchants' data was categorised based on keywords found in their category descriptions. A list of potential categories was established, including 'Retail Goods', 'Media and Technology', 'Furniture and Home Furnishing', 'Services', and 'Automotives'. Natural Language Processing techniques were applied by normalising the category text and filtering out common stopwords.

In [10]:
merchant_categories = pd.read_parquet('../data/curated/merchant_categories.parquet')
merchant_categories.head()

Unnamed: 0,merchant_abn,name,category,revenue_rate,fraud_probability,take_rate_value,category_label
0,10023283211,Felis Limited,"furniture, home furnishings and equipment shop...",e,0.0,0.18,Furniture and Home Furnishing
1,10142254217,Arcu Ac Orci Corporation,"cable, satellite, and otHer pay television and...",b,0.0,4.22,"Media and Technology, Services"
2,10165489824,Nunc Sed Company,"jewelry, watch, clock, and silverware shops",b,0.0,4.4,Retail Goods
3,10187291046,Ultricies Dignissim Lacus Foundation,"wAtch, clock, and jewelry repair shops",b,0.0,3.29,"Automotives, Retail Goods, Services"
4,10192359162,Enim Condimentum PC,"music shops - musical instruments, pianos, and...",a,0.0,6.33,Retail Goods


The transaction data was grouped by each unique combination of merchant abn and datetime, the dollar value and number of transactions were then aggregated to find the total sale value and transaction number of each merchant by date. This was then combined with the merchant category data on merchant abn, to match each merchant with their respective category. The mean of dollar value and number of transactions were taken, grouped by merchant abn. This was then merged with merchant fraud data and null values were replaced with 0 (since we assumed the merchants who were not in the merchant fraud data were not flagged for the potential to be fraud, thus had no chance of fraud). Categories were one hot encoded for each category separated by commas. A model was then trained on this dataset. The model can be used to predict instances with missing revenue.

Gradient Boosting was chosen, as it had the highest R Squared score, a train-split was done on the dataset, and was then used to train and test the model.

In [12]:
merchant_revenue_data = pd.read_parquet('../data/curated/merged_merchant_info_with_categories_and_features.parquet')
merchant_revenue_data.head()

Unnamed: 0,take_rate_value,fraud_probability,average_monthly_revenue,average_monthly_transactions,Retail Goods,Media and Technology,Furniture and Home Furnishing,Services,Automotives
0,0.18,0.0,216.736529,155.285714,0,0,1,0,0
21,4.22,0.0,37.716001,144.571429,0,1,0,1,0
42,4.4,0.0,11236.094771,0.454545,1,0,0,0,0
53,3.29,0.0,119.526004,16.8,1,0,0,1,1
73,6.33,0.0,462.32945,19.25,1,0,0,0,0


In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

y = merchant_revenue_data['average_monthly_revenue']

X = merchant_revenue_data.drop(columns=['average_monthly_revenue'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

model = GradientBoostingRegressor(
    learning_rate=0.01,
    max_depth=5,
    min_samples_leaf=1,
    min_samples_split=10,
    n_estimators=300
)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

accuracy = model.score(X_test, y_test)
print("Accuracy (R^2 score):", accuracy)

print("Feature Importances:", model.feature_importances_)

Mean Squared Error: 5143584.7856222335
Accuracy (R^2 score): 0.5121377192512677
Feature Importances: [0.06283569 0.09981132 0.59467378 0.07122158 0.00176467 0.03832798
 0.03552407 0.09584091]


To predict consumer fraud, transaction data was merged with fraud probability information. First, the transaction and fraud data are combined, creating two classification datasets. In merged_fraud_15, transactions with a fraud probability of 15% or less are classified as non-fraud (0), while in merged_fraud_20, the threshold is set to 20% or less. Logistic regression models are trained on both datasets to classify whether a transaction is fraudulent. Separately, the original merged_fraud dataset, which contains actual fraud probabilities, is used for linear regression to predict fraud probability directly. These models can be used to predict consumer fraud for future consumer data.

In [14]:
merged_fraud_15 = pd.read_parquet('../data/curated/merged_fraud_1.parquet')
merged_fraud_20 = pd.read_parquet('../data/curated/merged_fraud_2.parquet')
merged_fraud = pd.read_parquet('../data/curated/merged_fraud.parquet')

In [15]:
merged_fraud_15.head()

Unnamed: 0,num_orders,max_dollar_value,avg_dollar_value,is_fraud
0,1,2479.076338,2479.076338,0
1,3,1665.505861,710.667418,0
2,1,2236.461666,2236.461666,0
3,1,2334.493717,2334.493717,0
4,3,1791.908327,775.013087,0


In [16]:
merged_fraud_20.head()

Unnamed: 0,num_orders,max_dollar_value,avg_dollar_value,is_fraud
0,1,2479.076338,2479.076338,0
1,3,1665.505861,710.667418,0
2,1,2236.461666,2236.461666,0
3,1,2334.493717,2334.493717,0
4,3,1791.908327,775.013087,0


In [17]:
merged_fraud.head()

Unnamed: 0,user_id,order_datetime,num_orders,max_dollar_value,avg_dollar_value,fraud_probability
0,1,2022-02-20,1,2479.076338,2479.076338,9.805431
1,2,2021-08-30,3,1665.505861,710.667418,9.599514
2,2,2021-09-25,1,2236.461666,2236.461666,10.069851
3,3,2021-11-03,1,2334.493717,2334.493717,8.300636
4,4,2021-10-09,3,1791.908327,775.013087,9.633302


To identify merchants with the highest Compound Weekly Growth Rate (CWGR), the weekly revenue was analysed. After loading and aggregating transaction data by merchant and week, Prophet models were trained for each merchant to forecast future revenue and compute CWGR. Merchants were ranked based on their CWGR, and the top 10 were selected for further analysis. The revenue of these top merchants was scaled using MinMax scaling, and smooth curves were generated to visualise the trends in their monthly revenue over time. This analysis helped identify merchants with the fastest revenue growth, which could signal increased profitability and potential for the BNPL firm. However, it is important to assess whether the growth is sustainable or driven by temporary factors such as seasonal sales (e.g., during Christmas).