# Features Selection Method

This notebook will demonstrate a supplementary method we use to do feature selection other than correlation analysis. This method is only for internal use due to lack of concrete theorematic backup and interpretability.

## Read Me
We aim to find a quantative way to weight the features we have. A good feature should capture as much variance of the data as possible and should be independant to the others.

The idea we come up with is based on PCA analysis. The principle components are linear combination of all the features and capture the maximum variance of the data. The absolute value of the weights in these combinations represent the contribution of each feature along the respective direction. Hence, the importance of a feature can be measured by the sum of absolute value of its weight in each principle component. For example, if one feature has extremely low weight in every principle component, it is probabily not a very important feature. The metrics are first standarized before doing the pca analysis to avoid bias.

Additionally, by doing the pca analysis we might be able to check if the principle components conforms the 4 challenges we defined. Unfortunately, the weights only show partial similarity hence we decide not to bring it up in the presentation.

Based on the pca weights,  we come up with a ranking for the features. The ranking is only used as a supplementary criteria. The primary criteria for features selection is still based on correlation and domain knowledge.

In [2]:
import pandas as pd
import numpy as np
import os
import re
merchant_df = pd.read_parquet('../../data/curated/final_merchant_statistics')

## features grouping

In [2]:
merchant_df.columns

Index(['merchant_abn', 'name', 'tags', 'tag', 'revenue_level', 'take_rate',
       'avg_monthly_rev', 'discounted_avg_monthly_rev', 'avg_monthly_orders',
       'avg_monthly_approximate_fraudulent_orders', 'std_monthly_revenue',
       'std_monthly_discounted_revenue', 'sales_revenue',
       'discounted_sales_revenue', 'num_orders',
       'approximate_fraudulent_orders', 'avg_daily_rev',
       'discounted_avg_daily_rev', 'avg_value_per_order',
       'discounted_avg_value_per_order', 'avg_daily_orders',
       'avg_daily_approximate_fraudulent_orders', 'std_daily_revenue',
       'std_daily_discounted_revenue', 'avg_daily_commission',
       'discounted_avg_daily_commission', 'avg_monthly_commission',
       'discounted_avg_monthly_commission', 'avg_commission_per_order',
       'discounted_avg_commission_per_order', 'overall_commission',
       'discounted_overall_commission', 'overall_fraud_rate',
       'sa2_region_count', 'median_customer_income', 'returning_customers',
       '

In [3]:

risk=['std_daily_revenue','discounted_sales_revenue',
       'approximate_fraudulent_orders','avg_daily_approximate_fraudulent_orders',
       'discounted_avg_value_per_order','std_daily_discounted_revenue',
       'discounted_avg_daily_rev','discounted_avg_daily_commission',
       'discounted_avg_commission_per_order','overall_fraud_rate',
       'discounted_overall_commission' ]


consumer_quality=['median_customer_income', 'returning_customers']
revenue=['sales_revenue','avg_daily_rev','take_rate','avg_daily_commission',
         'overall_commission','avg_commission_per_order','avg_value_per_order','mean_spending','std_spending']


exposure=['avg_daily_orders','sa2_region_count','num_orders','unique_customers','vip_customers']

# PCA for ranking

In [4]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

merchant_df = merchant_df.dropna()

features = risk+consumer_quality+revenue+exposure

# some features the lower the better
EPSILON=0.000001
merchant_df["approximate_fraudulent_orders"]=1/(merchant_df["approximate_fraudulent_orders"]+EPSILON)
merchant_df["avg_daily_approximate_fraudulent_orders"]=1/(merchant_df["avg_daily_approximate_fraudulent_orders"]+EPSILON)
merchant_df["overall_fraud_rate"]=1/(merchant_df["overall_fraud_rate"]+EPSILON)

merchant_df["std_daily_revenue"]=1/(merchant_df["std_daily_revenue"]+EPSILON)
merchant_df["std_daily_discounted_revenue"]=1/(merchant_df["std_daily_discounted_revenue"]+EPSILON)
merchant_df["std_spending"]=1/(merchant_df["std_spending"]+EPSILON)

x = merchant_df.loc[:, features].values
x = StandardScaler().fit_transform(x)

## we have 4 dimensions , PCA component was set to 4, however 3 is better in terms of
## explained variance ratio

pca = PCA(n_components=3)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2','principal component 3'])

resultDF = pd.concat([principalDf,merchant_df["merchant_abn"]], axis = 1)

In [7]:
resultDF.head()

Unnamed: 0,principal component 1,principal component 2,principal component 3,merchant_abn
0,0.707971,-1.745871,-0.957668,10023280000.0
1,-2.891114,3.849745,2.091969,10346860000.0
2,-1.373781,-0.486562,0.170239,10385160000.0
3,5.946911,-1.185472,-0.220086,10648960000.0
4,0.745905,-2.021697,-0.846199,10714070000.0


In [8]:
weights = pca.components_
weights

array([[-0.10024615,  0.2730169 , -0.08612818, -0.08620144, -0.10524506,
        -0.10164052,  0.2730169 ,  0.26948178, -0.08697996,  0.10463658,
         0.26948178,  0.00564998,  0.21388932,  0.27251442,  0.27251442,
         0.02117211,  0.2690897 ,  0.2690897 , -0.08718716, -0.10522241,
        -0.08398297, -0.00435665,  0.22774523,  0.209856  ,  0.22774523,
         0.25721302,  0.24820977],
       [-0.14081094,  0.10061088,  0.19721034,  0.19733985,  0.36326284,
        -0.13739872,  0.10061088,  0.10781025,  0.34465905, -0.36215989,
         0.10781025, -0.01045544,  0.0515702 ,  0.10131381,  0.10131381,
         0.04533188,  0.10855316,  0.10855316,  0.34495811,  0.36310179,
         0.37287955, -0.01540488,  0.0316987 , -0.12270765,  0.0316987 ,
        -0.02910077, -0.03329794],
       [ 0.49597827,  0.02026564,  0.41654951,  0.41648732, -0.02815862,
         0.49865336,  0.02026564,  0.02649015, -0.0542369 ,  0.05375669,
         0.02649015, -0.02448891,  0.21211147,  0.0189

In [9]:
pca.explained_variance_ratio_


array([0.42191229, 0.23238604, 0.08958295])

In [10]:
print(sum(pca.explained_variance_ratio_))

0.7438812839011149


## Feature selection by pca weights

In [11]:
from operator import add

absolute_weights =[0]*27
for component_weight in weights:
    absolute_weights = list(map(add, absolute_weights, np.abs(component_weight)))

index_weight_map = []
for i in range(len(absolute_weights)):
    index_weight_map.append((absolute_weights[i],i))

index_weight_map.sort(key=lambda tup: tup[0], reverse=True)

for i in range(15):
    print(features[index_weight_map[i][1]])

std_daily_discounted_revenue
std_daily_revenue
avg_daily_approximate_fraudulent_orders
approximate_fraudulent_orders
overall_fraud_rate
discounted_avg_value_per_order
avg_value_per_order
sa2_region_count
discounted_avg_commission_per_order
avg_commission_per_order
mean_spending
returning_customers
avg_daily_orders
num_orders
discounted_avg_daily_commission
