## Revolut FinCrime Challenge: Feature Selection

There will be a notebook for each one of the Machine Learning Pipeline steps:

1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Model Building

**This is the notebook for step 3: Feature Selection**


## Predicting Fradulent transactions

The aim of the project is to build a machine learning model to find the fraudsters and take appropriate actions.

![SegmentLocal](fraud.gif "segment")

### Why is this important? 

Fraudsters can use our App to steal the data from other people's money from the outside into an account via Top-Up's. So finding them and blocking them is very necessary.

##  Feature Selection

In the following cells, I will select a group of variables, the most predictive ones, to build our machine learning models. 

### Why do we need to select variables?

1. For production: Fewer variables mean smaller client input requirements (e.g. customers filling out a form on a website or mobile app), and hence less code for error handling. This reduces the chances of bugs.
2. For model performance: Fewer variables mean simpler, more interpretable, less over-fitted models

### There are different ways to do the feature selection

1. Filter methods
2. Wrapper methods
3. Embedded methods

In this notebook I have used a Embedded method to select the variables.


**I will select variables using the L1 regularization in Logistic regression: L1 regularization has the property of setting the coefficient of non-informative variables to zero. This way we can identify those variables and remove them from our final models.**

Let's go ahead and load the datasets.

In [16]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
%matplotlib inline

# to build the models
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [17]:
# load dataset
# I loaded the datasets with the engineered values: I built and saved these datasets in the previous notebook.


X_train = pd.read_csv('../data/xtrain.csv')
X_test = pd.read_csv('../data/xtest.csv')

X_train.head()

Unnamed: 0,user_id,transaction_id,transaction_ts,registered_ts,birth_date,transaction_date,user_age,is_fraud,avg_transaction_count,amount_gbp,transaction_hr,hr_gt_nine,state_COMPLETED,state_DECLINED,state_FAILED,state_REVERTED,currency_BGN,currency_CHF,currency_CZK,currency_EUR,currency_GBP,currency_HRK,currency_HUF,currency_PLN,currency_RON,currency_Rare,currency_SEK,currency_USD,country_BG,country_CH,country_CZ,country_DE,country_ES,country_FR,country_GB,country_HR,country_HU,country_IE,country_IT,country_LT,country_LV,country_PL,country_PT,country_RO,country_Rare,country_SE,type_ATM,type_CARD_PAYMENT,type_EXCHANGE,type_FEE,type_TOPUP,type_TRANSFER,age_18-27,age_28-37,age_38-47,age_48-57,age_58-67,age_Rare
0,00001f33-1d47-47a3-8955-e719172e788b,46285eca-e1b1-43a2-b8b0-c173b48f5a95,2019-04-21 11:04:53.081,2019-04-11 20:49:42.623,1994-08-17 00:00:00.000,2019-04-21,25,0,-0.508167,0.054028,-0.580791,0.504333,0.72799,-0.250637,-0.331993,-0.481188,-0.112081,-0.174942,-0.138751,1.256661,-0.527487,-0.13125,-0.176167,-0.322369,-0.265881,-0.197194,-0.1061,-0.290565,-0.116779,-0.199412,-0.151228,-0.121195,-0.233889,-0.316005,-0.616132,-0.107685,-0.176442,3.487115,-0.133202,-0.122211,-0.115657,-0.386638,-0.211368,-0.290697,-0.272798,-0.10378,-0.136299,1.620741,-0.226587,-0.251627,-1.136748,-0.187375,1.370363,-0.639831,-0.482421,-0.347228,-0.200353,-0.166816
1,00001f33-1d47-47a3-8955-e719172e788b,1be7d138-fbaf-4216-8dc8-5056ec5f972c,2019-04-21 10:14:53.814,2019-04-11 20:49:42.623,1994-08-17 00:00:00.000,2019-04-21,25,0,-0.508167,0.043652,-0.778896,0.504333,0.72799,-0.250637,-0.331993,-0.481188,-0.112081,-0.174942,-0.138751,1.256661,-0.527487,-0.13125,-0.176167,-0.322369,-0.265881,-0.197194,-0.1061,-0.290565,-0.116779,-0.199412,-0.151228,-0.121195,-0.233889,-0.316005,-0.616132,-0.107685,-0.176442,3.487115,-0.133202,-0.122211,-0.115657,-0.386638,-0.211368,-0.290697,-0.272798,-0.10378,-0.136299,1.620741,-0.226587,-0.251627,-1.136748,-0.187375,1.370363,-0.639831,-0.482421,-0.347228,-0.200353,-0.166816
2,00001f33-1d47-47a3-8955-e719172e788b,a9aa681d-451e-44c5-8df0-687661ac583d,2019-04-19 21:05:53.192,2019-04-11 20:49:42.623,1994-08-17 00:00:00.000,2019-04-19,25,0,-0.508167,0.346044,1.400259,0.504333,0.72799,-0.250637,-0.331993,-0.481188,-0.112081,-0.174942,-0.138751,1.256661,-0.527487,-0.13125,-0.176167,-0.322369,-0.265881,-0.197194,-0.1061,-0.290565,-0.116779,-0.199412,-0.151228,-0.121195,-0.233889,-0.316005,-0.616132,-0.107685,-0.176442,3.487115,-0.133202,-0.122211,-0.115657,-0.386638,-0.211368,-0.290697,-0.272798,-0.10378,-0.136299,-0.617002,-0.226587,-0.251627,0.879702,-0.187375,1.370363,-0.639831,-0.482421,-0.347228,-0.200353,-0.166816
3,00001f33-1d47-47a3-8955-e719172e788b,e6021128-f4c1-4164-b3de-697e66ad613c,2019-04-11 20:55:20.996,2019-04-11 20:49:42.623,1994-08-17 00:00:00.000,2019-04-11,25,0,-0.508167,-0.121944,1.202154,0.504333,0.72799,-0.250637,-0.331993,-0.481188,-0.112081,-0.174942,-0.138751,1.256661,-0.527487,-0.13125,-0.176167,-0.322369,-0.265881,-0.197194,-0.1061,-0.290565,-0.116779,-0.199412,-0.151228,-0.121195,-0.233889,-0.316005,-0.616132,-0.107685,-0.176442,3.487115,-0.133202,-0.122211,-0.115657,-0.386638,-0.211368,-0.290697,-0.272798,-0.10378,-0.136299,-0.617002,-0.226587,3.974133,-1.136748,-0.187375,1.370363,-0.639831,-0.482421,-0.347228,-0.200353,-0.166816
4,00001f33-1d47-47a3-8955-e719172e788b,9499c9c9-c9a9-410f-820d-c6e92fed27fb,2019-04-11 20:53:54.700,2019-04-11 20:49:42.623,1994-08-17 00:00:00.000,2019-04-11,25,0,-0.508167,-0.144518,1.202154,0.504333,-1.373644,-0.250637,-0.331993,2.078191,-0.112081,-0.174942,-0.138751,1.256661,-0.527487,-0.13125,-0.176167,-0.322369,-0.265881,-0.197194,-0.1061,-0.290565,-0.116779,-0.199412,-0.151228,-0.121195,-0.233889,-0.316005,-0.616132,-0.107685,-0.176442,3.487115,-0.133202,-0.122211,-0.115657,-0.386638,-0.211368,-0.290697,-0.272798,-0.10378,-0.136299,-0.617002,-0.226587,-0.251627,0.879702,-0.187375,1.370363,-0.639831,-0.482421,-0.347228,-0.200353,-0.166816


In [18]:
# capture the target
y_train = X_train['is_fraud']
y_test = X_test['is_fraud']

# drop unnecessary variables from our training and testing sets
X_train.drop(['user_id', 'transaction_id','transaction_ts','registered_ts','birth_date','transaction_date','user_age','is_fraud'], axis=1, inplace=True)
X_test.drop(['user_id', 'transaction_id','transaction_ts','registered_ts','birth_date','transaction_date','user_age','is_fraud'], axis=1, inplace=True)

### Feature Selection

Let's go ahead and select a subset of the most predictive features. There is an element of randomness in the Lasso regression, so setting the seed.

In [19]:
# here I will do the model fitting and feature selection
# altogether in one line of code

# first, I specify the Logistic Regression model with L1 penalty, and I
# select a suitable alpha (equivalent of penalty).
# The bigger the alpha the less features that will be selected.

# Then I use the selectFromModel object from sklearn, which
# will select the features which coefficients are non-zero

sel_ = SelectFromModel(LogisticRegression(penalty = 'l1',C = 1, random_state=0)) # setting the seed, the random state in this function
sel_.fit(X_train, y_train)



SelectFromModel(estimator=LogisticRegression(C=1, class_weight=None, dual=False,
                                             fit_intercept=True,
                                             intercept_scaling=1, l1_ratio=None,
                                             max_iter=100, multi_class='warn',
                                             n_jobs=None, penalty='l1',
                                             random_state=0, solver='warn',
                                             tol=0.0001, verbose=0,
                                             warm_start=False),
                max_features=None, norm_order=1, prefit=False, threshold=None)

In [20]:
# this command let's us visualise those features that were kept.
# Kept features have a True indicator
sel_.get_support()

array([ True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True])

In [21]:
# let's print the number of total and selected features

# this is how we can make a list of the selected features
selected_feat = X_train.columns[(sel_.get_support())]

# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

total features: 50
selected features: 46
features with coefficients shrank to zero: 4


In [22]:
# print the selected features
selected_feat

Index(['avg_transaction_count', 'amount_gbp', 'transaction_hr', 'hr_gt_nine',
       'state_DECLINED', 'state_FAILED', 'state_REVERTED', 'currency_BGN',
       'currency_CHF', 'currency_CZK', 'currency_EUR', 'currency_GBP',
       'currency_HRK', 'currency_PLN', 'currency_RON', 'currency_Rare',
       'currency_SEK', 'currency_USD', 'country_BG', 'country_CH',
       'country_CZ', 'country_DE', 'country_ES', 'country_FR', 'country_GB',
       'country_HR', 'country_HU', 'country_IT', 'country_LT', 'country_LV',
       'country_PL', 'country_PT', 'country_RO', 'country_Rare', 'country_SE',
       'type_ATM', 'type_CARD_PAYMENT', 'type_EXCHANGE', 'type_FEE',
       'type_TOPUP', 'type_TRANSFER', 'age_18-27', 'age_38-47', 'age_48-57',
       'age_58-67', 'age_Rare'],
      dtype='object')

### Identify the selected variables

In [23]:
# this is an alternative way of identifying the selected features 
# based on the non-zero regularisation coefficients:
selected_feats = X_train.columns[(sel_.estimator_.coef_ != 0).ravel().tolist()]
selected_feats

Index(['avg_transaction_count', 'amount_gbp', 'transaction_hr', 'hr_gt_nine',
       'state_DECLINED', 'state_FAILED', 'state_REVERTED', 'currency_BGN',
       'currency_CHF', 'currency_CZK', 'currency_EUR', 'currency_GBP',
       'currency_HRK', 'currency_PLN', 'currency_RON', 'currency_Rare',
       'currency_SEK', 'currency_USD', 'country_BG', 'country_CH',
       'country_CZ', 'country_DE', 'country_ES', 'country_FR', 'country_GB',
       'country_HR', 'country_HU', 'country_IT', 'country_LT', 'country_LV',
       'country_PL', 'country_PT', 'country_RO', 'country_Rare', 'country_SE',
       'type_ATM', 'type_CARD_PAYMENT', 'type_EXCHANGE', 'type_FEE',
       'type_TOPUP', 'type_TRANSFER', 'age_18-27', 'age_38-47', 'age_48-57',
       'age_58-67', 'age_Rare'],
      dtype='object')

In [24]:
# now we save the selected list of features
pd.Series(selected_feats).to_csv('../data/selected_features.csv', index=False)

  


### References

https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

https://stats.stackexchange.com/questions/384833/adjusting-probability-threshold-for-sklearns-logistic-regression-model

https://github.com/trainindata/deploying-machine-learning-models

https://medium.com/datadriveninvestor/rethinking-the-right-metrics-for-fraud-detection-4edfb629c423