# Feature Engineering

In notebook [01-eda.ipynb](01-eda.ipynb) we explored our data, visualising the proportion of fraudulent and legitimate messages in our dataset, conditional on other values of the data. Exploring the data is a vital part of the data science workflow - it's important to understand what information the data holds and get a sense, early on, as to whether or not the data will be sufficient to provide insights into the question at hand. In our case, this question is 'is a particular transaction fraudulent or legitimate?'. 

We were able to see patterns and correlations in the data, which implies that we should probably proceed to the next step of the machine learning workflow, namely Feature Engineering. 

Feature Engineering aims to process the data into a format which the machine learning model will interpret correctly. In this notebook we will transform our data, whilst ensuring that the transformed data still holds enough information to distinguish between legitimate and fraudulent transactions.  

## Loading data

In notebook [01-eda.ipynb](01-eda.ipynb) we downloaded our data from s3 storage, so don't need to download it again. Instead, we load it from our persistent volume, using the `read_parquet` function:

In [1]:
import numpy as np
import pandas as pd
df = pd.read_parquet("fraud-cleaned-sample.parquet")

##  Train/test split

Before transforming the data we split it into train and test sets. 

In [2]:
from sklearn import model_selection
train, test = model_selection.train_test_split(df, random_state=43)

In [3]:
print(len(train))

1875000


In [4]:
print(len(test))

625000


In [5]:
len(train) / (len(train) + len(test))

0.75

# Encoding categorical features

Some of our features are obvious quantities (like interarrival times and transaction amounts), but others are categories of things (like merchant IDs and transaction types).  In a conventional programming language or database schema, we'd use enumerated types (C programmers may want to use distinguished small integers) to model categories of things, but those aren't suitable for input to machine learning algorithms.

Why?

Well, let's say we encode transaction types as small integers, like this:

```
MANUAL=0
SWIPE=1
CHIP_AND_PIN=2
CONTACTLESS=3
ONLINE=4
```

We can use this representation to write code that treats these differently, but the integers don't actually capture anything about our problem that a machine learning algorithm can exploit -- a manual transaction isn't "less than" a swipe transaction, and an online transaction isn't "closer to" a contactless transaction than a manual one is.  We want a representation that makes sure that manual transactions are similar to other manual transactions in some way _but dissimilar to all other transactions_ in that way.

There are several approaches we can use to make sense of categorical features, and we'll use two of them in this notebook:

- [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing) for merchant IDs and
- [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) for transaction types

In [6]:
import sklearn
from sklearn.pipeline import Pipeline
from sklearn import feature_extraction, preprocessing
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

stringize = np.frompyfunc(lambda x: "%s" % x, 1, 1)

def mk_stringize(colname):
    def stringize(tab):
        return [{colname : s} for s in tab]
    return stringize

def amap(s):
    return s.map(lambda x: {'merchant_id' : str(x)})

my_func = amap

def mk_hasher(features=16384, values=None):    
    return Pipeline([('dictify', 
                      FunctionTransformer(my_func, accept_sparse=True)), 
                     ('hasher', 
                      sklearn.feature_extraction.FeatureHasher(n_features=features, input_type='dict'))])


HASH_BUCKETS = 128

tt_xform = ('onehot', sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore', categories=[['online','contactless','chip_and_pin','manual','swipe']]), ['trans_type'])
mu_xform = ('m_hashing', mk_hasher(HASH_BUCKETS), 'merchant_id')

xform_steps = [tt_xform, mu_xform]

cat_xform = ColumnTransformer(transformers=xform_steps, n_jobs=None)


# Visualizing categorical features

The general approach we'll use is to [_reduce the dimensionality_](https://en.wikipedia.org/wiki/Dimensionality_reduction) of our encoded categorical features so we can plot them as points on a plane.  This means going from hundreds of dimensions (in the case of hashed merchant IDs) or five or six dimensions (in the case of one-hot encoded transaction types) to two dimensions.

We'll use two different techniques:  a linear technique called [principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) and a nonlinear technique called [t-distributed stochastic neighbor embedding]().  The details of these techniques are out of scope for this workshop, but they're both good places to start if you want to visualize some high-dimensional data.  Dimensionality reduction can be expensive, so we'll start by sampling only a small amount of our data.  

In [7]:
vis_sample = pd.concat([train[train["label"] == label].sample(2500) for label in ["legitimate", "fraud"]])

categorical_matrix = cat_xform.fit_transform(vis_sample)

When possible, it's always a good idea to check your transformed data looks as you'd expect. At this point in the process we can look at the number of none-zero entries in each of the hashed, transformed data. 

In [8]:
none_zero_xform = (categorical_matrix != 0).sum(0)
none_zero_xform

matrix([[2195,  708,  620,  951,  526,   29,   30,   33,   41,   26,
           40,   44,   28,   56,   38,   35,   24,   41,   44,   32,
           33,   43,   34,   39,   64,   46,   17,   54,   38,   44,
           41,   32,   41,   30,   52,   58,   42,   27,   46,   47,
           28,   54,   39,   34,   65,   61,   42,   34,   43,   57,
           64,   50,   35,   18,   54,   34,   27,   42,   31,   24,
           61,   46,   20,   39,   38,   34,   34,   39,   34,   30,
           42,   25,   61,   37,   26,   43,   41,   32,   37,   53,
           32,   43,   53,   43,   46,   40,   32,   32,   50,   45,
           27,   37,   45,   29,   47,   56,   37,   35,   38,   29,
           31,   49,   62,   27,   47,   40,   32,   43,   29,   38,
           20,   37,   25,   36,   27,   58,   29,   44,   49,   39,
           37,   25,   57,   24,   31,   37,   22,   42,   37,   31,
           41,   39,   42]])

We see that the first five values are notably larger than the rest of the values. This suggests that many of the data points were hashed to something within these first 5 buckets. Does this match with our intuition? 

Well, the first 5 entries correspond to the "tt_xform" - the transform of the transaction type. Since each transaction has a 'type' it seems sensible that these values are high, and indeed if we sum up these first 5 values we find...


In [9]:
none_zero_xform[0,0:5].sum()

5000

... 5000 - that's the exact number of entries we sampled in vis_Sample - 2500 fraud and 2500 legitimate. So this looks correct. 

✅ *You can go ahead and do similar checks on the rest of the transformed matrix - do their entries sum to the value you expect? 

Let's move on to visualising these feature vectors:

## Does the merchant ID obviously correlate with fraud?

We're going to start by using PCA to plot the two first principal components of the encoded merchants -- think of this as mapping from the high-dimensional space to a two-dimensional space in such a way that emphasizes the dimensions that contain the most information and minimizes the dimensions that contain the least information.

In [10]:
import sklearn.decomposition

merchants = categorical_matrix[:, -HASH_BUCKETS:]

DIMENSIONS = 2

mpca2 = sklearn.decomposition.PCA(DIMENSIONS)

mpca2_a = mpca2.fit_transform(merchants.toarray())

In [11]:
merchants_df = pd.DataFrame({"label": vis_sample["label"].astype(object),
                             "x": mpca2_a.T[0],
                             "y": mpca2_a.T[1]}).reset_index().dropna()

del merchants_df["index"]

In [12]:
import altair as alt
alt.Chart(merchants_df).mark_point(opacity=0.1).encode(
    x="x:Q", 
    y="y:Q", 
    color="label"
).interactive()


This graph is interactive, so zoom in and take a look!

As we can see, there's a lot of overlap between the classes here and merchant ID alone isn't an obvious way to differentiate between legitimate and fraudulent transactions.

## What if we use a nonlinear visualization technique?

Sometimes, a nonlinear visualization technique can work better than a linear one like PCA.  The next approach we'll try is called t-distributed stochastic neighbor embedding, or t-SNE for short.  t-SNE learns a mapping from high-dimensional points to low-dimensional points so that points that are similar in high-dimensional space are likely to be similar in low-dimensional space as well.  t-SNE can sometimes identify structure that simpler techniques like PCA can't, but this power comes at a cost:  it is much more expensive to compute than PCA and doesn't parallelize well.  (t-SNE also works best for visualizing two-dimensional data when it is reducing from tens of dimensions rather than hundreds or thousands.  So, in some cases, you'll want to use a fast technique like PCA to reduce your data to a few dozen dimensions before using t-SNE.  That's what we're doing with the `TruncatedSVD` class in the next cell.)

✅ *You can go back and re-run this entire notebook after changing `HASH_BUCKETS` to a different value.*



In [14]:
import sklearn.manifold
tsne = sklearn.manifold.TSNE()

# use SVD to reduce the dimensionality before fitting t-SNE
svd = sklearn.decomposition.TruncatedSVD(16)
svd_a = svd.fit_transform(merchants)

tsne_a = tsne.fit_transform(svd_a)

merchants_df["x"] = tsne_a.T[0]
merchants_df["y"] = tsne_a.T[1]

alt.Chart(merchants_df).mark_point(opacity=0.2).encode(x="x:Q", y="y:Q", color="label")



There's still a lot of overlap between the classes here.  Fortunately, we know from the [exploratory analysis notebook](./01-eda.ipynb) that our numeric features contain a lot of information to help us distinguish between classes.  We'll see how to exploit that with models in the next notebook, but first, we need to preprocess these features.


# Encoding numeric features

For the numeric features, our preprocessing is a little easier.  We need to impute missing values for interarrival times (the interarrival time is undefined for the first transaction for each user, since there was no previous interarrival time) and we need to scale all numeric features to a constant range.  We'll do this using the `Pipeline` facility from scikit-learn.


In [15]:
from sklearn.preprocessing import RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline


impute_and_scale = Pipeline([('median_imputer', SimpleImputer(strategy="median")), ('interarrival_scaler', RobustScaler())])
ia_scaler = ('interarrival_scaler', impute_and_scale, ['interarrival'])
amount_scaler = ('amount_scaler', RobustScaler(), ['amount'])

scale_steps = [ia_scaler, amount_scaler]

# Fit and save the feature extraction pipeline

Scikit-learn pipelines make it simple to piece together transformation steps. We state how we want each column from the origional data set to be transformed, and define this to be one pipeline, which we can 'fit' to our training data. 

In [16]:
all_xforms = ColumnTransformer(transformers=(scale_steps + xform_steps))
feat_pipeline = Pipeline([
    ('feature_extraction',all_xforms)
])

feat_pipeline.fit(train)

Pipeline(steps=[('feature_extraction',
                 ColumnTransformer(transformers=[('interarrival_scaler',
                                                  Pipeline(steps=[('median_imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('interarrival_scaler',
                                                                   RobustScaler())]),
                                                  ['interarrival']),
                                                 ('amount_scaler',
                                                  RobustScaler(), ['amount']),
                                                 ('onehot',
                                                  OneHotEncoder(categories=[['online',
                                                                             'contactless',
                                                                       

We save the pipeline and will use it in the next notebook, where we go on to train a model on our transformed data.

In [17]:
import cloudpickle as cp

In [18]:
cp.dump(feat_pipeline, open("feat_pipeline.pkl", "wb"))

With your feature extraction pipeline saved, you can go on to the next notebook - [logistic regression](./03-model-logistic-regression.ipynb). 