# Feature Engineering



**df1 (customer_transaction_details):**


1.   Drop the three Id columns
2.   Drop the outlier row 189 (transactionAmount = 353)
3. Keep paymentMethodRegistrationFailure, paymentMethodType, paymentMethodProvider, transactionFailed, orderState, transactionAmount


**df2 (customers_df):**


1.   Incorporate valid email address column
2.   Incorpoate a column flagging customer with duplication of email address
3.   Incorporate edomain, consolidating all domain names with frequency below 4 as "Other"
4. Incorporate country field from IPAddress. Consolidate countries with frequency less than 7 as "Other"
5. Keep the three numeric columns
6. Drop Billing, customer phone, device, IPAddress
7. Eliminate rows with duplicate emails from cust. Modelling will be performed on transaction rows, not customer rows.


Join the dataframes on emailaddress and then drop emailaddress. Convert categorical variables to one-hot encoding


This will leave 13 variables, which expands to 36 with one-hot encoding.












In [2]:
import pandas as pd
import numpy as np
import re

# df1 (customer_transaction_details)

In [3]:
df1 = pd.read_csv("customer_transaction_details (1).csv",index_col=0)
df1 = df1.drop(['transactionId','orderId','paymentMethodId'], axis=1)
df1 = df1.drop(189,axis=0)

# df2 (customers_df)

In [5]:
df2 = pd.read_csv("customers_df (1).csv",index_col=0)
regex = '''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
df2['valid_email'] = df2['customerEmail'].apply(lambda x: int(bool(re.fullmatch(regex,x))))
df2['dup_email'] = df2.duplicated(subset=['customerEmail'],keep=False)
df2['edomain'] = df2['customerEmail'].apply(lambda x: x.split("@")[1])
edomainCounts = df2['edomain'].value_counts().head(3).index
df2['edomain'] = df2['edomain'].apply(lambda x: x if x in edomainCounts else "Other")
ipcountry= pd.read_csv("ipcountry.csv")
df2['country'] = ipcountry['Country']
countryCounts = df2['country'].value_counts().head(6).index
df2['country'] = df2['country'].apply(lambda x: x if x in countryCounts else "Other")
df2 = df2.drop(['customerPhone','customerDevice', 'customerBillingAddress', 'customerIPAddress'], axis = 1)
df2 = df2.drop_duplicates(subset='customerEmail', keep='first', inplace=False)

# Concatenate

In [6]:
fraudData = df1.merge(df2, how='left', on='customerEmail')
y = fraudData['Fraud']
X = fraudData.drop(['customerEmail', 'Fraud'], axis = 1)
payment = pd.get_dummies(X['paymentMethodType'],drop_first=True)
provider = pd.get_dummies(X['paymentMethodProvider'],drop_first=True)
state = pd.get_dummies(X['orderState'],drop_first=True)
domain = pd.get_dummies(X['edomain'],drop_first=True)
country = pd.get_dummies(X['country'],drop_first=True)
X = X.drop(['paymentMethodType','paymentMethodProvider','orderState','edomain','country'],axis=1)
X = pd.concat([X,payment,provider,state,domain,country],axis=1)

 **Final balance of data: 41% fraud**

In [7]:
np.mean(y)

0.4115755627009646



---
# Model Building
Split data into train and test sets for all models. Normalize data based on X_train and apply to X_train and X_test.


In [8]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101,stratify=y)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 1. Logistic Regression

In [9]:
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train_scaled,y_train)

# 2.  Random forests

In [10]:
from sklearn.ensemble import RandomForestClassifier
rfcmodel = RandomForestClassifier(n_estimators=100)
rfcmodel.fit(X_train_scaled, y_train)

# 3. Random forests tuned

In [11]:
search_grid = {'n_estimators': [int(x) for x in np.linspace(start = 50, stop = 3000, num = 20)]}
rfc = RandomForestClassifier()
rf_grid = GridSearchCV(estimator = rfc, param_grid = search_grid, cv = 5, verbose=0, n_jobs = -1,scoring="recall", return_train_score=True)
rf_grid.fit(X_train_scaled, y_train)

# 4. Gradient boosting

In [12]:
from sklearn.ensemble import GradientBoostingClassifier
grbmodel = GradientBoostingClassifier(n_estimators=1000, learning_rate=1.0,max_depth=1, random_state=0).fit(X_train_scaled, y_train)

# 5. KNN

In [13]:
from sklearn.neighbors import KNeighborsClassifier
knnmodel = KNeighborsClassifier(n_neighbors=5)
knnmodel.fit(X_train_scaled, y_train)

# 6. KNN Tuned

In [14]:
grid = dict(n_neighbors=list(range(1,30)), p = [1,2], leaf_size=list(range(1,50)))
knn_2 = KNeighborsClassifier()
kn_grid = GridSearchCV(knn_2, grid, cv=10, scoring="recall")
kn_grid.fit(X_train_scaled,y_train)

# 7. SVM

In [15]:
from sklearn.svm import SVC
svmmodel = SVC()
svmmodel.fit(X_train_scaled, y_train)

# Model Evaluation

For a fraud status detection model we wish to prioritize the detection of even suspect transactions. The elimination of false negatives is therefore of paramount importance. False positives will cause possibly a delay in the transaction, and a customer relations issue, but a false negative will cause a direct financial loss. The significant measure is therefore recall - the proportion of frauds detetected. We will also monitor precision to ensure incidence of false positives is not excessive.

In [16]:
models = [logmodel, rfcmodel, rf_grid, grbmodel, knnmodel, kn_grid, svmmodel]
names = ['Logistic', 'Random Forest', 'RF tuned', 'Gradient boost', 'Knn', 'Knn tuned', 'SVM']
recall = []
precision = []
for model in models:
    predictions = model.predict(X_test_scaled)
    score = classification_report(y_test,predictions,output_dict = True)
    recall.append(score['True']['recall'])
    precision.append(score['True']['precision'])
results = pd.DataFrame({'Model':names,'Recall':recall,'Precision':precision})
results

Unnamed: 0,Model,Recall,Precision
0,Logistic,0.701299,0.857143
1,Random Forest,0.805195,0.984127
2,RF tuned,0.805195,1.0
3,Gradient boost,0.727273,0.888889
4,Knn,0.688312,0.84127
5,Knn tuned,0.831169,0.955224
6,SVM,0.74026,0.966102


**Best recall performance is achieved with tuned Knn model. Tuning parameters are :**

In [17]:
kn_grid.best_estimator_.get_params()

{'algorithm': 'auto',
 'leaf_size': 1,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 1,
 'p': 1,
 'weights': 'uniform'}

# Conclusion
Using a tuned Knn model we have achieved identification of **83%** of fraud cases in the dataset provided.