# Fraud Prediction Using Skewed Data

### How do we achieve good predictive power in the statistical model? Can sampling techniques help? Let's find out!


## 1. Setup
To prepare your environment, you need to install some packages.

### 1.1 Install the necessary packages

You need the latest versions of these packages:<br>

In [None]:
!pip install pandas_profiling

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pp
%matplotlib inline

## 2. Read the Data & convert it into Dataframe
Click on Insert to code and then select Insert pandas DataFrame in the below empty cell.

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Term,Credit_History_Available,Housing,Locality,Fraud_Risk
0,1,0,0,1,0,5849,0,146,360,1,1,1,0
1,1,1,1,1,1,4583,1508,128,360,1,1,3,1
2,1,1,0,1,1,3000,0,66,360,1,1,1,1
3,1,1,0,0,1,2583,2358,120,360,1,1,1,1
4,1,0,0,1,0,6000,0,141,360,1,1,1,0


## 3. Assign a new name for the dataframe

In [3]:
'''Rename the dataframe to df'''

df = df_data_3

## 4. Descriptive statistics on the data

In [4]:
print("ALL")
print(df.describe())

ALL
           Gender     Married  Dependents   Education  Self_Employed  \
count  827.000000  827.000000  827.000000  827.000000     827.000000   
mean     0.733978    0.481258    0.652963    0.790810       0.574365   
std      0.442143    0.499951    0.935835    0.406976       0.494738   
min      0.000000    0.000000    0.000000    0.000000       0.000000   
25%      0.000000    0.000000    0.000000    1.000000       0.000000   
50%      1.000000    0.000000    0.000000    1.000000       1.000000   
75%      1.000000    1.000000    1.000000    1.000000       1.000000   
max      1.000000    1.000000    3.000000    1.000000       1.000000   

       ApplicantIncome  CoapplicantIncome  LoanAmount   Loan_Term  \
count       827.000000         827.000000  827.000000  827.000000   
mean       5212.970979        1486.050786  140.892382  338.128174   
std        5593.713304        2802.847983   79.820451   75.353151   
min         150.000000           0.000000    9.000000   12.000000   
25

## 5. Analyze using Pandas Profiling

In [5]:
pp.ProfileReport(df)

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=27.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…






## 7. Split the data into train & test data sets using 70:30 mix
#### The model will be built on training data and will be applied on the test data

In [6]:
# Split The Data with all variables

from sklearn.model_selection import train_test_split

x = df[['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Term', 'Credit_History_Available', 'Housing',
       'Locality']]
y = df['Fraud_Risk']

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.30, random_state=0)
print('xtrain shape')
print(xtrain.shape)
print('xtest shape')
print(xtest.shape)

xtrain shape
(578, 12)
xtest shape
(249, 12)


## 8. Use Random Forest Algorithm
#### A brief about Random Forest Algorithm
Random forest classifier creates a set of decision trees from randomly selected subset of training set. It then aggregates the votes from different decision trees to decide the final class of the test object.  Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests will avoid overfitting which will enhance the accuracy of the model on new data. This is a Bagging based algorithm which is used for reducing Overfitting in order to create strong learners for generating accurate predictions.

In [7]:
# RF Classifier explained

'''rf = RandomForestClassifier(n_estimators=100, oob_score=True, n_jobs=4)
n_estimators : integer, optional (default=10)
The number of trees in the forest.
oob_score : bool (default=False)
Whether to use out-of-bag samples to estimate the generalization accuracy.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.
'''

from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

rfmodel = RandomForestClassifier()
rfmodel.fit(xtrain,ytrain)
print('model')
print(rfmodel)

ypredrf = rfmodel.predict(xtest)
print('confusion matrix')
print(metrics.confusion_matrix(ytest, ypredrf))
print('classification report')
print(metrics.classification_report(ytest, ypredrf))
print('Accuracy : %f' % (metrics.accuracy_score(ytest, ypredrf)))
print('Area under the curve : %f' % (metrics.roc_auc_score(ytest, ypredrf)))

model
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
confusion matrix
[[106   4]
 [ 16 123]]
classification report
              precision    recall  f1-score   support

           0       0.87      0.96      0.91       110
           1       0.97      0.88      0.92       139

   micro avg       0.92      0.92      0.92       249
   macro avg       0.92      0.92      0.92       249
weighted avg       0.92      0.92      0.92       249

Accuracy : 0.919679
Area under the curve : 0.924264




#### In the above classification report we are getting the f1-score of 92 which is the harmonic mean of precision & recall scores.
##### Recall is where the model tries to recollect the number of instances, in this case the model has been able to recollect 88% of frauds and able to classify them as frauds 97 % of the recollected events which is the precision of the model. 
##### Area under the curve signifies the accuracy of the model (values between 0 & 1) where the score towards 1 indicate high predictive power.
# <font color='Red'> Note  : The numbers for recall, precision, F1 Score & Area under the curve can change for different runs of the model due to stocastic nature of the algorithms.

# 9. Use Gradient Boosting Alogrithm
#### A Brief about Gradient Boosting Algorithm
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function(a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event). This is a Boosting based algorithm which is an ensemble technique to combine weak learners to create a strong learner that can make accurate predictions. 

In [8]:
# GBM Classifier explained

'''params = {'n_estimators': 500, 'max_depth': 3, 'subsample': 0.5, 'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
clf = ensemble. GradientBoostingClassifier(**params)
n_estimators : int (default=100)
The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting, so a large number usually results in better performance.
max_depth: integer, optional (default=3)
maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. The best value depends on the interaction of the input variables.
subsample: float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
learning_rate : float, optional (default=0.1)
learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
'''

from sklearn import ensemble

params = {'n_estimators': 500, 'max_depth': 3, 'subsample': 0.5,
          'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(xtrain, ytrain) #trains
y_pred = clf.predict(xtest)  #predicts
print('confusion matrix')
print(metrics.confusion_matrix(ytest, y_pred))
print('classification report')
print(metrics.classification_report(ytest, y_pred))
print("-----------------------------------------------------------------------------------------")
print("Accuracy is :")
print(metrics.accuracy_score(ytest, y_pred))
print('Area under the curve : %f' % (metrics.roc_auc_score(ytest, y_pred)))

confusion matrix
[[109   1]
 [ 12 127]]
classification report
              precision    recall  f1-score   support

           0       0.90      0.99      0.94       110
           1       0.99      0.91      0.95       139

   micro avg       0.95      0.95      0.95       249
   macro avg       0.95      0.95      0.95       249
weighted avg       0.95      0.95      0.95       249

-----------------------------------------------------------------------------------------
Accuracy is :
0.9477911646586346
Area under the curve : 0.952289


# 10. Use Extreme Gradient Boosting Alogrithm
#### A Brief about Extreme Gradient Boosting Algorithm
XGBoost is one of the implementations of Gradient Boosting concept, but what makes XGBoost unique is that it uses “a more regularized model formalization to control over-fitting, which gives it better performance,” according to the author of the algorithm, Tianqi Chen. Therefore, it helps to reduce overfitting.

In [9]:
from xgboost.sklearn import XGBClassifier

# Create the XGB classifier, xgb_model.
xgb_model = XGBClassifier()

In [10]:
# List the default parameters.
print(xgb_model.get_xgb_params())

{'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 100, 'nthread': 1, 'objective': 'binary:logistic', 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': 0, 'silent': 1, 'subsample': 1}


In [11]:
# Train and evaluate.
xgb_model.fit(xtrain, ytrain, eval_metric=['error'], eval_set=[((xtrain, ytrain)),(xtest, ytest)])

[0]	validation_0-error:0.069204	validation_1-error:0.084337
[1]	validation_0-error:0.069204	validation_1-error:0.084337
[2]	validation_0-error:0.069204	validation_1-error:0.084337
[3]	validation_0-error:0.069204	validation_1-error:0.084337
[4]	validation_0-error:0.072664	validation_1-error:0.076305
[5]	validation_0-error:0.072664	validation_1-error:0.076305
[6]	validation_0-error:0.070934	validation_1-error:0.076305
[7]	validation_0-error:0.070934	validation_1-error:0.076305
[8]	validation_0-error:0.070934	validation_1-error:0.076305
[9]	validation_0-error:0.070934	validation_1-error:0.076305
[10]	validation_0-error:0.065744	validation_1-error:0.068273
[11]	validation_0-error:0.070934	validation_1-error:0.076305
[12]	validation_0-error:0.065744	validation_1-error:0.068273
[13]	validation_0-error:0.065744	validation_1-error:0.068273
[14]	validation_0-error:0.065744	validation_1-error:0.068273
[15]	validation_0-error:0.065744	validation_1-error:0.068273
[16]	validation_0-error:0.065744	v

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [12]:
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore")

y_pred = xgb_model.predict(xtest)  #predicts
print('confusion matrix')
print(metrics.confusion_matrix(ytest, y_pred))
print('classification report')
print(metrics.classification_report(ytest, y_pred))
print("-----------------------------------------------------------------------------------------")
print("Accuracy is :")
print(metrics.accuracy_score(ytest, y_pred))
print('Area under the curve : %f' % (metrics.roc_auc_score(ytest, y_pred)))

confusion matrix
[[107   3]
 [ 13 126]]
classification report
              precision    recall  f1-score   support

           0       0.89      0.97      0.93       110
           1       0.98      0.91      0.94       139

   micro avg       0.94      0.94      0.94       249
   macro avg       0.93      0.94      0.94       249
weighted avg       0.94      0.94      0.94       249

-----------------------------------------------------------------------------------------
Accuracy is :
0.9357429718875502
Area under the curve : 0.939601


#### As we can see, the F1 score and Area under the curve are higher compared to previous model. 

# 11. Select random variables for model building
#### This is done to check whether we can get a lift on the accuracy with fewer variables

In [13]:
# Split The Data with few variables

from sklearn.model_selection import train_test_split

x = df[['Married', 'ApplicantIncome','CoapplicantIncome']]
y = df['Fraud_Risk']

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.30, random_state=0)
print('xtrain shape')
print(xtrain.shape)
print('xtest shape')
print(xtest.shape)

xtrain shape
(578, 3)
xtest shape
(249, 3)


In [14]:
'''Random Forest Classifier on reduced dimensions data'''

from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

rfmodel = RandomForestClassifier()
rfmodel.fit(xtrain,ytrain)
print('model')
print(rfmodel)

ypredrf = rfmodel.predict(xtest)
print('confusion matrix')
print(metrics.confusion_matrix(ytest, ypredrf))
print('classification report')
print(metrics.classification_report(ytest, ypredrf))
print('Accuracy : %f' % (metrics.accuracy_score(ytest, ypredrf)))
print('Area under the curve : %f' % (metrics.roc_auc_score(ytest, ypredrf)))

model
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
confusion matrix
[[102   8]
 [ 21 118]]
classification report
              precision    recall  f1-score   support

           0       0.83      0.93      0.88       110
           1       0.94      0.85      0.89       139

   micro avg       0.88      0.88      0.88       249
   macro avg       0.88      0.89      0.88       249
weighted avg       0.89      0.88      0.88       249

Accuracy : 0.883534
Area under the curve : 0.888097


#### The F1 score has reduced with fewer variables along with Area under the curve

In [15]:
'''Gradient Boost Algorithm on reduced dimensions data'''

from sklearn import ensemble

params = {'n_estimators': 500, 'max_depth': 3, 'subsample': 0.5,
          'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(xtrain, ytrain) #trains
y_pred = clf.predict(xtest)  #predicts
print('confusion matrix')
print(metrics.confusion_matrix(ytest, y_pred))
print('classification report')
print(metrics.classification_report(ytest, y_pred))
print("-----------------------------------------------------------------------------------------")
print("Accuracy is :")
print(metrics.accuracy_score(ytest, y_pred))
print('Area under the curve : %f' % (metrics.roc_auc_score(ytest, y_pred)))

confusion matrix
[[109   1]
 [ 22 117]]
classification report
              precision    recall  f1-score   support

           0       0.83      0.99      0.90       110
           1       0.99      0.84      0.91       139

   micro avg       0.91      0.91      0.91       249
   macro avg       0.91      0.92      0.91       249
weighted avg       0.92      0.91      0.91       249

-----------------------------------------------------------------------------------------
Accuracy is :
0.9076305220883534
Area under the curve : 0.916318


#### Reduction in the F1 score with fewer variables.

In [16]:
'''Extreme Gradient Boost Algorithm on reduced dimensions data'''

from xgboost.sklearn import XGBClassifier

# Create the XGB classifier, xgb_model.
xgb_model = XGBClassifier()
# List the default parameters.
print(xgb_model.get_xgb_params())

{'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 100, 'nthread': 1, 'objective': 'binary:logistic', 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': 0, 'silent': 1, 'subsample': 1}


In [17]:
# Train and evaluate.
xgb_model.fit(xtrain, ytrain, eval_metric=['error'], eval_set=[((xtrain, ytrain)),(xtest, ytest)])

[0]	validation_0-error:0.088235	validation_1-error:0.104418
[1]	validation_0-error:0.088235	validation_1-error:0.104418
[2]	validation_0-error:0.088235	validation_1-error:0.104418
[3]	validation_0-error:0.088235	validation_1-error:0.104418
[4]	validation_0-error:0.088235	validation_1-error:0.104418
[5]	validation_0-error:0.088235	validation_1-error:0.104418
[6]	validation_0-error:0.088235	validation_1-error:0.104418
[7]	validation_0-error:0.088235	validation_1-error:0.104418
[8]	validation_0-error:0.088235	validation_1-error:0.104418
[9]	validation_0-error:0.088235	validation_1-error:0.104418
[10]	validation_0-error:0.088235	validation_1-error:0.104418
[11]	validation_0-error:0.088235	validation_1-error:0.104418
[12]	validation_0-error:0.088235	validation_1-error:0.104418
[13]	validation_0-error:0.088235	validation_1-error:0.104418
[14]	validation_0-error:0.088235	validation_1-error:0.104418
[15]	validation_0-error:0.088235	validation_1-error:0.104418
[16]	validation_0-error:0.088235	v

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [18]:
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore")

y_pred = xgb_model.predict(xtest)  #predicts
print('confusion matrix')
print(metrics.confusion_matrix(ytest, y_pred))
print('classification report')
print(metrics.classification_report(ytest, y_pred))
print("-----------------------------------------------------------------------------------------")
print("Accuracy is :")
print(metrics.accuracy_score(ytest, y_pred))
print('Area under the curve : %f' % (metrics.roc_auc_score(ytest, y_pred)))

confusion matrix
[[107   3]
 [ 21 118]]
classification report
              precision    recall  f1-score   support

           0       0.84      0.97      0.90       110
           1       0.98      0.85      0.91       139

   micro avg       0.90      0.90      0.90       249
   macro avg       0.91      0.91      0.90       249
weighted avg       0.91      0.90      0.90       249

-----------------------------------------------------------------------------------------
Accuracy is :
0.9036144578313253
Area under the curve : 0.910824


#### Higher F1 score compared to Gradient Boosting Model

##### We can try different techniques like parameters tuning for enhancing the accuracy of the models.

### Generate predictions on new data by replacing the values

In [None]:
new_data = [{'Married': 1, 'ApplicantIncome': 3847, 'CoapplicantIncome':0}] 
  
# Create a DataFrame 
test_df = pd.DataFrame(new_data)

### Predictions using new data

In [None]:
new_pred = xgb_model.predict(test_df)

### Print the predicted result

In [None]:
Print(new_pred)