# Random Forest model for Healthcare Fraud Detection
### Deborah Leong, Sam Nuzbrokh and Doug Devens

This notedbook describes development of a random forest classification model to detect potentially fraudulent healthcare providers.

Import pandas package and scikitlearn metrics reports.  Read in file created by provider_inout_mods.  Also read in the mean residual by provider created by notebook 'Provider_claim_data_regression' and is described elsewhere.

In [1]:
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
x_train_inout_mod = pd.read_csv('x_train_inout_mod.csv')
provider_reimb_residuals=pd.read_csv('provider_groups_residual.csv')
provider_reimb_residuals =provider_reimb_residuals.drop(columns=['Unnamed: 0'])
provider_reimb_residuals.columns


Index(['Provider', 'MeanResidualReimbursement', 'logOPAnnualReimbursement',
       'logIPAnnualReimbursement', '$PerClaimDay', 'InscClaimAmtReimbursed'],
      dtype='object')

Rename columns in provider_reimb_residuals and drop extra column brought in with file read-in.  Merge the two files on provider to combine the data.

In [2]:
provider_reimb_residuals.columns=['Provider','MeanResidualReimbursement','logOPAnnualReimbursement',\
                'logIPAnnualReimbursement','$PerClaimDay','total_claim']
x_train_inout_mod = x_train_inout_mod.drop(columns = 'Unnamed: 0')
x_train_inout_mod = pd.merge(x_train_inout_mod,provider_reimb_residuals,on='Provider')

We also bring in data from the market basket analysis, which found a higher fraction of diabetes and ischemic heart patients for fraudulent providers.  We include that as a feature in this model.

In [3]:
diabetes_frac = pd.read_csv('diabetes_frac.csv')
diabetes_frac.columns

Index(['Unnamed: 0', 'Provider', 'PctDiabFrac'], dtype='object')

In [4]:
diabetes_frac = diabetes_frac.drop(columns = ['Unnamed: 0'])
x_train_inout_mod = pd.merge(x_train_inout_mod,diabetes_frac, on='Provider')

Confirm all columns are numeric, except 'Provider;.

In [25]:
import numpy as np
x_train_inout_mod.select_dtypes(exclude='number').columns


Index(['Provider'], dtype='object')

Move 'PotentialFraud" label data to target array.  Drop from features matrix in next cell.

In [26]:
x_train_inout_mod.columns

Index(['Age_in', 'Age_out', 'AttendingPhysician_in', 'AttendingPhysician_out',
       'ClaimDays_in', 'ClaimDays_out', 'DeductibleAmtPaid_in',
       'DeductibleAmtPaid_out', 'Gender_in', 'Gender_out',
       'InscClaimAmtReimbursed_in', 'InscClaimAmtReimbursed_out',
       'NumChronics_in', 'NumChronics_out', 'NumDiag_in', 'NumDiag_out',
       'NumProc_in', 'NumProc_out', 'State_in', 'State_out', 'WhetherDead_in',
       'WhetherDead_out', 'ClaimDays_in_Range', 'ClaimDays_out_Range',
       'InscClaimAmtReimbursed_in_Range', 'InscClaimAmtReimbursed_out_Range',
       'NumChronics_in_Range', 'NumChronics_out_Range', 'NumDiag_in_Range',
       'NumDiag_out_Range', 'NumProc_in_Range', 'NumProc_out_Range',
       'Provider', 'PotentialFraud', 'docDegMax', 'docBtwnMean', 'docEignMean',
       'docMANN', 'patDegMax', 'patBtwnMean', 'patEignMean', 'patMANN',
       'ClmsPerPhysician_in', 'ClmsPerPhysician_out', 'ClmsPerPatient_in',
       'ClmsPerPatient_out', 'DrPerPatient_in', 'DrPerPatie

We create the target or response array and confirm here that we have the same number of fraudulent providers across the training dataset.

In [27]:
y = x_train_inout_mod['PotentialFraud']
y.value_counts()


0    4904
1     506
Name: PotentialFraud, dtype: int64

We drop the 'PotentialFraud' column since it is the target column.

In [28]:
X = x_train_inout_mod.drop(columns = ['PotentialFraud'])


Import test_train_split from sklearn and split matrices into training and test sets for validation.

In [29]:
from sklearn import model_selection as ms
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, 
                                            test_size=0.20, random_state=42)


Create provider_claim matrices to be able to merge later to perform cost calculations.  Scale train and test columns from original X matrix.  Model will be scaled from this.

In [30]:
X = X.drop(columns=['Provider','total_claim'])

provider_claim_trn=X_train[['Provider','total_claim']]
X_train=X_train.drop(columns=['Provider','total_claim'])
X_train=(X_train-X.min(axis=0))/(X.max(axis=0)-X.min(axis=0))
print(X_train.shape)

provider_claim_test=X_test[['Provider','total_claim']]
X_test=X_test.drop(columns=['Provider','total_claim'])
X_test=(X_test-X.min(axis=0))/(X.max(axis=0)-X.min(axis=0))
print(X_test.shape)


(4328, 95)
(1082, 95)


Confirm there are no more NAs.

In [32]:
X_test=X_test.fillna(0)
c = np.sum(X_test.isnull())
c[c>0]


Series([], dtype: int64)

Import ensamble model and create instance of random forest model.  Run first instance.  Previous trials with weighting that running with class_weight equal to balanced and then slightly underweighting with the sample_weight option in the fit gave better results.

In [33]:
# rfparams_dict = {}
from sklearn import ensemble
randomForest = ensemble.RandomForestClassifier()
randomForest.set_params(class_weight = 'balanced',random_state=42, n_estimators=110, max_features=15, \
            min_samples_leaf = 12, min_samples_split=3,criterion='gini',oob_score=True)
sample_weight = np.array([1 if x==0 else 0.9 for x in y_train])
randomForest.fit(X_train, y_train,sample_weight=sample_weight) # fit 
print(confusion_matrix(y_test, randomForest.predict(X_test)))
print(classification_report(y_test, randomForest.predict(X_test)))


[[916  61]
 [ 24  81]]
              precision    recall  f1-score   support

           0       0.97      0.94      0.96       977
           1       0.57      0.77      0.66       105

    accuracy                           0.92      1082
   macro avg       0.77      0.85      0.81      1082
weighted avg       0.94      0.92      0.93      1082



In [34]:
randomForest

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features=15,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=12, min_samples_split=3,
                       min_weight_fraction_leaf=0.0, n_estimators=110,
                       n_jobs=None, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

Run a cross-validation grid search to optimize parameter settings.  

In [42]:
%%time 
from sklearn.model_selection import GridSearchCV
grid_para_forest = [{
    "n_estimators": range(80,151,25),
    "criterion": ["gini",'entropy'],
    "min_samples_leaf": range(12,31,5),
    "min_samples_split": range(2,9,2),
    "random_state": [42],
    'max_features':range(8,21,4)}]
grid_search_forest = GridSearchCV(randomForest, grid_para_forest, scoring='f1_weighted', cv=5, n_jobs=3)
grid_search_forest.fit(X_train, y_train)
bst_prm = grid_search_forest.best_params_
randomForest.set_params(class_weight = 'balanced',min_samples_split=bst_prm['min_samples_split'],random_state=42, 
                        n_estimators=bst_prm['n_estimators'], max_features=bst_prm['max_features'], \
                        criterion = bst_prm['criterion'], min_samples_leaf = bst_prm['min_samples_leaf'])
randomForest.fit(X_train, y_train,sample_weight=None)
print(confusion_matrix(y_test, randomForest.predict(X_test)))
print(classification_report(y_test, randomForest.predict(X_test)))

[[916  61]
 [ 27  78]]


NameError: name 'rfminfeatures' is not defined

We print out the set of best parameters and compare their performance against the prior 'naive' model.  We see the F1 score has dropped slightly.  We also see the model parameter selection has tended more toward overfitting with the smallest number of samples per leaf and samples per split chosen.  We notice the grid search chose entropy loss.

In [44]:
print(bst_prm)
print(confusion_matrix(y_test, randomForest.predict(X_test)))
print(classification_report(y_test, randomForest.predict(X_test)))

{'criterion': 'entropy', 'max_features': 20, 'min_samples_leaf': 12, 'min_samples_split': 2, 'n_estimators': 130, 'random_state': 42}
[[916  61]
 [ 27  78]]
              precision    recall  f1-score   support

           0       0.97      0.94      0.95       977
           1       0.56      0.74      0.64       105

    accuracy                           0.92      1082
   macro avg       0.77      0.84      0.80      1082
weighted avg       0.93      0.92      0.92      1082



We choose to stay with the original parameters (e.g. 'gini' loss function, instead of entropy) and other selections.  We know that the performance of the random forest is also dependent on the random number generator.  To introduce a measure of noise into the model training we fit the model for various values of the random state, and then save the F1 score, the confusion matrix, and a dataframe of labeled feature importances for each iteration, to allow a more representative view of feature importances.

In [35]:
sample_weight = np.array([1 if x==0 else 0.9 for x in y_train])
rndm_score_dict = {}
for i in range(8):
    rnint = np.random.randint(0,1000000)
    randomForest.set_params(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
           criterion='gini', max_depth=None, max_features=15,
           max_leaf_nodes=None, max_samples=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=12, min_samples_split=3,
           min_weight_fraction_leaf=0.0, n_estimators=110,
           n_jobs=None, oob_score=True, random_state=rnint, verbose=0,
           warm_start=False)
    randomForest.fit(X_train, y_train,sample_weight=sample_weight)
    print(confusion_matrix(y_test, randomForest.predict(X_test)))
    print(classification_report(y_test, randomForest.predict(X_test)))
    rndm_score_dict[rnint]=[confusion_matrix(y_test, randomForest.predict(X_test)),\
    ''.join([classification_report(y_test, randomForest.predict(X_test))[x] for x in range(148,152)]),\
    pd.DataFrame(list(zip(X_train.columns, randomForest.feature_importances_))).sort_values(by = 1, ascending=False)]

[[912  65]
 [ 25  80]]
              precision    recall  f1-score   support

           0       0.97      0.93      0.95       977
           1       0.55      0.76      0.64       105

    accuracy                           0.92      1082
   macro avg       0.76      0.85      0.80      1082
weighted avg       0.93      0.92      0.92      1082

[[913  64]
 [ 26  79]]
              precision    recall  f1-score   support

           0       0.97      0.93      0.95       977
           1       0.55      0.75      0.64       105

    accuracy                           0.92      1082
   macro avg       0.76      0.84      0.80      1082
weighted avg       0.93      0.92      0.92      1082

[[916  61]
 [ 24  81]]
              precision    recall  f1-score   support

           0       0.97      0.94      0.96       977
           1       0.57      0.77      0.66       105

    accuracy                           0.92      1082
   macro avg       0.77      0.85      0.81      1082
weigh

Here we calculate a composite confusion matrix (easier for me to read) to understand the true range of likely performance in classification. We see an average F1 score of 0.64, and identification of 80 of the 105 fraudulent providers in the test set.

In [36]:
import statistics
med_true_neg = statistics.median([rndm_score_dict[x][0][0][0] for x in rndm_score_dict.keys()])
std_true_neg = np.std([rndm_score_dict[x][0][0][0] for x in rndm_score_dict.keys()])
med_false_pos = statistics.median([rndm_score_dict[x][0][0][1] for x in rndm_score_dict.keys()])
std_false_pos = np.std([rndm_score_dict[x][0][0][1] for x in rndm_score_dict.keys()])
med_false_neg = statistics.median([rndm_score_dict[x][0][1][0] for x in rndm_score_dict.keys()])
std_false_neg = np.std([rndm_score_dict[x][0][1][0] for x in rndm_score_dict.keys()])
med_true_pos = statistics.median([rndm_score_dict[x][0][1][1] for x in rndm_score_dict.keys()])
std_true_pos = np.std([rndm_score_dict[x][0][1][1] for x in rndm_score_dict.keys()])
med_f1 = statistics.median([float(rndm_score_dict[x][1]) for x in rndm_score_dict.keys()])
std_f1 = np.std([float(rndm_score_dict[x][1]) for x in rndm_score_dict.keys()])
# print(med_f1)
print(' median, std F1 score for fraud ',(med_f1,std_f1))
print('      true neg                   false pos')
print((med_true_neg,std_true_neg),(med_false_pos,std_false_pos))
print('      false neg                   true pos')
print((med_false_neg,std_false_neg),(med_true_pos,std_true_pos))


 median, std F1 score for fraud  (0.64, 0.008291561975888507)
      true neg                   false pos
(914.0, 1.8708286933869707) (63.0, 1.8708286933869707)
      false neg                   true pos
(25.5, 1.2183492931011204) (79.5, 1.2183492931011204)


Here we calculate the average feature importance across all the random number iterations, from the feature importance dataframes created in each iteration.  We then view the bottom 20 (lowest feature importance) features for the model.

In [38]:
RF_Feature_Imp_Ave = rndm_score_dict[187403][2][[0]]
for key in rndm_score_dict.keys():
    RF_Feature_Imp_Ave = pd.merge(RF_Feature_Imp_Ave,rndm_score_dict[key][2], on=0)
RF_Feature_Imp_Ave['RF_Feature_Imp_Ave']=RF_Feature_Imp_Ave.mean(axis=1)
RF_Feature_Imp_Ave = RF_Feature_Imp_Ave.sort_values(by='RF_Feature_Imp_Ave', ascending=False)
RF_Feature_Imp_Ave = RF_Feature_Imp_Ave.drop(columns=['1_x','1_y','1_x','1_y','1_y','1_y'])
RF_Feature_Imp_Ave.tail(20)


Unnamed: 0,0,RF_Feature_Imp_Ave
72,general_otpt,0.001819
69,LogOtherPhys_out,0.001795
68,docBtwnMean,0.00177
77,docEignMean,0.001709
66,AttendingPhysician_in,0.001642
80,LogOpPhys_in,0.001587
70,Gender_in,0.001537
82,docDegMax,0.000753
84,LogOtherPhys_in,0.000584
85,congenital_inpt,0.000562


We did use the RFECV (reduced feature engine, cross-validate) but found in several instances that it would remove features that had been quite important in the feature importance tables created in the prior step.  For that reason we removed this step and reduced the features by simply removing the bottom 25 features with the lowest average feature importance.

In [139]:
# %%time
# sample_weight = np.array([1 if x==0 else 0.9 for x in y_train])
# randomForest.set_params(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
#            criterion='gini', max_depth=None, max_features=15,
#            max_leaf_nodes=None, max_samples=None,
#            min_impurity_decrease=0.0, min_impurity_split=None,
#            min_samples_leaf=12, min_samples_split=3,
#            min_weight_fraction_leaf=0.0, n_estimators=110,
#            n_jobs=None, oob_score=True, random_state=rnint, verbose=0,
#            warm_start=False)
# from sklearn.feature_selection import RFECV
# rfecv = RFECV(randomForest, step=1, min_features_to_select=15, cv=3, scoring='f1_weighted', verbose=0, \
#               n_jobs=3)
# rfecv = rfecv.fit(X_train, y_train)
# a = [X_train.columns[i] for i in range(len(X_train.columns)) if rfecv.support_[i]]
# rfminfeatures = rfecv.estimator_
# lilx_train = X_train[a]
# rfminfeatures.fit(lilx_train, y_train)
# lilx_test= X_test[a]
# print('   0    1    predicted is columns')
# print(confusion_matrix(y_test, rfminfeatures.predict(lilx_test)))
# print(classification_report(y_test, rfminfeatures.predict(lilx_test)))

In [19]:
RF_Feature_Imp_Ave.to_csv('rf_feature_importance.csv')

Here we identify the features (bottom 25 by average feature importance) to be removed from the reduced feature model.

In [39]:
a = RF_Feature_Imp_Ave.tail(25)
drop_list = list(a[0])
drop_list

['ClaimDays_out',
 'pulmonology_otpt',
 'urology_inpt',
 'LogOpPhys_out',
 'AttendingPhysician_out',
 'general_otpt',
 'LogOtherPhys_out',
 'docBtwnMean',
 'docEignMean',
 'AttendingPhysician_in',
 'LogOpPhys_in',
 'Gender_in',
 'docDegMax',
 'LogOtherPhys_in',
 'congenital_inpt',
 'ob-gyn_inpt',
 'NumProc_out_Range',
 'State_in',
 'ClaimDays_out_Range',
 'NumProc_out',
 'WhetherDead_in',
 'DeductibleAmtPaid_in',
 'ClmsPerPhysician_in',
 'ClmsPerPhysician_out',
 'neonatology_inpt']

We remove the bottom 25 features and then iterate across multiple random numbers to generate an average F1 score, average confusion matrix and average feature importance for the reduced model. We see the average F1 score for the reduced feature model remains unchanged, as does the average confusion matrix performance.

In [40]:
X_train_reduced = X_train.drop(columns=drop_list)
X_test_reduced = X_test.drop(columns=drop_list)
sample_weight = np.array([1 if x==0 else 0.9 for x in y_train])
rndm_score_red_dict = {}
for i in range(8):
    rnint = np.random.randint(0,1000000)
    randomForest.set_params(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
           criterion='gini', max_depth=None, max_features=15,
           max_leaf_nodes=None, max_samples=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=12, min_samples_split=3,
           min_weight_fraction_leaf=0.0, n_estimators=110,
           n_jobs=None, oob_score=True, random_state=rnint, verbose=0,
           warm_start=False)
    randomForest.fit(X_train_reduced, y_train,sample_weight=sample_weight)
#     print(confusion_matrix(y_test, randomForest.predict(X_test_reduced)))
#     print(classification_report(y_test, randomForest.predict(X_test_reduced)))
    rndm_score_red_dict[rnint]=[confusion_matrix(y_test, randomForest.predict(X_test_reduced)),\
    ''.join([classification_report(y_test, randomForest.predict(X_test_reduced))[x] for x in range(148,152)]),\
    pd.DataFrame(list(zip(X_train_reduced.columns, randomForest.feature_importances_))).sort_values(by = 1, ascending=False)]
    
med_true_neg = statistics.median([rndm_score_red_dict[x][0][0][0] for x in rndm_score_red_dict.keys()])
std_true_neg = np.std([rndm_score_red_dict[x][0][0][0] for x in rndm_score_red_dict.keys()])
med_false_pos = statistics.median([rndm_score_red_dict[x][0][0][1] for x in rndm_score_red_dict.keys()])
std_false_pos = np.std([rndm_score_red_dict[x][0][0][1] for x in rndm_score_red_dict.keys()])
med_false_neg = statistics.median([rndm_score_red_dict[x][0][1][0] for x in rndm_score_red_dict.keys()])
std_false_neg = np.std([rndm_score_red_dict[x][0][1][0] for x in rndm_score_red_dict.keys()])
med_true_pos = statistics.median([rndm_score_red_dict[x][0][1][1] for x in rndm_score_red_dict.keys()])
std_true_pos = np.std([rndm_score_red_dict[x][0][1][1] for x in rndm_score_red_dict.keys()])
med_f1 = statistics.median([float(rndm_score_red_dict[x][1]) for x in rndm_score_red_dict.keys()])
std_f1 = np.std([float(rndm_score_red_dict[x][1]) for x in rndm_score_red_dict.keys()])
# print(med_f1)
print('Metrics for reduced random forest on test set, minus bottom 25 features')
print(len(X_train_reduced.columns))
print(' median, std F1 score for fraud ',(med_f1,std_f1))
print('      true neg                   false pos')
print((med_true_neg,std_true_neg),(med_false_pos,std_false_pos))
print('      false neg                   true pos')
print((med_false_neg,std_false_neg),(med_true_pos,std_true_pos))
print('metrics for train set with reduced features')
print(confusion_matrix(y_train, randomForest.predict(X_train_reduced)))
print(classification_report(y_train, randomForest.predict(X_train_reduced)))


Metrics for reduced random forest on test set, minus bottom 25 features
70
 median, std F1 score for fraud  (0.645, 0.006959705453537534)
      true neg                   false pos
(911.5, 1.118033988749895) (65.5, 1.118033988749895)
      false neg                   true pos
(24.0, 1.0532687216470449) (81.0, 1.0532687216470449)
metrics for train set with reduced features
[[3723  204]
 [  17  384]]
              precision    recall  f1-score   support

           0       1.00      0.95      0.97      3927
           1       0.65      0.96      0.78       401

    accuracy                           0.95      4328
   macro avg       0.82      0.95      0.87      4328
weighted avg       0.96      0.95      0.95      4328



We now calculate the average feature importance across all the random iterations, and find the Range of Claim Durations, the number of claims, the range of reimbursements and the number of patients are the most important features in this model. These are roughly in accordance with the other tree-based models we've examined, including gradient boost, adaboost and logit boost.

In [43]:
RF_Red_Feature_Imp_Ave = rndm_score_red_dict[653683][2][[0]]
for key in rndm_score_red_dict.keys():
    RF_Red_Feature_Imp_Ave = pd.merge(RF_Red_Feature_Imp_Ave,rndm_score_red_dict[key][2], on=0)
RF_Red_Feature_Imp_Ave['RF_Feature_Imp_Ave']=RF_Red_Feature_Imp_Ave.mean(axis=1)
RF_Red_Feature_Imp_Ave = RF_Red_Feature_Imp_Ave.sort_values(by='RF_Feature_Imp_Ave', ascending=False)
RF_Red_Feature_Imp_Ave = RF_Red_Feature_Imp_Ave.drop(columns=['1_x','1_y','1_x','1_y','1_y','1_y'])
RF_Red_Feature_Imp_Ave.to_csv('RF_Red_Feature_Imp_Ave.csv')
RF_Red_Feature_Imp_Ave.head(20)

Unnamed: 0,0,RF_Feature_Imp_Ave
0,ClaimDays_in_Range,0.2043
1,LogClaims_in,0.095395
2,InscClaimAmtReimbursed_in_Range,0.079452
3,LogPatients_in,0.071365
4,LogClaims_out,0.056503
7,ClaimDays_in,0.039349
6,LogPatients_out,0.035343
5,NumProc_in_Range,0.03326
8,ClmsPerPatient_in,0.029042
9,NumDiag_in_Range,0.022861


In [45]:
X_train_reduced.to_csv('rf_reduced_feature_set')
y_train.to_csv('rf_reduced_label_set')


Finally we attempt to develop a cost model to quantify the relative performance of each model. We read in the total claims data since we have decided to measure the dollar amount of claims of the fraudulent providers and the amount of that money that this model has identified as reimbursed to fraudulent providers.

In [46]:
data = pd.read_csv('./data/combinedData.csv')


  interactivity=interactivity, compiler=compiler, result=result)


Sum the money reimbursed to all providers, to be able to quantify the amount of money reimbursed to fraudulent providers.

In [47]:
data = data[data['Set']=='Train']
data1 = data.groupby('Provider').agg('sum')['InscClaimAmtReimbursed'].reset_index()
data1.columns=['Provider','TotalClaim']
data1


Unnamed: 0,Provider,TotalClaim
0,PRV51001,104640
1,PRV51003,605670
2,PRV51004,52170
3,PRV51005,280910
4,PRV51007,33710
...,...,...
5405,PRV57759,10640
5406,PRV57760,4770
5407,PRV57761,18470
5408,PRV57762,1900


In [51]:
provider_claim_test.columns


Index(['Provider', 'total_claim'], dtype='object')

The model presented is slightly different from this one, but essentially we attempt to acknowledge a cost associated with all invetigations, and impose an extra cost for false positive identifications of innocent providers as fraudulent.  We attempted to maximize the amount of money identified as from fraudulent providers, while also trying to maximize the ratio of the recovered money to the amount spent to get that money.

In [52]:
a = pd.DataFrame({'actual':y_test,'predict':randomForest.predict(X_test_reduced),'total_claim': provider_claim_test['total_claim']})
print(confusion_matrix(y_test, randomForest.predict(X_test_reduced)))

totalclaims = np.sum(a['total_claim'])
totaldefrauded=100*np.sum(a[a['actual']==1]['total_claim'])/totalclaims

print('total claims for test set are ${:,.0f}'.format(totalclaims))

print('total fraudulent claims are %i' %totaldefrauded,'% of total claims')

totalcost=100*np.sum(a[a['predict']==1]['predict'])*100000/totalclaims
print('total investigation cost at 100K per %i' %totalcost,'% of total claims')

totalfalsepos=100*np.sum(a[(a['predict']==1) & a['actual']==0]['predict'])*100000/totalclaims
print('total legal costs for false positives at 100K per are %i' %totalfalsepos,'% of total claims')

totalrecovered=100*np.sum(a[(a['predict']==1) & a['actual']==1]['total_claim'])/totalclaims
print('total recovered claims are %i' %totalrecovered,'% of total claims')
print('total net benefit of model as Pct of total claims is %i' %(totalrecovered-(totalcost+totalfalsepos)),'% of total claims')

[[913  64]
 [ 24  81]]
total claims for test set are $106,047,820
total fraudulent claims are 51 % of total claims
total investigation cost at 100K per 13 % of total claims
total legal costs for false positives at 100K per are 6 % of total claims
total recovered claims are 48 % of total claims
total net benefit of model as Pct of total claims is 29 % of total claims


In [236]:
RF_Feature_Imp_Ave.to_csv('rf_feature_importance.csv')

(1082, 94)