In [1]:
#%run /Users/shabnanasser/workplace/git/Capstone_Two/Cap2_Preprocessing.ipynb

# Modeling
In the Data Modeling section, we are going to train our standardised data with three different ML models: Logistic Regression, Decision Tree Classifier and Random Forest classifier. Also, we will check the accuracy, precision, recall and f-scores of each model and find the best performing model among them.


## 1) Logistic Regression

In [2]:
#importing required libraries

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score


In [3]:
# Create a logistic regression model with specified solver and maximum iterations
logreg_model = LogisticRegression(solver='liblinear',max_iter=1000)

# Hyperparameter tuning with cross-validation
# Cross-validate the logistic regression model using 5-fold cross-validation

cv_scores_logreg = cross_val_score(logreg_model, X_train, y_train, cv=5)
print(f"Cross-Validation Scores for Logistic Regression: {cv_scores_logreg}")
print(f"Mean CV Score: {cv_scores_logreg.mean()}\n")

# Train and evaluate the logistic regression model
# Fit the model on the transformed training data
logreg_model.fit(X_train_transformed, y_train)
# Make predictions on the transformed test data
y_pred_logreg = logreg_model.predict(X_test_transformed)

# Calculate evaluation metrics for logistic regression
accuracy_logreg = accuracy_score(y_test, y_pred_logreg)
precision_logreg = precision_score(y_test, y_pred_logreg)
recall_logreg = recall_score(y_test, y_pred_logreg)
f1_logreg = f1_score(y_test, y_pred_logreg)

# Print the evaluation metrics for logistic regression
print("Evaluation Metrics for Logistic Regression:")
print(f"Accuracy: {accuracy_logreg}")
print(f"Precision: {precision_logreg}")
print(f"Recall: {recall_logreg}")
print(f"F1-Score: {f1_logreg}\n")

Cross-Validation Scores for Logistic Regression: [0.85796767 0.86489607 0.84064665 0.86589595 0.82774566]
Mean CV Score: 0.8514304022213619

Evaluation Metrics for Logistic Regression:
Accuracy: 0.9260628465804066
Precision: 0.676056338028169
Recall: 0.45714285714285713
F1-Score: 0.5454545454545455



## 2) Decision Tree Classifier

In [4]:

# Create a Decision Tree model
dt_model = DecisionTreeClassifier()

# Hyperparameter tuning with cross-validation
# Cross-validate the Decision Tree model using 5-fold cross-validation
cv_scores_dt = cross_val_score(dt_model, X_train, y_train, cv=5)
print(f"Cross-Validation Scores for Decision Tree: {cv_scores_dt}")
print(f"Mean CV Score: {cv_scores_dt.mean()}\n")

# Train and evaluate the Decision Tree model
# Fit the model on the training data
dt_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred_dt = dt_model.predict(X_test)

# Calculate evaluation metrics for the Decision Tree model
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)
f1_dt = f1_score(y_test, y_pred_dt)

# Print the evaluation metrics for the Decision Tree model
print("Evaluation Metrics for Decision Tree:")
print(f"Accuracy: {accuracy_dt}")
print(f"Precision: {precision_dt}")
print(f"Recall: {recall_dt}")
print(f"F1-Score: {f1_dt}\n")



Cross-Validation Scores for Decision Tree: [0.90877598 0.90877598 0.90184758 0.9017341  0.89942197]
Mean CV Score: 0.9041111214940795

Evaluation Metrics for Decision Tree:
Accuracy: 0.8872458410351202
Precision: 0.4247787610619469
Recall: 0.45714285714285713
F1-Score: 0.4403669724770642



## 3) Random Forest Classifier:

In [5]:
# Create a Random Forest model
rf_model = RandomForestClassifier()

# Hyperparameter tuning with cross-validation
# Cross-validate the Random Forest model using 5-fold cross-validation
cv_scores_rf = cross_val_score(rf_model, X_train, y_train, cv=5)
print(f"Cross-Validation Scores for Random Forest: {cv_scores_rf}")
print(f"Mean CV Score: {cv_scores_rf.mean()}\n")

# Train and evaluate the Random Forest model
# Fit the model on the training data
rf_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred_rf = rf_model.predict(X_test)

# Calculate evaluation metrics for the Random Forest model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)

#print  evaluation metrics for the model
print("Evaluation Metrics for Random Forest:")
print(f"Accuracy: {accuracy_rf}")
print(f"Precision: {precision_rf}")
print(f"Recall: {recall_rf}")
print(f"F1-Score: {f1_rf}\n")


Cross-Validation Scores for Random Forest: [0.93187067 0.9295612  0.94341801 0.92716763 0.91791908]
Mean CV Score: 0.9299873179457743

Evaluation Metrics for Random Forest:
Accuracy: 0.9214417744916821
Precision: 0.6666666666666666
Recall: 0.38095238095238093
F1-Score: 0.4848484848484849



<p><div style="text-align: justify;">The accuracy and f score of Logistic Regression is greater than all other model. Hence we can conclude that the best performing model among these models is Logistic Regression with an accuracy of 0.926 and F1-Score: 0.545. Also, the features InscClaimAmtReimbursed, DeductibleAmtPaid, Claim_Duration, Admitted_Days along with aggregated features and the target labels Provider and Potential Fraud lead us to find the best performing model. The idea behind adding the aggregated features based on the combinations of various features is that many parties or entities might work together to make a healthcare fraud. Thus, we need to capture interactions among them to better classify the fraudsters. Including these features is that there might be a pattern like if the sum of claim re-imb amount for a provider is very high or low then it might influence the fraud. All these features are related by each other in a way or other so we believe that those features contributed to the best performing model.</p>