## The Consumer Financial Protection Bureau (CFPB) is a U.S. government agency that makes sure financial companies treat their customers fairly. Their website allows customers of financial services to file complaints against financial companies and banks against unfair treatment if these companies are unable to resolve complaints to the customer’s satisfaction.

 

When customers choose to complain to the CFPB, financial companies incur additional costs to resolve such complaints.

 |

On receipt, the CFPB routes complaints to the financial companies, who generally respond to the consumer within 15 days.  Once a response is provided, one of two things can happen:

 

In most cases, consumers accept the response or remediation offered by the financial companies,
In other cases, they choose to dispute the resolution offered by the company.  (flagged in the 'Consumer disputed?' field).  In these situations, the bank has to perform additional investigations, and possibly offer further relief to the customers.  As a result, the cost of dealing with disputes can be high.
 

The original dataset for this project has over 2 million anonymized recent records, and covers 6000+ financial providers of all varieties.  It can be downloaded following the instructions at https://www.consumerfinance.gov/data-research/consumer-complaints/.  The website also provides additional information on the data, including the data dictionary. 

 

For this project, we will use only the data till 2017, and only for the top 5 banks in the US.  In order to make sure we are all working off the same data, we will use the file ** complaints_25Nov21.csv ** available in Jupyterhub under the shared/ folder.

 

The cost structure:

On average, it costs the banks $100 to resolve, respond to and close a complaint that is not disputed. 
 

On the other hand, it costs banks an extra 500 to resolve a complaint if it has been disputed.  (This 500 is on top of the $100 they have already spent.)
 

Extra diligence: If the banks know in advance which complaints will be disputed, they can perform “extra diligence” during the first round of addressing the complaint with a view to avoiding eventual disputes.  Performing extra diligence costs $90 per complaint, and provides a guarantee that the customer will not dispute the complaint.  But performing the extra diligence is wasted money if the customer would not have disputed the complaint.
 

You are required to create a model that can help the banks identify complaints that will end in a dispute.  The goal is to minimize total financial costs, and if the banks can identify future disputes they can avoid the larger costs by performing the cheaper extra diligence in advance.

### Hint: Think about Calculating Total Cost in Dollars   

The moment a complaint enters the CFPB’s system, there is 100 cost to resolve it.  This applies to every complaint.
After that, if a complaint’s resolution is disputed by the customer, an additional 500 has to be spent (for a total cost for such cases to be 600).
But the bank can intervene in advance by spending an extra $90 for extra diligence, and that can make sure the complaint’s resolution is not disputed. 
 

While we can’t prevent complaints from coming to the CFPB, we can reduce total costs by identifying the complaints that are likely to be disputed, and doing the extra due diligence for them.  This extra due diligence will cost us an extra $90 per complaint, but save us the additional $500 to resolve the complaint after the dispute.  But obviously, the bank would not want to spend this extra money on complaints that would not have been disputed anyway.

 

#### Your task is to create a predictive model that can help the banks keep their total complaint related costs low. 

 

### Follow the instructions below and answer the multiple choice questions that follow.

 

 

Explore the data, familiarize yourself with the fields and perform some EDA.
 

Set your X (predictor) and y (predicted) variables. 
Use only the below variables as your predictors.  Ignore the other variables in the dataset.
'Product', 'Sub-product', 'Issue', 'State', 'Tags', 'Submitted via',  'Company response to consumer', 'Timely response?'

 

Use 'Consumer disputed?' as your y-variable.  Be sure to convert your y-variable to 0s and 1s so your model can use it.
 

For example, you can use label encoder as below, or any other method you are comfortable using:

 

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

y = le.fit_transform(complaints['Consumer disputed?'])

 

Split your data into a test and train set.  Use an 80/20 train-test split, and random_state=123 for the train-test split.
 

For example, using the below, appropriately adjusted to the variable names you are using:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 123)

 

Check what proportion of complaints in your training dataset are disputed.  If this proportion is less than 30%, use random undersampling with random_state = 123 to balance your dataset. 
 

For example, you could use the below (adjusted for your choice of variables etc)

 

from imblearn.under_sampling import RandomUnderSampler

undersampler = RandomUnderSampler(random_state=123)

X_train, y_train = undersampler.fit_resample(X, y)

 

Train a predictive model to predict whether a complaint would be disputed using XGBoost Classifier using random_state=123
 

For example, using the below:

 

model_xgb = XGBClassifier(random_state = 123)

 

Evaluate the model on the test set, and create the classification report and confusion matrix.  (Remember, when we say ‘True Positive’, ‘False Negative’ etc, the second word, positive or negative, denotes the ground truth; and the first word, True or False, indicates whether we predicted correctly.)
 

Calculate the total cost in dollars for the test set.  Establish the ‘base-case’, ie the total cost if you were not using a model, using the test set only. 
 

Use the cost structure explained earlier (ie, 600 total for every disputed complaint, and 100 for every non-disputed complaint, and $90 for the extra due diligence.)

 

Now calculate the total cost in dollars based on the model results in the confusion matrix.  The below graphic might help you.  But you are free to use your own methods.


The cost in the default model is not the lowest cost.  Change the classification threshold on the model to calculate the lowest total cost you can achieve.

## Q1
In the test set (not the entire dataset), what proportion of consumers raised a dispute?

In [129]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [130]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from imblearn.under_sampling import RandomUnderSampler
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

In [131]:
# Load the dataset
complaints = pd.read_csv("complaints_25Nov21.csv")

# Display the first few rows of the dataset
complaints.head()


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2016-10-26,Money transfers,International money transfer,Other transaction issues,,"To whom it concerns, I would like to file a fo...",Company has responded to the consumer and the ...,"CITIBANK, N.A.",,,,Consent provided,Web,2016-10-29,Closed with explanation,Yes,No,2180490
1,2015-03-27,Bank account or service,Other bank product/service,"Account opening, closing, or management",,My name is XXXX XXXX XXXX and huband name is X...,Company chooses not to provide a public response,"CITIBANK, N.A.",PA,151XX,Older American,Consent provided,Web,2015-03-27,Closed with explanation,Yes,No,1305453
2,2015-04-20,Bank account or service,Other bank product/service,"Making/receiving payments, sending money",,XXXX 2015 : I called to make a payment on XXXX...,Company chooses not to provide a public response,U.S. BANCORP,PA,152XX,,Consent provided,Web,2015-04-22,Closed with monetary relief,Yes,No,1337613
3,2013-04-29,Mortgage,Conventional fixed mortgage,"Application, originator, mortgage broker",,,,JPMORGAN CHASE & CO.,VA,22406,Servicemember,,Phone,2013-04-30,Closed with explanation,Yes,Yes,393900
4,2013-05-29,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",GA,30044,,,Referral,2013-05-31,Closed with explanation,Yes,No,418647


In [132]:
# Perform exploratory data analysis (EDA)
complaints.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207260 entries, 0 to 207259
Data columns (total 18 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   Date received                 207260 non-null  object
 1   Product                       207260 non-null  object
 2   Sub-product                   164245 non-null  object
 3   Issue                         207260 non-null  object
 4   Sub-issue                     10347 non-null   object
 5   Consumer complaint narrative  29391 non-null   object
 6   Company public response       58458 non-null   object
 7   Company                       207260 non-null  object
 8   State                         205066 non-null  object
 9   ZIP code                      197974 non-null  object
 10  Tags                          28265 non-null   object
 11  Consumer consent provided?    51313 non-null   object
 12  Submitted via                 207260 non-null  object
 13 

In [133]:

# Create a copy of the DataFrame
complaints = complaints.copy()

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Iterate over each column
for column in complaints.columns:
    # Check if the column data type is object (categorical)
    if complaints[column].dtype == 'object':
        # Encode categorical columns
        complaints[column] = label_encoder.fit_transform(complaints[column].astype(str))

# Display the first few rows of the encoded DataFrame
complaints.head()


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,1791,5,23,69,57,24642,7,1,62,12612,3,1,5,1721,1,1,0,2180490
1,1212,0,30,1,57,19633,6,1,46,1947,0,1,5,1139,1,1,0,1305453
2,1236,0,30,61,57,27604,6,3,46,1974,3,1,5,1165,2,1,0,1337613
3,515,6,6,7,57,28991,8,2,55,2836,2,4,2,447,1,1,1,393900
4,545,6,31,57,57,28991,8,0,15,3731,3,4,4,476,1,1,0,418647


In [134]:
complaints.describe()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
count,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0,207260.0
mean,970.673632,3.737552,23.665729,45.285279,55.366964,26960.921104,7.608197,1.774192,28.077651,6567.530257,2.673101,3.183243,4.276749,910.001259,1.542903,0.977531,0.216651,1028619.0
std,568.137559,2.643515,16.722593,24.340755,7.940623,5926.071078,0.678523,1.568927,17.551795,3921.531333,0.883249,1.463812,1.017283,555.324673,1.093776,0.148205,0.411964,753334.8
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0
25%,462.0,2.0,6.0,28.0,57.0,28991.0,7.0,0.0,11.0,2954.0,3.0,4.0,4.0,407.0,1.0,1.0,0.0,345621.8
50%,944.0,6.0,30.0,57.0,57.0,28991.0,8.0,2.0,26.0,6424.0,3.0,4.0,5.0,876.0,1.0,1.0,0.0,920972.0
75%,1481.0,6.0,36.0,58.0,57.0,28991.0,8.0,4.0,42.0,10428.0,3.0,4.0,5.0,1409.0,2.0,1.0,0.0,1710704.0
max,1946.0,10.0,47.0,92.0,57.0,29345.0,8.0,4.0,62.0,12612.0,3.0,4.0,5.0,1897.0,6.0,1.0,1.0,2412707.0


In [135]:

# Set predictor variables (X) and predicted variable (y)
X = complaints[['Product', 'Sub-product', 'Issue', 'State', 'Tags', 'Submitted via', 'Company response to consumer', 'Timely response?']]
y = complaints['Consumer disputed?']

# Convert y variable to 0s and 1s
le = LabelEncoder()
y = le.fit_transform(y)


In [136]:
from imblearn.under_sampling import RandomUnderSampler
import pandas as pd

# Assuming you have already split your data into X_train and y_train
# and applied label encoding to categorical variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 123)

# Initialize RandomUnderSampler
undersampler = RandomUnderSampler(random_state=123)

# Perform random undersampling
X_train_balanced, y_train_balanced = undersampler.fit_resample(X_train, y_train)

# Calculate the proportion of consumers who raised a dispute in the balanced training dataset
proportion_dispute = sum(y_train_balanced) / len(y_train_balanced)
print("Proportion of consumers who raised a dispute in the balanced training dataset:", proportion_dispute)


Proportion of consumers who raised a dispute in the balanced training dataset: 0.5


In [137]:
# Count the number of complaints labeled as 'Consumer disputed?' = 'Yes' in the test set
disputed_count = (y_test == 1).sum()

# Calculate the total number of complaints in the test set
total_complaints = len(y_test)

# Calculate the proportion of consumers who raised a dispute
proportion_disputed = disputed_count / total_complaints

print("Proportion of consumers who raised a dispute:", proportion_disputed)


Proportion of consumers who raised a dispute: 0.21586413200810575


After you have performed random undersampling, what proportion of consumers in the training dataset raised a dispute?

In [138]:
# Balance the dataset if the proportion of disputed complaints is less than 30%
if proportion_disputed < 0.3:
    undersampler = RandomUnderSampler(random_state=123)
    X_train, y_train = undersampler.fit_resample(X_train, y_train)

# Train a predictive model using XGBoost Classifier
model_xgb = XGBClassifier(random_state=123)
model_xgb.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = model_xgb.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.52      0.64     32504
           1       0.27      0.63      0.38      8948

    accuracy                           0.55     41452
   macro avg       0.55      0.58      0.51     41452
weighted avg       0.71      0.55      0.59     41452

Confusion Matrix:
[[17050 15454]
 [ 3316  5632]]


Fit the XGBClassifier model as described in the instructions, and evaluate it on the test set.  What is the recall for the category 'Consumer disputed?' = 'Yes' on the test set?

### Q4
If there were no model, what would be the total cost to the banks of dealing with the complaints in the test set?  Pick the closest value to what you get.

In [139]:
# Count the number of complaints in the test set
total_complaints = len(y_test)

# Count the number of complaints that are disputed ('Consumer disputed?' = 'Yes')
num_disputed = sum(y_test)

# Calculate the number of complaints that are not disputed
num_not_disputed = total_complaints - num_disputed

# Calculate the total cost without using a model
total_cost_no_model = num_not_disputed * 100 + num_disputed * 600

print("Total cost without using a model:", total_cost_no_model)

Total cost without using a model: 8619200


Use the predictions for which complaints are likely to be disputed from the model you have created (using the default classification threshold).  Assume that if the model predicts a complaint will be disputed, the banks decide to spend 90 performing extra diligence to avoid the $600 cost of a dispute.

In this situation based on model results, what would be the total cost to the banks of dealing with the complaints in the test set?

Pick the closest value to your answer.

In [140]:
from sklearn.metrics import confusion_matrix

# Predictions from the model
y_pred = model_xgb.predict(X_test)

# Confusion matrix to check model performance
cm = confusion_matrix(y_test, y_pred)

# Extracting true positive, false positive, true negative, false negative
tn, fp, fn, tp = cm.ravel()

# Calculate the total cost based on model results
total_cost = (tn + fn + fp + tp) * 100 + (tn + fn) * (90) + fn * 500
print("Total cost based on model results:", total_cost)

Total cost based on model results: 7636140


In [141]:

# Predictions from the model
y_pred = model_xgb.predict(X_test)
total_cost = 0
for i in range(len(y_test)):
    total_cost += 100
    if y_pred[i] == 1:
        total_cost += 90;
    elif y_test[i] == 1:
        total_cost += 500

print("Total cost based on model results:", total_cost)


Total cost based on model results: 7700940


### Q6
The costs to the banks from doing due diligence and from having disputes are asymmetrical.  Therefore you have the opportunity to reduce total cost by varying the probability threshold from the default 0.5 in a binary classification situation as this.

Change the value of the threshold and determine the lowest total cost to the banks based on the observations in the test set.

In [142]:
# Predict probabilities
y_prob = model_xgb.predict_proba(X_test)[:, 1]

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

total_costs = []
for threshold in thresholds:
    y_pred_thresh = (y_prob >= threshold).astype(int)
    cm = confusion_matrix(y_test, y_pred_thresh)
    tn, fp, fn, tp = cm.ravel()
    predicted_disputed = fp + tp
    total_cost = (tn + fn + fp + tp) * 100 + (tn + fn) * (90) + fn * 500
    total_costs.append(total_cost)

min_cost = min(total_costs)
min_threshold = thresholds[total_costs.index(min_cost)]

print("Lowest total cost:", min_cost)
print("Optimal threshold:", min_threshold)


Lowest total cost: 4153770
Optimal threshold: 0.1


In [143]:
from sklearn.metrics import confusion_matrix

# Define thresholds to test
thresholds = [0.1, 0.54, 0.5, 0.46]

# Initialize variables to store results
min_total_cost = float('inf')
optimal_threshold = None

# Calculate total cost for each threshold
for threshold in thresholds:
    # Predict using the threshold
    y_pred_thresh = (y_prob >= threshold).astype(int)
    
    # Calculate confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_thresh).ravel()
    
    # Calculate the total cost based on model results
    total_cost = (tn + fn + fp + tp) * 100 + (tp + fp) * (90) + fp * 500
    
    # Update minimum total cost and optimal threshold
    if total_cost < min_total_cost:
        min_total_cost = total_cost
        optimal_threshold = threshold

# Print results
print("Lowest total cost:", min_total_cost)
print("Optimal threshold:", optimal_threshold)


Lowest total cost: 10595720
Optimal threshold: 0.54


In [144]:
import numpy as np
from sklearn.metrics import confusion_matrix

# Predicted probabilities from the model
y_pred_proba = model_xgb.predict_proba(X_test)[:, 1]

# Initialize variables to store the optimal threshold and its corresponding total cost
optimal_threshold = 0
min_total_cost = float('inf')

# Iterate over different threshold values
for threshold in np.linspace(0, 1, 101):
    # Classify complaints based on the threshold
    y_pred_threshold = (y_pred_proba > threshold).astype(int)

    # Confusion matrix to check model performance
    cm = confusion_matrix(y_test, y_pred_threshold)

    # Extracting true positive, false positive, true negative, false negative
    tn, fp, fn, tp = cm.ravel()

    # Calculate the total cost based on model results and threshold
    total_cost = (tn + fn + fp + tp) * 100 + (tn + fn) * 90 + fn * 500
    

    # Update optimal threshold and total cost if the current cost is lower
    if total_cost < min_total_cost:
        min_total_cost = total_cost
        optimal_threshold = threshold

print("Optimal Threshold:", optimal_threshold)
print("Lowest Total Cost:", min_total_cost)

Optimal Threshold: 0.0
Lowest Total Cost: 4145200


In [145]:
pred_prob = model_xgb.predict_proba(X_test)
pred_prob = pred_prob[:,1]
thresholds = np.linspace(0, 1, 50)
costs = []

for threshold in thresholds:
    y_pred = (pred_prob>threshold).astype(int)
    total_cost = 0
    # Confusion matrix to check model performance
    cm = confusion_matrix(y_test, y_pred)

    # Extracting true positive, false positive, true negative, false negative
    tn, fp, fn, tp = cm.ravel()

    # Calculate the total cost based on model results and threshold
    total_cost = (tn + fn + fp + tp) * 100 + (tn + fn) * 90 + fn * 500
    costs.append(total_cost)

min_cost = min(costs)
optimal_threshold = thresholds[costs.index(min_cost)]
print("The lowest total cost to the bank is:", min_cost,
      "which is achieved when threshold is ", optimal_threshold)

The lowest total cost to the bank is: 4145200 which is achieved when threshold is  0.0


In [146]:
import numpy as np
from sklearn.metrics import confusion_matrix

# Predicted probabilities from the model
y_pred_proba = model_xgb.predict_proba(X_test)[:, 1]

# Initialize variables to store the optimal threshold and its corresponding total cost
optimal_threshold = 0
min_total_cost = float('inf')
costs = []
# Iterate over different threshold values
for threshold in np.linspace(0, 1, 101):
    # Classify complaints based on the threshold
    y_pred_threshold = (y_pred_proba > threshold).astype(int)

    # Confusion matrix to check model performance
    cm = confusion_matrix(y_test, y_pred_threshold)

    # Extracting true positive, false positive, true negative, false negative
    tn, fp, fn, tp = cm.ravel()

    for i in range(len(y_test)):
        total_cost += 100
        if y_pred_threshold[i] == 1:
            total_cost += 90
        if y_pred_threshold[i] == 0 and y_test[i] == 1:
            total_cost += 500
    costs.append(total_cost)


print("Optimal Threshold:", thresholds[costs.index(min(costs))]
)
print("Lowest Total Cost:", min(costs))


Optimal Threshold: 0.0
Lowest Total Cost: 20225760
