
Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `Reg`. The last column in the dataset is the target value.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load the dataset
Reg = pd.read_csv("cc_rejections.data", header=None)
Reg.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


In [2]:
Reg.columns

Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], dtype='int64')

The drop_first parameter in pd.get_dummies() is useful for avoiding redundancy and multicollinearity in datasets, especially when the encoded data is used in statistical or machine learning models
When performing one-hot encoding, each category in a column is represented as a separate binary column (dummy variable).

Including all dummy variables means one column can always be inferred from the others. For example:

In [3]:
df=Reg.copy()

In [4]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


In [5]:
import pandas as pd
import numpy as np

# Replace '?' with NaN in column
df[1] = df[1].replace('?', np.nan)

# Check for NaN values in the first column
print("Number of NaN values in column 1:", df[1].isnull().sum())

# Preview the updated DataFrame
print("\nUpdated Reg DataFrame:\n", df)




Number of NaN values in column 1: 12

Updated Reg DataFrame:
     0      1       2  3  4   5   6     7  8  9   10 11   12 13
0    b  30.83   0.000  u  g   w   v  1.25  t  t   1  g    0  +
1    a  58.67   4.460  u  g   q   h  3.04  t  t   6  g  560  +
2    a  24.50   0.500  u  g   q   h  1.50  t  f   0  g  824  +
3    b  27.83   1.540  u  g   w   v  3.75  t  t   5  g    3  +
4    b  20.17   5.625  u  g   w   v  1.71  t  f   0  s    0  +
..  ..    ...     ... .. ..  ..  ..   ... .. ..  .. ..  ... ..
685  b  21.08  10.085  y  p   e   h  1.25  f  f   0  g    0  -
686  a  22.67   0.750  u  g   c   v  2.00  f  t   2  g  394  -
687  a  25.25  13.500  y  p  ff  ff  2.00  f  t   1  g    1  -
688  b  17.92   0.205  u  g  aa   v  0.04  f  f   0  g  750  -
689  b  35.00   3.375  u  g   c   h  8.29  f  f   0  g    0  -

[690 rows x 14 columns]


In [6]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


In [7]:
import pandas as pd
import numpy as np

# Example: Assume Reg is your DataFrame
# Fill missing values for the entire DataFrame
#cc_apps.fillna('missing', inplace=True)

# Specify columns to one-hot encode
columns_to_encode = [0, 3, 4, 5, 6, 8, 9,  11, 13 ]  # Replace with actual column names or indices you want to encode

# Perform one-hot encoding only on the specified columns
df= pd.get_dummies(df, columns=columns_to_encode, drop_first=True)

# Check the result
print("Updated DataFrame after one-hot encoding:")
print(Reg.head())

Updated DataFrame after one-hot encoding:
  0      1      2  3  4  5  6     7  8  9   10 11   12 13
0  b  30.83  0.000  u  g  w  v  1.25  t  t   1  g    0  +
1  a  58.67  4.460  u  g  q  h  3.04  t  t   6  g  560  +
2  a  24.50  0.500  u  g  q  h  1.50  t  f   0  g  824  +
3  b  27.83  1.540  u  g  w  v  3.75  t  t   5  g    3  +
4  b  20.17  5.625  u  g  w  v  1.71  t  f   0  s    0  +


In [8]:
df.dtypes

Unnamed: 0,0
1,object
2,float64
7,float64
10,int64
12,int64
0_a,bool
0_b,bool
3_l,bool
3_u,bool
3_y,bool


In [9]:
X = df.iloc[:, :-1]  # All rows, all columns except the last

# Separate target variable (last column)
y = df.iloc[:, -1]

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

# Example data (assuming X and y are defined somewhere in your code)
# X = pd.DataFrame({'feature1': [1, 2, 3, 4], 'feature2': [5, 6, 7, 8]})
# y = np.array([0, 1, 0, 1])

# Ensure all column names are strings
X.columns = X.columns.astype(str)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data (without refitting)
X_test_scaled = scaler.transform(X_test)

# Print results
print("Original Training Data:\n", X_train)
print("\nScaled Training Data:\n", X_train_scaled)
print("\nOriginal Test Data:\n", X_test)
print("\nScaled Test Data:\n", X_test_scaled)

Original Training Data:
          1       2       7  10  12    0_a   0_b    3_l    3_u    3_y  ...  \
278  24.58  13.500   0.000   0   0  False  True  False  False   True  ...   
110  29.17   3.500   3.500   3   0  False  True  False   True  False  ...   
82   39.83   0.500   0.250   0   0  False  True  False   True  False  ...   
51   26.00   1.000   1.750   0   0  False  True  False   True  False  ...   
218  53.92   9.625   8.665   5   0  False  True  False   True  False  ...   
..     ...     ...     ...  ..  ..    ...   ...    ...    ...    ...  ...   
71   34.83   4.000  12.500   0   0  False  True  False   True  False  ...   
106  28.75   1.165   0.500   0   0  False  True  False   True  False  ...   
270  37.58   0.000   0.000   0   0  False  True  False  False  False  ...   
435  19.00   0.000   0.000   4   1  False  True  False  False   True  ...   
102  18.67   5.000   0.375   2  38  False  True  False   True  False  ...   

       6_h    6_j    6_n    6_o    6_v    6_z    8

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.impute import SimpleImputer
import numpy as np

# Initialize the SimpleImputer to handle missing values
imputer = SimpleImputer(strategy='mean')

# Impute missing values in the training and test data
X_train_scaled = imputer.fit_transform(X_train_scaled)
X_test_scaled = imputer.transform(X_test_scaled)

# Initialize the Logistic Regression model
logistic_model = LogisticRegression()

# Fit the model on the scaled training data
logistic_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = logistic_model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Logistic Regression model:", accuracy)

# Display classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Display confusion matrix
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.impute import SimpleImputer
import numpy as np

# Initialize the SimpleImputer to handle missing values
imputer = SimpleImputer(strategy='mean')

# Impute missing values in the training and test data
X_train_scaled = imputer.fit_transform(X_train_scaled)
X_test_scaled = imputer.transform(X_test_scaled)

# Define the Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)

# Define hyperparameter grid for tuning
param_grid = {
    'n_estimators': [100, 200, 300],      # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],     # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],     # Minimum samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],       # Minimum samples required at a leaf node
    'bootstrap': [True, False]           # Whether bootstrap samples are used
}

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=rf_model,
    param_grid=param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,  # Use all available cores
    verbose=2
)

# Fit GridSearchCV on the training data
grid_search.fit(X_train_scaled, y_train)

# Get the best parameters and best model
best_params = grid_search.best_params_
best_rf_model = grid_search.best_estimator_

print("Best Hyperparameters:", best_params)

# Make predictions on the test set with the best model
y_pred = best_rf_model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy of Random Forest Classifier:", accuracy)

# Display classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Display confusion matrix
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Assuming X_train_scaled and y_train are ready after imputation

# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)

# Fit the model on the training data
rf_model.fit(X_train_scaled, y_train)

# Evaluate the model on the test set using .score()
best_score = rf_model.score(X_test_scaled, y_test)

# Print the best score
print("Best Score (Test Accuracy):", best_score)


Best Score (Test Accuracy): 0.8478260869565217


In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np

# Define the Random Forest Classifier with the best hyperparameters
best_rf_model = RandomForestClassifier(
    bootstrap=True,
    max_depth=10,
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=200,
    random_state=42
)

# Set up stratified k-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(best_rf_model, X_train_scaled, y_train, cv=cv, scoring='accuracy', n_jobs=-1)

# Save the mean cross-validation accuracy to best_score
best_score = np.mean(cv_scores)

# Print the results
print("Cross-Validation Scores:", cv_scores)
print("Mean Cross-Validation Accuracy (Best Score):", best_score)


Cross-Validation Scores: [0.88288288 0.82882883 0.86363636 0.92727273 0.82727273]
Mean Cross-Validation Accuracy (Best Score): 0.8659787059787061
