![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

In [216]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


In [217]:
# Splitting the data into features and labels 
X = cc_apps.drop(13, axis=1)
y = cc_apps[13]
X[1].value_counts()
X.isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
dtype: int64

# Data Preprocessing

- **Handling Missing Values:**
  - Replaced `'?'` with `np.nan` in all columns using `replace_to_null` function.
  - Checked for missing values with `X.isna().sum()`.

- **Converting Data Types:**
  - Converted columns to appropriate data types:
    - Categorical: Used `convert_cat` function.
    - Float: Used `convert_float` function.
  - Specific conversions:
    - `X[0]` to categorical.
    - `X[1]` to float.
    - Columns `3` to `6` to categorical.
    - Column `10` to int.
    - Columns `8` to `9` to categorical.
    - Column `11` to categorical.

- **Handling Missing Numerical Values:**
  - Replaced missing values in numerical columns with median values.
  - Replaced missing values in categorical columns with the mode.

- **Checking Categories:**
  - Selected categorical columns
  - Printed unique values and value counts for each categorical column.

- **Final Checks:**
  - Confirmed no missing values

In [218]:
# Prepprocessing the data
# We have '?' in some columns, it is safe to say we can intepret the '?' as missing values np.NAN
# Writing a function to perform that
def replace_to_null(df, col, string):
    df.loc[df[col] == string, col] = np.nan
    return df
for col in X.columns:
    X = replace_to_null(X, col, '?')
X.isna().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
dtype: int64

In [219]:
# Prepprocessing the data
# Checking datatypes and converting them to appropriate datatypes
def convert_cat(df,col):
    df[col] = df[col].astype('category')
    final_var = df[col]
    return final_var
def convert_float(df, col):
    df[col] = df[col].astype('float')
    final_df = df[col]
    return final_df
X[0] = convert_cat(X, 0)
X[1] = convert_float(X, 1)
for col in X.columns[3:7]:
    X[col] = convert_cat(X, col)
X[10] = X[10].astype(int)
for col in X.columns[8:10]:
    X[col] = convert_cat(X, col)
X[11] = convert_cat(X,11)

# Replacing missing numerical values with the median values
X[1] = X[1].fillna(X[1].median())
for col in X.select_dtypes(include='category').columns:
    mode = cat_df[col].mode()[0]
    X[col] = X[col].fillna(mode)

# Checking categories if cleaned
cat_df = X.select_dtypes(include='category')
for col in cat_df.columns:
    display(cat_df[col].unique())
# Confirm no columns with missing values
print(X.isna().sum()) 

['b', 'a']
Categories (2, object): ['a', 'b']

['u', 'y', 'l']
Categories (3, object): ['l', 'u', 'y']

['g', 'p', 'gg']
Categories (3, object): ['g', 'gg', 'p']

['w', 'q', 'm', 'r', 'cc', ..., 'i', 'e', 'aa', 'ff', 'j']
Length: 14
Categories (14, object): ['aa', 'c', 'cc', 'd', ..., 'q', 'r', 'w', 'x']

['v', 'h', 'bb', 'ff', 'j', 'z', 'o', 'dd', 'n']
Categories (9, object): ['bb', 'dd', 'ff', 'h', ..., 'n', 'o', 'v', 'z']

['t', 'f']
Categories (2, object): ['f', 't']

['t', 'f']
Categories (2, object): ['f', 't']

['g', 's', 'p']
Categories (3, object): ['g', 'p', 's']

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
dtype: int64


# Logistic Regression Model Building

- **Data Preparation:**
  - Created dummy variables
  - Ensured column names are strings for compatibility with scalerX_dummies.columns]
  - Split data into training and test sets:
  - Checked column variance and determined scaling was necessary

- **Scaling:**
  - Instantiated StandardScaler
  - Scaled training data
  - Scaled test data

- **Model and Hyperparameter Tuning:**
  - Instantiated Logistic Regression model
  - Defined hyperparameters for GridSearchCV
  - Performed GridSearchCV with 10-fold cross-validation, parallel processing, and accuracy scoring
  - Fitted model to training data: 

- **Results:**
  - Best parameters found
  - Best cross-validation score

In [220]:
# Finished cleaning
# Let build the model
# Creating dummy variables
X_dummies = pd.get_dummies(X, drop_first=True)
# Making sure the columns are of the same datatypes so scaler can work
X_dummies.columns = [str(column) for column in X_dummies.columns]
# Splitting into train and test set
X_train, X_test, y_train, y_test = train_test_split(X_dummies, y, test_size=0.25, random_state=5)
# Checking the variance of the columns to see if scaling is necessary
X_dummies.var()
# Variance disparities, Scaling needed
# Instantiate scaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) 
X_test_scaled = scaler.transform(X_test)
logreg = LogisticRegression(random_state=5)
params = { 'C' : np.arange(0,1,0.1) ,
           'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
           'max_iter' : range(10,150,10)}
cross_val = GridSearchCV(estimator=logreg, param_grid=params, cv=10, n_jobs=-1, scoring='accuracy' )
cross_val.fit(X_train_scaled, y_train)


In [221]:
best_score = cross_val.best_score_
best_param = cross_val.best_params_
y_pred = cross_val.predict(X_test_scaled)
accuracy = cross_val.score(X_test_scaled, y_test)
print(accuracy)
print(best_score)

0.861271676300578
0.8529034690799397
