# **Logistic Regression**

## **Predicting Credit Card Approvals**

The dataset we will use is the Credit Approval Dataset, which is a collection of credit card applications and the credit approval decisions.


The data is available from the UCI Machine Learning Repository: Quinlan,J. R.. Credit Approval. UCI Machine Learning Repository. https://doi.org/10.24432/C5FS30.

In [3]:
""" Drive Mounting """
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Load the dataset**

In [4]:
# Import pandas
import pandas as pd

# Load dataset
cc_apps = pd.read_csv("/content/drive/MyDrive/eada material/crx.data",header=None)

# Inspect data (show 5 first rows)
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


This file concerns credit card applications.  

All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.


## **Inspect the features and target variable (dependent variable)**

In [5]:
# Check the names of the columns
cc_apps.columns

Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64')

The probable features in a typical credit card application are Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus.

This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.

In [6]:
cc_apps.rename(columns={cc_apps.columns[0]:'Male',cc_apps.columns[1]:'Age',cc_apps.columns[2]:'Debt',cc_apps.columns[3]:'Married',cc_apps.columns[4]:'BanckCustomer',
                        cc_apps.columns[5]:'EducationLevel', cc_apps.columns[6]:'Ethnicity',cc_apps.columns[7]:'YearsEmployed',cc_apps.columns[8]:'PriorDefault',cc_apps.columns[9]:'Employed',
                        cc_apps.columns[10]:'CreditScore',cc_apps.columns[11]:'DriversLicense',cc_apps.columns[12]:'Citizen',cc_apps.columns[13]:'ZipCode',cc_apps.columns[14]:'Income',
                        cc_apps.columns[15]: 'Approved'}, inplace=True)
cc_apps.head()

Unnamed: 0,Male,Age,Debt,Married,BanckCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [7]:
# Check distribution of "Approved" variable, the value counts
cc_apps["Approved"].value_counts()

Unnamed: 0_level_0,count
Approved,Unnamed: 1_level_1
-,383
+,307


In [8]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print("\n")

# Inspect missing values in the dataset
cc_apps.tail(17)

              Age        Debt  YearsEmployed  CreditScore         Income
count  690.000000  690.000000     690.000000    690.00000     690.000000
mean    31.455391    4.758725       2.223406      2.40000    1017.385507
std     11.922910    4.978163       3.346513      4.86294    5210.102598
min     13.750000    0.000000       0.000000      0.00000       0.000000
25%     22.520000    1.000000       0.165000      0.00000       0.000000
50%     28.290000    2.750000       1.000000      0.00000       5.000000
75%     37.750000    7.207500       2.625000      3.00000     395.500000
max     80.250000   28.000000      28.500000     67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Male            690 non-null    object 
 1   Age             690 non-null    float64
 2   Debt            690 non-null    float64
 3   Marrie

Unnamed: 0,Male,Age,Debt,Married,BanckCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


Dataset issues:

  - Our dataset contains both numeric and non-numeric data (specifically data that are of float64, int64 and object types). Specifically, the features 1, 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.

  - The dataset also contains values from several ranges.

  - Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output.

Now, let's temporarily replace these missing value question marks with NaN.

# **Data Transformations**

In [9]:
# Import numpy
import numpy as np

# Inspect missing values in the dataset
print(cc_apps.tail(17))

# Replace the '?'s with numpy NaN
cc_apps = cc_apps.replace('?', np.nan)
# Inspect the missing values again
cc_apps.tail(17)

    Male    Age    Debt Married BanckCustomer EducationLevel Ethnicity  \
673    ?  29.50   2.000       y             p              e         h   
674    a  37.33   2.500       u             g              i         h   
675    a  41.58   1.040       u             g             aa         v   
676    a  30.58  10.665       u             g              q         h   
677    b  19.42   7.250       u             g              m         v   
678    a  17.92  10.210       u             g             ff        ff   
679    a  20.08   1.250       u             g              c         v   
680    b  19.50   0.290       u             g              k         v   
681    b  27.83   1.000       y             p              d         h   
682    b  17.08   3.290       u             g              i         v   
683    b  36.42   0.750       y             p              d         v   
684    b  40.58   3.290       u             g              m         v   
685    b  21.08  10.085       y       

Unnamed: 0,Male,Age,Debt,Married,BanckCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


## **Handling the missing values**

We **replaced all the question marks with NaNs**. This is going to help us in the next missing value treatment that we are going to perform.

An important question that gets raised here is why are we giving so much importance to missing values? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training.

So, to avoid this problem, we are going to impute the **missing values** with a strategy called **mean imputation**.

In [10]:
# Inpsect columns where there are nan values
print(cc_apps.isnull().sum())

# Impute the missing values with mean imputation for the numerical columns only
for col in cc_apps.select_dtypes(include=np.number):
    cc_apps[col] = cc_apps[col].fillna(cc_apps[col].mean())

# Count the number of NaNs in the dataset to verify
print(cc_apps.isnull().sum())

Male              12
Age                0
Debt               0
Married            6
BanckCustomer      6
EducationLevel     9
Ethnicity          9
YearsEmployed      0
PriorDefault       0
Employed           0
CreditScore        0
DriversLicense     0
Citizen            0
ZipCode           13
Income             0
Approved           0
dtype: int64
Male              12
Age                0
Debt               0
Married            6
BanckCustomer      6
EducationLevel     9
Ethnicity          9
YearsEmployed      0
PriorDefault       0
Employed           0
CreditScore        0
DriversLicense     0
Citizen            0
ZipCode           13
Income             0
Approved           0
dtype: int64


We have successfully taken care of the **missing values present in the numeric columns**. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this why the mean imputation strategy would not work here. This needs a different treatment.

We are going to impute these missing values with the **most frequent values as present** in the respective columns. This is good practice when it comes to imputing missing values **for categorical** data in general.

In [11]:
# Iterate over each column of cc_apps
for col in cc_apps:
    # Check if the column is of object type
    if cc_apps[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts())

# Count the number of NaNs in the dataset and print the counts to verify
cc_apps.isnull().sum()

Unnamed: 0,0
Male,12
Age,0
Debt,0
Married,6
BanckCustomer,6
EducationLevel,9
Ethnicity,9
YearsEmployed,0
PriorDefault,0
Employed,0


Features such as DriverLisence or ZipCode are not as important as the other features in the dataset for predicting credit card approvals.

In [12]:
# Before continuing, let's first drop the features "ZipCode" and "DriversLicense" since they do not probably have an effect on credit card approval
cc_apps = cc_apps.drop(['ZipCode','DriversLicense'], axis=1)

## **Preprocessing**

The missing values are now successfully handled.

There is still some **data preprocessing** needed before we proceed towards building our machine learning model. We are going to divide these remaining preprocessing steps into three main tasks:

1. **Convert the non-numeric data into numeric.**
2. **Split the data into train and test sets.**
3. **Scale the feature values to a uniform range.**

First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called one-hot encoding.

In [13]:
cc_apps

Unnamed: 0,Male,Age,Debt,Married,BanckCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,Citizen,Income,Approved
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,g,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,g,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,g,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,g,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,g,750,-


## **Label Encoding y**

In [14]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

cc_apps['Approved'] = le.fit_transform(cc_apps['Approved'])
print(cc_apps.head())

print(cc_apps.Approved.value_counts())

  Male    Age   Debt Married BanckCustomer EducationLevel Ethnicity  \
0    b  30.83  0.000       u             g              w         v   
1    a  58.67  4.460       u             g              q         h   
2    a  24.50  0.500       u             g              q         h   
3    b  27.83  1.540       u             g              w         v   
4    b  20.17  5.625       u             g              w         v   

   YearsEmployed PriorDefault Employed  CreditScore Citizen  Income  Approved  
0           1.25            t        t            1       g       0         0  
1           3.04            t        t            6       g     560         0  
2           1.50            t        f            0       g     824         0  
3           3.75            t        t            5       g       3         0  
4           1.71            t        f            0       s       0         0  
Approved
1    383
0    307
Name: count, dtype: int64


## **Creating Dummies**

We can create (n-1) dummy variables avoiding falling into the dummy variable trap. Note that the effect will be analyzed with respect to one of the levels of the class. Eg. We will create a single dummy for 'Male' and assign it 1 if it is Male. So by default model will give an estimation for Female and by analyzing its coefficient, we can observe its effect as the gender changes to Male.

In [15]:
cc_aps_transformed = pd.get_dummies(cc_apps,drop_first=True, dtype=int)

In [16]:
for col in cc_aps_transformed.columns:
  print(col)

Age
Debt
YearsEmployed
CreditScore
Income
Approved
Male_b
Married_u
Married_y
BanckCustomer_gg
BanckCustomer_p
EducationLevel_c
EducationLevel_cc
EducationLevel_d
EducationLevel_e
EducationLevel_ff
EducationLevel_i
EducationLevel_j
EducationLevel_k
EducationLevel_m
EducationLevel_q
EducationLevel_r
EducationLevel_w
EducationLevel_x
Ethnicity_dd
Ethnicity_ff
Ethnicity_h
Ethnicity_j
Ethnicity_n
Ethnicity_o
Ethnicity_v
Ethnicity_z
PriorDefault_t
Employed_t
Citizen_p
Citizen_s


Here, we have created dummies for all the levels. Now dropping base category from each categorical variable and create one less dummy for each attribute.

In [18]:
# Drop base dummy attributes to aviod dummy variable trap
cc_aps_transformed=cc_aps_transformed.drop(
    ["Male_b","Married_y","BanckCustomer_p","EducationLevel_x","Ethnicity_z","Citizen_s"],axis=1)

In [19]:
cc_aps_transformed.head()

Unnamed: 0,Age,Debt,YearsEmployed,CreditScore,Income,Approved,Married_u,BanckCustomer_gg,EducationLevel_c,EducationLevel_cc,...,Ethnicity_dd,Ethnicity_ff,Ethnicity_h,Ethnicity_j,Ethnicity_n,Ethnicity_o,Ethnicity_v,PriorDefault_t,Employed_t,Citizen_p
0,30.83,0.0,1.25,1,0,0,1,0,0,0,...,0,0,0,0,0,0,1,1,1,0
1,58.67,4.46,3.04,6,560,0,1,0,0,0,...,0,0,1,0,0,0,0,1,1,0
2,24.5,0.5,1.5,0,824,0,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,27.83,1.54,3.75,5,3,0,1,0,0,0,...,0,0,0,0,0,0,1,1,1,0
4,20.17,5.625,1.71,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,1,0,0


## **Dataset Split**



Next, it is time to split our data into train set and test set.

In [25]:
cc_aps_transformed.columns

Index(['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income', 'Approved',
       'Married_u', 'BanckCustomer_gg', 'EducationLevel_c',
       'EducationLevel_cc', 'EducationLevel_d', 'EducationLevel_e',
       'EducationLevel_ff', 'EducationLevel_i', 'EducationLevel_j',
       'EducationLevel_k', 'EducationLevel_m', 'EducationLevel_q',
       'EducationLevel_r', 'EducationLevel_w', 'Ethnicity_dd', 'Ethnicity_ff',
       'Ethnicity_h', 'Ethnicity_j', 'Ethnicity_n', 'Ethnicity_o',
       'Ethnicity_v', 'PriorDefault_t', 'Employed_t', 'Citizen_p'],
      dtype='object')

In [26]:
## Features X and target variable y

# Features / Target
X = cc_aps_transformed[['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income',
       'Married_u', 'BanckCustomer_gg', 'EducationLevel_c',
       'EducationLevel_cc', 'EducationLevel_d', 'EducationLevel_e',
       'EducationLevel_ff', 'EducationLevel_i', 'EducationLevel_j',
       'EducationLevel_k', 'EducationLevel_m', 'EducationLevel_q',
       'EducationLevel_r', 'EducationLevel_w', 'Ethnicity_dd', 'Ethnicity_ff',
       'Ethnicity_h', 'Ethnicity_j', 'Ethnicity_n', 'Ethnicity_o',
       'Ethnicity_v', 'PriorDefault_t', 'Employed_t', 'Citizen_p']]
y = cc_aps_transformed["Approved"]

In [27]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                    y,
                                    test_size=0.3,
                                    random_state=0)


The data is now split into two separate sets - train and test sets respectively.

We are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data.

In [28]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler and use it to rescale X_train and X_test
scaler = StandardScaler()
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)



## **Fitting a logistic regression model to the train set**

Essentially, predicting if a credit card application will be approved or not is a classification task. According to UCI, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.


Which model should we pick? A question to ask is: are the features that affect the credit card approval decision process correlated with each other? Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).

In [29]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(X_train,y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## **Making predictions and evaluating performance**

**How well does our model perform?**

We will now evaluate our model on the test set with respect to classification accuracy. We will also take a look the model's confusion matrix. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.

In [35]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Mean accuracy on the given test data and labels: ", accuracy_score(y_test, y_pred))
print(accuracy_score)

# Print the confusion matrix of the logreg model
conf_matrix = confusion_matrix(y_test, y_pred)
print("confusion Matrix: ", conf_matrix);

Mean accuracy on the given test data and labels:  0.8454106280193237
<function accuracy_score at 0x7ff097b6ec00>
confusion Matrix:  [[77 13]
 [19 98]]




In [36]:
# Print the confusion matrix with information
negative_label = 'Denied  '
positive_label = 'Approved'
print(f"                  Predicted   Predicted  ")
print(f"                | {negative_label:<9} | {positive_label:<9} |")
print("-----------------------------------------")
print(f"Actual {negative_label:<7} | {conf_matrix[0][0]:<9} | {conf_matrix[0][1]:<9} |")
print(f"Actual {positive_label:<7} | {conf_matrix[1][0]:<9} | {conf_matrix[1][1]:<9} |")
print("-----------------------------------------")

                  Predicted   Predicted  
                | Denied    | Approved  |
-----------------------------------------
Actual Denied   | 77        | 13        |
Actual Approved | 19        | 98        |
-----------------------------------------


In [37]:
logreg.coef_

array([[ 1.68382611e-02,  2.78507369e-02, -9.44846430e-02,
        -2.38769251e-01, -5.92716685e-04, -1.51731301e-01,
        -6.69115752e-02,  5.93774716e-01, -2.50104823e-01,
         3.14814759e-01, -5.98281433e-02,  8.63082264e-01,
         7.49814092e-01,  9.26585254e-02,  3.34513998e-01,
         1.49231082e-01, -5.82732164e-02, -7.48166913e-03,
        -2.35136820e-01,  4.92894497e-02,  8.08764283e-01,
         1.36526177e-01,  5.21976159e-02, -5.47783401e-02,
         1.12934874e-02,  6.70621461e-01, -2.95134914e+00,
        -2.97568028e-01, -4.82821890e-01]])

In [38]:
logreg.intercept_

array([1.63788551])

In [39]:
import statsmodels.api as sm
model = sm.Logit(y_train,sm.add_constant(rescaledX_train),random_state=0)
result = model.fit()
print(result.summary())

         Current function value: 0.282405
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:               Approved   No. Observations:                  483
Model:                          Logit   Df Residuals:                      453
Method:                           MLE   Df Model:                           29
Date:                Mon, 24 Mar 2025   Pseudo R-squ.:                  0.5895
Time:                        21:19:06   Log-Likelihood:                -136.40
converged:                      False   LL-Null:                       -332.30
Covariance Type:            nonrobust   LLR p-value:                 3.405e-65
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2360   6670.554   3.54e-05      1.000   -1.31e+04    1.31e+04
x1            -0.0053      0.202     -0.026      0.979      -0.40



In [40]:
print(X_train.columns)

Index(['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income', 'Married_u',
       'BanckCustomer_gg', 'EducationLevel_c', 'EducationLevel_cc',
       'EducationLevel_d', 'EducationLevel_e', 'EducationLevel_ff',
       'EducationLevel_i', 'EducationLevel_j', 'EducationLevel_k',
       'EducationLevel_m', 'EducationLevel_q', 'EducationLevel_r',
       'EducationLevel_w', 'Ethnicity_dd', 'Ethnicity_ff', 'Ethnicity_h',
       'Ethnicity_j', 'Ethnicity_n', 'Ethnicity_o', 'Ethnicity_v',
       'PriorDefault_t', 'Employed_t', 'Citizen_p'],
      dtype='object')
