# Predicting Credit Card Requests Approval

#### The purpose of this project is to predict wether a credit card request will be approved based on  clients' information. Historical data of previuos requests are used to develop a logistic regression model.

#### For the scaling task, I utilize to methods, normalization and standardization, and compare their effect of the model preformance.

## Importing libraries and data

In [2]:
import pandas as pd
import numpy as np

cc = pd.read_csv('data.csv')

print(cc.shape)

print(cc.head())

# assigning columns name form codebook
cc.columns = ['Male','Age','Debt','Married','BankCustomer',
              'EducationLevel','Ethnicity','YearsEmployed',
              'PriorDefault','Employed','CreditScore',
              'DriverLicense','Citizen','ZipCode','Income',
              'Approved']

print(cc.head())

(689, 16)
   b  30.83      0  u  g  w  v  1.25  t t.1  01  f g.1  00202  0.1  +
0  a  58.67  4.460  u  g  q  h  3.04  t   t   6  f   g  00043  560  +
1  a  24.50  0.500  u  g  q  h  1.50  t   f   0  f   g  00280  824  +
2  b  27.83  1.540  u  g  w  v  3.75  t   t   5  t   g  00100    3  +
3  b  20.17  5.625  u  g  w  v  1.71  t   f   0  f   s  00120    0  +
4  b  32.08  4.000  u  g  m  v  2.50  t   f   0  t   g  00360    0  +
  Male    Age   Debt Married BankCustomer EducationLevel Ethnicity  \
0    a  58.67  4.460       u            g              q         h   
1    a  24.50  0.500       u            g              q         h   
2    b  27.83  1.540       u            g              w         v   
3    b  20.17  5.625       u            g              w         v   
4    b  32.08  4.000       u            g              m         v   

   YearsEmployed PriorDefault Employed  CreditScore DriverLicense Citizen  \
0           3.04            t        t            6             f       

## Inspecting Data

In [5]:
print(cc.info())

print(cc.describe())

print(cc.describe(include=np.object))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689 entries, 0 to 688
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Male            689 non-null    object 
 1   Age             689 non-null    object 
 2   Debt            689 non-null    float64
 3   Married         689 non-null    object 
 4   BankCustomer    689 non-null    object 
 5   EducationLevel  689 non-null    object 
 6   Ethnicity       689 non-null    object 
 7   YearsEmployed   689 non-null    float64
 8   PriorDefault    689 non-null    object 
 9   Employed        689 non-null    object 
 10  CreditScore     689 non-null    int64  
 11  DriverLicense   689 non-null    object 
 12  Citizen         689 non-null    object 
 13  ZipCode         689 non-null    object 
 14  Income          689 non-null    int64  
 15  Approved        689 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.2+ KB
None
             Deb

#### Variables consist of 4 numeric and 12 object.

#### Income variable order is very different among numerics. It shows a need for normalization/standardization.

#### There are some undefined '?' valuse among data.

## Preprocessing

### Replacing '?' with nan

In [7]:
cc = cc.replace('?',np.nan)

print(cc.isnull().values.sum()) # there are 67 missing values


67


### Replacing 'nan' with columns mean, for numerics, and with columns most frequent value for objects.

In [9]:
cc.fillna(cc.mean(), inplace=True)

# replace non-numerics with columns most frequent value (if any)
for col in cc:
    if cc[col].dtype == 'object':
        cc = cc.fillna(cc[col].value_counts().index[0])

print(cc.isnull().values.sum()) #there is no missing value now


0


### Encode object Dtype as numeric

#### For the classification task using logistic regression, all data type must be numerics. However, labels are treated as classes and the order is ignored. So, while objects are encoded to numbers, there is no need to apply get_dummies or one_hot_encoding.

In [10]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in cc.columns:
        if cc[col].dtype=='object':
            cc[col]=le.fit_transform(cc[col])

print(cc.info()) #all Dtype are numeric now
print(cc)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689 entries, 0 to 688
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Male            689 non-null    int64  
 1   Age             689 non-null    int64  
 2   Debt            689 non-null    float64
 3   Married         689 non-null    int64  
 4   BankCustomer    689 non-null    int64  
 5   EducationLevel  689 non-null    int64  
 6   Ethnicity       689 non-null    int64  
 7   YearsEmployed   689 non-null    float64
 8   PriorDefault    689 non-null    int64  
 9   Employed        689 non-null    int64  
 10  CreditScore     689 non-null    int64  
 11  DriverLicense   689 non-null    int64  
 12  Citizen         689 non-null    int64  
 13  ZipCode         689 non-null    int64  
 14  Income          689 non-null    int64  
 15  Approved        689 non-null    int64  
dtypes: float64(2), int64(14)
memory usage: 86.2 KB
None
     Male  Age    Debt  Marr

### Feature selection

#### I am going to drop a couple irrelevant features intutively. Although some other features may be irrelevant, I let the model to handle it by assigning close to zero coefficient for them.

In [11]:
cc = cc.drop(['DriverLicense', 'ZipCode'], axis=1)

print(cc.shape)

(689, 14)


## Developing a model

### Splitting Data into a train set and a test set

In [13]:
from sklearn.model_selection import train_test_split

X = cc.iloc[:,:-1].values #also convert to numpy array

Y = cc.iloc[:,-1].values  #also convert to numpy array

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,
                                test_size=0.25,
                                random_state=0)

### Scaling 1: Normalization

In [15]:
from sklearn.preprocessing import MinMaxScaler

norm = MinMaxScaler().fit(X_train)

X_train_norm = norm.transform(X_train)

X_test_norm = norm.transform(X_test)


##  Train a loistic regression model (normal scaling)

In [16]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state= 0)

logreg.fit(X_train_norm, Y_train)

LogisticRegression(random_state=0)

## Evaluate model performance (normal scaling)


In [17]:
from sklearn.metrics import confusion_matrix, accuracy_score

#predict
Y_pred = logreg.predict(X_test_norm)

print(confusion_matrix(Y_test, Y_pred))

print(accuracy_score(Y_test, Y_pred))


[[70 12]
 [14 77]]
0.8497109826589595


### Scaling 2: Standardization

In [21]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler().fit(X_train)

X_train_std = sc.transform(X_train)

X_test_std = sc.transform(X_test)

##  Train a loistic regression model (standard scaling)


In [22]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state= 0)

logreg.fit(X_train_std, Y_train)

LogisticRegression(random_state=0)

## Evaluate model performance (standard scaling)



In [25]:
from sklearn.metrics import confusion_matrix, accuracy_score

#predict
Y_pred_std = logreg.predict(X_test_std)

print(confusion_matrix(Y_test, Y_pred_std))

print(accuracy_score(Y_test, Y_pred_std))


[[67 15]
 [10 81]]
0.8554913294797688


## Discussion

A logistic model is built and tested through both normalization and standardization. The latter method shows a better accuracy.