## Credit cards approval 

1. Title: Credit Approval

2. Sources: 
    (confidential)
    Submitted by quinlan@cs.su.oz.au

3.  Past Usage:

    See Quinlan,
    -  "Simplifying decision trees", Int J Man-Machine Studies 27,
      Dec 1987, pp. 221-234.
    -  "C4.5: Programs for Machine Learning", Morgan Kaufmann, Oct 1992
  
4.  Relevant Information:

    - This file concerns credit card applications.  All attribute names
    and values have been changed to meaningless symbols to protect
    confidentiality of the data.
  
    - This dataset is interesting because there is a good mix of
    attributes -- continuous, nominal with small numbers of
    values, and nominal with larger numbers of values.  There
    are also a few missing values.
  
5.  Number of Instances: 690

6.  Number of Attributes: 15 + class attribute

7.  Attribute Information:

   -  A1:	b, a.
   - A2:	continuous.
   -  A3:	continuous.
   -  A4:	u, y, l, t.
   -  A5:	g, p, gg.
   -  A6:	c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
   -  A7:	v, h, bb, j, n, z, dd, ff, o.
   -  A8:	continuous.
   -  A9:	t, f.
   -  A10:	t, f.
   -  A11:	continuous.
   -  A12:	t, f.
   -  A13:	g, p, s.
   -  A14:	continuous.
   -  A15:	continuous.
   -  A16: +,-         (class attribute)

8.  Missing Attribute Values:
    37 cases (5%) have one or more missing values.  The missing
    values from particular attributes are:

   -  A1:  12
   - A2:  12
   -  A4:   6
   -  A5:   6
   -  A6:   9
   -  A7:   9
   -  A14: 13

9.  Class Distribution
  
    +: 307 (44.5%)
    -: 383 (55.5%)

### Information
Regularly commercial banks recive plenty requests for credit cards. Some of the requests are rejected due to low income statement, existing loans, bad individual credit stories. 

In this work I will try to apply some machine learning algorithms to  this particular problem. 
At the first I will look at dataset and discover that it has of both numerical and non-numerical features. Missing values will be substituted by apropriate values. 

After that we will perform preprocessing, to prepare our data for machine learning algorithm. 

Eventually we will build a modelthat can predict if the request will be approved. 

 ### 1. Loading and viewing the dataset. We find that since this data is confidential, the contributor of the dataset has anonymized the feature names.

In [1]:
# Import pandas
import pandas as pd

# Load dataset
data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data", header=None)

# Inspect data
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


### Brief explanation of the features

$ Male          : num  1 1 0 0 0 0 1 0 0 0 ...

$ Age           : chr  "58.67" "24.50" "27.83" "20.17" ...

$ Debt          : num  4.46 0.5 1.54 5.62 4 ...

$ Married       : chr  "u" "u" "u" "u" ...

$ BankCustomer  : chr  "g" "g" "g" "g" ...

$ EducationLevel: chr  "q" "q" "w" "w" ...

$ Ethnicity     : chr  "h" "h" "v" "v" ...

$ YearsEmployed : num  3.04 1.5 3.75 1.71 2.5 ...

$ PriorDefault  : num  1 1 1 1 1 1 1 1 1 0 ...

$ Employed      : num  1 0 1 0 0 0 0 0 0 0 ...

$ CreditScore   : num  6 0 5 0 0 0 0 0 0 0 ...

$ DriversLicense: chr  "f" "f" "t" "f" ...

$ Citizen       : chr  "g" "g" "g" "s" ...

$ ZipCode       : chr  "00043" "00280" "00100" "00120" ...

$ Income        : num  560 824 3 0 0 ...

$ Approved      : chr  "+" "+" "+" "+" ...

### 2. Print some statistics

In [2]:
# Print summary statistics
data.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [3]:
# Print DataFrame information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     690 non-null object
1     690 non-null object
2     690 non-null float64
3     690 non-null object
4     690 non-null object
5     690 non-null object
6     690 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    690 non-null object
14    690 non-null int64
15    690 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 86.3+ KB


In [4]:
# detecting missing values "?"
data.iloc[-18:-16,:]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-


### 3. Working with missing values
What do we have:

1) both numeric and non-numeric data. Columns 2,7 ,10, 14 have numeric ddata, and the rest have non-numeric values

2) variable ranges of the values of features (0-28 or 1017 - 10000)

3) missing values in the dataset, labeled as "?"

In [5]:
# Import numpy
import numpy as np

# Inspect missing values in the dataset
print(data.isnull().values.sum())

# Replace the '?'s with NaN
data = data.replace("?",np.NaN)

# Inspect the missing values again
# ... YOUR CODE FOR TASK 3 ...
data.iloc[-18:-16,:]

0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-


### Performing imputation

In [6]:
# Impute the missing values with mean imputation
data = data.fillna(data.mean())

# Count the number of NaNs in the dataset to verify
print(data.isnull().values.sum())

print(data.info())

67
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     678 non-null object
1     678 non-null object
2     690 non-null float64
3     684 non-null object
4     684 non-null object
5     681 non-null object
6     681 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    677 non-null object
14    690 non-null int64
15    690 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 86.3+ KB
None


### For non-numeric values we implement replacement by the most frequent values

In [7]:
# Iterate over each column of cc_apps
for col in data.columns:
    # Check if the column is of object type
    if data[col].dtypes == 'object':
        # Impute with the most frequent value
        data[col] = data[col].fillna(data[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify

print(data.isnull().values.sum())

0


### 4. Preprocessing 

Next steps will demonstrate two precodures:

1) Converting the non-numeric data into numeric

2) Scale the values into uniform range

In [8]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
encoder = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in data.columns:
    # Compare if the dtype is object
    if data[col].dtype=='object':
    # Use LabelEncoder to do the numeric transformation
        data[col]=encoder.fit_transform(data[col])

In [9]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Drop features 10 and 13 and convert the DataFrame to a NumPy array
data = data.drop([data.columns[10],data.columns[13]], axis=1)
data = data.values

# Segregate features and labels into separate variables
X,y = data[:,0:13], data[:,13]


# Instantiate MinMaxScaler and use it to rescale
scaler = MinMaxScaler(feature_range=(0,1))
rescaledX = scaler.fit_transform(X)

### 5. Split our data into train and test samples

In [10]:
# Import train_test_split

from sklearn.model_selection import train_test_split

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(rescaledX,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42)

### 6. Aplly Logistic regression model to the train set

In [11]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### 7. Prediction, evaluation

In [12]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(X_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(X_test, y_test))

# Print the confusion matrix of the logreg model

confusion_matrix(y_test, y_pred)

Accuracy of logistic regression classifier:  0.8357487922705314


array([[87, 10],
       [24, 86]], dtype=int64)

### 8. Grid searching, improvement of the model
grid search over the following two parametres:

    tol
    max_iter


In [13]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)
print(param_grid)

{'tol': [0.01, 0.001, 0.0001], 'max_iter': [100, 150, 200]}


### 9. Confirming best model

In [14]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_,grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.855072 using {'max_iter': 100, 'tol': 0.01}
