## Project Description
Commercial banks process numerous credit card applications and often need to reject them. In this project, you'll use supervised machine learning techniques to automate this process, making it an efficient and cost-effective solution for banks.

![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

##### Use supervised learning techniques to automate the credit card approval process for banks.

- Preproccess the data and apply supervised learning techniques to find the best model and parameters for the job. Save the accuracy score from your best model as a numeric variable, best_score. Aim for an accuracy score of at least 0.75. The target variable is the last column of the DataFrame.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

In [2]:
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


In [3]:
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    int64  
 13  13      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


In [4]:
cc_apps.value_counts()

0  1      2       3  4  5   6  7      8  9  10  11  12    13
?  20.08  0.125   u  g  q   v  1.000  f  t  1   g   768   +     1
b  30.17  6.500   u  g  cc  v  3.125  t  t  8   g   1200  +     1
   29.67  1.415   u  g  w   h  0.750  t  t  1   g   100   +     1
   29.83  1.250   y  p  k   v  0.250  f  f  0   g   0     -     1
          2.040   y  p  x   h  0.040  f  f  0   g   1     -     1
                                                               ..
   16.50  0.125   u  g  c   v  0.165  f  f  0   g   0     -     1
   16.92  0.335   y  p  k   v  0.290  f  f  0   s   0     -     1
   17.08  0.085   y  p  c   v  0.040  f  f  0   g   722   -     1
          0.250   u  g  q   v  0.335  f  t  4   g   8     -     1
   ?      10.500  u  g  x   v  6.500  t  f  0   g   0     +     1
Name: count, Length: 690, dtype: int64

In [5]:
# Replace the '?'s with NaN in dataset
cc_apps_nans_replaced = cc_apps.replace("?", np.nan)
cc_apps_nans_replaced.value_counts()

0  1      2       3  4   5   6  7       8  9  10  11  12      13
a  15.75  0.375   u  g   c   v  1.000   f  f  0   g   18      -     1
b  29.58  4.500   u  g   w   v  7.500   t  t  2   g   0       +     1
   29.67  0.750   y  p   c   v  0.040   f  f  0   g   0       -     1
          1.415   u  g   w   h  0.750   t  t  1   g   100     +     1
   29.83  1.250   y  p   k   v  0.250   f  f  0   g   0       -     1
                                                                   ..
   17.42  6.500   u  g   i   v  0.125   f  f  0   g   100     -     1
   17.50  22.000  l  gg  ff  o  0.000   f  f  0   p   100000  +     1
   17.58  10.000  u  g   w   h  0.165   f  t  1   g   1       -     1
   17.67  4.460   u  g   c   v  0.250   f  f  0   s   0       -     1
   76.75  22.290  u  g   e   z  12.750  t  t  1   g   109     +     1
Name: count, Length: 659, dtype: int64

In [6]:
cc_apps_nans_replaced.isna().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13     0
dtype: int64

In [7]:
# Create a copy of the NaN replacement DataFrame
cc_apps_imputed = cc_apps_nans_replaced.copy()

In [8]:
for col in cc_apps_imputed.columns:
    if cc_apps_imputed[col].dtypes == 'object':
        cc_apps_imputed[col] = cc_apps_imputed[col].fillna(
            cc_apps_imputed[col].value_counts().index[0]
        )
    else:
        cc_apps_imputed[col] = cc_apps_imputed[col].fillna(cc_apps_imputed[col].mean())

In [9]:
cc_apps_imputed

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,g,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,g,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,g,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,g,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,g,750,-


In [10]:
# one hot encoding 
cc_apps_encoded = pd.get_dummies(cc_apps_imputed, dtype='int', drop_first=True)
cc_apps_encoded

Unnamed: 0,2,7,10,12,0_b,1_15.17,1_15.75,1_15.83,1_15.92,1_16.00,...,6_j,6_n,6_o,6_v,6_z,8_t,9_t,11_p,11_s,13_-
0,0.000,1.25,1,0,1,0,0,0,0,0,...,0,0,0,1,0,1,1,0,0,0
1,4.460,3.04,6,560,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
2,0.500,1.50,0,824,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,1.540,3.75,5,3,1,0,0,0,0,0,...,0,0,0,1,0,1,1,0,0,0
4,5.625,1.71,0,0,1,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,10.085,1.25,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
686,0.750,2.00,2,394,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,1
687,13.500,2.00,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
688,0.205,0.04,0,750,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


In [11]:
# features and target
X = cc_apps_encoded.iloc[:, :-1].values
y = cc_apps_encoded.iloc[:, [-1]].values

In [14]:
# training - spliting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [33]:
# scaling data
# scaler = StandardScaler()
# X_trainscaled = scaler.fit_transform(X_train)
# X_testscaled = scaler.fit(X_test)

In [15]:
# Create the scaler
scaler = StandardScaler()

# Fit the scaler to your training data
scaler.fit(X_train)

# Transform training and test data
X_trainscaled = scaler.transform(X_train)
X_testscaled = scaler.transform(X_test)

In [17]:
# Create the scaler and transform training data
# X_trainscaled = StandardScaler().fit_transform(X_train)

# Create another scaler instance for test data
# X_testscaled = StandardScaler().fit_transform(X_test)

In [36]:
# If X_train is 1D
# X_train = X_train.reshape(-1, 1)

In [16]:
# Option 1: Using ravel() as suggested in the error message
logreg = LogisticRegression()
logreg.fit(X_trainscaled, y_train)

# Option 2: Using flatten()
# logreg = LogisticRegression()
# logreg.fit(X_trainscaled, y_train.flatten())

# Option 3: Using reshape
# logreg = LogisticRegression()
# logreg.fit(X_trainscaled, y_train.reshape(1, -1))

  y = column_or_1d(y, warn=True)


In [17]:
# predicting
y_train_pred = logreg.predict(X_trainscaled)

In [18]:
print(confusion_matrix(y_train_pred, y_train))

[[207   3]
 [  3 270]]


In [19]:
# tuning model 
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

In [20]:
# grid 
param_grid = dict(tol= tol, max_iter= max_iter)

# model tunig
grid_model = GridSearchCV(estimator=logreg,
                          param_grid=param_grid, 
                          cv=5)

In [21]:
# result of grid_model
grid_model_result = grid_model.fit(X_trainscaled, y_train.ravel())
grid_model_result

In [22]:
# best model comparison
best_train_score, best_train_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_train_score, best_train_params))

Best: 0.836383 using {'max_iter': 100, 'tol': 0.0001}


In [30]:
# reshaping 
# X_testscaled = X_test.reshape(1, -1)

In [23]:
# If you want to evaluate on training data:
best_score_train = grid_model_result.score(X_trainscaled, y_train)

# If you want to evaluate on test data (which is usually preferred):
best_score_test = grid_model_result.score(X_testscaled, y_test)

In [27]:
best_model = grid_model_result.best_estimator_
best_score =  best_model.score(X_testscaled, y_test)

In [28]:
print("Accuracy of logistic regression classifier: ", best_score)

Accuracy of logistic regression classifier:  0.8067632850241546


In [30]:
best_score_var =  0.8067632850241546

In [31]:
best_score_model = best_score