<a href="https://colab.research.google.com/github/IsaacKelly99/IKR/blob/master/Credit_card_aproval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credit card aproval
## ML supervised learning
I built a model to predict whether a bank user will get an aproval on a new credit card or get rejected

### Tools used:
* Label encoder
* Iterative imputation
* train,test split
* Standard scaler
* Stacking classifier
* metrics: accuracy_score, matthews_corrcoef, f1_score

link to the dataset: http://archive.ics.uci.edu/ml/datasets/credit+approval

In [None]:
# Packages
import pandas as pd
import numpy as np

In [None]:
#/content/drive/MyDrive/data_for_colab/crx.data
df= pd.read_csv("/content/drive/MyDrive/data_for_colab/crx.data", header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


Some info about this dataframe,because this is real data, the names of the columns werent included, but we can still guess about some of them, maybe age, gender, social status, etc.

In [None]:
df.dtypes

0      object
1      object
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12     object
13     object
14      int64
15     object
dtype: object

*The datatypes and values acording to the source*
 
* A1: b, a.
* A2: continuous.
* A3: continuous.
* A4: u, y, l, t.
* A5: g, p, gg.
* A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
* A7: v, h, bb, j, n, z, dd, ff, o.
* A8: continuous.
* A9: t, f.
* A10: t, f.
* A11: continuous.
* A12: t, f.
* A13: g, p, s.
* A14: continuous.
* A15: continuous.
* A16: +,- (class attribute)

In [None]:
# Most of the columns are objects, there are 2 floats and one integrer
# lets see the unique values of each column
# lets drop first the numeric columns in a df, df2
# I cant transform the datatype of the column to float due to the missing values it contains with the "?" sign
# df[1]= df[1].astype("float64")
print(df[2].dtypes)
df2 = df.select_dtypes(exclude=['int64',"float64"])
# Excluding the numeric columns, stored as objects due to the ? sign
df2.drop(columns=[1,13], axis=1, inplace=True)
# a for loop to iterate between columns and print its unique values
for column in df2:
    unique_values = df2[column].unique()
    print(column, "unique values:", unique_values)
# The missing values are assigned with a "?" sign

float64
0 unique values: ['b' 'a' '?']
3 unique values: ['u' 'y' '?' 'l']
4 unique values: ['g' 'p' '?' 'gg']
5 unique values: ['w' 'q' 'm' 'r' 'cc' 'k' 'c' 'd' 'x' 'i' 'e' 'aa' 'ff' 'j' '?']
6 unique values: ['v' 'h' 'bb' 'ff' 'j' 'z' '?' 'o' 'dd' 'n']
8 unique values: ['t' 'f']
9 unique values: ['t' 'f']
11 unique values: ['f' 't']
12 unique values: ['g' 's' 'p']
15 unique values: ['+' '-']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
# lets replace the ? with np.nan
df = df.replace("?", np.nan)
# And the + and - from the column 15 to 1 and 0
df[15] = df[15].replace("+", 1)
df[15] = df[15].replace("-", 0)

# now i can correct their datatype so it matches with the source info
df[1]= df[1].astype("float64")
df[13]= df[13].astype("float64")
# check for Na's
def na_status(df):
  total = df.isnull().sum().sort_values(ascending=False)
  percent_1 = df.isnull().sum()/df.isnull().count()*100
  percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
  missing_data = pd.concat([total, percent_2], axis=1, keys=['Total NaN', 'NaN %'])
  print(missing_data.head(10))
na_status(df)
# We have up to almost 2% of missing values in some columns

    Total NaN  NaN %
13         13    1.9
1          12    1.7
0          12    1.7
6           9    1.3
5           9    1.3
4           6    0.9
3           6    0.9
15          0    0.0
14          0    0.0
12          0    0.0


In [None]:
# Lets impute the missing values using iterative imputator
# First i need to encode the values of the dataframe, using label encoder from sklearn
from sklearn.preprocessing import LabelEncoder
# Then i create an instance for the label encoder
encoder = LabelEncoder()

# Using df.apply, i used fit and transform in each column, keeping the Na's using not null from series,
# otherwise it would've encoded those Na's
df = df.apply(lambda series: pd.Series(
    encoder.fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
))

# Lets revise that the encoder kept the missing values
# I remove this numeric columns, these dont contain any missing values 
columns_numeric = df[[1,2,7,14]]
df2 = df.drop(columns=columns_numeric, axis=1)
# using the same for loop as before to check the unique values
for column in df2:
    unique_values = df2[column].unique()
    print(column, "unique values:", unique_values)

0 unique values: [ 1.  0. nan]
3 unique values: [ 1.  2. nan  0.]
4 unique values: [ 0.  2. nan  1.]
5 unique values: [12. 10.  9. 11.  2.  8.  1.  3. 13.  6.  4.  0.  5.  7. nan]
6 unique values: [ 7.  3.  0.  2.  4.  8. nan  6.  1.  5.]
8 unique values: [1 0]
9 unique values: [1 0]
10 unique values: [ 1  6  0  5  7 10  3 17  2  9  8 15 11 12 21 20  4 19 22 14 16 13 18]
11 unique values: [0 1]
12 unique values: [0 2 1]
13 unique values: [ 68.  11.  96.  31.  37. 115.  54.  23.  62.  15.  39.  90.   0. 105.
 127.  29.  67. 100.  47. 150.  56. 138. 158.   8.  84.  19. 143. 103.
  74. 149. 129.  83.  52. 162.  85. 154. 152. 134.  nan 167. 140.  44.
  28. 116.  97. 166.  65.  35.  58.  92.  55.  21.  49.  60. 106.  73.
 131.  94. 121. 130. 112.  69.  10.  63. 128. 139.  27.  17. 126. 125.
   3.   7.  32. 136. 118.   5.   2.  40. 151.  66.  46. 122.  13.  14.
 123.  48.  36.  16.  72.  80.  51.   4.  79. 153.  87. 148.  75.  25.
  20.  38. 146.  43.  42.  99.  50.  93. 109.  33. 163. 141. 

In [None]:
# We can see that the missing values where kept and not encoded
# Now the imputation can take place

# To impute the values i need the iterative imputer and the enable iterative imputer from sklearn
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer
# an instance of iterative imputer with random state 20
imp = IterativeImputer(random_state=20)
# The imputatior returns an array, but i want to check before if there are any nas left, using my function na_status
# the column names wont be kept, because there arent any
df = pd.DataFrame(imp.fit_transform(df))
na_status(df)
# There arent any Na's left

    Total NaN  NaN %
15          0    0.0
14          0    0.0
13          0    0.0
12          0    0.0
11          0    0.0
10          0    0.0
9           0    0.0
8           0    0.0
7           0    0.0
6           0    0.0


In [None]:
# Data split and preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, 

# An instance of scaler 
scaler = StandardScaler()

# The features
X = df.drop(columns=[15], axis=1)
# The target feature
y = df[15]

# The train, test split, leaving 35% of the data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.35, random_state=42)
# Scaling the data in the training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [None]:
# To predict the aproval i will use an ML stacked model
# The models i will be using 
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

In [None]:
# Defining and fitting the models
#SVM
svm= svm.SVC()
svm.fit(X_train_scaled, y_train)

#MLP
nnet = MLPClassifier(max_iter=1500)
nnet.fit(X_train_scaled, y_train)

#RFC
rforest = RandomForestClassifier()
rforest.fit(X_train_scaled, y_train)

#KNN
knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)

# Dtree
tree = DecisionTreeClassifier()
tree.fit(X_train_scaled,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [None]:
#Now to assemble all the ML models
from sklearn.ensemble import StackingClassifier 
# And the final model, will be logistic regression
from sklearn.linear_model import LogisticRegression
# The instance of logistic regression
logreg= LogisticRegression()

# list of estimators
estimators = [
              ('nnet',nnet),
              ('tree', tree),
              ('svm', svm),
              ('knn', knn),
              ('rforest', rforest)
]
# The stacked model, created using stacking classifier from sklearn
# the list of estimators and the final estimator are assigned as previously stated
stack_model = StackingClassifier(estimators = estimators, final_estimator = logreg)

In [None]:
# fitting the stacked model
stack_model.fit(X_train_scaled, y_train)

# creating predictions on X_train and X_test, to test the performance
y_train_pred = stack_model.predict(X_train_scaled)
y_test_pred = stack_model.predict(X_test_scaled)

In [None]:
#the metrics to evaluate the model performance
from sklearn.metrics import accuracy_score
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import f1_score

# Training set model performance
stack_model_train_accuracy = accuracy_score(y_train, y_train_pred) # Calculate Accuracy
stack_model_train_mcc = matthews_corrcoef(y_train, y_train_pred) # Calculate MCC
stack_model_train_f1 = f1_score(y_train, y_train_pred, average='weighted') # Calculate F1-score

# Test set model performance
stack_model_test_accuracy = accuracy_score(y_test, y_test_pred) # Calculate Accuracy
stack_model_test_mcc = matthews_corrcoef(y_test, y_test_pred) # Calculate MCC
stack_model_test_f1 = f1_score(y_test, y_test_pred, average='weighted') # Calculate F1-score

# Printing the results
print('Model performance for Training set')
print('- Accuracy: %s' % stack_model_train_accuracy)
print('- MCC: %s' % stack_model_train_mcc)
print('- F1 score: %s' % stack_model_train_f1)
print('----------------------------------')
print('Model performance for Test set')
print('- Accuracy: %s' % stack_model_test_accuracy)
print('- MCC: %s' % stack_model_test_mcc)
print('- F1 score: %s' % stack_model_test_f1)

Model performance for Training set
- Accuracy: 0.9732142857142857
- MCC: 0.9456969696969697
- F1 score: 0.9732142857142857
----------------------------------
Model performance for Test set
- Accuracy: 0.8801652892561983
- MCC: 0.758176479685776
- F1 score: 0.8802127812583885


The model reached an accuracy of 88% and an F1-score of 88% too, but according to the Mathews coefficient the model seems to generalize and predict with a lesser quality on new data; it could also be that the model is overfitting, due to the fact that the training data got scores above 94% for all the three measures.