Codes are from https://www.kaggle.com/kabure/kernels

<a id="Introduction"></a> <br>


# **1. Introduction:** 
<h2>Context</h2>
The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The link to the original dataset can be found below.

<h2>Content</h2>
It is almost impossible to understand the original dataset due to its complicated system of categories and symbols. Thus, I wrote a small Python script to convert it into a readable CSV file. Several columns are simply ignored, because in my opinion either they are not important or their descriptions are obscure. The selected attributes are:

<b>Age </b>(numeric)<br>
<b>Sex </b>(text: male, female)<br>
<b>Job </b>(numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)<br>
<b>Housing</b> (text: own, rent, or free)<br>
<b>Saving accounts</b> (text - little, moderate, quite rich, rich)<br>
<b>Checking account </b>(numeric, in DM - Deutsch Mark)<br>
<b>Credit amount</b> (numeric, in DM)<br>
<b>Duration</b> (numeric, in month)<br>
<b>Purpose</b>(text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others<br>
<b>Risk </b> (Value target - Good or Bad Risk)<br>

<a id="Librarys"></a> <br>
# **2. Librarys:** 
- Importing Librarys
- Importing Dataset

In [456]:
#Load the librarys
import random
import copy
import pandas as pd #To work with dataset
import numpy as np #Math library
import seaborn as sns #Graph library that use matplot in background
import matplotlib.pyplot as plt #to plot some parameters in seaborn
import sklearn

#Importing the data
df_credit = pd.read_csv("../input/german-credit-data-with-risk/german_credit_data.csv",index_col=0) # you can get the data from kaggle notebook

<a id="Known"></a> <br>
# **3. First Look at the data:** 
- Looking the Type of Data
- Null Numbers
- Unique values
- The first rows of our dataset

In [457]:
#Searching for Missings,type of data and also known the shape of data
print(df_credit.info())

In [458]:
df_credit.head()

# **4. Some explorations:** <a id="Explorations"></a> <br>

- Starting by distribuition of column Age.
- Some Seaborn graphical
- Columns crossing



<h2>Let's start looking through target variable and their distribuition</h2>

In [459]:
# it's a library that we work with plotly
import plotly.offline as py 
py.init_notebook_mode(connected=True) # this code, allow us to work with offline plotly version
import plotly.graph_objs as go # it's like "plt" of matplot
import plotly.tools as tls # It's useful to we get some tools of plotly
import warnings # This library will be used to ignore some warnings
from collections import Counter # To do counter of some features

trace0 = go.Bar(
            x = df_credit[df_credit["Risk"]== 'good']["Risk"].value_counts().index.values,
            y = df_credit[df_credit["Risk"]== 'good']["Risk"].value_counts().values,
            name='Good credit'
    )

trace1 = go.Bar(
            x = df_credit[df_credit["Risk"]== 'bad']["Risk"].value_counts().index.values,
            y = df_credit[df_credit["Risk"]== 'bad']["Risk"].value_counts().values,
            name='Bad credit'
    )

data = [trace0, trace1]

layout = go.Layout(
    
)

layout = go.Layout(
    yaxis=dict(
        title='Count'
    ),
    xaxis=dict(
        title='Risk Variable'
    ),
    title='Target variable distribution'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='grouped-bar')

In [460]:
df_good = df_credit.loc[df_credit["Risk"] == 'good']['Age'].values.tolist()
df_bad = df_credit.loc[df_credit["Risk"] == 'bad']['Age'].values.tolist()
df_age = df_credit['Age'].values.tolist()

#First plot
trace0 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good Credit"
)
#Second plot
trace1 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad Credit"
)
#Third plot
trace2 = go.Histogram(
    x=df_age,
    histnorm='probability',
    name="Overall Age"
)

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Good','Bad', 'General Distribuition'))

#setting the figs
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title='Age Distribuition', bargap=0.05)
py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')

In [461]:
df_good = df_credit[df_credit["Risk"] == 'good']
df_bad = df_credit[df_credit["Risk"] == 'bad']

fig, ax = plt.subplots(nrows=2, figsize=(12,8))
plt.subplots_adjust(hspace = 0.4, top = 0.8)

g1 = sns.distplot(df_good["Age"], ax=ax[0], color="g")
g1 = sns.distplot(df_bad["Age"], ax=ax[0], color='r')
g1.set_title("Age Distribuition", fontsize=15)
g1.set_xlabel("Age")
g1.set_xlabel("Frequency")

g2 = sns.countplot(x="Age",data=df_credit, palette="hls", ax=ax[1], hue = "Risk")
g2.set_title("Age Counting by Risk", fontsize=15)
g2.set_xlabel("Age")
g2.set_xlabel("Count")
plt.show()

<h2>Looking the diference by Sex</h2>

In [462]:
#First plot
trace0 = go.Bar(
    x = df_credit[df_credit["Risk"]== 'good']["Sex"].value_counts().index.values,
    y = df_credit[df_credit["Risk"]== 'good']["Sex"].value_counts().values,
    name='Good credit'
)

#First plot 2
trace1 = go.Bar(
    x = df_credit[df_credit["Risk"]== 'bad']["Sex"].value_counts().index.values,
    y = df_credit[df_credit["Risk"]== 'bad']["Sex"].value_counts().values,
    name="Bad Credit"
)

#Second plot
trace2 = go.Box(
    x = df_credit[df_credit["Risk"]== 'good']["Sex"],
    y = df_credit[df_credit["Risk"]== 'good']["Credit amount"],
    name=trace0.name
)

#Second plot 2
trace3 = go.Box(
    x = df_credit[df_credit["Risk"]== 'bad']["Sex"],
    y = df_credit[df_credit["Risk"]== 'bad']["Credit amount"],
    name=trace1.name
)

data = [trace0, trace1, trace2,trace3]


fig = tls.make_subplots(rows=1, cols=2, 
                        subplot_titles=('Sex Count', 'Credit Amount by Sex'))

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 2)

fig['layout'].update(height=400, width=800, title='Sex Distribuition', boxmode='group')
py.iplot(fig, filename='sex-subplot')

Looking the distribuition of Credit Amont

## Looking the total of values in each categorical feature

In [463]:
print("Purpose : ",df_credit.Purpose.unique())
print("Sex : ",df_credit.Sex.unique())
print("Housing : ",df_credit.Housing.unique())
print("Saving accounts : ",df_credit['Saving accounts'].unique())
print("Risk : ",df_credit['Risk'].unique())
print("Checking account : ",df_credit['Checking account'].unique())
# print("Aget_cat : ",df_credit['Age_cat'].unique())

## Let's do some feature engineering on this values and create variable Dummies of the values

In [464]:
def one_hot_encoder(df, nan_as_category = False):
    original_columns = list(df.columns)
    categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
    df = pd.get_dummies(df, columns= categorical_columns, dummy_na= nan_as_category, drop_first=True)
    new_columns = [c for c in df.columns if c not in original_columns]
    return df, new_columns

## Transforming the data into Dummy variables

In [465]:
df_credit['Saving accounts'] = df_credit['Saving accounts'].fillna('no_inf')
df_credit['Checking account'] = df_credit['Checking account'].fillna('no_inf')

#Purpose to Dummies Variable
df_credit = df_credit.merge(pd.get_dummies(df_credit.Purpose, drop_first=True, prefix='Purpose'), left_index=True, right_index=True)
#Sex feature in dummies
df_credit = df_credit.merge(pd.get_dummies(df_credit.Sex, drop_first=True, prefix='Sex'), left_index=True, right_index=True)
# Housing get dummies
df_credit = df_credit.merge(pd.get_dummies(df_credit.Housing, drop_first=True, prefix='Housing'), left_index=True, right_index=True)
# Housing get Saving Accounts
df_credit = df_credit.merge(pd.get_dummies(df_credit["Saving accounts"], drop_first=True, prefix='Savings'), left_index=True, right_index=True)
# Housing get Risk
df_credit = df_credit.merge(pd.get_dummies(df_credit.Risk, prefix='Risk'), left_index=True, right_index=True)
# Housing get Checking Account
df_credit = df_credit.merge(pd.get_dummies(df_credit["Checking account"], drop_first=True, prefix='Check'), left_index=True, right_index=True)
# Housing get Age categorical
# df_credit = df_credit.merge(pd.get_dummies(df_credit["Age_cat"], drop_first=True, prefix='Age_cat'), left_index=True, right_index=True)

## Deleting the old features

In [466]:
#Excluding the missing columns
del df_credit["Saving accounts"]
del df_credit["Checking account"]
del df_credit["Purpose"]
del df_credit["Sex"]
del df_credit["Housing"]
# del df_credit["Age_cat"]
del df_credit["Risk"]
del df_credit['Risk_good']

# **5. Correlation:** <a id="Correlation"></a> <br>
- Looking the data correlation
<h1>Looking the correlation of the data

In [467]:
plt.figure(figsize=(14,12))
sns.heatmap(df_credit.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True,  linecolor='white', annot=True)
plt.show()

In [413]:
# for j in range(1000):
#     i = random.randint(0,1000)
#     if i %2 == 0:
#         df_reverse_age['Age'][j] = 0
#     else:
#         df_reverse_age['Age'][j] = 1
# df_random_gender.head()

We categorize the age into 'young(<=35)' and 'old(>35)'


It is because the mean of age was aroung 35.xx...

In [468]:
df_orig_age = copy.deepcopy(df_credit)
for j in range(1000):
    if df_orig_age['Age'][j] > 35:
        df_orig_age['Age'][j] = 1 # 'Old'
    else:
        df_orig_age['Age'][j] = 0 #'Young'

In [469]:
df_credit.head() #original data

In [470]:
df_orig_age.head() #ours

# **6. Preprocessing:** <a id="Preprocessing"></a> <br>
- Importing ML librarys
- Setting X and y variables to the prediction
- Splitting Data


In [471]:
from sklearn.model_selection import train_test_split, KFold, cross_val_score # to split the data
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, fbeta_score #To evaluate our model

from sklearn.model_selection import GridSearchCV

# Algorithmns models to be compared
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier


In [472]:
df_credit['Credit amount'] = np.log(df_credit['Credit amount'])

# df_reverse_gender['Credit amount'] = np.log(df_reverse_gender['Credit amount'])

# df_random_age['Credit amount'] = np.log(df_random_age['Credit amount'])

# df_reverse_age['Credit amount'] = np.log(df_reverse_age['Credit amount'])
df_orig_age['Credit amount'] = np.log(df_orig_age['Credit amount'])

In [473]:
#Creating the X and y variables
X = df_credit.drop('Risk_bad', 1).values
y = df_credit["Risk_bad"].values

# Spliting X and y into train and test version
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)


# #Creating the X and y variables
# Xr = df_reverse_gender.drop('Risk_bad', 1).values
# yr = df_reverse_gender["Risk_bad"].values

# # Spliting X and y into train and test version
# Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size = 0.2, random_state=42)


#Creating the X and y variables
Xa = df_orig_age.drop('Risk_bad', 1).values
ya = df_orig_age["Risk_bad"].values

# Spliting X and y into train and test version
Xa_train, Xa_test, ya_train, ya_test = train_test_split(Xa, ya, test_size = 0.2, random_state=42)

In [474]:
# to feed the random state
seed = 7

# prepare models
models = []
# models.append(('LR', LogisticRegression()))
# models.append(('LDA', LinearDiscriminantAnalysis()))
# models.append(('KNN', KNeighborsClassifier()))
# models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
# models.append(('RF', RandomForestClassifier()))
# models.append(('SVM', SVC(gamma='auto')))
# models.append(('XGB', XGBClassifier()))

# evaluate each model in turn
results = []
names = []
scoring = 'recall'
4
for name, model in models:
        kfold = KFold(n_splits=10, random_state=seed)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        
# boxplot algorithm comparison
fig = plt.figure(figsize=(11,6))
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

Very interesting. Almost all models shows a low value to recall. 

We can observe that our best results was with CART, NB and XGBoost. <br>
I will implement some models and try to do a simple Tunning on them

In [475]:
model = GaussianNB()
model.fit(Xa_train, ya_train)

In [483]:
#Testing the model 
#Predicting using our  model
Xa_test, Xb_test = Xa_test[:100], Xa_test[100:]
ya_test, yb_test = ya_test[:100], ya_test[100:]

ya_pred = model.predict(Xa_test)

# Verificaar os resultados obtidos
print('group a')
print('ACC:', accuracy_score(ya_test,ya_pred))
print("\n")
print(confusion_matrix(ya_test, ya_pred))
print("\n")
print('fbeta_score:', fbeta_score(ya_test, ya_pred, beta=2))

print('roc_auc_score', sklearn.metrics.roc_auc_score(ya_test, ya_pred))


In [484]:
# for anti-classification, make reverse dataset
Xr_test = copy.deepcopy(Xa_test)
Xr_test[:,0] = 1-Xa_test[:,0] # reverse age (young <-> old)

In [485]:
#Testing the model 
#Predicting using default model, orig data
# Verificaar os resultados obtidos
print('group A')
print('ACC:',accuracy_score(ya_test,ya_pred))
print("\n")
print(confusion_matrix(ya_test, ya_pred))
print("\n")
print('fbeta_score', fbeta_score(ya_test, ya_pred, beta=2))

print('roc_auc_score', sklearn.metrics.roc_auc_score(ya_test, ya_pred))


#Predicting proba
print('reversed age category')
y_pred_prob = model.predict_proba(Xr_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(ya_test, y_pred_prob)


# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

In [242]:
yr_pred = model.predict(Xr_test)

# Verificaar os resultados obtidos
print(accuracy_score(y_test,yr_pred))
print("\n")
print(confusion_matrix(y_test, yr_pred))
cm = confusion_matrix(y_test, yr_pred)
print("\n")
print(fbeta_score(y_test, yr_pred, beta=2))
print(sklearn.metrics.roc_auc_score(y_test, yr_pred))


# **7.2 Model 2:** <a id="Modelling 2"></a> <br>

In [121]:
from sklearn.utils import resample
from sklearn.metrics import roc_curve

In [107]:
# Criando o classificador logreg
GNB = GaussianNB()

# Fitting with train data
model = GNB.fit(X_train, y_train)

In [108]:
# Printing the Training Score
print("Training score data: ")
print(model.score(X_train, y_train))

In [109]:
y_pred = model.predict(X_test)

print(accuracy_score(y_test,y_pred))
print("\n")
print(confusion_matrix(y_test, y_pred))
print("\n")
print(classification_report(y_test, y_pred))

With the Gaussian Model we got a best recall. 

## Let's verify the ROC curve

In [110]:
#Predicting proba
y_pred_prob = model.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

In [73]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

In [112]:
features = []
features.append(('pca', PCA(n_components=2)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)
# create pipeline
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', GaussianNB()))
model = Pipeline(estimators)
# evaluate pipeline
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(model, X_train, y_train, cv=kfold)
print(results.mean())

In [113]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(accuracy_score(y_test,y_pred))
print("\n")
print(confusion_matrix(y_test, y_pred))
print("\n")
print(fbeta_score(y_test, y_pred, beta=2))

In [166]:
# X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size = 0.2, random_state=42)
X_dev, X_test = Xa_test[:100], Xa_test[100:]
y_dev, y_test = ya_test[:100], ya_test[100:]


In [394]:
num_old = X_dev[:,0].sum()
num_young = len(y_dev) - num_old
print(num_young, num_old)

In [339]:
num_male = X_dev[:,11].sum()
num_female = len(y_dev) - num_male
print(num_female, num_male)

In [362]:
y_pred_th3 = (model.predict_proba(X_dev)[:,1] >= 0.7).astype(bool)
y_pred_th405 = (model.predict_proba(X_dev)[:,1] >= 0.405).astype(bool)
y_pred_th35 = (model.predict_proba(X_dev)[:,1] >= 0.35).astype(bool)
y_pred_th6 = (model.predict_proba(X_dev)[:,1] >= 0.6).astype(bool)
y_pred_th5 = model.predict(X_dev)

In [272]:
(y_pred_th5 == y_dev).sum()

In [273]:
y_pred_th5.sum()

In [367]:
count_female, count_male = 0,0
default_female, default_male = 0, 0
for i in range(len(y_dev)):
    if X_dev[i,11] == 0: # in case of female/young
        if y_pred_th42[i] == 1: #y_dev[i]:
            count_female += 1
        if y_pred_th5[i] == 1: #y_dev[i]:
            default_female += 1 
    else:
        if y_pred_th4[i] == 1: #y_dev[i]:
            count_male += 1
        if y_pred_th5[i] == 1: #y_dev[i]:
            default_male += 1

In [368]:
print(count_female, count_male)
print(count_female/num_female, count_male/num_male)
print(default_female/num_female, default_male/num_male)

In [395]:
count_young, count_old = 0,0
default_young, default_old = 0, 0

for i in range(len(y_dev)):
    if X_dev[i,0] == 0: # in case of female/young
        if y_pred_th7[i] == 1: #y_dev[i]:
            count_young += 1
        if y_pred_th5[i] == 1: #y_dev[i]:
            default_young += 1 
    else:
        if y_pred_th4[i] == 1: #y_dev[i]:
            count_old += 1
        if y_pred_th5[i] == 1: #y_dev[i]:
            default_old += 1
# for test 200 set,
# threshold key: woman 0.53, man 0.72 !
# if then, P(correct|woman) = 0.732, P(correct|man) = 0.729
# default, P(correct|woman) = 0.714, P(correct|man) = 0.639
# =========================================================
# for dev 100 set,
# threshold key: woman 0.4, man 0.7 !
# if then, P(correct|woman) = 0.781, P(correct|man) = 0.779
# default, P(correct|woman) = 0.813, P(correct|man) = 0.662

In [396]:
print(count_young, count_old)
print(count_young/num_young, count_old/num_old)
print(default_young/num_young, default_old/num_old)

In [397]:
default_young

In [398]:
default_old

In [399]:
y_pred_th42 = (model.predict_proba(X_test)[:,1] >= 0.42).astype(bool)
y_pred_th7 = (model.predict_proba(X_test)[:,1] >= 0.7).astype(bool)
y_pred_th5 = model.predict(X_test)

In [400]:
count_female, count_male = 0,0
default_female, default_male = 0, 0
num_male = X_test[:,11].sum()
num_female = len(y_test) - num_male
print(num_female, num_male)

for i in range(len(y_test)):
    if X_test[i,0] == 0: # in case of female
        if y_pred_th42[i] == 1 : #y_test[i]:
            count_female += 1
        if y_pred_th5[i] == 1 : #y_test[i]:
            default_female += 1 
    else:
        if y_pred_th4[i] == 1 : #y_test[i]:
            count_male += 1
        if y_pred_th5[i] == 1 : #y_test[i]:
            default_male += 1
print("age / gender modified")
print(count_female, count_male)
print(count_female/num_female, count_male/num_male)
print(default_female/num_female, default_male/num_male)
print(default_female, default_male)

In [373]:
count_female, count_male = 0,0
default_female, default_male = 0, 0
num_male = X_test[:,11].sum()
num_female = len(y_test) - num_male
print(num_female, num_male)

for i in range(len(y_test)):
    if X_test[i,11] == 0: # in case of female
        if y_pred_th42[i] == 1 : #y_test[i]:
            count_female += 1
        if y_pred_th5[i] == 1 : #y_test[i]:
            default_female += 1 
    else:
        if y_pred_th4[i] == 1 : #y_test[i]:
            count_male += 1
        if y_pred_th5[i] == 1 : #y_test[i]:
            default_male += 1

print(count_female, count_male)
print(count_female/num_female, count_male/num_male)
print(default_female/num_female, default_male/num_male)
# 
# we found threshold key in dev set: woman 0.4, man 0.7 !
# if then, P(correct|woman) = 0.531, P(correct|man) = 0.676
# default, P(correct|woman) = 0.531, P(correct|man) = 0.647
# ========================================================================
# we should re-find threshold key in test set: woman 0.758, man 0.758 !
# if then, for test set, P(correct|woman) = 0.688, P(correct|man) = 0.691
# default, for test set, P(correct|woman) = 0.531, P(correct|man) = 0.647

In [544]:
df_credit['Age'].mean()

In [316]:
default_female, default_male

## AGE

In [306]:
int(False)

In [388]:
y_pred_th5 = model.predict(X_dev)
fn_y, fn_o = 0,0
default_young, default_old = 0, 0
for i in range(len(y_dev)):
    if X_dev[i,0] == 0: # in case of female/young
        if y_pred_th5[i] == 1 and y_dev[i] == 0:
            fn_y += 1
    else:
        if y_pred_th5[i] == 1 and y_dev[i] == 0:
            fn_o += 1

In [389]:
fn_y, fn_o

In [392]:
y_pred_th5 = model.predict(X_test)
fn_y, fn_o = 0,0
default_young, default_old = 0, 0
for i in range(len(y_test)):
    if X_test[i,0] == 0: # in case of female/young
        if y_pred_th5[i] == 1 and y_test[i] == 0:
            fn_y += 1
    else:
        if y_pred_th5[i] == 1 and y_test[i] == 0:
            fn_o += 1

In [393]:
fn_y, fn_o

In [305]:
#group 1
print("dev set")
y_pred = model.predict(X_dev)
print(accuracy_score(y_dev,y_pred))
print(confusion_matrix(y_dev, y_pred))

# group 2
print("=="*10)
print("test set")
y_pred = model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred))

In [537]:
#Testing the model 
#Predicting using our  model
y_pred = model.predict(Xa_test)

# Verificaar os resultados obtidos
print(accuracy_score(y_test,y_pred))
print("\n")
print(confusion_matrix(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print("\n")
print(fbeta_score(y_test, y_pred, beta=2))
print(cm)

print(sklearn.metrics.roc_auc_score(y_test, y_pred))
