


__________________________________________________________________________________________________________________





# Exercise 03

## Data preparation and model evaluation exercise with credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


Read the data into Pandas

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.io.parsers.read_table(f, sep=',')

data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [2]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [3]:
y

0         1
1         0
2         0
3         0
4         0
5         0
6         0
7         0
8         0
9         0
10        0
11        1
12        0
13        0
14        0
15        0
16        0
17        1
18        0
19        0
20        0
21        0
22        0
23        0
24        0
25        0
26        0
27        0
28        0
29        0
         ..
112885    1
112886    0
112887    0
112888    0
112889    1
112890    0
112891    0
112892    0
112893    0
112894    0
112895    0
112896    0
112897    1
112898    0
112899    0
112900    0
112901    0
112902    0
112903    0
112904    0
112905    0
112906    0
112907    0
112908    0
112909    0
112910    0
112911    0
112912    0
112913    0
112914    0
Name: SeriousDlqin2yrs, Length: 112915, dtype: int64

In [4]:
X

Unnamed: 0.1,Unnamed: 0,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0.658180,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0.233810,30.0,0.0,0.036050,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0
5,5,0.213179,74.0,0.0,0.375607,3500.0,3.0,0.0,1.0,0.0,1.0
6,6,0.754464,39.0,0.0,0.209940,3500.0,8.0,0.0,0.0,0.0,0.0
7,7,0.189169,57.0,0.0,0.606291,23684.0,9.0,0.0,4.0,0.0,2.0
8,8,0.644226,30.0,0.0,0.309476,2500.0,5.0,0.0,0.0,0.0,0.0
9,9,0.018798,51.0,0.0,0.531529,6501.0,7.0,0.0,2.0,0.0,2.0


# Exercise 3.1

Input the missing values of the Age and Number of Dependents 

In [5]:
# check for missing values
data.isnull().sum()

Unnamed: 0                                 0
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

Se observa que existen missing values en los datos de age, y en los datos de NumberOfDependents

In [6]:
data.age

0         45.0
1         40.0
2         38.0
3         30.0
4         49.0
5         74.0
6         39.0
7         57.0
8         30.0
9         51.0
10        46.0
11        40.0
12        64.0
13        53.0
14        43.0
15        25.0
16        43.0
17        38.0
18        39.0
19        32.0
20        58.0
21        58.0
22        69.0
23        24.0
24        58.0
25        28.0
26        24.0
27        57.0
28        42.0
29        64.0
          ... 
112885    31.0
112886    48.0
112887    63.0
112888    57.0
112889    55.0
112890    43.0
112891    58.0
112892    83.0
112893     NaN
112894    44.0
112895    61.0
112896    52.0
112897    55.0
112898    64.0
112899    43.0
112900    37.0
112901    82.0
112902    26.0
112903    49.0
112904    28.0
112905    31.0
112906    62.0
112907    46.0
112908    59.0
112909    22.0
112910    50.0
112911    74.0
112912    44.0
112913    30.0
112914    64.0
Name: age, Length: 112915, dtype: float64

In [7]:

import matplotlib.pyplot as plt
a = data.age
a


0         45.0
1         40.0
2         38.0
3         30.0
4         49.0
5         74.0
6         39.0
7         57.0
8         30.0
9         51.0
10        46.0
11        40.0
12        64.0
13        53.0
14        43.0
15        25.0
16        43.0
17        38.0
18        39.0
19        32.0
20        58.0
21        58.0
22        69.0
23        24.0
24        58.0
25        28.0
26        24.0
27        57.0
28        42.0
29        64.0
          ... 
112885    31.0
112886    48.0
112887    63.0
112888    57.0
112889    55.0
112890    43.0
112891    58.0
112892    83.0
112893     NaN
112894    44.0
112895    61.0
112896    52.0
112897    55.0
112898    64.0
112899    43.0
112900    37.0
112901    82.0
112902    26.0
112903    49.0
112904    28.0
112905    31.0
112906    62.0
112907    46.0
112908    59.0
112909    22.0
112910    50.0
112911    74.0
112912    44.0
112913    30.0
112914    64.0
Name: age, Length: 112915, dtype: float64

La edad se imputa con la mediana, ya que estpa muy cerca la promedio y se requiere un valor entero para la edad, además que los datos sean fieles a la realidad a la hora de generar un nivel de riesgo


In [8]:
# mean Age
data.age.mean()

51.36130439584714

In [9]:
# median Age
data.age.median()

51.0

In [10]:
#datos mas repetido, la MODA

data.age.mode()

0    49.0
dtype: float64

In [11]:
import statistics as stats
stats.mode(data.age)

49.0

Imputación de los datos:

In [12]:
data.age.fillna(data.age.median(), inplace=True)
data.isnull().sum()

Unnamed: 0                                 0
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                        0
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

El numberof dependents, se imputa con lo mas repetido, es decir la moda, ya que al tener el máximo, ayuda al modelo, ya que se tiene el supuesto que las personas a cargo tiende a aumentar.

In [13]:
data.NumberOfDependents.mode()

0    0.0
dtype: float64

In [14]:
data.NumberOfDependents.mean()

0.8565735218319711

In [15]:
data.NumberOfDependents.median()

0.0

In [16]:
data.describe()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
count,112915.0,112915.0,112915.0,112915.0,112915.0,112915.0,112915.0,112915.0,112915.0,112915.0,112915.0,108648.0
mean,56457.0,0.067449,5.825057,51.347651,0.378807,0.306221,6959.809,8.675561,0.213594,1.015587,0.188531,0.856574
std,32595.89716,0.250799,254.976948,14.178009,3.521621,0.222926,14781.93,5.124575,3.489531,1.080925,3.472207,1.149537
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,28228.5,0.0,0.034371,41.0,0.0,0.133458,3637.0,5.0,0.0,0.0,0.0,0.0
50%,56457.0,0.0,0.173016,51.0,0.0,0.278272,5600.0,8.0,0.0,1.0,0.0,0.0
75%,84685.5,0.0,0.570906,61.0,0.0,0.440113,8416.0,11.0,0.0,2.0,0.0,2.0
max,112914.0,1.0,50708.0,103.0,98.0,0.999909,3008750.0,57.0,98.0,29.0,98.0,20.0



Como el promedio es 0.8565, se imputan los datos con el valor de 1

In [17]:
data.NumberOfDependents.fillna(1, inplace=True)
data.isnull().sum()

Unnamed: 0                              0
SeriousDlqin2yrs                        0
RevolvingUtilizationOfUnsecuredLines    0
age                                     0
NumberOfTime30-59DaysPastDueNotWorse    0
DebtRatio                               0
MonthlyIncome                           0
NumberOfOpenCreditLinesAndLoans         0
NumberOfTimes90DaysLate                 0
NumberRealEstateLoansOrLines            0
NumberOfTime60-89DaysPastDueNotWorse    0
NumberOfDependents                      0
dtype: int64

In [18]:
#data['NumberOfDependents'].astype('int64')
data['NumberOfDependents'] = data['NumberOfDependents'].astype('int64')

In [19]:
data = data.astype('int64')

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112915 entries, 0 to 112914
Data columns (total 12 columns):
Unnamed: 0                              112915 non-null int64
SeriousDlqin2yrs                        112915 non-null int64
RevolvingUtilizationOfUnsecuredLines    112915 non-null int64
age                                     112915 non-null int64
NumberOfTime30-59DaysPastDueNotWorse    112915 non-null int64
DebtRatio                               112915 non-null int64
MonthlyIncome                           112915 non-null int64
NumberOfOpenCreditLinesAndLoans         112915 non-null int64
NumberOfTimes90DaysLate                 112915 non-null int64
NumberRealEstateLoansOrLines            112915 non-null int64
NumberOfTime60-89DaysPastDueNotWorse    112915 non-null int64
NumberOfDependents                      112915 non-null int64
dtypes: int64(12)
memory usage: 10.3 MB


# Exercise 3.2

From the set of features

Select the features that maximize the **F1Score** the model using K-Fold cross-validation

In [165]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [166]:
# para el modelo se calculan los test nad train

# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [167]:
X.shape

(112915, 11)

In [168]:
X_train.shape

(84686, 11)

In [169]:
X_test.shape

(28229, 11)

In [170]:
y_train.shape

(84686,)

In [171]:
y_test.shape

(28229,)

In [172]:
# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [173]:
logreg

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [174]:
# make predictions for testing set
y_pred_class = logreg.predict(X_test)
y_pred_class

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [175]:
# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.932657904991


In [176]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.cross_validation import KFold

In [177]:
#### se realiza el modelos con todas las variables y con K FOLS CROSSVALIDATION ######
# Create k-folds



kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    results.append(metrics.accuracy_score(y_test, y_pred_class))





In [180]:
results = []
kf

sklearn.cross_validation.KFold(n=112915, n_folds=10, shuffle=False, random_state=0)

In [181]:
pd.Series(results).describe()

count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
dtype: float64

In [182]:
from sklearn.cross_validation import cross_val_score

logreg = LogisticRegression(C=1e9)

results = cross_val_score(logreg, X, y, cv=10, scoring='accuracy')

In [183]:
pd.Series(results).describe()

count    10.000000
mean      0.932728
std       0.000272
min       0.932341
25%       0.932628
50%       0.932696
75%       0.932820
max       0.933310
dtype: float64

In [184]:
###### MATRIZ DE CONFUSIÓN

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred_class)

array([[10483,     6],
       [  793,     9]], dtype=int64)

In [185]:
######## CALCULO DE LOS VALORES DEL F1 ESCORE

from sklearn.metrics import precision_score, recall_score, f1_score
print('precision_score ', precision_score(y_test, y_pred_class))
print('recall_score    ', recall_score(y_test, y_pred_class))

precision_score  0.6
recall_score     0.0112219451372


### F1Score

In [186]:
print('f1_score    ', f1_score(y_test, y_pred_class))

f1_score     0.0220318237454


In [187]:
fscore1 = f1_score(y_test, y_pred_class)
fscore1

0.022031823745410035

In [188]:
# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.929235674431


In [189]:
mod1 = metrics.accuracy_score(y_test, y_pred_class)
mod1

0.92923567443096267

### Selección de variables

In [190]:
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [191]:
logreg = LogisticRegression()
rfe = RFE(logreg, 18)
rfe = rfe.fit(X, y )
print(rfe.support_)
print(rfe.ranking_)
#Se seleccionan las variables con valor 1, como significativas.
#https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
#Para este caso, todas son significativas

[ True  True  True  True  True  True  True  True  True  True  True]
[1 1 1 1 1 1 1 1 1 1 1]


### Primera seleccción de variables

In [192]:
data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0,45,2,0,9120,13,0,6,0,2
1,1,0,0,40,0,0,2600,4,0,0,0,1
2,2,0,0,38,1,0,3042,2,1,0,0,0
3,3,0,0,30,0,0,3300,5,0,0,0,0
4,4,0,0,49,1,0,63588,7,0,1,0,0


In [193]:
# define X and y
# define X and y
feature_cols = ['age','MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 
                'NumberOfDependents','DebtRatio','NumberOfTime60-89DaysPastDueNotWorse']


y = data['SeriousDlqin2yrs']
#X = data.drop('SeriousDlqin2yrs', axis=1)
X = data[feature_cols]


# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

In [194]:
y.head()

0    1
1    0
2    0
3    0
4    0
Name: SeriousDlqin2yrs, dtype: int64

In [195]:
X.head()

Unnamed: 0,age,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberRealEstateLoansOrLines,NumberOfDependents,DebtRatio,NumberOfTime60-89DaysPastDueNotWorse
0,45,9120,13,6,2,0,0
1,40,2600,4,0,1,0,0
2,38,3042,2,0,0,0,0
3,30,3300,5,0,0,0,0
4,49,63588,7,1,0,0,0


In [196]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred_class)

array([[26302,    13],
       [ 1893,    21]], dtype=int64)

In [197]:
from sklearn.metrics import precision_score, recall_score, f1_score
print('precision_score ', precision_score(y_test, y_pred_class))
print('recall_score    ', recall_score(y_test, y_pred_class))

precision_score  0.617647058824
recall_score     0.0109717868339


In [198]:
print('f1_score    ', f1_score(y_test, y_pred_class))

f1_score     0.0215605749487


In [199]:
fscore2 = f1_score(y_test, y_pred_class)
fscore2

0.021560574948665298

In [200]:
# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.932480782174


In [201]:
mod2 = metrics.accuracy_score(y_test, y_pred_class)
mod2

0.9324807821743597

### Segunda selección de Variables

In [202]:
# define X and y
# define X and y
feature_cols = ['age', 'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 
                'NumberOfDependents', 'RevolvingUtilizationOfUnsecuredLines', 'NumberOfTimes90DaysLate', 'NumberOfTime30-59DaysPastDueNotWorse']


y = data['SeriousDlqin2yrs']
#X = data.drop('SeriousDlqin2yrs', axis=1)
X = data[feature_cols]


# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

In [203]:
X.head()

Unnamed: 0,age,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberRealEstateLoansOrLines,NumberOfDependents,RevolvingUtilizationOfUnsecuredLines,NumberOfTimes90DaysLate,NumberOfTime30-59DaysPastDueNotWorse
0,45,9120,13,6,2,0,0,2
1,40,2600,4,0,1,0,0,0
2,38,3042,2,0,0,0,1,1
3,30,3300,5,0,0,0,0,0
4,49,63588,7,1,0,0,0,1


In [204]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred_class)

array([[26302,    13],
       [ 1888,    26]], dtype=int64)

In [205]:
from sklearn.metrics import precision_score, recall_score, f1_score
print('precision_score ', precision_score(y_test, y_pred_class))
print('recall_score    ', recall_score(y_test, y_pred_class))

precision_score  0.666666666667
recall_score     0.0135841170324


In [206]:
print('f1_score    ', f1_score(y_test, y_pred_class))


f1_score     0.0266257040451


In [207]:
fscore3 = f1_score(y_test, y_pred_class)
fscore3

0.026625704045058884

In [208]:
# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.932657904991


In [209]:
mod3 = metrics.accuracy_score(y_test, y_pred_class)
mod3

0.93265790499132095

### Comparación de F1

In [210]:
fscore1

0.022031823745410035

In [211]:
fscore2

0.021560574948665298

In [212]:
fscore3

0.026625704045058884

### El mejor es el tercer modelo

# Exercise 3.3

Now which is the best set of features selected by AUC

### AUC Con todas las variables

In [213]:
mod1

0.92923567443096267

### AUC Primera seleccción de variables

In [214]:
mod2

0.9324807821743597

### AUC Segunda seleccción de variables

In [215]:
mod3

0.93265790499132095

### El mejor es el tercer AUC