<h2><center> German Credit Prediction </center><h>

In [1]:
#importing all the required packages
import pandas as pd
import sklearn as sk
import numpy as np
import os.path

## 1. Loading the dataset from CSV

In [2]:
cwd = os.getcwd()
url = cwd + '/DATA/index.csv'
GermanCredit = pd.read_csv(url)
GermanCredit.head() #TO view the Sample Data

Unnamed: 0,Creditability,Account Balance,Duration of Credit (month),Payment Status of Previous Credit,Purpose,Credit Amount,Value Savings/Stocks,Length of current employment,Instalment per cent,Sex & Marital Status,...,Duration in Current address,Most valuable available asset,Age (years),Concurrent Credits,Type of apartment,No of Credits at this Bank,Occupation,No of dependents,Telephone,Foreign Worker
0,1,1,18,4,2,1049,1,2,4,2,...,4,2,21,3,1,1,3,1,1,1
1,1,1,9,4,0,2799,1,3,2,3,...,2,1,36,3,1,2,3,2,1,1
2,1,2,12,2,9,841,2,4,2,2,...,4,1,23,3,1,1,2,1,1,1
3,1,1,12,4,0,2122,1,3,3,3,...,2,1,39,3,1,2,2,2,1,2
4,1,1,12,4,0,2171,1,3,4,3,...,4,2,38,1,2,2,2,1,1,2


In [3]:
GermanCredit.columns

Index([u'Creditability', u'Account Balance', u'Duration of Credit (month)',
       u'Payment Status of Previous Credit', u'Purpose', u'Credit Amount',
       u'Value Savings/Stocks', u'Length of current employment',
       u'Instalment per cent', u'Sex & Marital Status', u'Guarantors',
       u'Duration in Current address', u'Most valuable available asset',
       u'Age (years)', u'Concurrent Credits', u'Type of apartment',
       u'No of Credits at this Bank', u'Occupation', u'No of dependents',
       u'Telephone', u'Foreign Worker'],
      dtype='object')

## 2. Data Cleaning Process

In [4]:
GermanCredit.rename(columns = {'Duration of Credit (month)':'CreditDuration'
                              ,'Length of current employment':'CurrentEmploymentDuration'
                              ,'Duration in Current address':'CurrentAddressDuration'
                              ,'No of Credits at this Bank':'NoOfCreditsInBank'
                              ,'Type of apartment':'ApartmentType'
                              ,'Instalment per cent':'InstalmentPercentage'
                              ,'No of dependents': 'NoOfDependents'
                              ,'Most valuable available asset':'MostValuableAsset'
                              ,'Value Savings/Stocks': 'ValueSavingsOrStocks'
                              ,'Sex & Marital Status': 'SexAndMaritalStatus'},inplace=True)

In [5]:
GermanCredit.isnull() #To check if we have any null values in the data

Unnamed: 0,Creditability,Account Balance,CreditDuration,Payment Status of Previous Credit,Purpose,Credit Amount,ValueSavingsOrStocks,CurrentEmploymentDuration,InstalmentPercentage,SexAndMaritalStatus,...,CurrentAddressDuration,MostValuableAsset,Age (years),Concurrent Credits,ApartmentType,NoOfCreditsInBank,Occupation,NoOfDependents,Telephone,Foreign Worker
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [6]:
GermanCredit.isnull().any()

Creditability                        False
Account Balance                      False
CreditDuration                       False
Payment Status of Previous Credit    False
Purpose                              False
Credit Amount                        False
ValueSavingsOrStocks                 False
CurrentEmploymentDuration            False
InstalmentPercentage                 False
SexAndMaritalStatus                  False
Guarantors                           False
CurrentAddressDuration               False
MostValuableAsset                    False
Age (years)                          False
Concurrent Credits                   False
ApartmentType                        False
NoOfCreditsInBank                    False
Occupation                           False
NoOfDependents                       False
Telephone                            False
Foreign Worker                       False
dtype: bool

The Data does not contain any NaN values, it is clear that all the cells are filled with some values.

In [7]:
GermanCredit.shape

(1000, 21)

In [8]:
GC_with_Condition = GermanCredit.dropna(thresh=1)

In [9]:
GC_with_Condition.shape

(1000, 21)

In [10]:
GC_with_Condition.columns = GC_with_Condition.columns.str.replace(' ','') #Just renaming the column Names

Now, We are good with the data contents. We need to check the irregularity in value distribution in each column for better data analysis

To make sure that all the values in categorical variable are defined clearly, I am exploring the categorical variables using value count function.

In [11]:
GC_with_Condition['SexAndMaritalStatus'].value_counts()

3    548
2    310
4     92
1     50
Name: SexAndMaritalStatus, dtype: int64

In [12]:
GC_with_Condition['ApartmentType'].value_counts()

2    714
1    179
3    107
Name: ApartmentType, dtype: int64

In [13]:
GC_with_Condition['ForeignWorker'].value_counts()

1    963
2     37
Name: ForeignWorker, dtype: int64

In [14]:
GC_with_Condition['MostValuableAsset'].value_counts()

3    332
1    282
2    232
4    154
Name: MostValuableAsset, dtype: int64

In [15]:
GC_with_Condition['Occupation'].value_counts()

3    630
2    200
4    148
1     22
Name: Occupation, dtype: int64

In [16]:
GC_with_Condition['ValueSavingsOrStocks'].value_counts()

1    603
5    183
2    103
3     63
4     48
Name: ValueSavingsOrStocks, dtype: int64

So we are sure that the data is clean and ready for Predictive Analytics,

## 3 : Modeling Process

In [17]:
#Data split with dependent and independent variables in the dataset

X = GC_with_Condition.iloc[:,1:21] #dataset with only dependent variable
Y = GC_with_Condition.iloc[:,0] #dataset with only independent variable

Y.head()

0    1
1    1
2    1
3    1
4    1
Name: Creditability, dtype: int64

In [30]:
from sklearn.cross_validation import train_test_split
GC_X_Train, GC_X_Test, GC_Y_Train, GC_Y_Test = train_test_split(X,Y,test_size=0.2, random_state=0)

In [19]:
GC_X_Train.sample(10)

Unnamed: 0,AccountBalance,CreditDuration,PaymentStatusofPreviousCredit,Purpose,CreditAmount,ValueSavingsOrStocks,CurrentEmploymentDuration,InstalmentPercentage,SexAndMaritalStatus,Guarantors,CurrentAddressDuration,MostValuableAsset,Age(years),ConcurrentCredits,ApartmentType,NoOfCreditsInBank,Occupation,NoOfDependents,Telephone,ForeignWorker
353,1,12,4,0,3499,1,3,3,2,2,2,1,29,3,2,2,3,1,1,1
132,3,6,2,2,2116,1,3,2,3,1,2,1,41,3,2,1,3,1,2,1
639,1,24,4,1,2957,1,5,4,3,1,4,2,63,3,2,2,3,1,2,1
591,1,8,4,10,1164,1,5,3,3,1,4,4,51,1,3,2,4,2,2,1
548,2,12,2,0,1007,4,3,4,4,1,1,1,22,3,2,1,3,1,1,1
487,2,18,4,2,3612,1,5,3,2,1,4,2,37,3,2,1,3,1,2,2
309,4,54,0,1,9436,5,3,2,3,1,2,2,39,3,2,1,2,2,1,1
849,2,36,2,6,12612,2,3,1,3,1,4,4,47,3,3,1,3,2,2,1
798,1,24,2,0,915,5,5,4,2,1,2,3,29,1,2,1,3,1,1,1
179,4,48,3,3,12749,3,4,4,3,1,1,3,37,3,2,1,4,1,2,1


In [20]:
GC_X_Test.sample(10)

Unnamed: 0,AccountBalance,CreditDuration,PaymentStatusofPreviousCredit,Purpose,CreditAmount,ValueSavingsOrStocks,CurrentEmploymentDuration,InstalmentPercentage,SexAndMaritalStatus,Guarantors,CurrentAddressDuration,MostValuableAsset,Age(years),ConcurrentCredits,ApartmentType,NoOfCreditsInBank,Occupation,NoOfDependents,Telephone,ForeignWorker
844,2,24,4,10,11938,1,3,2,3,2,3,3,39,3,2,2,4,2,2,1
989,2,24,2,0,2718,1,3,3,2,1,4,2,20,3,1,1,2,1,2,1
255,3,15,2,9,2687,1,4,2,3,1,4,2,26,3,1,1,3,1,2,1
698,2,45,4,1,4576,2,1,3,3,1,4,3,27,3,2,1,3,1,1,1
958,2,30,0,9,4280,2,3,4,2,1,4,3,26,3,1,2,2,1,1,1
231,4,6,0,3,426,1,5,4,4,1,4,3,39,3,2,1,2,1,1,1
710,4,24,2,0,3757,1,5,4,2,2,4,4,62,3,3,1,3,1,2,1
545,2,8,2,9,907,1,2,3,4,1,2,1,26,3,2,1,3,1,2,1
672,2,18,2,3,1113,1,3,4,2,3,4,1,26,3,2,1,2,2,1,1
643,1,48,4,1,6143,1,5,4,2,1,4,4,58,2,3,2,2,1,1,1


## 4. Implementation of Predictive Model

The predictive model is implemented using Logistic regression

For this, we need to train our dataset using linearRegression model from sklearn package

In [21]:
#Linera Binary Regression

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

In [22]:
clf.fit(GC_X_Train, GC_Y_Train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [23]:
predictions = clf.predict(GC_X_Test)

## 5. Model Evaluvation

In [25]:
from sklearn.metrics import confusion_matrix
clf.score(GC_X_Test, GC_Y_Test)

0.765

In [26]:
confusion_matrix(GC_Y_Test, predictions)

array([[ 28,  40],
       [  7, 125]])

From the above figs, we can see how many true positive, false positive, ture negative and false negative are there. In this, therea are 28 true negative, 40 false negative and 7 false positive and 125 true positive.

In [29]:
y_score = clf.decision_function(GC_X_Test)

from sklearn.metrics import average_precision_score
average_precision = average_precision_score(GC_Y_Test, y_score)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))

Average precision-recall score: 0.88
