# HomeCredit 
# Part 2: Dealing with Imbalance Data

## Introduction:

Before we apply any machine learning algorithm to predict the target of the test set, we need to see if the training set is balanced or not. If the training set is imbalance, the prediction quality drops significantly. In most of the cases like in this project, the training set is imbalanced; we have more information about the applicants who have returned their loan, and we have only few examples of the applicants who did not repaid their loan. So, if we work with this imbalance set, the prediction behaves better of the applicants with TARGET = 0, and it will face problems (higher error) on determining the applicants who will not repay their loan.

So, the first task is to find how imbalance our training set is. To find the number of Targets, we can use '.sum()' attribute. Note that we cannot use the '.count()' attribute here, because it will count both zeros and ones in the TARGET column.
The difference between the number of tatal rows in the TARGET column and the number of the TARGETS with the value of one, gives us the number of non-target samples.

In [1]:
import pandas as pd
df_train = pd.read_csv('Data/application_train.csv')
#df_test = pd.read_csv('Data/application_test.csv')

In [2]:
number_of_targets = df_train["TARGET"].sum()
number_of_samples = df_train['SK_ID_CURR'].count()
number_of_non_targets = number_of_samples - number_of_targets
balance_ratio = number_of_targets / number_of_samples * 100
print('We have', number_of_samples, 'samples;', number_of_targets,'of them with TARGETS = 1, and ',\
      number_of_non_targets, 'with TARGET= 0. So, only %.1f'%balance_ratio, '% of samples are targets.' )


We have 307511 samples; 24825 of them with TARGETS = 1, and  282686 with TARGET= 0. So, only 8.1 % of samples are targets.


# Balancing the Data: Oversampling 

One way to achieve a balances training set, is to repeat some of the training samples with TARGET = 1 such that the number of samples with TARGET = 1 become equal to the number of TARGET = 0.
Now, we calculate 'Oversampling_gain' which indicates how many time the Target samples need to be repeated to achieve the balance data set. 'int' converts the result into an integer value.

In [3]:
extra_targets = number_of_non_targets - number_of_targets
print('In order to achieve a 1:1 ratio balanced dataset we will add', extra_targets, 'extra targets to the training set.')
Oversampling_gain = int((extra_targets + number_of_targets) / number_of_targets)+1
print('Oversampling gain = ', Oversampling_gain)

In order to achieve a 1:1 ratio balanced dataset we will add 257861 extra targets to the training set.
Oversampling gain =  12


Now, let's copy samples with target = 1 from the training set and add them to the training set to achieve a balanced training data.
So, first we will separate rowswoth TARGET = 1 and save them in a new data frame and name it 'train_ones'. We put the rest of the rows in 'train_zeros' data frame.
df_train['TARGET'].isin(['1']) checks weather the the TARGET values are equal to '1' or not, the result of this comparision is either True, or False. When we put this expression inside brackets in front of a column name, the row which refers to the True values are selected and the other ones are deleted from the new datafarem. We can find the 'train_zeros' dataframe similarly.

In [4]:
df_train_ones = df_train[df_train['TARGET'].isin(['1'])]
df_train_zeros = df_train[df_train['TARGET'].isin(['0'])]

'.concat' repeats a dataframe to make a new bigger dataframe out of itself (self-concatenating). Here, the 'df_train_ones' is concatenating with itself for 'Oversampling_gain = 12' times to achieve the extra TARGETS. In the second line, we add the zero target samples to the dataframe with extra targets to achieve the balanced data set.

In [8]:
df_train_ones_oversampled = pd.concat([df_train_ones]*Oversampling_gain)
df_train_samples = pd.concat([df_train_zeros,df_train_ones_oversampled])
print(len(df_train_ones))
print(len(df_train_ones_oversampled))

NameError: name 'df_train_ones' is not defined

In [None]:
import pandas as pd
import numpy as np
from matplotlib.pyplot import plot

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
import time
#from lightgbm import LGBMClassifier
#import lightgbm as lgb

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.simplefilter('ignore', UserWarning)

import gc
gc.enable()

from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error


# Applying Descision Tree Model to the training data:

In [None]:
d = defaultdict(LabelEncoder)
mydf = train_samples
my_data = mydf.apply(lambda x: d[x.name].fit_transform(x.astype(str)))
y = my_data.TARGET
# Let's select all columns except the target and the applicant's ID as training features.
columns_of_interest =(my_data.drop(['TARGET','SK_ID_CURR'], axis=1)).columns
X = my_data [columns_of_interest]
#print('Columns used as training features are: \n', X)
#X.describe()


In [None]:
from sklearn.model_selection import train_test_split
# We split the training set to use some of the samples for validating our model.
train_X, val_X, train_y, val_y = train_test_split(X, y,test_size = 0.25, random_state = 20)
# Define model
print('Training Features Shape:', train_X.shape)
print('Training Labels Shape:', train_y.shape)
print('Validation Features Shape:', val_X.shape)
print('Validation Labels Shape:', val_y.shape)

# Train Model:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Create a random forest Classifier. By convention, clf means 'Classifier'
#clf = RandomForestClassifier(n_jobs=2, random_state=0, class_weight="balanced")
clf = RandomForestClassifier(class_weight="balanced", max_features = 11)

# Train the Classifier to take the training features and learn how they relate to the training y.
clf.fit(train_X, train_y)
#clf.fit(train_X, np.ones(len(train_X)))

# Make Predictions on the Validation Set:

In [None]:
# Use the forest's predict method on the test data
predictions = clf.predict(val_X)
predictions.shape

# Evaluate Classifier


In [None]:
from sklearn.metrics import log_loss

# Calculate the absolute errors
error = predictions - val_y
# Print out the mean absolute error (mae)
print('Mean Error:', round(np.mean(error), 4))
val_y.shape
cv_scores = log_loss(val_y, predictions)
print('cv_scores =', cv_scores)


In [None]:
# Lets analyse the result:
result=pd.DataFrame({'Validation':val_y,'Predictions':predictions, 'Error': error})
result.describe()
#print(result)
print('Number of false predictions:', result['Error'].abs().sum())
print('Number of total predictions:', result['Error'].count())
print('Number of correct predictions:', result['Error'].abs().count()-result['Error'].sum())
fp = (result['Error']==1).sum()
fn = (result['Error']==-1).sum()
tp = ((result['Predictions']==1)&(result['Validation']==1)).sum()
tn = ((result['Predictions']==0)&(result['Validation']==0)).sum()
print('Number of false positives:',fp )
print('Number of false negatives:',fn)
print('Number of true positives:', tp)
print('Number of true negatives:', tn)
print('Mean Absolute Error = %.2f'% (100*mean_absolute_error(val_y, predictions)),'%')
precision = round((tp*100)/(tp+fp),2)
print('Precision =', precision,'%')
recall = round((tp*100)/(tp+fn),2)
print('Recall = %.2f'%recall,'%')
F1 = round(2*(precision*recall)/(precision+recall),2)
print('F1 score = %.2f' %F1, '%')
accuracy = round(100*(tp +tn)/(tp+tn+fp+fn),2)
print('Accuracy = %.2f' %accuracy, '%')

#Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
#ROC Curves:

# Predict the Test results

In [None]:

#print(val_X.head(1))
#print(test.shape)
#test_X = val_X.iloc[1:2, :]
#print(test.head())
Test_columns_of_interest =(test.drop(['SK_ID_CURR'], axis=1)).columns
Test_X = test[Test_columns_of_interest]

p = clf.predict(Test_X.apply(lambda x: d[x.name].fit_transform(x.astype(str))))
f_result=pd.DataFrame({'SK_ID_CURR':test['SK_ID_CURR'],'TARGET':p})
my_result= f_result.set_index('SK_ID_CURR')

#print(p.shape)
#print('My result = ',my_result)
#print('Test Index =\n', test_X.index)
#test.index.shape
#print('Actual Target Value =\n',val_y[test_X.index])

# Export the result to SQL


In [None]:
import sqlite3
import re
conn = sqlite3.connect('Summary.sqlite')
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS Results')

cur.execute('''
CREATE TABLE Results (Method TEXT, Accuracy FLOAT, Precision FLOAT, F1 FLOAT, Recall FLOAT)''')
cur.execute('''INSERT INTO Results (Method, Accuracy, Precision, F1, Recall)
               VALUES (?, ?, ?, ?, ?)''', (Method, accuracy, precision, F1, recall))
conn.commit()
cur.close()

In [None]:
print(f_result['TARGET'].sum())
print(f_result['TARGET'].count())
f_result['TARGET'].sum()/f_result['TARGET'].count()

# Conclusion:
The recall increased to almost 100% with a high accuracy and F1 score. There is no false negatives which is very good. It means, all of the applicants whom we predict to return their loan, will do.
There are a few applicant whom we predict not to return their loan, but they will. So, the model is a bit conservative in determining the trustable applicants which is good. We can distinguish all applicants who will not return their loan.

In [None]:
#from sqlalchemy import create_engine
#engine = create_engine('sqlite://', echo=False)
my_result.to_sql('MyResult2', conn, if_exists='replace')
#engine.execute("SELECT * FROM FinalResults").fetchall()