# HomeCredit 
# Part 2: Dealing with Imbalance Data

## Introduction:

Before we apply any machine learning algorithm to predict the target of the test set, we need to see if the training set is balanced or not. If the training set is imbalance, the prediction quality drops significantly. In most of the cases like in this project, the training set is imbalanced; we have more information about the applicants who have returned their loan, and we have only few examples of the applicants who did not repaid their loan. So, if we work with this imbalance set, the prediction behaves better of the applicants with TARGET = 0, and it will face problems (higher error) on determining the applicants who will not repay their loan.

So, the first task is to find how imbalance our training set is. To find the number of Targets, we can use '.sum()' attribute. Note that we cannot use the '.count()' attribute here, because it will count both zeros and ones in the TARGET column.
The difference between the number of tatal rows in the TARGET column and the number of the TARGETS with the value of one, gives us the number of non-target samples.

In [1]:
import pandas as pd
df_train = pd.read_csv('Data/application_train.csv')

In [2]:
number_of_targets = df_train["TARGET"].sum()
number_of_samples = df_train['SK_ID_CURR'].count()
number_of_non_targets = number_of_samples - number_of_targets
balance_ratio = number_of_targets / number_of_samples * 100
print('We have', number_of_samples, 'samples;', number_of_targets,'of them with TARGETS = 1, and ',\
      number_of_non_targets, 'with TARGET= 0. So, only %.1f'%balance_ratio, '% of samples are targets.' )

We have 307511 samples; 24825 of them with TARGETS = 1, and  282686 with TARGET= 0. So, only 8.1 % of samples are targets.


# Balancing the Data: Oversampling 

One way to achieve a balances training set, is to repeat some of the training samples with TARGET = 1 such that the number of samples with TARGET = 1 become equal to the number of TARGET = 0.
Now, we calculate 'Oversampling_gain' which indicates how many time the Target samples need to be repeated to achieve the balance data set. 'int' converts the result into an integer value.

In [3]:
extra_targets = number_of_non_targets - number_of_targets
print('In order to achieve a 1:1 ratio balanced dataset we will add', extra_targets, 'extra targets to the training set.')
Oversampling_gain = int((extra_targets + number_of_targets) / number_of_targets)+1
print('Oversampling gain = ', Oversampling_gain)

In order to achieve a 1:1 ratio balanced dataset we will add 257861 extra targets to the training set.
Oversampling gain =  12


Now, let's copy samples with target = 1 from the training set and add them to the training set to achieve a balanced training data.
So, first we will separate rowswoth TARGET = 1 and save them in a new data frame and name it 'train_ones'. We put the rest of the rows in 'train_zeros' data frame.
df_train['TARGET'].isin(['1']) checks weather the the TARGET values are equal to '1' or not, the result of this comparision is either True, or False. When we put this expression inside brackets in front of a column name, the row which refers to the True values are selected and the other ones are deleted from the new datafarem. We can find the 'train_zeros' dataframe similarly.

In [4]:
df_train_ones = df_train[df_train['TARGET'].isin(['1'])]
df_train_zeros = df_train[df_train['TARGET'].isin(['0'])]
# To clear some memory sppace we delete the extra dataframes 
del df_train

'.concat' repeats a dataframe to make a new bigger dataframe out of itself (self-concatenating). Here, the 'df_train_ones' is concatenating with itself for 'Oversampling_gain = 12' times to achieve the extra TARGETS. In the second line, we add the zero target samples to the dataframe with extra targets to achieve the balanced data set.

In [5]:
df_train_ones_oversampled = pd.concat([df_train_ones]*Oversampling_gain)
del df_train_ones
df_train_balance = pd.concat([df_train_zeros,df_train_ones_oversampled])

In [7]:
nofTargets = df_train_balance['TARGET'].sum()
nofNonTergats = len(df_train_balance) - df_train_balance['TARGET'].sum()
print('Number of zeros in the balance data set: ', nofNonTergats)
print('Number of ones in the balance data set: ', nofTargets)
ratio = nofTargets/ nofNonTergats
print('So, a ratio of %.2f' %ratio, 'is achieved after oversampling.')

Number of zeros in the balance data set:  282686
Number of ones in the balance data set:  297900
So, a ratio of 1.05 is achieved after oversampling.


After ensuring the balance of the data frame, we save it as 'balancedData.pkl' using '.to_pickle'. In the next part, we will load this data frame by calling 'balancedData.pkl'.

In [6]:
df_train_balance.to_pickle('balancedData.pkl')

# What is next?
In the next part, we will use the balance data to train a decision tree. So, let's save the dataframe to use it later.