# Home Credit Risk Default: Examining Data
## <font face="times" color = "#990000"> Introduction: </font>
A financing institute have received several loan request applications. They want to find qualified applicants to receive the loan. Unfortunately, there is no credit history available for the applicants. However, they have historical data available from previous applicants. This program aims to predict which applicant will repay the loan and which one will not.

The data is downloaded from a kaggle competition. You can find more information at https://www.kaggle.com/c/home-credit-default-risk

## <font face="times" color = "#990000"> Reading the data: </font>
First of all, we need to look at the data. In this project, I have downloaded the training and test sets <a href="https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets">(What are training/test sets?!)</a> from the API provided by <a href="https://www.kaggle.com/c/home-credit-default-risk/data">
Kaggle </a>. We assume you have saved the data ('application_train.csv' and 'application_test.csv' files) in a folder name "Data". The "Data" folder is located in the same folder as your python file is (in this example, homecredit folder). 

Since the extention of the data is .csv, you need to read it using <font face="courier new" color = "#990000">pd.read_csv</font> as shown bellow. Here, <font face="courier new" color = "#990000">'pd'</font> is a name we use to call pandas library, and <font face="courier new" color = "#990000">'read_cv'</font> is an attribute of pandas library. Before we use any of the pandas attributes, we have to import the library using <font face="courier new" color = "#990000">'import'</font>. The training data is read and saved in a dataframe which we name <font face="courier new" color = "#009933">'df_train'</font>. Similarly, the training set is read and saved in a dataframe which we name <font face="courier new" color = "#009933">'df_test'</font>.

You can also call the file by using its complete path address, e.g., <font face="courier new" color = "#990000">'C:/Users/marjan/homecredit/Data/application_train.csv'</font>. 

Now, you san run the bellow cell by selecting it and pressing "Shift"+"Enter". You will notice that the number beside cell will be changed into * which indicates the cell is running. It may take a few seconds or more for the data to be read depending on the size of the data. When the star sign changes into a number, the file is read. You will not see any output here unless an error exists.

In [1]:
import pandas as pd
df_train = pd.read_csv('Data/application_train.csv')
#df_test = pd.read_csv('Data/application_test.csv')

FileNotFoundError: File b'Data/application_train.csv' does not exist

## <font face="times" color = "#990000"> Examining the data: </font>
It is good to know how big as your dataset and how many rows and columns it has. <font face="courier new" color = "#990000">'shape'</font> is the attribute that shows the number of rows and columns of your dataframe. 
You can find the number of rows of a dataframe by adding <font face="courier new" color = "#990000">'.shape[0]'</font> to the dataframe's name.
Similarly, the number of columns can be found by adding <font face="courier new" color = "#990000">'.shape[1]'</font> to the dataframe's name.
Inorder to see the results, you need to print them using the <font face="courier new" color = "#990000">'print'</font> instruction. In python 3, you have to used parantheses, and type whatever you want to print inside a pair of paranthesis. Use quotation marks (single or double) to print an exact srting, and use the name of variables without quotation marks to print the value of the vaiables. Seperate them using commas. 
Rune the bellow cell to see the output of the print line.

In [None]:
number_of_rows_train = df_train.shape[0]
number_of_columns_train = df_train.shape[1]
print('The training data has', number_of_rows_train, 'rows, and', number_of_columns_train, 'columns.' )

## <font face="times" color = "#990000"> Displaying a few rows of the data: </font>
As you can see, the training set has 307511 rows which is a lot! It is a good idea to look at the few first row of the dataset to have an idea of how it looks like. the <font face="courier new" color = "#990000">'head()'</font> attribute, select only the first five rows of the data. You can determine the number of first rows you want to select by adding a number inside paranthesis.

In [None]:
print(df_train.head())

It might be interesting to know that the last rows of a data fram can be selected by the <font face="courier new" color = "#990000">'.tail()'</font> attribute. The line bellow select the last three rows of the training set data frame.

In [None]:
print(df_train.tail(3))

## <font face="times" color = "#990000"> Statistical Description of the data: </font>
Finally, to have a statistic description of your data set, you can use <font face="courier new" color = "#990000">'.describe()'</font> attribute. 

In [None]:
print('A statistical description of the training data: \n', df_train.describe())

As a result, a dataframe with 8 rows and 106 columns appear. You might have noticed that the number of columns are 106 here, while in the dataset, we had 122 columns. It is because we have only 106 numerical data columns, which are considered in the describe method. We have 122-106=16 columns with non-numeric type in the training datase which are ignored here.  

### <font face="courier new" color = "#6600cc">count:</font> 
The first row shows the number of data in each column. For example, the number of data in the last column (AMT_REQ_CREDIT_BUREAU_YEAR) is 265992, while the number of data in the first column is 307511. It shows that we have some missed data in the dataframe.

### <font face="courier new" color = "#6600cc">mean:</font>
It shows the average value of each column 

### <font face="courier new" color = "#6600cc">std:</font> 
It shows the standard deviation of the value of each column. A near zero std shows a quite constant data in the given column.

### <font face="courier new" color = "#6600cc">min:</font> 
It shows the minimum value of each column

### <font face="courier new" color = "#6600cc">25%, 50%, and 75%:</font>
It shows the 25% value of each column. One-forth of the data from that column have a value less than this. For example,  the 25% value of the first column is 189145.5. It indicates in the first column from all 307511 rows, 76878 rows have a value less than 189145.5, and 230633 rows have a value more than 189145.5. 50% and 75% gives us statistical distribution information from each column similar to the 25%.

### <font face="courier new" color = "#6600cc">max:</font>
It shows the maximum value of each column.


The training set shows our historical data including the information from the applicants, and weather they had repaid the loan or not. In order to use the training set, we have to know the name of the columns and where to find required data. The list of the name of the columns can be found using <font face="courier new" color = "#990000">'.columns'</font>

In [None]:
column_number = 0
for i in df_train.columns:
    print(column_number,i)
    column_number = column_number + 1

The first column is the application ID which is a unique identifier for each applicant. The second column, shows the target. If the target is equal to 1 it shows the applicant did not return the loan. 
Using the <font face="courier new" color = "#990000">'.count()'</font> attribute, we can find the number of applications we have in the training set. In order to efer to a specific column from the dataset, we should write the column name in the quotation marks inside brakets infront of the dataframe's name. Note that python is a case sensitive language. So, to call the column name properly, you need to consider the lower or upper case letters as it appears in the column list. After the column is indicated, we can add the attribute to find the number of unique applicants.

In [None]:
print('There are', df_train['SK_ID_CURR'].count(), 'unique loan applicants in total.' )

## <font face="times" color = "#990000"> Imbalance Training Set: </font>

Before we apply any machine learning algorithm to predict the target of the test set, we need to see if the training set is balanced or not. If the training set is imbalance, the prediction quality drops significantly. In most of the cases like in this project, the training set is imbalanced; we have more information about the applicants who have returned their loan, and we have only few examples of the applicants who did not repaid their loan. So, if we work with this imbalance set, the prediction behaves better of the applicants with TARGET = 0, and it will face problems (higher error) on determining the applicants who will not repay their loan.

So, the first task is to find how imbalance our training set is. To find the number of Targets, we can use <font face="courier new" color = "#990000">'.sum()'</font> attribute. Note that we cannot use the <font face="courier new" color = "#990000">'.count()'</font> attribute here, because it will count both zeros and ones in the TARGET column.
The difference between the number of tatal rows in the TARGET column and the number of the TARGETS with the value of one, gives us the number of non-target samples.

In [None]:
number_of_targets = df_train["TARGET"].sum()
number_of_samples = df_train['SK_ID_CURR'].count()
number_of_non_targets = number_of_samples - number_of_targets
balance_ratio = number_of_targets / number_of_samples * 100
print('We have', number_of_samples, 'samples;', number_of_targets,'of them with TARGETS = 1, and ',\
      number_of_non_targets, 'with TARGET= 0. So, only %.1f'%balance_ratio, '% of samples are targets.' )


## <font face="times" color = "#990000"> Ehat is Next? </font>
Up to know, we read the data, and examine it. in the next part, we deal with unbalance data and try to find a balance between TARGETs and non-Target Samples.