# Data Understanding
## Data Quality
Now we're going to analyze the quality of the data we were given.


## Data Summarization
In this section, we will provide an overview of key properties of the data which will help in selecting the most suitable tool for analyzing the data. Specifically, we are going to do a statistical analysis of all tables in the dataset. But first we need to import pandas and matplotlib which will aid us in the calculation of statistical values and data visualization, respectively.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

### Relation `account`
Firstly, we are going to open the account table and look at its first few lines. 

As we can see there is one categorical variable representing how often account balances are issued for each account. All other variables are numerical as they correspond to account and district IDs or the timestamp of account creation.

In [None]:
# show the first few lines of the dataframe
account_df = pd.read_csv('./data/account.csv', sep=';')
account_df.head()

Now, we are going to check for null values in the account table. The output of the following snippet of code shows there are no such values.

In [None]:
# show how many null values each column has
account_df.isna().sum()

Next, we are going to make sure that each account appears exactly once. The output of the following snippet of code shows that each account appears exactly once.

In [None]:
# show the number of duplicate values in the account_id column
account_df['account_id'].duplicated().any()

Now, we are going to calculate how many accounts exist in each district and the mode. As we can see from the output of the following snippets of code, the district where the highest number of accounts are registered in is district 1, having 3.6 times the number of accounts as the next district. In fact the boxplot shows that there are 7 outliers (districts 1, 70, 74, 54, 64, 72, 68), given that the average district has 58 accounts, the first quartile is at 42 and the third quartile is at 53 which gives an interquartile range of 11. These districts will skew the data if no correction is done.

In [None]:
# plot the distribution of accounts in each district
district_values = account_df['district_id'].value_counts()
ax = district_values.sort_values(ascending=False).plot.bar(
    title='district_id',
    legend=True,
    xlabel='district id',
    ylabel='number of accounts in district',
    figsize=(20, 3)
)
ax.bar_label(ax.containers[0])
plt.show()

In [None]:
# compute the mode of the district_id (the district with the highest number of accounts)
account_df['district_id'].mode()

In [None]:
# compute the count, mean, standard deviation, minimum and maximum and quartiles for the number of accoutns in each district
acc_description = district_values.describe()
acc_description

In [None]:
# draw a boxplot of the number of accounts in each district to show the outliers
account_df['district_id'].value_counts().plot.box()

Now, we are going to calculate the mode of the issuance column which shows that the most common frequency for issuance of bank statements is `monthly issuance`, with 4167 occurrences. Other types of issuance include `weekly issuance` with 240 entries and `ìssuance after transaction` with 93 entries.

In [None]:
# calculate the most common frequency for issuance of bank statements
account_df['frequency'].mode()

In [None]:
# draw a barplot of the number of occurrences of each frequency
freq_values = account_df['frequency'].value_counts()
ax = freq_values.sort_values(ascending=False).plot.bar(
    title='frequency',
    legend=True,
    xlabel='frequency',
    ylabel='number of accounts with frequency',
)
ax.bar_label(ax.containers[0])
plt.show()

## Relation `client`
We are now going to open the client table and look at its first few lines. 

As we can see this table only contains numerical variables: the ID of the client and the district where they live, as well as a birth number which not only encodes the client's birth date but also their gender (for men, the format is YYMMDD and for women, the format is YYMM+50DD). 

In [None]:
# show the first few lines of the dataframe
client_df = pd.read_csv('./data/client.csv', sep=';')
client_df.head()

Now, we are going to check for null values in the client table. The output of the following snippets of code shows there are no such values and that there are no repeated clients.

In [None]:
# show how many null values each column has
client_df.isna().sum()

In [None]:
# show the number of duplicate values in the client_id column
client_df['client_id'].duplicated().any()

In [None]:
# compute the count, mean, standard deviation, minimum and maximum and quartiles for the number of clients in each district
client_df['district_id'].value_counts().describe()

## Relation `disposition`
We are now going to open the disposition table and look at its first few lines. 

As we can see this table contains numerical variables: the ID of the client and of the account, as well as the disposition ID. This table also contains a categorical variable containing the type of disposition (`OWNER`, `DISPONENT`). This table contains no null values. There are 4500 owners and 869 disponents.

In [None]:
# show the first few lines of the dataframe
disp_df = pd.read_csv('./data/disp.csv', sep=';')
disp_df.head()

In [None]:
# show how many null values each column has
disp_df.isna().sum()

In [None]:
# show the types of disponent in the dataset
disp_df['type'].unique()

In [None]:
# draw a barplot of the number of occurrences of each disposition type
type_values = disp_df['type'].value_counts()
print(type_values)
ax = type_values.sort_values(ascending=False).plot.bar(
    title='district_id',
    legend=True,
    xlabel='district id',
    ylabel='number of disponent types',
)
ax.bar_label(ax.containers[0])
plt.show()

## Relations `transaction` and `permanent order`
The permanent order relation is included in the transaction table, specifically, in rows that contain the two-letter code of the destination bank of a transaction. Other rows that do not correspond to permanent orders do not contain these attributes, which explains the large number of null values in this table.
This relation is split between two tables, a dataset for training purposes and another one for testing purposes. However for the purpose of this analysis we will append them together.

There are three transaction types (categorical variables): `credit`, `withdrawal` and `withdrawal in cash`. The most common transaction type is `credit`. 
There are five operation types (categorical variables): `credit in cash`, `collection from another bank`, `withdrawal in cash`, `remittance to another bank`, `credit card withdrawal`. The most common operation type is `withdrawal in cash`.

The mean transaction amount is 5911.10, the minimum transaction amount is 0 and the maximum transaction amount is 86400. The box plot shows that there are many outliers to this distribution, given that the interquartile range is 6615.6.

In [None]:
# show the first few lines of the dataframe
trans_train_df = pd.read_csv('./data/trans_train.csv', sep=';', dtype={'k_symbol': str, 'bank': str})
trans_test_df = pd.read_csv('./data/trans_test.csv', sep=';', dtype={'k_symbol': str, 'bank': str})
trans_df = trans_train_df.append(trans_test_df)
trans_df.head()

In [None]:
# show how many null values each column has
trans_df.isna().sum()

In [None]:
# show the unique categories of transaction types
trans_df['type'].unique()

In [None]:
# show the most common transaction type
trans_df['type'].mode()

In [None]:
# show the unique categories of transaction operation
trans_df['operation'].unique()

In [None]:
# show the most common transaction operation
trans_df['operation'].mode()

In [None]:
# compute the count, mean, standard deviation, minimum and maximum and quartiles for the amount of money in a transaction
trans_df['amount'].describe()

In [None]:
# plot a histogram of the distribution of transaction amounts
trans_df['amount'].plot.hist(bins=25)

In [None]:
# draw a boxplot of the distribution of transaction amounts
trans_df['amount'].plot.box()

## Relation `loan`
We are now going to open the loan table and look at its first few lines. This relation is split between two tables, one for training purposes and another one for testing purposes. However for the purpose of this analysis we will append them.

This table contains numerical variables representing the IDs for the loans and the accounts that requested them as well as the date of the day the loan was granted and the amount of money borrowed. Each row also has information on the number of payments made on the loan and the duration of the loan. Because the purpose of this project is to predict whether a loan will be paid or not, this table contains the `status` attribute which, for the train table, says if the loan has been paid or not. This attribute is empty for the test table because it is what we want to predict.

Loans can have a duration of 12, 36, 60, 24 or 48 months. The average loan duration is 36 months.

As expected, there are 354 null values in the table, all in the status column, which corresponds to the number of rows in the test table.

Also as expected, the correlogram shows a significant positive correlation between the amount of the loan and the number of payments.

The average loan amount is 151410.17 with a standard deviation of 113372.40, the smallest loan has an amount of 4980 and the largest, an amount of 590820.
The average payment is 4190.66 with a standard deviation of 2215.83, the smallest payment is 304 and the largest is 9910.

The boxplot for the ``amount`` column shows that there are many outliers but the boxplot for the ``payments`` column shows none. The amount column seems to follow a Pareto distribution.

In [None]:
# show the first few lines of the dataframe
loan_train_df = pd.read_csv('./data/loan_train.csv', sep=';')
loan_test_df = pd.read_csv('./data/loan_test.csv', sep=';')
loan_df = loan_train_df.append(loan_test_df)
loan_df.head()

In [None]:
# show how many null values each column has
loan_df.isna().sum()

In [None]:
loan_df['duration'].mean()

Now we will look at the ``amount`` column of the table.

In [None]:
corr = loan_df.corr() 
corr

In [None]:
plt.matshow(corr)

In [None]:
loan_df['amount'].describe()

In [None]:
loan_df['amount'].plot.box()

In [None]:
loan_df['amount'].plot.hist()

Now we will look at the ``payments`` column of the table.

In [None]:
loan_df['payments'].describe()

In [None]:
loan_df['payments'].plot.box()

In [None]:
loan_df['payments'].plot.hist()

## Relation `credit card`
We are now going to open the loan table and look at its first few lines. This relation is split between two tables, one for training purposes and another one for testing purposes. However for the purpose of this analysis we will append them.

In [None]:
card_train_df = pd.read_csv('./data/card_train.csv', sep=';')
card_test_df = pd.read_csv('./data/card_test.csv', sep=';')
card_df = card_train_df.append(card_train_df)
card_df.head()

## Relation `demographic data`
We are now going to open the district table and look at its first few lines.

In [None]:
district_df = pd.read_csv('./data/district.csv', sep=';', na_values='?')
district_df.head()

In [None]:
district_df.isna().sum()