# Data Understanding
## Data Quality
Now we're going to analyze the quality of the data we were given.


## Data Summarization
In this section, we will provide an overview of key properties of the data which will help in selecting the most suitable tool for analyzing the data. Specifically, we are going to do a statistical analysis of all tables in the dataset. But first we need to import pandas and matplotlib which will aid us in the calculation of statistical values and data visualization, respectively.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

### Relation `account`
Firstly, we are going to open the account table and look at its first few lines. 

As we can see there is one categorical variable representing how often account balances are issued for each account. All other variables are numerical as they correspond to account and district IDs or the timestamp of account creation.

In [None]:
# show the first few lines of the dataframe
account_df = pd.read_csv('./data/account.csv', sep=';')
account_df.head()

Now, we are going to check for null values in the account table. The output of the following snippet of code shows there are no such values.

In [None]:
# show how many null values each column has
account_df.isna().sum()

Next, we are going to make sure that each account appears exactly once. The output of the following snippet of code shows that each account appears exactly once.

In [None]:
# show the number of duplicate values in the account_id column
account_df['account_id'].duplicated().any()

Now, we are going to calculate how many accounts exist in each district and the mode. As we can see from the output of the following snippets of code, the district where the highest number of accounts are registered in is district 1, having 3.6 times the number of accounts as the next district. In fact the boxplot shows that there are 7 outliers (districts 1, 70, 74, 54, 64, 72, 68), given that the average district has 58 accounts, the first quartile is at 42 and the third quartile is at 53 which gives an interquartile range of 11. These districts will skew the data if no correction is done.

In [None]:
# plot the distribution of accounts in each district
district_values = account_df['district_id'].value_counts()
ax = district_values.sort_values(ascending=False).plot.bar(
    title='district_id',
    legend=True,
    xlabel='district id',
    ylabel='number of accounts in district',
    figsize=(20, 3)
)
ax.bar_label(ax.containers[0])
plt.show()

In [None]:
# compute the mode of the district_id (the district with the highest number of accounts)
account_df['district_id'].mode()

In [None]:
# compute the count, mean, standard deviation, minimum and maximum and quartiles for the number of accoutns in each district
acc_description = district_values.describe()
acc_description

In [None]:
# draw a boxplot of the number of accounts in each district to show the outliers
account_df['district_id'].value_counts().plot.box()

Now, we are going to calculate the mode of the issuance column which shows that the most common frequency for issuance of bank statements is `monthly issuance`, with 4167 occurrences. Other types of issuance include `weekly issuance` with 240 entries and `ìssuance after transaction` with 93 entries.

In [None]:
# calculate the most common frequency for issuance of bank statements
account_df['frequency'].mode()

In [None]:
# draw a barplot of the number of occurrences of each frequency
freq_values = account_df['frequency'].value_counts()
ax = freq_values.sort_values(ascending=False).plot.bar(
    title='frequency',
    legend=True,
    xlabel='frequency',
    ylabel='number of accounts with frequency',
)
ax.bar_label(ax.containers[0])
plt.show()

## Relation `client`
We are now going to open the client table and look at its first few lines. 

As we can see this table only contains numerical variables: the ID of the client and the district where they live, as well as a birth number which not only encodes the client's birth date but also their gender (for men, the format is YYMMDD and for women, the format is YYMM+50DD). 

In [None]:
# show the first few lines of the dataframe
client_df = pd.read_csv('./data/client.csv', sep=';')
client_df.head()

Now, we are going to check for null values in the client table. The output of the following snippets of code shows there are no such values and that there are no repeated clients.

In [None]:
# show how many null values each column has
client_df.isna().sum()

In [None]:
# show the number of duplicate values in the client_id column
client_df['client_id'].duplicated().any()

In [None]:
# compute the count, mean, standard deviation, minimum and maximum and quartiles for the number of clients in each district
client_df['district_id'].value_counts().describe()

## Relation `disposition`
We are now going to open the disposition table and look at its first few lines. 

As we can see this table contains numerical variables: the ID of the client and of the account, as well as the disposition ID. This table also contains a categorical variable containing the type of disposition (`OWNER`, `DISPONENT`). This table contains no null values. There are 4500 owners and 869 disponents.

In [None]:
# show the first few lines of the dataframe
disp_df = pd.read_csv('./data/disp.csv', sep=';')
disp_df.head()

In [None]:
# show how many null values each column has
disp_df.isna().sum()

In [None]:
# show the types of disponent in the dataset
disp_df['type'].unique()

In [None]:
# draw a barplot of the number of occurrences of each disposition type
type_values = disp_df['type'].value_counts()
print(type_values)
ax = type_values.sort_values(ascending=False).plot.bar(
    title='district_id',
    legend=True,
    xlabel='district id',
    ylabel='number of disponent types',
)
ax.bar_label(ax.containers[0])
plt.show()

## Relations `transaction` and `permanent order`
The permanent order relation is included in the transaction table, specifically, in rows that contain the two-letter code of the destination bank of a transaction. Other rows that do not correspond to permanent orders do not contain these attributes, which explains the large number of null values in this table.
This relation is split between two tables, a dataset for training purposes and another one for testing purposes. 

There are three transaction types: `credit`, `withdrawal` and `withdrawal in cash`. The most common transaction type is `credit`. 
There are five operation types: `credit in cash`, `collection from another bank`, `withdrawal in cash`, `remittance to another bank`, `credit card withdrawal`. The most common operation type is `withdrawal in cash`.

The mean transaction amount is 5911.10, the minimum transaction amount is 0 and the maximum transaction amount is 86400. The box plot shows that there are many outliers to this distribution.

In [70]:
trans_train_df = pd.read_csv('./data/trans_train.csv', sep=';', dtype={'k_symbol': str, 'bank': str})
trans_test_df = pd.read_csv('./data/trans_test.csv', sep=';', dtype={'k_symbol': str, 'bank': str})
trans_df = trans_train_df.append(trans_test_df)
trans_df.head()

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
0,1548749,5270,930113,credit,credit in cash,800.0,800.0,,,
1,1548750,5270,930114,credit,collection from another bank,44749.0,45549.0,,IJ,80269753.0
2,3393738,11265,930114,credit,credit in cash,1000.0,1000.0,,,
3,3122924,10364,930117,credit,credit in cash,1100.0,1100.0,,,
4,1121963,3834,930119,credit,credit in cash,700.0,700.0,,,


In [None]:
# show how many null values each column has
trans_df.isna().sum()

In [None]:
trans_df['type'].unique()

In [None]:
trans_df['type'].mode()

In [None]:
trans_df['operation'].unique()

In [None]:
trans_df['operation'].mode()

In [None]:
trans_df['amount'].describe()

In [None]:
trans_df['amount'].plot.hist(bins=25)

In [None]:
trans_df['amount'].plot.box()

## Relation `loan`

## Relation `credit card`

## Relation `demographic data`