# Data Analysis

**Notes**:
- Only the owner of the account (on disposition) can issue permanent orders and ask for a loan

## Loading the Data

![Data description](./images/data_description.gif)

In [1]:
import pandas as pd
from IPython.display import display

#Reading CSVs and renaming repeated columns

account = pd.read_csv("data/account.csv", sep=";")
client = pd.read_csv("data/client.csv", sep=";")
disposition = pd.read_csv("data/disp.csv", sep=";")
district = pd.read_csv("data/district.csv", sep=";") # Demograph

card_train = pd.read_csv("data/card_train.csv", sep=";")
card_test = pd.read_csv("data/card_test.csv", sep=";")  # Credit Card

loan_train = pd.read_csv("data/loan_train.csv", sep=";")
loan_test = pd.read_csv("data/loan_test.csv", sep=";")

trans_train = pd.read_csv("data/trans_train.csv", sep=";")
trans_test = pd.read_csv("data/trans_test.csv", sep=";")

  exec(code_obj, self.user_global_ns, self.user_ns)


## First Analysis
For a first view of the data, we can check it's shape, some values and data types
### Shape

In [2]:
print(f'''
account shape: {account.shape}
client shape: {client.shape}
disposition shape: {disposition.shape}
district shape: {district.shape}

card_train shape: {card_train.shape}
card_test shape: {card_test.shape}

loan_train shape: {loan_train.shape}
loan_test shape: {loan_test.shape}

trans_train shape: {trans_train.shape}
trans_test shape: {trans_test.shape}
''')


account shape: (4500, 4)
client shape: (5369, 3)
disposition shape: (5369, 4)
district shape: (77, 16)

card_train shape: (177, 4)
card_test shape: (25, 4)

loan_train shape: (328, 7)
loan_test shape: (354, 7)

trans_train shape: (396685, 10)
trans_test shape: (30200, 10)



- Table with most columns: district
- Table with most records: transaction test
- Table with least columns: client
- Table with least records: card test
- 5369 Clients and only 202 credit cards. Not every client has a credit card?
- 5369 Clients and 4500 accounts. Are there clients that do not have an account?
- 5369 Clients and 4500 accounts. Only 682 loans?

### Head
We can check the first 5 entries of each table to get a sense of the data

In [3]:
display("account", account.head())
display("client", client.head())
display("disposition", disposition.head())
display("district", district.head())

display("card_train", card_train.head())
display("card_test", card_test.head())

display("loan_train", loan_train.head())
display("loan_test", loan_test.head())

display("trans_train", trans_train.head())
display("trans_test", trans_test.head())

'account'

Unnamed: 0,account_id,district_id,frequency,date
0,576,55,monthly issuance,930101
1,3818,74,monthly issuance,930101
2,704,55,monthly issuance,930101
3,2378,16,monthly issuance,930101
4,2632,24,monthly issuance,930102


'client'

Unnamed: 0,client_id,birth_number,district_id
0,1,706213,18
1,2,450204,1
2,3,406009,1
3,4,561201,5
4,5,605703,5


'disposition'

Unnamed: 0,disp_id,client_id,account_id,type
0,1,1,1,OWNER
1,2,2,2,OWNER
2,3,3,2,DISPONENT
3,4,4,3,OWNER
4,5,5,3,DISPONENT


'district'

Unnamed: 0,code,name,region,no. of inhabitants,no. of municipalities with inhabitants < 499,no. of municipalities with inhabitants 500-1999,no. of municipalities with inhabitants 2000-9999,no. of municipalities with inhabitants >10000,no. of cities,ratio of urban inhabitants,average salary,unemploymant rate '95,unemploymant rate '96,no. of enterpreneurs per 1000 inhabitants,no. of commited crimes '95,no. of commited crimes '96
0,1,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.29,0.43,167,85677,99107
1,2,Benesov,central Bohemia,88884,80,26,6,2,5,46.7,8507,1.67,1.85,132,2159,2674
2,3,Beroun,central Bohemia,75232,55,26,4,1,5,41.7,8980,1.95,2.21,111,2824,2813
3,4,Kladno,central Bohemia,149893,63,29,6,2,6,67.4,9753,4.64,5.05,109,5244,5892
4,5,Kolin,central Bohemia,95616,65,30,4,1,6,51.4,9307,3.85,4.43,118,2616,3040


'card_train'

Unnamed: 0,card_id,disp_id,type,issued
0,1005,9285,classic,931107
1,104,588,classic,940119
2,747,4915,classic,940205
3,70,439,classic,940208
4,577,3687,classic,940215


'card_test'

Unnamed: 0,card_id,disp_id,type,issued
0,813,5873,junior,961028
1,1014,9452,classic,961102
2,408,2560,classic,961210
3,1118,11393,classic,970102
4,565,3601,gold,970106


'loan_train'

Unnamed: 0,loan_id,account_id,date,amount,duration,payments,status
0,5314,1787,930705,96396,12,8033,-1
1,5316,1801,930711,165960,36,4610,1
2,6863,9188,930728,127080,60,2118,1
3,5325,1843,930803,105804,36,2939,1
4,7240,11013,930906,274740,60,4579,1


'loan_test'

Unnamed: 0,loan_id,account_id,date,amount,duration,payments,status
0,5895,4473,970103,93960,60,1566,
1,7122,10365,970104,260640,36,7240,
2,6173,5724,970108,232560,48,4845,
3,6142,5591,970121,221880,60,3698,
4,5358,2018,970121,38520,12,3210,


'trans_train'

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
0,1548749,5270,930113,credit,credit in cash,800.0,800.0,,,
1,1548750,5270,930114,credit,collection from another bank,44749.0,45549.0,,IJ,80269753.0
2,3393738,11265,930114,credit,credit in cash,1000.0,1000.0,,,
3,3122924,10364,930117,credit,credit in cash,1100.0,1100.0,,,
4,1121963,3834,930119,credit,credit in cash,700.0,700.0,,,


'trans_test'

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
0,6145,25,960728,credit,credit in cash,900.0,900.0,,,
1,6456,25,960827,credit,credit in cash,15800.0,16700.0,,,
2,6150,25,960903,credit,credit in cash,13067.0,29767.0,,,
3,6171,25,960905,credit,credit in cash,42054.0,71821.0,,,
4,6457,25,960906,withdrawal,withdrawal in cash,36000.0,77580.0,,,


- Account Id is not ordered on account table
- dates in card, loan and transaction train/test are in a special format (xxyyzz, xx-year, yy-month, zz-day)
- There are some NaN in the data

### Data Types

In [4]:
print('---- account  ----')
account.info()
print('\n---- client ----')
client.info()
print('\n---- disposition ----')
disposition.info()
print('\n---- district ----')
district.info()

print('\n---- card_train ----')
card_train.info()
print('\n---- card_test ----')
card_test.info()

print('\n---- loan_train ----')
loan_train.info()
print('\n---- loan_test ----')
loan_test.info()

print('\n---- trans_train ----')
trans_train.info()
print('\n---- trans_test ----')
trans_test.info()

---- account  ----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4500 entries, 0 to 4499
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   account_id   4500 non-null   int64 
 1   district_id  4500 non-null   int64 
 2   frequency    4500 non-null   object
 3   date         4500 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 140.8+ KB

---- client ----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5369 entries, 0 to 5368
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   client_id     5369 non-null   int64
 1   birth_number  5369 non-null   int64
 2   district_id   5369 non-null   int64
dtypes: int64(3)
memory usage: 126.0 KB

---- disposition ----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5369 entries, 0 to 5368
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 

- There are a lot of datatype of type `object` which could be strings or may have NaN values
- unemploymant rate '95 and no. of commited crimes '95 should be of type float64, but are objects

### Describe Tables

Object types will have values in count, unique, top and freq. While other data types will have values in count, mean, std, min, 25%, 50%, 75% and max. If there are NaN values, is because it it a metric that does not apply to the data type

In [5]:
display("account", account.describe(include='all'))
display("client", client.describe(include='all'))
display("disposition", disposition.describe(include='all'))
display("district", district.describe(include='all'))

display("card_train", card_train.describe(include='all'))
display("card_test", card_test.describe(include='all'))

display("loan_train", loan_train.describe(include='all'))
display("loan_test", loan_test.describe(include='all'))

display("trans_train", trans_train.describe(include='all'))
display("trans_test", trans_test.describe(include='all'))

'account'

Unnamed: 0,account_id,district_id,frequency,date
count,4500.0,4500.0,4500,4500.0
unique,,,3,
top,,,monthly issuance,
freq,,,4167,
mean,2786.067556,37.310444,,951654.608667
std,2313.811984,25.177217,,14842.188377
min,1.0,1.0,,930101.0
25%,1182.75,13.0,,931227.0
50%,2368.0,38.0,,960102.0
75%,3552.25,60.0,,961101.0


'client'

Unnamed: 0,client_id,birth_number,district_id
count,5369.0,5369.0,5369.0
mean,3359.01192,535114.970013,37.310114
std,2832.911984,172895.618429,25.04369
min,1.0,110820.0,1.0
25%,1418.0,406009.0,14.0
50%,2839.0,540829.0,38.0
75%,4257.0,681013.0,60.0
max,13998.0,875927.0,77.0


'disposition'

Unnamed: 0,disp_id,client_id,account_id,type
count,5369.0,5369.0,5369.0,5369
unique,,,,2
top,,,,OWNER
freq,,,,4500
mean,3337.09797,3359.01192,2767.496927,
std,2770.418826,2832.911984,2307.84363,
min,1.0,1.0,1.0,
25%,1418.0,1418.0,1178.0,
50%,2839.0,2839.0,2349.0,
75%,4257.0,4257.0,3526.0,


'district'

Unnamed: 0,code,name,region,no. of inhabitants,no. of municipalities with inhabitants < 499,no. of municipalities with inhabitants 500-1999,no. of municipalities with inhabitants 2000-9999,no. of municipalities with inhabitants >10000,no. of cities,ratio of urban inhabitants,average salary,unemploymant rate '95,unemploymant rate '96,no. of enterpreneurs per 1000 inhabitants,no. of commited crimes '95,no. of commited crimes '96
count,77.0,77,77,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0
unique,,77,8,,,,,,,,,71.0,,,76.0,
top,,Hl.m. Praha,south Moravia,,,,,,,,,1.51,,,2854.0,
freq,,1,14,,,,,,,,,2.0,,,2.0,
mean,39.0,,,133884.9,48.623377,24.324675,6.272727,1.727273,6.25974,63.035065,9031.675325,,3.787013,116.12987,,5030.831169
std,22.371857,,,136913.5,32.741829,12.780991,4.015222,1.008338,2.435497,16.221727,790.202347,,1.90848,16.608773,,11270.796786
min,1.0,,,42821.0,0.0,0.0,0.0,0.0,1.0,33.9,8110.0,,0.43,81.0,,888.0
25%,20.0,,,85852.0,22.0,16.0,4.0,1.0,5.0,51.9,8512.0,,2.31,105.0,,2122.0
50%,39.0,,,108871.0,49.0,25.0,6.0,2.0,6.0,59.8,8814.0,,3.6,113.0,,3040.0
75%,58.0,,,139012.0,71.0,32.0,8.0,2.0,8.0,73.5,9317.0,,4.79,126.0,,4595.0


'card_train'

Unnamed: 0,card_id,disp_id,type,issued
count,177.0,177.0,177,177.0
unique,,,3,
top,,,classic,
freq,,,127,
mean,433.576271,3031.723164,,954971.259887
std,290.507562,2632.338944,,7217.974691
min,3.0,41.0,,931107.0
25%,183.0,1080.0,,950616.0
50%,397.0,2513.0,,960221.0
75%,661.0,4270.0,,960831.0


'card_test'

Unnamed: 0,card_id,disp_id,type,issued
count,25.0,25.0,25,25.0
unique,,,3,
top,,,classic,
freq,,,20,
mean,670.12,5839.96,,972351.28
std,394.98727,4387.770936,,6111.008472
min,17.0,127.0,,961028.0
25%,408.0,2560.0,,970520.0
50%,623.0,3984.0,,970828.0
75%,1014.0,9452.0,,980109.0


'loan_train'

Unnamed: 0,loan_id,account_id,date,amount,duration,payments,status
count,328.0,328.0,328.0,328.0,328.0,328.0,328.0
mean,6205.658537,5982.085366,949989.125,145308.621951,35.853659,4150.932927,0.719512
std,667.985675,3213.262492,9495.504646,105247.318098,16.734752,2193.620989,0.695541
min,4959.0,2.0,930705.0,4980.0,12.0,319.0,-1.0
25%,5604.25,3079.0,940809.25,68328.0,24.0,2368.75,1.0
50%,6227.5,6032.0,950565.5,114804.0,36.0,3878.5,1.0
75%,6737.25,8564.5,960525.25,198600.0,48.0,5907.75,1.0
max,7308.0,11362.0,961227.0,538500.0,60.0,9689.0,1.0


'loan_test'

Unnamed: 0,loan_id,account_id,date,amount,duration,payments,status
count,354.0,354.0,354.0,354.0,354.0,354.0,0.0
mean,6141.711864,5677.838983,975109.045198,157063.59322,37.084746,4227.477401,
std,695.356054,3345.166236,4928.280822,120285.252766,17.387392,2238.683851,
min,4962.0,25.0,970103.0,5148.0,12.0,304.0,
25%,5528.0,2734.25,970619.0,66696.0,24.0,2486.75,
50%,6100.5,5378.5,971208.0,120804.0,36.0,4003.0,
75%,6761.75,8764.5,980516.5,220437.0,48.0,5686.5,
max,7295.0,11328.0,981208.0,590820.0,60.0,9910.0,


'trans_train'

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
count,396685.0,396685.0,396685.0,396685,325924,396685.0,396685.0,211441,97242,102229.0
unique,,,,3,5,,,7,13,
top,,,,withdrawal,withdrawal in cash,,,interest credited,ST,
freq,,,,232093,165270,,,70761,8114,
mean,1239338.0,2508.434796,951310.066801,,,5677.55298,35804.792507,,,46642290.0
std,1213288.0,2020.928889,9510.974536,,,9190.364137,19692.148243,,,30021360.0
min,1.0,1.0,930101.0,,,0.0,-13588.7,,,0.0
25%,391833.0,1092.0,941110.0,,,127.5,22424.3,,,19900180.0
50%,788258.0,2220.0,950930.0,,,1952.0,30959.6,,,46736180.0
75%,1273700.0,3357.0,960606.0,,,6500.0,44661.0,,,72322170.0


'trans_test'

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
count,30200.0,30200.0,30200.0,30200,25070,30200.0,30200.0,12781,5823,9139.0
unique,,,,3,5,,,6,13,
top,,,,withdrawal,withdrawal in cash,,,interest credited,YZ,
freq,,,,17462,13889,,,5130,501,
mean,1997768.0,5639.798344,969666.689901,,,8978.861517,44644.210795,,,33523670.0
std,1172841.0,3362.670775,6984.081578,,,12456.202612,24068.542647,,,34383930.0
min,6145.0,25.0,950419.0,,,0.2,-17030.4,,,0.0
25%,927380.5,2720.0,961214.0,,,216.8,27441.45,,,0.0
50%,2042316.0,5477.0,970522.0,,,3700.0,40378.2,,,23140500.0
75%,3117521.0,8772.0,971117.0,,,12300.0,57737.2,,,65154920.0


## Explore the Dataset

First thing to analyse are the data type `object` columns

In [30]:
print("------ account ------")
print("-> frequency")
display(account["frequency"].value_counts())

print("\n------ disposition ------")
print("-> type")
display(disposition["type"].value_counts())

print("\n------ district ------")
print("-> name")
display(district["name "].value_counts())
print("-> region")
display(district["region"].value_counts())
print("-> unemploymant rate '95")
display(district["unemploymant rate '95 "].value_counts())
print("-> no. of commited crimes '95")
display(district["no. of commited crimes '95 "].value_counts())


print("\n------ card_train ------")
print("-> type")
display(card_train["type"].value_counts())

print("\n------ card_test ------")
print("-> type")
display(card_test["type"].value_counts())

print("\n------ trans_train ------")
print("-> type")
display(trans_train["type"].value_counts())
print("-> operation")
display(trans_train["operation"].value_counts())
print("-> k_symbol")
display(trans_train["k_symbol"].value_counts())
print("-> bank")
display(trans_train["bank"].value_counts())

print("\n------ trans_test ------")
print("-> type")
display(trans_test["type"].value_counts())
print("-> operation")
display(trans_test["operation"].value_counts())
print("-> k_symbol")
display(trans_test["k_symbol"].value_counts())
print("-> bank")
display(trans_test["bank"].value_counts())

------ account ------
-> frequency


monthly issuance              4167
weekly issuance                240
issuance after transaction      93
Name: frequency, dtype: int64


------ disposition ------
-> type


OWNER        4500
DISPONENT     869
Name: type, dtype: int64


------ district ------
-> name


Hl.m. Praha      1
Svitavy          1
Hodonin          1
Breclav          1
Brno - venkov    1
                ..
Plzen - mesto    1
Klatovy          1
Karlovy Vary     1
Cheb             1
Vsetin           1
Name: name , Length: 77, dtype: int64

-> region


south Moravia      14
central Bohemia    12
east Bohemia       11
north Moravia      11
west Bohemia       10
north Bohemia      10
south Bohemia       8
Prague              1
Name: region, dtype: int64

-> unemploymant rate '95


1.51    2
3.38    2
1.60    2
3.13    2
1.79    2
       ..
5.75    1
6.43    1
1.02    1
1.67    1
4.01    1
Name: unemploymant rate '95 , Length: 71, dtype: int64

-> no. of commited crimes '95


2854     2
85677    1
1660     1
3729     1
3659     1
        ..
1822     1
5198     1
2879     1
1089     1
3460     1
Name: no. of commited crimes '95 , Length: 76, dtype: int64


------ card_train ------
-> type


classic    127
junior      41
gold         9
Name: type, dtype: int64


------ card_test ------
-> type


classic    20
junior      4
gold        1
Name: type, dtype: int64


------ trans_train ------
-> type


withdrawal            232093
credit                159468
withdrawal in cash      5124
Name: type, dtype: int64

-> operation


withdrawal in cash              165270
remittance to another bank       70737
credit in cash                   62202
collection from another bank     26505
credit card withdrawal            1210
Name: operation, dtype: int64

-> k_symbol


interest credited                        70761
payment for statement                    58377
household                                42839
                                         19065
old-age pension                          13502
insurrance payment                        6592
sanction interest if negative balance      305
Name: k_symbol, dtype: int64

-> bank


ST    8114
GH    7886
EF    7878
AB    7666
UV    7618
OP    7595
IJ    7536
YZ    7471
QR    7413
KL    7397
WX    7033
CD    7009
MN    6626
Name: bank, dtype: int64


------ trans_test ------
-> type


withdrawal            17462
credit                11882
withdrawal in cash      856
Name: type, dtype: int64

-> operation


withdrawal in cash              13889
credit in cash                   5304
remittance to another bank       4375
collection from another bank     1448
credit card withdrawal             54
Name: operation, dtype: int64

-> k_symbol


interest credited                        5130
payment for statement                    3186
household                                2192
                                         1549
insurrance payment                        681
sanction interest if negative balance      43
Name: k_symbol, dtype: int64

-> bank


YZ    501
UV    500
WX    497
KL    482
QR    460
OP    446
AB    438
IJ    438
GH    432
MN    431
CD    414
EF    401
ST    383
Name: bank, dtype: int64

- Some columns have a ' ' after the column name in district
- Transaction types (+/-) should be only withdrawal and credit, withdrawal in cash shouldn't be there (should be only withdrawal)
- k-symbol as white values
