# Data understanding transaction after

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import scipy
import numpy as np
import data_understanding_utils as du
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline
# not cut columns
pd.set_option('display.max_columns',None)

path= "./refined/"
transacation_data = pd.read_csv(path+"transaction.csv", sep=';')

transacation_data

Unnamed: 0,trans_id,account_id,trans_date,trans_type,trans_operation,trans_amount,trans_balance,trans_k_symbol,trans_year
0,1548749,5270,1993-01-13,credit,credit in cash,800.0,800.0,,1993
1,1548750,5270,1993-01-14,credit,collection from another bank,44749.0,45549.0,,1993
2,3393738,11265,1993-01-14,credit,credit in cash,1000.0,1000.0,,1993
3,3122924,10364,1993-01-17,credit,credit in cash,1100.0,1100.0,,1993
4,1121963,3834,1993-01-19,credit,credit in cash,700.0,700.0,,1993
...,...,...,...,...,...,...,...,...,...
396680,515914,1763,1996-12-31,withdrawal,withdrawal in cash,-14.6,67769.5,payment for statement,1996
396681,516262,1765,1996-12-31,withdrawal,withdrawal in cash,-14.6,19708.1,payment for statement,1996
396682,520019,1775,1996-12-31,withdrawal,withdrawal in cash,-14.6,15944.5,payment for statement,1996
396683,517894,1769,1996-12-31,withdrawal,withdrawal in cash,-14.6,34679.4,payment for statement,1996


### 2.2 Describe data



In [15]:
du.info_data(transacation_data,"shape","")
du.info_data(transacation_data,"head","")

(396685, 9)

   trans_id  account_id  trans_date trans_type               trans_operation  \
0   1548749        5270  1993-01-13     credit                credit in cash   
1   1548750        5270  1993-01-14     credit  collection from another bank   
2   3393738       11265  1993-01-14     credit                credit in cash   
3   3122924       10364  1993-01-17     credit                credit in cash   
4   1121963        3834  1993-01-19     credit                credit in cash   

   trans_amount  trans_balance trans_k_symbol  trans_year  
0         800.0          800.0            NaN        1993  
1       44749.0        45549.0            NaN        1993  
2        1000.0         1000.0            NaN        1993  
3        1100.0         1100.0            NaN        1993  
4         700.0          700.0            NaN        1993  



#### Info about the dataset

In [16]:
du.info_data(transacation_data,"info","")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396685 entries, 0 to 396684
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   trans_id         396685 non-null  int64  
 1   account_id       396685 non-null  int64  
 2   trans_date       396685 non-null  object 
 3   trans_type       396685 non-null  object 
 4   trans_operation  396685 non-null  object 
 5   trans_amount     396685 non-null  float64
 6   trans_balance    396685 non-null  float64
 7   trans_k_symbol   211441 non-null  object 
 8   trans_year       396685 non-null  int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 27.2+ MB
None



In [17]:
du.info_data(transacation_data,"isnull","")

Number of null values: 
 trans_id                0
account_id              0
trans_date              0
trans_type              0
trans_operation         0
trans_amount            0
trans_balance           0
trans_k_symbol     185244
trans_year              0
dtype: int64



We already drop bank and account attributes, in crisp_dm sprints 1 and 2, because they have a lot of missing values andit is easily seen that they are not relevant to the analysis. 

The trans_k_symbol ( before K_symbol) it is not so easy to check if it is relevant to the analysis, however it has more than 50% missing values. Therefore, it will be difficult to make estimates without introducing bias. 
In the next sprints we will try to treat this attribute carefully and in the best possible way.

We can see that trans_k_symbol attribute has a lot of null values, so we can drop it or if plausible (it might introduce bias in data and affect the results) make estimates for the missing values like:
- most common value of the attribute (e.g. mean, mode);
- based on other(s) attribute(s);
- more sophisticated methods

In [22]:
res_duplicate = du.check_duplicates(transacation_data,"transaction",["trans_id"])

No duplicates found in the data


In the data preparation step, we need to create new features from the existing ones. For example to have the account age, maybe change how frequency atribute is represented and other usefull informations.

#### Statistical Summary

In [19]:
du.info_data(transacation_data,"describe","")



           trans_id     account_id   trans_amount  trans_balance  \
count  3.966850e+05  396685.000000  396685.000000  396685.000000   
mean   1.239338e+06    2508.434796     299.980242   35804.792507   
std    1.213288e+06    2020.928889   10798.494973   19692.148243   
min    1.000000e+00       1.000000  -86400.000000  -13588.700000   
25%    3.918330e+05    1092.000000   -2700.000000   22424.300000   
50%    7.882580e+05    2220.000000     -14.600000   30959.600000   
75%    1.273700e+06    3357.000000     210.600000   44661.000000   
max    3.682934e+06   11382.000000   74812.000000  193909.900000   

          trans_year  
count  396685.000000  
mean     1995.061089  
std         0.953892  
min      1993.000000  
25%      1994.000000  
50%      1995.000000  
75%      1996.000000  
max      1996.000000  



TODO: We need to do the transaction understanding with the "raw data" so we can justify the data preparation steps about drop account and bank attributes !!! And also justify already drop bank and account attributes on sprint 1 and 2!