# Data preparation: Transactions and loans

In [7]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn import preprocessing


trans_df = pd.read_csv("data/trans_dev.csv",sep=";", low_memory=False)
loan_df = pd.read_csv('data/loan_dev.csv', sep=';')

## Transaction preparation



### Data quality issues

In [8]:
#### Data cleaning, transformation and data quality changes
trans_df.rename(columns={'date' : 'trans_date'}, inplace=True)
trans_df.rename(columns={'type' : 'trans_type'}, inplace=True)
trans_df.rename(columns={'operation' : 'trans_operation'}, inplace=True)
trans_df.rename(columns={'amount' : 'trans_amount'}, inplace=True)
trans_df.rename(columns={'balance' : 'trans_balance'}, inplace=True)
trans_df.rename(columns={'k_symbol' : 'trans_k_symbol'}, inplace=True)
trans_df.rename(columns={'bank' : 'trans_bank'}, inplace=True)
trans_df.rename(columns={'account': 'trans_account'}, inplace=True)


#### Noise

Nothing to report os improvements to be made.

### Outliers

Nothing to report os improvements to do know, but we will check the outliers in the next sprints with more detail and attention.

#### Inconsistent, incorrect data or improving data quality

In [9]:

trans_df['trans_date'] = pd.to_datetime(trans_df['trans_date'], format='%y%m%d')
#withdrawal in cash para withdrawal
trans_df.loc[trans_df["trans_type"]=="withdrawal in cash","trans_type"] = "withdrawal" 

trans_date_year = []
for i in trans_df.index:
    trans_date_year.append(trans_df['trans_date'][i].year)

trans_df['trans_year'] = trans_date_year


#trans_operation without name, change to other_types
trans_df.loc[trans_df["trans_operation"].isnull(),"trans_operation"] = "other_types"
#se for withdrawal, o ammount é negativo
trans_df.loc[trans_df["trans_type"]=="withdrawal","trans_amount"] *= -1


### Missing values

We decide to drop the columns account bank and k_symbol( now renamed to trans_account, trans_bank and trans_k_symbol) , due to the high number of missing values and more importantly, because nothing is being discovered that points they are relevant to the analysis. 

Todo: check if the missing values are relevant to the analysis.

However, for the trans_k_symbol attribute we will now drop the column, but as mentioned in the data_understanding phase later on we will treat it more carefully and make sure if we can, for example, replace with the values in some way that would not introduce bias.

#### Duplicates

No duplicates were found in the data understanding phase. Nothing to report.

#### Inconsistent or Incorrect Data

Todo: 

Maybe we can try to search or try to reach some professional about technique/specific information about banks and accounts, but we will not do it now.



### Data pre-processing

#### Feature Extraction

#### Data cleaning

In [None]:
#as colunas bank e account não são necessárias, têm muitos valores nulos
trans_df.drop(['trans_bank', 'trans_account','trans_k_symbol' ], axis=1, inplace=True)
print(trans_df.head())


trans_df.to_csv('refined/transaction.csv',sep=';',index=False)

Todo: 
- Handling Missing Values
- Handling Duplicates
- Handling Inconsistent or Incorrect Data
    - statistical-based methods to detect outliers
    - Domain knowledge
    - • Inconsistency detection


We decide to drop the columns account bank and k_symbol( now renamed to trans_account, trans_bank and trans_k_symbol) , due to the high number of missing values and more importantly, because nothing is being discovered that points they are relevant to the analysis. 

Todo: check if the missing values are relevant to the analysis.

However, for the trans_k_symbol attribute we will now drop the column, but as mentioned in the data_understanding phase later on we will treat it more carefully and make sure if we can, for example, replace with the values in some way that would not introduce bias.


#### Data transformation

##### One-Hot Enconding

In [None]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
trans_df['trans_operation']= label_encoder.fit_transform(trans_df['trans_operation'])



##### Inconsistent, incorrect data or improving data quality

In [None]:
trans_df['trans_date'] = pd.to_datetime(trans_df['trans_date'], format='%y%m%d')

trans_date_year = []
trans_date_month = []

for i in trans_df.index:
    trans_date_year.append(trans_df['trans_date'][i].year)
    trans_date_month.append(trans_df["trans_date"][i].month)

trans_df['trans_year'] = trans_date_year
trans_df['trans_month'] = trans_date_month
trans_df.drop["trans_date"]

trans_df.loc[trans_df["trans_type"]=="withdrawal in cash","trans_type"] = "withdrawal" 
trans_df.loc[trans_df["trans_type"]=="withdrawal","trans_amount"] *= -1
trans_df.head()

A lot to do here:

Some common strategies:
- Normalization
- Binarization / One-Hot Enconding
- Discretization

What we have done:
- One-hot encoding on categorical nominal attribute trans_operation 
- Change values withdrawal in cash to withdrawal on trans_type attribute
- change the values on trans_ammount attribute to be negative when trans_type attribute values are withdrawal
- change trans_date attribute to datetime type :year-month-day
- add trans_year attribute to the dataset getting information from trans_date
- add trans_month attribute to the dataset getting information from trans_date
- drop trans_date collumn 


#### Feature engineering

#### Data and Dimensionality Reduction

### Export new transaction data to csv file

In [None]:
trans_df.to_csv('refined/transaction.csv')