# How's our data integrity?

New data has been merged into the `banking` DataFrame that contains details on how investments in the `inv_amount` column are allocated across four different funds A, B, C and D.

Furthermore, the age and birthdays of customers are now stored in the `age` and `birth_date` columns respectively.

You want to understand how customers of different age groups invest. However, you want to first make sure the data you're analyzing is correct. You will do so by cross field checking values of `inv_amount` and `age` against the amount invested in different funds and customers' birthdays. Both `pandas` and `datetime` have been imported as `pd` and `dt` respectively.

In [6]:
import pandas as pd
import numpy as np
from faker import Faker
import datetime as dt
fake = Faker()
path=r'Z:/'
file='banking_dirty.csv'
banking = pd.read_csv(path+file,index_col = [0],parse_dates=['birth_date'])
acct_cur = [fake.random_element(elements=('dollar', 'euro')) for _ in range(len(banking))]
banking['acct_cur']=acct_cur
print(banking.head(),'\n')
#banking['birth_date'] = banking['birth_date'].astype('datetime64[ns]')

    cust_id birth_date  Age  acct_amount  inv_amount   fund_A   fund_B  \
0  870A9281 1962-06-09   58     63523.31       51295  30105.0   4138.0   
1  166B05B0 1962-12-16   58     38175.46       15050   4995.0    938.0   
2  BFC13E88 1990-09-12   34     59863.77       24567  10323.0   4590.0   
3  F2158F66 1985-11-03   35     84132.10       23712   3908.0    492.0   
4  7A73F334 1990-05-17   30    120512.00       93230  12158.4  51281.0   

    fund_C   fund_D account_opened last_transaction acct_cur  
0   1420.0  15632.0       02-09-18         22-02-19   dollar  
1   6696.0   2421.0       28-02-19         31-10-18     euro  
2   8469.0   1185.0       25-04-18         02-04-18     euro  
3   6482.0  12830.0       07-11-17         08-11-18   dollar  
4  13434.0  18383.0       14-05-18         19-07-18     euro   



In [7]:
banking.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 0 to 99
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cust_id           100 non-null    object        
 1   birth_date        100 non-null    datetime64[ns]
 2   Age               100 non-null    int64         
 3   acct_amount       100 non-null    float64       
 4   inv_amount        100 non-null    int64         
 5   fund_A            100 non-null    float64       
 6   fund_B            100 non-null    float64       
 7   fund_C            100 non-null    float64       
 8   fund_D            100 non-null    float64       
 9   account_opened    100 non-null    object        
 10  last_transaction  100 non-null    object        
 11  acct_cur          100 non-null    object        
dtypes: datetime64[ns](1), float64(5), int64(2), object(4)
memory usage: 10.2+ KB


Find the rows where the sum of all rows of the `fund_columns` in banking are equal to the `inv_amount` column.
Store the values of `banking` with consistent `inv_amount` in `consistent_inv`, and those with inconsistent ones in `inconsistent_inv`

In [2]:
# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis=1) == banking['inv_amount']

# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])

Number of inconsistent investments:  8


Store today's date into `today`, and manually calculate customers' ages and store them in `ages_manual`.
Find all rows of `banking` where the `age` column is equal to `ages_manual` and then filter `banking` into consistent_ages and inconsistent_ages.


In [8]:
# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year

# Find rows where age column == ages_manual
age_equ = banking['Age']  == ages_manual

# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]

# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])

Number of inconsistent ages:  100
