In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Uniformity

In this exercise and throughout this chapter, you will be working with a retail banking dataset stored in the banking DataFrame. The dataset contains:

data on the amount of money stored in accounts (`acct_amount`), their currency (`acct_cur`), amount invested (`inv_amount`), account opening date (`account_opened`), and last transaction date (`last_transaction`)

In [2]:
banking = pd.read_csv("banking_dirty.csv")

In [3]:
banking.head()

Unnamed: 0.1,Unnamed: 0,cust_id,birth_date,Age,acct_amount,inv_amount,fund_A,fund_B,fund_C,fund_D,account_opened,last_transaction
0,0,870A9281,1962-06-09,58,63523.31,51295,30105.0,4138.0,1420.0,15632.0,02-09-18,22-02-19
1,1,166B05B0,1962-12-16,58,38175.46,15050,4995.0,938.0,6696.0,2421.0,28-02-19,31-10-18
2,2,BFC13E88,1990-09-12,34,59863.77,24567,10323.0,4590.0,8469.0,1185.0,25-04-18,02-04-18
3,3,F2158F66,1985-11-03,35,84132.1,23712,3908.0,492.0,6482.0,12830.0,07-11-17,08-11-18
4,4,7A73F334,1990-05-17,30,120512.0,93230,12158.4,51281.0,13434.0,18383.0,14-05-18,19-07-18


### Ex 1:

You are tasked with understanding the average account size and how investments vary by the size of account, however in order to produce this analysis accurately, you first need to `unify the currency amount into dollars`. 

In [None]:
# Find the rows of acct_cur in banking that are equal to 'euro' and store them in the variable acct_eu.

acct_eu = banking['acct_cur'] == 'euro'

In [None]:
# Find all the rows of acct_amount in banking that fit the acct_eu condition, 
# and convert them to USD by multiplying them with 1.1.

banking.loc[banking['acct_cur'] == 'euro', 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1

In [None]:
# Find all the rows of acct_cur in banking that fit the acct_eu condition, set them to 'dollar'.

banking.loc[banking['acct_cur'] == 'euro', 'acct_cur'] = "dollar"

In [None]:
# Assert that only dollar currency remains

assert banking['acct_cur'].unique() == 'dollar'

### Uniform dates

After having unified the currencies of your different account amounts, you want to add a temporal dimension to your analysis and see how customers have been investing their money given the size of their account over each year. The `account_opened` column represents when customers opened their accounts and is a good proxy for segmenting customer activity and investment over time.

### Ex 2: 

Since this data was consolidated from `multiple sources`, you need to make sure that all `dates` are of the `same format`. You will do so by converting this column into a `datetime` object, while `making sure` that the format is `inferred` and `potentially incorrect` formats are set to `missing`.

In [6]:
# Print the header of account_opened
print(banking["account_opened"].head())

0    02-09-18
1    28-02-19
2    25-04-18
3    07-11-17
4    14-05-18
Name: account_opened, dtype: object


Now if we Take a look at the output. You tried converting the values to datetime using the default to_datetime() function without changing any argument, however received the following error:

In [7]:
banking['account_opened'] = pd.to_datetime(banking['account_opened'])

In [8]:
# Convert the account_opened column to datetime, 
# while making sure the date format is inferred and that erroneous formats that raise error return a missing value.

banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = "coerce") 

In [9]:
# Extract the year from the amended account_opened column and assign it to the acct_year column
# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

In [10]:
# Print acct_year
print(banking['acct_year'])

0     2018
1     2019
2     2018
3     2017
4     2018
      ... 
95    2018
96    2017
97    2017
98    2017
99    2017
Name: acct_year, Length: 100, dtype: object


### Cross field validation

The use of multiple fields in a dataset to `sanity check data integrity`

New data has been merged into the banking DataFrame that contains details on how investments in the `inv_amount` column are allocated across four different funds `A, B, C and D`.

Furthermore, the `age` and `birthdays` of customers are now stored in the `age` and `birth_date` columns respectively.

In [11]:
banking.head()

Unnamed: 0.1,Unnamed: 0,cust_id,birth_date,Age,acct_amount,inv_amount,fund_A,fund_B,fund_C,fund_D,account_opened,last_transaction,acct_year
0,0,870A9281,1962-06-09,58,63523.31,51295,30105.0,4138.0,1420.0,15632.0,2018-02-09,22-02-19,2018
1,1,166B05B0,1962-12-16,58,38175.46,15050,4995.0,938.0,6696.0,2421.0,2019-02-28,31-10-18,2019
2,2,BFC13E88,1990-09-12,34,59863.77,24567,10323.0,4590.0,8469.0,1185.0,2018-04-25,02-04-18,2018
3,3,F2158F66,1985-11-03,35,84132.1,23712,3908.0,492.0,6482.0,12830.0,2017-07-11,08-11-18,2017
4,4,7A73F334,1990-05-17,30,120512.0,93230,12158.4,51281.0,13434.0,18383.0,2018-05-14,19-07-18,2018


### Ex 2:

You want to understand how customers of `different age groups` invest. However, you want to first make sure the data you're analyzing is `correct`. You will do so by cross field checking values of  `inv_amount` and `age` against the `amount` invested in different funds and customers' `birthdays`.

In [19]:
### Sanity check in Investment

In [14]:
# Find the rows where the sum of all rows of the fund_columns in banking are equal to the inv_amount column.

# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis = 1) == banking["inv_amount"]
inv_equ

0      True
1      True
2      True
3      True
4     False
      ...  
95     True
96     True
97     True
98     True
99     True
Length: 100, dtype: bool

In [15]:
# Store the values of banking with consistent inv_amount in consistent_inv, and those with inconsistent ones in inconsistent_inv
# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
consistent_inv

Unnamed: 0.1,Unnamed: 0,cust_id,birth_date,Age,acct_amount,inv_amount,fund_A,fund_B,fund_C,fund_D,account_opened,last_transaction,acct_year
0,0,870A9281,1962-06-09,58,63523.31,51295,30105.0,4138.0,1420.0,15632.0,2018-02-09,22-02-19,2018
1,1,166B05B0,1962-12-16,58,38175.46,15050,4995.0,938.0,6696.0,2421.0,2019-02-28,31-10-18,2019
2,2,BFC13E88,1990-09-12,34,59863.77,24567,10323.0,4590.0,8469.0,1185.0,2018-04-25,02-04-18,2018
3,3,F2158F66,1985-11-03,35,84132.10,23712,3908.0,492.0,6482.0,12830.0,2017-07-11,08-11-18,2017
5,5,472341F2,1980-02-23,40,83127.65,67960,12686.0,19776.0,23707.0,11791.0,2018-12-14,22-04-18,2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,CA507BA1,1974-08-10,46,12209.84,7515,190.0,931.0,1451.0,4943.0,2018-05-26,11-09-19,2018
96,96,B99CD662,1989-12-12,31,92838.44,49089,2453.0,7892.0,31486.0,7258.0,2017-04-05,12-03-19,2017
97,97,13770971,1984-11-29,36,92750.87,27962,3352.0,7547.0,8486.0,8577.0,2017-08-16,24-04-19,2017
98,98,93E78DA3,1969-12-14,51,41942.23,29662,1758.0,11174.0,11650.0,5080.0,2017-09-10,15-04-18,2017


In [16]:
inconsistent_inv = banking[~inv_equ]
inconsistent_inv

Unnamed: 0.1,Unnamed: 0,cust_id,birth_date,Age,acct_amount,inv_amount,fund_A,fund_B,fund_C,fund_D,account_opened,last_transaction,acct_year
4,4,7A73F334,1990-05-17,30,120512.0,93230,12158.4,51281.0,13434.0,18383.0,2018-05-14,19-07-18,2018
12,12,EEBD980F,1990-11-20,34,57838.49,50812,18314.0,1477.0,29049.48,5539.0,2018-08-12,04-01-20,2018
22,22,96525DA6,1992-11-23,28,82511.24,33927,8206.0,15019.0,5559.6,6182.0,2018-07-23,07-08-18,2018
43,43,38B8CD9C,1970-06-25,50,28834.71,27531,314.0,6072.28,14163.0,7908.0,2018-09-17,05-02-20,2018
47,47,68C55974,1962-07-08,58,95038.14,66796,33764.0,5042.0,10659.0,19237.41,2018-03-04,25-09-18,2018
65,65,0A9BA907,1966-09-21,54,90469.53,70171,28615.0,21720.05,11906.0,10763.0,2018-06-15,28-08-18,2018
89,89,C580AE41,1968-06-01,52,96673.37,68466,8489.36,28592.0,2439.0,30419.0,2018-09-28,17-09-18,2018
92,92,A07D5C92,1990-09-20,30,99577.36,60407,6467.0,20861.0,9861.0,26004.16,2017-11-17,16-01-20,2017


In [18]:
# Store consistent and inconsistent data
print("Number of inconsistent investments: ", len(inconsistent_inv))

Number of inconsistent investments:  8


In [43]:
import datetime as dt

banking["birth_date"] = pd.to_datetime(banking["birth_date"])

banking["Age"] = banking["Age"] + 1

In [44]:
banking["Age"]

0     59
1     59
2     35
3     36
4     31
      ..
95    47
96    32
97    37
98    52
99    28
Name: Age, Length: 100, dtype: int64

In [45]:
### Sanity Check the Ages

In [46]:
# Store today's date into today,
today = dt.date.today()
today

datetime.date(2021, 10, 28)

In [47]:
# manually calculate customers' ages and store them in ages_manual.

ages_manual = today.year - banking["birth_date"].dt.year
ages_manual

0     59
1     59
2     31
3     36
4     31
      ..
95    47
96    32
97    37
98    52
99    28
Name: birth_date, Length: 100, dtype: int64

In [48]:
# Find all rows of banking where the age column is equal to ages_manual
age_equ = banking["Age"] == ages_manual
age_equ

0      True
1      True
2     False
3      True
4      True
      ...  
95     True
96     True
97     True
98     True
99     True
Length: 100, dtype: bool

In [49]:
# then filter banking into consistent_ages and inconsistent_ages.
consistent_ages = banking[age_equ]
consistent_ages

Unnamed: 0.1,Unnamed: 0,cust_id,birth_date,Age,acct_amount,inv_amount,fund_A,fund_B,fund_C,fund_D,account_opened,last_transaction,acct_year
0,0,870A9281,1962-06-09,59,63523.31,51295,30105.0,4138.0,1420.0,15632.0,2018-02-09,22-02-19,2018
1,1,166B05B0,1962-12-16,59,38175.46,15050,4995.0,938.0,6696.0,2421.0,2019-02-28,31-10-18,2019
3,3,F2158F66,1985-11-03,36,84132.10,23712,3908.0,492.0,6482.0,12830.0,2017-07-11,08-11-18,2017
4,4,7A73F334,1990-05-17,31,120512.00,93230,12158.4,51281.0,13434.0,18383.0,2018-05-14,19-07-18,2018
5,5,472341F2,1980-02-23,41,83127.65,67960,12686.0,19776.0,23707.0,11791.0,2018-12-14,22-04-18,2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,CA507BA1,1974-08-10,47,12209.84,7515,190.0,931.0,1451.0,4943.0,2018-05-26,11-09-19,2018
96,96,B99CD662,1989-12-12,32,92838.44,49089,2453.0,7892.0,31486.0,7258.0,2017-04-05,12-03-19,2017
97,97,13770971,1984-11-29,37,92750.87,27962,3352.0,7547.0,8486.0,8577.0,2017-08-16,24-04-19,2017
98,98,93E78DA3,1969-12-14,52,41942.23,29662,1758.0,11174.0,11650.0,5080.0,2017-09-10,15-04-18,2017


In [50]:
inconsistent_ages = banking[~age_equ]
inconsistent_ages

Unnamed: 0.1,Unnamed: 0,cust_id,birth_date,Age,acct_amount,inv_amount,fund_A,fund_B,fund_C,fund_D,account_opened,last_transaction,acct_year
2,2,BFC13E88,1990-09-12,35,59863.77,24567,10323.0,4590.0,8469.0,1185.0,2018-04-25,02-04-18,2018
8,8,E52D4C7F,1975-06-05,50,61795.89,49385,12939.0,7757.0,12569.0,16120.0,2017-05-22,24-10-19,2017
12,12,EEBD980F,1990-11-20,35,57838.49,50812,18314.0,1477.0,29049.48,5539.0,2018-08-12,04-01-20,2018
23,23,A1815565,1968-09-27,57,82996.04,30897,16092.0,5491.0,5098.0,4216.0,2017-07-11,30-09-19,2017
32,32,8D08495A,1961-08-14,64,89138.52,60795,53880.0,1325.0,2105.0,3485.0,2018-08-08,05-02-19,2018
54,54,2F4F99C1,1988-12-19,37,82058.48,35758,6129.0,16840.0,10397.0,2392.0,2018-12-30,11-08-18,2018
61,61,45F31C81,1975-01-12,50,120675300.0,94608,15416.0,18845.0,20325.0,40022.0,2018-05-11,25-12-19,2018
85,85,7539C3B7,1974-05-14,51,1077557.0,91190,32692.0,30405.0,14728.0,13365.0,2017-08-23,07-06-19,2017


In [51]:
# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])

Number of inconsistent ages:  8


### Missing Values

You just received a new version of the banking DataFrame containing data on the `amount held` and `invested` for `new and existing customers`. However, there are rows with missing `inv_amount` values.

You know for a fact that most customers `below 25` do not have investment accounts yet, and suspect it could be driving the `missingness`.

In [55]:
import missingno as msno

In [None]:
# Print the number of missing values by column in the banking DataFrame.
print(banking.isna().sum())

# Plot and show the missingness matrix of banking with the msno.matrix() function.
msno.matrix(banking)
plt.show()

# Isolate the values of banking missing values of inv_amount 
# into missing_investors and with non-missing inv_amount values into investors.
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

In [None]:
# Now that you've isolated banking into investors and missing_investors, use the .describe() method on both of these DataFrames
investors.describe()
missing_investors.describe()

In [None]:
# We can see that ---
# The inv_amount is missing only for young customers, 
# since the average age in missing_investors is 22 and the maximum age is 25.

In [None]:
# Sort the banking DataFrame by the age column and plot the missingness matrix of banking_sorted
banking_sorted = banking.sort_values("age")
msno.matrix(banking_sorted)
plt.show()

### Ex 3 :

In this exercise, you're working with another version of the banking DataFrame that contains missing values for both the `cust_id column` and the `acct_amount` column.

You want to produce analysis on how many `unique` customers the bank has, the `average amount` held by customers and more. You know that rows with missing `cust_id` don't really help you, and that on average `acct_amount` is usually `5 times` the amount of `inv_amount`.

In this exercise, you will `drop rows` of banking with missing `cust_ids`, and `impute` missing values of `acct_amount` with some `domain knowledge`.

In [None]:
# Use .dropna() to drop missing values of the cust_id column in banking and store the results in banking_fullid
banking_fullid = banking.dropna(subset = ['cust_id'])

# Use inv_amount to compute the estimated account amounts for banking_fullid by setting the amounts equal to inv_amount * 5, 
# and assign the results to acct_imp.
acct_imp = banking_fullid["inv_amount"] * 5

# Impute the missing values of acct_amount in banking_fullid with the newly created acct_imp using .fillna().
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})

# Print number of missing values
print(banking_imputed.isna().sum())