# Banking Profile Dimension

In this notebook, we will create a separate CSV file containing clean quality data related to the Banking Profile Dimension.

Let's begin by loading the source dataset.

In [15]:
# dependencies

import pandas as pd

## Data Loading

Load source dataset.

In [16]:
df = pd.read_csv('../data/Customer-Churn-Records.csv')
df.head(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


## Data Preparation

To prepare source data for ingestion, we will drop unrequired columns and check for data quality issues.

### Banking Profile Dimension Requirements

For the Banking Profile Dimension, we will require the following columns:

- `Bank Profile Key`: A unique identifier for the bank profile.
- `Tenure`: The number of years the customer has been with the bank.
- `Active Member`: Whether the customer is an active member or not.
- `Products Number`: The number of products the customer has with the bank.
- `Balance`: The amount of money the customer has in their bank account.
- `Complain`: Whether the customer has complained or not.

In [17]:
# drop the columns that are not needed
df.drop([
  'RowNumber', 
  'CustomerId', 
  'Surname',
  'CreditScore',
  'Geography',
  'Gender',
  'Age',
  'HasCrCard',
  'EstimatedSalary',
  'Exited',
  'Satisfaction Score',
  'Card Type', 
  'Point Earned'
  ],
  axis=1,
  inplace=True)

# confirm columns
df.columns

Index(['Tenure', 'Balance', 'NumOfProducts', 'IsActiveMember', 'Complain'], dtype='object')

### Surrogate Key Pipeline

We will create a surrogate key for the Banking Profile Dimension, to uniquely identify each record in the dimension.

In [18]:
# adding a surrogate key column "bank_profile_key"
df['bank_profile_key'] = range(1, 1+len(df))
df.head(5)

Unnamed: 0,Tenure,Balance,NumOfProducts,IsActiveMember,Complain,bank_profile_key
0,2,0.0,1,1,1,1
1,1,83807.86,1,1,1,2
2,8,159660.8,3,0,1,3
3,1,0.0,2,0,0,4
4,2,125510.82,1,1,0,5


We will also rename and re-order the columns to follow a consistent naming convention with other tables, as well as, to follow the conceptual design plan made in **Phase 1** of the project.

In [19]:
# rename columns
df.rename(columns={
  'Tenure': 'tenure',
  'Balance': 'balance',
  'NumOfProducts': 'products_number',
  'IsActiveMember': 'active_member',
  'Complain': 'complain'
  },
  inplace=True)

# change the order of the columns
df = df[[ 'bank_profile_key', 'tenure', 'active_member', 'products_number', 'balance', 'complain' ]]

### Data Quality

To ensure we don't have any missing values, we will check for null values in the required columns.

In [20]:
# confirm no missing values
df.isnull().sum()

bank_profile_key    0
tenure              0
active_member       0
products_number     0
balance             0
complain            0
dtype: int64

We can see that we don't have any missing values.

Let's confirm the data types of the required columns, to ensure they are in the correct format, and to exclude any noisy data.

In [21]:
# confirm data types
df.dtypes

bank_profile_key      int64
tenure                int64
active_member         int64
products_number       int64
balance             float64
complain              int64
dtype: object

Lastly, let's confirm that the values conform to the ideal standard. For example, the `Tenure` column should not have negative values.

In [22]:
# check the range of values
df.describe()

Unnamed: 0,bank_profile_key,tenure,active_member,products_number,balance,complain
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,5.0128,0.5151,1.5302,76485.889288,0.2044
std,2886.89568,2.892174,0.499797,0.581654,62397.405202,0.403283
min,1.0,0.0,0.0,1.0,0.0,0.0
25%,2500.75,3.0,0.0,1.0,0.0,0.0
50%,5000.5,5.0,1.0,1.0,97198.54,0.0
75%,7500.25,7.0,1.0,2.0,127644.24,0.0
max,10000.0,10.0,1.0,4.0,250898.09,1.0


Now, that the quality of our data is verified, we have one last step to perform before we can save the dataset for banking profile.

## Data Export

Finally, we will export the clean data to a new CSV file.

In [23]:
# save the prepared and verified data to separate file
df.to_csv('../data/banking_profile_dim.csv', index=False)