# Credit Profile Dimension

In this notebook, we will create a separate CSV file containing clean quality data related to the Credit Profile Dimension.

Let's begin by loading the source dataset.

In [12]:
# dependencies

import pandas as pd

## Data Loading

Load source dataset.

In [13]:
df = pd.read_csv('../data/Customer-Churn-Records.csv')
df.head(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


## Data Preparation

To prepare source data for ingestion, we will drop unrequired columns and check for data quality issues.

### Credit Profile Dimension Requirements

For the Credit Profile Dimension, we will require the following columns:

- `Credit Profile Key`: A unique identifier for the bank profile.
- `Credit Card Ownership`: A boolean value to indicate whether the customer has a credit card.
- `Credit Score`: The customer's credit score number.
- `Credit Card Type`: The type of credit card the customer has.

Let's extract the relevant columns and rename the columns to follow a consistent naming convention with other tables, as well as the conceptual design plan created in Phase 1 of this Project.

In [14]:
# Rename the columns
df.rename(columns={"HasCrCard": "credit_card_ownership"}, inplace=True)
df.rename(columns={"CreditScore": "credit_score"}, inplace=True)
df.rename(columns={"Card Type": "credit_card_type"}, inplace=True)

# Extract the relevant columns for credit profile attributes
credit_profile_df = df[["credit_card_ownership", "credit_score", "credit_card_type"]]

### Surrogate Key Pipeline

Let's create a surrogate key for the Credit Profile Dimension, to uniquely identify each record in the dimension.

In [15]:
# Generate unique credit profile keys
credit_profile_df["credit_profile_key"] = range(1, len(credit_profile_df) + 1)

# Reorder columns with 'Credit Profile Key" as the first column
credit_profile_df = credit_profile_df[["credit_profile_key", "credit_card_ownership", "credit_score", "credit_card_type"]]

credit_profile_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  credit_profile_df["credit_profile_key"] = range(1, len(credit_profile_df) + 1)


Unnamed: 0,credit_profile_key,credit_card_ownership,credit_score,credit_card_type
0,1,1,619,DIAMOND
1,2,0,608,DIAMOND
2,3,1,502,DIAMOND
3,4,0,699,GOLD
4,5,1,850,GOLD


### Data Quality

To ensure we don't have any missing values, we will check for null values in the required columns.

In [16]:
#  Ensure there are no missing values
credit_profile_df.isnull().sum()

credit_profile_key       0
credit_card_ownership    0
credit_score             0
credit_card_type         0
dtype: int64

We can see that we don't have any missing values.

Let's confirm the data types of the required columns, to ensure they are in the correct format, and to exclude any noisy data.

In [17]:
# Esnure correct data types
credit_profile_df.dtypes

credit_profile_key        int64
credit_card_ownership     int64
credit_score              int64
credit_card_type         object
dtype: object

Lastly, let's confirm that the values conform to the ideal standard. For example, the `Credit Score` column should not have negative values, nor less than 300.

In [18]:
# check the range of values
credit_profile_df.describe()

Unnamed: 0,credit_profile_key,credit_card_ownership,credit_score
count,10000.0,10000.0,10000.0
mean,5000.5,0.7055,650.5288
std,2886.89568,0.45584,96.653299
min,1.0,0.0,350.0
25%,2500.75,0.0,584.0
50%,5000.5,1.0,652.0
75%,7500.25,1.0,718.0
max,10000.0,1.0,850.0


Now, that the quality of our data is verified, we have one last step to perform before we can save the dataset for Credit Profile.

In [19]:

# Save dimension table to a CSV file
credit_profile_df.to_csv('../data/credit_profile_dim.csv', index=False)