# Customer Dimension

In this notebook, we will create a separate CSV file containing clean quality data related to the Customers Dimension.

Let's begin by loading the source dataset.

In [16]:
# dependencies

import pandas as pd

## Data Loading

Load source dataset and preview the first 5 rows.

In [17]:
df = pd.read_csv('../data/Customer-Churn-Records.csv')
df.head(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


## Data Preparation

To prepare source data for ingestion, we will drop unrequired columns and check for data quality issues.

### Banking Profile Dimension Requirements

For the Customer Dimension, we will require the following columns:

- `Customer_Id`: A unique identifier for the customer profile.
- `Surname`: The customer's last name
- `Age`: The customer's age
- `Gender`: The customer's gender
- `Estimated_Salary`: The estimated salary of each customer

In [18]:
# drop the columns that are not needed
df.drop([
  'RowNumber', 
  'CreditScore',
  'Geography',
  'HasCrCard',
  'Exited',
  'Satisfaction Score',
  'Card Type', 
  'Point Earned',
  'NumOfProducts',
  'Tenure',
  'Balance',
  'IsActiveMember',
  'Complain'
  ],
  axis=1,
  inplace=True)

# confirm columns
df.columns

Index(['CustomerId', 'Surname', 'Gender', 'Age', 'EstimatedSalary'], dtype='object')

### Surrogate Key Pipeline

Since our data already has a unique ID for each customer *Customer_ID*, we will use this value as the surrogate key

We will also rename and re-order the columns to follow a consistent naming convention with other tables, as well as, to follow the conceptual design plan made in **Phase 1** of the project.

In [19]:
# rename columns
df.rename(columns={
  'CustomerId': 'customer_id',
  'Surname': 'surname',
  'Age': 'age',
  'Gender': 'gender',
  'EstimatedSalary': 'estimated_salary'
  },
  inplace=True)

### Data Quality

To ensure we don't have any missing values, we will check for null values in the required columns.

In [20]:
# confirm no missing values
df.isnull().sum()

customer_id         0
surname             0
gender              0
age                 0
estimated_salary    0
dtype: int64

We can see that we don't have any missing values.

Let's confirm the data types of the required columns, to ensure they are in the correct format, and to exclude any noisy data.

In [21]:
# confirm data types
df.dtypes

customer_id           int64
surname              object
gender               object
age                   int64
estimated_salary    float64
dtype: object

Lastly, let's confirm that the values conform to the ideal standard. For example, the `Age` column should not have negative values.

In [22]:
# check the range of values
df.describe()

Unnamed: 0,customer_id,age,estimated_salary
count,10000.0,10000.0,10000.0
mean,15690940.0,38.9218,100090.239881
std,71936.19,10.487806,57510.492818
min,15565700.0,18.0,11.58
25%,15628530.0,32.0,51002.11
50%,15690740.0,37.0,100193.915
75%,15753230.0,44.0,149388.2475
max,15815690.0,92.0,199992.48


Now, that the quality of our data is verified, we have one last step to perform before we can save the dataset in a CSV file.

## Data Export

Finally, we will export the clean data to a new CSV file.

In [23]:
# save the prepared and verified data to separate file
df.to_csv('../data/customer_profile_dim.csv', index=False)