# Location Dimension

In this notebook, we will create a separate CSV file containing clean quality data related to the Location Dimension.

Let's begin by loading the source dataset.

In [17]:
# dependencies

import pandas as pd
# import numpy as np

In [18]:
df = pd.read_csv('../data/Customer-Churn-Records.csv')
df.head(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


## Data Preparation

To prepare source data, we will drop unrequired columns and check for data quality.

### Location Dimension Requirements

For the Location Dimension, we will require the following columns:

- `Location Key`: A unique identifier for the location.
- `Location`: The country where the customer resides.

In [19]:
# Extract the required column
df = df['Geography']

In [20]:
# Get the unique countries from the dataset
unique_countries = df.unique()

### Surrogate Key Pipeline

We will create a surrogate key for the Location Dimension, to uniquely identify each record in the dimension.

In [21]:

# Create a dataframe for the Location dimension table, and rename "Geography" column to "Country" to be consistent
loc_df= pd.DataFrame({"location_key": range(1, len(unique_countries) + 1), "country": unique_countries})
loc_df.head(5)

Unnamed: 0,location_key,country
0,1,France
1,2,Spain
2,3,Germany


### Data Quality

To ensure we don't have any missing values, we will check for null values in the required columns.

In [22]:
# confirm no missing values
loc_df.isnull().sum()

location_key    0
country         0
dtype: int64

We can see that we don't have any missing values.

Let's confirm the data types of the required columns, to ensure they are in the correct format, and to exclude any noisy data.

In [23]:
# confirm data types
loc_df.dtypes

location_key     int64
country         object
dtype: object

## Data Export

Finally, we will export the clean data to a new CSV file.

In [24]:
# Save dimension table to a CSV file
loc_df.to_csv("../data/location_dim.csv", index=False)