# Quality check for all tables

### Purpose:

### \- <span style="font-family: -apple-system, BlinkMacSystemFont, sans-serif; color: var(--vscode-foreground);">Getting insights on how to prepare the data to be inserted into Silver layer<br></span><span style="font-family: -apple-system, BlinkMacSystemFont, sans-serif; color: var(--vscode-foreground);">-&nbsp;</span>  Running many quality checks, and doing cleaning and transforamtion if needed

## **Transformation Process (Mostly done in Silver layer)**

### **Main Components**
- **Data Enrichment**
- **Data Integration** (gold)
- **Derived Columns**
- **Data Normalization & Standardization**
- **Business Rules & Logic** (gold)
- **Data Aggregations** (gold)
- **Data Cleansing**

---

### **Data Cleansing Breakdown**
- **Remove Duplicates**
- **Data Filtering**
- **Handling Missing Data**
- **Handling Invalid Values**
- **Handling Unwanted Spaces**
- **Outlier Detection**
- **Data Type Casting**


| Table Name                 | Problem                                                      | Solution                                                                 | Notes                                                                                                           | Columns Targeted                   |
|----------------------------|--------------------------------------------------------------|--------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|------------------------------------|
| crm_cust_info              | Primary key (`cst_id`) duplicates                           | Pulling the latest record for each customer that likely contains the most up-to-date information | Window function (`ROW_NUMBER()`) and `PARTITION BY` are used to rank and flag the latest record.              | cst_id                             |
|                            | Unwanted leading and trailing spaces                        | `TRIM()` is used                                                        |                                                                                                               | cst_firstname, cst_lastname, cst_gndr |
|                            | Missing values and standardization                         | `'CASE WHEN'` statements are used to map values to a friendly format and impute null values | `TRIM()` and `UPPER()` string functions are used for data consistency.                                        | cst_gndr, cst_marital_status       |
| crm_prd_info               | No columns to be used for joining with other tables        | Derived columns, product key is divided into two derived columns         | - `SUBSTRING()`, `REPLACE()` are used.  <br> - The items that have not sold yet (no records in sales table) are investigated using CTE (Common Table Expression). <br> - The root cause of discrepancy between category of items in product info and category table is detected. | prd_key                            |
|                            | Negative and NULL values for product cost                  | NULL values are replaced by `0`                                          | `ISNULL()` is used.                                                                                            | prd_cost                           |
|                            | Missing values and standardization in product line        | `'CASE WHEN'` statements are used to map values to a more descriptive way and impute null values | `TRIM()` and `UPPER()` string functions are also used for data consistency.                                   | prd_line                           |
|                            | Start date of price of an item is older than its end date  | Using this logic (`End Date = Start Date of the NEXT Record -1`) to ensure no overlap and maintain **history** of the price per item | `LEAD()` window function is used to access the value of the next record. <br> `DATEADD()` is used to shift the end date one day backward to ensure no overlap. | prd_start_dt, prd_end_dt           |
| crm_sales_details          | An invalid date inserted by mistake (out of business boundaries) | Replaced by `NULL`                                                    | `'CASE WHEN'` statement and arbitrary placeholders for boundaries are used to convert the invalid date into `NULL`. | sls_order_dt                       |
|                            | Nonpositive and NULL values in sales price and sales amount | Business logic is utilized to derive the wrong values from the other two values in several possible scenarios (i.e., `sales_amount = quantity * price`) | `'CASE WHEN'`, `ABS()`, and `NULLIF()` are used.                                                             | sls_sales, sls_quantity, sls_price |
| bronze.erp_cust_az12       | Non-consistent ID patterns and non-compatible with customer info table | `"NAS"` is removed from records                                      | - Pattern-matching operator (`LIKE`) is used.  <br> - `'CASE WHEN'` and string function (`SUBSTRING()`) are used to filter and fix unmatched IDs. <br> - CTE is also used. | cid                                |
|                            | Unrealistic birthdate (in the future)                      | Wherever the birthdate is larger than today's date, replaced with `NULL` | `'CASE WHEN'` and `GETDATE()` date function are used.                                                          | bdate                              |
|                            | Missing values and standardization in gender              | `'CASE WHEN'` statements are used to map values to a more descriptive way and replace null values with `NULL` | `TRIM()` and `UPPER()` string functions are also used for data consistency.                                   | gen                                |
| bronze.erp_loc_a101        | Non-compatible with customer info table (`cst_key`)       | Removed `'-'` to be compatible                                        | `REPLACE()` string function is used.                                                                           | cid                                |
|                            | Data Standardization & Consistency issues in country column | Mapped the values to their actual countries and replaced `NULLs` with `'Unknown'` | `'CASE WHEN'` and `TRIM()` string function are used.                                                         | cntry                              |
| bronze.erp_px_cat_g1v2     | Just one ID is not compatible with derived product category in the product info table | Replaced `'CO_PD'` with `'CO_PE'` to match the product info table and avoid join issues | `REPLACE()` string function is used.                                                                          | id                                 |


## Table: crm\_cust\_info

| Problem                                   | Solution                                                                                 | Notes                                                                                   | Columns Targeted                    |
|-------------------------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|--------------------------------------|
| Primary key (cst_id) duplicates          | Pulling the latest record for each customer that probably contains the most up-to-date information | Window function (`ROW_NUMBER()`) and `PARTITION BY` is used to rank then flag the latest record | cst_id                               |
| Unwanted leading and trailing spaces     | `TRIM()`                                                                                 |                                                                                         | cst_firstname, cst_lastname, cst_gndr |
| Missing values and standardization       | `'CASE WHEN'` statements are used to map the values to a more friendly way and also impute NULL values | `TRIM` and `UPPER` string functions are also used for data consistency                | cst_gndr, cst_marital_status         |


In [28]:
-- Preview
SELECT TOP 10 * FROM bronze.crm_cust_info;

cst_id,cst_key,cst_firstname,cst_lastname,cst_marital_status,cst_gndr,cst_create_date
11000,AW00011000,Jon,Yang,M,M,2025-10-06
11001,AW00011001,Eugene,Huang,S,M,2025-10-06
11002,AW00011002,Ruben,Torres,M,M,2025-10-06
11003,AW00011003,Christy,Zhu,S,F,2025-10-06
11004,AW00011004,Elizabeth,Johnson,S,F,2025-10-06
11005,AW00011005,Julio,Ruiz,S,M,2025-10-06
11006,AW00011006,Janet,Alvarez,S,F,2025-10-06
11007,AW00011007,Marco,Mehta,M,M,2025-10-06
11008,AW00011008,Rob,Verhoff,S,F,2025-10-06
11009,AW00011009,Shannon,Carlson,S,M,2025-10-06


In [5]:
-- Check For Nulls or Duplicates in Primary Key
SELECT 
    cst_id, 
    COUNT(*) AS cnt
FROM bronze.crm_cust_info 
GROUP BY cst_id 
HAVING COUNT(*) > 1 
    OR cst_id IS NULL;

cst_id,cnt
29449.0,2
29473.0,2
29433.0,2
,4
29483.0,2
29466.0,3


In [29]:
-- Let's check a couple of them
SELECT  
    *
FROM bronze.crm_cust_info 
WHERE cst_id = 29466

-- It seems that the customer info in this case was updated after initial creation

cst_id,cst_key,cst_firstname,cst_lastname,cst_marital_status,cst_gndr,cst_create_date
29466,AW00029466,,,,,2026-01-25
29466,AW00029466,Lance,Jimenez,M,,2026-01-26
29466,AW00029466,Lance,Jimenez,M,M,2026-01-27


In [25]:
-- Let's check null records
SELECT  
    *
FROM bronze.crm_cust_info 
WHERE cst_id IS NULL

-- Not sure why we have cst_key but not cst_id.
--  Null values might have happend because cst_key did not follow ("AW000" + cst_id). More investigation required. 
-- However I rather focus on cst_id as it is the attribute present in sales table, allowing us to perform join operation on this column.

cst_id,cst_key,cst_firstname,cst_lastname,cst_marital_status,cst_gndr,cst_create_date
,SF566,,,,,
,PO25,,,,,
,13451235,,,,,
,A01Ass,,,,,


In [30]:
-- For cst_id I need a way to make sure I just pull the latest record as it is probably contain the most (updated) information.
-- I will use row_number() window function to rank based on cst_create_date
SELECT 
    *,
    ROW_NUMBER() OVER (PARTITION BY cst_id ORDER BY cst_create_date DESC) AS flag_last
FROM bronze.crm_cust_info
WHERE cst_id = 29466;

cst_id,cst_key,cst_firstname,cst_lastname,cst_marital_status,cst_gndr,cst_create_date,flag_last
29466,AW00029466,Lance,Jimenez,M,M,2026-01-27,1
29466,AW00029466,Lance,Jimenez,M,,2026-01-26,2
29466,AW00029466,,,,,2026-01-25,3


In [32]:
-- Let's identify the duplicates
SELECT *
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY cst_id ORDER BY cst_create_date DESC) AS flag_last
    FROM bronze.crm_cust_info
) AS t
WHERE flag_last != 1;

cst_id,cst_key,cst_firstname,cst_lastname,cst_marital_status,cst_gndr,cst_create_date,flag_last
,SF566,,,,,,2
,13451235,,,,,,3
,A01Ass,,,,,,4
29433.0,AW00029433,,,M,M,2026-01-25,2
29449.0,AW00029449,,Chen,S,,2026-01-25,2
29466.0,AW00029466,Lance,Jimenez,M,,2026-01-26,2
29466.0,AW00029466,,,,,2026-01-25,3
29473.0,AW00029473,Carmen,,,,2026-01-25,2
29483.0,AW00029483,,Navarro,,,2026-01-25,2


In [33]:
-- Let's check one of the records as an example
-- We can see just the latest record is kept
SELECT *
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY cst_id ORDER BY cst_create_date DESC) AS flag_last
    FROM bronze.crm_cust_info
) AS t
WHERE flag_last = 1 AND cst_id = 29466;

cst_id,cst_key,cst_firstname,cst_lastname,cst_marital_status,cst_gndr,cst_create_date,flag_last
29466,AW00029466,Lance,Jimenez,M,M,2026-01-27,1


In [35]:
-- Now I check for unwanted leading and trailing spaces in text (nvarchar) columns (firstname, lastname, gender)
-- gender seems to fine but I will use Trim() for firstname and last name
SELECT cst_gndr
FROM bronze.crm_cust_info
WHERE cst_gndr != TRIM(cst_gndr);

cst_firstname


In [38]:
SELECT TOP 5
    TRIM(cst_lastname),
    TRIM(cst_firstname)
FROM bronze.crm_cust_info


(No column name),(No column name).1
Yang,Jon
Huang,Eugene
Torres,Ruben
Zhu,Christy
Johnson,Elizabeth


In [42]:
-- Check the consistency of values in low cardinality columns (gender, marital status)

-- Data Standardization and Consistency
-- Here we can swap the abbrivations with a more user-friendly name
-- And replace NULL values with "Unknown"

SELECT DISTINCT cst_gndr
FROM bronze.crm_cust_info;

SELECT DISTINCT cst_marital_status
FROM bronze.crm_cust_info;

cst_gndr
""
F
M


cst_marital_status
S
""
M


In [39]:
-- I will also use TRIM() and UPPER() for consistency and make sure we are able to catch any bad data in the future
SELECT TOP 10
    TRIM(cst_firstname) AS cst_firstname,
    TRIM(cst_lastname) AS cst_lastname,
    CASE 
        WHEN UPPER(TRIM([cst_marital_status])) = 'S' THEN 'Single'
        WHEN UPPER(TRIM([cst_marital_status])) = 'M' THEN 'Married'
        ELSE 'Unknown'
    END AS cst_marital_status,
    CASE 
        WHEN UPPER(TRIM(cst_gndr)) = 'F' THEN 'Female'
        WHEN UPPER(TRIM(cst_gndr)) = 'M' THEN 'Male'
        ELSE 'Unknown'
    END AS cst_gndr
FROM bronze.crm_cust_info;

cst_firstname,cst_lastname,cst_marital_status,cst_gndr
Jon,Yang,Married,Male
Eugene,Huang,Single,Male
Ruben,Torres,Married,Male
Christy,Zhu,Single,Female
Elizabeth,Johnson,Single,Female
Julio,Ruiz,Single,Male
Janet,Alvarez,Single,Female
Marco,Mehta,Married,Male
Rob,Verhoff,Single,Female
Shannon,Carlson,Single,Male


## Table: crm\_prd\_info

| Problem                                                      | Solution                                                                 | Notes                                                                                                           | Columns Targeted          |
|--------------------------------------------------------------|--------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|---------------------------|
| No columns to be used for joining with other tables         | Derived columns, product key is divided into two derived columns         | - `SUBSTRING()`, `REPLACE()` is used.  <br> - The items that have not sold yet (no records in sales table) are investigated using CTE (Common Table Expression). <br> - The root cause of discrepancy between category of items in product info and category table is detected. | prd_key                   |
| Negative and NULL values for product cost                   | NULL values are replaced by `0`                                          | `ISNULL()` is used.                                                                                            | prd_cost                  |
| Missing values and standardization in product line          | `'CASE WHEN'` statements are used to map values to a more descriptive way and impute null values with 'Unknown'| `TRIM` and `UPPER` string functions are also used for data consistency.                                        | prd_line                  |
| Start date of price of an item is older than its end date   | Using this logic (`End Date = Start Date of the NEXT Record -1`) to ensure no overlap and maintain **history** of the price per item | `LEAD()` window function is used to access the value of the next record. <br> `DATEADD()` is used to shift the end date one day backward to ensure no overlap. | prd_start_dt, prd_end_dt  |


In [45]:
-- Preview
SELECT TOP 10 * FROM bronze.crm_prd_info;

prd_id,prd_key,prd_nm,prd_cost,prd_line,prd_start_dt,prd_end_dt
210,CO-RF-FR-R92B-58,HL Road Frame - Black- 58,,R,2003-07-01,
211,CO-RF-FR-R92R-58,HL Road Frame - Red- 58,,R,2003-07-01,
212,AC-HE-HL-U509-R,Sport-100 Helmet- Red,12.0,S,2011-07-01,2007-12-28
213,AC-HE-HL-U509-R,Sport-100 Helmet- Red,14.0,S,2012-07-01,2008-12-27
214,AC-HE-HL-U509-R,Sport-100 Helmet- Red,13.0,S,2013-07-01,
215,AC-HE-HL-U509,Sport-100 Helmet- Black,12.0,S,2011-07-01,2007-12-28
216,AC-HE-HL-U509,Sport-100 Helmet- Black,14.0,S,2012-07-01,2008-12-27
217,AC-HE-HL-U509,Sport-100 Helmet- Black,13.0,S,2013-07-01,
218,CL-SO-SO-B909-M,Mountain Bike Socks- M,3.0,M,2011-07-01,2007-12-28
219,CL-SO-SO-B909-L,Mountain Bike Socks- L,3.0,M,2011-07-01,2007-12-28


In [46]:
-- Check For Nulls or Duplicates in Primary Key
-- No duplicates in the Primary Key
SELECT 
    prd_id, 
    COUNT(*) AS cnt
FROM bronze.crm_prd_info 
GROUP BY prd_id 
HAVING COUNT(*) > 1 
    OR prd_id IS NULL;

prd_id,cnt


In [59]:
-- We need to break down prd_key into two derived columns that we can use to join to two other tables
-- The first 5 charachters seem to be related to category id in erp_px_cat_g1v2 table
-- However, '-' needs be replaced by '_' 

SELECT TOP 5
    REPLACE(SUBSTRING(prd_key, 1, 5), '-', '_') AS cat_id
FROM bronze.crm_prd_info;
PRINT('bronze.erp_px_cat_g1v2')

SELECT DISTINCT TOP 5 id FROM bronze.erp_px_cat_g1v2;


cat_id
CO_RF
CO_RF
AC_HE
AC_HE
AC_HE


id
AC_BC
AC_BR
AC_BS
AC_CL
AC_FE


In [53]:
-- Now let's check which ids are in prd_info table but not in erp_px_cat_g1v2
-- Found 7
SELECT 
    prd_id,
    prd_key,
    REPLACE(SUBSTRING(prd_key, 1, 5), '-', '_') AS cat_id,
    prd_nm,
    prd_cost,
    prd_line,
    prd_start_dt,
    prd_end_dt
FROM bronze.crm_prd_info
WHERE REPLACE(SUBSTRING(prd_key, 1, 5), '-', '_') NOT IN (
    SELECT DISTINCT id 
    FROM bronze.erp_px_cat_g1v2
);


prd_id,prd_key,cat_id,prd_nm,prd_cost,prd_line,prd_start_dt,prd_end_dt
542,CO-PE-PD-M282,CO_PE,LL Mountain Pedal,18,M,2013-07-01,
543,CO-PE-PD-M340,CO_PE,ML Mountain Pedal,28,M,2013-07-01,
544,CO-PE-PD-M562,CO_PE,HL Mountain Pedal,36,M,2013-07-01,
545,CO-PE-PD-R347,CO_PE,LL Road Pedal,18,R,2013-07-01,
546,CO-PE-PD-R563,CO_PE,ML Road Pedal,28,R,2013-07-01,
547,CO-PE-PD-R853,CO_PE,HL Road Pedal,36,R,2013-07-01,
548,CO-PE-PD-T852,CO_PE,Touring Pedal,36,T,2013-07-01,


In [63]:
-- Now let's see which ids are in erp_px_cat_g1v2table but not in prd_info
-- Found 1
------------------
-- It appears that the discrepancies are due to the inconsistent abbreviations for "Pedals" ("PE" vs "PD").
-- Further investigation and consultation with the team is required.
------------------
SELECT DISTINCT id, CAT, SUBCAT, MAINTENANCE
FROM bronze.erp_px_cat_g1v2
WHERE id NOT IN (
    SELECT REPLACE(SUBSTRING(prd_key, 1, 5), '-', '_')
    FROM bronze.crm_prd_info);

-- Let's check if there is any other Pedals
SELECT DISTINCT
    id,
    CAT,
    SUBCAT,
    MAINTENANCE
FROM bronze.erp_px_cat_g1v2
WHERE SUBCAT LIKE '%Pedals%';


id,CAT,SUBCAT,MAINTENANCE
CO_PD,Components,Pedals,No


id,CAT,SUBCAT,MAINTENANCE
CO_PD,Components,Pedals,No


In [64]:
-- The rest of the product key can be used to join the table with crm_sales_details table
SELECT TOP 5
    REPLACE(SUBSTRING(prd_key, 1, 5), '-', '_') AS cat_id,
    SUBSTRING(prd_key, 7, LEN(prd_key)) AS prd_key
FROM bronze.crm_prd_info;
PRINT('bronze.erp_px_cat_g1v2');

SELECT DISTINCT TOP 5 sls_prd_key FROM bronze.crm_sales_details;

cat_id,prd_key
CO_RF,FR-R92B-58
CO_RF,FR-R92R-58
AC_HE,HL-U509-R
AC_HE,HL-U509-R
AC_HE,HL-U509-R


sls_prd_key
BK-R93R-62
BK-M82S-44
BK-R50B-62
BK-R93R-44
BK-M82B-48


In [79]:
-- Now let's check which ids are in prd_info table but not in crm_sales_details
-- Found 220 records. We have a lot of products (165) that are not sold yet (no match in sales table)
SELECT TOP 5
    prd_id,
    prd_key,
    REPLACE(SUBSTRING(prd_key, 1, 5), '-', '_') AS cat_id,
    SUBSTRING(prd_key, 7, LEN(prd_key)) AS prd_key,
    prd_nm,
    prd_cost,
    prd_line,
    prd_start_dt,
    prd_end_dt
FROM bronze.crm_prd_info
WHERE SUBSTRING(prd_key, 7, LEN(prd_key)) NOT IN (
    SELECT sls_prd_key 
    FROM bronze.crm_sales_details
    );


prd_id,prd_key,cat_id,prd_key.1,prd_nm,prd_cost,prd_line,prd_start_dt,prd_end_dt
210,CO-RF-FR-R92B-58,CO_RF,FR-R92B-58,HL Road Frame - Black- 58,,R,2003-07-01,
211,CO-RF-FR-R92R-58,CO_RF,FR-R92R-58,HL Road Frame - Red- 58,,R,2003-07-01,
218,CL-SO-SO-B909-M,CL_SO,SO-B909-M,Mountain Bike Socks- M,3.0,M,2011-07-01,2007-12-28
219,CL-SO-SO-B909-L,CL_SO,SO-B909-L,Mountain Bike Socks- L,3.0,M,2011-07-01,2007-12-28
238,CO-RF-FR-R92R-62,CO_RF,FR-R92R-62,HL Road Frame - Red- 62,748.0,R,2011-07-01,2007-12-28


In [75]:
-- 165 Items found that are not sold yet
WITH UnsoldProducts AS (
    SELECT
        prd_id,
        prd_key,
        REPLACE(SUBSTRING(prd_key, 1, 5), '-', '_') AS cat_id,
        SUBSTRING(prd_key, 7, LEN(prd_key)) AS prod_key,
        prd_nm,
        prd_cost,
        prd_line,
        prd_start_dt,
        prd_end_dt
    FROM bronze.crm_prd_info
    WHERE SUBSTRING(prd_key, 7, LEN(prd_key)) NOT IN (
        SELECT sls_prd_key 
        FROM bronze.crm_sales_details
    )
)
SELECT DISTINCT TOP 5 prod_key -- COUNT(DISTINCT prod_key)
FROM UnsoldProducts;

prod_key
FR-R92B-58
FR-R92R-58
SO-B909-M
SO-B909-L
FR-R92R-62


In [82]:
-- Now let's check which producs are in sales_details table but not in prd_info
-- No records returned
-- Okay. Now we are sure that the discrepency is actually because that some items are not sold

SELECT 
    sls_prd_key
FROM bronze.crm_sales_details
WHERE sls_prd_key NOT IN (
    SELECT SUBSTRING(prd_key, 7, LEN(prd_key))
    FROM bronze.crm_prd_info
);

sls_prd_key


In [83]:
-- Let's move on to product name and check for unwanted spaces
-- It is fine
SELECT prd_nm 
FROM bronze. crm_prd_info 
WHERE prd_nm != TRIM(prd_nm)

prd_nm


In [85]:
-- For Product cost, we can check for NULL and negative values
SELECT *
FROM bronze.crm_prd_info 
WHERE prd_cost < 0 OR prd_cost IS NULL

prd_id,prd_key,prd_nm,prd_cost,prd_line,prd_start_dt,prd_end_dt
210,CO-RF-FR-R92B-58,HL Road Frame - Black- 58,,R,2003-07-01,
211,CO-RF-FR-R92R-58,HL Road Frame - Red- 58,,R,2003-07-01,


In [87]:
-- We have only NULL issues that I will replace by 0 in this case, to not mess up with aggregations like avg later on.
SELECT ISNULL (prd_cost, 0) AS prd_cost
FROM bronze.crm_prd_info 
WHERE prd_cost < 0 OR prd_cost IS NULL

prd_cost
0
0


In [88]:
-- Moving on to product line column
SELECT DISTINCT prd_line 
FROM bronze.crm_prd_info

prd_line
""
M
R
S
T


In [91]:
-- Data Standardization & Consistency
-- We can replace them with more descriptive values with the insights gained by looking at product name
-- Also, NULL values will be replaced by 'Unknown'

SELECT DISTINCT
    CASE UPPER(TRIM(prd_line))
        WHEN 'M' THEN 'Mountain'
        WHEN 'R' THEN 'Road'
        WHEN 'S' THEN 'Other Sales'
        WHEN 'T' THEN 'Touring'
        ELSE 'Unknown'
    END AS prd_line
FROM bronze.crm_prd_info;

prd_line
Mountain
Other Sales
Road
Touring
Unknown


In [93]:
-- Moving on to start date and end date
-- This two columns show the time period that a certain product was at a certain price
-- So in this table we have a history of products' cost (price)

-- The main issue here is that the start date is older than end date, which makes no sense.
SELECT TOP 5
* 
FROM bronze.crm_prd_info

prd_id,prd_key,prd_nm,prd_cost,prd_line,prd_start_dt,prd_end_dt
210,CO-RF-FR-R92B-58,HL Road Frame - Black- 58,,R,2003-07-01,
211,CO-RF-FR-R92R-58,HL Road Frame - Red- 58,,R,2003-07-01,
212,AC-HE-HL-U509-R,Sport-100 Helmet- Red,12.0,S,2011-07-01,2007-12-28
213,AC-HE-HL-U509-R,Sport-100 Helmet- Red,14.0,S,2012-07-01,2008-12-27
214,AC-HE-HL-U509-R,Sport-100 Helmet- Red,13.0,S,2013-07-01,


In [98]:
-- I will fix the issue using this logic (End Date = Start Date of the NEXT Record -1) to make sure there is no overlap
-- Let's focus on an example
SELECT 
    prd_id,
    prd_key,
    prd_nm,
    prd_start_dt,
    prd_end_dt,
    DATEADD(day, -1, LEAD(prd_start_dt) OVER (PARTITION BY prd_key ORDER BY prd_start_dt)) AS prd_end_dt_test
FROM bronze.crm_prd_info
WHERE prd_key IN ('AC-HE-HL-U509-R', 'AC-HE-HL-U509');

prd_id,prd_key,prd_nm,prd_start_dt,prd_end_dt,prd_end_dt_test
215,AC-HE-HL-U509,Sport-100 Helmet- Black,2011-07-01,2007-12-28,2012-06-30
216,AC-HE-HL-U509,Sport-100 Helmet- Black,2012-07-01,2008-12-27,2013-06-30
217,AC-HE-HL-U509,Sport-100 Helmet- Black,2013-07-01,,
212,AC-HE-HL-U509-R,Sport-100 Helmet- Red,2011-07-01,2007-12-28,2012-06-30
213,AC-HE-HL-U509-R,Sport-100 Helmet- Red,2012-07-01,2008-12-27,2013-06-30
214,AC-HE-HL-U509-R,Sport-100 Helmet- Red,2013-07-01,,


## Table: crm_sales_details

| Problem                                                      | Solution                                                                 | Notes                                                                                                           | Columns Targeted         |
|--------------------------------------------------------------|--------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|--------------------------|
| An invalid date inserted by mistake (out of business boundaries) | Replaced by `NULL`                                                     | `'CASE WHEN'` statement and arbitrary placeholders for boundaries are used to convert the invalid date into `NULL`. | sls_order_dt            |
| Nonpositive and NULL values in sales price and sales amount  | Business logic is utilized to fix the wrong values using the other two values in several possible scenarios (i.e., `sales_amount = quantity * price`) | `'CASE WHEN'`, `ABS()`, and `NULLIF()` are used.                                                        | sls_sales, sls_quantity, sls_price |


In [128]:
-- Preview
SELECT TOP 10 * FROM bronze.crm_sales_details;

sls_ord_num,sls_prd_key,sls_cust_id,sls_sales,sls_quantity,sls_price,sls_order_dt,sls_ship_dt,sls_due_dt
SO43697,BK-R93R-62,21768,3578,1,3578,2010-12-29,2011-01-05,2011-01-10
SO43698,BK-M82S-44,28389,3400,1,3400,2010-12-29,2011-01-05,2011-01-10
SO43699,BK-M82S-44,25863,3400,1,3400,2010-12-29,2011-01-05,2011-01-10
SO43700,BK-R50B-62,14501,699,1,699,2010-12-29,2011-01-05,2011-01-10
SO43701,BK-M82S-44,11003,3400,1,3400,2010-12-29,2011-01-05,2011-01-10
SO43702,BK-R93R-44,27645,3578,1,3578,2010-12-30,2011-01-06,2011-01-11
SO43703,BK-R93R-62,16624,3578,1,3578,2010-12-30,2011-01-06,2011-01-11
SO43704,BK-M82B-48,11005,3375,1,3375,2010-12-30,2011-01-06,2011-01-11
SO43705,BK-M82S-38,11011,3400,1,3400,2010-12-30,2011-01-06,2011-01-11
SO43706,BK-R93R-48,27621,3578,1,3578,2010-12-31,2011-01-07,2011-01-12


In [110]:
-- Check For Nulls or Duplicates in the potential Primary Key(sls_ord_num)
-- We have many duplicates. let's take a closer look.

SELECT TOP 5
    sls_ord_num,
    COUNT(*) AS cnt
FROM bronze.crm_sales_details
GROUP BY sls_ord_num
HAVING COUNT(*) > 1 
    OR sls_ord_num IS NULL;



sls_ord_num,cnt
SO55367,4
SO62535,3
SO64083,3
SO65048,2
SO72893,2


In [109]:
-- It seems that sales details table is a fact table without a primary key
-- This table captures all the sales events meaing one order number can be assoicated with diffent itmes sold
SELECT *
FROM bronze.crm_sales_details
WHERE sls_ord_num = 'SO55367'

sls_ord_num,sls_prd_key,sls_cust_id,sls_sales,sls_quantity,sls_price,sls_order_dt,sls_ship_dt,sls_due_dt
SO55367,TT-T092,17642,5,1,5,2013-03-30,2013-04-06,2013-04-11
SO55367,BC-R205,17642,9,1,9,2013-03-30,2013-04-06,2013-04-11
SO55367,WB-H098,17642,5,1,5,2013-03-30,2013-04-06,2013-04-11
SO55367,CA-1098,17642,9,1,9,2013-03-30,2013-04-06,2013-04-11


In [111]:
-- Check for unwanted spaces
-- It is fine
SELECT 
    sls_ord_num
FROM bronze.crm_sales_details
WHERE sls_ord_num != TRIM(sls_ord_num);

sls_ord_num


In [114]:
-- sls_prd_key and sls_cust_key are the foreign keys that connect sales tables with prd_info and cust_info tables.
-- Let's check if all of these ids exist in the other two tables for data integration purposes
-- Yes. These two columns are perfectly fine

SELECT 
    sls_ord_num,
    sls_prd_key,
    sls_cust_id
FROM bronze.crm_sales_details
WHERE sls_prd_key NOT IN (
    SELECT SUBSTRING(prd_key, 7, LEN(prd_key))
    FROM bronze.crm_prd_info);

SELECT 
    sls_ord_num,
    sls_prd_key,
    sls_cust_id
FROM bronze.crm_sales_details
WHERE sls_cust_id NOT IN (
    SELECT cst_id
    FROM bronze.crm_cust_info);


sls_ord_num,sls_prd_key,sls_cust_id


sls_ord_num,sls_prd_key,sls_cust_id


In [117]:
-- Let's check the boundries
SELECT 
    MIN(sls_order_dt) AS MinOrderDate,
    MAX(sls_order_dt) AS MaxOrderDate
FROM bronze.crm_sales_details;

MinOrderDate,MaxOrderDate
2010-12-29,5489-01-01


In [126]:
-- The max date is wrong
-- let's sort the table by latest order date for further investigation
-- OK. Just one record is inserted by mistake. I will replace it with NULL
SELECT TOP 5 
* 
FROM bronze.crm_sales_details
ORDER BY sls_order_dt DESC;

sls_ord_num,sls_prd_key,sls_cust_id,sls_sales,sls_quantity,sls_price,sls_order_dt,sls_ship_dt,sls_due_dt
SO69215,TT-M928,16864,5,1,5,5489-01-01,2013-11-02,2013-11-07
SO75084,RA-H123,11078,120,1,120,2014-01-28,2014-02-04,2014-02-09
SO75085,CA-1098,11927,9,1,9,2014-01-28,2014-02-04,2014-02-09
SO75085,CL-9009,11927,8,1,8,2014-01-28,2014-02-04,2014-02-09
SO75086,CL-9009,28789,8,1,8,2014-01-28,2014-02-04,2014-02-09


In [138]:
-- Check for Invalid Dates
-- '2100-01-01' and '1900-01-01' used as place holders that need to be discussed
SELECT TOP 5
    sls_order_dt,
    CASE 
        WHEN sls_order_dt < '1900-01-01'
             OR sls_order_dt > '2100-01-01' 
        THEN NULL
        ELSE sls_order_dt
    END AS validated_order_dt
FROM bronze.crm_sales_details
ORDER BY sls_order_dt DESC;

sls_order_dt,validated_order_dt
5489-01-01,
2014-01-28,2014-01-28
2014-01-28,2014-01-28
2014-01-28,2014-01-28
2014-01-28,2014-01-28


In [134]:
--let's check the boundries for shipping and due dates as well
-- They sound good
SELECT 
    MIN(sls_ship_dt) AS Min,
    MAX(sls_ship_dt) AS Max
FROM bronze.crm_sales_details;

SELECT 
    MIN(sls_due_dt) AS Min,
    MAX(sls_due_dt) AS Max
FROM bronze.crm_sales_details;

Min,Max
2011-01-05,2014-02-04


Min,Max
2011-01-10,2014-02-09


In [133]:
-- check for logical order of order shipping and due dates
-- No new records are returned
SELECT *
FROM bronze.crm_sales_details
WHERE sls_order_dt > sls_ship_dt
   OR sls_order_dt > sls_due_dt
   OR sls_ship_dt > sls_due_dt;

sls_ord_num,sls_prd_key,sls_cust_id,sls_sales,sls_quantity,sls_price,sls_order_dt,sls_ship_dt,sls_due_dt
SO69215,TT-M928,16864,5,1,5,5489-01-01,2013-11-02,2013-11-07


In [135]:
-- Let's get into sls_sales, sls_quantity, sls_price

-- > Sales = Quantity * Price 
-- > Values must not be NULL, zero, or negative.
SELECT DISTINCT 
    sls_sales, 
    sls_quantity, 
    sls_price
FROM bronze.crm_sales_details
WHERE 
    sls_sales != sls_quantity * sls_price
    OR sls_sales IS NULL
    OR sls_quantity IS NULL
    OR sls_price IS NULL
    OR sls_sales <= 0
    OR sls_quantity <= 0
    OR sls_price <= 0
ORDER BY 
    sls_sales, 
    sls_quantity, 
    sls_price;

sls_sales,sls_quantity,sls_price
,1,2.0
,1,8.0
,1,9.0
,1,10.0
,1,22.0
,1,24.0
,1,35.0
-54.0,1,54.0
-35.0,1,35.0
-18.0,1,9.0


In [137]:
-- Found 33 records that sls_sales and sls_price have several issues (NULL and negative values) but quantity is fine
-- Okay. in order to fix the issues I will follow this set of rules:
----- If Sales is negative, zero, or null, derive it using Quantity and Price. 
----- If Price is zero or null, calculate it using Sales and Quantity. 
----- If Price is negative, convert it to a positive value

SELECT DISTINCT
    sls_sales      AS old_sls_sales,
    sls_price      AS old_sls_price,
    CASE 
        WHEN sls_sales IS NULL 
             OR sls_sales <= 0 
             OR sls_sales != sls_quantity * ABS(sls_price)
        THEN sls_quantity * ABS(sls_price)
        ELSE sls_sales
    END            AS new_sls_sales,
    CASE 
        WHEN sls_price IS NULL 
             OR sls_price <= 0 
        THEN sls_sales / NULLIF(sls_quantity, 0) -- to make sure not geeting infinite value
        ELSE sls_price
    END            AS new_sls_price,
    sls_quantity
FROM bronze.crm_sales_details
WHERE 
    sls_sales != sls_quantity * sls_price
    OR sls_sales IS NULL
    OR sls_quantity IS NULL
    OR sls_price IS NULL
    OR sls_sales <= 0
    OR sls_quantity <= 0
    OR sls_price <= 0
ORDER BY 
    sls_sales, 
    sls_quantity, 
    sls_price;

old_sls_sales,old_sls_price,new_sls_sales,new_sls_price,sls_quantity
,2.0,2,2,1
,8.0,8,8,1
,9.0,9,9,1
,10.0,10,10,1
,22.0,22,22,1
,24.0,24,24,1
,35.0,35,35,1
-54.0,54.0,54,54,1
-35.0,35.0,35,35,1
-18.0,9.0,9,9,1


## Table: bronze.erp\_cust\_az12

| Problem                                                      | Solution                                                       | Notes                                                                                                           | Columns Targeted |
|--------------------------------------------------------------|----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|------------------|
| Non-consistent ID patterns and non-compatible with customer info table | `"NAS"` is removed from records                              | - Pattern-matching operator (`LIKE`) is used.  <br> - `'CASE WHEN'` and string function (`SUBSTRING()`) are used to filter and fix unmatched IDs. <br> - CTE is also used. | cid              |
| Unrealistic birthdate (in the future)                       | Wherever the birthdate is larger than today's date, replaced with `NULL` | `'CASE WHEN'` and `GETDATE()` date function are used.                                                          | bdate           |
| Missing values and standardization in gender                | `'CASE WHEN'` statements are used to map values to a more descriptive way and replace null values with `NULL` | `TRIM` and `UPPER` string functions are also used for data consistency.                                        | gen             |


In [142]:
SELECT TOP 10
* 
FROM bronze.erp_cust_az12

CID,BDATE,GEN
NASAW00011000,1971-10-06,Male
NASAW00011001,1976-05-10,Male
NASAW00011002,1971-02-09,Male
NASAW00011003,1973-08-14,Female
NASAW00011004,1979-08-05,Female
NASAW00011005,1976-08-01,Male
NASAW00011006,1976-12-02,Female
NASAW00011007,1969-11-06,Male
NASAW00011008,1975-07-04,Female
NASAW00011009,1969-09-29,Male


In [153]:
-- checking for dublicates in CID
-- None found
SELECT 
    cid, 
    COUNT(*) AS cnt
FROM bronze.erp_cust_az12
GROUP BY cid 
HAVING COUNT(*) > 1 
    OR cid IS NULL;

cid,cnt


In [144]:
SELECT TOP 5
*
FROM bronze.erp_cust_az12
WHERE cid LIKE 'AW%'

CID,BDATE,GEN
AW00022042,1983-08-18,Female
AW00022043,1978-02-09,Female
AW00022044,1983-05-13,Male
AW00022045,1979-05-04,Male
AW00022046,1984-08-16,Female


In [148]:
-- Apperantly, older ids follow (NASAW%) and newer ids follow (AW%) patterns
-- For data integration and being able to join this table with customer info, each id should start with (AW)

SELECT TOP 5
    cid,
    CASE 
        WHEN cid LIKE 'NAS%' THEN SUBSTRING(cid, 4, LEN(cid))
        ELSE cid 
    END AS transformed_cid,
    bdate,
    gen
FROM bronze.erp_cust_az12


cid,transformed_cid,bdate,gen
NASAW00011000,AW00011000,1971-10-06,Male
NASAW00011001,AW00011001,1976-05-10,Male
NASAW00011002,AW00011002,1971-02-09,Male
NASAW00011003,AW00011003,1973-08-14,Female
NASAW00011004,AW00011004,1979-08-05,Female


In [149]:
-- Let's check if all ids also exist in customer info table (later on we join this table with customer info)
-- Yesss. sounds good
WITH cte AS (
    SELECT 
        cid,
        CASE 
            WHEN cid LIKE 'NAS%' THEN SUBSTRING(cid, 4, LEN(cid))
            ELSE cid 
        END AS transformed_cid,
        bdate,
        gen
    FROM bronze.erp_cust_az12
)
SELECT *
FROM cte
WHERE transformed_cid NOT IN (
    SELECT DISTINCT cst_key
    FROM bronze.crm_cust_info
);

cid,transformed_cid,bdate,gen


In [154]:
-- Moving on to birth date column
--let's check the boundries for birthdate
-- They sound good
SELECT 
    MIN(bdate) AS Min,
    MAX(bdate) AS Max
FROM bronze.erp_cust_az12;

Min,Max
1916-02-10,9999-11-20


In [159]:
-- Looks like we have customers older than 100 years old. No issue here
-- Some customers' birth date (16 customers) is in the future which is not acceptable, replaced with NULL
SELECT TOP 20
bdate,
CASE WHEN bdate > GETDATE() THEN NULL
    ELSE bdate
END AS new_bdate
FROM bronze.erp_cust_az12
ORDER BY bdate DESC

bdate,new_bdate
9999-11-20,
9999-09-13,
9999-09-11,
9999-05-10,
2980-03-09,
2080-03-15,
2066-06-16,
2065-12-12,
2055-01-23,
2050-11-22,


In [160]:
-- Let's check gender
-- Data Standardization & Consistency 
SELECT DISTINCT gen FROM bronze. erp_cust_az12

gen
""
F
Male
Female
M


In [161]:
-- Fixing the name issue and handling missing values
SELECT DISTINCT 
    gen,
    CASE 
        WHEN UPPER(TRIM(gen)) IN ('F', 'FEMALE') THEN 'Female'
        WHEN UPPER(TRIM(gen)) IN ('M', 'MALE') THEN 'Male'
        ELSE 'Unknown'
    END AS standardized_gen
FROM bronze.erp_cust_az12;

gen,standardized_gen
Female,Female
F,Female
M,Male
,Unknown
Male,Male


## Table: bronze.erp\_loc\_a101

| Problem                                                 | Solution                                        | Notes                                          | Columns Targeted |
|---------------------------------------------------------|-------------------------------------------------|------------------------------------------------|------------------|
| Non-compatible with customer info table (`cst_key`)    | Removed `'-'` to be compatible                 | `REPLACE()` string function is used.           | cid              |
| Data Standardization & Consistency issues in country column | Mapped the values to their actual countries and replaced `NULLs` with `'Unknown'` | `'CASE WHEN'` and `TRIM()` string function are used. | cntry            |


In [183]:
-- Preview
SELECT TOP 5
*
FROM bronze.erp_loc_a101;

SELECT TOP 5
cst_key
FROM bronze.crm_cust_info;

CID,CNTRY
AW-00011000,Australia
AW-00011001,Australia
AW-00011002,Australia
AW-00011003,Australia
AW-00011004,Australia


cst_key
AW00011000
AW00011001
AW00011002
AW00011003
AW00011004


In [168]:
-- Let's remove ('-')
SELECT TOP 5
    REPLACE(cid, '-', '') AS transformed_cid,
    cid
FROM bronze.erp_loc_a101

transformed_cid,cid
AW00011000,AW-00011000
AW00011001,AW-00011001
AW00011002,AW-00011002
AW00011003,AW-00011003
AW00011004,AW-00011004


In [167]:
-- Let's check if all ids exist in customer info table
-- sounds good

SELECT 
    REPLACE(cid, '-', '') AS transformed_cid,
    cntry
FROM bronze.erp_loc_a101
WHERE REPLACE(cid, '-', '') NOT IN (
    SELECT cst_key
    FROM bronze.crm_cust_info
);

transformed_cid,cntry


In [170]:
-- Checking country
SELECT DISTINCT cntry 
FROM bronze.erp_loc_a101
ORDER BY cntry

cntry
""
Australia
Canada
DE
France
Germany
United Kingdom
United States
US
USA


In [172]:
-- Data Standardization & Consistency 
SELECT DISTINCT 
    cntry AS old_cntry,
    CASE 
        WHEN TRIM(cntry) = 'DE' THEN 'Germany'
        WHEN TRIM(cntry) IN ('US', 'USA') THEN 'United States'
        WHEN TRIM(cntry) = '' OR cntry IS NULL THEN 'Unknown'
        ELSE TRIM(cntry)
    END AS cntry
FROM bronze.erp_loc_a101
ORDER BY cntry;

old_cntry,cntry
Australia,Australia
Canada,Canada
France,France
DE,Germany
Germany,Germany
United Kingdom,United Kingdom
US,United States
United States,United States
USA,United States
,Unknown


## Table: bronze.erp\_px\_cat\_g1v2

| Problem                                                      | Solution                                                          | Notes                               | Columns Targeted |
|--------------------------------------------------------------|-------------------------------------------------------------------|-------------------------------------|------------------|
| Just one ID is not compatible with derived product category in the product info table | Replaced `'CO_PD'` with `'CO_PE'` to match the product info table and avoid join issues | `REPLACE()` string function is used. | id               |


In [173]:
--Preview
SELECT TOP 5
*
FROM bronze.erp_px_cat_g1v2

ID,CAT,SUBCAT,MAINTENANCE
AC_BR,Accessories,Bike Racks,Yes
AC_BS,Accessories,Bike Stands,No
AC_BC,Accessories,Bottles and Cages,No
AC_CL,Accessories,Cleaners,Yes
AC_FE,Accessories,Fenders,No


In [176]:
-- As we learned earlier there was one id not in product category key
SELECT id, CAT, SUBCAT, MAINTENANCE
FROM bronze.erp_px_cat_g1v2
WHERE id NOT IN (
    SELECT REPLACE(SUBSTRING(prd_key, 1, 5), '-', '_')
    FROM bronze.crm_prd_info);


id,CAT,SUBCAT,MAINTENANCE
CO_PD,Components,Pedals,No


In [182]:
-- Now I will replace ('CO_PD') by ('CO_PE') to match with the product info table, in ordet to have no problem while joining
SELECT 
    id,
    CAT,
    SUBCAT,
    MAINTENANCE
FROM bronze.erp_px_cat_g1v2
WHERE REPLACE(id, 'CO_PD', 'CO_PE') NOT IN (
    SELECT REPLACE(SUBSTRING(prd_key, 1, 5), '-', '_')
    FROM bronze.crm_prd_info
);

id,CAT,SUBCAT,MAINTENANCE


In [181]:
-- Check for unwanted Spaces (cat, subcat, maintenance)
-- All looking good
SELECT *
FROM bronze.erp_px_cat_g1v2
WHERE cat != TRIM(cat)
   OR subcat != TRIM(subcat)
   OR maintenance != TRIM(maintenance);

ID,CAT,SUBCAT,MAINTENANCE


In [178]:
-- Data Standardization & Consistency 
-- It is fine
SELECT DISTINCT cat 
FROM bronze. erp_px_cat_g1v2

cat
Accessories
Bikes
Clothing
Components


In [179]:
-- Data Standardization & Consistency 
-- It is fine
SELECT DISTINCT subcat
FROM bronze. erp_px_cat_g1v2

subcat
Bib-Shorts
Bike Racks
Bike Stands
Bottles and Cages
Bottom Brackets
Brakes
Caps
Chains
Cleaners
Cranksets


In [180]:
-- Data Standardization & Consistency 
-- It is fine
SELECT DISTINCT maintenance 
FROM bronze. erp_px_cat_g1v2

maintenance
No
Yes
