# Dataset Validation - Products

This notebook performs comprehensive data validation on the *products* dataset used for the CRM pricing analysis project.

In [3]:
# import library

import pandas as pd 

# Initialize list to collect data quality issues
dq_issues = []

In [4]:
# Load dataset
prodct = pd.read_csv('/Users/Gio Noga/Documents/Data Analysis 101/repos/gn-data-crm_pricing_analysis/raw_dataset/products.csv')

### **Get General information about the DataFrame**

In [5]:
# Check table shape
prodct.shape

(7, 3)

In [6]:
#Summary of dataset
prodct.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   product      7 non-null      object
 1   series       7 non-null      object
 2   sales_price  7 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 300.0+ bytes


**Validation Approach:**

1. **Visual inspection** - Display entire dataset to verify all records at once
2. **Uniqueness checks** - Ensure primary key candidates have no duplicates
3. **Value corelation** - Check categorical columns for alignment with product type
4. **Data completeness** - Verify critical columns have no unexpected nulls

#### **Visual Inspection for Whole Dataset**

Due to the small size of the product dataset (7 rows, 3 columns), manual inspection (eyeballing) combined with basic statistical checks is sufficient and deamed more practical than complex validation rules.

In [7]:
# Display entire dataset
prodct.head(7)

Unnamed: 0,product,series,sales_price
0,GTX Basic,GTX,550
1,GTX Pro,GTX,4821
2,MG Special,MG,55
3,MG Advanced,MG,3393
4,GTX Plus Pro,GTX,5482
5,GTX Plus Basic,GTX,1096
6,GTK 500,GTK,26768


#### **Validate Original Primary Key Column**

#### **Column: product**

**product** is determined to be the PK of the dataset, thus, check for duplicates will be done first.

In [8]:
# Check for duplicate for PK
prodct['product'].nunique(dropna=False)

7

Count of unique values in **product** column matched the count of rows in the dataset which indicates that there should be no duplicates.

### **Validation for other Columns**

#### **Column: series**

Value for this column is expected to not be null and is identical to the specified series in the product name.

Earlier validation has already confirm that ther are no null values for this column, what's lest the to ensure prodict-series corelation.

In [9]:
# Validate series matches specific product series format
series_mismatch = (prodct['series'] != prodct['product'].str.split().str[0]).sum()

if series_mismatch > 0:
    print(f"Series mismatch: {series_mismatch} row(s)")
else:
    print("Passed")

Passed


Validation confirms that each **product** has the correct corresponding **series** value. 

#### **Column: sales_price**

Similar to **series** column, the **sales_price** column is expected to not contain null values. In addition to this, the value must also valid and not zero or negative value must be present.

In [10]:
#Check for non-positive sales price
(prodct['sales_price'] <= 0).sum()

np.int64(0)

Validation confirms that each **product** has a valid **sales_price**. 

**There are no data quality issues in the dataset. No further cleaning is required.**

In [12]:
# Save the dataframe to a new CSV file for analysis
prodct.to_csv('C:\\Users\\Gio Noga\\Documents\\Data Analysis 101\\repos\\gn-data-crm_pricing_analysis\\clean_dataset/02_products_cleaned.csv', index=False)