# Dataset Validation - Accounts

This notebook performs comprehensive data validation on the *accounts* dataset used for the CRM pricing analysis project.

In [6]:
# import library
import pandas as pd
import re

# Initialize list to collect data quality issues
dq_issues = []

In [7]:
## Load the accounts raw dataset
acc = pd.read_csv('/Users/Gio Noga/Documents/Data Analysis 101/repos/gn-data-crm_pricing_analysis/raw_dataset/accounts.csv')

### **Get General information about the DataFrame**

In [8]:
# Table size
acc.shape

(85, 7)

In [9]:
# Summary of DataFrame
acc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   account           85 non-null     object 
 1   sector            85 non-null     object 
 2   year_established  85 non-null     int64  
 3   revenue           85 non-null     float64
 4   employees         85 non-null     int64  
 5   office_location   85 non-null     object 
 6   subsidiary_of     15 non-null     object 
dtypes: float64(1), int64(2), object(4)
memory usage: 4.8+ KB


In [10]:
# Get sample data
acc.head(3)

Unnamed: 0,account,sector,year_established,revenue,employees,office_location,subsidiary_of
0,Acme Corporation,technolgy,1996,1100.04,2822,United States,
1,Betasoloin,medical,1999,251.41,495,United States,
2,Betatech,medical,1986,647.18,1185,Kenya,


In [11]:
# Get sample data
acc.tail(3)    

Unnamed: 0,account,sector,year_established,revenue,employees,office_location,subsidiary_of
82,Zoomit,entertainment,1992,324.19,978,United States,
83,Zotware,software,1979,4478.47,13809,United States,
84,Zumgoity,medical,1984,441.08,1210,United States,


**Validation Approach:**

- **Primary Key Validation**: Ensuring the `account` column contains unique, non-null values
- **Data Type Consistency**: Verifying columns contain expected data types
- **Categorical Value Accuracy**: Checking for spelling inconsistencies and unexpected values in categorical columns (sector, office_location)
- **Numeric Value Validity**: Validating ranges and realistic values for numeric columns (year_established, employees)
- **Referential Integrity**: Ensuring `subsidiary_of` values reference valid existing accounts

### **Validate Original Primary Key Column**

#### **Column: account**

**account** is determined to be the PK of the dataset, thus, check for duplicates will be done first.

In [12]:
# Check for duplicates in PK column
acc['account'].nunique(dropna=False)

85

Count of unique values in **account** column matched the count of rows in the dataset which indicates that there should be no duplicates.

In [None]:
# Validate account column
def validate_account(account):
    
    if pd.isna(account):
        return "null_value"
    
    name = str(account).strip()
    
    # Rule 1: At least one word
    tokens = name.split()
    if len(tokens) < 1:
        return "no_words"
    
    # Rule 2: Allowed characters only (letters, numbers, spaces, &, -, ., ')
    if not re.fullmatch(r"[A-Za-z0-9\s&\-\.\']+", name):
        return "invalid_characters"
    
    # Rule 3: First letter of the first word must be uppercase
    if not tokens[0][0].isupper():
        return "first_word_not_capitalized"
    
    return "valid"

# Apply validation
acc['account_validation'] = acc['account'].apply(validate_account)

# Summarize results
validation_summary = acc['account_validation'].value_counts()
print("Validation Results:")
print(validation_summary)

# Group and display issues (deduplicated, with affected rows)
issues = (
    acc[acc['account_validation'] != 'valid']
    .groupby(['account', 'account_validation'])
    .size()
    .reset_index(name='affected_rows')
)

if not issues.empty:
    print("\nIssues found:")
    print(issues)
else:
    print("\nPassed")

Validation Results:
account_validation
valid                         84
first_word_not_capitalized     1
Name: count, dtype: int64

Issues found:
   account          account_validation  affected_rows
0  dambase  first_word_not_capitalized              1


In [14]:
dq_issues.append({
    "dataset": "accounts",
    "column": "account",
    "issue": "Inconsistent Formatting",
    "details": "The first letter of the value is not capitalized.",
    "action": "Standardize the account names to have the first letter capitalized.",
    "status": "Open"
})

### **Validation for other Columns**

#### **Columns: sector, office_location**

Columns **sector** and **office_location** are expected to have duplicate values, what will be check is for the accuracy and concistency of the values.

In [15]:
# Check for unexepected values and frequency counts
acc['sector'].value_counts(dropna=False)

sector
retail                17
technolgy             12
medical               12
marketing              8
finance                8
software               7
entertainment          6
telecommunications     6
services               5
employment             4
Name: count, dtype: int64

In [16]:
# Flag issue found
dq_issues.append({
    "dataset": "accounts",
    "column": "sector",
    "issue": "Spelling inconsistency",
    "details": "Found 'technolgy' instead of 'technology'",
    "action": "standardize spelling",
    "status": "Open"
})

dq_issues.append({
    "dataset": "accounts",
    "column": "sector",
    "issue": "Inconsistent Formatting",
    "details": "The first letter of the value is not capitalized.",
    "action": "Standardize the sector names to have the first letter capitalized.",
    "status": "Open"
})


In [17]:
# Check for unexepected values and frequency counts
acc['office_location'].value_counts(dropna=False)

office_location
United States    71
Kenya             1
Philipines        1
Japan             1
Italy             1
Norway            1
Korea             1
Jordan            1
Brazil            1
Germany           1
Panama            1
Belgium           1
Romania           1
Poland            1
China             1
Name: count, dtype: int64

In [18]:
# Flag issue found
dq_issues.append({
    "dataset": "accounts",
    "column": "office_location",
    "issue": "Spelling inconsistency",
    "details": "Found 'Philipines' instead of 'Philippines'",
    "action": "standardize spelling",
    "status": "Open"
})

Since these two columns have low-cardinality, a frequency count function was used to detect to identify inconsistent/incorrect values. In this specific case, it is deamed enought to check data accuracy.

#### **Columns: year_established, employees** 

Columns **year_established** and **employees** are expected to contain positive whole numbers only.
Additionally, year_established must be a 4-digit year and cannot be greater than 2025.

Earlier validation has already confirmed that values inside these two columns are whole number. What's left is to check for negative and unrealistic values.


In [19]:
#Confirm that all year_established value is a four-digit year
(acc['year_established'].astype(str).str.len() != 4).sum()

np.int64(0)

In [20]:
# Check for negative or zero year_established values
(acc['year_established'] <= 0).sum()

np.int64(0)

In [21]:
# Check for unrealistic year_established values
(acc['year_established'] > 2025).sum()

np.int64(0)

Validation for **year_established** has confirmed that there are **no issues on the column values**.

For the **employees** column, values are validated to ensure they are positive and fall within a realistic business range, with zero, negative, or extreme values flagged as potential data quality issues.

In [22]:
# Check for negative or zero employees values
(acc['employees'] <= 0).sum()

np.int64(0)

In [23]:
# df['employees'] <= 0
acc['employees'].describe()


count       85.000000
mean      4660.823529
std       5715.601198
min          9.000000
25%       1179.000000
50%       2769.000000
75%       5595.000000
max      34288.000000
Name: employees, dtype: float64

Employee counts exhibit a right-skewed but realistic distribution, with no zero or negative values and a plausible upper bound. No data quality issues were identified for **employees** column.

#### **Column: revenue** 

Column revenue is expected to contain positive decimal numbers only.
Since the datatype has already been confirmed as float, the only remaining checks are for negative and zero values.

In [24]:
# Validate revenue column
def validate_revenue(revenue):

    # Rule 1: Must not be negative
    if revenue < 0:
        return "negative_value"

    # Rule 2: Must not be zero
    if revenue == 0:
        return "zero_not_allowed"

    return "valid"


# Apply validation
acc['revenue_validation'] = acc['revenue'].apply(validate_revenue)

# Summarize results
validation_summary = acc['revenue_validation'].value_counts()
print("Validation Results:")
print(validation_summary)

# Group and display issues (deduplicated, with affected rows)
issues = (
    acc[acc['revenue_validation'] != 'valid']
    .groupby(['revenue', 'revenue_validation'])
    .size()
    .reset_index(name='affected_rows')
)

if not issues.empty:
    print("\nIssues found:")
    print(issues)
else:
    print("\nPassed")



Validation Results:
revenue_validation
valid    85
Name: count, dtype: int64

Passed


#### **Column: subsidiary_of**

The **subsidiary_of** column is expected to contain null values. When populated, its value must reference an existing account in the **account** column, indicating a valid parentâ€“subsidiary relationship.

In [25]:
# check for non-null values in subsidiary_of column

acc['subsidiary_of'].value_counts(dropna=False)

subsidiary_of
NaN                 70
Acme Corporation     4
Sonron               3
Bubba Gump           2
Golddex              2
Inity                2
Massive Dynamic      1
Warephase            1
Name: count, dtype: int64

In [26]:
# Extract non-null values from subsidiary_of column and compare with account column
non_null_subsidiaries = acc['subsidiary_of'].dropna().unique()
existing_accounts = acc['account'].unique()

# Check which subsidiary_of values don't exist in account column
invalid_subsidiaries = [sub for sub in non_null_subsidiaries if sub not in existing_accounts]

if invalid_subsidiaries:
    print(f"Failed: {invalid_subsidiaries}")
else:
    print("Passed")

Passed


In [27]:
# Save local flagged issues to global list 

dq_issues_df = pd.DataFrame(dq_issues) 
dq_issues_df.to_csv('raw_data_issues.csv', mode='a', index=False, header=not pd.io.common.file_exists('raw_data_issues.csv')) 

dq_issues_df

Unnamed: 0,dataset,column,issue,details,action,status
0,accounts,account,Inconsistent Formatting,The first letter of the value is not capitalized.,Standardize the account names to have the firs...,Open
1,accounts,sector,Spelling inconsistency,Found 'technolgy' instead of 'technology',standardize spelling,Open
2,accounts,sector,Inconsistent Formatting,The first letter of the value is not capitalized.,Standardize the sector names to have the first...,Open
3,accounts,office_location,Spelling inconsistency,Found 'Philipines' instead of 'Philippines',standardize spelling,Open


**Data validation indicated a high level of overall data quality, with only minor accuracy and structural issues identified. These issues were documented an will be handled during cleanup.**