# Dataset Validation - Sales Teams

This notebook performs comprehensive data validation on the *sales_teams* dataset used for the CRM pricing analysis project.

In [1]:
# import library
import pandas as pd
import re

# intialize list to collect data quality issue
dq_issues = []

In [2]:
# Load dataset
team = pd.read_csv('/Users/Gio Noga/Documents/Data Analysis 101/repos/gn-data-crm_pricing_analysis/raw_dataset/sales_teams.csv')

### **Get General information about the Dataset**

In [3]:
# Check Dataset size
team.shape

(35, 3)

In [4]:
# Summary of dataset
team.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   sales_agent      35 non-null     object
 1   manager          35 non-null     object
 2   regional_office  35 non-null     object
dtypes: object(3)
memory usage: 972.0+ bytes


In [5]:
# Get sample data
team.head(3)

Unnamed: 0,sales_agent,manager,regional_office
0,Anna Snelling,Dustin Brinkmann,Central
1,Cecily Lampkin,Dustin Brinkmann,Central
2,Versie Hillebrand,Dustin Brinkmann,Central


**Validation Approach**

- **Primary Key Validation**: Ensuring the `sales_agent` column contains unique, non-null values
- **Value Accuracy**: Confirmed consistent spelling, spacing, and capitalization.

### **Validate Original Primary Key Column**

#### **Column: sales_agent**

**sales_agent** is determined to be the PK of the dataset, thus, check for duplicates will be done first.

In [6]:
# Check for duplicate values in PK column
team['sales_agent'].nunique(dropna=False)

35

The value of unique values for **sales_agent** is equal to the number of entires for the dataset, which indicates that all avalues are unique.

Now, validation for accuracy and consistency will be done.

In [7]:
# Validate sales_agent column with comprehensive rules

# ---------------------------
# Validate sales_agent column
# ---------------------------
def validate_agent_name(sales_agent):

    if pd.isna(sales_agent):
        return "null_value"

    name = str(sales_agent).strip()

    # Rule 1: Minimum trimmed length
    if len(name) < 5:
        return "too_short"

    # Rule 2: Excessive whitespace
    if "  " in name:
        return "excessive_whitespace"

    # Rule 3: Allowed characters only
    if not re.match(r"^[a-zA-Z\s\-']+$", name):
        return "invalid_characters"

    # Rule 4: At least two tokens
    tokens = name.split()
    if len(tokens) < 2:
        return "insufficient_tokens"

    return "valid"


# ---------------------------
# Run validation (NO column persisted)
# ---------------------------
agent_validation_results = team['sales_agent'].apply(validate_agent_name)

# ---------------------------
# Summary
# ---------------------------
print("Validation Results:")
print(agent_validation_results.value_counts())

# ---------------------------
# Extract issues ONLY (do NOT overwrite dq issues table)
# ---------------------------
sales_agent_validation_issues = (
    team
    .assign(name_validation=agent_validation_results)  # temporary only
    .loc[agent_validation_results != 'valid', ['sales_agent', 'name_validation']]
)

if not sales_agent_validation_issues.empty:
    print("\nIssues found:")
    print(sales_agent_validation_issues)
else:
    print("\nPassed")

Validation Results:
sales_agent
valid    35
Name: count, dtype: int64

Passed


### **Validation for other Columns**

#### **Columns: manager**

The **manager** column may contain duplicate values. Validation focuses on accuracy and consistency.

In [8]:
#Show unique values
team['manager'].value_counts(dropna=False)

manager
Melvin Marxen       6
Cara Losch          6
Celia Rouche        6
Rocco Neubert       6
Summer Sewald       6
Dustin Brinkmann    5
Name: count, dtype: int64

In [9]:
# Validation function (same rules as sale_agent column)
def validate_manager_name(manager):

    if pd.isna(manager):
        return "null_value"

    name = str(manager).strip()

    # Rule 1: Minimum trimmed length
    if len(name) < 5:
        return "too_short"

    # Rule 2: Excessive whitespace
    if "  " in name:
        return "excessive_whitespace"

    # Rule 3: Allowed characters only
    if not re.match(r"^[a-zA-Z\s\-']+$", name):
        return "invalid_characters"

    # Rule 4: At least two tokens
    tokens = name.split()
    if len(tokens) < 2:
        return "insufficient_tokens"

    return "valid"


# --- Optimization: validate unique manager names only ---
unique_managers = team['manager'].dropna().unique()

manager_validation = pd.DataFrame({
    "manager": unique_managers,
    "name_validation": [validate_manager_name(m) for m in unique_managers]
})

# Summary
validation_summary = manager_validation['name_validation'].value_counts()
print("Validation Results:")
print(validation_summary)

# âœ… DO NOT overwrite your main issues table
manager_name_validation_issues = manager_validation[
    manager_validation['name_validation'] != 'valid'
]

if not manager_name_validation_issues.empty:
    print("\nIssues found:")
    print(manager_name_validation_issues)
else:
    print("\nPassed")


Validation Results:
name_validation
valid    6
Name: count, dtype: int64

Passed


#### **Columns: regional_office**

The **regional_office** column is expected to contain duplicate values. Validation focuses on accuracy and consistency.

In [10]:
#Show unique values
team['regional_office'].value_counts(dropna=False)

regional_office
East       12
West       12
Central    11
Name: count, dtype: int64

In [11]:
#Validation function for location (regional_office)
def validate_location(location):
    location = str(location).strip()
    
    # Rule 1: Excessive internal whitespace
    if "  " in location:
        return "excessive_whitespace"
    
    # Rule 2: Letters and spaces only
    if not re.fullmatch(r"[A-Za-z\s]+", location):
        return "invalid_characters"
    
    return "valid"

# --- Validate unique locations only ---
unique_locations = team['regional_office'].dropna().unique()

validation_results = pd.DataFrame({
    'regional_office': unique_locations,
    'location_validation': [validate_location(loc) for loc in unique_locations]
})

# --- Summarize results ---
validation_summary = validation_results['location_validation'].value_counts()
print("Validation Results:")
print(validation_summary)

# --- Show invalid locations ---
issues = validation_results[validation_results['location_validation'] != 'valid']
if len(issues) > 0:
    print("\nIssues found:")
    print(issues)
else:
    print("\nPassed")


Validation Results:
location_validation
valid    3
Name: count, dtype: int64

Passed


**There are no data quality issues in the dataset. No further cleaning is required.**

In [12]:
# Save the cleaned dataframe to a new CSV file for analysis
team.to_csv('C:\\Users\\Gio Noga\\Documents\\Data Analysis 101\\repos\\gn-data-crm_pricing_analysis\\clean_dataset/03_sales_teams_cleaned.csv', index=False)