# Dataset Validation - Sales Pipeline

This notebook performs comprehensive data validation on the *sales_pipeleine* dataset used for the CRM pricing analysis project.

In [1]:
# import library
import pandas as pd
import re
import numpy as np

# intialize list to collect data quality issue
dq_issues = []

In [2]:
# load dataset
sales = pd.read_csv('/Users/Gio Noga/Documents/Data Analysis 101/repos/gn-data-crm_pricing_analysis/raw_dataset/sales_pipeline.csv')

### **Get General information about the Dataset**

In [3]:
#Get dataset size
sales.shape

(8800, 8)

In [4]:
# Get structure overview of the dataset
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8800 entries, 0 to 8799
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   opportunity_id  8800 non-null   object 
 1   sales_agent     8800 non-null   object 
 2   product         8800 non-null   object 
 3   account         7375 non-null   object 
 4   deal_stage      8800 non-null   object 
 5   engage_date     8300 non-null   object 
 6   close_date      6711 non-null   object 
 7   close_value     6711 non-null   float64
dtypes: float64(1), object(7)
memory usage: 550.1+ KB


In [5]:
#Get summary statistics
sales.describe(include='all')

Unnamed: 0,opportunity_id,sales_agent,product,account,deal_stage,engage_date,close_date,close_value
count,8800,8800,8800,7375,8800,8300,6711,6711.0
unique,8800,30,7,85,4,421,306,
top,1C1I7A6R,Darcel Schlecht,GTX Basic,Hottechi,Won,2017-07-22,2017-05-22,
freq,1,747,1866,200,4238,66,41,
mean,,,,,,,,1490.915512
std,,,,,,,,2320.670773
min,,,,,,,,0.0
25%,,,,,,,,0.0
50%,,,,,,,,472.0
75%,,,,,,,,3225.0


In [6]:
# Display first 3 rows
sales.head(3)

Unnamed: 0,opportunity_id,sales_agent,product,account,deal_stage,engage_date,close_date,close_value
0,1C1I7A6R,Moses Frase,GTX Plus Basic,Cancity,Won,2016-10-20,2017-03-01,1054.0
1,Z063OYW0,Darcel Schlecht,GTXPro,Isdom,Won,2016-10-25,2017-03-11,4514.0
2,EC4QE1BX,Darcel Schlecht,MG Special,Cancity,Won,2016-10-25,2017-03-07,50.0


In [7]:
# display last 3 rows
sales.tail(3)

Unnamed: 0,opportunity_id,sales_agent,product,account,deal_stage,engage_date,close_date,close_value
8797,LIB4KUZJ,Versie Hillebrand,MG Advanced,,Prospecting,,,
8798,18IUIUK0,Versie Hillebrand,MG Advanced,,Prospecting,,,
8799,8I5ONXJX,Versie Hillebrand,MG Advanced,,Prospecting,,,


**Validation Approach**

- **Primary Key Validation**: Ensuring the `opportunity_id` column contains unique, non-null values
- **Text Standardization**: Checked string columns for consistent spelling, spacing, and capitalization.
- **Date Format Consistency**: Ensured date columns use a uniform, valid datetime format.
- **Numeric Value Validity**: Confirmed numeric fields fall within acceptable ranges (no negative values).

### **Primary Key Column**

#### **Column: opportunity_id**

**opportunity_id** is the primary key of the dataset. It is validated for duplicates and checked to ensure all values are exactly 8 characters long.

In [8]:
# Check for duplicate rows and length validation
duplicate_count = sales['opportunity_id'].duplicated().sum()
all_len_8 = (sales['opportunity_id'].str.len() == 8).all()

if duplicate_count == 0 and all_len_8:
    print("Passed")
else:
    print("Failed")

Passed


### **Validation for other Columns**

#### **Columns: sales_agent, products, account**

Columns **sales_agent**, **product**, and **account** are foreign key string fields. Duplicate values are expected; however, these columns should contain no null values and must be validated for value accuracy and consistency.

In [9]:
# Validate sales_agent column
def validate_sales_agent(sales_agent):
    
    if pd.isna(sales_agent):
        return "null_value"
    
    name = str(sales_agent).strip()
    
    # Rule 1: Minimum length after trimming
    if len(name) < 5:
        return "too_short"
    
    # Rule 2: Excessive internal whitespace
    if "  " in name:
        return "excessive_whitespace"
    
    # Rule 3: Allowed characters only (letters, spaces, hyphen, apostrophe)
    if not re.fullmatch(r"[A-Za-z\s\-']+", name):
        return "invalid_characters"
    
    # Rule 4: Must contain at least two name tokens
    if len(name.split()) < 2:
        return "insufficient_tokens"
    
    return "valid"

# Apply validation
sales['sales_agent_validation'] = sales['sales_agent'].apply(validate_sales_agent)

# Summarize results
validation_summary = sales['sales_agent_validation'].value_counts()
print("Validation Results:")
print(validation_summary)

# Display issues if any
issues = sales[sales['sales_agent_validation'] != 'valid']
if not issues.empty:
    print("\nIssues found:")
    print(issues[['sales_agent', 'sales_agent_validation']])
else:
    print("\nPassed")


Validation Results:
sales_agent_validation
valid    8800
Name: count, dtype: int64

Passed


In [10]:
# Validate product column
def validate_product(product):
    
    if pd.isna(product):
        return "null_value"
    
    name = str(product).strip()
    
    # Rule 1: Excessive internal whitespace
    if "  " in name:
        return "excessive_whitespace"
    
    # Rule 2: Letters, numbers, and spaces only
    if not re.fullmatch(r"[A-Za-z0-9\s]+", name):
        return "invalid_characters"

    
    tokens = name.split()
    
    # Rule 3: Product name must have 2â€“3 words
    if len(tokens) < 2 or len(tokens) > 3:
        return "invalid_word_count"
    
    # Rule 4: First word must be all uppercase
    if not tokens[0].isupper():
        return "first_word_not_uppercase"
    
    return "valid"

# Apply validation
sales['product_validation'] = sales['product'].apply(validate_product)

# Summarize results
validation_summary = sales['product_validation'].value_counts()
print("Validation Results:")
print(validation_summary)

# Group and display issues (deduplicated output)
issues = (
    sales[sales['product_validation'] != 'valid']
    .groupby(['product', 'product_validation'])
    .size()
    .reset_index(name='affected_rows')
)

if not issues.empty:
    print("\nIssues found:")
    print(issues)
else:
    print("\nPassed")



Validation Results:
product_validation
valid                 7320
invalid_word_count    1480
Name: count, dtype: int64

Issues found:
  product  product_validation  affected_rows
0  GTXPro  invalid_word_count           1480


In [11]:
# Flag issues found
dq_issues.append({
    "dataset": "sales_pipeline",
    "column": "product",
    "issue": "Inconsistent Formatting",
    "details": "Product value 'GTXPro' is missing a space between 'GTX' and 'Pro'.",
    "action": "Standardize product name to 'GTX Pro'.",
    "status": "Open"
})

In [12]:
# Validate account column
def validate_account(account, deal_stage):
    
    # Allow NULL only if deal_stage is Prospecting or Engaging
    if pd.isna(account):
        if deal_stage in ["Prospecting", "Engaging"]:
            return "valid"
        else:
            return "null_not_allowed"
    
    name = str(account).strip()
    
    # Rule 1: At least one word
    tokens = name.split()
    if len(tokens) < 1:
        return "no_words"
    
    # Rule 2: Allowed characters only
    if not re.fullmatch(r"[A-Za-z0-9\s&\-\.\']+", name):
        return "invalid_characters"
    
    # Rule 3: First letter of the first word must be uppercase
    if not tokens[0][0].isupper():
        return "first_word_not_capitalized"
    
    return "valid"

# Apply validation (row-wise because we need deal_stage)
sales['account_validation'] = sales.apply(
    lambda row: validate_account(row['account'], row['deal_stage']),
    axis=1
)

# Summarize results
validation_summary = sales['account_validation'].value_counts()
print("Validation Results:")
print(validation_summary)

# Group and display issues
issues = (
    sales[sales['account_validation'] != 'valid']
    .groupby(['account', 'deal_stage', 'account_validation'])
    .size()
    .reset_index(name='affected_rows')
)

if not issues.empty:
    print("\nIssues found:")
    print(issues)
else:
    print("\nPassed")

Validation Results:
account_validation
valid                         8708
first_word_not_capitalized      92
Name: count, dtype: int64

Issues found:
   account deal_stage          account_validation  affected_rows
0  dambase   Engaging  first_word_not_capitalized              1
1  dambase       Lost  first_word_not_capitalized             37
2  dambase        Won  first_word_not_capitalized             54


In [13]:
dq_issues.append({
    "dataset": "sales_pipeline",
    "column": "account",
    "issue": "Inconsistent Formatting",
    "details": "The first letter of the value 'dambase' is not capitalized.",
    "action": "Standardize the account names to have the first letter capitalized.",
    "status": "Open"
})

#### **Column: deal_stage**

Validation was performed by comparing all values against a predefined list of allowed deal stages to identify any unexpected or incorrectly formatted entries.

In [14]:
allowed_stages = {
    "Prospecting",
    "Engaging",
    "Won",
    "Lost"
}

invalid_stages = sales.loc[
    ~sales['deal_stage'].isin(allowed_stages),
    'deal_stage'
].value_counts(dropna=False)

if invalid_stages.empty:
    print("Passed")
else:
    print("Failed")
    print(invalid_stages)


Passed


#### **Columns: engage_date, close_date**

Validation was performed by ensure that values uniformly follow YYYY-MM-DD formar. The precene of null value are expected but are only allowed for certail **deal_stage** values.

In [15]:
# Validate engage_date column
def is_valid_date(date_value):
    try:
        pd.to_datetime(date_value, format="%Y-%m-%d", errors="raise")
        return True
    except Exception:
        return False


invalid_engage_date = sales[
    (
        # Non-null values must follow YYYY-MM-DD
        sales['engage_date'].notna() &
        ~sales['engage_date'].apply(is_valid_date)
    )
    |
    (
        # Null allowed only for Prospecting/Engaging
        (sales['engage_date'].isna() | sales['engage_date'].astype(str).str.strip().eq('')) &
        (~sales['deal_stage'].isin(['Prospecting', 'Engaging']))
    )
]

if invalid_engage_date.empty:
    print("Passed")
else:
    print("Failed")
    print(invalid_engage_date[['deal_stage', 'engage_date']])


Passed


In [16]:
# Validate close_date column
def is_valid_date(date_value):
    try:
        pd.to_datetime(date_value, format="%Y-%m-%d", errors="raise")
        return True
    except Exception:
        return False


invalid_close_date = sales[
    (
        # Non-null values must follow YYYY-MM-DD
        sales['close_date'].notna() &
        ~sales['close_date'].apply(is_valid_date)
    )
    |
    (
        # Null allowed only for Prospecting/Engaging
        (sales['close_date'].isna() | sales['close_date'].astype(str).str.strip().eq('')) &
        (~sales['deal_stage'].isin(['Prospecting', 'Engaging']))
    )
]

if invalid_close_date.empty:
    print("Passed")
else:
    print("Failed")
    print(invalid_close_date[['deal_stage', 'close_date']])

Passed


#### **Columns: close_value**

Validation ensures values are numeric and aligned with the deal stage, allowing nulls or zero only where appropriate.

In [17]:
# Validate close_value column
def validate_close_value(close_value, deal_stage):

    # Rule 1: Null allowed only for Prospecting or Engaging
    if pd.isna(close_value) or str(close_value).strip() == "":
        if deal_stage in ['Prospecting', 'Engaging']:
            return "valid"
        return "null_value"

    # Convert to numeric
    try:
        value = float(close_value)
    except Exception:
        return "invalid_number"

    # Rule 2: Must not be negative
    if value < 0:
        return "negative_value"

    # Rule 3: Zero allowed only for Lost
    if value == 0 and deal_stage != 'Lost':
        return "zero_not_allowed"

    return "valid"


# Apply validation
sales['close_value_validation'] = sales.apply(
    lambda row: validate_close_value(row['close_value'], row['deal_stage']),
    axis=1
)

# Summarize results

print("Validation Results:")
print(validation_summary)

# Group and display issues (deduplicated, with affected rows)
issues = (
    sales[sales['close_value_validation'] != 'valid']
    .groupby(['close_value', 'close_value_validation'])
    .size()
    .reset_index(name='affected_rows')
)

if not issues.empty:
    print("\nIssues found:")
    print(issues)
else:
    print("\nPassed")



Validation Results:
account_validation
valid                         8708
first_word_not_capitalized      92
Name: count, dtype: int64

Passed


In [18]:
# Save local flagged issues to global list
dq_issues_df = pd.DataFrame(dq_issues)
dq_issues_df.to_csv('raw_data_issues.csv', mode='a', index=False, header=not pd.io.common.file_exists('raw_data_issues.csv'))


dq_issues_df

Unnamed: 0,dataset,column,issue,details,action,status
0,sales_pipeline,product,Inconsistent Formatting,Product value 'GTXPro' is missing a space betw...,Standardize product name to 'GTX Pro'.,Open
1,sales_pipeline,account,Inconsistent Formatting,The first letter of the value 'dambase' is not...,Standardize the account names to have the firs...,Open


**Data validation indicated a high level of overall data quality, with only minor accuracy and structural issues identified. These issues were documented an will be handled during cleanup.**