# 1. Introduction

---

**Name**                : Ladityarsa Ilyankusuma

**E-Mail**               : ladityarsa.ian@gmail.com

---

### A. Datasource Breakdown

##### a.1. Source Link

Dataset Kaggle Source: [IBM HR Analytics Employee Attrition & Performance](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)

##### a.2. Overview

This datasource provides a detailed snapshot of employees within a corporate environment, covering a wide range of personal, professional, and performance-related attributes. It includes demographic details such as age, gender, and education level, as well as job-specific data like department, role, salary, and work experience.

A key focus of this dataset is employee attrition or whether someone has left the company or not, making it especially useful for analyzing turnover trends. It also contains indicators of workload and satisfaction, such as overtime status, job involvement, and work-life balance ratings. These features together allow for rich exploration into what factors might influence employee retention, engagement, and productivity.

The dataset is well-suited for HR analytics to explore important questions such as "show me a breakdown of distance from home by job role and attrition" or "compare average monthly income by education and attrition", and decision-making support in areas like workforce planning, employee well-being initiatives, and performance management strategies.

##### a.3. General Information

- Rows: 1,470
- Columns: 35
- Missing Values: 0
- Duplicate Rows: 0

This indicates the dataset is in a good state before validation, all column names are normalized and no nulls nor duplicates were found after preprocessing.

**Column Types Breakdown**

| Data Type            | Count |
| -------------------- | ----- |
| Integer (numerical)  | 19    |
| Object (categorical) | 16    |

Most of the features are numeric and suitable for statistical validations, while categorical fields are candidates for set and type checks.

**Example of Numerical Columns**

- `age`
- `daily_rate`
- `distance_from_home`
- `monthly_income`
- `num_companies_worked`
- `percent_salary_hike`
- `total_working_years`
- `years_at_company`

**Example of Categorical Columns**

- `attrition`
- `business_travel`
- `department`
- `education`
- `gender`
- `job_role`
- `marital_status`
- `over_time`

### B. Objective Breakdown

This notebook is part of a data quality assurance workflow using **Great Expectations (GX)** to validate a structured HR dataset before further analysis. The dataset contains detailed employee records, including job roles, compensation, work-life balance, and performance metrics, all of which are essential for exploring trends in attrition and workforce behavior.

##### b.1. Why Data Validation?

Before generating insights or building machine learning models, it is critical to **ensure that the underlying data is trustworthy, clean, and consistent**. This notebook uses Great Expectations to:

* Check the integrity of columns (such as uniqueness, value ranges, allowed types)
* Catch subtle data issues early (like invalid categories, duplicated rows, or out-of-range values)
* Provide a documented, automated validation layer that can be rerun as data pipelines evolve

This validation is especially important in the context of HR analytics, where **small data quality issues can mislead strategic decisions** about hiring, retention, or compensation.

##### b.2. What This Notebook Covers

Using Great Expectations, we perform the following tasks:

1. **Connect the HR dataset** (stored as a CSV) to a Pandas-based GX Datasource
2. **Create or reuse a reusable expectation suite** tailored to this dataset
3. **Apply 7 carefully selected expectations**, including:

   * Uniqueness checks for employee identifiers
   * Valid value ranges for numerical columns like `age` or `monthly_income`
   * Membership constraints for categorical features like `education`
   * Correct data form of integer or float for numerical columns like `daily_rate`
   * Valid values of two intertwined columns like `monthly_rate` and `daily_rate`
   * Make sure there are no missing chunks of data or data duplications in the pipeline
4. **Run a validation checkpoint** to evaluate the dataset against the expectation suite
5. **Generate and view Data Docs**, a visual report that summarizes which expectations passed or failed

These steps help ensure that any downstream analytics, whether in dashboards, KPIs, or predictive models, are built on reliable data.

##### b.3. Business Relevance

While the notebook focuses on **technical validation**, the ultimate goal is **business reliability**. Clean HR data allows analysts and stakeholders to:

* Spot early warning signs of employee disengagement
* Understand how performance varies across roles or departments
* Identify imbalances in workload, compensation, or satisfaction
* Support HR policies with evidence-driven insights

By embedding validation directly into the data workflow, this notebook ensures that these insights are not only useful, but also **credible**.

##### b.4. Intended Users

* **Data analysts and data engineers** responsible for maintaining clean HR data pipelines
* **HR analysts** validating employee datasets before visual exploration
* **People analytics teams** building predictive models or executive dashboards
* **Decision-makers** who rely on accurate, validated data to drive workforce strategy

# 2. Importing Libraries

In [1]:
# Importing libraries

## pip install -q "great-expectations==0.18.19"
from great_expectations.data_context import FileDataContext
import great_expectations.exceptions as gx_exceptions

# 3. Instantiate a Great Expectations Data Context

**What this does:**

This step creates a **new Great Expectations (GX) context**, basically the core environment GX uses to store and manage all your validation assets (datasources, expectation suites, checkpoints, and data docs). We're saving this context inside an auto-generated local folder called `/gx`.

**This is where GX tracks everything:**

* What data sources you've connected
* What expectations you've defined
* Where to log validation results
* How to build visual reports (Data Docs)

**Having a clearly defined data context allows us to:**

* Run repeatable validation checks every time new data comes in
* Easily organize and persist your expectations and reports across notebook restarts
* Avoid relying on memory or temp configs, this project becomes modular and portable

In [2]:
# Create a new GX context in the current folder
context = FileDataContext.create(project_root_dir="./")

# 4. Connect to a Datasource

**What this does:**

This step is about telling Great Expectations **where our data lives** and how to read it.

1. `add_pandas(...)` registers a datasource using Pandas, ideal for working with local `.csv` files in-memory.
2. If the datasource or asset already exists (like after a kernel restart), we catch that gracefully and re-load it from the context using a `try/except` block.
3. `add_csv_asset(...)` connects a specific CSV file as a data asset under the datasource.
4. Finally, we generate a `batch_request`, which is GX's internal format for pulling the actual data when running validations.

**Why this matters:**

* Datasources are GX’s entry point into **any data system**: files, databases, cloud storage, etc.
* Assets represent **specific datasets** within a datasource.
* Batch requests are the link between your raw data and your expectations, they’re used to validate and explore real data batches.

The use of `try/except` makes this notebook **resilient to kernel restarts** or reruns, avoiding “already exists” errors.

In [None]:
# Define the name of the datasource. This name must be unique between Datasources.
datasource_name = "csv_datasource_m3"

# Add a Pandas-based datasource, or load it if it's already exist
try:
    datasource = context.sources.add_pandas(datasource_name)
except gx_exceptions.DatasourceError:
    datasource = context.sources[datasource_name]

# Define the name of the asset and its path
asset_name = "m3_data_clean"
path_to_data = "./data/HR_employee_attrition_dataset_clean.csv"

# Add the CSV file as a data asset, or load it if it's already exist
try:
    asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)
except gx_exceptions.DataAssetError:
    asset = datasource.assets[asset_name]

# Build the batch request used to fetch the data for validation
batch_request = asset.build_batch_request()

# 5. Create an Expectation Suite

**What this does:**

This step sets up the **expectation suite**, which is where our data rules (validations) will be saved.

1. `add_expectation_suite()` creates a new empty suite named `"m3_suite"`.
2. If that suite already exists (like after re-running the notebook), we load it instead using `get_expectation_suite()`.
3. Then, we create a `validator`, a central GX object that links:

   * The actual data (via `batch_request`)
   * The validation logic (via the suite)

Finally, `validator.head()` gives a preview of the data you're about to validate, which helps ensure the connection works and the data looks right.

**Why this matters:**

* An **expectation suite** is like a checklist for what “good” data should look like: column types, value ranges, uniqueness, etc.
* The **validator** is where you define and apply those rules interactively in your notebook.
* All expectations you define later will be **added to this suite**.

The `try/except` again ensures the suite can be reused across multiple notebook runs without throwing duplicate errors, keeping the workflow smooth.

In [4]:
# Define the name of the suite
suite_name = "m3_suite"

# Create the expectation suite, or get it if it's already exist
try:
    suite = context.add_expectation_suite(expectation_suite_name=suite_name)
except gx_exceptions.DataContextError:
    suite = context.get_expectation_suite(expectation_suite_name=suite_name)

# Create the validator using above expectation suite
validator = context.get_validator(
    batch_request=batch_request, 
    expectation_suite=suite)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,age,attrition,business_travel,daily_rate,department,distance_from_home,education,education_field,employee_count,...,relationship_satisfaction,standard_hours,stock_option_level,total_working_years,training_times_last_year,work_life_balance,years_at_company,years_in_current_role,years_since_last_promotion,years_with_curr_manager
0,1,41,Yes,Travel_Rarely,1102,Sales,1,College,Life Sciences,1,...,Low,80,0,8,0,Bad,6,4,0,5
1,2,49,No,Travel_Frequently,279,Research & Development,8,Below College,Life Sciences,1,...,Very High,80,1,10,3,Better,10,7,1,7
2,4,37,Yes,Travel_Rarely,1373,Research & Development,2,College,Other,1,...,Medium,80,0,7,3,Better,0,0,0,0
3,5,33,No,Travel_Frequently,1392,Research & Development,3,Master,Life Sciences,1,...,High,80,0,8,3,Better,8,7,3,0
4,7,27,No,Travel_Rarely,591,Research & Development,2,Below College,Medical,1,...,Very High,80,1,6,3,Better,2,2,2,2


### A. Define Expectations

Once the expectation suite is initialized and the validator is ready, this is where we **define the specific rules** that our dataset must follow, or more famously known as **expectations** in Great Expectations.

Expectations are the heart of data validation. **They allow us to:**

* Ensure data consistency and quality across pipelines
* Catch schema drifts or anomalies early
* Document data contracts in a human-readable way

Each expectation reflects a business or technical assumption we want to enforce. In this project, we've designed **7 key expectations** based on column-level semantics and domain understanding. Each expectation is added interactively through the `validator` and automatically saved into the suite (`m3_suite`) for future validation.


##### a.1. Expectation 1 : `to be unique`

This expectation verifies that the `id` column contains **unique values**, ensuring that each row in the dataset represents a distinct employee:

In [5]:
# Expectation 1 : Column `id` must be unique
validator.expect_column_values_to_be_unique(column='id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1470,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Purpose of the Expectation:**

In any employee-related dataset, especially those used for HR analytics or people insights, the `id` field typically acts as a **primary key**. Enforcing uniqueness here is essential for several reasons:

* It prevents data duplication, which could skew metrics such as attrition rates or average tenure.
* It ensures consistency when joining with other datasets, like performance reviews or payroll.
* It enables accurate row-level analysis, such as tracking an individual’s progression or behavior over time.

**Result Summary:**

* `element_count: 1470` — All 1,470 rows were checked.
* `unexpected_count: 0` — No duplicate `id` values were found.
* `unexpected_percent: 0.0` — 0% of the rows failed this expectation.
* `missing_count: 0` — There were also no null or missing values in the `id` column.

These results indicate that the dataset successfully meets the uniqueness requirement. Every employee in the dataset is uniquely identified, which gives us a solid foundation for all downstream analysis and ensures trust in any insights derived from this data.


##### a.2. Expectation 2 : `to be between min_value and max_value`

This expectation ensures that all values in the `age` column fall within a **realistic and valid working age range** of 18 to 60 years:

In [6]:
# Expectation 2 : Column `age` must be between 18 and 60
validator.expect_column_values_to_be_between(
    column='age', min_value=18, max_value=60
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1470,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Purpose of the Expectation:**

Validating age boundaries is crucial in workforce datasets to:

* **Catch outliers** or data entry errors (ages below 18 or unrealistically high like 95).
* **Align the dataset with employment regulations**, which often stipulate minimum and maximum legal working ages.
* **Ensure accuracy** in segmentation and modeling, such as when age groups are used to study engagement, compensation trends, or attrition patterns.

By enforcing this range, we help ensure the demographic data is both clean and compliant.

**Result Summary:**

* `element_count: 1470` — All rows in the dataset were checked.
* `unexpected_count: 0` — Every `age` value was within the valid range (18–60).
* `unexpected_percent: 0.0` — No violations were found.
* `missing_count: 0` — The column contains no nulls or blanks.

This successful result confirms that the `age` column is both complete and logically valid. This allows us to confidently use this feature in further analysis tasks without needing additional filtering or imputation.

##### a.3. Expectation 3 : `to be in set`

This expectation checks that the `education` column **only contains valid categorical labels** from a predefined set of accepted educational levels:

In [7]:
# Expectation 3 : Column `education` must contain one of the following 5 things:
## - 'Below College'
## - 'College'
## - 'Bachelor'
## - 'Master'
## - 'Doctor'
validator.expect_column_values_to_be_in_set(
    'education', 
    ['Below College', 'College', 'Bachelor', 'Master', 'Doctor']
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1470,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Purpose of the Expectation:**

Ensuring categorical consistency is essential in HR analytics. By restricting values in the `education` column to a controlled set of labels, we:

* **Prevent typos or inconsistent entries** (like "bachelor" vs "Bachelor" or "Masters").
* **Ensure correct groupings** for statistical comparisons, especially when education level is used to study performance, compensation, or promotion patterns.
* Maintain **data integrity** for downstream dashboards and ML models that treat these categories as features.

**Result Summary:**

* `element_count: 1470` — All rows were evaluated.
* `unexpected_count: 0` — Every value in `education` matched one of the expected labels.
* `missing_count: 0` — There are no nulls or blanks in this column.
* `unexpected_percent: 0.0` — The entire column adheres strictly to the defined set.

This confirms that the `education` field is not only complete but also well-standardized, making it safe for use in grouping, aggregation, or encoding tasks.

##### a.4. Expectation 4 : `to be in type list`

This expectation ensures that the values in the `daily_rate` column are of a **valid numerical data type**, specifically either `int64` or `float64`.

In [8]:
# Expectation 4 : Column `daily_rate` must in form of integer or float
validator.expect_column_values_to_be_in_type_list(
    'daily_rate', ['int64', 'float64']
)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Purpose of the Expectation:**

Data type validation is critical when preparing numerical columns for:

* **Mathematical operations** (sums, averages, comparisons),
* **Modeling** (numeric encoding),
* And **visualization** (like histograms or scatter plots).

By ensuring that `daily_rate` contains only integers or floats, we avoid issues like:

* Parsing errors from string-based numbers (`'800'`, a number as a string),
* Unexpected objects (`None`, `N/A`, or mixed types),
* Downstream calculation bugs in aggregations or feature engineering.

**Result Summary:**

* `success: true` — All values in the column conform to an accepted type.
* `observed_value: "int64"` — The entire column consists of 64-bit integers.
* The fact that no mixed types were detected further confirms data consistency.

This makes the `daily_rate` column safe for any form of numerical analysis requiring numeric dtypes.

##### a.5. Expectation 5 : `A to be greater than B`

This expectation enforces a **logical relationship** between two numerical columns:
It ensures that for every row, the value in `monthly_rate` is **greater than** the value in `daily_rate`.

In [9]:
# Expectation 5 : Column `monthly_rate` must always be greater than column `daily_rate`
validator.expect_column_pair_values_A_to_be_greater_than_B(
    column_A='monthly_rate',
    column_B='daily_rate'
)

Calculating Metrics:   0%|          | 0/7 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1470,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Purpose of the Expectation:**

This type of cross-column validation is useful when:

* The columns represent **hierarchically related quantities**, such as daily vs. monthly compensation,
* There’s a **business logic** that must always hold true (monthly pay > daily pay),
* You're checking for **data entry errors** (accidentally swapped or inconsistent values).

By applying this expectation, we confirm that the dataset adheres to real-world constraints, improving reliability and trust in downstream analysis.

**Result Summary:**

* `success: true` — Every single row satisfies the rule.
* `unexpected_count: 0` — No violations found where daily rate exceeds or equals monthly rate.
* This confirms the **integrity of pay structure** in the dataset and rules out any flipped or inconsistent entries in these two columns.

##### a.6. Expectation 6 : `median to be between min_value and max_value`

This expectation validates that the **median value** of the `monthly_income` column falls within a defined, reasonable range of \$4,000 to \$6,000.

In [10]:
# Expectation 6 : The median of column `monthly_income` must be in range of 4000 - 6000 dollars
validator.expect_column_median_to_be_between(
    column='monthly_income', min_value=4000, max_value=6000
)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 4919.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Purpose of the Expectation:**

Validating the median is especially useful when:

* You want to **summarize the central tendency** of skewed numerical data (like income),
* There’s an expected **business benchmark** range (like typical employee salary),
* You want to catch shifts or anomalies in data distributions over time.

This expectation helps ensure that compensation data remains **plausible and within expected business norms**.

**Result Summary:**

* `observed_value: 4919.0` — The median monthly income is well within the expected range of \$4000–\$6000.
* `success: true` — Confirms that the dataset's income distribution remains centered around typical company pay ranges.
* This builds trust in the salary data and can help validate predictive models or dashboards that use this column as input.

##### a.7. Expectation 7 : `row count to be between min_value and max_value`

This expectation ensures that the **number of rows in the dataset** falls within an acceptable range, in this case between **1,000 and 2,000** records.

In [11]:
# Expectation 7 : Row count must be in range of 1000 - 2000 rows
validator.expect_table_row_count_to_be_between(
    min_value=1000, max_value=2000
)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 1470
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Purpose of the Expectation:**

Validating the total row count is a basic but powerful check to:

* Confirm the dataset is **complete and loaded properly**,
* Detect partial loads, dropped records, or duplicate imports,
* Serve as a **guardrail** against structural issues in the pipeline or manual errors during collection.

This is especially useful in periodic ETL workflows where the dataset size should remain within a consistent range over time.

**Result Summary:**

* `observed_value: 1470` — The dataset has 1,470 rows.
* This value is **within the acceptable range** of 1,000 to 2,000.
* `success: true` confirms the dataset was **fully and correctly loaded** with no missing or excess rows.

### B. Save the Expectations

Once we’ve defined all our validation rules (expectations), we need to **persist them into the expectation suite** so they can be reused or executed later — such as in automated data pipelines or future validation runs.

**What this does:**

* Saves the current state of the expectation suite (`m3_suite`) into the GX project directory (`gx/expectations/`).
* The `discard_failed_expectations=False` flag ensures that **even expectations that might have failed** during testing are retained in the suite (though in our case, all passed).

**Why this matters:**

* This is a critical step in **productionizing data validation**, as it turns our validation logic from ephemeral code into **a reusable contract** between data producers and consumers.
* The saved suite can now be linked to **Checkpoints**, **Data Docs**, or even integrated into **CI/CD pipelines** for automated checks.

In [12]:
# Save the expectations into the expectation suite
validator.save_expectation_suite(discard_failed_expectations=False)

### C. Run a Checkpoint

A **Checkpoint** in Great Expectations is a reproducible validation routine. It bundles together:

* The datasource (where the data lives),
* The batch request (what slice of data to validate),
* The expectation suite (the rules to validate against),

into a **single, versionable object** that can be executed over and over again, manually or automatically.

**What this does:**

* `add_or_update_checkpoint` registers a new checkpoint (or updates it if it already exists).
* `checkpoint.run()` triggers the entire validation process defined by the expectation suite.
* All results, including pass/fail summaries, are stored in the local GX context (`gx/uncommitted/validations/`) for tracking and reporting.

**Why this matters:**

* This is **the moment of truth** where we see if the data meets our expectations.
* It also **decouples validation from code**. Once defined, a checkpoint can be triggered without re-running the entire notebook logic.
* You can later **schedule this checkpoint** in a production pipeline or in our case, Airflow DAG.

In [13]:
# Define the name of the checkpoint
checkpoint_name = "checkpoint_m3"

# Create or update a persistent checkpoint using the validator we've built
try:
    checkpoint = context.add_or_update_checkpoint(
        name=checkpoint_name,
        validator=validator,
    )
except Exception as e:
    print("Failed to add or update checkpoint:", e)

In [14]:
# Run the checkpoint
checkpoint_results = checkpoint.run()

Calculating Metrics:   0%|          | 0/31 [00:00<?, ?it/s]

### D. Build the Data Docs

After running validations and saving expectations, the final step is to **generate a visual report** of the results. Great Expectations does this through **Data Docs**, which are static HTML pages that show which validations passed or failed.

This will compile everything: datasource info, suite expectations, and validation results, into an organized visual format. We can then open it locally through our browser (if running on a local machine), or deploy it to a remote viewer for team sharing.

In [15]:
# Build data docs (renders HTML files)
context.build_data_docs()
print("Data Docs have been successfully built.")

Data Docs have been successfully built.


In [16]:
# Open the docs in default web browser (only works locally)
context.open_data_docs()

# 6. Conclusion

We have successfully executed a complete **data validation pipeline** using **Great Expectations** on a structured HR dataset. The workflow was carefully designed to ensure that the dataset is **clean, consistent, and trustworthy** before it's used for deeper analytics or business intelligence.

We started by configuring a **Great Expectations context** and connecting to a **Pandas-based datasource**, enabling direct ingestion of our cleaned CSV file. An **expectation suite** was then created and tied to a **validator**, which we used to define and apply seven targeted expectations that reflect both **data quality standards** and **business logic requirements**.

### A. Summary of Validation Expectations:

| # | Expectation Description       | Column(s) Involved                                               | Outcome                                          |
| - | ----------------------------- | ---------------------------------------------------------------- | ------------------------------------------------ |
| 1 | Must be unique                | `id`                                                             | Passed: All 1470 values are unique             |
| 2 | Must be between 18 and 60     | `age`                                                            | Passed: No out-of-range values                 |
| 3 | Must be in allowed set        | `education` ∈ {Below College, College, Bachelor, Master, Doctor} | Passed: All values valid                       |
| 4 | Must be int or float          | `daily_rate`                                                     | Passed: Observed type was `int64`              |
| 5 | `monthly_rate` > `daily_rate` | `monthly_rate`, `daily_rate`                                     | Passed: All 1470 rows satisfied this condition |
| 6 | Median between 4000–6000      | `monthly_income`                                                 | Passed: Median = 4919.0                        |
| 7 | Row count between 1000–2000   | Entire dataset                                                   | Passed: 1470 rows observed                     |

Each expectation returned `success: true`, confirming that **no violations** were found and **no manual intervention** was needed. These validations were saved to an **expectation suite** and made persistent using a **named checkpoint** (`checkpoint_m3`), allowing this entire validation logic to be re-executed reliably in future data pipelines. Finally, the results were rendered into human-readable **Data Docs**, providing a visual audit trail for data quality review and documentation.

### B. Business Impact

Clean, validated data is more than a technical necessity as it enables **trusted insights** that inform **strategic decisions**. With this validated HR dataset, stakeholders can confidently explore:

* Key drivers of employee attrition
* Role or department-level performance trends
* Early indicators of disengagement

This ensures that downstream visualizations, dashboards, and models are **built on solid ground**, supporting impactful decisions in HR policy, workforce planning, and organizational strategy.