# Dynamic Data Source Column Validation

## Purpose
This notebook provides a flexible, automated way to validate that specific columns exist in one or more data sources (CSV or Delta tables) using PySpark and Python's `unittest` framework. It is designed for data engineers and analysts who need to ensure that ingested or processed files conform to expected schemas before further processing or reporting.

## How to Use
1. **Edit the `data_sources` list at the top of the code cell:**
   - Each entry should specify:
     - `name`: A descriptive name for the data source (for reporting and error messages).
     - `file_type`: Either `'csv'` or `'delta'`.
     - `file_path`: The full path to the file or Delta table.
     - `expected_columns`: A list of column names that must be present in the data source.
   - You can add, remove, or comment out entries as needed to check any combination of data sources. The code will only check the sources you include in the list.

2. **Run the notebook:**
   - The test will loop through all specified data sources and check for the presence of the required columns.
   - For each data source, the test prints which file is being checked and reports any missing columns with clear error messages, including the file name and which columns are missing.

3. **Interpreting Results:**
   - If all columns are present in all data sources, the test passes and you will see confirmation for each file checked.
   - If any columns are missing or a file cannot be read, the test fails and prints a detailed message indicating which file and which columns are problematic.
   - If any test fails, the notebook will exit with a failure message (if run in Databricks or similar environments).

## Example
To check only one file, simply leave only one entry in the `data_sources` list. To check two or more, add them to the list. The code is fully dynamic and does not require changes to the test logic for different files or schemas.

**Tip:** You can use this notebook as a template for any schema validation task by simply updating the `data_sources` list to match your current validation needs.


In [0]:
import unittest
from pyspark.sql.utils import AnalysisException

# Define your data sources here. Add or remove entries as needed.
data_sources = [
    {
        "name": "Caregivers CSV",
        "file_type": "csv",
        "file_path": "dbfs:/mnt/ci-carma/landing/caregiverevent-3531dc00-4bb1-11f0-8e22-065857d19e8f.csv",
        "expected_columns": [
            "Caregiver_ICN__c", "Applicant_Type__c", "Caregiver_Status__c", 
            "Dispositioned_Date__c", "Benefits_End_Date__c", "Veteran_ICN__c", "CreatedDate"
        ]
    },
    {
        "name": "Disability CSV",
        "file_type": "csv",
        "file_path": "dbfs:/mnt/ci-vadir-shared/CPIDODIEX_20250618_spool.csv",
        "expected_columns": ["PTCPNT_ID", "CMBNED_DEGREE_DSBLTY", "DSBL_DTR_DT"]
    },
    {
        "name": "PT Indicator Delta",
        "file_type": "delta",
        "file_path": "/mnt/ci-vba-edw-2/DeltaTables/DW_ADHOC_RECURR.DOD_PATRONAGE_SCD_PT/",
        "expected_columns": ["PTCPNT_VET_ID", "PT_35_FLAG"]
    }
]

class TestFileColumns(unittest.TestCase):
    def check_columns(self, file_type, file_path, expected_columns, name=None):
        print(f"Checking file: {file_path} (type: {file_type})" + (f" [{name}]" if name else ""))
        try:
            if file_type == 'csv':
                df = spark.read.csv(file_path, header=True)
            elif file_type == 'delta':
                df = spark.read.format("delta").load(file_path)
            else:
                self.fail(f"Unknown file type: {file_type} for file {file_path}")
            missing_columns = set(expected_columns) - set(df.columns)
            self.assertTrue(
                len(missing_columns) == 0,
                f"[{name}] File {file_path} is missing expected columns: {missing_columns}. Found columns: {df.columns}"
            )
        except AnalysisException as e:
            self.fail(f"[{name}] File {file_path} could not be read: {str(e)}")

    def test_dynamic_data_sources(self):
        for ds in data_sources:
            with self.subTest(data_source=ds["name"]):
                self.check_columns(
                    ds["file_type"],
                    ds["file_path"],
                    ds["expected_columns"],
                    name=ds["name"]
                )

if __name__ == '__main__':
    result = unittest.main(argv=['first-arg-is-ignored'], exit=False)
    if not result.result.wasSuccessful():
        print("Unit tests failed. Exiting the notebook.")
        dbutils.notebook.exit("Tests failed, exiting notebook and failing job.")