## Data quality verification

#### 3. Data requirements

The data requirements for this project combine both business and technical requirements. The business requirements are based on the domain knowledge of the project, while the technical requirements are based on the data types and formats required for the analysis. The data requirements will help ensure that the data is of high quality and suitable for further analysis and modeling.

EDA has shown that only some columns are important for future use, so we will define the requirements for these columns:
- We work only with data of 2022 year, so `FlightDate` should be in 2022 year.
- Time features should be given in the correct format of "hhmm", where hh is 00-23, mm is 00-59. Some of such time features are `CRSDepTime`, `CRSArrTime`.
- `OriginAirportID`, `DestAirportID`, `Operating_Airline`, `Tail_Number` should be in the correct format for airport and airline codes. IDs are numbers, OperatingAirline is a two-character string, Tail_Number is a string of digits and letters.
- `Cancelled` is a binary feature, so it should be False or True.
- Some features such as `Tail_Number` are not relevant for cancelled flights, so we should check that these features are present for non-cancelled flights, and are missing for cancelled flights.

We can define types for important columns in data in the following way:
- `FlightDate`: date, format "YYYY-MM-DD"
- `Operating_Airline`: string, length 2
- `OriginAirportID`, `DestAirportID`: integer
- `Cancelled`: boolean
- `CRSDepTime`, `CRSArrTime`: time, format "hhmm"
- `Tail_Number`: string (can be None for cancelled flights)
- `CRSElapsedTime`, `DepDelay`, `ActualElapsedTime`: integer, minutes (can be None for cancelled flights)
- `Distance`: integer, miles

For testing the quality of data, we will use Great Expectations library. It allows to define expectations for data, and then automatically check if the data meets these expectations. We will define expectations for the columns mentioned above, and then check if the data meets these expectations. The expectation suite is written in `src/data_quality.py` file. We will use this file to run the expectations on the data.

In [1]:
from src.data_quality import load_context_and_sample_data, define_expectations

# Create Great Expectations context and load data
context, da = load_context_and_sample_data("../services", "../data/samples/sample.csv")
batch_request = da.build_batch_request()
# Define expectations and save them
validator = define_expectations(context, batch_request)
validator.save_expectation_suite(discard_failed_expectations=False)

# Validate the data
checkpoint = context.add_or_update_checkpoint(
    name="sample_checkpoint",
    validator=validator,
)
checkpoint_result = checkpoint.run()

context.view_validation_result(checkpoint_result)

if checkpoint_result.success:
    print("Data quality verification passed successfully")
else:
    print("Data quality verification failed")

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/89 [00:00<?, ?it/s]

Data quality verification passed successfully
