# Final Project Great Expectations for Data Analyst's Dataset

## Import Necessary Libraries

In [1]:
from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir='./')

## Defining the dataset

In [2]:
# Define the name of the data source and convert it to lowercase
datasource_name = 'data-analyst-stock-dataset'

# Add a pandas data source with the specified name to the context
datasource = context.sources.add_pandas(datasource_name)

# Define the name of the dataset asset and convert it to lowercase
asset_name = 'data-analyst-stock-dataset'

# Specify the path to the raw CSV file containing the dataset
path_to_data = 'dataset_for_analysis.csv'

# Add a CSV asset with the specified name and file path to the data source
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build a batch request to load the dataset into memory
batch_request = asset.build_batch_request()


## Defining Validator

In [3]:
# Creating Expectation suite
expectation_suite_name = 'data-validation'
context.add_or_update_expectation_suite(expectation_suite_name)

# Creating validator
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Displaying Validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,Bank_ID,Date,Open,High,Low,Close,Adj Close,Volume
0,BBCA,2019-04-02,5530.0,5570.0,5480.0,5500.0,4896.961426,36853000
1,BBCA,2019-04-03,5500.0,5500.0,5500.0,5500.0,4896.961426,0
2,BBCA,2019-04-04,5540.0,5560.0,5490.0,5540.0,4932.575195,48112000
3,BBCA,2019-04-05,5560.0,5560.0,5525.0,5530.0,4923.671875,21913500
4,BBCA,2019-04-08,5505.0,5540.0,5450.0,5480.0,4879.15332,39910000


In this data validation for our data analyst, we will be validating 4 expectations:
- Bank ID must contain the selected banking company
- Closing price must be numeric
- Closing price must not be null
- Date must be between 2019 to 2024

## Expectations 1: The bank ID must contain the selected banking company

In [6]:
validator.expect_column_distinct_values_to_be_in_set(column="Bank_ID", value_set=["BBCA", "BBRI", "BMRI", "BRIS", "BBNI"])

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": [
      "BBCA",
      "BBNI",
      "BBRI",
      "BMRI",
      "BRIS"
    ],
    "details": {
      "value_counts": [
        {
          "value": "BBCA",
          "count": 1225
        },
        {
          "value": "BBNI",
          "count": 1225
        },
        {
          "value": "BBRI",
          "count": 1225
        },
        {
          "value": "BMRI",
          "count": 1225
        },
        {
          "value": "BRIS",
          "count": 1225
        }
      ]
    }
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

This result indicates that the validation was successful. It shows that all expected values ("BBCA", "BBNI", "BBRI", "BMRI", "BRIS") were observed in the dataset. Each value appeared 1225 times, suggesting that the dataset contains an equal distribution of data for each of the selected banking companies.

## Expectations 2: Closing prices must be numeric

In [11]:
validator.expect_column_values_to_be_of_type(column="Close", type_="float64")

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

This result indicates that the validation was successful. The observed value indicates that the datatype of the specified column is numeric. This confirms that the datatype of the column meets the expectation, ensuring consistency and reliability in the dataset.

## Expectations 3: Closing Price must not be null

In [12]:
validator.expect_column_values_to_not_be_null(column="Close")

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 6125,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

This result indicates a successful validation with no null values found in the closing price.

## Expectations 4: The Date must be between 2019 to 2024

In [19]:
validator.expect_column_values_to_be_between(
    column="Date",
    min_value="2019-03-31",
    max_value="2024-03-28"
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 6125,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The validation has been succesful, validating that the date of the data is between 2019 to 2024