# Testing for ML

# Libraries

In [1]:
import great_expectations as ge
import json
import pandas as pd
from urllib.request import urlopen
import importlib.util



# Data testing

So far, we've used unit and integration tests to test the functions that interact with our data but we haven't tested the validity of the data itself.

We're going to use the great expectations library to test what our data is expected to look like. It's a library that allows us to create expectations as to what our data should look like in a standardized way. It also provides modules to seamlessly connect with backend data sources such as local file systems, S3, databases, etc.

In [2]:
TWEETS_PATH = '../data/tweets.csv'
TAGS_PATH = '../data/tags.txt'

def get_idx_tag(path_file: str):
    """
    Return a dictionnary composed of indices and tags

    Args :
        path_file (str) : location of file containing tags

    Returns:
        None
    """
    dict_tags = dict()
    with open(path_file, "r") as f:
        lines = f.readlines()
        for line in lines:
            if line != "\n":
                key_value = line.split(":")
                key = int(key_value[0].split("_")[-1][:-1])
                value = line.split(":")[1].replace("\n", "")
                dict_tags[key] = value.replace('"', "")[1:]
    return dict_tags


def idx_to_tag(serie, dict_tags):
    return serie.apply(lambda x: dict_tags[x])

tweets = pd.read_csv(TWEETS_PATH)
dict_tags = get_idx_tag(TAGS_PATH)

labeled_tweets = tweets.copy()
labeled_tweets.label = idx_to_tag(labeled_tweets.label, dict_tags)


In [3]:
df = ge.dataset.PandasDataset(labeled_tweets)
print (f"{len(df)} tweets")
df.head(5)

21107 tweets


Unnamed: 0,text,label
0,Here are Thursday's biggest analyst calls: App...,Analyst Update
1,Buy Las Vegas Sands as travel to Singapore bui...,Analyst Update
2,"Piper Sandler downgrades DocuSign to sell, cit...",Analyst Update
3,"Analysts react to Tesla's latest earnings, bre...",Analyst Update
4,Netflix and its peers are set for a ‘return to...,Analyst Update


## Expectations

We want to think about our entire dataset and all the features (columns) within it.

Defaults expectations : 

- expect_table_columns_to_match_ordered_list : Presence of specific features
- expect_compound_columns_to_be_unique : Unique combinations of features (detect data leaks!)
- expect_column_values_to_not_be_null : Missing values
- expect_column_values_to_be_of_type : Type adherence
- expect_column_values_to_be_unique : Unique values
- expect_column_values_to_be_in_set : List (categorical) / range (continuous) of allowed values
- expect_column_pair_values_a_to_be_greater_than_b : Feature value relationships with other feature values
- expect_table_row_count_to_be_between : Row count (exact or range) of samples
- expect_column_mean_to_be_between : Value statistics (mean, std, median, max, min, sum, etc.)

Custom expectations : [Here](https://docs.greatexpectations.io/docs/guides/expectations/creating_custom_expectations/overview/)


In [4]:
#Exemple 1 : Success

# Missing values
df.expect_column_values_to_not_be_null(column="label")

{
  "success": true,
  "meta": {},
  "result": {
    "element_count": 21107,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "partial_unexpected_list": []
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [5]:
#Exemple 2 : Failure

# Type adherence
df.expect_column_values_to_be_of_type(column="label", type_="int")

{
  "success": false,
  "meta": {},
  "result": {
    "element_count": 21107,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 21107,
    "unexpected_percent": 100.0,
    "unexpected_percent_total": 100.0,
    "unexpected_percent_nonmissing": 100.0,
    "partial_unexpected_list": [
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update",
      "Analyst Update"
    ]
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## Organization

When it comes to organizing expectations, it's recommended to start with table-level ones and then move on to individual feature columns.

1. Table expectations : 
    - expect_table_columns_to_match_ordered_list
    - expect_compound_columns_to_be_unique
    - ...

2. Column expectations :
    - expect_column_values_to_be_unique
    - expect_column_values_to_not_be_null
    - ...

We can group all the expectations together to create an Expectation Suite object which we can use to validate any Dataset module.

In [6]:
# Expectation suite
expectation_suite = df.get_expectation_suite(discard_failed_expectations=False)
print(df.validate(expectation_suite=expectation_suite, only_return_failures=True))

{
  "success": false,
  "meta": {
    "great_expectations_version": "0.16.8",
    "expectation_suite_name": "default",
    "run_id": {
      "run_name": null,
      "run_time": "2023-04-27T14:01:10.078198+02:00"
    },
    "batch_kwargs": {
      "ge_batch_id": "328ba7e6-e4f3-11ed-b3ad-b0359fae9647"
    },
    "batch_markers": {},
    "batch_parameters": {},
    "validation_time": "20230427T120110.078055Z",
    "expectation_suite_meta": {
      "great_expectations_version": "0.16.8"
    }
  },
  "evaluation_parameters": {},
  "statistics": {
    "evaluated_expectations": 2,
    "successful_expectations": 1,
    "unsuccessful_expectations": 1,
    "success_percent": 50.0
  },
  "results": [
    {
      "success": false,
      "meta": {},
      "result": {
        "element_count": 21107,
        "missing_count": 0,
        "missing_percent": 0.0,
        "unexpected_count": 21107,
        "unexpected_percent": 100.0,
        "unexpected_percent_total": 100.0,
        "unexpected_percent_

So far we've worked with the Great Expectations library at the  notebook level but we can further organize our expectations by creating a Project.

In the tests folder run the command ```great_expectations init```