# Assignment 1

In this assignment, you will get to practice using [TensorFlow Data Validation (TFDV)](https://cloud.google.com/solutions/machine-learning/analyzing-and-validating-data-at-scale-for-ml-using-tfx), an open-source Python package from the [TensorFlow Extended (TFX)](https://www.tensorflow.org/tfx) ecosystem. 

TFDV helps to understand, validate, and monitor production machine learning data at scale. It provides insight into some key questions in the data analysis process such as:

* What are the underlying statistics of my data?

* What does my training dataset look like?

* How does my evaluation and serving datasets compare to the training dataset?

* How can I find and fix data anomalies?

The figure below summarizes the usual TFDV workflow:

<img src='https://i.imgur.com/BtkrYQV.png' alt='picture of tfdv workflow'>

As shown, you can use TFDV to compute descriptive statistics of the training data and generate a schema. You can then validate new datasets (e.g. the serving dataset from your customers) against this schema to detect and fix anomalies. This helps prevent the different types of skew. That way, you can be confident that your model is training on or predicting data that is consistent with the expected feature types and distribution.

In this assignment, you will use TFDV in order to:

* Generate and visualize statistics from a dataframe
* Infer a dataset schema
* Calculate, visualize and fix anomalies

Let's begin!

<a name='1'></a>
## 1 - Setup and Imports

The following package is pre-installed if you are using GitHub Codespaces. 

If you do not use GitHub Codespaces with the pre-installed Kernel, please consider creating a conda environment with Python 3.8 and install the following package manually.

In [1]:
# %%capture
# pip install --upgrade 'protobuf<=3.20.1'# !pip install --upgrade pip 
# !pip install python-snappy
# !pip install tensorflow_data_validation[visualization]

In [2]:
# Import packages
import os
import pandas as pd
import tensorflow as tf
import tempfile, urllib, zipfile
import tensorflow_data_validation as tfdv


from tensorflow.python.lib.io import file_io
from tensorflow_data_validation.utils import slicing_util
from tensorflow_metadata.proto.v0.statistics_pb2 import DatasetFeatureStatisticsList, DatasetFeatureStatistics

# Set TF's logger to only display errors to avoid internal warnings being shown
tf.get_logger().setLevel('ERROR')

2023-05-09 12:50:17.272115: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-09 12:50:17.317968: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-09 12:50:17.318733: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a name='2'></a>
## 2 - Load the Dataset
You will be using the [Bank Marketing Data Set ](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing). The dataset comprises information of 
bank marketing activities. It includes over 20 features regarding the conducted activites.

This dataset has already been included in your Jupyter workspace. In addition, it is split up in training, evaluation and serving set so you can easily load it.

In [3]:
# Read CSV data into a dataframe and recognize the missing data that is encoded with '?' string as NaN
train_df = pd.read_csv('data/bank_full_train.csv', header=0, na_values = '?', sep=";") # training dataset comprises 70% of the data
eval_df = pd.read_csv('data/bank_full_eval.csv', header=0, na_values = '?', sep=";") # training dataset comprises 15% of the data
serving_df = pd.read_csv('data/bank_full_serv.csv', header=0, na_values = '?', sep=";") # training dataset comprises 15% of the data

# Serving data emulates the data that would be submitted for predictions, so it should not have the label column.
serving_df = serving_df.drop(['y'], axis=1)

# Preview the trainig dataset
train_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


<a name='3'></a>
## 3 - Generate and Visualize Training Data Statistics

In this section, you will be generating descriptive statistics from the dataset. This is usually the first step when dealing with a dataset you are not yet familiar with. It is also known as performing an *exploratory data analysis* and its purpose is to understand the data types, the data itself and any possible issues that need to be addressed.

It is important to mention that **exploratory data analysis should be perfomed on the training dataset** only. This is because getting information out of the evaluation or serving datasets can be seen as "cheating" since this data is used to emulate data that you have not collected yet and will try to predict using your ML algorithm. **In general, it is a good practice to avoid leaking information from your evaluation and serving data into your model.**

<a name='ex-1'></a>
### Exercise 1: Generate Training Statistics 

TFDV allows you to generate statistics from different data formats such as CSV or a Pandas DataFrame. 

Since you already have the data stored in a DataFrame you can use the function [`tfdv.generate_statistics_from_dataframe()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe) which, given a DataFrame and `stats_options`, generates an object of type `DatasetFeatureStatisticsList`. This object includes the computed statistics of the given dataset.

Complete the cell below to generate the statistics of the training set. Remember to pass the training dataframe and the `stats_options` that you defined above as arguments.

In [4]:
# Collect features to include while computing the statistics
cols = [col for col in train_df.columns]

# Instantiate a StatsOptions class and define the feature_allowlist property
stats_options = tfdv.StatsOptions(feature_allowlist=cols)

### START CODE HERE
train_stats = tfdv.generate_statistics_from_dataframe(train_df, stats_options)
### END CODE HERE

In [5]:
# TEST CODE

# get the number of features used to compute statistics
print(f"Number of features used: {len(train_stats.datasets[0].features)}")

# check the number of examples used
print(f"Number of examples used: {train_stats.datasets[0].num_examples}")

# check the column names of the first and last feature
print(f"First feature: {train_stats.datasets[0].features[0].path.step[0]}")
print(f"Last feature: {train_stats.datasets[0].features[-1].path.step[0]}")

Number of features used: 17
Number of examples used: 31647
First feature: age
Last feature: y


**Expected Output:**

```
Number of features used: 17
Number of examples used: 31647
First feature: age
Last feature: y
```

<a name='ex-2'></a>
### Exercise 2: Visualize Training Statistics

Now that you have the computed statistics in the `DatasetFeatureStatisticsList` instance, you will need a way to **visualize** these to get actual insights. TFDV provides this functionality through the method [`tfdv.visualize_statistics()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics).

Using this function in an interactive Python environment such as this one will output a very nice and convenient way to interact with the descriptive statistics you generated earlier. 

**Try it out yourself!** Remember to pass in the generated training statistics in the previous exercise as an argument.

In [6]:
### START CODE HERE
tfdv.visualize_statistics(train_stats)
### END CODE HERE

<a name='4'></a>
## 4 - Infer a data schema

A schema defines the **properties of the data** and can thus be used to detect errors. Some of these properties include:

- which features are expected to be present
- feature type
- the number of values for a feature in each example
- the presence of each feature across all examples
- the expected domains of features

The schema is expected to be fairly static, whereas statistics can vary per data split. So, you will **infer the data schema from only the training dataset**. Later, you will generate statistics for evaluation and serving datasets and compare their state with the data schema to detect anomalies, drift and skew.

<a name='ex-3'></a>
### Exercise 3: Infer the training set schema

Schema inference is straightforward using [`tfdv.infer_schema()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema). This function needs only the **statistics** (an instance of `DatasetFeatureStatisticsList`) of your data as input. The output will be a Schema [protocol buffer](https://developers.google.com/protocol-buffers) containing the results.

A complimentary function is [`tfdv.display_schema()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_schema) for displaying the schema in a table. This accepts a **Schema** protocol buffer as input.

Fill the code below to infer the schema from the training statistics using TFDV and display the result.

In [7]:
### START CODE HERE
# Infer the data schema by using the training statistics that you generated
schema = tfdv.infer_schema(train_stats)

# Display the data schema
tfdv.display_schema(schema)
### END CODE HERE

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'age',INT,required,,-
'job',STRING,required,,'job'
'marital',STRING,required,,'marital'
'education',STRING,required,,'education'
'default',STRING,required,,'default'
'balance',INT,required,,-
'housing',STRING,required,,'housing'
'loan',STRING,required,,'loan'
'contact',STRING,required,,'contact'
'day',INT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'job',"'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown'"
'marital',"'divorced', 'married', 'single'"
'education',"'primary', 'secondary', 'tertiary', 'unknown'"
'default',"'no', 'yes'"
'housing',"'no', 'yes'"
'loan',"'no', 'yes'"
'contact',"'cellular', 'telephone', 'unknown'"
'month',"'apr', 'aug', 'dec', 'feb', 'jan', 'jul', 'jun', 'mar', 'may', 'nov', 'oct'"
'poutcome',"'failure', 'other', 'success', 'unknown'"
'y',"'no', 'yes'"


In [8]:
# TEST CODE

# Check number of features
print(f"Number of features in schema: {len(schema.feature)}")

Number of features in schema: 17


**Expected Output:**

```
Number of features in schema: 17
```

**Be sure to check the information displayed before moving forward.**

<a name='5'></a>
## 5 - Calculate, Visualize and Fix Evaluation Anomalies


It is important that the schema of the evaluation data is consistent with the training data since the data that your model is going to receive should be consistent to the one you used to train it with.

Moreover, it is also important that the **features of the evaluation data belong roughly to the same range as the training data**. This ensures that the model will be evaluated on a similar loss surface covered during training.

<a name='ex-4'></a>
### Exercise 4: Compare Training and Evaluation Statistics

Now you are going to generate the evaluation statistics and compare it with training statistics. You can use the [`tfdv.generate_statistics_from_dataframe()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe) function for this. But this time, you'll need to pass the **evaluation data**. For the `stats_options` parameter, the list you used before works here too.

Remember that to visualize the evaluation statistics you can use [`tfdv.visualize_statistics()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics). 

However, it is impractical to visualize both statistics separately and do your comparison from there. Fortunately, TFDV has got this covered. You can use the `visualize_statistics` function and pass additional parameters to overlay the statistics from both datasets (referenced as left-hand side and right-hand side statistics). Let's see what these parameters are:

- `lhs_statistics`: Required parameter. Expects an instance of `DatasetFeatureStatisticsList `.


- `rhs_statistics`: Expects an instance of `DatasetFeatureStatisticsList ` to compare with `lhs_statistics`.


- `lhs_name`: Name of the `lhs_statistics` dataset.


- `rhs_name`: Name of the `rhs_statistics` dataset.

For this case, remember to define the `lhs_statistics` protocol with the `eval_stats`, and the optional `rhs_statistics` protocol with the `train_stats`.

Additionally, check the function for the protocol name declaration, and define the lhs and rhs names as `'EVAL_DATASET'` and `'TRAIN_DATASET'` respectively.

In [9]:
### START CODE HERE
# Generate evaluation dataset statistics
# HINT: Remember to use the evaluation dataframe and to pass the stats_options (that you defined before) as an argument
eval_stats = tfdv.generate_statistics_from_dataframe(eval_df, stats_options=stats_options)

# Compare evaluation data with training data 
# HINT: Remember to use both the evaluation and training statistics with the lhs_statistics and rhs_statistics arguments
# HINT: Assign the names of 'EVAL_DATASET' and 'TRAIN_DATASET' to the lhs and rhs protocols
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')
                          
### END CODE HERE

In [10]:
# TEST CODE

# get the number of features used to compute statistics
print(f"Number of features: {len(eval_stats.datasets[0].features)}")

# check the number of examples used
print(f"Number of examples: {eval_stats.datasets[0].num_examples}")

# check the column names of the first and last feature
print(f"First feature: {eval_stats.datasets[0].features[0].path.step[0]}")
print(f"Last feature: {eval_stats.datasets[0].features[-1].path.step[0]}")

Number of features: 17
Number of examples: 6781
First feature: age
Last feature: y


**Expected Output:**

```
Number of features: 17
Number of examples: 6781
First feature: age
Last feature: y
```

<a name='ex-5'></a>
### Exercise 5: Detecting Anomalies ###

At this point, you should ask if your evaluation dataset matches the schema from your training dataset. For instance, if you scroll through the output cell in the previous exercise, you can see that the categorical feature **marital** has 3 unique values in the training set while the evaluation dataset has 4. You can verify with the built-in Pandas `describe()` method as well.

In [11]:
train_df["marital"].describe()

count       31647
unique          3
top       married
freq        19782
Name: marital, dtype: object

In [12]:
eval_df["marital"].describe()

count        6781
unique          4
top       married
freq         3898
Name: marital, dtype: object

It is possible but highly inefficient to visually inspect and determine all the anomalies. So, let's instead use TFDV functions to detect and display these.

You can use the function [`tfdv.validate_statistics()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/validate_statistics) for detecting anomalies and [`tfdv.display_anomalies()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_anomalies) for displaying them.

The `validate_statistics()` method has two required arguments:
- an instance of `DatasetFeatureStatisticsList`
- an instance of `Schema`

Fill in the following graded function which, given the statistics and schema, displays the anomalies found.

In [13]:
def calculate_and_display_anomalies(statistics, schema):
    '''
    Calculate and display anomalies.

            Parameters:
                    statistics : Data statistics in statistics_pb2.DatasetFeatureStatisticsList format
                    schema : Data schema in schema_pb2.Schema format

            Returns:
                    display of calculated anomalies
    '''
    ### START CODE HERE
    # HINTS: Pass the statistics and schema parameters into the validation function 
    anomalies = tfdv.validate_statistics(statistics,schema)
    
    # HINTS: Display input anomalies by using the calculated anomalies
    tfdv.display_anomalies(anomalies)
    ### END CODE HERE

You should see detected anomalies in the `marital` and `contact` features by running the cell below.

In [14]:
# Check evaluation data for errors by validating the evaluation data staticss using the previously inferred schema
calculate_and_display_anomalies(eval_stats, schema=schema)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'marital',Unexpected string values,Examples contain values missing from the schema: widowed (<1%).
'contact',Unexpected string values,Examples contain values missing from the schema: email (<1%).


<a name='ex-6'></a>
### Exercise 6: Fix evaluation anomalies in the schema

The evaluation data has records with values for the features **maritial** and **contact**  that were not included in the schema generated from the training data. You can fix this by adding the new values that exist in the evaluation dataset to the domain of these features.

To get the `domain` of a particular feature you can use [`tfdv.get_domain()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/get_domain).

You can use the `append()` method to the `value` property of the returned `domain` to add strings to the valid list of values. To be more explicit, given a domain you can do something like:

```python
domain.value.append("feature_value")

```

In [15]:
### START CODE HERE

# Get the domain associated with the input feature, marital, from the schema
marital_domain = tfdv.get_domain(schema, 'marital') 

# HINT: Append the missing value 'widowed' to the domain
marital_domain.value.append("widowed")

# Get the domain associated with the input feature, contact, from the schema
contact_domain = tfdv.get_domain(schema, 'contact') 

# HINT: Append the missing value 'email' to the domain
contact_domain.value.append("email")

# HINT: Re-calculate and re-display anomalies with the new schema
calculate_and_display_anomalies(eval_stats, schema=schema)
### END CODE HERE

If you did the exercise correctly, you should see *"No anomalies found."* after running the cell above.

<a name='6'></a>
## 6 - Schema Environments

By default, all datasets in a pipeline should use the same schema. However, there are some exceptions. 

For example, the **label column is dropped in the serving set** so this will be flagged when comparing with the training set schema. 

**In this case, introducing slight schema variations is necessary.**

<a name='ex-7'></a>
### Exercise 7: Check anomalies in the serving set

Now you are going to check for anomalies in the **serving data**. The process is very similar to the one you previously did for the evaluation data with a little change. 

Let's create a new `StatsOptions` that is aware of the information provided by the schema and use it when generating statistics from the serving DataFrame.

In [16]:
# Define a new statistics options by the tfdv.StatsOptions class for the serving data by passing the previously inferred schema
options = tfdv.StatsOptions(schema=schema, 
                            infer_type_from_schema=True, 
                            feature_allowlist=cols)

In [17]:
### START CODE HERE
# Generate serving dataset statistics
# HINT: Remember to use the serving dataframe and to pass the newly defined statistics options
serving_stats = tfdv.generate_statistics_from_dataframe(serving_df, stats_options=options)

# HINT: Calculate and display anomalies using the generated serving statistics
calculate_and_display_anomalies(serving_stats, schema=schema)
### END CODE HERE

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'balance',Unexpected data type,Expected data of type: INT but got STRING
'month',Unexpected string values,Examples contain values missing from the schema: sep (~8%).
'y',Column dropped,Column is completely missing


You should see that the `month` feature has an anomaly (i.e. Unexpected string values) which is roughly 8%. This is due to the time-span in which the data has been collected and therefore neglectable.

Let's **relax the anomaly detection constraint** for this feature by defining the `min_domain_mass` of the feature's distribution constraints.

In [18]:
# This relaxes the minimum fraction of values that must come from the domain for the feature.

# Get the feature and relax to match 90% of the domain
month = tfdv.get_feature(schema, 'month')
month.distribution_constraints.min_domain_mass = 0.9 

# Detect anomalies with the updated constraints
calculate_and_display_anomalies(serving_stats, schema=schema)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'y',Column dropped,Column is completely missing
'balance',Unexpected data type,Expected data of type: INT but got STRING


If the `month` feature is no longer part of the output cell, then the relaxation worked!

<a name='ex-8'></a>
### Exercise 8: Detecting anomalies with environments

There is still one thing to address. The `y` feature (which is the label column) showed up as an anomaly ('Column dropped'). Since labels are not expected in the serving data, let's tell TFDV to ignore this detected anomaly.

This requirement of introducing slight schema variations can be expressed by using [environments](https://www.tensorflow.org/tfx/data_validation/get_started#schema_environments). In particular, features in the schema can be associated with a set of environments using `default_environment`, `in_environment` and `not_in_environment`.

In [19]:
# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')

Complete the code below to exclude the `y` feature from the `SERVING` environment.

To achieve this, you can use the [`tfdv.get_feature()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/get_feature) function to get the `y` feature from the inferred schema and use its `not_in_environment` attribute to specify that `y` should be removed from the `SERVING` environment's schema. This **attribute is a list** so you will have to **append** the name of the environment that you wish to omit this feature for.

To be more explicit, given a feature you can do something like:

```python
feature.not_in_environment.append('NAME_OF_ENVIRONMENT')
```

The function `tfdv.get_feature` receives the following parameters:

- `schema`: The schema.
- `feature_path`: The path of the feature to obtain from the schema. In this case this is equal to the name of the feature.

In [20]:
### START CODE HERE
# Specify that 'y' feature is not in SERVING environment.
# HINT: Append the 'SERVING' environmnet to the not_in_environment attribute of the feature
tfdv.get_feature(schema, 'y').not_in_environment.append('SERVING')

# HINT: Calculate anomalies with the validate_statistics function by using the serving statistics, 
# inferred schema and the SERVING environment parameter.
serving_anomalies_with_env = tfdv.validate_statistics(serving_stats, schema, environment="SERVING")
### END CODE HERE

You should see "No anomalies found" by running the cell below.

In [21]:
# Display anomalies
tfdv.display_anomalies(serving_anomalies_with_env)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'balance',Unexpected data type,Expected data of type: INT but got STRING


Now you have succesfully addressed all anomaly-related issues!

<a name='7'></a>
## 7 - Check for Data Drift and Skew

During data validation, you also need to check for data drift and data skew between the training and serving data. You can do this by specifying the [skew_comparator and drift_comparator](https://www.tensorflow.org/tfx/data_validation/get_started#checking_data_skew_and_drift) in the schema. 

Drift and skew is expressed in terms of [L-infinity distance](https://en.wikipedia.org/wiki/Chebyshev_distance) which evaluates the difference between vectors as the greatest of the differences along any coordinate dimension. The [Jensen Shannon divergence](https://en.wikipedia.org/wiki/Jensen–Shannon_divergence) is a measure for numeric data similar to the Kullback Leibler divergence.

You can set the threshold distance so that you receive warnings when the drift is higher than is acceptable.  Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation.

Let's check for the skew in the **age** feature and drift in the **loan** feature.

In [22]:
# Calculate skew for the age feature
skew = tfdv.get_feature(schema, 'age')
skew.skew_comparator.jensen_shannon_divergence.threshold = 0.03 # domain knowledge helps to determine this threshold

# Calculate drift for the loan feature
drift = tfdv.get_feature(schema, 'loan')
drift.drift_comparator.infinity_norm.threshold = 0.03 # domain knowledge helps to determine this threshold

# Calculate anomalies
skew_drift_anomalies = tfdv.validate_statistics(train_stats, schema,
                                          previous_statistics=eval_stats,
                                          serving_statistics=serving_stats)

# Display anomalies
tfdv.display_anomalies(skew_drift_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'age',High approximate Jensen-Shannon divergence between training and serving,"The approximate Jensen-Shannon divergence between training and serving is 0.94206 (up to six significant digits), above the threshold 0.03."
'loan',High Linfty distance between current and previous,"The Linfty distance between current and previous is 0.0426477 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: no"


The detected anomaly distance for the feature **loan** is not too far from the threshold value of `0.03`. For this exercise, let's accept this as within bounds (i.e. you can set the distance to something like `0.05` instead).

In [23]:
# Calculate skew for the age feature
skew = tfdv.get_feature(schema, 'age')
skew.skew_comparator.jensen_shannon_divergence.threshold = 0.03 # domain knowledge helps to determine this threshold

# Calculate drift for the loan feature
drift = tfdv.get_feature(schema, 'loan')
drift.drift_comparator.infinity_norm.threshold = 0.05 # domain knowledge helps to determine this threshold

# Calculate anomalies
skew_drift_anomalies = tfdv.validate_statistics(train_stats, schema,
                                          previous_statistics=eval_stats,
                                          serving_statistics=serving_stats)

# Display anomalies
tfdv.display_anomalies(skew_drift_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'age',High approximate Jensen-Shannon divergence between training and serving,"The approximate Jensen-Shannon divergence between training and serving is 0.94206 (up to six significant digits), above the threshold 0.03."


Clearly, there still is an issue with the feature **age** in the serving set. This needs further investigation.

In [24]:
# Compare serving data with training data 
# HINT: Remember to use both the serving and training statistics with the lhs_statistics and rhs_statistics arguments
# HINT: Assign the names of 'SERVING_DATASET' and 'TRAIN_DATASET' to the lhs and rhs protocols
tfdv.visualize_statistics(lhs_statistics=serving_stats, rhs_statistics=train_stats,
                          lhs_name='SERVING_DATASET', rhs_name='TRAIN_DATASET')

The maximum value for the feature **age** is extremly high. You are now interested in the unique values of the feature **age** and want to identify any systematic errors.

In [25]:
# Look for the unique values of the feature age in the serving_df

serving_df["age"].unique()

array([   28,    37,    29,    41,    32,    39,    30,    27,    58,
          40,    21,    31,    34,    60,    23,    22,    25,    52,
          33,    50,    26,    48,    46,    38,    42,    44,    47,
          35,    24,    51,    36,    49,    57,    56,    45,    59,
          43, 30000,    55,    54,    53,    20,    83,    61,    78,
          73,    64,    79,    68,    19,    66,    71,    63,    80,
          69,    72,    70,    62,    82,    67,    65,    81,    75,
          74,    77,    18,    76,    89,    84,    86,    95,    87,
          92,    85,    90,    93,    88])

There is the value 30000 for age which clearly does not fit. You are now interested in identifying the respective rows with this value.

In [26]:
# Search for the rows with the value 30000 for the feature age in the serving_df

serving_df[serving_df["age"]==30000]

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
127,30000,technician,married,secondary,no,225,yes,no,telephone,15,may,22,7,337,12,failure


As only one row is impacted by the issue and there are plenty left, you can easily remove the issued row. Please do this now.

In [27]:
# Remove the impacted row x in the serving_df

serving_df = serving_df.drop(127)

# Re-Calculate skew for the age feature
serving_stats = tfdv.generate_statistics_from_dataframe(serving_df, stats_options=options)

skew = tfdv.get_feature(schema, 'age')
skew.skew_comparator.jensen_shannon_divergence.threshold = 0.03 # domain knowledge helps to determine this threshold

# Calculate anomalies
skew_drift_anomalies = tfdv.validate_statistics(train_stats, schema,
                                          previous_statistics=eval_stats,
                                          serving_statistics=serving_stats)

# Display anomalies
tfdv.display_anomalies(skew_drift_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'age',High approximate Jensen-Shannon divergence between training and serving,"The approximate Jensen-Shannon divergence between training and serving is 0.0854732 (up to six significant digits), above the threshold 0.03."


The skew for the feature **age** has declined immensly. From here you can decide whether you further investigate the case or change your threshold as it is neglectable. For this assignment we change the threshold to a value of 0.10 and thus relax the issue.

In [28]:
# Calculate skew for the age feature
skew = tfdv.get_feature(schema, 'age')
skew.skew_comparator.jensen_shannon_divergence.threshold = 0.10 # domain knowledge helps to determine this threshold

# Calculate anomalies
skew_drift_anomalies = tfdv.validate_statistics(train_stats, schema,
                                          previous_statistics=eval_stats,
                                          serving_statistics=serving_stats)

# Display anomalies
tfdv.display_anomalies(skew_drift_anomalies)

Now, the issue should be resolved and no anomalies should be found.

<a name='8'></a>
## 8 - Freeze the schema

Now that the schema has been reviewed, you will store the schema in a file in its "frozen" state. This can be used to validate incoming data once your application goes live to your users.

This is pretty straightforward using Tensorflow's `io` utils and TFDV's [`write_schema_text()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/write_schema_text) function.

In [29]:
# Create output directory
OUTPUT_DIR = "output"
file_io.recursive_create_dir(OUTPUT_DIR)

# Use TensorFlow text output format pbtxt to store the schema
schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')

# write_schema_text function expect the defined schema and output path as parameters
tfdv.write_schema_text(schema, schema_file) 

After submitting this assignment, you can click the Jupyter logo in the left upper corner of the screen to check the Jupyter filesystem. The `schema.pbtxt` file should be inside the `output` directory. 

**Congratulations on finishing this week's assignment!** A lot of concepts where introduced and now you should feel more familiar with using TFDV for inferring schemas, anomaly detection and other data-related tasks.

**Keep it up!**