# Workshop CBS / Capgemini

In this Workshop/Exercise you will get to practice using [TensorFlow Data Validation (TFDV)](https://cloud.google.com/solutions/machine-learning/analyzing-and-validating-data-at-scale-for-ml-using-tfx), an open-source Python package from the [TensorFlow Extended (TFX)](https://www.tensorflow.org/tfx) ecosystem. 
It is an open-source library that helps to understand, validate, and monitor production machine learning (ML) data at scale. Common use-cases include comparing training, evaluation and serving datasets, as well as checking for training/serving skew. 

In this exercise you will use TFDV in order to:

* Generate and visualize statistics from a dataframe
* Infer a dataset schema
* Calculate, visualize and fix anomalies

Let's begin!

## Table of Contents

- [1 - Setup and Imports](#1)
- [2 - Load the Dataset](#2)
  - [2.1 - Read and Split the Dataset](#2-1)
    - [2.1.1 - Data Splits](#2-1-1)
    - [2.1.2 - Label Column](#2-1-2)
- [3 - Generate and Visualize Training Data Statistics](#3)
  - [3.1 - Removing Irrelevant Features](#3-1)
  - [Generate Training Statistics](#ex-1)
  - [Visualize Training Statistics](#ex-2)
- [4 - Infer a Data Schema](#4)
  - [Infer the training set schema](#ex-3)
- [5 - Calculate, Visualize and Fix Evaluation Anomalies](#5)
  - [Compare Training and Evaluation Statistics](#ex-4)
  - [Detecting Anomalies](#ex-5)
  - [Fix evaluation anomalies in the schema](#ex-6)
- [6 - Schema Environments](#6)
  - [Check anomalies in the serving set](#ex-7)
  - [Modifying the domain](#ex-8)
  - [Detecting anomalies with environments](#ex-9)
- [7 - Check for Data Drift and Skew](#7)
- [8 - Freeze the Schema](#8)

<a name='1'></a>
## 1 - Setup and Imports

Note, if you have not used the packages before, you need to install them first. 

In [None]:
# Import packages
import os
import pandas as pd
#!pip install tensorflow
import tensorflow as tf
import tempfile, urllib, zipfile
#!pip install tensorflow_data_validation
import tensorflow_data_validation as tfdv

from tensorflow.python.lib.io import file_io
from tensorflow_data_validation.utils import slicing_util
from tensorflow_metadata.proto.v0.statistics_pb2 import DatasetFeatureStatisticsList, DatasetFeatureStatistics

# Set TF's logger to only display errors to avoid internal warnings being shown
tf.get_logger().setLevel('ERROR')

<a name='2'></a>
## 2 - Load the Dataset
You will be using the [Diabetes 130-US hospitals for years 1999-2008 Data Set](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008) donated to the University of California, Irvine (UCI) Machine Learning Repository. The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes.


<a name='2-1'></a>
### 2.1 Read and Split the Dataset

Start by downloading the dataset from the website: [Diabetes 130-US hospitals for years 1999-2008 Data Set](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008). It is then important to change the filepath to read the dataset.   

**Task**: Read the data and then preview the data by using `head()`.

In [None]:
# Read CSV data into a dataframe and recognize the missing data that is encoded with '?' string as NaN
df = pd.read_csv('diabetic_data.csv', header=0, na_values = '?')

# Preview the dataset


<a name='2-1-1'></a>
#### Data splits

In a production ML system, the model performance can be negatively affected by anomalies and divergence between data splits for training, evaluation, and serving. To emulate a production system, you will split the dataset into:

* 70% training set 
* 15% evaluation set
* 15% serving set

You will then use TFDV to visualize, analyze, and understand the data. You will create a data schema from the training dataset, then compare the evaluation and serving sets with this schema to detect anomalies and data drift/skew.

<a name='2-1-2'></a>
#### Label Column

This dataset has been prepared to analyze the factors related to readmission outcome. In this notebook, you will treat the `readmitted` column as the *target* or label column. 

The target (or label) is important to know while splitting the data into training, evaluation and serving sets. In supervised learning, you need to include the target in the training and evaluation datasets. For the serving set however (i.e. the set that simulates the data coming from your users), the **label column needs to be dropped** since that is the feature that your model will be trying to predict.

**Task**: Split the dataset and return the training, evaluation and serving data:
* train_df: Training dataframe(70% of the entire dataset)
* eval_df: Evaluation dataframe (15% of the entire dataset) 
* serving_df: Serving dataframe (15% of the entire dataset, label column dropped)

**Drop the label column in the serving data**

How many records are in each of the three datasets?

In [None]:
# Find the amount of records in each dataset:
print('Training dataset has {} records\nValidation dataset has {} records\nServing dataset has {} records'.format(len(train_df),len(eval_df),len(serving_df)))

<a name='3'></a>
## 3 - Generate and Visualize Training Data Statistics

In this section, you will be generating descriptive statistics from the dataset. This is usually the first step when dealing with a dataset you are not yet familiar with. It is also known as performing an *exploratory data analysis* and its purpose is to understand the data types, the data itself and any possible issues that need to be addressed.

It is important to mention that **exploratory data analysis should be perfomed on the training dataset** only. This is because getting information out of the evaluation or serving datasets can be seen as "cheating" since this data is used to emulate data that you have not collected yet and will try to predict using your ML algorithm. **In general, it is a good practice to avoid leaking information from your evaluation and serving data into your model.**

<a name='3-1'></a>
### Removing Irrelevant Features

Before you generate the statistics, you want to drop irrelevant features from your dataset. You can do that with TFDV with the [tfdv.StatsOptions](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions) class. It is usually **not a good idea** to drop features without knowing what information they contain. However there are times when this can be fairly obvious.

One of the important parameters of the `StatsOptions` class is `feature_allowlist`, which defines the features to include while calculating the data statistics. You can check the [documentation](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions#args) to learn more about the class arguments.

**Task**: Omit the variables `encounter_id` and `patient_nbr` from the data since they are part of the internal tracking of patients in the hospital and they don't contain valuable information for the task at hand. Use the  function [tfdv.StatsOptions](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions) on the remaining columns and name it *stats_options*. Then review the features by calling them out with `feature_allowlist`. 


<a name='ex-1'></a>
### Generate Training Statistics 

TFDV allows you to generate statistics from different data formats such as CSV or a Pandas DataFrame. 

Since you already have the data stored in a DataFrame you can use the function [`tfdv.generate_statistics_from_dataframe()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe) which, given a DataFrame and `stats_options`, generates an object of type `DatasetFeatureStatisticsList`. This object includes the computed statistics of the given dataset.

**Task**: Generate the statistics of the training set and name it *train_stats*. Remember to pass the training dataframe and the `stats_options` that you defined above as arguments.

You can test your code with the following code:

In [None]:
# get the number of features used to compute statistics
print(f"Number of features used: {len(train_stats.datasets[0].features)}")

# check the number of examples used
print(f"Number of examples used: {train_stats.datasets[0].num_examples}")

# check the column names of the first and last feature
print(f"First feature: {train_stats.datasets[0].features[0].path.step[0]}")
print(f"Last feature: {train_stats.datasets[0].features[-1].path.step[0]}")

**Expected Output:**

```
Number of features used: 48
Number of examples used: 71236
First feature: race
Last feature: readmitted
```

<a name='ex-2'></a>
### Visualize Training Statistics

Now that you have the computed statistics in the `DatasetFeatureStatisticsList` instance, you will need a way to **visualize** these to get actual insights. TFDV provides this functionality through the method [`tfdv.visualize_statistics()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics).

Using this function in an interactive Python environment such as this one will output a very nice and convenient way to interact with the descriptive statistics you generated earlier. 

**Task**: Visualize the training statistics with the function.

<a name='4'></a>
## 4 - Infer a data schema

A schema defines the **properties of the data** and can thus be used to detect errors. Some of these properties include:

- which features are expected to be present
- feature type
- the number of values for a feature in each example
- the presence of each feature across all examples
- the expected domains of features

The schema is expected to be fairly static, whereas statistics can vary per data split. So, you will **infer the data schema from only the training dataset**. Later, you will generate statistics for evaluation and serving datasets and compare their state with the data schema to detect anomalies, drift and skew.

<a name='ex-3'></a>
### Infer the training set schema

Schema inference is straightforward using [`tfdv.infer_schema()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema). This function needs only the **statistics** (an instance of `DatasetFeatureStatisticsList`) of your data as input. The output will be a Schema [protocol buffer](https://developers.google.com/protocol-buffers) containing the results.

A complimentary function is [`tfdv.display_schema()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_schema) for displaying the schema in a table. This accepts a **Schema** protocol buffer as input.

**Task**: Infer the training schema and name it *schema*. Then display the data schema. 

In [None]:
# Infer the data schema by using the training statistics that you generated


# Display the data schema


In the following you can test your code:

In [None]:
# Check number of features
print(f"Number of features in schema: {len(schema.feature)}")

# Check domain name of 2nd feature
print(f"Second feature in schema: {list(schema.feature)[1].domain}")

**Expected Output:**

```
Number of features in schema: 48
Second feature in schema: gender
```

<a name='5'></a>
## 5 - Calculate, Visualize and Fix Evaluation Anomalies


It is important that the schema of the evaluation data is consistent with the training data since the data that your model is going to receive should be consistent to the one you used to train it with.

Moreover, it is also important that the **features of the evaluation data belong roughly to the same range as the training data**. This ensures that the model will be evaluated on a similar loss surface covered during training.

<a name='ex-4'></a>
### Compare Training and Evaluation Statistics

Now you are going to generate the evaluation statistics and compare it with training statistics. You can use the [`tfdv.generate_statistics_from_dataframe()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe) function for this. But this time, you'll need to pass the **evaluation data**. For the `stats_options` parameter, the list you used before works here too.

Remember that to visualize the evaluation statistics you can use [`tfdv.visualize_statistics()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics). 

However, it is impractical to visualize both statistics separately and do your comparison from there. Fortunately, TFDV has got this covered. You can use the `visualize_statistics` function and pass additional parameters to overlay the statistics from both datasets (referenced as left-hand side and right-hand side statistics). Let's see what these parameters are:

- `lhs_statistics`: Required parameter. Expects an instance of `DatasetFeatureStatisticsList `.


- `rhs_statistics`: Expects an instance of `DatasetFeatureStatisticsList ` to compare with `lhs_statistics`.


- `lhs_name`: Name of the `lhs_statistics` dataset.


- `rhs_name`: Name of the `rhs_statistics` dataset.

For this case, remember to define the `lhs_statistics` protocol with the `eval_stats`, and the optional `rhs_statistics` protocol with the `train_stats`.

Additionally, check the function for the protocol name declaration, and define the lhs and rhs names as `'EVAL_DATASET'` and `'TRAIN_DATASET'` respectively.

**Task**: Generate the evaluation of the dataset statistics and call it *eval_stats*. Then visualize the differences between the evaluation and training data. Remember to use the two functions defined above.

In [None]:
# Generate evaluation dataset statistics
# HINT: Remember to use the evaluation dataframe and to pass the stats_options (that you defined before) as an argument

# Compare evaluation data with training data 
# HINT: Remember to use both the evaluation and training statistics with the lhs_statistics and rhs_statistics arguments
# HINT: Assign the names of 'EVAL_DATASET' and 'TRAIN_DATASET' to the lhs and rhs protocols



In the following you can test your code:

In [None]:
# get the number of features used to compute statistics
print(f"Number of features: {len(eval_stats.datasets[0].features)}")

# check the number of examples used
print(f"Number of examples: {eval_stats.datasets[0].num_examples}")

# check the column names of the first and last feature
print(f"First feature: {eval_stats.datasets[0].features[0].path.step[0]}")
print(f"Last feature: {eval_stats.datasets[0].features[-1].path.step[0]}")

**Expected Output:**

```
Number of features: 48
Number of examples: 15265
First feature: race
Last feature: readmitted
```

<a name='ex-5'></a>
### Detecting Anomalies ###

At this point, you should ask if your evaluation dataset matches the schema from your training dataset. For instance, if you scroll through the output cell in the previous exercise, you can see that the categorical feature **glimepiride-pioglitazone** has 1 unique value in the training set while the evaluation dataset has 2. You can verify with the built-in Pandas `describe()` method as well.

**Task**: Use `describe()` on the feature "glimepiride-pioglitazone" in both the training data and the evaluation data.

It is possible but highly inefficient to visually inspect and determine all the anomalies. So, let's instead use TFDV functions to detect and display these. 
You can use the function [`tfdv.validate_statistics()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/validate_statistics) for detecting anomalies and [`tfdv.display_anomalies()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_anomalies) for displaying them.

The `validate_statistics()` method has two required arguments:
- an instance of `DatasetFeatureStatisticsList`
- an instance of `Schema`

**Task**: Detect the anomalies and name them *anomalies*, and then visualize the anomalies in the data.

In [None]:
# HINTS: Pass the statistics and schema parameters into the validation function 

 
# HINTS: Display input anomalies by using the calculated anomalies




You should see detected anomalies in the `medical_specialty` and `glimepiride-pioglitazone` features by running the cell below.

<a name='ex-6'></a>
### Fix evaluation anomalies in the schema

The evaluation data has records with values for the features **glimepiride-pioglitazone** and **medical_speciality**  that were not included in the schema generated from the training data. You can fix this by adding the new values that exist in the evaluation dataset to the domain of these features.

To get the `domain` of a particular feature you can use [`tfdv.get_domain()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/get_domain).

You can use the `append()` method to the `value` property of the returned `domain` to add strings to the valid list of values. To be more explicit, given a domain you can do something like:

```python
domain.value.append("feature_value")

```

**Task**: Start by getting the domain associated with the feature 'glimepiride-pioglitazone', and name it *glimepiride_pioglitazone_domain*. Then the missing value 'Steady' should be appended to the domain. Go through the same procedure for the feature 'medical_specialty', where the missing value is called 'Neurophysiology'. Then recalculate and redisplay the anomalies with the new schema. 

If you did the exercise correctly, you should see *"No anomalies found."* after running the cell above.

<a name='6'></a>
## 6 - Schema Environments

By default, all datasets in a pipeline should use the same schema. However, there are some exceptions. 

For example, the **label column is dropped in the serving set** so this will be flagged when comparing with the training set schema. 

**In this case, introducing slight schema variations is necessary.**

<a name='ex-7'></a>
### Check anomalies in the serving set

Now you are going to check for anomalies in the **serving data**. The process is very similar to the one you previously did for the evaluation data with a little change. 

**Task**: Create a new `tfdv.StatsOptions` called 'options', which is aware of the information provided by the schema and use it when generating statistics from the serving DataFrame.
Then calculate and display anomalies using the generated serving statistics.

You should see that `metformin-rosiglitazone`, `metformin-pioglitazone`, `payer_code` and `medical_specialty` features have an anomaly (i.e. Unexpected string values) which is less than 1%. 

Let's **relax the anomaly detection constraints** for the last two of these features by defining the `min_domain_mass` of the feature's distribution constraints.

In [None]:
# This relaxes the minimum fraction of values that must come from the domain for the feature.

# Get the feature and relax to match 90% of the domain
payer_code = tfdv.get_feature(schema, 'payer_code')
payer_code.distribution_constraints.min_domain_mass = 0.9 

# Get the feature and relax to match 90% of the domain
medical_specialty = tfdv.get_feature(schema, 'medical_specialty')
medical_specialty.distribution_constraints.min_domain_mass = 0.9 

# Detect anomalies with the updated constraints
calculate_and_display_anomalies(serving_stats, schema=schema)

If the `payer_code` and `medical_specialty` are no longer part of the output cell, then the relaxation worked!

<a name='ex-8'></a>
### Modifying the Domain

Let's investigate the possible cause of the anomalies for the other features, namely `metformin-pioglitazone` and `metformin-rosiglitazone`. From the output of the previous exercise, you'll see that the `anomaly long description` says: "Examples contain values missing from the schema: Steady (<1%)". 
You can redisplay the schema and look at the domain of these features to verify this statement.

When you inferred the schema at the start of this lab, it's possible that some  values were not detected in the training data so it was not included in the expected domain values of the feature's schema. In the case of `metformin-rosiglitazone` and `metformin-pioglitazone`, the value "Steady" is indeed missing. You will just see "No" in the domain of these two features after running the code cell below.

**Task**: Use the `tfdv.display_schema()` function to display the schema.

Towards the bottom of the Domain-Values pairs of the cell above, you can see that many features (including **'metformin'**) have the same values: `['Down', 'No', 'Steady', 'Up']`. These values are common to many features including the ones with missing values during schema inference. 

TFDV allows you to modify the domains of some features to match an existing domain. To address the detected anomaly, you can **set the domain** of these features to the domain of the `metformin` feature.

**Task**: Set the domain of a feature list to an existing feature domain. For this, use the [`tfdv.set_domain()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/set_domain) function, which has the following parameters:

- `schema`: The schema


- `feature_path`: The name of the feature whose domain needs to be set.


- `domain`: A domain protocol buffer or the name of a global string domain present in the input schema.



**Task**: Modify the domain of the features defined in the `domain_change_features` list below to be equal to **metformin's domain** to address the anomalies found. Use the function `modify_domain_of_features`, and then display the new schema. 

**Since you are overriding the existing domain of the features, it is normal to get a warning so you don't do this by accident.**

Remember to display the new schema.

In [None]:
domain_change_features = ['repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 
                          'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 
                          'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 
                          'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 
                          'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone']



In the following you can test your code:

In [None]:
# check that the domain of some features are now switched to `metformin`
print(f"Domain name of 'chlorpropamide': {tfdv.get_feature(schema, 'chlorpropamide').domain}")
print(f"Domain values of 'chlorpropamide': {tfdv.get_domain(schema, 'chlorpropamide').value}")
print(f"Domain name of 'repaglinide': {tfdv.get_feature(schema, 'repaglinide').domain}")
print(f"Domain values of 'repaglinide': {tfdv.get_domain(schema, 'repaglinide').value}")
print(f"Domain name of 'nateglinide': {tfdv.get_feature(schema, 'nateglinide').domain}")
print(f"Domain values of 'nateglinide': {tfdv.get_domain(schema, 'nateglinide').value}")

**Expected Output:**

```
Domain name of 'chlorpropamide': metformin
Domain values of 'chlorpropamide': ['Down', 'No', 'Steady', 'Up']
Domain name of 'repaglinide': metformin
Domain values of 'repaglinide': ['Down', 'No', 'Steady', 'Up']
Domain name of 'nateglinide': metformin
Domain values of 'nateglinide': ['Down', 'No', 'Steady', 'Up']
```

Let's do a final check of anomalies to see if this solved the issue.

In [None]:
calculate_and_display_anomalies(serving_stats, schema=schema)

You should now see the `metformin-pioglitazone` and `metformin-rosiglitazone` features dropped from the output anomalies.

<a name='ex-9'></a>
### Detecting anomalies with environments

There is still one thing to address. The `readmitted` feature (which is the label column) showed up as an anomaly ('Column dropped'). Since labels are not expected in the serving data, let's tell TFDV to ignore this detected anomaly.

This requirement of introducing slight schema variations can be expressed by using [environments](https://www.tensorflow.org/tfx/data_validation/get_started#schema_environments). In particular, features in the schema can be associated with a set of environments using `default_environment`, `in_environment` and `not_in_environment`.

**Task**: Run the code below for the default schema. 

In [None]:
# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')

Complete the code below to exclude the `readmitted` feature from the `SERVING` environment.

To achieve this, you can use the [`tfdv.get_feature()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/get_feature) function to get the `readmitted` feature from the inferred schema and use its `not_in_environment` attribute to specify that `readmitted` should be removed from the `SERVING` environment's schema. This **attribute is a list** so you will have to **append** the name of the environment that you wish to omit this feature for.

To be more explicit, given a feature you can do something like:

```python
feature.not_in_environment.append('NAME_OF_ENVIRONMENT')
```

The function `tfdv.get_feature` receives the following parameters:

- `schema`: The schema.
- `feature_path`: The path of the feature to obtain from the schema. In this case this is equal to the name of the feature.

**Task**: Specify that the 'readmitted' feature is not in SERVING environment. This is done by appending the serving environment  to the not_in_environment attribute of the feature. Then calculate the anomalies with the validate_statistics function by using the serving statistics, inferred schema and the serving environment parameter.

You should see "No anomalies found" by running the cell below.

In [None]:
# Display anomalies
tfdv.display_anomalies(serving_anomalies_with_env)

Now you have succesfully addressed all anomaly-related issues!

<a name='7'></a>
## 7 - Check for Data Drift and Skew

During data validation, you also need to check for data drift and data skew between the training and serving data. You can do this by specifying the [skew_comparator and drift_comparator](https://www.tensorflow.org/tfx/data_validation/get_started#checking_data_skew_and_drift) in the schema. 

Drift and skew is expressed in terms of [L-infinity distance](https://en.wikipedia.org/wiki/Chebyshev_distance) which evaluates the difference between vectors as the greatest of the differences along any coordinate dimension.

You can set the threshold distance so that you receive warnings when the drift is higher than is acceptable.  Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation.

In the below code you can check the skew in the *diabetesMed* feature.

**Task**: Write code to check the drift in the *payer_code* feature and name it 'payer_code'. Hint, use the 'drift_comparator' instead of 'skew_comparator' in the code. Then calculate the anomalies by using the `tfdv.validate_statistics()` function and name it 'skew_drift_anomalies', and display them by the `tfdv.display_anomalies()` function.

In [None]:
# Calculate skew for the diabetesMed feature
diabetes_med = tfdv.get_feature(schema, 'diabetesMed')
diabetes_med.skew_comparator.infinity_norm.threshold = 0.03 # domain knowledge helps to determine this threshold


In both of these cases, the detected anomaly distance is not too far from the threshold value of `0.03`. For this exercise, let's accept this as within bounds (i.e. you can set the distance to something like `0.035` instead).

**However, if the anomaly truly indicates a skew and drift, then further investigation is necessary as this could have a direct impact on model performance.**

<a name='8'></a>
## 8 - Freeze the schema

Now that the schema has been reviewed, you will store the schema in a file in its "frozen" state. This can be used to validate incoming data once your application goes live to your users.

This is pretty straightforward using Tensorflow's `io` utils and TFDV's [`write_schema_text()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/write_schema_text) function.

**Task**: Set the correct output directory and locate your schema. 

In [None]:
# Create output directory
OUTPUT_DIR = "output"
file_io.recursive_create_dir(OUTPUT_DIR)

# Use TensorFlow text output format pbtxt to store the schema
schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')

# write_schema_text function expect the defined schema and output path as parameters
tfdv.write_schema_text(schema, schema_file) 