<a href="https://colab.research.google.com/github/sergejhorvat/Tensorflow2.0_Udemy/blob/master/Data_Validation_with_TensorFlow_Data_Validation_(TFDV).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Stage 1: Install all dependencies and setting up the environment

In [2]:
# Set up notebook to use python 2 becouse of a snappy library
!apt-get install python-dev python-snappy

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-dev is already the newest version (2.7.15~rc1-1).
python-snappy is already the newest version (0.5-1.1build2).
0 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.


In [9]:
#!pip install dill==0.3.0


Collecting dill==0.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/39/7a/70803635c850e351257029089d38748516a280864c97cbc73087afef6d51/dill-0.3.0.tar.gz (151kB)
[K     |████████████████████████████████| 153kB 2.9MB/s 
[?25hBuilding wheels for collected packages: dill
  Building wheel for dill (setup.py) ... [?25l[?25hdone
  Created wheel for dill: filename=dill-0.3.0-cp27-none-any.whl size=77513 sha256=9690cbd2bb02513c0f90a9d5085ca91f248f2f582d82bdef6729d4d589a82a5e
  Stored in directory: /root/.cache/pip/wheels/c9/de/a4/a91eec4eea652104d8c81b633f32ead5eb57d1b294eab24167
Successfully built dill
[31mERROR: apache-beam 2.15.0 has requirement dill<0.2.10,>=0.2.9, but you'll have dill 0.3.0 which is incompatible.[0m
Installing collected packages: dill
  Found existing installation: dill 0.2.9
    Uninstalling dill-0.2.9:
      Successfully uninstalled dill-0.2.9
Successfully installed dill-0.3.0


In [0]:
!pip install -q tensorflow_data_validation

## Stage 2: Import project dependencies

In [0]:
import pandas as pd
import tensorflow as tf
import tensorflow_data_validation as tfdv

# import print function from python 3 
from __future__ import print_function

## Stage 3: Simple dataset analysis

In [0]:
# Neede to reupload file every time the runtime is recycled
dataset = pd.read_csv("pollution-small.csv")

In [11]:
dataset.shape

(2188, 5)

In [0]:
training_data = dataset[:1600]

In [17]:
training_data.describe()

Unnamed: 0,pm10,no2,so2,soot
count,1600.0,1600.0,1600.0,1600.0
mean,49.656494,30.980519,16.229981,21.551956
std,35.211906,12.400788,10.621896,12.127354
min,6.38,9.74,4.01,6.0
25%,28.345,22.5675,9.7775,14.4
50%,38.835,28.715,13.275,18.63
75%,58.05,36.37,19.2825,24.0725
max,277.25,138.01,123.13,107.65


In [0]:
test_set = dataset[1600:]

In [19]:
test_set.describe()

Unnamed: 0,pm10,no2,so2,soot
count,588.0,588.0,588.0,588.0
mean,44.648248,37.296922,13.60517,18.44131
std,28.992087,10.94005,5.098944,6.596459
min,11.9,15.07,4.99,8.0
25%,28.3375,29.2175,10.1225,14.41
50%,35.555,35.815,12.345,17.09
75%,50.8125,43.8725,15.855,20.9625
max,273.77,106.03,38.03,87.21


## Stage 3: Data analysis and validation with TFDV

### Generate training data statistics

In [0]:
# Generate statistics with tfdv, more than .describe() function
# tfdv has more functions to gerate statistics
# Statistics that we have generated are column information, data information 
# and enerything that we can comapre our training set to our new data set 
train_stats = tfdv.generate_statistics_from_dataframe(dataframe=dataset,n_jobs=1)

### Infering the schema

In [0]:
# Concept os scema is important for data validation 
#  and it will contain all the data statistics 
#  to check our newly recived data  how good it is , 
#  does have some anomalies comparet to train data statistics
# Schema will contain information like: 
#    + column data type
#    + is column is a must or optional
#    + What is distribution of column
schema = tfdv.infer_schema(statistics=train_stats)

In [26]:
# Display schema
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'soot',FLOAT,required,,-
'no2',FLOAT,required,,-
'pm10',FLOAT,required,,-
'Date',BYTES,required,,-
'so2',FLOAT,required,,-


### Calculate test set statistics

In [0]:
test_stats = tfdv.generate_statistics_from_dataframe(dataframe=test_set)

## Stage 4: Compare test statistics with the Schema

### Checking for anomalies in new data

In [0]:
# Anomalie is the difference between test statistics and the schema
anomalies = tfdv.validate_statistics(statistics=test_stats, schema=schema)

### Displaying all detected anomalies

- Integer larger than 10
- STRING type when expected INT type
- FLOAT type when expected INT type
- Integer smaller than 0

In [30]:
tfdv.display_anomalies(anomalies)

### New data WITH anomalies

In [0]:
test_set_copy = test_set.copy()

In [0]:
test_set_copy.drop("soot", axis=1, inplace=True)

### Statistics based on data with anomalies

1.  Define statistics from our new data set

In [0]:
test_set_copy_stats = tfdv.generate_statistics_from_dataframe(dataframe=test_set_copy)

2. Check statistics between new data set and schema

In [0]:
anomalies_new = tfdv.validate_statistics(statistics=test_set_copy_stats, schema=schema)

3. Display if there is any anomalies

In [40]:
tfdv.display_anomalies(anomalies_new)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'soot',Column dropped,Column is completely missing


## Stage 5: Prepare the schema for Serving

In serving time we will not have the column that we want to predict as we need it in training time. WE will use different environments for: 

1.   Training
2.   Serving

We will add default environments to our schema (list of environments)



In [0]:
schema.default_environment.append("TRAINING")
schema.default_environment.append("SERVING")

### Removing a target column from the Serving schema

In [0]:
tfdv.get_feature(schema, "soot").not_in_environment.append("SERVING")

### Checking for anomalies between the SERVING environment and new test set

In [0]:
serving_env_anomalies = tfdv.validate_statistics(test_set_copy_stats, schema, environment="SERVING")

In [45]:
tfdv.display_anomalies(serving_env_anomalies)