# TensorFlow Data Validation Self-Assignment

## Context 

Since I am currently auditing the Coursera course 'Machine Learning Data Lifecycle in Production', I do not have access to some of the assignments given. Due to this, I looked for tutorials only that could serve as a replacement to the standard assignment. Luckily, Google has a [jupyter notebook](https://github.com/tensorflow/tfx/blob/master/docs/tutorials/data_validation/tfdv_basic.ipynb). Taking the tutorial as template, I replaced the data that was used with public data from a prior project on modeling the price of Bitcoin.

In [23]:
import tensorflow_data_validation as tfdv
import pandas as pd

In [12]:
btcTrain = tfdv.generate_statistics_from_csv(data_location='StandardizedTrainX.csv')



### Training Data set

As this data set was standardized previously, some of the results are not surprising in how closely they fit a gaussian distribution. However, the following features seemed rather skewed to the right. 

1. Good
2. Bad
3. Neutral

These were used in the project as a total count of Good/Bad/Neutral events across social media, as per the [Augmento API](http://api-dev.augmento.ai/v0.1/documentation#aggregated-events), and gauged the overall sentiment of the Bitcoin market. 
Seeing such a stark skew in these three different data sets would have definitely helped me focus in on the data in a more precise way. Another reason I wish I had used this previously is the 35.74% figure I see for the amount of zeros in the feature 'AddrROC' ( abbreviation of 'Address Count Rate of Change' ). I was not aware that there was that many zeros in the data set for this feature. Using this information, it may have been possible for me to improve the model by understanding why this was the case.

In [13]:
tfdv.visualize_statistics(btcTrain)

#### Training Schema

All of the types for the featues are not suprising except for the 'Date' feature, I find interesting that it is represented as BYTES rather than the expected data time. When I open it in Pandas, it shows as an object type despite it seeming to show up as a normal datetime feature. This confirms that this was just something that occurred when opening the file, so it is not a concern. 

In [14]:
schema = tfdv.infer_schema(statistics=btcTrain)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Date',BYTES,required,,-
'active address count',FLOAT,required,,-
'address count',FLOAT,required,,-
'New Non Zero Balance Address',FLOAT,required,,-
'Good',FLOAT,required,,-
'Bad',FLOAT,required,,-
'Neutral',FLOAT,required,,-
'high',FLOAT,required,,-
'low',FLOAT,required,,-
'close',FLOAT,required,,-


In [28]:
btcTrainDf = pd.read_csv('StandardizedTrainX.csv')
btcTrainDf.head()

Unnamed: 0,Date,active address count,address count,New Non Zero Balance Address,Good,Bad,Neutral,high,low,close,volume,GoodROC,AddrROC
0,2018-06-20 07:00:00,-1.066907,-0.400443,-1.097319,0.849554,0.553453,0.719863,-0.224455,-0.224429,-0.224443,-0.383163,-0.558903,0.251669
1,2019-08-06 16:00:00,0.532394,0.438459,0.440854,0.123911,0.262764,0.099298,0.558309,0.558336,0.558321,-0.343842,-1.57027,0.0
2,2019-01-10 03:00:00,-0.705577,-0.024963,-0.584572,-0.630565,-0.662696,-0.873228,-0.626318,-0.626293,-0.626307,-0.186559,-0.137963,-0.204762
3,2018-05-26 21:00:00,-1.751711,-0.442017,-1.750465,0.022994,-0.110979,0.113191,-0.085413,-0.085387,-0.085401,0.091964,-0.625741,-0.0
4,2020-07-30 19:00:00,1.172073,1.298858,1.330062,-0.54887,-0.520317,-0.646305,0.466899,0.466926,0.466911,0.164052,-0.423782,0.0


In [27]:
btcTrainDf.dtypes

Date                            object 
active address count            float64
address count                   float64
New Non Zero Balance Address    float64
Good                            float64
Bad                             float64
Neutral                         float64
high                            float64
low                             float64
close                           float64
volume                          float64
GoodROC                         float64
AddrROC                         float64
dtype: object

In [15]:
btcTest = tfdv.generate_statistics_from_csv(data_location='StandardizedTestX.csv')



### Comparing the Test data set to the Train data set

At the very least it does look like the split that was done to create these two data sets have fairly similar distributions across the range of features that they share. So, unlikely some of my previous findings, it seems like the splitting of the data did not contribute to a bad model. However, I was not aware of how drastically different the minimum value is for the 'GoodROC' feature between the two data sets. Due to the volatile nature of the crypto markets, this is to be expected, but I do wonder if it had an impact on the model.

In [16]:
tfdv.visualize_statistics(lhs_statistics=btcTest, rhs_statistics=btcTrain,
                          lhs_name='TEST_DATASET', rhs_name='TRAIN_DATASET')

#### Anomaly Schema Check between the two datasets

Thankfully the two datasets had no anomalies between them, so the project definitely was not impacting by this potential issue. Because no differences where found, the schema can be safely frozen, as done below.  

In [17]:
anomalies = tfdv.validate_statistics(statistics=btcTest, schema=schema)
tfdv.display_anomalies(anomalies)

  pd.set_option('max_colwidth', -1)


In [18]:
goodROCSkew = tfdv.get_feature(schema, 'GoodROC')
goodROCSkew.skew_comparator.infinity_norm.threshold = 0.01

goodROCDrift = tfdv.get_feature(schema, 'GoodROC')
goodROCDrift.drift_comparator.infinity_norm.threshold = 0.001

skew_anomalies = tfdv.validate_statistics(btcTrain, schema,
                                          previous_statistics=btcTest,
                                          serving_statistics=btcTrain)

tfdv.display_anomalies(skew_anomalies)

In [21]:
tfdv.write_schema_text(schema, 'btcSchema.pbtxt')

## Closing Remarks

This brief run with the tensorflow data validation package has definitely taught me that even in small scale projects, discrepancies can occur. Thankfully this data is not being used for a production model because it definitely is not ready and in this aspect. This tutorial has helped me understand the importance of these details that can be easily missed in a larger project and has given me an appreciation for the power that this package offers in facilitating the machine learning production models at scale. 