<a href="https://colab.research.google.com/github/Ol-Shweta/AI-Chat-GPT-3/blob/main/observation_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Upgrade Pip

To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab.  Local systems can of course be upgraded separately.

In [None]:
try:
  import colab
  !pip install --upgrade pip
except:
  pass

### Install Data Validation packages

Install the TensorFlow Data Validation packages and dependencies, which takes a few minutes. You may see warnings and errors regarding incompatible dependency versions, which you will resolve in the next section.

In [None]:
print('Installing TensorFlow Data Validation')
!pip install --upgrade 'tensorflow_data_validation[visualization]<2'

Installing TensorFlow Data Validation
Collecting tensorflow_data_validation[visualization]<2
  Downloading tensorflow_data_validation-1.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m
Collecting apache-beam[gcp]<3,>=2.47 (from tensorflow_data_validation[visualization]<2)
  Downloading apache_beam-2.53.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.8/14.8 MB[0m [31m71.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyfarmhash<0.4,>=0.2.2 (from tensorflow_data_validation[visualization]<2)
  Downloading pyfarmhash-0.3.2.tar.gz (99 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.9/99.9 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tfx-bsl<1.15,>=1.14.0 (from tensorfl

### Import TensorFlow and reload updated packages

The prior step updates the default packages in the Gooogle Colab environment, so you must reload the package resources to resolve the new dependencies.

Note: This step resolves the dependency error from the installation. If you are still experiencing code execution problems after running this code, restart the runtime (Runtime > Restart runtime ...).

In [None]:
import pkg_resources
import importlib
importlib.reload(pkg_resources)

<module 'pkg_resources' from '/usr/local/lib/python3.10/dist-packages/pkg_resources/__init__.py'>

Check the versions of TensorFlow and the Data Validation before proceeding.

In [None]:
import tensorflow as tf
import tensorflow_data_validation as tfdv
print('TF version:', tf.__version__)
print('TFDV version:', tfdv.version.__version__)

TF version: 2.15.0
TFDV version: 1.14.0


## Load the dataset
We will download our dataset from Google drive.

In [None]:
import os
import tempfile
import gdown
import urllib.request
import shutil

# Set up some globals for our file paths
BASE_DIR = tempfile.mkdtemp()
DATA_DIR = os.path.join(BASE_DIR, 'observ')
QHSE_DATA = os.path.join(DATA_DIR, 'observ.csv')
EVAL_DATA = os.path.join(DATA_DIR, 'eval', 'observ.csv')

# Define the Google Drive file ID
file_id = '1XiiVXtzJyTElO0A9jdt2biK2xYXkgS_M'

try:
    # Create the destination directories if they don't exist
    os.makedirs(DATA_DIR, exist_ok=True)
    os.makedirs(os.path.dirname(EVAL_DATA), exist_ok=True)

    # Construct the direct download link
    direct_download_link = f'https://drive.google.com/uc?export=download&id={file_id}'

    # Download the file to QHSE_DATA
    urllib.request.urlretrieve(direct_download_link, QHSE_DATA)
    print(f"QHSE data downloaded to: {QHSE_DATA}")

    # Copy the downloaded file to EVAL_DATA
    shutil.copy(QHSE_DATA, EVAL_DATA)
    print(f"Evaluation data copied from QHSE data to: {EVAL_DATA}")

except Exception as e:
    print(f"An error occurred: {e}")


QHSE data downloaded to: /tmp/tmpoykm0ou7/observ/observ.csv
Evaluation data copied from QHSE data to: /tmp/tmpoykm0ou7/observ/eval/observ.csv


## Compute and visualize statistics

First we'll use [`tfdv.generate_statistics_from_csv`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv) to compute statistics for our training data. (ignore the snappy warnings)

TFDV can compute descriptive [statistics](https://github.com/tensorflow/metadata/blob/v0.6.0/tensorflow_metadata/proto/v0/statistics.proto) that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.

Internally, TFDV uses [Apache Beam](https://beam.apache.org/)'s data-parallel processing framework to scale the computation of statistics over large datasets. For applications that wish to integrate deeper with TFDV (e.g., attach statistics generation at the end of a data-generation pipeline), the API also exposes a Beam PTransform for statistics generation.

In [None]:
qhse_stats = tfdv.generate_statistics_from_csv(data_location=QHSE_DATA)

# Display the feature names
# print(qhse_stats.datasets[0].features.feature.keys())



Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Now let's use [`tfdv.visualize_statistics`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics), which uses [Facets](https://pair-code.github.io/facets/) to create a succinct visualization of our training data:

* Notice that numeric features and catagorical features are visualized separately, and that charts are displayed showing the distributions for each feature.
* Notice that features with missing or zero values display a percentage in red as a visual indicator that there may be issues with examples in those features.  The percentage is the percentage of examples that have missing or zero values for that feature.
* Notice that there are no examples with values for `pickup_census_tract`.  This is an opportunity for dimensionality reduction!
* Try clicking "expand" above the charts to change the display
* Try hovering over bars in the charts to display bucket ranges and counts
* Try switching between the log and linear scales, and notice how the log scale reveals much more detail about the `payment_type` categorical feature
* Try selecting "quantiles" from the "Chart to show" menu, and hover over the markers to show the quantile percentages

In [None]:
# docs-infra: no-execute
tfdv.visualize_statistics(qhse_stats)

## Infer a schema

Now let's use [`tfdv.infer_schema`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema) to create a schema for our data.  A schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data.  For categorical features the schema also defines the domain - the list of acceptable values.  Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.

Getting the schema right is important because the rest of our production pipeline will be relying on the schema that TFDV generates to be correct.  The schema also provides documentation for the data, and so is useful when different developers work on the same data.  Let's use [`tfdv.display_schema`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_schema) to display the inferred schema so that we can review it.

In [None]:
schema = tfdv.infer_schema(statistics=qhse_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'observation_date',STRING,required,,'observation_date'
'business_category',STRING,required,,'business_category'
'operator',STRING,required,,'operator'
'client',STRING,required,,'client'
'environmental_condition',STRING,required,,'environmental_condition'
'workshift',STRING,required,,'workshift'
'number_people_observered',INT,required,,-
'description',STRING,required,,'description'
'observer_comment',STRING,required,,'observer_comment'
'immediate_action',STRING,required,,'immediate_action'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'observation_date',"'04-05-2020 09:15', '08-05-2020 01:00', '09-02-2020 08:00', '09-03-2020 01:00', '09-09-2020 01:00', '10-01-2020 01:00', '10-11-2020 00:00', '10-12-2020 09:00', '11-03-2019 02:00', '11-03-2019 08:00', '17-10-2020 10:00', '19-08-2020 1:00', '22-04-2020 8:20', '26-09-2019 11:00', '28-08-2020 9:20', '31-10-2019 8:00'"
'business_category',"'APC', 'Wireline'"
'operator',"'APC', 'Anoop Kumar', 'Gurwinder Singh', 'Gurwinder Singh/ Anoop Kumar', 'NA', 'ahmad', 'wael ahmed'"
'client',"'APC', 'BHGE', 'NA', 'NASMA', 'NULL', 'Saudi Aramco', 'Wireline'"
'environmental_condition','Sunny'
'workshift',"'Day', 'Night'"
'description',"'After finishing MTD campaign, the truck and equipment were carried out for servicing at service station. So during washing they spray Diesel on all equipment. By this diesel vapors are remains in air, and most of people who are working there smoke on same place. Due to this any big accident can happen.', 'Fire extinguisher monthly November inspection is not done yet', 'It is observed on job at NA-393 rig before connecting PTC-RCT makeup, the scaffolding around the well(which is almost 6ft in height) is damaged and it may cause to slip trip and fall while connecting the PTC-RCT and can leading to any injury.', 'Light lamp in the engineer\'s rooms is not working', 'Maintenance lab light is not working', 'Maintenance lab walk way is obstructed with wooden box', 'No fire extinguisher in near the oil/ paint store', 'Noticed one of third party personnel on the wellsite manhaldling APC Equipment. Approached him and his supervisor and told them to call the concerned APC Personnel in case equipment needs to be moved to another place, and not to handle them by themselves.', 'Observed person using correct LIFTING Techniques', 'Observed personnel bending his back while connecting tools, corrected him.', 'This was observed person working in yard without safety glass', 'Unsafe Condition - leak at the FHC water valve. a small amount of water drops.', 'in APCSHOP we use the shop system plug without using something like lookout of Tagout and may be one Operator try to disconnect it so we must use LOOKOUT SYSTEM', 'in NASMA we found some unsafe issue 1 the ground in said the shop full of oil and sliding 2 all of the workshop have a lot of rubbish', 'observed person using correct ppe for job', 'water leaking from the RCT room Air condition,'"
'observer_comment',"' ""Need to do monthly inspection for all fire extinguisher""', 'Clear the wooden box from the walk way, and keep it in proper place', 'It is observed on job at NA-393 rig before connecting PTC-RCT makeup, the scaffolding around the well is damaged and it may cause to slip trip and fall while connecting the PTC-RCT and can leading to any injury.', 'N/A', 'NA', 'Stop the person, and reported to concerned persons', 'light system is not working', 'need fix a proper fire extinguisher near the oil store', 'need to repair the AC', 'needs to change the lamp', 'slip hazards', 'update', 'we need to use lookout system'"
'immediate_action',"'Briefed them to call APC equipment if equipment needs to be moved', 'Clear the wooden box from the walk way, and keep it in proper place', 'Corrected him', 'Discuss same with APC team.', 'NA', 'Need to do monthly inspection for all fire extinguisher', 'Stop the person, and reported to concerned persons', 'Stop the work and discuss same with team member Anoop & Baker Wireline guys and sequre the same place. Connected tool from the safest side.', 'Try to fix it but we can\'t, no electricity, Reported it to foreman', 'informed to concerned persons', 'report to concerned person', 'reported it to concern personal', 'reported to Star Safety team', 'reported to concerned persons', 'try to clean the ground'"
'safe_behaviour',"'NULL', 'Safe Behavior'"


## Check evaluation data for errors

So far we've only been looking at the training data.  It's important that our evaluation data is consistent with our training data, including that it uses the same schema.  It's also important that the evaluation data includes examples of roughly the same ranges of values for our numerical features as our training data, so that our coverage of the loss surface during evaluation is roughly the same as during training.  The same is true for categorical features.  Otherwise, we may have training issues that are not identified during evaluation, because we didn't evaluate part of our loss surface.


In [None]:
# Compute stats for evaluation data
eval_stats = tfdv.generate_statistics_from_csv(data_location=EVAL_DATA)

In [None]:
# docs-infra: no-execute
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=qhse_stats,
                          lhs_name='EVAL_DATASET', rhs_name='QHSE_DATASET')

## Check for evaluation anomalies

Does our evaluation dataset match the schema from our training dataset?  This is especially important for categorical features, where we want to identify the range of acceptable values.

Key Point: What would happen if we tried to evaluate using data with categorical feature values that were not in our training dataset?  What about numeric features that are outside the ranges in our training dataset?

In [None]:
# Check eval data for errors by validating the eval data stats using the previously inferred schema.
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)
tfdv.display_anomalies(anomalies)

## Fix evaluation anomalies in the schema

If an anomaly truly indicates a data error, then the underlying data should be fixed.  Otherwise, we can simply update the schema to include the values in the eval dataset.

Key Point: How would our evaluation results be affected if we did not fix these problems?

Unless we change our evaluation dataset we can't fix everything, but we can fix things in the schema that we're comfortable accepting.  That includes relaxing our view of what is and what is not an anomaly for particular features, as well as updating our schema to include missing values for categorical features.  TFDV has enabled us to discover what we need to fix.

Let's make those fixes now, and then review one more time.