# Data Analysis
The purpose of this phase is to perform exploratory data analysis to understand the nature of your data, as well as the data preprocessing and feature engineering required for your ML task. 

The output of this process is a **raw_schema**, which acts as a contract between your model and the incoming input data. This **raw_schema** is used by the data validation step in the ML training pipeline.

1. Analyze the training data and produce statistics.
2. Generate data schema from the produced statistics.
3. Configure the schema.
4. Save the schema for later use.

### UCI Adult Dataset: https://archive.ics.uci.edu/ml/datasets/adult
Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

In [2]:
import os

ROOT_DIR = '..'
DATA_DIR = ROOT_DIR + '/data'
TRAIN_DATA_DIR = DATA_DIR + '/train'
RAW_SCHEMA_DIR = ROOT_DIR + '/raw_schema'

In [3]:
!mkdir $DATA_DIR
!mkdir $TRAIN_DATA_DIR
!gsutil cp gs://cloud-samples-data/ml-engine/census/data/adult.data.csv $DATA_DIR
!gsutil cp gs://cloud-samples-data/ml-engine/census/data/adult.test.csv $DATA_DIR

Copying gs://cloud-samples-data/ml-engine/census/data/adult.data.csv...
- [1 files][  3.8 MiB/  3.8 MiB]                                                
Operation completed over 1 objects/3.8 MiB.                                      
Copying gs://cloud-samples-data/ml-engine/census/data/adult.test.csv...
- [1 files][  1.9 MiB/  1.9 MiB]                                                
Operation completed over 1 objects/1.9 MiB.                                      


### Adding headers to the CSV files as the CsvExampleGen components expect headers...

In [4]:
import pandas as pd

HEADER = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
               'marital_status', 'occupation', 'relationship', 'race', 'gender',
               'capital_gain', 'capital_loss', 'hours_per_week',
               'native_country', 'income_bracket']

pd.read_csv(DATA_DIR +"/adult.data.csv", names=HEADER).to_csv(TRAIN_DATA_DIR +"/train-01.csv", index=False)
pd.read_csv(DATA_DIR +"/adult.test.csv", names=HEADER).to_csv(TRAIN_DATA_DIR +"/train-02.csv", index=False)

In [5]:
!wc -l $TRAIN_DATA_DIR/train-01.csv
!head $TRAIN_DATA_DIR/train-01.csv

   32562 ../data/train/train-01.csv
age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
39, State-gov,77516, Bachelors,13, Never-married, Adm-clerical, Not-in-family, White, Male,2174,0,40, United-States, <=50K
50, Self-emp-not-inc,83311, Bachelors,13, Married-civ-spouse, Exec-managerial, Husband, White, Male,0,0,13, United-States, <=50K
38, Private,215646, HS-grad,9, Divorced, Handlers-cleaners, Not-in-family, White, Male,0,0,40, United-States, <=50K
53, Private,234721, 11th,7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male,0,0,40, United-States, <=50K
28, Private,338409, Bachelors,13, Married-civ-spouse, Prof-specialty, Wife, Black, Female,0,0,40, Cuba, <=50K
37, Private,284582, Masters,14, Married-civ-spouse, Exec-managerial, Wife, White, Female,0,0,40, United-States, <=50K
49, Private,160187, 9th,5, Married-spouse-absent, Other-service, Not-in-family, Blac

## Tensorflow Data Validation

In [6]:
import tensorflow_data_validation as tfdv

TARGET_FEATURE_NAME = 'income_bracket'
WEIGHT_FEATURE_NAME = 'fnlwgt'

## 1. Compute Statistics

In [7]:
train_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAIN_DATA_DIR+'/*.csv', 
    column_names=None, # CSV data file include header
    stats_options=tfdv.StatsOptions(
        weight_feature=WEIGHT_FEATURE_NAME,
        sample_rate=1.0
    )
)



Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


In [8]:
tfdv.visualize_statistics(train_stats)

## 2. Infer Schema

In [9]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'age',INT,required,,-
'capital_gain',INT,required,,-
'capital_loss',INT,required,,-
'education',STRING,required,,'education'
'education_num',INT,required,,-
'fnlwgt',INT,required,,-
'gender',STRING,required,,'gender'
'hours_per_week',INT,required,,-
'income_bracket',STRING,required,,'income_bracket'
'marital_status',STRING,required,,'marital_status'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'education',"' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate', ' HS-grad', ' Masters', ' Preschool', ' Prof-school', ' Some-college'"
'gender',"' Female', ' Male'"
'income_bracket',"' <=50K', ' >50K'"
'marital_status',"' Divorced', ' Married-AF-spouse', ' Married-civ-spouse', ' Married-spouse-absent', ' Never-married', ' Separated', ' Widowed'"
'native_country',"' ?', ' Cambodia', ' Canada', ' China', ' Columbia', ' Cuba', ' Dominican-Republic', ' Ecuador', ' El-Salvador', ' England', ' France', ' Germany', ' Greece', ' Guatemala', ' Haiti', ' Holand-Netherlands', ' Honduras', ' Hong', ' Hungary', ' India', ' Iran', ' Ireland', ' Italy', ' Jamaica', ' Japan', ' Laos', ' Mexico', ' Nicaragua', ' Outlying-US(Guam-USVI-etc)', ' Peru', ' Philippines', ' Poland', ' Portugal', ' Puerto-Rico', ' Scotland', ' South', ' Taiwan', ' Thailand', ' Trinadad&Tobago', ' United-States', ' Vietnam', ' Yugoslavia'"
'occupation',"' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair', ' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners', ' Machine-op-inspct', ' Other-service', ' Priv-house-serv', ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support', ' Transport-moving'"
'race',"' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other', ' White'"
'relationship',"' Husband', ' Not-in-family', ' Other-relative', ' Own-child', ' Unmarried', ' Wife'"
'workclass',"' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private', ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'"


## 3. Alter the Schema

In [10]:
# Relax the minimum fraction of values that must come from the domain for feature occupation.
occupation = tfdv.get_feature(schema, 'occupation')
occupation.distribution_constraints.min_domain_mass = 0.9

# Add new value to the domain of feature native_country.
native_country_domain = tfdv.get_domain(schema, 'native_country')
native_country_domain.value.append('Egypt')

# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('EVALUATION')
schema.default_environment.append('SERVING')

# Specify that the class feature is not in SERVING environment.
tfdv.get_feature(schema, TARGET_FEATURE_NAME).not_in_environment.append('SERVING')

## 4. Save the Schema

In [11]:
import shutil

if os.path.exists(RAW_SCHEMA_DIR):
    shutil.rmtree(RAW_SCHEMA_DIR)
    
os.mkdir(RAW_SCHEMA_DIR)

raw_schema_location = os.path.join(RAW_SCHEMA_DIR, 'schema.pbtxt')
tfdv.write_schema_text(schema, raw_schema_location)

### Test loading saved schema

In [12]:
tfdv.load_schema_text(raw_schema_location)

feature {
  name: "age"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "capital_gain"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "capital_loss"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "education"
  type: BYTES
  domain: "education"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "education_num"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "fnlwgt"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "gender"
  type: BYTES
  domain: "gender"
  presence {
    min_fraction: 1.0
    min_count: 1