# Generating Skewed Data for Prediction

This notebook helps generating skewed data based on the [covertype](https://archive.ics.uci.edu/ml/datasets/covertype) dataset from UCI Machine Learning Repository. The generated data is then used to simulate online prediction request workload to a deployed model version on the AI Platform Prediction.

The notebook covers the following steps:
1. Download the data
2. Define dataset metadata
3. Sample unskewed data points
4. Prepare skewed data points
5. Simulate serving workload to AI Platform Prediction



## Setup

### Install packages and dependencies

In [0]:
!pip install -U -q google-api-python-client
!pip install -U -q pandas

### Setup your GCP Project

In [0]:
PROJECT_ID = 'sa-data-validation'
BUCKET =  'sa-data-validation'
REGION = 'us-central1'
!gcloud config set project $PROJECT_ID

### Authenticate your GCP account

This is required if you run the notebook in Colab

In [0]:
try:
  from google.colab import auth
  auth.authenticate_user()
  print("Colab user is authenticated.")
except: pass

### Import libraries

In [0]:
import os
from tensorflow import io as tf_io
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

### Define constants

You can change the default values for the following constants

In [0]:
LOCAL_WORKSPACE = './workspace'
LOCAL_DATA_DIR = os.path.join(LOCAL_WORKSPACE, 'data')
LOCAL_DATA_FILE = os.path.join(LOCAL_DATA_DIR, 'train.csv')
BQ_DATASET_NAME = 'data_validation'
BQ_TABLE_NAME = 'covertype_classifier_logs'
MODEL_NAME = 'covertype_classifier'
VERSION_NAME = 'v1'
MODEL_OUTPUT_KEY = 'probabilities'
SIGNATURE_NAME = 'serving_default'

## 1. Download Data

The covertype dataset is preprocessed, split, and uploaded to uploaded to the `gs://workshop-datasets/covertype` public GCS location. 

We use this version of the preprocessed dataset in this notebook. For more information, see [Cover Type Dataset](https://github.com/GoogleCloudPlatform/mlops-on-gcp/tree/master/datasets/covertype)

In [0]:
if tf_io.gfile.exists(LOCAL_WORKSPACE):
  print("Removing previous workspace artifacts...")
  tf_io.gfile.rmtree(LOCAL_WORKSPACE)

print("Creating a new workspace...")
tf_io.gfile.makedirs(LOCAL_WORKSPACE)
tf_io.gfile.makedirs(LOCAL_DATA_DIR)

In [0]:
!gsutil cp gs://workshop-datasets/covertype/data_validation/training/dataset.csv {LOCAL_DATA_FILE}
!wc -l {LOCAL_DATA_FILE}

In [0]:
data = pd.read_csv(LOCAL_DATA_FILE)
print("Total number of records: {}".format(len(data.index)))
data.sample(10).T

## 2. Define Metadata

In [0]:
HEADER = ['Elevation', 'Aspect', 'Slope','Horizontal_Distance_To_Hydrology',
          'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
          'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
          'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area', 'Soil_Type',
          'Cover_Type']

TARGET_FEATURE_NAME = 'Cover_Type'

FEATURE_LABELS = ['0', '1', '2', '3', '4', '5', '6']

NUMERIC_FEATURE_NAMES = ['Aspect', 'Elevation', 'Hillshade_3pm', 
                         'Hillshade_9am', 'Hillshade_Noon', 
                         'Horizontal_Distance_To_Fire_Points',
                         'Horizontal_Distance_To_Hydrology',
                         'Horizontal_Distance_To_Roadways','Slope',
                         'Vertical_Distance_To_Hydrology']

CATEGORICAL_FEATURE_NAMES = ['Soil_Type', 'Wilderness_Area']

FEATURE_NAMES = CATEGORICAL_FEATURE_NAMES + NUMERIC_FEATURE_NAMES

HEADER_DEFAULTS = [[0] if feature_name in NUMERIC_FEATURE_NAMES + [TARGET_FEATURE_NAME] else ['NA'] 
                   for feature_name in HEADER]

NUM_CLASSES = len(FEATURE_LABELS)

In [0]:
for feature_name in CATEGORICAL_FEATURE_NAMES:
  data[feature_name] = data[feature_name].astype(str)

## 3. Sampling Normal Data



In [0]:
normal_data = data.sample(2000)

In [0]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))
normal_data['Elevation'].plot.hist(bins=15, ax=axes[0][0], title='Elevation')
normal_data['Aspect'].plot.hist(bins=15, ax=axes[0][1], title='Aspect')
normal_data['Wilderness_Area'].value_counts(normalize=True).plot.bar(ax=axes[1][0], title='Wilderness Area')
normal_data[TARGET_FEATURE_NAME].value_counts(normalize=True).plot.bar(ax=axes[1][1], title=TARGET_FEATURE_NAME)

## 4. Prepare Skewed Data
We are going to introduce the following skews to the data:
1. **Numerical Features**
 * *Elevation - Feature Skew*: Convert the unit of measure from meters to kilometers for 1% of the data points
 * *Aspect - Distribution Skew*: Decrease the value by randomly from 1% to 50%
2. **Categorical Features**
 * *Wilderness_Area - Feature Skew*: Adding a new category "Others" for 1% of the data points
 * *Wilderness_Area - Distribution Skew*: Increase of of the frequency of "Cache" and "Neota" values by 25%



In [0]:
skewed_data = data.sample(1000)

### 4.1 Skewing numerical features

#### 4.1.1 Elevation Feature Skew

In [0]:
ratio = 0.1
size = int(len(skewed_data.index) * ratio)
indexes = np.random.choice(skewed_data.index, size=size, replace=False)
skewed_data['Elevation'][indexes] = skewed_data['Elevation'][indexes] // 1000

In [0]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))
normal_data['Elevation'].plot.hist(bins=15, ax=axes[0], title='Elevation - Normal')
skewed_data['Elevation'].plot.hist(bins=15,  ax=axes[1], title='Elevation - Skewed')

#### 4.1.2 Aspect Distribution Skew

In [0]:
skewed_data['Aspect'] = skewed_data['Aspect'].apply(
    lambda value: int(value * np.random.uniform(0.5, 0.99))
)

In [0]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))
normal_data['Aspect'].plot.hist(bins=15, ax=axes[0], title='Aspect - Normal')
skewed_data['Aspect'].plot.hist(bins=15,  ax=axes[1], title='Aspect - Skewed')

### 4.2 Skew categorical features

#### 4.2.1 Wilderness Area Feature Skew
Adding a new category "Others"


In [0]:
skewed_data['Wilderness_Area'] = skewed_data['Wilderness_Area'].apply(
    lambda value: 'Others' if np.random.uniform() <= 0.1 else value
)

#### 4.2.2 Wilderness Area Distribution Skew

In [0]:
skewed_data['Wilderness_Area'] = skewed_data['Wilderness_Area'].apply(
    lambda value: 'Neota' if value in ['Rawah', 'Commanche'] and np.random.uniform() <= 0.25 else value
)

In [0]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))
normal_data['Wilderness_Area'].value_counts(normalize=True).plot.bar(ax=axes[0], title='Wilderness Area - Normal')
skewed_data['Wilderness_Area'].value_counts(normalize=True).plot.bar(ax=axes[1], title='Wilderness Area - Skewed')


## 5. Simulating serving workload

### 5.1 Implement the model API client

In [0]:
import googleapiclient.discovery
import numpy as np

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, VERSION_NAME)
print("Service name: {}".format(name))

def caip_predict(instance):
  
  request_body={
      'signature_name': SIGNATURE_NAME,
      'instances': [instance]
      }

  response = service.projects().predict(
      name=name,
      body=request_body

  ).execute()

  if 'error' in response:
    raise RuntimeError(response['error'])

  probability_list = [output[MODEL_OUTPUT_KEY] for output in response['predictions']]
  classes = [FEATURE_LABELS[int(np.argmax(probabilities))] for probabilities in probability_list]
  return classes

In [0]:
import time

def simulate_requests(data_frame):

  print("Simulation started...")
  print("---------------------")
  print("Number of instances: {}".format(len(data_frame.index)))

  i = 0
  for _, row in data_frame.iterrows():
    instance = dict(row)
    instance.pop(TARGET_FEATURE_NAME)
    for k,v in instance.items():
      instance[k] = [v]

    predicted_class = caip_predict(instance)
    i += 1
    
    print(".", end='')

    if (i + 1) % 100 == 0:
      print()
      print("Sent {} requests.".format(i + 1))

    time.sleep(0.5)
  print("")
  print("-------------------")
  print("Simulation finised.")
    

### 5.2 Simulate AI Platform Prediction requests

In [0]:
simulate_requests(normal_data)

In [0]:
simulate_requests(skewed_data)