In [None]:
#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Monitor wildlife health via image classification

- **Time estimate**: 2 hours
- **Cost estimate**: less than $30.00

This _interactive notebook_ shows you how to train an image classification model to recognize animal species from [camera trap](https://en.wikipedia.org/wiki/Camera_trap) pictures.

We are using the
[WCS Camera Traps dataset](http://lila.science/datasets/wcscameratraps)
from LILA BC _(Labeled Information Library of Alexandria: Biology and Conservation)_.

Here's a quick summary of what we'll go through:

1. **Create an _images database_** _(~5 minutes, costs a few cents)_: A [BigQuery](https://cloud.google.com/bigquery) table with all the image file names along with their respective category.
1. **Train the image classifier** _(~2 hours, costs ~\$25.00)_: A [Dataflow](https://cloud.google.com/dataflow) pipeline that creates a balanced dataset from the images database, downloads the necessary images from LILA into [Cloud Storage](https://cloud.google.com/storage), imports the data into [AI Platform](https://cloud.google.com/ai-platform) and triggers the model training.
1. **Deploy the model** _(costs $1.25 for every hour the model is deployed)_: After the model finishes training, we look at the results and deploy it into a Cloud endpoint in AI Platform.
1. **Classify images**: We send some images into the model and get back the predictions.

# Before you begin

Hi, this is an interactive notebook where we run an existing code sample, there's no need to write any code.

You can run a _code cell_ by clicking the _"Run cell"_ button at the top left corner of each code cell. When you run a code cell, the code runs in the notebook's runtime, so you're not making any changes to your personal computer.

To avoid getting errors, make sure to run _all_ the code cells _in order_.

Before you begin, you need to:

1. Enable the _Dataflow_ and _Cloud AutoML APIs_ in your Google Cloud project.

  > ℹ️ If you don't plan to keep the resources that you create in this sample, we recommend creating a new project instead of selecting an existing project.
  > After you finish these steps, you can delete the project, removing all resources associated with the project.

  <button>

  [_Click here_ to enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=dataflow,automl.googleapis.com)

  </button>

1. Make sure that billing is enabled for your Cloud project.
  [Learn how to confirm that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. Create a Cloud Storage bucket if you don't have one already.

  > ℹ️ Make sure it's a _regional_ bucket in a location where
  [AutoML is available](https://cloud.google.com/ai-platform-unified/docs/general/locations#available_regions).

  <button>

  [_Click here_ to create a bucket](https://console.cloud.google.com/storage/create-bucket)

  </button>

1. [Create a BigQuery dataset](https://cloud.google.com/bigquery/docs/datasets#create-dataset) if you don't have one already.

  > ℹ️ The BigQuery table is created automatically if it doesn't exist.

In [17]:
#@title My Google Cloud resources

google_cloud_project = "" #@param {type:"string"}
cloud_storage_bucket = "" #@param {type:"string"}
cloud_storage_directory = "samples/wildlife-insights" #@param {type:"string"}
bigquery_dataset = "samples" #@param {type:"string"}
bigquery_table = "wildlife_images_metadata" #@param {type:"string"}
automl_name_prefix = "wildlife_classifier" #@param {type:"string"}
region = "us-central1" #@param {type:"string"}

# Validate inputs.
if not google_cloud_project:
  raise ValueError('Please provide your google_cloud_project')
if not cloud_storage_bucket:
  raise ValueError('Please provide your cloud_storage_bucket')

# Authenticate to use the Google Cloud resources.
try:
  from google.colab import auth
  auth.authenticate_user()
  print('Authenticated')
except ModuleNotFoundError:
  import os
  if os.environ.get('GOOGLE_APPLICATION_CREDENTIALS') is None:
    raise ValueError('Please set your GOOGLE_APPLICATION_CREDENTIALS environment variable to your service account JSON file path.')
  print(f"GOOGLE_APPLICATION_CREDENTIALS: {service_account_file}")

%env GOOGLE_CLOUD_PROJECT={google_cloud_project}

GOOGLE_APPLICATION_CREDENTIALS: /Users/dcavazos/creds/python-docs-samples-tests.json


> ℹ️ Run the cell above, and after pasting the _"verification code"_, press _[ENTER]_ to authenticate.

## Preparing your working environment

Run the following cells to download and install everything needed for the sample.

<button>

![View in GitHub](https://www.tensorflow.org/images/GitHub-Mark-32px.png)
[View sample in GitHub](https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/people-and-planet-ai/automl-image-classification)

</button>

In [None]:
# Clone the python-docs-samples respository.
!git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git

# Navigate to the sample code directory.
%cd python-docs-samples/people-and-planet-ai/wildlife-health-via-image-classification

Now lets install the sample requirements.

> When we `pip install` the requirements, there might be some warnings about conflicting dependency versions with the pre-installed libraries in Colab. For this sample, they are safe to ignore.

In [None]:
# We need libffi-dev to launch the Dataflow pipeline.
!apt-get -qq install libffi-dev

# ℹ️ Colab already has Pillow pre-installed.
# We remove it from the requirements.txt to avoid having a conflicting version.
# This is not necessary in a clean virtual environment.
!sed -i "s/^Pillow==.*//g" requirements.txt

# Install the sample requirements.
!pip install --quiet -r requirements.txt

# Creating the images database

First, we need to create the _images database_. This is a **one-time only** process.

We run a Dataflow pipeline that creates the images database in BigQuery, it contains the following:

- `category`: The species we want to predict, this is our _label_.
- `file_name`: The path where the image file is located.

AutoML expects the actual image files to live in Cloud Storage. During the first training job the images are downloaded from the LILA database and saved into Cloud Storage.

The idea is that as we capture new images in the field, we save the image JPEG files directly in Cloud Storage, and its category and file name in the BigQuery images database.

The initial data comes from the metadata JSON file in [WCS Camera Traps database](http://lila.science/datasets/wcscameratraps).
We do some very basic data cleaning like discarding rows with invalid categories like `#ref!`, `empty`, `unidentifiable`, `unidentified`, `unknown` and some other categories that don't give us any useful information.

In [None]:
# [One time only] Create the images database.
!python create_images_metadata_table.py \
  --bigquery-dataset "{bigquery_dataset}" \
  --bigquery-table "{bigquery_table}" \
  --runner "DataflowRunner" \
  --job_name "wildlife-images-metadata-`date +%Y%m%d-%H%M%S`" \
  --project "{google_cloud_project}" \
  --temp_location "gs://{cloud_storage_bucket}/{cloud_storage_directory}/temp" \
  --region "{region}" \
  --worker_machine_type "n1-standard-2"

<button>

![View in GitHub](https://www.tensorflow.org/images/GitHub-Mark-32px.png)
[View `create_images_metadata_table.py`](https://github.com/GoogleCloudPlatform/python-docs-samples/blob/dataflow-automl-vision/people-and-planet-ai/automl-image-classification/create_images_metadata_table.py)

</button>

> ℹ️ We need at least `n1-standard-2` [worker machines](https://cloud.google.com/compute/docs/machine-types#n1_machine_types) due to large RAM usage when parsing the metadata JSON file.

You can look up the job details in the Dataflow jobs page:

- [console.cloud.google.com/dataflow/jobs](https://console.cloud.google.com/dataflow/jobs)

# Training the image classifier

One of the challenges of this dataset is that it's very _unbalanced_.
Meaning that there are tens of thousands of pictures of some species like
[`tayassu pecari`](https://www.google.com/search?q=tayassu+pecari&tbm=isch),
while only a handful of pictures of other species like
[`tolypeutes matacus`](https://www.google.com/search?q=tolypeutes+matacus&tbm=isch).
This could introduce a
[_bias_](https://developers.google.com/machine-learning/crash-course/fairness/types-of-bias)
into our model; it could predict a species just because it's more common rather than actually identifying its features.

So instead of using the entire database of images, we use Dataflow to create a _balanced_ dataset.
We specify the _minimum_ and _maximum_ number of images we want for every category.
For categories with too many images, it will have at most `max_images_per_class` randomly selected images.
And categories with less than `min_images_per_class` are discarded as _"too little information to teach our model to classify this species"_.

> ℹ️ For this sample, we decided to default to using between `50` and `100` images per class to keep the training dataset small.
> This reduces the training time at the _potential_ cost of prediction accuracy.
> Feel free to play around with other numbers.

Once Dataflow selects the images for the dataset, it
[lazily](https://en.wikipedia.org/wiki/Lazy_evaluation)
downloads them into Cloud Storage from the Lila Science database.
This means we are only _preprocessing_ the images we are actually using for training instead of the entire database.
At the same time, if an image already exists in Cloud Storage, there's nothing else to do for that image, so it also serves as a _cache_ for images that have already been preprocessed in previous runs.

After all the images are stored in Cloud Storage, Dataflow creates a
[CSV file for the AutoML dataset](https://cloud.google.com/ai-platform-unified/docs/datasets/prepare-image#csv).
Each row includes the Cloud Storage path of an image, alongside with the category _(label)_.

Then it tells AutoML to create a dataset and import the files from the CSV file.
It waits until the dataset is ready, and finally it tells AutoML to train the model.
This is where the Dataflow job stops.

> ℹ️ We let AutoML do the _data splitting_ automatically for us.
> By default, it uses 80% of the data for training, 10% for validation, and 10% for testing.
>
> See [About data splits for AutoML models](https://cloud.google.com/ai-platform-unified/docs/general/ml-use) for more information.

> ℹ️ For simplicity, in this sample we are training a `CLOUD` model.
> This allows us to deploy it to an HTTP endpoint and get predictions online.
>
> See [Train an AutoML Edge model](https://cloud.google.com/ai-platform-unified/docs/training/automl-edge-console)
> for information on how to train a model for a microcontroller.

In [19]:
min_images_per_class = 50 #@param {type:"integer"}
max_images_per_class = 100 #@param {type:"integer"}


In [None]:
# Create a balanced dataset and signal AutoML to train a model.
!python train_model.py \
  --cloud-storage-path "gs://{cloud_storage_bucket}/{cloud_storage_directory}" \
  --bigquery-dataset "{bigquery_dataset}" \
  --bigquery-table "{bigquery_table}" \
  --automl-name-prefix "{automl_name_prefix}" \
  --min-images-per-class "{min_images_per_class}" \
  --max-images-per-class "{max_images_per_class}" \
  --runner "DataflowRunner" \
  --job_name "wildlife-train-model-`date +%Y%m%d-%H%M%S`" \
  --project "{google_cloud_project}" \
  --temp_location "gs://{cloud_storage_bucket}/{cloud_storage_directory}/temp" \
  --requirements_file "requirements.txt" \
  --region "{region}"

<button>

![View in GitHub](https://www.tensorflow.org/images/GitHub-Mark-32px.png)
[View `train_model.py`](https://github.com/GoogleCloudPlatform/python-docs-samples/blob/dataflow-automl-vision/people-and-planet-ai/automl-image-classification/train_model.py)

</button>

> ℹ️ It can take several minutes for the job to show up in the Dataflow jobs page. See [[ARROW-8983]](https://issues.apache.org/jira/browse/ARROW-8983) for more information.

You can look up the job details in the Dataflow jobs page:

- [console.cloud.google.com/dataflow/jobs](https://console.cloud.google.com/dataflow/jobs)

You can look up the status of your AutoML resources as well:

- Datasets: [console.cloud.google.com/ai/platform/datasets](https://console.cloud.google.com/ai/platform/datasets)
- Training: [console.cloud.google.com/ai/platform/training/training-pipelines](https://console.cloud.google.com/ai/platform/training/training-pipelines)
- Models: [console.cloud.google.com/ai/platform/models](https://console.cloud.google.com/ai/platform/models)

Training the AutoML model can take a while, depending on the dataset size and the training budget you allow.

> ℹ️ You can adjust the training budget using the `--automl-budget-milli-node-hours` flag. We default to `8000` which is the minimum.
>
> See [AutoML pricing](https://cloud.google.com/vision/automl/pricing) and [`train_budget_milli_node_hours`](https://cloud.google.com/automl/docs/reference/rpc/google.cloud.automl.v1#imageclassificationmodelmetadata) for more information.

# Deploying the model

> ℹ️ If you were disconnected from the session due to inactivity, please make sure to re-run the _"My Google Cloud resources"_ cell at the beginning of the notebook.

After the model training has finished, we need to deploy it into an endpoint to get predictions from it.

You can deploy it through the Cloud Console: [console.cloud.google.com/ai/platform/models](https://console.cloud.google.com/ai/platform/models)

Alternatively, you can deploy it through the API:

In [None]:
# First we need the model path, we can get it with gcloud.
cmd_output = !gcloud beta ai models list \
  --project {google_cloud_project} \
  --region {region} \
  --filter "display_name:{automl_name_prefix}*" \
  --format "table[no-heading](display_name,name)" 2>/dev/null
models = sorted([line.split() for line in cmd_output])
model_path = models[0][1]

print(f"model_path: {model_path}")

In [None]:
# Create an endpoint and deploy the model to it.
!python deploy_model.py \
  --project {google_cloud_project} \
  --region {region} \
  --model-path {model_path} \
  --model-endpoint-name {automl_name_prefix}

<button>

![View in GitHub](https://www.tensorflow.org/images/GitHub-Mark-32px.png)
[View `deploy_model.py`](https://github.com/GoogleCloudPlatform/python-docs-samples/blob/dataflow-automl-vision/people-and-planet-ai/automl-image-classification/deploy_model.py)

</button>

# Classifying images

Now that we have a deployed model, we can classify images using that model.

Since we don't have a camera trap readily available, lets use some images from LILA to see the model in action.

## Visualizing images from LILA

First, lets define some functions to help us visualize and navigate the images from the LILA database.

In [None]:
import io
import requests
from PIL import Image
from IPython.display import display

from google.cloud import bigquery


def display_image(image_file, width=400):
  base_url = 'https://lilablobssc.blob.core.windows.net/wcs-unzipped'
  image_bytes = requests.get(f"{base_url}/{image_file}").content
  if b'<Error>' in image_bytes:
    raise ValueError(f"Error requesting image: {base_url}/{image_file}\n{image_bytes.decode('utf-8')}")
  image = Image.open(io.BytesIO(image_bytes))
  display(image.resize((int(width), int(width / image.size[0] * image.size[1]))))


def display_samples_for_category(category, num_samples=3, width=400):
  client = bigquery.Client()
  query_job = client.query(f"""
      SELECT file_name
      FROM `{google_cloud_project}.{bigquery_dataset}.{bigquery_table}`
      WHERE category = '{category}'
      LIMIT {num_samples}
  """)

  for row in query_job:
    image_file = row['file_name']
    print(f"{category}: {image_file}")
    display_image(image_file, width)

In [None]:
# We can explore images for a specific category like this.
display_samples_for_category('tapirus indicus', 3)

## Online predictions

Now lets see what our model thinks about some images.

In [None]:
# First we need the endpoint ID, we can get it with gcloud.
stdout = !gcloud beta ai endpoints list \
  --project {google_cloud_project} \
  --region {region} \
  --filter "display_name={automl_name_prefix}" \
  --format "table[no-heading](ENDPOINT_ID)" 2>/dev/null
model_endpoint_id = stdout[0]

print(f"model_endpoint_id: {model_endpoint_id}")

In [None]:
def predict(image_file):
  display_image(image_file)
  !python predict.py \
    --project "{google_cloud_project}" \
    --region "{region}" \
    --model-endpoint-id "{model_endpoint_id}" \
    --image-file "{image_file}"

<button>

![View in GitHub](https://www.tensorflow.org/images/GitHub-Mark-32px.png)
[View `predict.py`](https://github.com/GoogleCloudPlatform/python-docs-samples/blob/dataflow-automl-vision/people-and-planet-ai/automl-image-classification/predict.py)

</button>


In [None]:
# Species: dicerorhinus sumatrensis
predict('animals/0325/1529.jpg')

In [None]:
# Species: didelphis imperfecta
predict('animals/0667/1214.jpg')

In [None]:
# Species: tapirus indicus
predict('animals/0036/0072.jpg')

In [None]:
# Species: leopardus wiedii
predict('animals/0000/1705.jpg')

In [None]:
# Species: hemigalus derbyanus
predict('animals/0036/0566.jpg')

In [None]:
# Species: dasypus novemcinctus
predict('animals/0000/0425.jpg')

While analyzing the model evaluation in AutoML, and some of our experiments. It might be worth trying to classify the image to a family instead of a specific species. Many species of the same family are very similar and it may be confusing the model.

# Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

## Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

> ⚠️ **Caution**: Deleting a project has the following effects:
>
> - **Everything in the project is deleted.** If you used an existing project for this tutorial, when you delete it, you also delete any other work you've done in the project.
> - **Custom project IDs are lost.** When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an `appspot.com` URL, delete selected resources inside the project instead of deleting the whole project.
>
> If you plan to explore multiple tutorials and quickstarts, reusing projects can help you avoid exceeding project quota limits.

1. In the Cloud Console, go to the **Manage resources** page.

  <button>

  [Go to Manage resources](https://console.cloud.google.com/iam-admin/projects)

  </button>

1. In the project list, select the project that you want to delete, and then click **Delete**.

1. In the dialog, type the project ID, and then click **Shut down** to delete the project.
