# AutoML for Text Classification with Vertex AI

**Learning Objectives**

1. Learn how to create a text classification dataset for AutoML using BigQuery
1. Learn how to train AutoML to build a text classification model
1. Learn how to evaluate a model trained with AutoML
1. Learn how to predict on new test data with AutoML

## Introduction

In this notebook, we will use [AutoML for Text Classification](https://cloud.google.com/natural-language/automl/docs/beginners-guide) to train a text model to recognize the source of article titles:  New York Times, TechCrunch or GitHub. 

In a first step, we will query a public dataset on BigQuery taken from [hacker news](https://news.ycombinator.com/) ( it is an aggregator that displays tech related headlines from various  sources) to create our training set.

In a second step, use the AutoML UI to upload our dataset, train a text model on it, and evaluate the model we have just trained.

In [1]:
import os

import pandas as pd
from google.cloud import bigquery

Replace the variable values in the cell below. Note, AutoML can only be run in the [regions where it is available](https://cloud.google.com/vertex-ai/docs/general/locations#feature-availability). 

In [136]:
PROJECT = !(gcloud config get-value core/project)
PROJECT = PROJECT[0]
BUCKET = PROJECT  # defaults to PROJECT
REGION = "us-central1"  # Replace with your REGION

os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION

In [14]:
%%bash
exists=$(gsutil ls -d | grep -w gs://$BUCKET/)

if [ -n "$exists" ]; then
   echo -e "Bucket gs://$BUCKET already exists."
    
else
   echo "Creating a new GCS bucket."
   gsutil mb -l $REGION gs://$BUCKET
   echo -e "\nHere are your current buckets:"
   gsutil ls
fi

Bucket gs://qwiklabs-gcp-00-3b39512a845c already exists.


## Create a Dataset from BigQuery 

Hacker news headlines are available as a BigQuery public dataset. The [dataset](https://bigquery.cloud.google.com/table/bigquery-public-data:hacker_news.stories?tab=details) contains all headlines from the sites inception in October 2006 until October 2015. 

Here is a sample of the dataset:

In [15]:
%%bigquery --project $PROJECT

SELECT
    url, title, score
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    LENGTH(title) > 10
    AND score > 10
    AND LENGTH(url) > 0
LIMIT 10

Query complete after 0.01s: 100%|██████████| 2/2 [00:00<00:00, 1323.54query/s]                        
Downloading: 100%|██████████| 10/10 [00:01<00:00,  7.37rows/s]


Unnamed: 0,url,title,score
0,https://www.kickstarter.com/projects/carlosxcl...,"Show HN: Code Cards, Like Texas hold 'em for p...",11
1,http://vancouver.en.craigslist.ca/van/roo/2035...,Best Roommate Ad Ever,11
2,https://github.com/Groundworkstech/Submicron,Deep-Submicron Backdoors,11
3,http://empowerunited.com/,Could this be the solution for the 99%?,11
4,http://themanufacturingrevolution.com/braun-vs...,Braun vs. Apple: Is copying designs theft or i...,11
5,https://github.com/styleguide/,"GitHub Styleguide - CSS, HTML, JS, Ruby",11
6,http://zoomzum.com/10-best-firefox-add-ons-to-...,Essential Firefox Add-Ons to Make You More Pro...,11
7,http://www.zintin.com,Feedback on our social media iPhone app,11
8,http://founderdating.com/comingtodinner/,Guess Who’s Coming to Dinner…To Save Our Company,11
9,http://tech.matchfwd.com/poor-mans-template-ab...,Poor Man's Template A/B Testing (in Django),11


Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>

In [16]:
%%bigquery --project $PROJECT

SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
    COUNT(title) AS num_articles
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    AND LENGTH(title) > 10
GROUP BY
    source
ORDER BY num_articles DESC
  LIMIT 100

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1764.04query/s]                        
Downloading: 100%|██████████| 100/100 [00:01<00:00, 71.79rows/s] 


Unnamed: 0,source,num_articles
0,blogspot,41386
1,github,36525
2,techcrunch,30891
3,youtube,30848
4,nytimes,28787
...,...,...
95,f5,1254
96,gamasutra,1249
97,cnbc,1229
98,indiatimes,1223


Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.

In [17]:
regex = ".*://(.[^/]+)/"


sub_query = """
SELECT
    title,
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '{0}'), '.'))[OFFSET(1)] AS source
    
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '{0}'), '.com$')
    AND LENGTH(title) > 10
""".format(
    regex
)


query = """
SELECT 
    LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title,
    source
FROM
  ({sub_query})
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
""".format(
    sub_query=sub_query
)

print(query)


SELECT 
    LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title,
    source
FROM
  (
SELECT
    title,
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source
    
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')



For ML training, we usually need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset). AutoML however figures out on its own how to create these splits, so we won't need to do that here. 



In [18]:
bq = bigquery.Client(project=PROJECT)
title_dataset = bq.query(query).to_dataframe()
title_dataset.head()

Unnamed: 0,title,source
0,feminist-software-foundation complains about r...,github
1,expose sps as web services on the fly.,github
2,show hn scrwl shorthand code reading and wr...,github
3,geoip module on nodejs now is a c addon,github
4,show hn linuxexplorer,github


AutoML for text classification requires that
* the dataset be in csv form with 
* the first column being the texts to classify or a GCS path to the text 
* the last colum to be the text labels

The dataset we pulled from BiqQuery satisfies these requirements.

In [19]:
print(f"The full dataset contains {len(title_dataset)} titles")

The full dataset contains 96203 titles


Let's make sure we have roughly the same number of labels for each of our three labels:

In [20]:
title_dataset.source.value_counts()

github        36525
techcrunch    30891
nytimes       28787
Name: source, dtype: int64

Finally we will save our data, which is currently in-memory, to disk.

We will create a csv file containing the full dataset and another containing only 1000 articles for development.

**Note:** It may take a long time to train AutoML on the full dataset, so we recommend to use the sample dataset for the purpose of learning the tool. 


In [21]:
DATADIR = "./data/"

if not os.path.exists(DATADIR):
    os.makedirs(DATADIR)

In [22]:
FULL_DATASET_NAME = "titles_full.csv"
FULL_DATASET_PATH = os.path.join(DATADIR, FULL_DATASET_NAME)

# Let's shuffle the data before writing it to disk.
title_dataset = title_dataset.sample(n=len(title_dataset))

title_dataset.to_csv(
    FULL_DATASET_PATH, header=False, index=False, encoding="utf-8"
)

Now let's sample 1000 articles from the full dataset and make sure we have enough examples for each label in our sample dataset (see [here](https://cloud.google.com/natural-language/automl/docs/beginners-guide) for further details on how to prepare data for AutoML).

In [23]:
sample_title_dataset = title_dataset.sample(n=1000)
sample_title_dataset.source.value_counts()

github        351
nytimes       334
techcrunch    315
Name: source, dtype: int64

Let's write the sample datatset to disk.

In [24]:
SAMPLE_DATASET_NAME = "titles_sample.csv"
SAMPLE_DATASET_PATH = os.path.join(DATADIR, SAMPLE_DATASET_NAME)

sample_title_dataset.to_csv(
    SAMPLE_DATASET_PATH, header=False, index=False, encoding="utf-8"
)

In [25]:
sample_title_dataset.head()

Unnamed: 0,title,source
8623,baron is a bitcoin payment processor that anyo...,github
10144,default profile pictures generated on the fly,github
7114,pelusa - static analysis lint-type tool for ruby,github
14841,hackerpit.com solutions,github
56251,nyt obit of joybubbles he hacked the phone sy...,nytimes


In [26]:
%%bash
gsutil cp data/titles_sample.csv gs://$BUCKET

Copying file://data/titles_sample.csv [Content-Type=text/csv]...
/ [1 files][ 57.2 KiB/ 57.2 KiB]                                                
Operation completed over 1 objects/57.2 KiB.                                     


## Train a Model with AutoML for Text Classification

### Step 1: Create the dataset in Vertex AI

From the Vertex menu click "Datasets" then click the "+Create" at the top of the window. 

Create a dataset called `hacker_news_titles` and specify it as a Text dataset for text classification (Single-label). Click the `Create` button at the bottom. Note that here you should choose the region that agrees with the region you specified above; e.g. we use 'us-central1'.

![AutoML](./assets/vertex/create_dataset.png)

Then, select the file `titles_sample.csv` from your GCS bucket. Importing the data can take about 10 minutes.

![AutoML](./assets/vertex/select_gcs_file.png)

## Step 2: Train an AutoML text model

Once the dataset is imported you can browse specific examples, or analyze label distributions. Once you are happy with what you see, proceed to train the model.

![AutoML](./assets/vertex/train_model_1.png)

Give your model an indicative name like `hacker_news_titles_automl` and start training. Training may take a few hours. 

![AutoML](./assets/vertex/train_model_2.png)

## Step 3: Evaluate the model

Once the model is trained, navigate to the `Models` tab in Vertex AI and see your model `hacker_news_titles_automl`. Click on the model and you can "Evaluate" how the model performed. You'll be able to see the averall precision and recall, as well as drill down to preformances at the individual label level.

![AutoML](./assets/vertex/evaluate_1.png)

AutoML will also show you a confusion matrix and you can see examples where the model made a mistake for each of the labels.

![AutoML](./assets/vertex/evaluate_2.png)

## Step 4: Predict with the trained AutoML model

Now you can test your model directly by entering new text in the UI and having AutoML predicts the source of your snippet. First deploy your model to an endpoint. Click on `Deploy to Endpoint` and you'll be directed to a page to create the endpoint. Give the endpoint an indicative name, like `hacker_news_model_endpoint`. Keep all other options as default, and press `DEPLOY`. It make take a few minutes to create the endpoint and deploy the model to the endpoint.

![AutoML](./assets/vertex/deploy.png)

Once the deployment has completed, you can test your model in the UI and make online predictions. Just type text into the box and click `Predict`. You'll see your model's predictions and the corresponding softmax values for each label:

![AutoML](./assets/vertex/online_predict.png)

You can also set up a batch prediction job. First we'll need to set up our files for prediction with AutoML text classification. To do this, we'll use a JSONL file to specify a list of documents to make predictions about and then store the JSONL file in a Cloud Storage bucket. A single line in an input JSONL file should have the format:

`{"content": "gs://sourcebucket/datasets/texts/source_text.txt", "mimeType": "text/plain"}`

We'll create the GCS .txt files and create the jsonl file below:

In [146]:
from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET)

SAMPLE_BATCH_INPUTS = "./batch_predict_inputs.jsonl"

for idx, text in sample_title_dataset.title.items():
    # write the text sample to GCS
    blob = bucket.blob(f"hacker_news_sample/sample_{idx}.txt")
    blob.upload_from_string(data=text, content_type="text/plain")

    # add the GCS file to local jsonl
    with open(SAMPLE_BATCH_INPUTS, "a") as f:
        f.write(
            f'{{"content": "gs://{BUCKET}/hacker_news_sample/sample_{idx}.txt", "mimeType": "text/plain"}}\n'
        )

Let's make sure the jsonl file was written correctly and that the bucket contains the sample .txt files:

In [157]:
!head -5 ./batch_predict_inputs.jsonl

{"content": "gs://qwiklabs-gcp-00-3b39512a845c/hacker_news_sample/sample_8623.txt", "mimeType": "text/plain"}
{"content": "gs://qwiklabs-gcp-00-3b39512a845c/hacker_news_sample/sample_10144.txt", "mimeType": "text/plain"}
{"content": "gs://qwiklabs-gcp-00-3b39512a845c/hacker_news_sample/sample_7114.txt", "mimeType": "text/plain"}
{"content": "gs://qwiklabs-gcp-00-3b39512a845c/hacker_news_sample/sample_14841.txt", "mimeType": "text/plain"}
{"content": "gs://qwiklabs-gcp-00-3b39512a845c/hacker_news_sample/sample_56251.txt", "mimeType": "text/plain"}


In [162]:
!gsutil ls gs://$BUCKET/hacker_news_sample | head -5

gs://qwiklabs-gcp-00-3b39512a845c/hacker_news_sample/sample_1.txt
gs://qwiklabs-gcp-00-3b39512a845c/hacker_news_sample/sample_10030.txt
gs://qwiklabs-gcp-00-3b39512a845c/hacker_news_sample/sample_10038.txt
gs://qwiklabs-gcp-00-3b39512a845c/hacker_news_sample/sample_10133.txt
gs://qwiklabs-gcp-00-3b39512a845c/hacker_news_sample/sample_10144.txt
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe


We'll copy the json file to our GCS bucket and kick off the batch prediction job...

In [163]:
!gsutil cp ./batch_predict_inputs.jsonl gs://$BUCKET

Copying file://./batch_predict_inputs.jsonl [Content-Type=application/octet-stream]...
/ [1 files][108.3 KiB/108.3 KiB]                                                
Operation completed over 1 objects/108.3 KiB.                                    


Click the `Create Batch Prediction` button and enter the fields for 'Batch prediction name', 'Cloud Storage source path' and 'Cloud storage location to store outputs'. We'll call our job `hacker_news_batch_job`. We'll call prediction on the jsonl file we just created and uploaded. We'll write our prediction outputs to our GCS bucket.

Then, we create our batch prediction job.

![AutoML](./assets/vertex/batch_predict_1.png)

Once the job is complete we can inspect the results in our GCS bucket:

![AutoML](./assets/vertex/batch_predict_2.png)

Copyright 2021 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License