# Text Classification using Keras

We will look at the titles of articles and figure out whether the article came from the New York Times, TechCrunch or GitHub. 

We will use [hacker news](https://news.ycombinator.com/) as our data source. It is an aggregator that displays tech related headlines from various  sources.

**Learning Objectives**

* Learn how to use TF-Hub for transfer learning
* Learn how to create a sentence level text classification model using Keras
* Learn how to create a word level text classification model using Keras

In [None]:
# Ensure that we have the right version of Tensorflow installed.
!pip freeze | grep tf-nightly-2.0-preview || pip install tf-nightly-2.0-preview

In [None]:
import os
import shutil

from google.cloud import bigquery
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.layers import (
    Conv1D,
    GlobalAveragePooling1D,
    Dropout,
    Dense,
    MaxPooling1D,
)
from tensorflow.keras.models import Sequential


print(tf.__version__)

In [None]:
BUCKET = 'dherin-sandbox'
PROJECT = 'dherin-sandbox'
REGION = 'us-central1'
SEED = 0

In [None]:
%load_ext google.cloud.bigquery

In [None]:
%matplotlib inline

## GPU Strongly Recommended

This entire notebook will run in under 10 minutes using a V100 GPU, but will take about 3 hours on CPU

You can add a GPU to your AI Platform Notebook instance following [these instructions](https://cloud.google.com/ml-engine/docs/notebooks/manage-hardware-accelerators).  You can remove the GPU after completing the lab (to manage costs).

After adding the subsequent cell should print "GPU Enabled: True".

In [None]:
print('GPU Enabled: {}'.format(tf.test.is_gpu_available()))

# Create Dataset from BigQuery 

Hacker news headlines are available as a BigQuery public dataset. The [dataset](https://bigquery.cloud.google.com/table/bigquery-public-data:hacker_news.stories?tab=details) contains all headlines from the sites inception in October 2006 until October 2015. 

Here is a sample of the dataset:

In [None]:
%%bigquery --project $PROJECT

SELECT
    url, title, score
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    LENGTH(title) > 10
    AND score > 10
    AND LENGTH(url) > 0
LIMIT 10

Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>

In [None]:
%%bigquery --project $PROJECT

SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
    COUNT(title) AS num_articles
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    AND LENGTH(title) > 10
GROUP BY
    source
ORDER BY num_articles DESC
  LIMIT 100

Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.

In [None]:
bq = bigquery.Client(project=PROJECT)


regex = '.*://(.[^/]+)/'


sub_query = """
SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '{0}'), '.'))[OFFSET(1)] AS source,
    title
FROM
    `bigquery-public-data.hacker_news.stories`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '{0}'), '.com$')
    AND LENGTH(title) > 10
""".format(regex)


query = """
SELECT source,
       LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title
FROM
  ({sub_query})
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
""".format(sub_query=sub_query)


df = bq.query(query + " LIMIT 5").to_dataframe()
df.head()

For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset).  

A simple, repeatable way to do this is to use the hash of a well-distributed column in our data (See https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning).

In [None]:
traindf = bq.query(
    query + " AND MOD(ABS(FARM_FINGERPRINT(title)), 4) > 0"
).to_dataframe()

In [None]:
evaldf = bq.query(
    query + "  AND MOD(ABS(FARM_FINGERPRINT(title)), 4) = 0"
).to_dataframe()

Below we can see that roughly 75% of the data is used for training, and 25% for evaluation. 

We can also see that within each dataset, the classes are roughly balanced.

In [None]:
traindf['source'].value_counts()

In [None]:
evaldf['source'].value_counts()

Finally we will save our data, which is currently in-memory, to disk.

In [None]:
DATADIR = '../data/txtcls'

shutil.rmtree(DATADIR, ignore_errors=True)
os.makedirs(DATADIR)

TRAIN_PATH = os.path.join(DATADIR, 'train.tsv')
EVAL_PATH = os.path.join(DATADIR, 'eval.tsv')

traindf.to_csv(
    TRAIN_PATH, header=False, index=False, encoding='utf-8', sep='\t')

evaldf.to_csv(
    EVAL_PATH, header=False, index=False, encoding='utf-8', sep='\t')

In [None]:
!head -3 $TRAIN_PATH

In [None]:
!wc -l $TRAIN_PATH

In [None]:
!wc -l $EVAL_PATH

# Sentence Level Model with DNN

Now that we have our dataset, we need to represent our text data numerically. [Tensorflow Hub](https://www.tensorflow.org/hub) makes this super easy. It contains a library of pre-trained text embeddings that we can download and use with a few lines of code. 

In particular we will use [this](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1) embedding which encodes sentences into 128 dimensional vectors.

Once we have the embedded representation we can simply feed it through a DNN for classification.

In [None]:
CLASSES = {
    'github': 0,
    'nytimes': 1,
    'techcrunch': 2,
}

N_CLASSES = len(CLASSES)

MAX_SEQUENCE_LENGTH = 50

HUB = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1"

LEARNING_RATE = 0.001
BATCH_SIZE = 128

TRAIN = tf.estimator.ModeKeys.TRAIN
EVAL = tf.estimator.ModeKeys.EVAL

In [None]:
def load_hacker_news_data(train_data_path, eval_data_path):
    column_names = ('label', 'text')
    
    df_train = pd.read_csv(train_data_path, names=column_names, sep='\t')
    df_eval = pd.read_csv(eval_data_path, names=column_names, sep='\t')
    
    X_train = list(df_train['text'])
    Y_train = np.array(df_train['label'].map(CLASSES))
    X_test = list(df_eval['text'])
    Y_test = np.array(df_eval['label'].map(CLASSES))
    
    return (X_train, Y_train), (X_test, Y_test)

In [None]:
(X_train, Y_train), (X_test, Y_test) = load_hacker_news_data(
    TRAIN_PATH, EVAL_PATH)

In [None]:
EXAMPLE = 1

print("X_train:", X_train[EXAMPLE])
print("Y_train:", Y_train[EXAMPLE])

assert Y_train[EXAMPLE] in CLASSES.values()

In [None]:
def create_dataset(texts, labels, batch_size, mode):
    # Precision and recall metrics require one hot labels
    labels = tf.one_hot(labels, N_CLASSES)
    dataset = tf.data.Dataset.from_tensor_slices((texts, labels))

    if mode == tf.estimator.ModeKeys.EVAL:
        return dataset.batch(batch_size)
    else:
        return dataset.shuffle(50000).batch(batch_size)

In [None]:
train_dataset = create_dataset(X_train, Y_train, BATCH_SIZE, mode=TRAIN)

eval_dataset = create_dataset(X_test, Y_test, BATCH_SIZE, mode=EVAL)

In [None]:
for x, y in train_dataset.take(1):
    assert x.shape == (BATCH_SIZE,)
    assert y.shape == (BATCH_SIZE, len(CLASSES.values()))

In [None]:
def build_dnn_model(learning_rate):
    
    model = models.Sequential([
        hub.KerasLayer(HUB, output_shape=[128], input_shape=[], dtype=tf.string),
        Dense(500,activation='relu'),
        Dense(100,activation='relu'),
        Dense(len(CLASSES), activation='softmax'),    
    ])

    optimizer = tf.keras.optimizers.Adam(lr=learning_rate)
    
    model.compile(
        optimizer=optimizer, 
        loss='categorical_crossentropy', 
        metrics=[
            'accuracy',
            tf.keras.metrics.Precision(),
            tf.keras.metrics.Recall()
        ]
    )

    return model

In [None]:
model = build_dnn_model(LEARNING_RATE)

In [None]:
model.summary()

In [None]:
%%time

tf.random.set_seed(SEED)

MODEL_DIR = "./models/txtclf/dnn"
EPOCHS = 5

history = model.fit(
    train_dataset,
    epochs=EPOCHS,
    validation_data=eval_dataset,
    callbacks=[TensorBoard(MODEL_DIR, embeddings_freq=1)]
)

In [None]:
pd.DataFrame(history.history).plot()

### Results

We get 80% validation accuracy. Not bad.

# Word Level Model with CNN

While the above method shines in simplicity, it uses a sentence level embedding which ignores the ordering of words. Might we get better performance if we embedded each word individually then fed them into a sequential model? We test that hypothesis now.

The `hub.KerasLayer()` method doesn't support word level embeddings natively, instead it averages the component word embeddings into a single sentence embedding, so to achieve what we want we must do it upfront in the `input_fn()`. In particular we:
1. Split each sentence into a list of its component words
2. Pad each list to a constant length
3. Embed each word into 128 dimension vector representation

Note the changes to the `input_fn()` below.

Since input function now returns a sequence of word embeddings, so we can process the data using a sequential model. Specifically we'll use a 1D CNN. Note the changes to `keras_model()` below.

In [None]:
def create_dataset(texts, labels, batch_size, mode):
    labels = tf.one_hot(labels, len(CLASSES))
    texts = [sentence.split() for sentence in texts]
    texts = [
        (sentence + MAX_SEQUENCE_LENGTH * ['<PAD>'])[:MAX_SEQUENCE_LENGTH]
        for sentence in texts]
    embed = hub.load(HUB)
    texts = [embed(sentence) for sentence in texts]

    dataset = tf.data.Dataset.from_tensor_slices((texts, labels))

    if mode == tf.estimator.ModeKeys.EVAL:
        return dataset.batch(batch_size)
    else:
        return dataset.shuffle(50000).batch(batch_size)

**The subsequent cell takes ~ 3 hours on CPU, about ~ 6 minutes on a P100 GPU, and ~ 4 minutes on a V100 GPU**

This takes so long because now we are doing a lot of pre-processing in the input function.


In [None]:
%%time
train_dataset = create_dataset(X_train, Y_train, BATCH_SIZE, mode=TRAIN)

eval_dataset = create_dataset(X_test, Y_test, BATCH_SIZE, mode=EVAL)

In [None]:
# Testing cell
EMBEDDING_DIM = 128
for x, y in train_dataset.take(1):
    assert x.shape == (BATCH_SIZE, MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)
    assert y.shape == (BATCH_SIZE, N_CLASSES)

In [None]:
def build_cnn_model(learning_rate,
                    filters=64,
                    dropout_rate=0.2,
                    kernel_size=3,
                    pool_size=3):

    model = Sequential([
        Dropout(
            input_shape=(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM),
            rate=dropout_rate
        ),
        Conv1D(
            filters=filters,
            kernel_size=kernel_size,
            activation='relu',
            bias_initializer='random_uniform',
            padding='same',
        ),
        MaxPooling1D(pool_size=pool_size),
        Conv1D(
            filters=filters * 2,
            kernel_size=kernel_size,
            activation='relu',
            bias_initializer='random_uniform',
            padding='same',
        ),
        GlobalAveragePooling1D(),
        Dropout(rate=dropout_rate),
        Dense(N_CLASSES, activation='softmax'),
    ])

    optimizer = tf.keras.optimizers.Adam(lr=learning_rate)
    model.compile(
        optimizer=optimizer,
        loss='categorical_crossentropy',
        metrics=[
            'accuracy',
            tf.keras.metrics.Precision(),
            tf.keras.metrics.Recall()
        ]
    )
    return model

In [None]:
model = build_cnn_model(LEARNING_RATE)

In [None]:
model.summary()

In [None]:
%%time

tf.random.set_seed(SEED)

model.fit(
    train_dataset,
    epochs=EPOCHS,
    validation_data=eval_dataset,
)

Our accuracy improved to 83%! Looks like paying attention to word order does help.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License