# ML APIs as preprocessors for scikit-learn
Unstructured data (like documents, audio, images or video) can be difficult to preprocess into a usable state for traditional ML frameworks like scikit-learn and xgboost. 

This notebook gives an example of using Google's Building Blocks APIs to extract structure from unstructured data. 

## Use case: Tokenization with the Natural Language API

As an example use case, consider that most NLP requires that text first be preprocessed with a technique like TfIdf (Term Frequency / Inverse Document Frequency). This first requires that the text be separate into distinct words or "tokens". 

This is trivial in English. Words are separated by spaces. 

However, Japanese characters are not separated by spaces or other break characters. So tokenizing text requires understanding the language. Google's Natural Language API does this without requiring special language knowledge or a dictionary.

### Acknowledgements
This example is adapted, with permission, from [Hayato Yoshikawa](https://github.com/hayatoy). Here is the [original slide deck](https://www.slideshare.net/HayatoYoshikawa/using-cloud-natural-language-api-as-a-preprocessor) and here is the [original notebook](https://github.com/sgreenberg/misc-ml/blob/master/NL_API_for_tokenization_orig.ipynb)

The concept of distinguishing between New York Times and TechCrunch headlines in Hacker News headlines is used, with permission, from [Lak Lakshmanan](https://aisoftwarellc.weebly.com/)'s blog post, [How to do text classification with CNNs, TensorFlow and word embedding](https://towardsdatascience.com/how-to-do-text-classification-using-tensorflow-word-embeddings-and-cnn-edae13b3e575)


## Getting Hacker News headlines from BigQuery

Hacker News headlines are available in the `bigquery-public-data.hacker_news.stories` public dataset. 

For the purposes of this demo, I extracted and translated a small number of them. (The original query can be found in the appendix below.)

In [1]:
# Note I have set up GOOGLE_APPLICATION_CREDENTIALS environment variables
# per https://google-cloud-python.readthedocs.io/en/latest/bigquery/usage.html#id3

from google.cloud import bigquery
client = bigquery.Client() 

In [2]:
# Enable magics in BigQuery. See https://google-cloud-python.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.magics.html#module-google.cloud.bigquery.magics
%load_ext google.cloud.bigquery

In [3]:
%%bigquery df
SELECT * FROM `sgreenberg-project2.misc_ml.hacker_news_stories`
ORDER BY id

Unnamed: 0,source,title,time_ts,id
0,techcrunch,From Social Networks to Market Networks,2015-06-28 00:31:48+00:00,9791977
1,techcrunch,地下貯蔵庫を持つワイン購入サイトが無料のワインボトルを進呈中,2015-10-29 00:54:20+00:00,9791978
2,techcrunch,Underground Cellar A Wine-Buying Site That Rew...,2015-10-29 00:54:20+00:00,9791978
3,techcrunch,AndroidユーザーはGoogle Playで公開予定のアプリの事前登録が可能に,2015-10-01 00:54:20+00:00,9791979
4,techcrunch,Android Users Can Now Pre-Register for Upcomin...,2015-10-01 00:54:20+00:00,9791979
5,techcrunch,最新500社のスタートアップ企業一覧,2015-11-03 00:54:20+00:00,9791980
6,techcrunch,Here s the Latest Batch of 500 Startups Companies,2015-11-03 00:54:20+00:00,9791980
7,nytimes,Google Unveils Plan for New Corporate Campus,2015-08-12 00:02:29+00:00,10045195
8,nytimes,Google、新しい企業キャンパス計画を発表,2015-08-12 00:02:29+00:00,10045195
9,nytimes,The Risk of a Billion-Dollar Valuation in Sili...,2015-08-29 00:54:20+00:00,10138569


## Attempt to tokenize

Tokenization is poor for Japanese text. Each headline is counted as a distinct word.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
vectorizer = TfidfVectorizer(sublinear_tf=True)

vocab = vectorizer.fit(df.title).vocabulary_
vocab_df = pd.DataFrame(vocab.items())
vocab_df

Unnamed: 0,0,1
0,founder,28
1,dollar,24
2,startups,67
3,mayor,40
4,market,39
5,web,81
6,from,30
7,シリコンバレーにおける10億ドルの評価のリスク,85
8,apps,6
9,androidユーザーはgoogle,5


## Adding a space between words
This is done using Google's Natural Language API

In [5]:
# Note I enabled the Natural Language API on this project.
from google.cloud import language
from google.cloud.language import types
from google.cloud.language import enums

client = language.LanguageServiceClient()

def tokenize(title):
    document = types.Document(content=title, type=enums.Document.Type.PLAIN_TEXT)
    tokens = client.analyze_syntax(document).tokens
    tokenized_text = " ".join([token.text.content for token in tokens])
    return tokenized_text

df['tokenized_title'] = df.title.apply(tokenize)

df

Unnamed: 0,source,title,time_ts,id,tokenized_title
0,techcrunch,From Social Networks to Market Networks,2015-06-28 00:31:48+00:00,9791977,From Social Networks to Market Networks
1,techcrunch,地下貯蔵庫を持つワイン購入サイトが無料のワインボトルを進呈中,2015-10-29 00:54:20+00:00,9791978,地下 貯蔵 庫 を 持つ ワイン 購入 サイト が 無料 の ワイン ボトル を 進呈 中
2,techcrunch,Underground Cellar A Wine-Buying Site That Rew...,2015-10-29 00:54:20+00:00,9791978,Underground Cellar A Wine - Buying Site That R...
3,techcrunch,AndroidユーザーはGoogle Playで公開予定のアプリの事前登録が可能に,2015-10-01 00:54:20+00:00,9791979,Android ユーザー は Google Play で 公開 予定 の アプリ の 事前 ...
4,techcrunch,Android Users Can Now Pre-Register for Upcomin...,2015-10-01 00:54:20+00:00,9791979,Android Users Can Now Pre-Register for Upcomin...
5,techcrunch,最新500社のスタートアップ企業一覧,2015-11-03 00:54:20+00:00,9791980,最新 500 社 の スタート アップ 企業 一覧
6,techcrunch,Here s the Latest Batch of 500 Startups Companies,2015-11-03 00:54:20+00:00,9791980,Here s the Latest Batch of 500 Startups Companies
7,nytimes,Google Unveils Plan for New Corporate Campus,2015-08-12 00:02:29+00:00,10045195,Google Unveils Plan for New Corporate Campus
8,nytimes,Google、新しい企業キャンパス計画を発表,2015-08-12 00:02:29+00:00,10045195,Google 、 新しい 企業 キャンパス 計画 を 発表
9,nytimes,The Risk of a Billion-Dollar Valuation in Sili...,2015-08-29 00:54:20+00:00,10138569,The Risk of a Billion - Dollar Valuation in Si...


## Tokenize a second time

This time, tokenization is successful.

In [6]:
vocab = vectorizer.fit(df.tokenized_title).vocabulary_
vocab_df = pd.DataFrame(vocab.items())
vocab_df

Unnamed: 0,0,1
0,予定,98
1,founder,28
2,battlefield,8
3,dollar,24
4,キャンパス,87
5,startups,66
6,評価,114
7,mayor,40
8,market,39
9,web,80


# Appendix - Original query to extract Hacker News headlines from BigQuery

```
SELECT 
  source, 
  REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title, 
  time_ts, 
  id
FROM
(SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  title, 
  time_ts, 
  id
 FROM
  `bigquery-public-data.hacker_news.stories`
 WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
  AND LENGTH(title) > 10
)
WHERE (source = 'techcrunch' OR source = 'nytimes') AND time_ts > '2015-01-01'
```