# Preprocessing Notebook
------------------------

The goal of this notebook is to discover whether the existing model (as found in `Glove_embedding.ipynb` can be written in such a way as to take advantage of the capabilities offered by TFX. In particular, we'd like to understand whether the preprocessing steps can be included in the model graph and deployed as a single artifact. In the event that any preprocessing steps change, we'd like to confirm that model performance isn't negatively impacted. The overall goal is to replicate the behavior of the existing model as best as possible while simplifying productionization and operationalization. 

The notebook is structured as follows: 

 **A. Pre-preprocessing:** Exact duplication of steps found in `Glove_embedding.ipynb`. Eventually these steps, as well as the steps found in earlier notebooks, should be made part of a robust data pipeline in order to facilitate on-demand training of new models on appropriate data.
 
 **B. Preprocessing:** Ensure that we can replicate existing preprocessing steps in tensorflow. These steps would be included in the `preprocessing_fn` used by the Transform component in a TFX pipeline. 
 
**NOTE:** Much of the code included in this notebook (e.g. explicit calls to `tft_beam`) is handled under the hood when using TFX. The thinking behind including this code as opposed to diving straight in with an end-to-end TFX pipeline is twofold: 

 - Ensure that any changes to the code don't negatively impact the model 
 - Give an idea of the types of things happening under the hood when using TFX. Hopefully this will help to demystify and facilitate debugging in the future. 

In [1]:
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils
import apache_beam as beam

import tempfile
import pprint
import apache_beam as beam

import os

import tensorflow as tf 
import pandas as pd 
import numpy as np 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from google.cloud import storage
import pickle

For some reason I needed to install this when using AI platform notebooks. Not really sure why? 

In [2]:
#!pip install --user google-resumable-media==0.6.0

---
---
### **A. Pre-preprocessing**

The following cells simply replicate some of first steps in the `Glove_embedding.ipynb` notebook. Ideally these steps would come at the end of a data processing pipeline with the result being clean csv files that could then be picked up the TFX training pipeline. 

**NOTE:** Exactly where the "pre-preprocessing"/data pipeline ends and the training pipeline begins is currently a bit unclear and should be discussed and agreed upon by the Data Scientists and ML Engineers

In [3]:
translated = pd.read_csv("gs://ml-sandbox-101-tagging/data/processed/translation_data/translated_data_entode.csv", index_col=0)
overlap = pd.read_csv("gs://ml-sandbox-101-tagging/data/processed/de_gb_overlap_data.csv", index_col=0)

In [4]:
overlap.rename(columns={"combi":"full_translated"}, inplace=True)
overlap = pd.merge(translated,overlap,how='left',on='TitleId')

overlap['features'] = overlap[['full_translated_x','full_translated_y','series_ep_subgenre_y']].fillna("").apply(" ".join, axis=1)
train_data_all =  overlap[['TitleId', 'series_ep_uuid_x', 'title_x', 'series_ep_title_x',
       'series_name_x', 'series_ep_synopsis_x', 'genre_x',
       'series_ep_subgenre_x', 'series_ep_tags_x','translated_text',
       'tranlsated_subgenres', 'tranlsated_tags',  'features']]

train_data_all.rename(columns={'series_ep_uuid_x':'series_ep_uuid',
                               'title_x':'title', 
                               'series_ep_title_x':'series_ep_title',
                               'series_name_x':'series_name', 
                               'series_ep_synopsis_x':'series_ep_synopsis', 
                               'genre_x':'genre',
                               'series_ep_subgenre_x':'series_ep_subgenre', 
                               'series_ep_tags_x':'series_ep_tags'},inplace=True)

train_data_all = train_data_all[train_data_all.features.apply(len)>30]

# split train/test uuids
train_uuids,test_uuids =train_test_split(train_data_all.series_ep_uuid.unique(), shuffle=True, test_size=0.1, random_state=42)
# We take a series and we want it all to go in training or all in testing avoid any possible leakage

train_data = train_data_all.loc[train_data_all.series_ep_uuid.isin(train_uuids),:].copy()
test_data = train_data_all[train_data_all.series_ep_uuid.isin(test_uuids)].copy()

train_data['series_ep_tags'] = train_data['series_ep_tags'].apply(eval)
test_data['series_ep_tags'] = test_data['series_ep_tags'].apply(eval)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [5]:
display(train_data[['features', 'series_ep_tags']])

Unnamed: 0,features,series_ep_tags
0,Ein amerikanischer Militärbeamter untersucht d...,"[20th century, thriller, the middle east, myst..."
1,Aufschlussreiche Dokumentation über den Weltme...,"[sport skills & tricks, bio, mental health, ad..."
2,"Ein Söldnerteam wird von der CIA beauftragt, n...","[thriller, military personnel, soldier, action..."
4,Ein Amerikaner lässt sich in einem Dorf im feu...,"[the far east, thriller, china, 19th century, ..."
5,Ester und ihre Mutter führen ein scheinbar bes...,"[criminal, usa, latin america, violent, thrill..."
...,...,...
99118,"Irinas Schicksal zwingt Georgina, sich dem Sch...","[wealthy family, thriller, crime, mystery, wea..."
99119,Ein Sky-Original mit Lennie James. Ne&#39;er-d...,"[gritty, thriller, urban, crime, london, famil..."
99120,"Ellie und Gedeon stoßen zusammen, als die Ermi...","[german language, germany, gritty, thriller, c..."
99121,Kollateralschaden: McNultys Neugier lässt die ...,"[usa, gritty, crime, hbo, police, tense, corru..."


Since these are really the only two columns we need for training the model, it would be nice to have these as outputs from the data pipeline and inputs to the training pipeline. 

Some work will be needed to understand how data comes into the service. Any preprocessing steps to get data into the above format (without the tags) could possibly be included the serving_input_function we export with the model. If any of the above pre-preprocessing steps are necessary for the serving data as well, these steps could be moved into the preprocessing steps of the training pipeline. 

---
---

### **B. Preprocessing**

**Can we replicate existing preprocessing steps?**

The bulk of the preprocessing consisted of a few steps: 
 1. Binarizing the labels to get a multi-one-hot encoding (using Sklearn's `MultiLabelBinarizer`)
 2. Cleaning the text by removing punctuation and stop words followed by stemming (using a german-language model from Spacy as well as regex methods in python)
 3. Tokenizing the text (using Keras tokenizer) 
 
A major focus in ML Engineering is **consistent** preprocessing so as to avoid training-serving skew. Leveraging a library such as tensorflow transform allows us to build our preprocessing steps as TF graphs that can be deployed together with our model as a single artifact. This allows a requesting client to submit raw data and then the preprocessing happens on the deployed model graph. 

In order to take advantage of this capability one must define a **preprocessing function** which consists of only regular tensorflow operations and/or **analyzers** provided by tf.Transform. Analyzers cause tf.Transform to compute a **full-pass** operation outside of tensorflow and subsequently use a generated constant tensor in the preprocessing graph. For instance, there are analyzers to compute the minimum and maximum of a dataset. The generated constants from these analyzers can the be built into the model graph to do things like min-max scaling of incoming data based on statistics from the training data.  

---
The main question we are trying to answer here is whether we can replicate the existing preprocessing steps in a manner which uses only tensorflow operations, without negatively impacting model performance. Unfortunately, this requires us to rethink some of the preprocessing steps listed above; however, this also provides us an opportunity to streamline and improve these steps as well. 

In particular, it is worth rethinking the use of the Spacy language model since this would be quite difficult to replicate using Tensorflow operations. It is important to keep in mind that we are doing text preprocessing in order to embed tokens using pre-trained Glove embeddings. We found some resources that indicated it was in fact **unnecessary and perhaps even counter-productive to perform complex preprocessing such as stopword removal and stemming:**
 1. [This paper](https://arxiv.org/pdf/1707.01780.pdf) is perhaps the most insightful 
 
     ```
      "Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Addi- tionally, word embeddings trained on multiword- grouped corpora perform surprisingly well when applied to simple tokenized datasets.”
     ```
     
     
 2. [This kaggle kernel](https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings) is also useful
 
     ```
     1. Don't use standard preprocessing steps like stemming or stopword removal when you have pre-trained embeddings Some of you might used standard preprocessing steps when doing word count based feature extraction (e.g. TFIDF) such as removing stopwords, stemming etc. The reason is simple: You lose valuable information, which would help your NN to figure things out.
     2. Get your vocabulary as close to the embeddings as possible I will focus in this notebook, how to achieve that. For an example I take the GoogleNews pretrained embeddings, there is no deeper reason for this choice."
     ```
     
     
 3. Finally [this resource](https://mlwhiz.com/blog/2019/01/17/deeplearning_nlp_preprocess/) does a good job summarizing the main point: 
 
     ```
     In principle our preprocessing should match the preprocessing that was used before training the word embedding
     ```

The main takeaway is that methods such as stopword removal and lemmatization are less relevant for modern-day deep learning NLP techniques. This is because many such algorithms are explicitly trained to consider the context of a word—context which is lost after doing things like stopword removal. 

From what we could find, it looks like GloVe doesn't do any stop word removal or lemmatization and therefore we might be able to get away without doing either. This allows us to focus on recreating the other preprocessing steps, in particular **cleaning of punctuation and generating a vocabulary for the text**. The ultimate, test however, is how much of our vocabulary is covered by the pretrained embeddings. In the `glove_embedding.ipynb` notebook we see that we could get about **66% coverage** of the vocabulary. **We must ensure that our preprocessing steps do not cause coverage do be any lower than this.**

In [6]:
train_data_records = train_data[['features', 'series_ep_tags']].to_dict('records')
test_data_records = test_data[['features', 'series_ep_tags']].to_dict('records')

train_data_records[0]

{'features': 'Ein amerikanischer Militärbeamter untersucht den Tod einer Hubschrauberpilotin im Golfkrieg von 1990, um zu entscheiden, ob ihre Handlungen eine posthume Ehrenmedaille verdienen. In der Zwischenzeit versucht er, sich mit einem tragischen Vorfall auseinanderzusetzen, den er während des Konflikts selbst verursacht und vertuscht hat. Drama mit Denzel Washington, Meg Ryan, Lou Diamond Phillips, Scott Glenn und Matt Damon. .. Krieg Theater. Courage Under Fire  ',
 'series_ep_tags': ['20th century',
  'thriller',
  'the middle east',
  'mystery',
  'gulf war',
  'war',
  '1990s',
  'ethics & morality',
  'military personnel',
  'action',
  'drama',
  'military',
  'officer-colonel']}

---
#### **B.1 Regex**

The text cleanup class includes methods to remove punctuation, special characters and numbers, and to filter out short words. Interestingly, this text cleanup was followed by a [keras tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer), which actually filters out punctuation and special characters as well. Instead of filtering out these characters (twice), why not instead "whitelist" the characters we want to keep. Specifically, we could just keep letters that we want (alphabet plus german characters **äÄöÖüÜß**).

In [7]:
white_list = '[^äÄöÖüÜßa-zA-Z]'

text_sample = train_data_records[25000]['features']

print(text_sample)

Während einer harten Afghanistan-Tour findet der gewissenhafte französische Armeekapitän Bonassieu (Jeremie Renier) seine Autorität im Stich gelassen, als seine Männer unerklärlich verschwinden. Er vermutet natürlich, dass es die Arbeit der Taliban ist ... aber es stellt sich heraus, dass sie auf unheimlich ähnliche Weise Truppen verloren haben. Ein unsichtbarer Feind macht sich in diesem nervenden psychologischen Kriegsthriller bemerkbar.. Weltkino Theater. Neither Heaven Nor Earth  


In [8]:
%%time
cleaned = tf.strings.regex_replace(text_sample, white_list, ' ')

print(cleaned)

tf.Tensor(b'W\xc3\xa4hrend einer harten Afghanistan Tour findet der gewissenhafte franz\xc3\xb6sische Armeekapit\xc3\xa4n Bonassieu  Jeremie Renier  seine Autorit\xc3\xa4t im Stich gelassen  als seine M\xc3\xa4nner unerkl\xc3\xa4rlich verschwinden  Er vermutet nat\xc3\xbcrlich  dass es die Arbeit der Taliban ist     aber es stellt sich heraus  dass sie auf unheimlich \xc3\xa4hnliche Weise Truppen verloren haben  Ein unsichtbarer Feind macht sich in diesem nervenden psychologischen Kriegsthriller bemerkbar   Weltkino Theater  Neither Heaven Nor Earth  ', shape=(), dtype=string)
CPU times: user 24 ms, sys: 0 ns, total: 24 ms
Wall time: 22.2 ms


This seems to do what we want to do; however, the next step was to remove short words. 

Using word boundaries doesn't work with german characters:

In [9]:
%%time
tf.strings.regex_replace(cleaned, r'\b[äÄöÖüÜßa-zA-Z]{1,3}\b', ' ')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 943 µs


<tf.Tensor: shape=(), dtype=string, numpy=b' hrend einer harten Afghanistan Tour findet   gewissenhafte franz sische Armeekapit  Bonassieu  Jeremie Renier  seine Autorit    Stich gelassen    seine  nner unerkl rlich verschwinden    vermutet   rlich  dass     Arbeit   Taliban       aber   stellt sich heraus  dass     unheimlich \xc3\xa4hnliche Weise Truppen verloren haben    unsichtbarer Feind macht sich   diesem nervenden psychologischen Kriegsthriller bemerkbar   Weltkino Theater  Neither Heaven   Earth  '>

**Notice the first word (should be während)**

This solution from John Pawley seems to work:

In [10]:
%%time
tf.strings.regex_replace(cleaned, r'((^|\s)[äÄöÖüÜßa-zA-Z]{1,3})+\s', ' ')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 982 µs


<tf.Tensor: shape=(), dtype=string, numpy=b'W\xc3\xa4hrend einer harten Afghanistan Tour findet gewissenhafte franz\xc3\xb6sische Armeekapit\xc3\xa4n Bonassieu  Jeremie Renier  seine Autorit\xc3\xa4t Stich gelassen  seine M\xc3\xa4nner unerkl\xc3\xa4rlich verschwinden  vermutet nat\xc3\xbcrlich  dass Arbeit Taliban     aber stellt sich heraus  dass unheimlich \xc3\xa4hnliche Weise Truppen verloren haben  unsichtbarer Feind macht sich diesem nervenden psychologischen Kriegsthriller bemerkbar   Weltkino Theater  Neither Heaven Earth  '>

Considering we are no longer removing stop words, it perhaps makes sense to limit the filtering to 1/2 letter words as opposed to filtering out all words length 3 or less. 

---
#### **B.2 `preprocessing_fn`**

Now we can write a simple preprocessing function called `preprocessing_fn`. This function will be found and used by the `Transform` component in TFX to construct the preprocessing pipeline. In the preprocessing_fn we can make heavy use of Tensorflow Transform for performing feature engineering on the dataset, especially for any engineering that requires a full-pass over the dataset (e.g. vocabulary generation). 

**How is this different from transforming features inside modelling code?** Using feature columns in model code, one can also do some simple feature engineering. The key is that these transformations can be defined without looking at the data. TFT, on the other hand, is useful when one must perform a full pass over the data. In addition, by transforming the data ahead of time we can potentially accelerate the training process. 

See [this page](https://www.tensorflow.org/tfx/guide/transform#transform_and_tensorflow_transform) for more information. 

---

The `preprocessing_fn` describes a series of operations on Tensorflow tensors. The input to the function is determined by a Schema proto containing a list of features. 

In this particular example we will have two inputs, the synopsis (the text that constitutes the input feature to the model) and the tags (the label(s) to be predicted). Our preprocessing function will need to clean and tokenize the text. We will then compute and apply vocabularies to both the feature and the tags in order to convert strings to integers. 

---
**NOTE:** In this particular example we make some simplifications. Namely, we hard-code a predefined a `MAX_STRING_LENGTH` and a number of unique tags. This allows us to do the appropriate padding and create a multi-one-hot encoded vector for the tags. In the future, we may want to have these values as outputs of the data pre-processing pipeline or find a better way to do this. See [this stackoverflow question](https://stackoverflow.com/questions/59793174/how-to-use-tf-transform-analyzer-variables-in-preprocessing-fn).  

**NOTE:** Again, extra detail is included to give some detail about what is happening under the hood when using the transform component. See [this example](https://github.com/tensorflow/transform/blob/master/examples/sentiment_example.py) of using tensorflow transform in a standalone manner vs. [the `preprocessing_fn` in this example](https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/taxi_utils_native_keras.py) to understand the difference. 

**NOTE:** We had issues when serving models with the transform layer when using `tft.compute_and_apply_vocabulary` on our label data. When removing this label field from the input schema spec for serving there were issues with the graph requiring the label field. Seems there is something going on with compute_and_apply_vocabulary that requires the field to be present all the time. Therefore we only compute the vocab in preprocessing and apply it when generating training/validation samples, this ensure the fields can be ignored within the serving graph.

**NOTE:** Would like to be able to use analyzers in the preprocessor function, similar to [this question on stackoverflow](https://stackoverflow.com/questions/59793174/how-to-use-tf-transform-analyzer-variables-in-preprocessing-fn). Unfortunately that means that we can't do things like the following and that we may have to hard-code the MAX_STRING_LENGTH to a reasonable value (currently use the maximum string length in the training set, in the future could just set this to something like 200): 
```python
num_tokens = tft.word_count(text_tokens)
max_len = tft.max(num_tokens)
text_tokens = text_tokens.to_tensor(shape=[None, MAX_STRING_LENGTH])
```

In [11]:
SYNOPSIS = 'features'
TAGS = 'series_ep_tags'
MAX_STRING_LENGTH = 277

white_list = '[^äÄöÖüÜßa-zA-Z]'

def clean(text):
    """
    Clean up text input by removing everything but the 
    white-listed characters
    """
    # Encoding needed to keep german characters
    lower = tf.strings.lower(text, 'utf-8')
    cleaned = tf.strings.regex_replace(lower, white_list, ' ')
    cleaned = tf.strings.strip(cleaned)
    # Filter single letters
    clean_1 = tf.strings.regex_replace(cleaned, r'((^|\s)[äÄöÖüÜßa-zA-Z]{1})+\s', ' ')
    # Replace multiple spaces with single space
    final = tf.strings.regex_replace(clean_1, ' +', ' ')

    return final

def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    
    text = inputs[SYNOPSIS]
    tags = inputs[TAGS]

    cleaned = clean(text)
    text_tokens = tf.compat.v1.string_split(cleaned, ' ', result_type='RaggedTensor')

    text_tokens = text_tokens.to_tensor(shape=[None, MAX_STRING_LENGTH])
    text_indices = tft.compute_and_apply_vocabulary(
        text_tokens, vocab_filename='vocab', num_oov_buckets=1
    )
    
    # compute vocab of tags, do not apply due to serving issues
    _ = tft.vocabulary(tags, vocab_filename='tags')
    
    # Need to transform the name
    return {
        SYNOPSIS: text_indices,
        TAGS: tags
    }

**NOTE:** The next feels cells take quite a long time (~10 min)! Can skip if the data is already in GCS. 

In [13]:
!mkdir transform_dir

In [14]:
%%time
# This schema would be produced automatically when running the 
# TFX pipeline
raw_data_metadata = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec(
        {
            SYNOPSIS: tf.io.FixedLenFeature([], tf.string),
            TAGS: tf.io.VarLenFeature(tf.string),
        }
    )
)


with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    # Analyze and transform training data
    transformed_dataset, transform_fn = (train_data_records, raw_data_metadata,) | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)
    transformed_data, transformed_metadata = transformed_dataset 

    # Transform test data using transform_fn 
    raw_test_dataset = (test_data_records, raw_data_metadata)
    transformed_test_dataset = (
      (raw_test_dataset, transform_fn) | tft_beam.TransformDataset())
    transformed_test_data, _ = transformed_test_dataset









Instructions for updating:
Use ref() instead.


Instructions for updating:
Use ref() instead.


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


INFO:tensorflow:SavedModel written to: /tmp/tmph62ndpdd/tftransform_tmp/01b0f5a1f8e0469eaef468edde942406/saved_model.pb


INFO:tensorflow:SavedModel written to: /tmp/tmph62ndpdd/tftransform_tmp/01b0f5a1f8e0469eaef468edde942406/saved_model.pb


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


INFO:tensorflow:SavedModel written to: /tmp/tmph62ndpdd/tftransform_tmp/a560af43fbea40f0aea36cad35c3d463/saved_model.pb


INFO:tensorflow:SavedModel written to: /tmp/tmph62ndpdd/tftransform_tmp/a560af43fbea40f0aea36cad35c3d463/saved_model.pb






INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets written to: /tmp/tmph62ndpdd/tftransform_tmp/9e4398889ed045a7aeff053a991c39bf/assets


INFO:tensorflow:Assets written to: /tmp/tmph62ndpdd/tftransform_tmp/9e4398889ed045a7aeff053a991c39bf/assets


INFO:tensorflow:SavedModel written to: /tmp/tmph62ndpdd/tftransform_tmp/9e4398889ed045a7aeff053a991c39bf/saved_model.pb


INFO:tensorflow:SavedModel written to: /tmp/tmph62ndpdd/tftransform_tmp/9e4398889ed045a7aeff053a991c39bf/saved_model.pb


value: "\n\013\n\tConst_1:0\022\005vocab"



value: "\n\013\n\tConst_1:0\022\005vocab"



value: "\n\013\n\tConst_3:0\022\004tags"



value: "\n\013\n\tConst_3:0\022\004tags"



INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


value: "\n\013\n\tConst_1:0\022\005vocab"



value: "\n\013\n\tConst_1:0\022\005vocab"



value: "\n\013\n\tConst_3:0\022\004tags"



value: "\n\013\n\tConst_3:0\022\004tags"



INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore




  name: "features"
  type: BYTES
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "series_ep_tags"
  type: BYTES
}
}), (<PCollection[CreatePInput1/Map(decode).None] at 0x7f906d898a90>, BeamDatasetMetadata(dataset_metadata={'_schema': feature {
  name: "features"
  type: INT
  int_domain {
    is_categorical: true
  }
  presence {
    min_fraction: 1.0
  }
  shape {
    dim {
      size: 277
    }
  }
}
feature {
  name: "series_ep_tags"
  type: BYTES
}
}, deferred_metadata=<PCollection[CreatePInput2/Map(decode).None] at 0x7f906d898250>))) belongs to. Thus noop.


value: "\n\013\n\tConst_1:0\022\005vocab"



value: "\n\013\n\tConst_1:0\022\005vocab"



value: "\n\013\n\tConst_3:0\022\004tags"



value: "\n\013\n\tConst_3:0\022\004tags"



INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


CPU times: user 2min 18s, sys: 4.26 s, total: 2min 23s
Wall time: 2min 23s


In [15]:
# needs beam pipeline to save the meta data
with beam.Pipeline() as pipeline:
    with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
        _ = (raw_data_metadata | 'WriteMetadata' >> tft_beam.WriteMetadata('./transform_dir/metadata', pipeline=pipeline))

  name: "features"
  type: BYTES
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "series_ep_tags"
  type: BYTES
}
} belongs to. Thus noop.


In [16]:
%%time
with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    transformed_data_coder = tft.coders.ExampleProtoCoder(transformed_metadata.schema)
    # Write out transformed training data
    _ = (
        transformed_data 
        | 'EncodeTrainData' >> beam.Map(transformed_data_coder.encode)
        | 'WriteTrainData' >> beam.io.WriteToTFRecord(
            os.path.join('./transform_dir', 'train_transformed'))
    )

    # Write out transformed test data
    _ = (
        (transformed_test_data 
         | 'EncodeTestData' >> beam.Map(transformed_data_coder.encode)
         | 'WriteTestData' >> beam.io.WriteToTFRecord(
             os.path.join('./transform_dir', 'test_transformed')))
    )

    # Write out the transform fn 
    _ = (
        transform_fn
        | 'WriteTransformFn' >> tft_beam.WriteTransformFn('./transform_dir')
    )



CPU times: user 7min 26s, sys: 4.44 s, total: 7min 30s
Wall time: 7min 31s


Took a very long time to write out examples?

In [17]:
!gsutil -m cp -r transform_dir gs://ml-sandbox-tagging-tfx-experiments/preprocessing_notebook

Copying file://transform_dir/test_transformed-00000-of-00001 [Content-Type=application/octet-stream]...
Copying file://transform_dir/train_transformed-00000-of-00001 [Content-Type=application/octet-stream]...
Copying file://transform_dir/transform_fn/assets/tags [Content-Type=application/octet-stream]...
Copying file://transform_dir/transform_fn/saved_model.pb [Content-Type=application/octet-stream]...
Copying file://transform_dir/transform_fn/assets/vocab [Content-Type=application/octet-stream]...
Copying file://transform_dir/metadata/schema.pbtxt [Content-Type=application/octet-stream]...
Copying file://transform_dir/transformed_metadata/schema.pbtxt [Content-Type=application/octet-stream]...
\ [7/7 files][ 84.3 MiB/ 84.3 MiB] 100% Done                                    
Operation completed over 7 objects/84.3 MiB.                                     


---
#### **B.3 Preprocessing Analysis**

Let's check out the generated vocabulary and the transformed datasets to see if they make sense!

In [21]:
# How we can access the vocab and tag files 
tf_transform_output = tft.TFTransformOutput('gs://ml-sandbox-tagging-tfx-experiments/preprocessing_notebook/')

VOCAB_FILE = tf_transform_output.vocabulary_file_by_name('vocab')
TAG_FILE = tf_transform_output.vocabulary_file_by_name('tags')

vocab_df = pd.read_csv(VOCAB_FILE, header=None)
tags_df = pd.read_csv(TAG_FILE, header=None)

In [22]:
display(vocab_df)
display(tags_df)

Unnamed: 0,0
0,und
1,die
2,der
3,in
4,zu
...,...
128968,aalt
128969,aakeel
128970,aafri
128971,aaargh


Unnamed: 0,0
0,usa
1,drama
2,factual
3,united kingdom
4,reality
...,...
2591,2010-11 football
2592,2009-10 football
2593,2000-01 football
2594,1999-00 football


Seems to make sense, though it is slightly concerning that the some of the words in the vocab are things like "aaaaaaaaaaah". Is this really in the dataset or was this due to errors in preprocessing? 

Also very good that we got the same number of tags as before (see `glove_embedding.ipynb` notebook)! 

We can check that the transformed results make sense:

In [23]:
raw_dataset = tf.data.TFRecordDataset('gs://ml-sandbox-tagging-tfx-experiments/preprocessing_notebook/train_transformed-00000-of-00001')

In [24]:
for raw_record in raw_dataset.take(1):
    print('\n {:=^80} \n'.format(" Serialized "))
    print(raw_record)
    print('\n {:=^80} \n'.format(" Decoded "))
    print(tf.io.parse_single_example(raw_record, tf_transform_output.transformed_feature_spec()))



tf.Tensor(b"\n\xb7\x07\n\x82\x06\n\x08features\x12\xf5\x05\x1a\xf2\x05\n\xef\x05\n\xbb\x0b\xfa\xf9\x05\x90\x01\x0c\xfb\x01\x14\xaa\xc4\x06\x19\xd3\xa7\x03\x05\x10\x04\xab\x07\xce\x02\x1d\xc4\x1d\x0b\xf8\xcd\x05\xe2\x9d\x04\xe9\x06\x03\x02\xcc\x03H\x16\x0e\x06\x12\x9f+\xad\x1e\xf1%\x0c\x16'\x1a\xf6W\x95\x03\xf9\t\x00\xfe\xaf\x01#4\x06\xc2\x83\x01\x85\x08\xcc'\xb5\x03\xd9\x18\xdf2\xf9\x1c\xd2\x05\x81\x1e\x00\xae\x02\x86\x0b\x94\x04,\xd0\xd5\x02g\xf5\x06\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xcd\xef\x07\xc

We see that transform will write out the data as a TFRecord of TFExample protos. We will need to parse these in order to feed them into the model. Luckily there is a helper function for this: `tf.data.experimental.make_batched_features_dataset`. We will probably end up making use of this later, but for now let's continue exploring some of these examples. 

In [25]:
first_example = tf.io.parse_single_example(raw_record, tf_transform_output.transformed_feature_spec())

vocab_df.iloc[first_example['features'].numpy()[:20]]

Unnamed: 0,0
10,ein
1467,amerikanischer
97530,militärbeamter
144,untersucht
12,den
251,tod
20,einer
107050,hubschrauberpilotin
25,im
54227,golfkrieg


Looks like things are working as intended! We are able to decode the first example (compare to the example above)

Now let's look at the tags:

In [26]:
print(tf.sparse.to_dense(first_example['series_ep_tags']).numpy())

[b'20th century' b'thriller' b'the middle east' b'mystery' b'gulf war'
 b'war' b'1990s' b'ethics & morality' b'military personnel' b'action'
 b'drama' b'military' b'officer-colonel']


Again, this looks right! We are able to retrieve all of the desired tags. 

When we actually train the model, we obviously can't use strings like this. Luckily we can create a [`StaticVocabularyTable`](https://www.tensorflow.org/api_docs/python/tf/lookup/StaticVocabularyTable) using a text file initializer based on the vocabulary we just generated in order to go from strings to Id. We can then easily turn those sparse tensors into a multi-one-hot encoded vector in order to actually use this information as labels in the model. See the `Trainer.ipynb` notebook for how this is done.

#### **B.4 Vocabulary Coverage** 

Things seem to look good. The last major thing to check is our coverage of the vocabulary. These steps will need to be replicated when we initialize the model. 

In [27]:
storage_client = storage.Client()

bucket = storage_client.bucket('ml-sandbox-101-tagging')
blob = bucket.blob('data/processed/training_data/glove_data/glove_embedding_index.pkl')
pickle_in = blob.download_as_string()
file = pickle.loads(pickle_in)

In [28]:
vocab_size = len(vocab_df) + 1
embedding_matrix = np.zeros((vocab_size, 300))

In [29]:
words_not_found = []
for i, word in enumerate(vocab_df.values):
    embedding_vector = file.get(word[0])
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else:
        words_not_found.append(word[0])

In [30]:
nonzero_elements = np.count_nonzero(
                np.count_nonzero(embedding_matrix, axis=1))
overlap = ((nonzero_elements*100)//vocab_size)
print("{}% of vocabulary is covered by pretrained model".format(overlap))

70% of vocabulary is covered by pretrained model


In [31]:
print(words_not_found[:50])
print(len(words_not_found))

['herausforderungen', 'zeichentrickserie', 'eröffnungsausgabe', 'persönlichkeiten', 'eröffnungsfolge', 'verschwenderischen', 'catelynn', 'krankenschwester', 'handlungsstränge', 'vanderpump', 'ausgründungsserie', 'kailyn', 'dokumentarserien', 'zurückzugewinnen', 'swiper', 'stellvertretenden', 'aufrechtzuerhalten', 'jetters', 'immobilienmakler', 'berichterstattung', 'außergewöhnliche', 'zusammenarbeitet', 'auseinandersetzen', 'dokumentarfilmserie', 'wissenschaftliche', 'alleinerziehende', 'tagesslot', 'upsy', 'unwahrscheinlicher', 'liebesinteresse', 'tombliboos', 'igglepiggle', 'geburtstagsfeier', 'jareau', 'nutbrown', 'geschlechtsumwandlungen', 'detektivdrama', 'allgemeinchirurgin', 'polizeikommissar', 'kardiothorakchirurgen', 'unterschiedlichen', 'schnallende', 'raynas', 'kinderanimation', 'jerseylicious', 'diskussionsteilnehmer', 'kardashianern', 'eliminierungsherausforderung', 'zusammenschließen', 'wonnacott']
38651


Good news is that the coverage of the vocabulary is actually a bit higher than before! This is quite positive, though it is possible that this increase in coverage is merely due to the presence of stopwords. 

More concerning is the nature of the words in the vocabulary not covered by the pretrained embeddings. Above we've printed the first 50 of these words. While it is to be expected that some of these words are not captured (e.g. **jerseylicious, kardashianern**), some of the other non-found words are quite concerning: 
 + Fairly normal words such as *krakenschwester* (nurse) and *persönlichkeiten* (personalities)
 + Compound words, which are a major feature of the german language: *eröffnungsausgabe* (opening edition), *zurückgewinnen* (to win back). 
 + Words that intuitively would seem quite important for predicting tags: *dokumentarserien*, *dokumentarfilmserie*, *detektivdrama*, *kinderanimation*. These types of words not being captured by the vocabulary might severely limit the ability of the model to predict appropriate tags. 
 
One additional thing to note is that it seems like it might be necessary to do stemming in particular cases. Though *persönlichkeiten* is not covered, *persönlichkeit* is: 

In [32]:
file.get('persönlichkeit')

array([-3.932790e-01, -1.113070e-01,  2.045410e-01, -1.262810e-01,
        2.054620e-01,  9.071800e-02,  2.108400e-01, -6.925700e-02,
       -2.625900e-01,  8.015400e-02, -4.734000e-01, -3.358700e-01,
       -1.801580e-01, -1.412570e-01,  5.990200e-02, -1.859500e-02,
       -4.520390e-01, -6.847640e-01,  4.992600e-02, -6.165300e-02,
        4.984950e-01,  2.509250e-01, -1.757640e-01,  5.247500e-02,
       -2.351000e-02,  1.460200e-01, -7.447370e-01,  2.156845e+00,
       -3.656490e-01, -1.206100e-02, -3.096100e-01, -2.968330e-01,
        1.087440e-01, -1.663210e-01,  6.946200e-02, -8.180000e-03,
        1.054030e-01, -3.857000e-03, -2.210000e-04,  8.870900e-02,
        7.814000e-02, -6.477770e-01,  2.979450e-01, -3.727920e-01,
        4.487290e-01,  5.933300e-01,  8.662300e-02,  4.326920e-01,
       -5.541000e-02, -3.068400e-02, -1.189760e-01, -1.841990e-01,
       -1.053000e-02,  2.140540e-01,  3.582340e-01,  6.302600e-02,
        4.267090e-01,  1.622990e-01, -2.603740e-01, -1.479510e

We will need to train the model to confirm that performance isn't negatively impacted by this. 

In general, though, this observation would seem to support the notion that we should explore other methods for embedding text. For future iterations of the model we should look into models in [tf hub](https://tfhub.dev/s?language=de&module-type=text-embedding,text-classification,text-generation,text-language-model,text-question-answering,text-retrieval-question-answering). Using a universal sentence encoder in particular might be quite beneficial; we can basically avoid doing any preprocessing as well as leverage parts of the synopses that are written in other languages (especially english). 

#### **Cleanup**

In [33]:
!rm -rf transform_dir