In this notebook we encode the Liar-liar dataset's documents.<br>
<br>
The Liar-liar dataset contains two columns that will be used in the **Fake news learning** notebook, the **'fullText_based_content'** and the **'label-liar'** columns. The **'fullText_based_content'** column contains our documents, we will use **Bag-of-words** encoding to encode its content.

Additionaly, we will extract **contextualized embeddings** of the documents from a **Bert** model, we will use these embeddings in the **Fake news learning** notebook as an alternative data to the **BOW** encodings.

# Data exploration

### Imports & definitions

In [5]:
import pandas as pd
import numpy as np

### Get train and test datasets

In [6]:
#train
num_splits = 7
train = pd.read_csv('https://raw.githubusercontent.com/AlonBrul/liar-liar-dataset/main/liar_train_0.csv')
for i in range(1, num_splits):
  temp = pd.read_csv(f'https://raw.githubusercontent.com/AlonBrul/liar-liar-dataset/main/liar_train_{i}.csv')
  train = train.append(temp)

#test
test = pd.read_csv('https://raw.githubusercontent.com/AlonBrul/liar-liar-dataset-train/main/liar_test.csv')

In [None]:
train.head(3)

Unnamed: 0,id,date,speaker,statement,sources,paragraph_based_content,fullText_based_content,label-liar
0,18178,2020-03-18T13:26:42-04:00,Instagram posts,"""COVID-19 started because we eat animals.""",['https://www.cdc.gov/coronavirus/2019-ncov/ca...,['Vegan Instagram users are pinning the 2019 c...,Vegan Instagram users are pinning the 2019 cor...,barely-true
1,3350,2011-03-04T09:12:59-05:00,Glenn Beck,Says Michelle Obama has 43 people on her staff...,['http://www.glennbeck.com/2011/02/25/while-wo...,['Glenn Beck rekindled a falsehood about the s...,Glenn Beck rekindled a falsehood about the siz...,pants-fire
2,14343,2017-07-21T11:52:44-04:00,Mike Pence,"Says President Donald Trump ""has signed more l...",['https://nrf.com/events/retail-advocates-summ...,['Vice President Mike Pence says that when it ...,Vice President Mike Pence says that when it co...,half-true


We want only **fullText_based_content** and **label-liar** columns.<br>
Lets -
### drop unwanted columns

In [7]:
train.drop(['id', 'date', 'speaker', 'statement', 'sources', 'paragraph_based_content'], axis=1, inplace=True)
test.drop(['id', 'date', 'speaker', 'statement', 'sources', 'paragraph_based_content'], axis=1, inplace=True)

In [None]:
train.head(3)

Unnamed: 0,fullText_based_content,label-liar
0,Vegan Instagram users are pinning the 2019 cor...,barely-true
1,Glenn Beck rekindled a falsehood about the siz...,pants-fire
2,Vice President Mike Pence says that when it co...,half-true


### Get information on the datasets

In [None]:
print(train.info())
print()
print(test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15052 entries, 0 to 2139
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   fullText_based_content  15052 non-null  object
 1   label-liar              15052 non-null  object
dtypes: object(2)
memory usage: 352.8+ KB
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1266 entries, 0 to 1265
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   fullText_based_content  1266 non-null   object
 1   label-liar              1266 non-null   object
dtypes: object(2)
memory usage: 19.9+ KB
None


Insight: There are 15052 rows in train and 1266 in test, all are non-null.<br>
Lets find how the labels are divided.

In [None]:
train['label-liar'].value_counts()

false          3280
half-true      2833
mostly-true    2631
barely-true    2483
true           2050
pants-fire     1775
Name: label-liar, dtype: int64

Insight: There are more *false* labels (3280) than *true* labels (2050) but overall the labels are evenly divided.<br>
Lets find the length of the documents in **fullText_based_content**<br>
<br>
Word tokenize the documents and count number of words.

In [8]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

train_docs_lst = train['fullText_based_content'].tolist()

train_tokenized_docs = []
for doc in train_docs_lst:
  train_tokenized_docs.append(word_tokenize(doc))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [34]:
print(train_tokenized_docs[0][:8])

['vegan', 'instagram', 'users', 'are', 'pinning', 'the', 'coronavirus', 'outbreak']


In [10]:
doc_len = len(train_tokenized_docs[0])
train_min_len = doc_len
train_max_len = doc_len
train_avg_len = doc_len
for doc in train_tokenized_docs:
  doc_len = len(doc)

  if train_min_len > doc_len:
    train_min_len = doc_len

  elif train_max_len < doc_len:
    train_max_len = doc_len
  
  train_avg_len = (train_avg_len+doc_len)/2

print('train_min_len = ', train_min_len, 'words')
print('train_max_len = ', train_max_len, 'words')
print('train_avg_len = ', train_avg_len, 'words')

train_min_len =  59 words
train_max_len =  3440 words
train_avg_len =  975.0845406342837 words


Insights: The minimum document length is much shorter than the maximum document length, the average length is 975 words.<br>

The acctual number of words is smaller though, because these lengths include punctuation marks and stop-words, which we will ignore, therefore it is safe to assume that the average document length will be close to 800 words.

# Bag-of-words encoding

In this section, we encode the **'fullText_based_content'** column of the **Liar-liar** dataset (the documents) using **BOW** encoding.<br>
**BOW** encoding assigns each word a unique number (e.g. the word "apple" is assigned the number 1209)

The encoding process will be in the following manner:
- Preprocess the text.
  - Lower-case the text.
  - Remove special characters.
  - Remove stop words.
  - Lemmatize words.
- Calculate TF-IDF values for all words.
- Create a vocabulary with best TF-IDF valued words.
- Word Tokenize the documents.
- Replace word-tokens with matching numbers.

<br>
A guide for extracting best TF-IDF valued words we used:<br>
https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.YIVJ7pAzaUk

### Imports & definitions

In [None]:
import pandas as pd
import numpy as np
import io
import time
from tqdm import tqdm
import re

#nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

### Get train and test datasets

In [13]:
#train
num_splits = 7
train = pd.read_csv('https://raw.githubusercontent.com/AlonBrul/liar-liar-dataset/main/liar_train_0.csv')
for i in range(1, num_splits):
  temp = pd.read_csv(f'https://raw.githubusercontent.com/AlonBrul/liar-liar-dataset/main/liar_train_{i}.csv')
  train = train.append(temp)

#test
test = pd.read_csv('https://raw.githubusercontent.com/AlonBrul/liar-liar-dataset-train/main/liar_test.csv')

### Drop unwanted columns

In [14]:
train.drop(['id', 'date', 'speaker', 'statement', 'sources', 'paragraph_based_content'], axis=1, inplace=True)
test.drop(['id', 'date', 'speaker', 'statement', 'sources', 'paragraph_based_content'], axis=1, inplace=True)

In [4]:
train.head(3)

Unnamed: 0,fullText_based_content,label-liar
0,Vegan Instagram users are pinning the 2019 cor...,barely-true
1,Glenn Beck rekindled a falsehood about the siz...,pants-fire
2,Vice President Mike Pence says that when it co...,half-true


## Preprocess corpus text

In [15]:
def pre_process(text):
    # lowercase
    text=text.lower()
    
    # remove tags
    text=re.sub("","",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text

train_docs = train['fullText_based_content'].apply(lambda x:pre_process(x))
test_docs = test['fullText_based_content'].apply(lambda x:pre_process(x))

### Tokenize words

Tokenize documents for easier preprocessing.

In [16]:
train_tokenized_docs = []
for doc in tqdm(train_docs):
  train_tokenized_docs.append(word_tokenize(doc))

test_tokenized_docs = []
for doc in tqdm(test_docs):
  test_tokenized_docs.append(word_tokenize(doc))

100%|██████████| 15052/15052 [00:43<00:00, 343.55it/s]
100%|██████████| 1266/1266 [00:03<00:00, 324.79it/s]


### Remove stop words

In [17]:
stop_words = set(stopwords.words("english"))

train_filtered_docs = []
for tokenized_doc in tqdm(train_tokenized_docs):
  filtered_doc = []
  for w in tokenized_doc:
      if w not in stop_words:
          filtered_doc.append(w)
  train_filtered_docs.append(filtered_doc)

test_filtered_docs = []
for tokenized_doc in tqdm(test_tokenized_docs):
  filtered_doc = []
  for w in tokenized_doc:
      if w not in stop_words:
          filtered_doc.append(w)
  test_filtered_docs.append(filtered_doc)

100%|██████████| 15052/15052 [00:02<00:00, 5279.95it/s]
100%|██████████| 1266/1266 [00:00<00:00, 5336.81it/s]


### Pos tagging

Add Pos tagging (Part-of-speech tagging) for better lemmatization.

In [18]:
train_pos_docs = []
for filtered_doc in tqdm(train_filtered_docs):
  train_pos_docs.append(nltk.pos_tag(filtered_doc))

test_pos_docs = []
for filtered_doc in tqdm(test_filtered_docs):
  test_pos_docs.append(nltk.pos_tag(filtered_doc))

100%|██████████| 15052/15052 [06:57<00:00, 36.03it/s]
100%|██████████| 1266/1266 [00:36<00:00, 34.76it/s]


### Lemmatize words

In [19]:
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

lem = WordNetLemmatizer()

train_lemmatized_docs = []
for pos_doc in tqdm(train_pos_docs):
  lemmatized_doc = []
  for pos in pos_doc:
    lemmatized_doc.append(lem.lemmatize(pos[0], get_wordnet_pos(pos[1])))
  train_lemmatized_docs.append(lemmatized_doc)

test_lemmatized_docs = []
for pos_doc in tqdm(test_pos_docs):
  lemmatized_doc = []
  for pos in pos_doc:
    lemmatized_doc.append(lem.lemmatize(pos[0], get_wordnet_pos(pos[1])))
  test_lemmatized_docs.append(lemmatized_doc)

100%|██████████| 15052/15052 [00:38<00:00, 391.60it/s]
100%|██████████| 1266/1266 [00:03<00:00, 372.46it/s]


Example of the preprocess result.

In [33]:
print('train tokenized doc =',train_tokenized_docs[0][25:33])
print('train filtered doc =',train_filtered_docs[0][14:22])
print('train pos doc =',train_pos_docs[0][14:22])
print('train lemmatized doc =',train_lemmatized_docs[0][14:22])

train tokenized doc = ['of', 'meat', 'production', 'claimed', 'that', 'covid', 'started', 'because']
train filtered doc = ['impact', 'meat', 'production', 'claimed', 'covid', 'started', 'eat', 'animals']
train pos doc = [('impact', 'NN'), ('meat', 'NN'), ('production', 'NN'), ('claimed', 'VBD'), ('covid', 'NN'), ('started', 'VBD'), ('eat', 'NN'), ('animals', 'NNS')]
train lemmatized doc = ['impact', 'meat', 'production', 'claim', 'covid', 'start', 'eat', 'animal']


### Join words back to sentences

In [None]:
train_docs = []
for doc in train_lemmatized_docs:
  sentence = " ".join(doc)
  train_docs.append(sentence)

test_docs = []
for doc in test_lemmatized_docs:
  sentence = " ".join(doc)
  test_docs.append(sentence)

## Calculate TF-IDF values

### Get word count matrix (TF values)

In [None]:
# create a vocabulary of words from train_docs,
# ignore words that appear in max_df% of documents, 
# max vocabulary size = max_features
cv=CountVectorizer(max_df=0.60, tokenizer=word_tokenize, max_features=100000)
word_count_vector=cv.fit_transform(train_docs)

### Get TF-IDF values

In [None]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

### Help functions for retrieving top-n words

In [None]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results = {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]] = score_vals[idx]
    
    return results

### Get set of top-n words corpus

In [None]:
# you only needs to do this once, this is a mapping of index to 
feature_names = cv.get_feature_names()

n = 1000000 # top n words from corpus
#generate tf-idf for the corpus
tf_idf_vector = tfidf_transformer.transform(cv.transform(train_docs))

#sort the tf-idf vectors by descending order of scores
sorted_items = sort_coo(tf_idf_vector.tocoo())

#extract only the top n
keywords = extract_topn_from_vector(feature_names,sorted_items,n)

vocab = set(keywords.keys())

In [None]:
len(vocab)

56667

Vocab size in **Learning with BOW** in **Fake news learning** notebook will be 56668.

## Encode docs

### Tokenize docs

In [None]:
train_tokenized_docs = []
for doc in tqdm(train_docs, position=0, leave=True):
  train_tokenized_docs.append(word_tokenize(doc))

test_tokenized_docs = []
for doc in tqdm(test_docs, position=0, leave=True):
  test_tokenized_docs.append(word_tokenize(doc))

100%|██████████| 15052/15052 [00:31<00:00, 471.77it/s]
100%|██████████| 1266/1266 [00:02<00:00, 469.71it/s]


### Create a word vocabulary

We use a dictionary where keys are words and values are the words indices in a list.

In [None]:
vocab_list = list(vocab)

vocab_dict = dict((word, vocab_list.index(word)+1) for word in vocab_list) # numeric value assigned = word index + 1 in vocab_list

### Replace words with numeric values

In [None]:
OUT_OF_VOCAB = 0 # special value for words out of vocabulary

train_encoded_docs = []
for doc in tqdm(train_tokenized_docs, position=0, leave=True):

  encoded_doc = []
  for word in doc:

    if word in vocab_list:
      encoded_doc.append(vocab_dict[word])
    else:
      encoded_doc.append(OUT_OF_VOCAB)
      
  train_encoded_docs.append(encoded_doc)

test_encoded_docs = []
for doc in tqdm(test_tokenized_docs, position=0, leave=True):
  
  encoded_doc = []
  for word in doc:
    if word in vocab_list:
      encoded_doc.append(vocab_dict[word]) # numeric value assigned = word index + 1 in vocab_list
    else:
      encoded_doc.append(OUT_OF_VOCAB)

  test_encoded_docs.append(encoded_doc)

100%|██████████| 15052/15052 [2:41:43<00:00,  1.55it/s]
100%|██████████| 1266/1266 [14:38<00:00,  1.44it/s]


### Save encodings

In [None]:
train_encoded_df = pd.DataFrame({'full_text_encoded':train_encoded_docs})
test_encoded_df = pd.DataFrame({'full_text_encoded':test_encoded_docs})

# split to 4 datasets
num_splits = 4
split_size = 3763 # (15052 / 4 = 3763) rows per database
for i in range(num_splits):
  temp_df = train_encoded_df[(i*split_size):((i+1)*split_size)]
  path = f'/content/train_encoded_v2_{i}.csv'
  temp_df.to_csv(path, index=False)

test_encoded_df.to_csv('test_encoded_v2.csv', index=False)

# Contextualized embeddings with Bert

In this section, we extract embedded representations of the **'fullText_based_content'** column of the Liar-liar dataset (the documents) from a pretrained transformer model - **Bert**.<br>
The contextualized embeddings are vectors that **Bert** learned for each document.<br>
<br>
We will get the embeddings in the following manner:

- Convert the documents to a *Bert-format*
  - i.e. Save documents as class **input_features** with other attributes that Bert needs.
- Download the Bert model.
- Feed Bert the converted documents.
- Extract the embeddings Bert learned.

<br>
A guide for extracting contextualized word embeddings from Bert we used:<br>
https://towardsdatascience.com/nlp-extract-contextualized-word-embeddings-from-bert-keras-tf-67ef29f60a7b

### Imports & definitions


In [None]:
import pandas as pd
import numpy as np
import io
import time
from tqdm import tqdm

#tf
!pip install tensorflow==1.13.0rc1
import tensorflow as tf

#bert & other
!rm -rf bert
!git clone https://github.com/google-research/bert
import sys
sys.path.append('bert/')
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import codecs
import collections
import json
import re
import os
import pprint
import modeling
import tokenization

TPU download & config

In [None]:
assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)
from google.colab import auth
auth.authenticate_user()
with tf.compat.v1.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

In [None]:
# Available pretrained model checkpoints:
#   uncased_L-12_H-768_A-12: uncased BERT base model

#   uncased_L-24_H-1024_A-16: uncased BERT large model
#   cased_L-12_H-768_A-12: cased BERT large model
BERT_MODEL = 'uncased_L-12_H-768_A-12'
BERT_PRETRAINED_DIR = 'gs://cloud-tpu-checkpoints/bert/' + BERT_MODEL
print('***** BERT pretrained directory: {} *****'.format(BERT_PRETRAINED_DIR))
!gsutil ls $BERT_PRETRAINED_DIR

### Get train and test datasets

In [None]:
#train
num_splits = 7
train = pd.read_csv('https://raw.githubusercontent.com/AlonBrul/liar-liar-dataset/main/liar_train_0.csv')
for i in range(1, num_splits):
  temp = pd.read_csv(f'https://raw.githubusercontent.com/AlonBrul/liar-liar-dataset/main/liar_train_{i}.csv')
  train = train.append(temp)

#test
test = pd.read_csv('https://raw.githubusercontent.com/AlonBrul/liar-liar-dataset-train/main/liar_test.csv')

### Drop unwanted columns

In [None]:
train.drop(['id', 'date', 'speaker', 'statement', 'sources', 'paragraph_based_content'], axis=1, inplace=True)
test.drop(['id', 'date', 'speaker', 'statement', 'sources', 'paragraph_based_content'], axis=1, inplace=True)

In [None]:
train.head(3)

Unnamed: 0,id,date,speaker,statement,sources,paragraph_based_content,fullText_based_content,label-liar
0,18178,2020-03-18T13:26:42-04:00,Instagram posts,"""COVID-19 started because we eat animals.""",['https://www.cdc.gov/coronavirus/2019-ncov/ca...,['Vegan Instagram users are pinning the 2019 c...,Vegan Instagram users are pinning the 2019 cor...,barely-true
1,3350,2011-03-04T09:12:59-05:00,Glenn Beck,Says Michelle Obama has 43 people on her staff...,['http://www.glennbeck.com/2011/02/25/while-wo...,['Glenn Beck rekindled a falsehood about the s...,Glenn Beck rekindled a falsehood about the siz...,pants-fire
2,14343,2017-07-21T11:52:44-04:00,Mike Pence,"Says President Donald Trump ""has signed more l...",['https://nrf.com/events/retail-advocates-summ...,['Vice President Mike Pence says that when it ...,Vice President Mike Pence says that when it co...,half-true


### Definitions for Bert

In [None]:
LAYERS = [-1,-2,-3,-4]
NUM_TPU_CORES = 8
MAX_SEQ_LENGTH = 512
BERT_CONFIG = BERT_PRETRAINED_DIR + '/bert_config.json'
CHKPT_DIR = BERT_PRETRAINED_DIR + '/bert_model.ckpt'
VOCAB_FILE = BERT_PRETRAINED_DIR + '/vocab.txt'
INIT_CHECKPOINT = BERT_PRETRAINED_DIR + '/bert_model.ckpt'
BATCH_SIZE = 128
VECTOR_DIM = 5

### Functions & Classes for Bert

In [None]:
class InputExample():

  def __init__(self, unique_id, text_a):
    self.unique_id = unique_id
    self.text_a = text_a

In [None]:
class InputFeatures():
  """A single set of features of data."""

  def __init__(self, unique_id, tokens, input_ids, input_mask, input_type_ids):
    self.unique_id = unique_id
    self.tokens = tokens
    self.input_ids = input_ids
    self.input_mask = input_mask
    self.input_type_ids = input_type_ids

In [None]:
def input_fn_builder(features, seq_length):
  """Creates an `input_fn` closure to be passed to TPUEstimator."""

  all_unique_ids = []
  all_input_ids = []
  all_input_mask = []
  all_input_type_ids = []

  for feature in features:
    all_unique_ids.append(feature.unique_id)
    all_input_ids.append(feature.input_ids)
    all_input_mask.append(feature.input_mask)
    all_input_type_ids.append(feature.input_type_ids)

  def input_fn(params):
    """The actual input function."""
    batch_size = params["batch_size"]

    num_examples = len(features)

    # This is for demo purposes and does NOT scale to large data sets. We do
    # not use Dataset.from_generator() because that uses tf.py_func which is
    # not TPU compatible. The right way to load data is with TFRecordReader.
    d = tf.data.Dataset.from_tensor_slices({
        "unique_ids":
            tf.constant(all_unique_ids,
                        shape=[num_examples],
                        dtype=tf.int32),
        "input_ids":
            tf.constant(all_input_ids,
                        shape=[num_examples, seq_length],
                        dtype=tf.int32),
        "input_mask":
            tf.constant(all_input_mask,
                        shape=[num_examples, seq_length],
                        dtype=tf.int32),
        "input_type_ids":
            tf.constant(all_input_type_ids,
                        shape=[num_examples, seq_length],
                        dtype=tf.int32),
    })

    d = d.batch(batch_size=batch_size, drop_remainder=False)
    return d

  return input_fn
  
def model_fn_builder(bert_config, init_checkpoint, layer_indexes, use_tpu,
                     use_one_hot_embeddings):
  """Returns `model_fn` closure for TPUEstimator."""

  def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
    """The `model_fn` for TPUEstimator."""

    unique_ids = features["unique_ids"] 
    input_ids = features["input_ids"]
    input_mask = features["input_mask"]
    input_type_ids = features["input_type_ids"]

    model = modeling.BertModel(
        config=bert_config,
        is_training=False,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=input_type_ids,
        use_one_hot_embeddings=use_one_hot_embeddings)

    if mode != tf.estimator.ModeKeys.PREDICT:
      raise ValueError("Only PREDICT modes are supported: %s" % (mode))

    tvars = tf.trainable_variables()
    scaffold_fn = None
    (assignment_map,
     initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(
         tvars, init_checkpoint)
    if use_tpu:
      def tpu_scaffold():
        tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
        return tf.train.Scaffold()

      scaffold_fn = tpu_scaffold
    else:
      tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

    all_layers = model.get_all_encoder_layers()

    predictions = {
        "unique_id": unique_ids,
    }

    for (i, layer_index) in enumerate(layer_indexes):
      predictions["layer_output_%d" % i] = all_layers[layer_index]

    output_spec = tf.contrib.tpu.TPUEstimatorSpec(
        mode=mode, predictions=predictions, scaffold_fn=scaffold_fn)
    return output_spec

  return model_fn

In [None]:
def convert_examples_to_features(examples, seq_length, tokenizer):
  """Loads a data file into a list of `InputBatch`s."""

  features = []
  for ex_index, example in enumerate(examples):
    tokens_a = tokenizer.tokenize(example.text_a)

    # Account for [CLS] and [SEP] with "- 2"
    if len(tokens_a) > seq_length - 2:
      tokens_a = tokens_a[:(seq_length - 2)]

    tokens = []
    input_type_ids = []
    tokens.append("[CLS]")
    input_type_ids.append(0)
    for token in tokens_a:
      tokens.append(token)
      input_type_ids.append(0)
    tokens.append("[SEP]")
    input_type_ids.append(0)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    while len(input_ids) < seq_length:
      input_ids.append(0)
      input_mask.append(0)
      input_type_ids.append(0)

    assert len(input_ids) == seq_length
    assert len(input_mask) == seq_length
    assert len(input_type_ids) == seq_length

    features.append(
        InputFeatures(
            unique_id=example.unique_id,
            tokens=tokens,
            input_ids=input_ids,
            input_mask=input_mask,
            input_type_ids=input_type_ids))
  return features

In [None]:
def read_sequence(input_sentences):
  examples = []
  unique_id = 0
  for sentence in input_sentences:
    line = tokenization.convert_to_unicode(sentence)
    examples.append(InputExample(unique_id=unique_id, text_a=line))
    unique_id += 1
  return examples

In [None]:
def get_features(input_list, dim=768):
  layer_indexes = LAYERS

  bert_config = modeling.BertConfig.from_json_file(BERT_CONFIG)

  tokenizer = tokenization.FullTokenizer(
      vocab_file=VOCAB_FILE, do_lower_case=True)

  is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
  tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
  run_config = tf.contrib.tpu.RunConfig(
      cluster=tpu_cluster_resolver,
      tpu_config=tf.contrib.tpu.TPUConfig(
          num_shards=NUM_TPU_CORES,
          per_host_input_for_training=is_per_host))

  model_fn = model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=INIT_CHECKPOINT,
      layer_indexes=layer_indexes,
      use_tpu=False,
      use_one_hot_embeddings=True)

  # If TPU is not available, this will fall back to normal Estimator on CPU
  # or GPU.
  estimator = tf.contrib.tpu.TPUEstimator(
      use_tpu=False,
      model_fn=model_fn,
      config=run_config,
      predict_batch_size=BATCH_SIZE,
      train_batch_size=BATCH_SIZE)  

  examples = read_sequence(input_list)

  features = convert_examples_to_features(
      examples=examples, seq_length=MAX_SEQ_LENGTH, tokenizer=tokenizer)

  input_fn = input_fn_builder(
      features=features, seq_length=MAX_SEQ_LENGTH)

  unique_id_to_feature = {}
  for feature in features:
    unique_id_to_feature[feature.unique_id] = feature

  features_lst = []

  # Get features
  for result in tqdm(estimator.predict(input_fn, yield_single_examples=True), position=0, leave=True):

    unique_id = int(result["unique_id"])
    feature = unique_id_to_feature[unique_id]
    output = collections.OrderedDict()
    for i, token in enumerate(feature.tokens):
      layers = []
      for j, layer_index in enumerate(layer_indexes):
        layer_output = result["layer_output_%d" % j]
        layer_output_flat = np.array([x for x in layer_output[i:(i + 1)].flat])
        layers.append(layer_output_flat)

      output[token] = sum(layers)[:dim]

    features_lst.append(output)

  return features_lst

In [None]:
def extract_embedding_vector(features):
  vector = []
  for word in features:
    vector.append(list(features[word]))
  return vector

def extract_embedded_docs(features_lst):
  embedded_docs = []
  for features in features_lst:
    embedded_docs.append(extract_embedding_vector(features))
  return embedded_docs

### Get train embeddings

In [None]:
train_docs = train['fullText_based_content'].tolist()

Split train docs, the emdedded docs are to large to get all at once

In [None]:
l_train = len(train_docs)

First split

In [None]:
train_docs_1 = train_docs[:int(l_train/3)] # first third

train_features_1 = get_features(train_docs_1, dim=5)

In [None]:
train_embedded_1 = extract_embedded_docs(train_features_1)

Save first split embeddings

In [None]:
train_embedded_df_1 = pd.DataFrame({'full_text_embedded':train_embedded_1})

# split to 4 datasets
num_splits = 4
split_size = 1673 # rows per database
for i in range(num_splits):
  temp_df = train_embedded_df_1[(i*split_size):((i+1)*split_size)]
  path = f'/content/train_embedded_v2_{i}.csv'
  temp_df.to_csv(path, index=False)

Second split

In [None]:
train_docs_2 = train_docs[int(l_train/3):int((2*l_train)/3)] # second third

train_features_2 = get_features(train_docs_2, dim=5)

In [None]:
train_embedded_2 = extract_embedded_docs(train_features_2)

Save second split embeddings

In [None]:
train_embedded_df_2 = pd.DataFrame({'full_text_embedded':train_embedded_2})

# split to 4 datasets
num_splits = 4
split_size = 1673 # rows per database
for i in range(num_splits):
  temp_df = train_embedded_df_2[(i*split_size):((i+1)*split_size)]
  path = f'/content/train_embedded_v2_{i+4}.csv'
  temp_df.to_csv(path, index=False)

Third split (last)

In [None]:
train_docs_3 = train_docs[int((2*l_train/3)):l_train] # third third

train_features_3 = get_features(train_docs_3, dim=5)

In [None]:
train_embedded_3 = extract_embedded_docs(train_features_3)

Save third split embeddings

In [None]:
train_embedded_df_3 = pd.DataFrame({'full_text_embedded':train_embedded_3})

# save train embeddings
# split to 4 datasets
num_splits = 4
split_size = 1673 # rows per database
for i in range(num_splits):
  temp_df = train_embedded_df_3[(i*split_size):((i+1)*split_size)]
  path = f'/content/train_embedded_v2_{i+8}.csv'
  temp_df.to_csv(path, index=False)

Without splits (when max sequence len = 128)

In [None]:
train_features = get_features(train_docs, dim=5)

In [None]:
train_embedded = extract_embedded_docs(train_features)

Save train embeddings

In [None]:
train_embedded_df = pd.DataFrame({'full_text_embedded':train_embedded})

# save train embeddings
# split to 4 datasets
num_splits = 4
split_size = 3763 # (15052 / 4 = 3763) rows per database
for i in range(num_splits):
  temp_df = train_embedded_df[(i*split_size):((i+1)*split_size)]
  path = f'/content/train_embedded{i}.csv'
  temp_df.to_csv(path, index=False)

### Get test embeddings

In [None]:
test_docs = list(test['fullText_based_content'])

In [None]:
test_features = get_features(test_docs, dim=5)

In [None]:
test_embedded = extract_embedded_docs(test_features)

Save test embeddings

In [None]:
test_embedded_df = pd.DataFrame({'full_text_embedded':test_embedded})

# save test embeddings
test_embedded_df.to_csv('test_embedded.csv', index=False)

In [35]:
%%shell
jupyter nbconvert --to html /content/Fake_news_encoding.ipynb

[NbConvertApp] Converting notebook /content/Fake_news_encoding.ipynb to html
[NbConvertApp] Writing 419158 bytes to /content/Fake_news_encoding.html


