# Comparing TensorFlow (original) and PyTorch models

You can use this small notebook to check the conversion of the model's weights from the TensorFlow model to the PyTorch model. In the following, we compare the weights of the last layer on a simple example (in `input.txt`) but both models returns all the hidden layers so you can check every stage of the model.

To run this notebook, follow these instructions:
- make sure that your Python environment has both TensorFlow and PyTorch installed,
- download the original TensorFlow implementation,
- download a pre-trained TensorFlow model as indicaded in the TensorFlow implementation readme,
- run the script `convert_tf_checkpoint_to_pytorch.py` as indicated in the `README` to convert the pre-trained TensorFlow model to PyTorch.

If needed change the relative paths indicated in this notebook (at the beggining of Sections 1 and 2) to point to the relevent models and code.

In [1]:
import os
os.chdir('../')

In [2]:
import tensorflow as tf

W0702 16:02:10.969589 140024053769984 __init__.py:308] Limited tf.compat.v2.summary API due to missing TensorBoard installation.


## 1/ TensorFlow code

In [3]:
original_tf_inplem_dir = "../bert/"
model_dir = "/tmp/pretraining_output_test_final_model/"

vocab_file = model_dir + "vocab.txt"
bert_config_file = model_dir + "bert_config.json"
init_checkpoint = model_dir + "bert_model.ckpt"

input_file = "./samples/input.txt"
max_seq_length = 128

In [4]:
import importlib.util
import sys

spec = importlib.util.spec_from_file_location('*', original_tf_inplem_dir + '/extract_features.py')
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
sys.modules['extract_features_tensorflow'] = module
sys.path.append('../bert')
from extract_features_tensorflow import *

In [5]:
# with tf.variable_scope("test", dtype=tf.float64):
layer_indexes = list(range(12))
bert_config = modeling.BertConfig.from_json_file(bert_config_file)
tokenizer = tokenization.FullTokenizer(
    vocab_file=vocab_file, do_lower_case=True)
examples = read_examples(input_file)

features = convert_examples_to_features(
    examples=examples, seq_length=max_seq_length, tokenizer=tokenizer)
unique_id_to_feature = {}
for feature in features:
    unique_id_to_feature[feature.unique_id] = feature

W0702 16:02:11.077666 140024053769984 deprecation_wrapper.py:119] From /dfs/scratch0/zjian/bert-pretraining/src/bert-pretraining/third_party/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

W0702 16:02:11.233513 140024053769984 deprecation_wrapper.py:119] From ../bert//extract_features.py:295: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.



In [6]:
# with tf.variable_scope("test", dtype=tf.float64):
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
run_config = tf.contrib.tpu.RunConfig(
    master=None,
    tpu_config=tf.contrib.tpu.TPUConfig(
        num_shards=1,
        per_host_input_for_training=is_per_host))

model_fn = model_fn_builder(
    bert_config=bert_config,
    init_checkpoint=init_checkpoint,
    layer_indexes=layer_indexes,
    use_tpu=False,
    use_one_hot_embeddings=False)

# If TPU is not available, this will fall back to normal Estimator on CPU
# or GPU.
estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=False,
    model_fn=model_fn,
    config=run_config,
    predict_batch_size=1)

input_fn = input_fn_builder(
    features=features, seq_length=max_seq_length)

W0702 16:02:11.905027 140024053769984 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0702 16:02:11.907641 140024053769984 estimator.py:1984] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f58f672fbf8>) includes params argument, but params are not passed to Estimator.
W0702 16:02:11.911869 140024053769984 estimator.py:1811] Using temporary folder as model directory: /tmp/tmpojjqolic
W0702 16:02:11.914280 140024053769984 tpu_context.py:750] Setting TPUConfig.num_shards==1 is an unsupported behavior. Please fix as soon as possible (leaving num_shards as None.)
W0702 16:02:11.915335 140024053769984 tpu_context.py:211] eval_on_tpu ig

In [7]:
# with tf.variable_scope("test", dtype=tf.float64):
tensorflow_all_out = []
for result in estimator.predict(input_fn, yield_single_examples=True):
    unique_id = int(result["unique_id"])
    feature = unique_id_to_feature[unique_id]
    output_json = collections.OrderedDict()
    output_json["linex_index"] = unique_id
    tensorflow_all_out_features = []
    # for (i, token) in enumerate(feature.tokens):
    all_layers = []
    for (j, layer_index) in enumerate(layer_indexes):
        print("extracting layer {}".format(j))
        layer_output = result["layer_output_%d" % j]
        layers = collections.OrderedDict()
        layers["index"] = layer_index
        layers["values"] = layer_output
        all_layers.append(layers)
    tensorflow_out_features = collections.OrderedDict()
    tensorflow_out_features["layers"] = all_layers
    tensorflow_all_out_features.append(tensorflow_out_features)

    output_json["features"] = tensorflow_all_out_features
    tensorflow_all_out.append(output_json)

W0702 16:02:12.093593 140024053769984 deprecation_wrapper.py:119] From /dfs/scratch0/zjian/bert-pretraining/src/bert-pretraining/third_party/bert/modeling.py:171: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0702 16:02:12.096588 140024053769984 deprecation_wrapper.py:119] From /dfs/scratch0/zjian/bert-pretraining/src/bert-pretraining/third_party/bert/modeling.py:409: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

W0702 16:02:12.134335 140024053769984 deprecation_wrapper.py:119] From /dfs/scratch0/zjian/bert-pretraining/src/bert-pretraining/third_party/bert/modeling.py:490: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.

W0702 16:02:12.220436 140024053769984 deprecation.py:323] From /dfs/scratch0/zjian/bert-pretraining/src/bert-pretraining/third_party/bert/modeling.py:671: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a

bert/embeddings/word_embeddings:0
bert/embeddings/token_type_embeddings:0
bert/embeddings/position_embeddings:0
bert/embeddings/LayerNorm/beta:0
bert/embeddings/LayerNorm/gamma:0
bert/encoder/layer_0/attention/self/query/kernel:0
bert/encoder/layer_0/attention/self/query/bias:0
bert/encoder/layer_0/attention/self/key/kernel:0
bert/encoder/layer_0/attention/self/key/bias:0
bert/encoder/layer_0/attention/self/value/kernel:0
bert/encoder/layer_0/attention/self/value/bias:0
bert/encoder/layer_0/attention/output/dense/kernel:0
bert/encoder/layer_0/attention/output/dense/bias:0
bert/encoder/layer_0/attention/output/LayerNorm/beta:0
bert/encoder/layer_0/attention/output/LayerNorm/gamma:0
bert/encoder/layer_0/intermediate/dense/kernel:0
bert/encoder/layer_0/intermediate/dense/bias:0
bert/encoder/layer_0/output/dense/kernel:0
bert/encoder/layer_0/output/dense/bias:0
bert/encoder/layer_0/output/LayerNorm/beta:0
bert/encoder/layer_0/output/LayerNorm/gamma:0
bert/encoder/layer_1/attention/self/que

W0702 16:02:16.202486 140024053769984 deprecation.py:323] From /lfs/1/zjian/anaconda2/envs/bert-pretraining/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py:1354: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


attention output  [[-0.023517929 0.114575304 0.0559358522 ... -0.138221383 0.265626878 -0.034330003]
 [0.29820472 -0.256990701 -0.188006207 ... -0.171895146 0.276326776 -0.165664867]
 [0.221838355 -0.363108665 -0.133449152 ... -0.400731623 0.128412277 -0.185162306]
 ...
 [0.176892072 -0.00986768864 -0.00209077634 ... -0.148898095 0.252958655 -0.207604825]
 [0.180311739 -0.00693849288 0.0158761069 ... -0.158081949 0.261587888 -0.220568925]
 [0.177134782 -0.0169706102 -0.0248442 ... -0.149772689 0.25775969 -0.228787959]] [[0.168550581 -0.285767317 -0.326125622 ... -0.0275705792 0.0382532738 0.163995281]
 [-0.153370529 0.337782115 -0.169415116 ... 0.297335207 0.387448281 -0.342025697]
 [-0.4856323 -0.523777962 0.296871603 ... 0.191184327 -0.067088604 0.320287108]
 ...
 [0.0265424848 -0.179814368 0.489417434 ... -0.532132208 -0.246509075 0.465869516]
 [-0.119451314 -0.232424796 0.24276945 ... -0.473972976 -0.13932699 0.318041414]
 [0.110764056 -0.0750389695 0.322582066 ... -0.0730810165 -0

intermediate output  [[-0.0236262083 0.225160047 -0.0296256319 ... -0.0318246037 -0.0236812346 -0.0225447249]
 [-0.126183853 -0.15823181 -0.0473857597 ... -0.0720712468 -0.142897129 -0.0187649447]
 [-0.0771942809 -0.0480017141 -0.0587825067 ... -0.0458443314 -0.075253062 -0.0373694785]
 ...
 [-0.0274390765 -0.0692289174 -0.0451274924 ... -0.0139679629 -0.00442458596 -0.000603590044]
 [-0.0476122051 -0.00895324536 -0.0499093756 ... -0.0192492399 -0.00651343819 -0.000587598188]
 [-0.0426970534 -0.0222028401 -0.0293225758 ... -0.0238419697 -0.00822856463 -0.000817868859]]
layer output  [[-0.100653581 -0.220392287 -0.0911183208 ... 0.150086403 0.0866234899 0.116971701]
 [-0.787523687 0.498794913 0.475328982 ... 0.310904503 -0.309030294 -1.07281244]
 [-0.484146148 -0.173144937 0.346873283 ... -0.337559 0.28318426 -0.57993418]
 ...
 [-0.368741184 -0.236361772 0.787994862 ... 0.899103582 -0.230881214 -0.262915939]
 [-0.468671679 -0.18532905 0.690786421 ... 0.904403329 -0.155636147 -0.38067215

attention output 2  [[-0.542055726 0.478041977 -1.12282073 ... 0.341569662 0.503791213 0.416829735]
 [-0.584998548 -0.37038058 0.20044589 ... 0.897326827 -0.440936 -1.53594208]
 [-0.583883166 -0.462254822 -0.340663493 ... -0.401084423 0.913186729 -0.272743911]
 ...
 [-0.841303229 -1.13098419 0.799435079 ... 0.0242231227 -0.700819135 -0.0711985528]
 [-0.953500748 -0.972590446 1.04508686 ... 0.316271901 -0.739231825 -0.491385102]
 [-0.577084482 -1.17544258 1.21596551 ... 0.498981386 -0.819508195 -0.625325918]] [[-0.540071845 0.0500272587 -0.89557606 ... -0.26914373 0.430562019 0.31089741]
 [-0.453103513 -0.578923285 0.0848933682 ... 0.671331525 -0.479961693 -1.5545336]
 [-0.740642667 -0.710118771 0.138602927 ... -0.660527587 0.799326 -0.0161202643]
 ...
 [-0.622550726 -0.865465224 0.616362691 ... -0.075886175 -0.572800457 -0.0834857]
 [-0.673786938 -0.760952711 0.814807832 ... 0.216668099 -0.59974736 -0.396390229]
 [-0.369158477 -0.896872103 0.967870295 ... 0.399232864 -0.666993618 -0.48

layer output  [[-0.400504857 0.599968612 -1.27780664 ... -0.600738645 0.228174895 0.81543082]
 [-0.712510586 -1.04387331 -0.232637823 ... 0.313348114 -0.189448506 -0.8807531]
 [0.0555186719 -1.34353602 -0.183687896 ... -0.0756268799 0.567102432 0.0978050828]
 ...
 [-1.00748932 -0.849740386 -0.274659723 ... -0.37996307 -0.30851987 0.374443352]
 [-1.08780396 -0.678281724 0.680303454 ... 0.0840016752 -0.321386844 -0.18527]
 [-0.826690137 -0.7939924 0.769347608 ... 0.313351929 -0.518493056 -0.396081537]]
attention output  [[-0.129962802 0.0542304479 -0.208217412 ... 0.0465455838 0.215447828 0.0425155573]
 [0.00484702084 -0.162631944 -0.267138511 ... 0.0203325115 0.0877325 -0.049696736]
 [-0.0492953248 -0.0936832577 -0.261444569 ... 0.108127385 0.17607455 -0.117583416]
 ...
 [0.0138748037 -0.154966667 0.0644498095 ... 0.0145573281 -0.0305626262 0.0455522202]
 [0.073379904 0.00129055604 0.0453381464 ... -0.0243445337 0.0395112932 0.0331408866]
 [0.095911853 -0.0487403683 0.0871657804 ... 0.0

intermediate output  [[-0.0266864952 -0.168003961 -0.000373105315 ... -0.170005754 -0.16354847 -0.000815874548]
 [-0.00106928044 -0.00643255562 -0.0109781073 ... -0.0969317406 -0.0194965042 -0.00230544875]
 [-0.000643233361 -0.123876981 -0.0257592872 ... -0.165454894 -0.131807864 -0.0153195765]
 ...
 [-0.0318699963 -0.0202224683 -0.0120739201 ... -0.0319447704 -0.114373252 -0.0856173486]
 [-0.0341788307 -0.0163605921 -0.0242529754 ... -0.0367028862 -0.102355875 -0.0618763603]
 [-0.0376860611 -0.0186775792 -0.0194126852 ... -0.0545464456 -0.0904721841 -0.0628261492]]
layer output  [[-0.909168899 0.307837218 -0.577829182 ... -0.793938398 0.471214 0.695334613]
 [-0.950974107 -0.854580879 -0.168390319 ... 0.0339296535 0.0199695788 -0.703799307]
 [0.0859396607 -1.14730322 -0.0370468311 ... -0.149369046 0.267103612 0.13064301]
 ...
 [-1.15583885 -0.493509829 0.161296189 ... -0.0386419222 0.170636088 -0.262758464]
 [-1.17897987 -0.333679497 0.583467841 ... 0.160872757 0.125401601 -0.614055812

In [8]:
print(len(tensorflow_all_out))
print(len(tensorflow_all_out[0]))
print(tensorflow_all_out[0].keys())
print("number of tokens", len(tensorflow_all_out[0]['features']))
print("number of layers", len(tensorflow_all_out[0]['features'][0]['layers']))
tensorflow_all_out[0]['features'][0]['layers'][0]['values'].shape

1
2
odict_keys(['linex_index', 'features'])
number of tokens 1
number of layers 12


(128, 768)

In [9]:
tensorflow_outputs = list(tensorflow_all_out[0]['features'][0]['layers'][t]['values'] for t in layer_indexes)

In [10]:
print(tensorflow_outputs[0])

[[ 0.10810544  0.00736203 -0.14134324 ...  0.08043151  0.07175563
   0.0031992 ]
 [-0.00526232  0.6327945  -0.2985075  ...  0.10594425  0.09061253
  -0.76824725]
 [-0.3182612  -0.8120704   0.15033704 ... -0.1900597   0.15686822
   0.12246863]
 ...
 [ 0.09414048 -0.33054894  0.61384857 ...  0.43929374 -0.3086228
   0.06017733]
 [ 0.01996002 -0.37984183  0.49045902 ...  0.45061845 -0.21570973
  -0.05887301]
 [ 0.15295641 -0.2668718   0.49672574 ...  0.7504021  -0.5253611
  -0.10960616]]


In [11]:
tf.reset_default_graph()
input_tensor = tf.constant(value=0.5, shape=(128, 276))
output_tensor = tf.contrib.layers.layer_norm(
      inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope="double_test")
assign_ops = []
tvars = []
for tvar in tf.trainable_variables():
    assign_ops.append(tf.assign(tvar, tf.ones_like(tvar)))
    tvars.append(tvar)
    
print(tf.trainable_variables())


with tf.compat.v1.Session() as sess:
    sess.run(assign_ops)
    res = sess.run(tvars)
    res = sess.run(output_tensor)
    print(res)


[<tf.Variable 'double_test/beta:0' shape=(276,) dtype=float32_ref>, <tf.Variable 'double_test/gamma:0' shape=(276,) dtype=float32_ref>]
[[1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 ...
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]]


## 2/ PyTorch code

In [12]:
os.chdir('./examples')

In [13]:
import extract_features
import pytorch_pretrained_bert as ppb
from extract_features import *

In [14]:
init_checkpoint_pt = "/tmp/pretraining_output_test_final_model/"

In [15]:
device = torch.device("cpu")
model = ppb.BertModel.from_pretrained(init_checkpoint_pt)
model.to(device)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.0)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.0)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=

In [16]:
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
all_input_type_ids = torch.tensor([f.input_type_ids for f in features], dtype=torch.long)
all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)

eval_data = TensorDataset(all_input_ids, all_input_mask, all_input_type_ids, all_example_index)
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=1)

model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.0)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.0)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.0)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=

In [17]:
layer_indexes = list(range(12))

pytorch_all_out = []
for input_ids, input_mask, input_type_ids, example_indices in eval_dataloader:
    print(input_ids)
    print(input_mask)
    print(example_indices)
    input_ids = input_ids.to(device)
    input_mask = input_mask.to(device)

    all_encoder_layers, _ = model(input_ids, token_type_ids=input_type_ids, attention_mask=input_mask)

    for b, example_index in enumerate(example_indices):
        feature = features[example_index.item()]
        unique_id = int(feature.unique_id)
        # feature = unique_id_to_feature[unique_id]
        output_json = collections.OrderedDict()
        output_json["linex_index"] = unique_id
        all_out_features = []
        # for (i, token) in enumerate(feature.tokens):
        all_layers = []
        for (j, layer_index) in enumerate(layer_indexes):
            print("layer", j, layer_index)
            layer_output = all_encoder_layers[int(layer_index)].detach().cpu().numpy()
            layer_output = layer_output[b]
            layers = collections.OrderedDict()
            layers["index"] = layer_index
            layer_output = layer_output
            layers["values"] = layer_output if not isinstance(layer_output, (int, float)) else [layer_output]
            all_layers.append(layers)

            out_features = collections.OrderedDict()
            out_features["layers"] = all_layers
            all_out_features.append(out_features)
        output_json["features"] = all_out_features
        pytorch_all_out.append(output_json)

tensor([[  101,  2040,  2001,  3958, 27227,  1029,   102,  3958, 27227,  2001,
          1037, 13997, 11510,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,  

attention output  tensor([[[-0.3289,  0.5208, -0.4548,  ..., -0.2294,  0.2134, -0.2056],
         [ 0.1561,  0.0687, -0.4760,  ..., -0.0908, -0.1608,  0.0037],
         [-0.0266, -0.2418, -0.1724,  ...,  0.0117,  0.2279,  0.0795],
         ...,
         [-0.0879, -0.0112,  0.1206,  ...,  0.1064, -0.3725,  0.1220],
         [-0.0258, -0.0087,  0.1116,  ...,  0.1620, -0.3612,  0.0543],
         [-0.0447,  0.0216,  0.0513,  ...,  0.1878, -0.3071,  0.0659]]],
       grad_fn=<AddBackward0>) tensor([[[-0.1698, -0.5307, -0.6242,  ..., -0.1200,  0.1597,  0.4773],
         [-0.7938, -0.1090,  0.7108,  ...,  0.4576, -0.5454, -1.3904],
         [-0.4679, -0.1531,  0.3411,  ..., -0.4670,  0.1977, -0.1903],
         ...,
         [-0.6615, -0.8491,  0.7529,  ...,  0.5292, -0.7504, -0.1584],
         [-0.6878, -0.7739,  0.7142,  ...,  0.6860, -0.6350, -0.3030],
         [-0.4130, -0.8915,  0.8246,  ...,  0.8307, -0.7879, -0.4366]]],
       grad_fn=<AddBackward0>)
attention output 2  tensor([[[-0.568

       grad_fn=<AddBackward0>) 1e-12 Parameter containing:
tensor([-2.8268e-02,  4.4276e-02,  5.6783e-02, -4.7591e-02,  1.6136e-01,
        -2.6716e-02,  1.7337e-01,  9.8786e-02, -1.6353e-02, -9.7341e-02,
         4.8042e-02,  7.7096e-02, -1.6100e-01,  2.1559e-02,  1.0660e-03,
         2.9267e-01,  1.5741e-01,  1.1006e-01, -8.4956e-02,  1.6614e-02,
         1.8766e-01, -1.2081e-01,  6.8637e-03,  2.7939e-01,  1.1396e-01,
         1.0982e-01, -5.4954e-03, -5.1910e-02, -2.8676e-01,  1.2391e-01,
        -4.1713e-02, -5.9358e-02,  1.8512e-01, -3.6515e-02,  2.4761e-02,
         8.1520e-02, -1.7233e-01, -8.6288e-02, -3.6852e-02, -1.3840e-01,
        -1.4638e-01, -1.5844e-01, -1.2759e-01,  1.5428e-01, -2.4942e-02,
        -1.5789e-01,  3.5075e-02,  7.0320e-02, -1.0181e-01, -9.3846e-02,
        -6.0247e-02, -1.0749e-01,  2.5149e-01, -1.5908e-01,  1.3973e-01,
         1.4689e-01,  7.5141e-02, -2.6053e-01, -4.2137e-02, -1.3690e-01,
         7.5296e-02,  7.8063e-02, -4.5173e-02,  4.6669e-02, -8.92

In [18]:
print(len(pytorch_all_out))
print(len(pytorch_all_out[0]))
print(pytorch_all_out[0].keys())
print("number of tokens", len(pytorch_all_out))
print("number of layers", len(pytorch_all_out[0]['features'][0]['layers']))
print("hidden_size", len(pytorch_all_out[0]['features'][0]['layers'][0]['values']))
pytorch_all_out[0]['features'][0]['layers'][0]['values'].shape

1
2
odict_keys(['linex_index', 'features'])
number of tokens 1
number of layers 12
hidden_size 128


(128, 768)

In [19]:
pytorch_outputs = list(pytorch_all_out[0]['features'][0]['layers'][t]['values'] for t in layer_indexes)
print(pytorch_outputs[0].shape)
print(pytorch_outputs[1].shape)

(128, 768)
(128, 768)


In [20]:
print(tensorflow_outputs[0].shape)
print(tensorflow_outputs[1].shape)

(128, 768)
(128, 768)


In [21]:
print(tensorflow_outputs[0])
# for i in range(tensorflow_outputs[0].shape[0]):
#     print(tensorflow_outputs[0][i])

[[ 0.10810544  0.00736203 -0.14134324 ...  0.08043151  0.07175563
   0.0031992 ]
 [-0.00526232  0.6327945  -0.2985075  ...  0.10594425  0.09061253
  -0.76824725]
 [-0.3182612  -0.8120704   0.15033704 ... -0.1900597   0.15686822
   0.12246863]
 ...
 [ 0.09414048 -0.33054894  0.61384857 ...  0.43929374 -0.3086228
   0.06017733]
 [ 0.01996002 -0.37984183  0.49045902 ...  0.45061845 -0.21570973
  -0.05887301]
 [ 0.15295641 -0.2668718   0.49672574 ...  0.7504021  -0.5253611
  -0.10960616]]


In [22]:
print(pytorch_outputs[0])

[[ 0.10828765  0.0067381  -0.14161144 ...  0.08049869  0.07187192
   0.00301047]
 [-0.00542086  0.63183767 -0.29846492 ...  0.10570657  0.09073348
  -0.7689243 ]
 [-0.31781077 -0.8129925   0.15038896 ... -0.18991752  0.15697755
   0.12239677]
 ...
 [ 0.09426479 -0.33084315  0.61350524 ...  0.4392033  -0.30860138
   0.05975728]
 [ 0.02007423 -0.3801041   0.49014768 ...  0.4505022  -0.21570988
  -0.05932539]
 [ 0.15316041 -0.26719052  0.49640608 ...  0.75035083 -0.52535987
  -0.1100089 ]]


In [23]:
for i in range(11):
    print(i, tensorflow_outputs[i])
    print(i, pytorch_outputs[i])
    print(i, tensorflow_outputs[i] - pytorch_outputs[i])

0 [[ 0.10810544  0.00736203 -0.14134324 ...  0.08043151  0.07175563
   0.0031992 ]
 [-0.00526232  0.6327945  -0.2985075  ...  0.10594425  0.09061253
  -0.76824725]
 [-0.3182612  -0.8120704   0.15033704 ... -0.1900597   0.15686822
   0.12246863]
 ...
 [ 0.09414048 -0.33054894  0.61384857 ...  0.43929374 -0.3086228
   0.06017733]
 [ 0.01996002 -0.37984183  0.49045902 ...  0.45061845 -0.21570973
  -0.05887301]
 [ 0.15295641 -0.2668718   0.49672574 ...  0.7504021  -0.5253611
  -0.10960616]]
0 [[ 0.10828765  0.0067381  -0.14161144 ...  0.08049869  0.07187192
   0.00301047]
 [-0.00542086  0.63183767 -0.29846492 ...  0.10570657  0.09073348
  -0.7689243 ]
 [-0.31781077 -0.8129925   0.15038896 ... -0.18991752  0.15697755
   0.12239677]
 ...
 [ 0.09426479 -0.33084315  0.61350524 ...  0.4392033  -0.30860138
   0.05975728]
 [ 0.02007423 -0.3801041   0.49014768 ...  0.4505022  -0.21570988
  -0.05932539]
 [ 0.15316041 -0.26719052  0.49640608 ...  0.75035083 -0.52535987
  -0.1100089 ]]
0 [[-1.8221139

In [28]:
# print(tensorflow_outputs[0] - pytorch_outputs[0])
# print(tensorflow_outputs[1] - pytorch_outputs[1])
print(tensorflow_outputs[2] - pytorch_outputs[2])
# print(tensorflow_outputs[11] - pytorch_outputs[11])

[[-4.19542193e-05  4.87893820e-04  3.64027917e-04 ... -4.16293740e-04
   6.87688589e-04  4.09096479e-04]
 [ 1.24812126e-03 -4.68790531e-04  2.93403864e-04 ... -7.49558210e-04
   6.66707754e-04  8.89182091e-04]
 [-2.89857388e-04  1.80363655e-04  2.33143568e-04 ...  1.05842948e-03
  -2.08616257e-05  6.86526299e-04]
 ...
 [-2.32636929e-04 -3.27408314e-04  7.27355480e-04 ... -1.55210495e-04
   8.00490379e-04  1.46865845e-04]
 [-5.50061464e-04 -1.61409378e-04  1.82747841e-04 ... -1.04010105e-04
   4.23476100e-04  6.42836094e-05]
 [-2.39863992e-04 -3.22192907e-04  2.92241573e-04 ... -1.17897987e-04
   4.39703465e-04 -5.87105751e-05]]


In [29]:
print(tensorflow_outputs[2])


[[-0.10065358 -0.22039229 -0.09111832 ...  0.1500864   0.08662349
   0.1169717 ]
 [-0.7875237   0.4987949   0.47532898 ...  0.3109045  -0.3090303
  -1.0728124 ]
 [-0.48414615 -0.17314494  0.34687328 ... -0.337559    0.28318426
  -0.5799342 ]
 ...
 [-0.36874118 -0.23636177  0.78799486 ...  0.8991036  -0.23088121
  -0.26291594]
 [-0.46867168 -0.18532905  0.6907864  ...  0.9044033  -0.15563615
  -0.38067216]
 [-0.23992004 -0.31867903  0.6687768  ...  1.0202096  -0.36089402
  -0.44989982]]


## 3/ Comparing the standard deviation on the last layer of both models

In [30]:
import numpy as np

In [31]:
print('shape tensorflow layer, shape pytorch layer, standard deviation')
print('\n'.join(list(str((np.array(tensorflow_outputs[i]).shape,
                          np.array(pytorch_outputs[i]).shape, 
                          np.sqrt(np.mean((np.array(tensorflow_outputs[i]) - np.array(pytorch_outputs[i]))**2.0)))) for i in range(12))))

shape tensorflow layer, shape pytorch layer, standard deviation
((128, 768), (128, 768), 0.00021029184)
((128, 768), (128, 768), 0.00055583025)
((128, 768), (128, 768), 0.00068541203)
((128, 768), (128, 768), 0.0008927335)
((128, 768), (128, 768), 0.001315971)
((128, 768), (128, 768), 0.0016274694)
((128, 768), (128, 768), 0.0021441837)
((128, 768), (128, 768), 0.0024197593)
((128, 768), (128, 768), 0.0026458544)
((128, 768), (128, 768), 0.0028913843)
((128, 768), (128, 768), 0.0030688304)
((128, 768), (128, 768), 0.001419331)
