<a href="https://colab.research.google.com/github/LollipopGB/EA_TechnicalTest_DanielGarcia/blob/master/Solving_Error_MBERT_Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MBERT Word Embeddings
This notebook has been written following the specifications in:
- See BERT on TensorHub: https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/2

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
!pip install tensorflow==2.0
!pip install tensorflow_hub
!pip install bert-for-tf2
!pip install sentencepiece

Collecting tensorflow==2.0
[?25l  Downloading https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
[K     |████████████████████████████████| 86.3MB 52kB/s 
Collecting gast==0.2.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting tensorboard<2.1.0,>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/76/54/99b9d5d52d5cb732f099baaaf7740403e83fe6b0cedde940fabd2b13d75a/tensorboard-2.0.2-py3-none-any.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 47.5MB/s 
Collecting keras-applications>=1.0.8
[?25l  Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
[K     |████████████████████████████████| 51kB 6.5MB/s 
[?25hCollecting tensorflow-

In [3]:
import tensorflow as tf
import tensorflow_hub as hub
print("TF version: ", tf.__version__)
print("Hub version: ", hub.__version__)

TF version:  2.0.0
Hub version:  0.9.0


## Import modules

In [4]:
import tensorflow_hub as hub
import tensorflow as tf
import bert
FullTokenizer = bert.bert_tokenization.FullTokenizer
from tensorflow.keras.models import Model       # Keras is the new high level API for TensorFlow
import math
import pandas as pd
import string

In [5]:
path = '/content/drive/My Drive/ea_corpora_no_nan.csv'
df = pd.read_csv(path, sep=',', header=0)
max_seq_length = 128

## Building the model

Building model using tf.keras and hub. from sentences to embeddings.

Inputs:
 - input token ids (tokenizer converts tokens using vocab file)
 - input masks (1 for useful tokens, 0 for padding)
 - segment ids (for 2 text training: 0 for the first one, 1 for the second one)

Outputs:
 - pooled_output of shape `[batch_size, 768]` with representations for the entire input sequences
 - sequence_output of shape `[batch_size, max_seq_length, 768]` with representations for each input token (in context)

In [6]:
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/2",
                            trainable=False)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

In [7]:
model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=[pooled_output, sequence_output])

## Example of generating an embedding

First, we preprocessed the sentence following the BERT methodology. Then, we generate the ids, mask and segments with the tokenizer.

In [8]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)

In [9]:
s = "This nice, sentence."

Tokenizing the sentence

In [10]:
stokens = tokenizer.tokenize(s)

Adding separator tokens according to the paper

In [11]:
stokens = ["[CLS]"] + stokens + ["[SEP]"]

Get the model inputs from the tokens

In [14]:
input_ids = get_ids(stokens, tokenizer, max_seq_length)
input_masks = get_masks(stokens, max_seq_length)
input_segments = get_segments(stokens, max_seq_length)

In [15]:
print(stokens)
print(input_ids)
print(input_masks)
print(input_segments)

['[CLS]', 'This', 'nic', '##e', ',', 'sentence', '.', '[SEP]']
[101, 10747, 46267, 10112, 117, 49219, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Generate Embeddings using the pretrained model

In [16]:
pool_embs, all_embs = model.predict([[input_ids],[input_masks],[input_segments]])

In [17]:
pool_embs

array([[ 0.23276514, -0.00594373,  0.30834806, -0.20356625, -0.09590112,
         0.43650317,  0.2871416 ,  0.30860618, -0.4822504 ,  0.36143574,
         0.05642665, -0.32627806, -0.24309899, -0.09139016,  0.16354893,
        -0.23851739,  0.7306353 ,  0.03081608,  0.04861095, -0.30492905,
        -0.9998869 , -0.2516094 , -0.20149425, -0.16954152, -0.44420984,
         0.21227205, -0.3052292 ,  0.33267456,  0.26318187, -0.29790413,
         0.22276796, -0.9998971 ,  0.55339444,  0.71306735,  0.32374546,
        -0.17143737,  0.0824426 ,  0.28915772,  0.2440962 , -0.39900273,
        -0.25733086, -0.05787262, -0.14064786,  0.19076537, -0.12050197,
        -0.41198155, -0.2559957 ,  0.265201  , -0.346468  ,  0.01656516,
         0.11038731,  0.36560035,  0.49111158,  0.2537717 ,  0.25466797,
         0.16619101,  0.18128854,  0.24105678,  0.35735092, -0.24011606,
        -0.02952298,  0.44282633,  0.22373076, -0.17051585, -0.2681529 ,
        -0.3102133 ,  0.21565759, -0.02486245,  0.6

## Generate the dataset

Define auxiliar methods to generate the batches to obtain the word embeddings per document.

At the beginning, the documents were trim to 1000 words (the 3rd quartile of number of documents per total size) to generate small number of batches per document without losing too much information. However, it was going to take about 17 hours to generate all the embeddings, so just to show the whole process and reduce computational time as que are not seeking the best performance and quality, the texts are greatly reduced to 100 words. The execution time was still 6 hours. So, we are going to reduce the dataset length, **just to show the process with the classifier**.

In [13]:
def reduce_text(text):
  if len(text.split()) > 100:
    return " ".join(text.split()[:100])
  else:
    return text

def get_masks(tokens, max_seq_length):
    """Mask for padding"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))


def get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (max_seq_length - len(tokens))


def get_ids(tokens, tokenizer, max_seq_length):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return input_ids

def generate_batches(text, size):
  chunks = [text[x:x+size] for x in range(0, len(text), size)]
  for c in chunks:
    c.insert(0, '[CLS]')
    c.append('[SEP]')
  return chunks

def generate_all(row):
  data_prepared = []
  for i in row:
    input_ids = get_ids(i, tokenizer, max_seq_length)
    input_masks = get_masks(i, max_seq_length)
    input_segments = get_segments(i, max_seq_length)
    data_prepared.append([input_ids, input_masks, input_segments])
  return data_prepared

def get_embeddings(batches):
  embedding = []
  for batch in batches:
    pool_embs, all_embs = model.predict([[batch[0]], [batch[1]], [batch[2]]])
    embedding.append(pool_embs)
  return embedding

In [18]:
df_reduced = df.iloc[range(0,len(df),3)]
df_reduced = df_reduced.reset_index()

In [19]:
df_reduced.shape

(7847, 4)

In [20]:
df_reduced['text_reduced'] = df_reduced['text'].apply(lambda x: reduce_text(x))

In [21]:
df_reduced['text_tokens'] = df_reduced['text_reduced'].apply(lambda x: tokenizer.tokenize(x))

In [22]:
df_reduced['text_batches'] = df_reduced['text_tokens'].apply(lambda x: generate_batches(x, 100))

In [23]:
df_reduced['basic_set'] = df_reduced['text_batches'].apply(lambda x: generate_all(x))

In [24]:
df_reduced.head()

Unnamed: 0,index,text,category,language,text_reduced,text_tokens,text_batches,basic_set
0,0,"i read this book because in my town, everyone ...",APR,en,"i read this book because in my town, everyone ...","[i, read, this, book, because, in, my, town, ,...","[[[CLS], i, read, this, book, because, in, my,...","[[[101, 177, 24944, 10531, 12748, 12373, 10106..."
1,3,milady has found a good vein: anita blake. bas...,APR,en,milady has found a good vein: anita blake. bas...,"[mil, ##ady, has, found, a, good, vei, ##n, :,...","[[[CLS], mil, ##ady, has, found, a, good, vei,...","[[[101, 15033, 51210, 10393, 11823, 169, 15198..."
2,6,"well, frankly, the first 3 volumes of the new ...",APR,en,"well, frankly, the first 3 volumes of the new ...","[well, ,, fra, ##nk, ##ly, ,, the, first, 3, v...","[[[CLS], well, ,, fra, ##nk, ##ly, ,, the, fir...","[[[101, 11206, 117, 10628, 17761, 10454, 117, ..."
3,9,it is a deafening silence olivier delorme brea...,APR,en,it is a deafening silence olivier delorme brea...,"[it, is, a, dea, ##fen, ##ing, silence, oli, #...","[[[CLS], it, is, a, dea, ##fen, ##ing, silence...","[[[101, 10271, 10124, 169, 42492, 15559, 10230..."
4,12,"i really like if it was true, and i felt faint...",APR,en,"i really like if it was true, and i felt faint...","[i, really, like, if, it, was, true, ,, and, i...","[[[CLS], i, really, like, if, it, was, true, ,...","[[[101, 177, 30181, 11850, 12277, 10271, 10134..."


In [25]:
from tqdm import tqdm

In [26]:
tqdm.pandas()
df_reduced['embeddings'] = df_reduced['basic_set'].progress_apply(lambda x: get_embeddings(x))

100%|██████████| 7847/7847 [3:47:15<00:00,  1.74s/it]


In [27]:
df_reduced.head()

Unnamed: 0,index,text,category,language,text_reduced,text_tokens,text_batches,basic_set,embeddings
0,0,"i read this book because in my town, everyone ...",APR,en,"i read this book because in my town, everyone ...","[i, read, this, book, because, in, my, town, ,...","[[[CLS], i, read, this, book, because, in, my,...","[[[101, 177, 24944, 10531, 12748, 12373, 10106...","[[[0.21412727, -0.25563636, 0.3922883, -0.1099..."
1,3,milady has found a good vein: anita blake. bas...,APR,en,milady has found a good vein: anita blake. bas...,"[mil, ##ady, has, found, a, good, vei, ##n, :,...","[[[CLS], mil, ##ady, has, found, a, good, vei,...","[[[101, 15033, 51210, 10393, 11823, 169, 15198...","[[[0.21550941, -0.32545748, 0.33222845, -0.287..."
2,6,"well, frankly, the first 3 volumes of the new ...",APR,en,"well, frankly, the first 3 volumes of the new ...","[well, ,, fra, ##nk, ##ly, ,, the, first, 3, v...","[[[CLS], well, ,, fra, ##nk, ##ly, ,, the, fir...","[[[101, 11206, 117, 10628, 17761, 10454, 117, ...","[[[0.026787607, -0.3051752, 0.072571024, -0.58..."
3,9,it is a deafening silence olivier delorme brea...,APR,en,it is a deafening silence olivier delorme brea...,"[it, is, a, dea, ##fen, ##ing, silence, oli, #...","[[[CLS], it, is, a, dea, ##fen, ##ing, silence...","[[[101, 10271, 10124, 169, 42492, 15559, 10230...","[[[0.16710715, -0.19441627, 0.23410313, -0.311..."
4,12,"i really like if it was true, and i felt faint...",APR,en,"i really like if it was true, and i felt faint...","[i, really, like, if, it, was, true, ,, and, i...","[[[CLS], i, really, like, if, it, was, true, ,...","[[[101, 177, 30181, 11850, 12277, 10271, 10134...","[[[0.3504453, -0.23025343, 0.31962156, -0.2193..."


In [37]:
df_final = df_reduced.drop(columns=['text_reduced', 'text_tokens', 'text', 'text_batches', 'basic_set', 'index'])

In [38]:
df_final.head()

Unnamed: 0,category,language,embeddings
0,APR,en,"[[[0.21412727, -0.25563636, 0.3922883, -0.1099..."
1,APR,en,"[[[0.21550941, -0.32545748, 0.33222845, -0.287..."
2,APR,en,"[[[0.026787607, -0.3051752, 0.072571024, -0.58..."
3,APR,en,"[[[0.16710715, -0.19441627, 0.23410313, -0.311..."
4,APR,en,"[[[0.3504453, -0.23025343, 0.31962156, -0.2193..."


In [39]:
data = []

for i in range(len(df_final)):

  new_row = [item for sublist in df_final.loc[i]['embeddings'] for item in sublist.tolist()[0]]

  new_row.insert(0, df_final.loc[i]['category'])
  new_row.insert(0, df_final.loc[i]['language'])
  data.append(new_row)

In [40]:
df_flatten = pd.DataFrame(data)

In [41]:
df_flatten.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,6874,6875,6876,6877,6878,6879,6880,6881,6882,6883,6884,6885,6886,6887,6888,6889,6890,6891,6892,6893,6894,6895,6896,6897,6898,6899,6900,6901,6902,6903,6904,6905,6906,6907,6908,6909,6910,6911,6912,6913
0,en,APR,0.214127,-0.255636,0.392288,-0.10992,-0.026097,0.148366,0.293664,0.275959,-0.319734,0.139019,-0.054376,6e-06,-0.180775,-0.144149,0.103221,-0.070089,0.368038,0.070929,0.201238,-0.079317,-0.999311,-0.197135,-0.16249,-0.11198,-0.295185,0.138431,-0.152204,0.292405,-0.021648,-0.135306,0.111917,-0.999275,0.419451,0.414028,0.212482,-0.152436,0.085479,0.092783,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,en,APR,0.215509,-0.325457,0.332228,-0.287824,-0.136689,0.383519,0.291183,0.286996,-0.481386,0.21829,-0.201273,-0.205439,-0.26538,-0.299966,0.223057,-0.247543,0.675796,0.117384,0.200475,-0.13095,-0.999962,-0.341156,-0.359581,-0.138069,-0.41684,0.228402,-0.272158,0.278628,0.18857,-0.246999,0.210601,-0.99996,0.755134,0.689256,0.309375,-0.144468,0.23134,0.23143,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,en,APR,0.026788,-0.305175,0.072571,-0.581912,-0.353657,-0.348442,0.470672,-0.131022,0.239081,-0.328728,-0.344556,0.109195,0.15958,0.189566,0.342923,0.367296,-0.619738,0.249904,0.488591,0.298604,-0.600034,-0.341328,0.367388,-0.358904,-0.09761,0.188682,-0.290164,0.363977,-0.022725,-0.348053,0.029906,-0.726957,0.912228,-0.42725,0.054324,-0.394523,-0.263104,-0.133449,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,en,APR,0.167107,-0.194416,0.234103,-0.311199,-0.171992,0.159504,0.265455,0.096494,-0.173541,0.024549,-0.192411,0.036744,-0.079506,-0.102444,0.089812,0.040732,0.248893,0.19385,0.305282,0.00485,-0.984261,-0.003436,-0.128723,-0.078641,-0.13961,0.246347,-0.198116,0.24528,0.044969,-0.251157,0.091292,-0.982581,0.727015,0.476094,0.220772,0.005005,0.019828,0.149508,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,en,APR,0.350445,-0.230253,0.319622,-0.219304,-0.079355,0.508322,0.284624,0.340958,-0.570853,0.349486,-0.063045,-0.229274,-0.319232,-0.275659,0.280474,-0.407663,0.854582,0.114386,0.167656,-0.375742,-0.999933,-0.292812,-0.52883,-0.177153,-0.57439,0.37371,-0.290632,0.24986,0.404556,-0.316472,0.168192,-0.999944,0.741588,0.788995,0.34334,-0.154618,0.263667,0.352792,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [42]:
df_flatten = df_flatten.rename(columns={0: 'language', 1: 'category'})

In [43]:
df_flatten.head()

Unnamed: 0,language,category,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,6874,6875,6876,6877,6878,6879,6880,6881,6882,6883,6884,6885,6886,6887,6888,6889,6890,6891,6892,6893,6894,6895,6896,6897,6898,6899,6900,6901,6902,6903,6904,6905,6906,6907,6908,6909,6910,6911,6912,6913
0,en,APR,0.214127,-0.255636,0.392288,-0.10992,-0.026097,0.148366,0.293664,0.275959,-0.319734,0.139019,-0.054376,6e-06,-0.180775,-0.144149,0.103221,-0.070089,0.368038,0.070929,0.201238,-0.079317,-0.999311,-0.197135,-0.16249,-0.11198,-0.295185,0.138431,-0.152204,0.292405,-0.021648,-0.135306,0.111917,-0.999275,0.419451,0.414028,0.212482,-0.152436,0.085479,0.092783,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,en,APR,0.215509,-0.325457,0.332228,-0.287824,-0.136689,0.383519,0.291183,0.286996,-0.481386,0.21829,-0.201273,-0.205439,-0.26538,-0.299966,0.223057,-0.247543,0.675796,0.117384,0.200475,-0.13095,-0.999962,-0.341156,-0.359581,-0.138069,-0.41684,0.228402,-0.272158,0.278628,0.18857,-0.246999,0.210601,-0.99996,0.755134,0.689256,0.309375,-0.144468,0.23134,0.23143,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,en,APR,0.026788,-0.305175,0.072571,-0.581912,-0.353657,-0.348442,0.470672,-0.131022,0.239081,-0.328728,-0.344556,0.109195,0.15958,0.189566,0.342923,0.367296,-0.619738,0.249904,0.488591,0.298604,-0.600034,-0.341328,0.367388,-0.358904,-0.09761,0.188682,-0.290164,0.363977,-0.022725,-0.348053,0.029906,-0.726957,0.912228,-0.42725,0.054324,-0.394523,-0.263104,-0.133449,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,en,APR,0.167107,-0.194416,0.234103,-0.311199,-0.171992,0.159504,0.265455,0.096494,-0.173541,0.024549,-0.192411,0.036744,-0.079506,-0.102444,0.089812,0.040732,0.248893,0.19385,0.305282,0.00485,-0.984261,-0.003436,-0.128723,-0.078641,-0.13961,0.246347,-0.198116,0.24528,0.044969,-0.251157,0.091292,-0.982581,0.727015,0.476094,0.220772,0.005005,0.019828,0.149508,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,en,APR,0.350445,-0.230253,0.319622,-0.219304,-0.079355,0.508322,0.284624,0.340958,-0.570853,0.349486,-0.063045,-0.229274,-0.319232,-0.275659,0.280474,-0.407663,0.854582,0.114386,0.167656,-0.375742,-0.999933,-0.292812,-0.52883,-0.177153,-0.57439,0.37371,-0.290632,0.24986,0.404556,-0.316472,0.168192,-0.999944,0.741588,0.788995,0.34334,-0.154618,0.263667,0.352792,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [44]:
df_flatten.to_csv('/content/drive/My Drive/ea_embeddings_bert_flatten_good.csv', index=False, encoding='utf-8') 