<figure>
<img src="../Imagenes/logo-final-ap.png"  width="80" height="80" align="left"/> 
</figure>

# <span style="color:blue"><left>Aprendizaje Profundo</left></span>

# <span style="color:red"><center>BERTAS</center></span>

<center>Creation of Embeddings</center>

##   <span style="color:blue">Autores</span>

1. Alvaro Mauricio Montenegro Díaz, ammontenegrod@unal.edu.co
2. Daniel Mauricio Montenegro Reyes, dextronomo@gmail.com 

## <span style="color:blue">Referencias</span> 

1. [HuggingFace. Transformers ](https://huggingface.co/transformers/)
1. [HuggingFace. Intro pipeline](https://huggingface.co/course/chapter1/3?fw=pt)
1. [Sentence transformer](https://github.com/UKPLab/sentence-transformers)
1. [SBERT paper](https://arxiv.org/pdf/1908.10084.pdf)
1. [SBERT.net](https://www.sbert.net/)

## <span style="color:blue">Contenido</span>

* [Introducción](#Introducción)
* 

## <span style="color:blue">Introduction</span>

In this notebook we read the sentence-transformer embeddings of the esays

## <span style="color:blue">Load required modules</span>

In [1]:
import pandas as pd
import numpy as np
import h5py
from glob import glob
import os

## <span style="color:blue">Open hdf5 file</span>

In [2]:
# crea el archivo hdf5
# userblock for documentation
# 
path_hdf5 = '../Datos/Embeddings_hdf5/'
file_names_hdf5 = ['../Datos/Embeddings_hdf5/F1_1783.hdf5',
                 '../Datos/Embeddings_hdf5/F2_1800.hdf5',
                 '../Datos/Embeddings_hdf5/F3_1726.hdf5',
                 '../Datos/Embeddings_hdf5/F4_1772.hdf5',
                 '../Datos/Embeddings_hdf5/F5_1805.hdf5',
                 '../Datos/Embeddings_hdf5/F6_1800.hdf5',
                 '../Datos/Embeddings_hdf5/F7_1569.hdf5',
                 '../Datos/Embeddings_hdf5/F8_723.hdf5']

path_csv = '../Datos/Clean_data/'
file_names_csv = sorted(glob(path_csv + 'F*.csv'))

In [3]:
for file in file_names_hdf5:
    f = h5py.File(file, 'w', userblock_size=512)
    with open(file, 'r+') as f1:
        doc='This example contains the embeddings of all the essays of an exam type./nThere are two objects:/nessay_id-> It contains the identifier of each essay./data-> It is an array of size num_total_essays x768 and they correspond to the embeddings of the sentences averaged per trial using the all-mpnet-base-v2 model of sentence transformers.'    
        f1.write(doc)
    f.close()

## <span style="color:blue">Create sentence model</span>

In [4]:
from sentence_transformers import SentenceTransformer, LoggingHandler
model = SentenceTransformer('all-mpnet-base-v2')
import logging

# prepare to parallel embeddings

logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])

## <span style="color:blue">Load and process data</span>

1. Create a list with the file names.
1. Process each file:
    * read file
    * extract the column sentence,
    * compute number of sentences of each essay,
    * transform all esaay in a unique list,
    * process in parallel all sentences (embedding),
    + save the embeddings in the hdf5 files.

In [6]:
import nltk
nltk.download('punkt') # Manejo de puntuación
import matplotlib.pyplot as plt
# tokenizers
from nltk.tokenize import sent_tokenize, word_tokenize 

[nltk_data] Downloading package punkt to /home/bizon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:

# process the embedings 
if __name__ == '__main__':
    #Start the multi-process pool on all available CUDA devices
    pool = model.start_multi_process_pool()
    
    # process each file
    for file_csv, file_hdf5 in zip(file_names_csv, file_names_hdf5):
        # read texts from Asap file
        input_file = pd.read_csv(file_csv)
        essays = input_file['essay'].values
        essay_id = np.array(input_file['essay_id'].values, dtype=np.uint16)
        emb_list = []
        
        # Make a unique list of sentences to parallel embeddings
        for essay in essays:
            sentences = sent_tokenize(essay)
            #Compute the embeddings using the multi-process pool
            emb = model.encode_multi_process(sentences, pool)
            # compute pool (mean the embeddings)
            emb_pool = np.mean(emb, axis=0)
            emb_list.append(emb_pool)
        
        # convert list to a to array
        data = np.array(emb_list, dtype=np.float32)
        print('shape of the array:', data.shape)
        # write to the hdf5 file
        f = h5py.File(file_hdf5, mode='r+')
        f['/essay_id'] = essay_id
        f['/data'] = data
        f.close()

        
    #Optional: Stop the proccesses in the pool
    model.stop_multi_process_pool(pool)

In [22]:
# Sanity check
for file_hdf5 in file_names_hdf5:
    f = h5py.File(file_hdf5, mode='r+')
    essay_dataset = f['/essay_id']
    dataset = f['/data']
    print(essay_dataset.shape)
    print(dataset[:2])
    f.close()

(1783,)
[[-0.00922659  0.03124746 -0.01250991 ...  0.02161009 -0.02491724
  -0.007745  ]
 [ 0.00144173  0.06095947 -0.01735124 ...  0.0249634  -0.01465576
  -0.0198111 ]]
(1800,)
[[ 0.00910369  0.05059931 -0.00264957 ... -0.01627963  0.01587244
  -0.00826657]
 [ 0.04705726  0.07408167 -0.00594867 ... -0.00919404 -0.00724361
   0.00038601]]
(1726,)
[[ 0.00014497 -0.0193138  -0.03675896 ... -0.01913233  0.03073528
  -0.01637933]
 [ 0.02078576 -0.04045704 -0.00054409 ...  0.00511276  0.00618818
  -0.01427932]]
(1772,)
[[ 0.02182995  0.09105893 -0.0024663  ...  0.00200623  0.0395928
  -0.01122126]
 [-0.02006244 -0.06909715 -0.02033951 ...  0.0136023   0.0234135
  -0.03045519]]
(1805,)
[[-0.01462389  0.01998284 -0.01343488 ...  0.01645633  0.01985981
  -0.01801151]
 [-0.04109019  0.03315708 -0.01877466 ...  0.01024659  0.01051008
  -0.00861973]]
(1800,)
[[-0.0100621   0.01959658 -0.00715139 ...  0.00786291 -0.00249043
  -0.00828326]
 [-0.0165115   0.00255447 -0.00583109 ... -0.01115719 -0.0