# Ingeniería de características y Similitud sintáctica 

En este apartado se verá principalmente cómo *vectorizar* documentos, que consiste en convertir texto no estructurado en vectores compuestos por números.

Debido a que la vectorización es la base para casi todas las tareas de *Machine Learning*, en este capítulo se trabajará con dos modelos dispuestos por la librería *scikit-learn*, además de construir nuestro propio vectorizador, útil para futuros proyectos por ser ajustable a las tareas que en cada momento el usuario considere necesarias.

In [2]:
import sys, os

#Carga del archivo setup.py
%run -i ../pyenv_settings/setup.py

#Imports y configuraciones de gráficas
%run "$BASE_DIR/pyenv_settings/settings.py"

#Reset del entorno virtual al iniciar la ejecución
#%reset -f

%reload_ext autoreload
%autoreload 0
%config InlineBackend.figure_format = 'png'

# # to print output of all statements and not just the last
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"

# # otherwise text between $ signs will be interpreted as formula and printed in italic
# pd.set_option('display.html.use_mathjax', False)

You are working on a local system.
Files will be searched relative to "..".


Tras cargar los ajustes y preferencias del entorno virtual, podemos proceder a crear nuestro primer vectorizador.

## Construcción de un Vectorizador
Se va a construir un vectorizador que se usará sobre el Data Frame con el que se ha estado trabajando en los capítulos anteriores. Como primer paso, se deberá cargar el dataset guardado en la base de datos, cuya contenido (los comentarios de los usuarios) ya está normalizado y tokenizado.

In [3]:
#Conexión con la base de datos en la que tenemos guardado el Data Frame
db_name = "../data/zigbee2mqtt_comments.db"
con = sqlite3.connect(db_name)
df = pd.read_sql("select * from comments", con)
con.close()

#Comprobación de que se ha cargado correctamente
print(df.columns)
print(df[['normalized_text', 'tokens']].head(4))

Index(['id', 'user', 'text', 'impurity', 'clean_text', 'normalized_text',
       'tokens'],
      dtype='object')
                                                                                                                                                                                           normalized_text  \
0                                                                    This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days   
1  Also, after updating the z2m, cyclic reboots began ''' Starting Zigbee2MQTT without watchdog. INFO: Preparing to start... INFO: Socat not enabled INFO: Starting Zigbee2MQTT... Starting Zigbee2MQTT...   
2  Hi ! Since 2 or 3 days, MQTT suddenly fail. A few messages in the log, many auto restart, and works again ... Very strange. In the log INFO: Preparing to start... ERROR: Got unexpected response fr...   
3  I don't know if it's exactly the same, but since v1.42 I ha

### Enumeración del vocabulario
En primer lugar se van a enumerar todas las palabras de los comentarios normalizados del Data Frame (tokens), de este modo para referirnos a una palabra específica, se puede usar su número asociado (índice del diccionario generado) para crear los vectores.

*Antes que nada*, recordar que los tokens se almacenan en el Data Frame como palabras separadas por comas, no como una lista de palabras, por ello hay que convertir cada entrada de la columna "tokens" a listas que tengan cada palabra como una única entrada.

In [4]:
#Asegurarse de que la columna 'tokens' contenga listas y no cadenas de texto
df['tokens'] = df['tokens'].apply(lambda x: x.split(','))

#Verificar cómo se ve ahora la columna 'tokens'
print(df['tokens'].head(4))

#Aunque parezca que no ha cambiado, ahora es precisamente una lista y se podrá crear
#el diccionario correctamente

0                                           [This, issue, is, stale, because, it, has, been, open, 30, days, with, no, activity, Remove, stale, label, or, comment, or, this, will, be, closed, in, 7, days]
1    [Also, after, updating, the, z2m, cyclic, reboots, began, Starting, Zigbee2MQTT, without, watchdog, INFO, Preparing, to, start, INFO, Socat, not, enabled, INFO, Starting, Zigbee2MQTT, Starting, Zi...
2    [Hi, Since, 2, or, 3, days, MQTT, suddenly, fail, A, few, messages, in, the, log, many, auto, restart, and, works, again, Very, strange, In, the, log, INFO, Preparing, to, start, ERROR, Got, unexp...
3    [I, don't, know, if, it's, exactly, the, same, but, since, v1.42, I, have, trouble, with, Z2M, It, restarts, x, times, a, day, without, further, notice, I, think, it, is, a, software, issue, becau...
Name: tokens, dtype: object


In [5]:
#Creacion del diccionario a partir de la columna tokens ya existente
vocabulary = set([word for tokens in df['tokens'] for word in tokens])

In [6]:
#Enumeración de las palabras (tokens)
word_to_index = {word: i for i, word in enumerate(vocabulary)}

In [7]:
#Impresión del diccionario con sus índices creado
# for word, i in word_to_index.items():
#     print(f"'{word}': {i}")

Como se observa, el diccionario cuenta con un total de 12617 entradas. Ahora se incluirá una nueva columna en el Data Frame en el que se indicará el índice de cada token que aparece en cada entrada de texto (comentario) correspondiente.

In [8]:
#Se añaden los índices de los tokens a la columna 'token_index' del DF
df['token_index'] = df['tokens'].progress_apply(lambda tokens: [word_to_index.get(token, -1) for token in tokens])

  0%|          | 0/2678 [00:00<?, ?it/s]

100%|██████████| 2678/2678 [00:00<00:00, 41008.63it/s]


In [9]:
#print(df[['normalized_text', 'tokens', 'token_index']].head(4))

### Vectorización de documentos
Para comparar vectores se debe asegurar que todos cuentan con las mismas dimensiones, por ello se utiliza el mismo diccionario para todos.

Si un texto no contiene una palabra, se indica con un 0 en su posición, en caso contrario, se indica con un 1. Se deduce entonces que la longitud de los vectores será igual a la longitud del diccionario generado.

Ahora se definirá una función que codificará todos los textos en vectores:

In [10]:
def onehot_encode(text):
    return [1 if w in text else 0 for w in vocabulary]

In [11]:
#Generación de los vectores one-hot
onehot_vectors = [onehot_encode(text) for text in df['normalized_text']]

#Verificación de que se han codificado todas las entradas
print(f"Total de vectores generados: {len(onehot_vectors)}")
print(f"Total de entradas del Data Frame: {len(df)}")

Total de vectores generados: 2678
Total de entradas del Data Frame: 2678


In [12]:
for text, vector in zip(df['normalized_text'].head(2), onehot_vectors):
    print("One-hot vector: ")
    print(vector)
    print(" - Normalized text: ")
    print(text)
    print("-" * 50)

One-hot vector: 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

### Matriz de términos del documento
En la matriz de términos están todos los términos del vocabulario dispuestos en las columnas, cada fila corresponde con cada documento de texto y se indica con 0 o 1 si ese término aparece o no en el documento.

Con esta matriz se pierde la posibilidad de calcular la frecuencia con la que aparece una palabra, pero es la construcción más básica que se utilizará para casi todas las tareas relacionadas con Machine Learning.

In [13]:
pd.DataFrame(onehot_vectors, columns=list(vocabulary))

Unnamed: 0,20230302.0,night,entity.py,usb-VMware_VMware_Virtual_USB_Mouse-event-mouse,cpu,June,your_number,loose,queue.ts:35:20,withouthe,med,supervisor-2021.06.6,'position_template,'ZHA,place,...,seemingly,combination,2024-01-05T13:04:18.267Z,:36,identical,coordinator_backup.json,respect,ID_SERIAL_SHORT,2023-01-04T17:03:19.360Z,0x1380,index.js:173:13,YOU,database,mqtt-user,ieee_address
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2673,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2674,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2675,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2676,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


La creación de esta matriz es de gran ayuda a la hora de calcular similitudes entre los documentos que se analizan. Esto se consigue calculando el número de 1s en la misma posición entre las distintias entradas.

Ahora se mostrará como se calcula la similitud entre las dos primeras entradas del Data Frame que se ha estado usando en el estudio.

In [14]:
simil = [onehot_vectors[0][i] & onehot_vectors[1][i] for i in range(0, len(vocabulary))]
sum(simil)

45

Si se desea calcular la similitud entre todas las entradas del Data Frame, se puede hacer de la siguiente manera:

In [15]:
#import numpy as np

np.dot(onehot_vectors, np.transpose(onehot_vectors))

array([[ 74,  45,  41, ...,  23,  27,  35],
       [ 45, 300,  96, ...,  25,  35,  48],
       [ 41,  96, 197, ...,  25,  36,  52],
       ...,
       [ 23,  25,  25, ...,  40,  16,  22],
       [ 27,  35,  36, ...,  16,  55,  32],
       [ 35,  48,  52, ...,  22,  32,  86]])

En la salida de esta operación se observan distintos arrays, cada uno en una fila, correspondientes con cada entrada del Data Frame.

Cada posición del array es la similitud de esa entrada con el resto, por eso los valores más altos de similitud se localizan en la diagonal, pues coincide con la similitud de una entrada con ella misma.

## Vectorización con *scikit learn*
Como alternativa, se pueden vectorizar los documentos haciendo uso de la librería scikit-learn, que además proporcionará la posibilidad de calcular la frecuencia de aparición de un símbolo haciendo uso de la representación *bag-of-words*

### Implementación del vectorizador

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

### Vocabulario
A continuación, el modelo debe aprenderse el vocabulario perteneciente a todas las entradas del Data Frame que se desea analizar.

In [None]:
cv.fit(df['normalized_text'])

#Visualización por pantalla del vocabulario
print(cv.vocabulary_)
print(len(cv.vocabulary_))

8443


### Transformación de text a vectores

In [27]:
vecs = cv.transform(df['normalized_text'])
vecs
print(vecs)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 93810 stored elements and shape (2678, 8443)>
  Coords	Values
  (0, 1034)	1
  (0, 2033)	1
  (0, 2465)	1
  (0, 2469)	1
  (0, 2476)	1
  (0, 2863)	1
  (0, 2930)	1
  (0, 3260)	2
  (0, 4376)	1
  (0, 4650)	1
  (0, 4814)	1
  (0, 4834)	1
  (0, 4841)	1
  (0, 4961)	1
  (0, 5586)	1
  (0, 5741)	1
  (0, 5763)	2
  (0, 6490)	1
  (0, 7155)	2
  (0, 7524)	2
  (0, 8159)	1
  (0, 8178)	1
  (1, 463)	1
  (1, 465)	2
  (1, 500)	1
  :	:
  (2676, 4841)	1
  (2676, 5334)	1
  (2676, 7332)	1
  (2676, 7333)	1
  (2676, 7341)	1
  (2676, 7403)	1
  (2676, 8315)	2
  (2677, 2646)	1
  (2677, 2793)	1
  (2677, 2991)	3
  (2677, 3300)	1
  (2677, 3458)	1
  (2677, 3806)	1
  (2677, 4389)	1
  (2677, 4650)	1
  (2677, 4701)	1
  (2677, 4814)	1
  (2677, 5707)	1
  (2677, 6076)	1
  (2677, 6581)	1
  (2677, 7048)	1
  (2677, 7524)	1
  (2677, 7526)	1
  (2677, 8134)	1
  (2677, 8159)	1


Esta función transforma todas las entradas de texto a una matriz dispersa (sólo almacena aquellas posiciones en las que hay un 1) en la que cada fila representa una entrada del Data Frame y cada columna un término del vocabulario.

Se transformará una matriz completa para facilitar su lectura en caso de considerarse necesario:

In [None]:
#v_matrix = vecs.toarray()
v_matrix = pd.DataFrame(vecs.toarray(), columns=cv.get_feature_names_out())
print(v_matrix)

### Cálculo de similitudes
Scikit-learn ofrece una función que permite calcular la similitud entre dos entradas (textos) o la similitud de todo el Data Frame. A continuación se muestra cómo usar esta función para estos dos casos:

In [33]:
from sklearn.metrics.pairwise import cosine_similarity

#Cálculo de similitud entre dos entradas
cosine_similarity(vecs[0], vecs[1])

array([[0.02243308]])

In [34]:
#Cálculo de similitud de todo el Data Frame
pd.DataFrame(cosine_similarity(vecs, vecs))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,...,2663,2664,2665,2666,2667,2668,2669,2670,2671,2672,2673,2674,2675,2676,2677
0,1.00,0.02,0.08,0.02,0.02,0.02,0.03,0.20,0.07,1.00,0.08,0.11,0.09,1.00,0.00,...,0.16,0.26,0.10,0.00,0.11,0.05,0.20,0.23,0.25,0.16,0.00,0.10,0.27,0.09,0.17
1,0.02,1.00,0.33,0.69,0.10,0.15,0.10,0.12,0.12,0.02,0.06,0.08,0.05,0.02,0.13,...,0.06,0.00,0.00,0.13,0.14,0.11,0.16,0.00,0.02,0.08,0.00,0.04,0.01,0.01,0.06
2,0.08,0.33,1.00,0.23,0.11,0.48,0.41,0.26,0.17,0.08,0.26,0.04,0.20,0.08,0.35,...,0.22,0.08,0.04,0.07,0.34,0.26,0.39,0.07,0.09,0.28,0.00,0.20,0.05,0.08,0.12
3,0.02,0.69,0.23,1.00,0.14,0.09,0.06,0.11,0.13,0.02,0.05,0.09,0.03,0.02,0.11,...,0.07,0.01,0.01,0.14,0.12,0.11,0.13,0.01,0.02,0.05,0.00,0.02,0.02,0.02,0.04
4,0.02,0.10,0.11,0.14,1.00,0.11,0.10,0.09,0.04,0.02,0.07,0.01,0.07,0.02,0.15,...,0.04,0.04,0.00,0.00,0.06,0.04,0.08,0.04,0.04,0.09,0.00,0.04,0.03,0.02,0.06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2673,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00
2674,0.10,0.04,0.20,0.02,0.04,0.21,0.13,0.13,0.04,0.10,0.09,0.18,0.15,0.10,0.10,...,0.06,0.10,0.00,0.00,0.22,0.06,0.23,0.00,0.06,0.14,0.00,1.00,0.12,0.14,0.00
2675,0.27,0.01,0.05,0.02,0.03,0.00,0.00,0.15,0.07,0.27,0.00,0.10,0.17,0.27,0.00,...,0.10,0.47,0.00,0.00,0.10,0.00,0.20,0.28,0.18,0.22,0.00,0.12,1.00,0.08,0.12
2676,0.09,0.01,0.08,0.02,0.02,0.02,0.08,0.14,0.00,0.09,0.11,0.24,0.20,0.09,0.04,...,0.00,0.00,0.14,0.00,0.04,0.08,0.02,0.00,0.14,0.06,0.00,0.14,0.08,1.00,0.15


Como se indicó anteriormente, se observa que en la diagonal los valores son de *1.00* porque se está comparando cada entrada con ella misma, en el resto de posiciones se observa la similitud con el resto de entradas del Data Frame.