# Ingeniería de características y Similitud sintáctica 

En este apartado se verá principalmente cómo *vectorizar* documentos, que consiste en convertir texto no estructurado en vectores compuestos por números.

Debido a que la vectorización es la base para casi todas las tareas de *Machine Learning*, en este capítulo se trabajará con dos modelos dispuestos por la librería *scikit-learn*, además de construir nuestro propio vectorizador, útil para futuros proyectos por ser ajustable a las tareas que en cada momento el usuario considere necesarias.

In [89]:
import sys, os

#Carga del archivo setup.py
%run -i ../pyenv_settings/setup.py

#Imports y configuraciones de gráficas
%run "$BASE_DIR/pyenv_settings/settings.py"

#Reset del entorno virtual al iniciar la ejecución
#%reset -f

%reload_ext autoreload
%autoreload 0
%config InlineBackend.figure_format = 'png'

# # to print output of all statements and not just the last
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"

# # otherwise text between $ signs will be interpreted as formula and printed in italic
# pd.set_option('display.html.use_mathjax', False)

You are working on a local system.
Files will be searched relative to "..".


Tras cargar los ajustes y preferencias del entorno virtual, podemos proceder a crear nuestro primer vectorizador.

## Construcción de un Vectorizador
Se va a construir un vectorizador que se usará sobre el Data Frame con el que se ha estado trabajando en los capítulos anteriores. Como primer paso, se deberá cargar el dataset guardado en la base de datos, cuya contenido (los comentarios de los usuarios) ya está normalizado y tokenizado.

In [90]:
#Conexión con la base de datos en la que tenemos guardado el Data Frame
db_name = "../data/zigbee2mqtt_comments.db"
con = sqlite3.connect(db_name)
df = pd.read_sql("select * from posts_nlp", con)
con.close()

#Comprobación de que se ha cargado correctamente
print(df.columns)
print(df[['normalized_text', 'tokens']].head(4))

Index(['id', 'user', 'text', 'impurity', 'clean_text', 'normalized_text',
       'tokens', 'lemmas', 'adjs_verbs', 'nouns', 'noun_phrases',
       'adj_noun_phrases', 'entities'],
      dtype='object')
                                                                                                                                                                                           normalized_text  \
0                                                                    This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days   
1  Also, after updating the z2m, cyclic reboots began ''' Starting Zigbee2MQTT without watchdog. INFO: Preparing to start... INFO: Socat not enabled INFO: Starting Zigbee2MQTT... Starting Zigbee2MQTT...   
2  Hi ! Since 2 or 3 days, MQTT suddenly fail. A few messages in the log, many auto restart, and works again ... Very strange. In the log INFO: Preparing to start... ERROR: Got une

### Enumeración del vocabulario
En primer lugar se van a enumerar todas las palabras de los comentarios normalizados del Data Frame (tokens), de este modo para referirnos a una palabra específica, se puede usar su número asociado (índice del diccionario generado) para crear los vectores.

*Antes que nada*, recordar que los tokens se almacenan en el Data Frame como palabras separadas por comas, no como una lista de palabras, por ello hay que convertir cada entrada de la columna "tokens" a listas que tengan cada palabra como una única entrada.

In [91]:
#Asegurarse de que la columna 'tokens' contenga listas y no cadenas de texto
df['tokens'] = df['tokens'].apply(lambda x: x.split(','))

#Verificar cómo se ve ahora la columna 'tokens'
print(df['tokens'].head(4))

#Aunque parezca que no ha cambiado, ahora es precisamente una lista y se podrá crear
#el diccionario correctamente

0                                           [This, issue, is, stale, because, it, has, been, open, 30, days, with, no, activity, Remove, stale, label, or, comment, or, this, will, be, closed, in, 7, days]
1    [Also, after, updating, the, z2m, cyclic, reboots, began, Starting, Zigbee2MQTT, without, watchdog, INFO, Preparing, to, start, INFO, Socat, not, enabled, INFO, Starting, Zigbee2MQTT, Starting, Zi...
2    [Hi, Since, 2, or, 3, days, MQTT, suddenly, fail, A, few, messages, in, the, log, many, auto, restart, and, works, again, Very, strange, In, the, log, INFO, Preparing, to, start, ERROR, Got, unexp...
3    [I, don't, know, if, it's, exactly, the, same, but, since, v1.42, I, have, trouble, with, Z2M, It, restarts, x, times, a, day, without, further, notice, I, think, it, is, a, software, issue, becau...
Name: tokens, dtype: object


In [92]:
#Creacion del diccionario a partir de la columna tokens ya existente
vocabulary = set([word for tokens in df['tokens'] for word in tokens])

In [93]:
#Enumeración de las palabras (tokens)
word_to_index = {word: i for i, word in enumerate(vocabulary)}

In [94]:
#Impresión del diccionario con sus índices creado
# for word, i in word_to_index.items():
#     print(f"'{word}': {i}")

Como se observa, el diccionario cuenta con un total de 12617 entradas. Ahora se incluirá una nueva columna en el Data Frame en el que se indicará el índice de cada token que aparece en cada entrada de texto (comentario) correspondiente.

In [95]:
#Se añaden los índices de los tokens a la columna 'token_index' del DF
df['token_index'] = df['tokens'].progress_apply(lambda tokens: [word_to_index.get(token, -1) for token in tokens])

  0%|          | 0/2678 [00:00<?, ?it/s]

100%|██████████| 2678/2678 [00:00<00:00, 41310.73it/s]


### Vectorización de documentos
Para comparar vectores se debe asegurar que todos cuentan con las mismas dimensiones, por ello se utiliza el mismo diccionario para todos.

Si un texto no contiene una palabra, se indica con un 0 en su posición, en caso contrario, se indica con un 1. Se deduce entonces que la longitud de los vectores será igual a la longitud del diccionario generado.

Ahora se definirá una función que codificará todos los textos en vectores:

In [96]:
def onehot_encode(text):
    return [1 if w in text else 0 for w in vocabulary]

In [97]:
#Generación de los vectores one-hot
onehot_vectors = [onehot_encode(text) for text in df['normalized_text']]

#Verificación de que se han codificado todas las entradas
print(f"Total de vectores generados: {len(onehot_vectors)}")
print(f"Total de entradas del Data Frame: {len(df)}")

Total de vectores generados: 2678
Total de entradas del Data Frame: 2678


In [None]:
for text, vector in zip(df['normalized_text'].head(2), onehot_vectors):
    print("One-hot vector: ")
    print(vector)
    print(" - Normalized text: ")
    print(text)
    print("-" * 50)

### Matriz de términos del documento
En la matriz de términos están todos los términos del vocabulario dispuestos en las columnas, cada fila corresponde con cada documento de texto y se indica con 0 o 1 si ese término aparece o no en el documento.

Con esta matriz se pierde la posibilidad de calcular la frecuencia con la que aparece una palabra, pero es la construcción más básica que se utilizará para casi todas las tareas relacionadas con Machine Learning.

In [98]:
pd.DataFrame(onehot_vectors, columns=list(vocabulary))

Unnamed: 0,energy,identification,curl,managed,16:32:43MQTT,nowadays,2021-04-14T12:11:34.856Z,color_temp,Products,Simply,relatively,added,searches,Hive,udp4,...,setting,depending,capture,Reinstalling,23:42:41,bought,hybrid,CN,Nortek,chksum:b8d5,wwn-0x5000c29a2c392bbc-part7,up,sidenote,allowed.includes,Object.Module._extensions..js
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2673,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2674,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2675,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2676,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


La creación de esta matriz es de gran ayuda a la hora de calcular similitudes entre los documentos que se analizan. Esto se consigue calculando el número de 1s en la misma posición entre las distintias entradas.

Ahora se mostrará como se calcula la similitud entre las dos primeras entradas del Data Frame que se ha estado usando en el estudio.

In [99]:
simil = [onehot_vectors[0][i] & onehot_vectors[1][i] for i in range(0, len(vocabulary))]
sum(simil)

45

Si se desea calcular la similitud entre todas las entradas del Data Frame, se puede hacer de la siguiente manera:

In [100]:
#import numpy as np

np.dot(onehot_vectors, np.transpose(onehot_vectors))

array([[ 74,  45,  41, ...,  23,  27,  35],
       [ 45, 300,  96, ...,  25,  35,  48],
       [ 41,  96, 197, ...,  25,  36,  52],
       ...,
       [ 23,  25,  25, ...,  40,  16,  22],
       [ 27,  35,  36, ...,  16,  55,  32],
       [ 35,  48,  52, ...,  22,  32,  86]])

En la salida de esta operación se observan distintos arrays, cada uno en una fila, correspondientes con cada entrada del Data Frame.

Cada posición del array es la similitud de esa entrada con el resto, por eso los valores más altos de similitud se localizan en la diagonal, pues coincide con la similitud de una entrada con ella misma.

## Vectorización con *scikit learn*
Como alternativa, se pueden vectorizar los documentos haciendo uso de la librería scikit-learn, que además proporcionará la posibilidad de calcular la frecuencia de aparición de un símbolo haciendo uso de la representación *bag-of-words*

### Implementación del vectorizador

In [101]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

### Vocabulario
A continuación, el modelo debe aprenderse el vocabulario perteneciente a todas las entradas del Data Frame que se desea analizar.

In [None]:
cv.fit(df['normalized_text'])

#Visualización por pantalla del vocabulario
print(cv.vocabulary_)
print(len(cv.vocabulary_))

### Transformación de text a vectores

In [103]:
vecs = cv.transform(df['normalized_text'])
vecs
print(vecs)

  (0, 1034)	1
  (0, 2033)	1
  (0, 2465)	1
  (0, 2469)	1
  (0, 2476)	1
  (0, 2863)	1
  (0, 2930)	1
  (0, 3260)	2
  (0, 4376)	1
  (0, 4650)	1
  (0, 4814)	1
  (0, 4834)	1
  (0, 4841)	1
  (0, 4961)	1
  (0, 5586)	1
  (0, 5741)	1
  (0, 5763)	2
  (0, 6490)	1
  (0, 7155)	2
  (0, 7524)	2
  (0, 8159)	1
  (0, 8178)	1
  (1, 463)	1
  (1, 465)	2
  (1, 500)	1
  :	:
  (2676, 4841)	1
  (2676, 5334)	1
  (2676, 7332)	1
  (2676, 7333)	1
  (2676, 7341)	1
  (2676, 7403)	1
  (2676, 8315)	2
  (2677, 2646)	1
  (2677, 2793)	1
  (2677, 2991)	3
  (2677, 3300)	1
  (2677, 3458)	1
  (2677, 3806)	1
  (2677, 4389)	1
  (2677, 4650)	1
  (2677, 4701)	1
  (2677, 4814)	1
  (2677, 5707)	1
  (2677, 6076)	1
  (2677, 6581)	1
  (2677, 7048)	1
  (2677, 7524)	1
  (2677, 7526)	1
  (2677, 8134)	1
  (2677, 8159)	1


Esta función transforma todas las entradas de texto a una matriz dispersa (sólo almacena aquellas posiciones en las que hay un 1) en la que cada fila representa una entrada del Data Frame y cada columna un término del vocabulario.

Se transformará una matriz completa para facilitar su lectura en caso de considerarse necesario:

In [104]:
#v_matrix = vecs.toarray()
v_matrix = pd.DataFrame(vecs.toarray(), columns=cv.get_feature_names_out())
print(v_matrix)

      00  0000  000000  00000000000000000001  00000001  00000004  \
0      0     0       0                     0         0         0   
1      0     0       0                     0         0         0   
2      0     0       0                     0         0         0   
3      0     0       0                     0         0         0   
4      6     0       0                     0         0         0   
...   ..   ...     ...                   ...       ...       ...   
2673   0     0       0                     0         0         0   
2674   0     0       0                     0         0         0   
2675   0     0       0                     0         0         0   
2676   0     0       0                     0         0         0   
2677   0     0       0                     0         0         0   

      0000_02_00_0  0000_02_03_0  0000_02_05_0  0001  0002  0003  001  002  \
0                0             0             0     0     0     0    0    0   
1                0         

### Cálculo de similitudes
Scikit-learn ofrece una función que permite calcular la similitud entre dos entradas (textos) o la similitud de todo el Data Frame. A continuación se muestra cómo usar esta función para estos dos casos:

In [105]:
from sklearn.metrics.pairwise import cosine_similarity

#Cálculo de similitud entre dos entradas
cosine_similarity(vecs[0], vecs[1])

array([[0.02243308]])

In [106]:
#Cálculo de similitud de todo el Data Frame
pd.DataFrame(cosine_similarity(vecs, vecs))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,...,2663,2664,2665,2666,2667,2668,2669,2670,2671,2672,2673,2674,2675,2676,2677
0,1.00,0.02,0.08,0.02,0.02,0.02,0.03,0.20,0.07,1.00,0.08,0.11,0.09,1.00,0.00,...,0.16,0.26,0.10,0.00,0.11,0.05,0.20,0.23,0.25,0.16,0.00,0.10,0.27,0.09,0.17
1,0.02,1.00,0.33,0.69,0.10,0.15,0.10,0.12,0.12,0.02,0.06,0.08,0.05,0.02,0.13,...,0.06,0.00,0.00,0.13,0.14,0.11,0.16,0.00,0.02,0.08,0.00,0.04,0.01,0.01,0.06
2,0.08,0.33,1.00,0.23,0.11,0.48,0.41,0.26,0.17,0.08,0.26,0.04,0.20,0.08,0.35,...,0.22,0.08,0.04,0.07,0.34,0.26,0.39,0.07,0.09,0.28,0.00,0.20,0.05,0.08,0.12
3,0.02,0.69,0.23,1.00,0.14,0.09,0.06,0.11,0.13,0.02,0.05,0.09,0.03,0.02,0.11,...,0.07,0.01,0.01,0.14,0.12,0.11,0.13,0.01,0.02,0.05,0.00,0.02,0.02,0.02,0.04
4,0.02,0.10,0.11,0.14,1.00,0.11,0.10,0.09,0.04,0.02,0.07,0.01,0.07,0.02,0.15,...,0.04,0.04,0.00,0.00,0.06,0.04,0.08,0.04,0.04,0.09,0.00,0.04,0.03,0.02,0.06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2673,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00
2674,0.10,0.04,0.20,0.02,0.04,0.21,0.13,0.13,0.04,0.10,0.09,0.18,0.15,0.10,0.10,...,0.06,0.10,0.00,0.00,0.22,0.06,0.23,0.00,0.06,0.14,0.00,1.00,0.12,0.14,0.00
2675,0.27,0.01,0.05,0.02,0.03,0.00,0.00,0.15,0.07,0.27,0.00,0.10,0.17,0.27,0.00,...,0.10,0.47,0.00,0.00,0.10,0.00,0.20,0.28,0.18,0.22,0.00,0.12,1.00,0.08,0.12
2676,0.09,0.01,0.08,0.02,0.02,0.02,0.08,0.14,0.00,0.09,0.11,0.24,0.20,0.09,0.04,...,0.00,0.00,0.14,0.00,0.04,0.08,0.02,0.00,0.14,0.06,0.00,0.14,0.08,1.00,0.15


Como se indicó anteriormente, se observa que en la diagonal los valores son de *1.00* porque se está comparando cada entrada con ella misma, en el resto de posiciones se observa la similitud con el resto de entradas del Data Frame.

## Modelo TF/IDF
A la hora de analizar documentos, cabe la posibilidad de que una elevada similitud no aporte demasiada información y sin embargo, aquellos términos que aparecen con menor frecuencia cuenten con mucha.

En este punto entra en juego este tipo de modelos, que se encargan de contar el total de apariciones de un símbolo/palabra y así reducir el peso de las palabras con mucha frecuencia y aumentar el de aquellas menos comunes.

Cómo ya se vió anteriormente, una mejor forma de medir la información es calculando la frecuencia inversa del documento.

En este caso, el peso TF/IDF puede calculares a partir del modelo bag-of-words desarrollado hace un momento.

In [107]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
tfidf_vecs = tfidf.fit_transform(vecs)

In [108]:
pd.DataFrame(tfidf_vecs.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,00,0000,000000,00000000000000000001,00000001,00000004,0000_02_00_0,0000_02_03_0,0000_02_05_0,0001,0002,0003,001,002,002z,...,по,проблема,проблему,просто,решил,сore,свою,снимок,такая,тоже,тот,удалив,умолчанию,установил,экрана
0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
2,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
3,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4,0.20,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2673,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
2674,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
2675,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
2676,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00


Ahora se observa como el peso de alguanas palabras que tenían una elevada frecuencia ha disminuido, al mismo tiempo que el peso de aquellas que no contaban con una elevada frecuencia no ha variado demasiado.

Veamos el efecto que este proceso tiene sobre la matriz de similitud:

In [109]:
pd.DataFrame(cosine_similarity(tfidf_vecs, tfidf_vecs))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,...,2663,2664,2665,2666,2667,2668,2669,2670,2671,2672,2673,2674,2675,2676,2677
0,1.00,0.01,0.03,0.01,0.01,0.01,0.01,0.09,0.02,1.00,0.04,0.04,0.03,1.00,0.00,...,0.06,0.09,0.03,0.00,0.05,0.01,0.10,0.05,0.11,0.06,0.00,0.05,0.10,0.03,0.06
1,0.01,1.00,0.20,0.58,0.08,0.04,0.02,0.05,0.02,0.01,0.01,0.02,0.01,0.01,0.03,...,0.02,0.00,0.00,0.03,0.03,0.02,0.04,0.00,0.01,0.02,0.00,0.01,0.00,0.00,0.05
2,0.03,0.20,1.00,0.11,0.03,0.11,0.11,0.06,0.02,0.03,0.05,0.00,0.03,0.03,0.06,...,0.04,0.01,0.01,0.01,0.07,0.05,0.09,0.01,0.03,0.05,0.00,0.04,0.01,0.05,0.05
3,0.01,0.58,0.11,1.00,0.13,0.03,0.01,0.03,0.02,0.01,0.02,0.02,0.00,0.01,0.03,...,0.02,0.00,0.00,0.03,0.02,0.02,0.03,0.00,0.00,0.01,0.00,0.01,0.00,0.01,0.03
4,0.01,0.08,0.03,0.13,1.00,0.04,0.03,0.04,0.01,0.01,0.02,0.00,0.01,0.01,0.06,...,0.01,0.01,0.00,0.00,0.01,0.01,0.02,0.00,0.01,0.02,0.00,0.01,0.00,0.02,0.03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2673,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00
2674,0.05,0.01,0.04,0.01,0.01,0.05,0.02,0.04,0.00,0.05,0.01,0.04,0.03,0.05,0.01,...,0.01,0.05,0.00,0.00,0.06,0.01,0.06,0.00,0.01,0.02,0.00,1.00,0.04,0.04,0.00
2675,0.10,0.00,0.01,0.00,0.00,0.00,0.00,0.03,0.01,0.10,0.00,0.01,0.03,0.10,0.00,...,0.02,0.18,0.00,0.00,0.02,0.00,0.11,0.04,0.03,0.05,0.00,0.04,1.00,0.02,0.02
2676,0.03,0.00,0.05,0.01,0.02,0.01,0.07,0.06,0.00,0.03,0.06,0.06,0.06,0.03,0.02,...,0.00,0.00,0.03,0.00,0.01,0.01,0.00,0.00,0.02,0.01,0.00,0.04,0.02,1.00,0.07


Se aprecia que en muchos casos el valor de la similitud ha descendido al reducirse el peso de aquellas palabras con una elevada frecuencia.

### Limitación de tipos de palabras
Podemos mejorar la densidad de información que obtendremos limitando los tipos de palabras que queremos mantener en nuestro Data Frame. Como existen tipos de palabras que no aportan demasiada información, como preposiciones o conjugaciones, se pueden obviar y así reducir el vocabulario.

In [110]:
vecs

<2678x8443 sparse matrix of type '<class 'numpy.int64'>'
	with 93810 stored elements in Compressed Sparse Row format>

La matriz en este momento cuenta con 93.810 elementos almacenados.
Veamos ahora cuántos tendrá una vez se haya vectorizado y ajustado la información utilizando palabras de parada como filtro (Una vez más, la columna *"normalized text"* del Data Frame ya contenía datos previamente limpiados).

In [111]:
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS as stopwords_set

#Se pasa el set de stop words a lista
stopwords = list(stopwords_set)

#Crea el vectorizador TF/IDF con stopwords
tfidf = TfidfVectorizer(stop_words=stopwords)

#Función que ajusta y transforma los textos
#Funciones anteriores en una única función
vecs = tfidf.fit_transform(df['normalized_text'].map(str))

In [112]:
vecs

<2678x8194 sparse matrix of type '<class 'numpy.float64'>'
	with 59363 stored elements in Compressed Sparse Row format>

Tras realizar la transformación, se observa claramente que el número de elementos almacenados ha disminuido hasta los 59.363 elementos, lo que reduce considerablemente la cantidad de elementos que se consideran de importancia para el estudio.

### Eliminación de los elementos más comunes
Si un elemento de la matriz cuenta con una frecuencia demasiado elevada, cabe esperar que su aporte no debe de ser de gran importancia, es incluso posible que se trate de una palabra que no se ha incluido como stop word.

En este caso no se considera necesario su uso, ya que la reducción de elementos ya ha sido considerable con el uso del diccionario de stop words de spaCy.

De todos modos, se va a escribir el código que ejecutaría esta tarea en caso de ser necesario en el futuro.

In [None]:
top_10000 = pd.read_csv("https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt", header=None)
#top_10000_list = list(top_10000)
# tfidf = TfidfVectorizer(stop_words=set(top_10000.iloc[:,0].values))
# vecs = tfidf.fit_transform(df['normalized_text'].map(str))
# vecs

### Mayor contexto usando N-Grams
Al contrario que en los apartados anteriores, ahora se añadirá contexto a la información incluyendo n-grams de, por ejemplo, 2 dimensiones.

Como únicamente se había trabajado con términos de una única palabra, ahora se tendrán en cuenta aquellos compuestos por más de una, cuya utilización está muy extendida sobre todo en el inglés.

In [114]:
#stopwords = list(set(top_10000[0].values))

tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words=stopwords, min_df=2)
vecs = tfidf.fit_transform(df['normalized_text'].map(str))
print(vecs.shape)
vecs

tfidf = TfidfVectorizer(ngram_range=(1,3), stop_words=stopwords, min_df=2)
vecs = tfidf.fit_transform(df['normalized_text'].map(str))
print(vecs.shape)
vecs

(2678, 13309)
(2678, 21762)


<2678x21762 sparse matrix of type '<class 'numpy.float64'>'
	with 132346 stored elements in Compressed Sparse Row format>

Pese a que ha disminuido el número de elementos contenidos en la matriz, esto es debido a que todas esos elementos ahora se han unido para crear n-grams de 2 o 3 elementos.

### Palabras relacionadas
A menudo, hay ciertas palabras que suelen aparecer juntas en los textos.

En este apartado se tratará de encontrar aquellas palabras que suelen aparecer juntas una cierta cantidad de veces, en este caso, unas 300 veces a lo largo de todo el Data Frame.

In [115]:
tfidf_word = TfidfVectorizer(stop_words=stopwords, min_df=300)
rel_words = tfidf_word.fit_transform(df['normalized_text'])

rel_words

<2678x17 sparse matrix of type '<class 'numpy.float64'>'
	with 6995 stored elements in Compressed Sparse Row format>

Como resultado se han obtenido 2.035 elementos de palabras relacionadas usando el diccionario de las 10.000 palabras con más frecuencia de aparición usado en el apartado anterior.

Se calcula la similitud en el vector obtenido de palabras relacionadas.

In [116]:
rel_freq = cosine_similarity(rel_words.T, rel_words.T)
np.fill_diagonal(rel_freq, 0)

In [117]:
#Para mostrar por pantallas las palabras relacionadas
voc = tfidf_word.get_feature_names_out()
size = rel_freq.shape[0] # quadratic
for index in np.argsort(rel_freq.flatten())[::-1][0:40]:
    a = int(index/size)
    b = index%size
    if a > b:  # avoid repetitions
        print('"%s" related to "%s"' % (voc[a], voc[b]))

"label" related to "activity"
"stale" related to "label"
"stale" related to "activity"
"days" related to "activity"
"label" related to "days"
"comment" related to "activity"
"label" related to "comment"
"stale" related to "days"
"stale" related to "comment"
"days" related to "comment"
"closed" related to "activity"
"label" related to "closed"
"activity" related to "30"
"label" related to "30"
"stale" related to "closed"
"remove" related to "label"
"stale" related to "30"
"remove" related to "activity"
"days" related to "closed"
"days" related to "30"


Observando el resultado por pantalla, quizás en este caso no pueda ser demasiado significativo ya que el dataset que se está utilizando no es demasiado amplio, pero en un futuro proyecto en el que la cantidad de información sea significativamente mayor puede ser de gran ayuda.

In [118]:
#Convertimos las palabras relacionadas a Data Frame
related_df = pd.DataFrame(rel_words.toarray(), columns=tfidf_word.get_feature_names_out())

#Conexión con la bd y almacenamiento de las palabras 
con = sqlite3.connect(db_name) 
related_df.to_sql("related_words", con, index=False, if_exists="replace")
con.close()