In [1]:
## Packages
library(tidyverse)
# install.packages("wordVectors")
# library(wordVectors)
# install.packages("text2vec")
library(text2vec)
library(ggplot2)
# install.packages("quanteda")
library(quanteda)

── [1mAttaching packages[22m ───────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.0
[32m✔[39m [34mtidyr  [39m 1.1.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ──────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Package version: 3.0.0
Unicode version: 10.0
ICU version: 61.1

Parallel computing: 16 of 16 threads used.

See https://quanteda.io for tutorials and examples.



# Presentación GloVe
- Fernanda Rubio
- Roberto Pérez 
- Víctor Rivera

## Introducción

Existen dos familias de algoritmos para "aprender" vectores de palabras

- *Global Matrix Factorization* como LSA  ➡️  Aprovechan información estadística

- *Local Context Widow* como skip-gram  ➡️  buen desempeño encontrando analogías de palabras

GloVe combina las ventajas de las 2 familias de modelos   

## GloVe

## Ejemplo aplicado

Cargamos una muestra de artículos de Wikipedia:

In [2]:
wiki_corp <- quanteda.corpora::download(
    url = "https://www.dropbox.com/s/9mubqwpgls3qi9t/data_corpus_wiki.rds?dl=1"
)

Creación de vocabulario del que se aprenderán los vectores de palabras:

1: Tokenizar el corpus:

In [5]:
wiki_toks <- tokens(wiki_corp)
wiki_toks

Tokens consisting of 1 document and 1 docvar.
text1 :
 [1] "anarchism"  "originated" "as"         "a"          "term"      
 [6] "of"         "abuse"      "first"      "used"       "against"   
[11] "early"      "working"   
[ ... and 17,005,195 more ]


2: Extración de los features que suceden 5 veces o más:

In [7]:
feats <- dfm(wiki_toks, verbose = TRUE) %>%
    dfm_trim(min_termfreq = 5) %>%
    featnames()

wiki_toks <- tokens_select(wiki_toks, feats, padding = TRUE)

Creating a dfm from a tokens input...

 ...lowercasing

 ...found 1 document, 253,854 features

 ...complete, elapsed time: 1.3 seconds.

Finished constructing a 1 x 253,854 sparse dfm.



Contrucción de la matriz de concurrencia:

In [8]:
wiki_fcm <- fcm(wiki_toks, context = "window", count = "weighted", weights = 1 / (1:5), tri = TRUE)
wiki_fcm

Feature co-occurrence matrix of: 71,290 by 71,290 features.
            features
features     anarchism originated          as           a        term
  anarchism       29.3          1    28.23333    30.95000    2.166667
  originated       0            0    65.68333    47.78333   18.116667
  as               0            0 10508.90000 27202.56667  316.183333
  a                0            0     0       18019.26667 1145.700000
  term             0            0     0           0         60.733333
  of               0            0     0           0          0       
  abuse            0            0     0           0          0       
  first            0            0     0           0          0       
  used             0            0     0           0          0       
  against          0            0     0           0          0       
            features
features              of      abuse       first        used     against
  anarchism     73.56667   0           0.500000    0.250

### Modelo GloVe

Entrenamiento del modelo

In [9]:
glove <- GlobalVectors$new(rank = 50, x_max = 10)
wv_main <- glove$fit_transform(wiki_fcm, n_iter = 10,
                               convergence_tol = 0.01, n_threads = 8)

INFO  [19:37:05.057] epoch 1, loss 0.1618 
INFO  [19:37:24.649] epoch 2, loss 0.1233 
INFO  [19:37:44.160] epoch 3, loss 0.1073 
INFO  [19:38:03.865] epoch 4, loss 0.0992 
INFO  [19:38:23.756] epoch 5, loss 0.0942 
INFO  [19:38:43.684] epoch 6, loss 0.0908 
INFO  [19:39:07.081] epoch 7, loss 0.0882 
INFO  [19:39:42.393] epoch 8, loss 0.0861 
INFO  [19:40:15.271] epoch 9, loss 0.0844 
INFO  [19:40:43.506] epoch 10, loss 0.0831 


Suma de palabras con el contexto para mejorar la precisión:

In [10]:
wv_context <- glove$components
word_vectors <- wv_main + t(wv_context)

### Entrenar word2vec

In [11]:
normalizar <- function(texto, vocab = NULL){
  # minúsculas
  texto <- tolower(texto)
  # varios ajustes
  texto <- gsub("\\s+", " ", texto)
  texto <- gsub("\\.[^0-9]", " _punto_ ", texto)
  texto <- gsub(" _s_ $", "", texto)
  texto <- gsub("\\.", " _punto_ ", texto)
  texto <- gsub("[«»¡!¿?-]", "", texto) 
  texto <- gsub(";", " _punto_coma_ ", texto) 
  texto <- gsub("\\:", " _dos_puntos_ ", texto) 
  texto <- gsub("\\,[^0-9]", " _coma_ ", texto)
  texto <- gsub("\\s+", " ", texto)
  texto
}
wiki_df <- tibble(txt = wiki_corp) %>%
                mutate(id = row_number()) %>%
                mutate(txt = normalizar(txt))

if(!file.exists('./salidas/wiki_w2v.txt')){
  tmp <- tempfile()
  # tokenización
  write_lines(wiki_df$txt,  tmp)
  prep <- prep_word2vec(tmp, 
          destination = './salidas/wiki_w2v.txt', bundle_ngrams = 2)
} 

if (!file.exists("./salidas/wiki_vectors.bin")) {
  model_w2v <- train_word2vec("./salidas/wiki_w2v.txt", 
          "./salidas/wiki_vectors.bin",
          vectors = 100, threads = 4, window = 5, cbow = 0,  
          iter = 5, negative_samples = 20, min_count = 5) 
} else {
  model_w2v <- read.vectors("./salidas/wiki_vectors.bin")
}

ERROR: Error: Problem with `mutate()` input `txt`.
[31m✖[39m argumento de tipo no-carácter
[34mℹ[39m Input `txt` is `normalizar(txt)`.


## Fin