Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
260 lines (171 sloc) 6.54 KB
in---
title: "Word Vectors Template"
author: "Jonathan Fitzgerald & Sarah Connell"
date: "6/3/2019"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Using this File
This file is set up so that once you're comfortable with the basics of `word2vec` (which you can learn from the `introduction_word2vec.Rmd` file) and you have installed all the necessary packages, you can use it as a convenient template for working with your data. Most of the instructional language is in the introductory file, so check that if you have questions; this file includes just the code you'll need to load in files, train models, and query them, with a few comments on the things you'll want to pay particularly close attention to.
##Check the Working Directory
```{r}
getwd()
```
## Load Packages
```{r}
library(tidyverse)
library(tidytext)
library(magrittr)
library(devtools)
library(tsne)
library(wordVectors)
library(lsa)
```
## Read-in and Combine Multiple Text Files
```{r}
# Change "name_of_your_folder" to match the name of the folder with your corpus
path2file <- "data/name_of_your_folder"
fileList <- list.files(path2file,full.names = TRUE)
readTextFiles <- function(file) { # Remember that the code that defines functions must be run by putting your cursor at the beginning or end, or by selecting the whole section of code
message(file)
rawText = paste(scan(file, sep="\n",what="raw",strip.white = TRUE))
output = tibble(filename=gsub(path2file,"",file),text=rawText) %>%
group_by(filename) %>%
summarise(text = paste(rawText, collapse = " "))
return(output)
}
combinedTexts <- tibble(filename=fileList) %>%
group_by(filename) %>%
do(readTextFiles(.$filename))
```
## Prepare Text for VSM
```{r}
# Don't forget to change the text in the first line to whatever you want to call your model file
baseFile <- "your_file_name"
w2vInput <- paste("data/",baseFile,".txt", sep = "")
w2vCleaned <- paste("data/",baseFile,"_cleaned.txt", sep="")
w2vBin <- paste("data/",baseFile,".bin", sep="")
combinedTexts$text %>% write_lines(w2vInput)
```
## Create Vector Space Model
```{r}
THREADS <- 3
prep_word2vec(origin=w2vInput,destination=w2vCleaned,lowercase=T,bundle_ngrams=1)
#See the introductory file for a reminder on how you might adjust the parameters below
if (!file.exists(w2vBin)) {
w2vModel <- train_word2vec(
w2vCleaned,
output_file=w2vBin,
vectors=100,
threads=THREADS,
window=6, iter=10, negative_samples=15
)
} else {
w2vModel <- read.vectors(w2vBin)
}
```
## Read-in Existing Model Files
You don't need to run this if you want to work with a file you just trained; but if you would like to switch models or read-in an existing one after starting a new session, this is the code you would use.
```{r}
w2vModel <- read.vectors("data/name_of_your_file.bin")
```
## Visualize
```{r}
w2vModel %>% plot(perplexity=10)
```
## Clustering
```{r}
centers <- 150
clustering <- kmeans(w2vModel, centers=centers, iter.max=40)
sapply(sample(1:centers, 10), function(n) {
names(clustering$cluster[clustering$cluster==n][1:15])
})
```
## Closest To
```{r}
#To have the results appear in the console below
w2vModel %>% closest_to("girl", 30)
#To view the results in a separate tab
w2vModel %>% closest_to("girl", 30) %>% View()
```
## Closest To Two Things
```{r}
w2vModel %>% closest_to(~"girl"+"woman", 20)
```
## Closest To The Space Between Two Things
```{r}
w2vModel %>% closest_to(~"man"-"woman", 20)
```
## Analogies
```{r}
w2vModel %>% closest_to(~"king"-"man"+"woman", 20)
```
##Export Queries
```{r}
w2vExport <- w2vModel %>% closest_to("girl", 30)
#Change "name_of_your_query" to a descriptive name that you want to give to your export file.
write.csv(file="output/name_of_your_query.csv", x=w2vExport)
```
##Export Clusters
```{r}
centers <- 150
clustering <- kmeans(w2vModel,centers=centers,iter.max = 40)
#Change "name_of_your_query" to a descriptive name that you want to give to your export file.
w2vExport <-sapply(sample(1:centers,150),function(n) {
names(clustering$cluster[clustering$cluster==n][1:15])
})
write.csv(file="output/name_of_your_cluster.csv", x=w2vExport)
```
## Evaluate the Model
You can run this test by hitting `command-return` or `control-return` to run one line a time, or just hit the green button in the top right of the code block below.
```{r}
files_list = list.files(pattern="*.bin$", recursive=TRUE)
rownames <- c()
data_frame <- data.frame()
data = list(c("away", "off"),
c("before", "after"),
c("cause", "effects"),
c("children", "parents"),
c("come", "go"),
c("day", "night"),
c("first", "second"),
c("good", "bad"),
c("last", "first"),
c("kind", "sort"),
c("leave", "quit"),
c("life", "death"),
c("girl", "boy"),
c("little", "small"))
data_list = list()
for(fn in files_list) {
wwp_model = read.vectors(fn)
sims <- c()
for(pairs in data)
{
vector1 <- c()
for(x in wwp_model[[pairs[1]]]) {
vector1 <- c(vector1, x)
}
vector2 <- c()
for(x in wwp_model[[pairs[2]]]) {
vector2 <- c(vector2, x)
}
sims <- c(sims, cosine(vector1, vector2))
f_name <- strsplit(fn, "/")[[1]][[2]]
data_list[[f_name]] <- sims
}
}
for(pairs in data) {
rownames <- c(rownames, paste(pairs[1], pairs[2], sep="-"))
}
results <- structure(data_list,
class = "data.frame",
row.names = rownames
)
write.csv(file="output/model-test-results.csv", x=results)
```
## Credits and Thanks
This tutorial uses the `wordVectors` package developed by Ben Schmidt and Jian Li, itself based on the original `word2vec` code developed by Mikolov et al. The walkthrough was also informed by workshop materials authored by Schmidt, as well as by an exercise created by Thanasis Kinias and Ryan Cordell for the "Humanities Data Analysis" course, and a later version used in Elizabeth Maddock Dillon and Sarah Connell's "Literature and Digital Diversity" class, both at Northeastern University.
This version of the walkthrough was developed as part of the Word Vectors for the Thoughtful Humanist series at Northeastern. Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.
You can’t perform that action at this time.