Skip to content
Switch branches/tags
Go to file
Cannot retrieve contributors at this time
title: "Word Vectors Template"
author: "Jonathan Fitzgerald & Sarah Connell"
date: "6/3/2019"
output: pdf_document
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
## Using this File
This file is set up so that once you're comfortable with the basics of `word2vec` (which you can learn from the `introduction_word2vec.Rmd` file) and you have installed all the necessary packages, you can use it as a convenient template for working with your data. Most of the instructional language is in the introductory file, so check that if you have questions; this file includes just the code you'll need to load in files, train models, and query them, with a few comments on the things you'll want to pay particularly close attention to.
##Check the Working Directory
## Load Packages
## Read-in and Combine Multiple Text Files
# Change "name_of_your_folder" to match the name of the folder with your corpus
path2file <- "data/name_of_your_folder"
fileList <- list.files(path2file,full.names = TRUE)
readTextFiles <- function(file) { # Remember that the code that defines functions must be run by putting your cursor at the beginning or end, or by selecting the whole section of code
rawText = paste(scan(file, sep="\n",what="raw",strip.white = TRUE))
output = tibble(filename=gsub(path2file,"",file),text=rawText) %>%
group_by(filename) %>%
summarise(text = paste(rawText, collapse = " "))
combinedTexts <- tibble(filename=fileList) %>%
group_by(filename) %>%
## Prepare Text for VSM
# Don't forget to change the text in the first line to whatever you want to call your model file
baseFile <- "your_file_name"
w2vInput <- paste("data/",baseFile,".txt", sep = "")
w2vCleaned <- paste("data/",baseFile,"_cleaned.txt", sep="")
w2vBin <- paste("data/",baseFile,".bin", sep="")
combinedTexts$text %>% write_lines(w2vInput)
## Create Vector Space Model
#See the introductory file for a reminder on how you might adjust the parameters below
if (!file.exists(w2vBin)) {
w2vModel <- train_word2vec(
window=6, iter=10, negative_samples=15
} else {
w2vModel <- read.vectors(w2vBin)
## Read-in Existing Model Files
You don't need to run this if you want to work with a file you just trained; but if you would like to switch models or read-in an existing one after starting a new session, this is the code you would use.
w2vModel <- read.vectors("data/name_of_your_file.bin")
## Visualize
w2vModel %>% plot(perplexity=10)
## Clustering
centers <- 150
clustering <- kmeans(w2vModel, centers=centers, iter.max=40)
sapply(sample(1:centers, 10), function(n) {
## Closest To
#To have the results appear in the console below
w2vModel %>% closest_to("girl", 30)
#To view the results in a separate tab
w2vModel %>% closest_to("girl", 30) %>% View()
## Closest To Two Things
w2vModel %>% closest_to(~"girl"+"woman", 20)
## Closest To The Space Between Two Things
w2vModel %>% closest_to(~"man"-"woman", 20)
## Analogies
w2vModel %>% closest_to(~"king"-"man"+"woman", 20)
##Export Queries
w2vExport <- w2vModel %>% closest_to("girl", 30)
#Change "name_of_your_query" to a descriptive name that you want to give to your export file.
write.csv(file="output/name_of_your_query.csv", x=w2vExport)
##Export Clusters
centers <- 150
clustering <- kmeans(w2vModel,centers=centers,iter.max = 40)
#Change "name_of_your_query" to a descriptive name that you want to give to your export file.
w2vExport <-sapply(sample(1:centers,150),function(n) {
write.csv(file="output/name_of_your_cluster.csv", x=w2vExport)
## Evaluate the Model
You can run this test by hitting `command-return` or `control-return` to run one line a time, or just hit the green button in the top right of the code block below.
files_list = list.files(pattern="*.bin$", recursive=TRUE)
rownames <- c()
data_frame <- data.frame()
data = list(c("away", "off"),
c("before", "after"),
c("cause", "effects"),
c("children", "parents"),
c("come", "go"),
c("day", "night"),
c("first", "second"),
c("good", "bad"),
c("last", "first"),
c("kind", "sort"),
c("leave", "quit"),
c("life", "death"),
c("girl", "boy"),
c("little", "small"))
data_list = list()
for(fn in files_list) {
wwp_model = read.vectors(fn)
sims <- c()
for(pairs in data)
vector1 <- c()
for(x in wwp_model[[pairs[1]]]) {
vector1 <- c(vector1, x)
vector2 <- c()
for(x in wwp_model[[pairs[2]]]) {
vector2 <- c(vector2, x)
sims <- c(sims, cosine(vector1, vector2))
f_name <- strsplit(fn, "/")[[1]][[2]]
data_list[[f_name]] <- sims
for(pairs in data) {
rownames <- c(rownames, paste(pairs[1], pairs[2], sep="-"))
results <- structure(data_list,
class = "data.frame",
row.names = rownames
write.csv(file="output/model-test-results.csv", x=results)
## Credits and Thanks
This tutorial uses the `wordVectors` package developed by Ben Schmidt and Jian Li, itself based on the original `word2vec` code developed by Mikolov et al. The walkthrough was also informed by workshop materials authored by Schmidt, as well as by an exercise created by Thanasis Kinias and Ryan Cordell for the "Humanities Data Analysis" course, and a later version used in Elizabeth Maddock Dillon and Sarah Connell's "Literature and Digital Diversity" class, both at Northeastern University.
This version of the walkthrough was developed as part of the Word Vectors for the Thoughtful Humanist series at Northeastern. Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.