Permalink
Cannot retrieve contributors at this time
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
260 lines (171 sloc)
6.54 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
in--- | |
title: "Word Vectors Template" | |
author: "Jonathan Fitzgerald & Sarah Connell" | |
date: "6/3/2019" | |
output: pdf_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE) | |
``` | |
## Using this File | |
This file is set up so that once you're comfortable with the basics of `word2vec` (which you can learn from the `introduction_word2vec.Rmd` file) and you have installed all the necessary packages, you can use it as a convenient template for working with your data. Most of the instructional language is in the introductory file, so check that if you have questions; this file includes just the code you'll need to load in files, train models, and query them, with a few comments on the things you'll want to pay particularly close attention to. | |
##Check the Working Directory | |
```{r} | |
getwd() | |
``` | |
## Load Packages | |
```{r} | |
library(tidyverse) | |
library(tidytext) | |
library(magrittr) | |
library(devtools) | |
library(tsne) | |
library(wordVectors) | |
library(lsa) | |
``` | |
## Read-in and Combine Multiple Text Files | |
```{r} | |
# Change "name_of_your_folder" to match the name of the folder with your corpus | |
path2file <- "data/name_of_your_folder" | |
fileList <- list.files(path2file,full.names = TRUE) | |
readTextFiles <- function(file) { # Remember that the code that defines functions must be run by putting your cursor at the beginning or end, or by selecting the whole section of code | |
message(file) | |
rawText = paste(scan(file, sep="\n",what="raw",strip.white = TRUE)) | |
output = tibble(filename=gsub(path2file,"",file),text=rawText) %>% | |
group_by(filename) %>% | |
summarise(text = paste(rawText, collapse = " ")) | |
return(output) | |
} | |
combinedTexts <- tibble(filename=fileList) %>% | |
group_by(filename) %>% | |
do(readTextFiles(.$filename)) | |
``` | |
## Prepare Text for VSM | |
```{r} | |
# Don't forget to change the text in the first line to whatever you want to call your model file | |
baseFile <- "your_file_name" | |
w2vInput <- paste("data/",baseFile,".txt", sep = "") | |
w2vCleaned <- paste("data/",baseFile,"_cleaned.txt", sep="") | |
w2vBin <- paste("data/",baseFile,".bin", sep="") | |
combinedTexts$text %>% write_lines(w2vInput) | |
``` | |
## Create Vector Space Model | |
```{r} | |
THREADS <- 3 | |
prep_word2vec(origin=w2vInput,destination=w2vCleaned,lowercase=T,bundle_ngrams=1) | |
#See the introductory file for a reminder on how you might adjust the parameters below | |
if (!file.exists(w2vBin)) { | |
w2vModel <- train_word2vec( | |
w2vCleaned, | |
output_file=w2vBin, | |
vectors=100, | |
threads=THREADS, | |
window=6, iter=10, negative_samples=15 | |
) | |
} else { | |
w2vModel <- read.vectors(w2vBin) | |
} | |
``` | |
## Read-in Existing Model Files | |
You don't need to run this if you want to work with a file you just trained; but if you would like to switch models or read-in an existing one after starting a new session, this is the code you would use. | |
```{r} | |
w2vModel <- read.vectors("data/name_of_your_file.bin") | |
``` | |
## Visualize | |
```{r} | |
w2vModel %>% plot(perplexity=10) | |
``` | |
## Clustering | |
```{r} | |
centers <- 150 | |
clustering <- kmeans(w2vModel, centers=centers, iter.max=40) | |
sapply(sample(1:centers, 10), function(n) { | |
names(clustering$cluster[clustering$cluster==n][1:15]) | |
}) | |
``` | |
## Closest To | |
```{r} | |
#To have the results appear in the console below | |
w2vModel %>% closest_to("girl", 30) | |
#To view the results in a separate tab | |
w2vModel %>% closest_to("girl", 30) %>% View() | |
``` | |
## Closest To Two Things | |
```{r} | |
w2vModel %>% closest_to(~"girl"+"woman", 20) | |
``` | |
## Closest To The Space Between Two Things | |
```{r} | |
w2vModel %>% closest_to(~"man"-"woman", 20) | |
``` | |
## Analogies | |
```{r} | |
w2vModel %>% closest_to(~"king"-"man"+"woman", 20) | |
``` | |
##Export Queries | |
```{r} | |
w2vExport <- w2vModel %>% closest_to("girl", 30) | |
#Change "name_of_your_query" to a descriptive name that you want to give to your export file. | |
write.csv(file="output/name_of_your_query.csv", x=w2vExport) | |
``` | |
##Export Clusters | |
```{r} | |
centers <- 150 | |
clustering <- kmeans(w2vModel,centers=centers,iter.max = 40) | |
#Change "name_of_your_query" to a descriptive name that you want to give to your export file. | |
w2vExport <-sapply(sample(1:centers,150),function(n) { | |
names(clustering$cluster[clustering$cluster==n][1:15]) | |
}) | |
write.csv(file="output/name_of_your_cluster.csv", x=w2vExport) | |
``` | |
## Evaluate the Model | |
You can run this test by hitting `command-return` or `control-return` to run one line a time, or just hit the green button in the top right of the code block below. | |
```{r} | |
files_list = list.files(pattern="*.bin$", recursive=TRUE) | |
rownames <- c() | |
data_frame <- data.frame() | |
data = list(c("away", "off"), | |
c("before", "after"), | |
c("cause", "effects"), | |
c("children", "parents"), | |
c("come", "go"), | |
c("day", "night"), | |
c("first", "second"), | |
c("good", "bad"), | |
c("last", "first"), | |
c("kind", "sort"), | |
c("leave", "quit"), | |
c("life", "death"), | |
c("girl", "boy"), | |
c("little", "small")) | |
data_list = list() | |
for(fn in files_list) { | |
wwp_model = read.vectors(fn) | |
sims <- c() | |
for(pairs in data) | |
{ | |
vector1 <- c() | |
for(x in wwp_model[[pairs[1]]]) { | |
vector1 <- c(vector1, x) | |
} | |
vector2 <- c() | |
for(x in wwp_model[[pairs[2]]]) { | |
vector2 <- c(vector2, x) | |
} | |
sims <- c(sims, cosine(vector1, vector2)) | |
f_name <- strsplit(fn, "/")[[1]][[2]] | |
data_list[[f_name]] <- sims | |
} | |
} | |
for(pairs in data) { | |
rownames <- c(rownames, paste(pairs[1], pairs[2], sep="-")) | |
} | |
results <- structure(data_list, | |
class = "data.frame", | |
row.names = rownames | |
) | |
write.csv(file="output/model-test-results.csv", x=results) | |
``` | |
## Credits and Thanks | |
This tutorial uses the `wordVectors` package developed by Ben Schmidt and Jian Li, itself based on the original `word2vec` code developed by Mikolov et al. The walkthrough was also informed by workshop materials authored by Schmidt, as well as by an exercise created by Thanasis Kinias and Ryan Cordell for the "Humanities Data Analysis" course, and a later version used in Elizabeth Maddock Dillon and Sarah Connell's "Literature and Digital Diversity" class, both at Northeastern University. | |
This version of the walkthrough was developed as part of the Word Vectors for the Thoughtful Humanist series at Northeastern. Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities. | |