## Corpus Linguistics - LING3038/6038
### Week 3

How can we do some descriptive corpus statistics with R?

An easy way to get started is to import a text from the web. 

In [1]:
moby<-read.delim("https://www.gutenberg.org/files/2701/2701-0.txt",encoding="UTF-8",sep="\n") #this function has 3 arguments
#the website from which we are reading data
#the text encoding. UTF-8 works well for most languages. If the text was encoded with a different system, we could specfic this here
#what is used to separate parts of the text in the file. Here it is different lines so we use \n to mean 'new line'


In [2]:
#we can see the first and last bits of our object by using head() and tail(). 
#We use a comma and then an argument to define how many lines we want to see.

head(moby,300)
tail(moby,300)

X....START.OF.THE.PROJECT.GUTENBERG.EBOOK.2701....
MOBY-DICK;
"or, THE WHALE."
By Herman Melville
CONTENTS
ETYMOLOGY.
EXTRACTS (Supplied by a Sub-Sub-Librarian).
CHAPTER 1. Loomings.
CHAPTER 2. The Carpet-Bag.
CHAPTER 3. The Spouter-Inn.
CHAPTER 4. The Counterpane.


Unnamed: 0,X....START.OF.THE.PROJECT.GUTENBERG.EBOOK.2701....
18615,In an instant the boat was pulling round close under the stern.
18616,“The sharks! the sharks!” cried a voice from the low cabin-window
18617,"there; “O master, my master, come back!”"
18618,But Ahab heard nothing; for his own voice was high-lifted then; and the
18619,boat leaped on.
18620,"Yet the voice spake true; for scarce had he pushed from the ship, when"
18621,"numbers of sharks, seemingly rising from out the dark waters beneath"
18622,"the hull, maliciously snapped at the blades of the oars, every time"
18623,they dipped in the water; and in this way accompanied the boat with
18624,their bites. It is a thing not uncommonly happening to the whale-boats


There is a lot of header/footer information that we do not want to contribute to our word counts. 
We want to exlude that header/footer information. How can we do that?
We can figure out the number of the rows where the data starts and ends.
Then we save that portion as a new object.

In [3]:
#find the line that starts the actual text
moby[14:16,]


#find the line that ends the actual text
moby[18928:18929,]

This works if you already know the line numbers of around the start/end of the text


In [4]:
moby.text<-moby[15:18928,]

In [5]:
##Another way to read in texts is with the funtcion readLines()
##Then name our lines based on finding certain indicators/signals of when the text starts and stops
##to find those indications we search for them using grep()

moby2<-readLines("https://www.gutenberg.org/files/2701/2701-0.txt",encoding="UTF-8")
head(moby2)

##Then we search for strings that signal the start/stop of the text and name them:
##Most Gutenberg texts contain a string like "START OF xxxx" / "END OF xxxx"

# the text starts after the Project Gutenberg header...
start <- grep("START OF THE PROJECT GUTENBERG", moby2, value=F) + 1 #one line after the line with this text

# ...end ends at the Project Gutenberg footer.
stop <- grep("END OF THE PROJECT GUTENBERG", moby2,value=F) - 1 #one line before the line with this text

start
stop



In [6]:
moby.lines <- moby2[start:stop]
head(moby.lines,25) #lets see the start of the text now

We might also decide that we do not want to start with "CHAPTER 1" but rather with the classic line "Call me Ishmael". How would we change our start search to make this happen?

In [None]:
#You put in what we want to search for (Call me Ishmael)
Ishmael.start <- grep("", moby2, value=F) 



In [None]:
#now create another object called moby.lines2. You put in the start and stop points using the objects we have created

moby.lines2<-moby2[:]

head(moby.lines2,20)

Note that these different strategies of reading in and selecting our portions of the text results in different sized objects.


In [None]:
length(moby.text)
length(moby.lines)
length(moby.lines2)



Now let's work with our object 'moby.lines'

In [None]:
head(moby.lines)

In [None]:
#we now are going to turn everything into lowercase

moby.text.lowercase<-tolower(moby.lines)
head(moby.text.lowercase)

In [None]:
#Right now our ojbect is a list of character strings of different lengths. We want to make our text into one long string.
#We do this with paste() and unlist()

moby.text.lowercase.vector<-paste(unlist(moby.text.lowercase), collapse =" ")

length(moby.text.lowercase)
length(moby.text.lowercase.vector)

In [None]:
moby.vec<-moby.text.lowercase.vector #this is just to create a back-up object and work with an object with a shorter name

There are a lot of R packages out there that deal with corpus data and have functions to do the main corpus things like turn a text into a collection of word tokens. Here are a few:

In [None]:
library(corpus)
moby.toks.corpus<-text_tokens(moby.vec, drop_punct=T)
moby.toks.corpus

In [None]:
library(stringr)
moby.toks.stringr<-str_split(moby.vec, "\\s|[[:PUNCT:]]") #this is a regular expression. we'll get to these soon
moby.toks.stringr

In [None]:
library(tm)
moby.toks.tm.scan<-scan_tokenizer(moby.vec) #this package has a few different tokenizers with different function names
moby.toks.tm.scan

Now that we have a bunch of tokens, we can start counting them, their types and doing basic corpus description tasks. I am going to load {data.table} to do this and turn my {corpus} created tokens into a data.table

In [None]:
library(data.table)
dt<-as.data.table(moby.toks.corpus, na.rm=T)
dt

In [None]:
##turn it into a frequency table
#add a column to count the words
dt[, word_count := .N, by = V1] #.N counts the instances of things##turn it into a frequency table

#inspect object
dt

In [None]:
#reduce tokens to types by getting the unique tokens
dtm<-unique(dt[,c("V1","word_count")])
colnames(dtm)[1]<-"word"
colnames(dtm)[2]<-"freq"

In [None]:
dtm

In [None]:
#Let's sort this in a more useful way
setkey(dtm,freq)
dtm


In [None]:
library(ggplot2)
ggplot(subset(dtm, freq > 350) ,aes(reorder(word, -freq), freq))+
  geom_bar(stat="identity")+
  theme(axis.text.x=element_text(angle=90, hjust=1)) + ggtitle("Moby Dick word frequency (over 350 tokens)")+
  xlab("Word")+
  ylab("Frequency")


In [None]:
#there are a lot of "stopwords" here. let's get rid of some.
library(stopwords)
head(stopwords_en,20)
length(stopwords_en)

dtm2<-dtm[! dtm$word %in% stopwords_en,] 
dtm2

In [None]:
ggplot(subset(dtm2, freq > 200) ,aes(reorder(word, -freq), freq))+
  geom_bar(stat="identity")+
  theme(axis.text.x=element_text(angle=90, hjust=1)) + ggtitle("Moby Dick word frequency (over 200 tokens, stopwords removed)")+
  xlab("Word")+
  ylab("Frequency")

In [None]:
##Histograms
dtm3 <- dtm2[order(-freq)] #this also orders like setkey(dtm,freq) above
dtm3[, rank:= .I, by="freq"] #we are adding a column for the frequency rank
dtm3

In [None]:
a<-ggplot(dtm3 ,aes(rank,freq))+
  geom_point(position=position_jitter(width=.05,height=.05))+ 
  ggtitle("Moby Dick word rank v frequency")+
  xlab("Rank")+
  ylab("Frequency")

In [None]:
d<-ggplot(dtm3 ,aes(rank,freq))+
  geom_point(position=position_jitter(width=.05,height=.05))+ 
  scale_x_log10()+
  scale_y_log10()+
  ggtitle("Moby Dick log-log plot")+
  xlab("Rank (on log scale)")+
  ylab("Frequency (on log scale)")

In [None]:
library(gridExtra)
grid.arrange(a,d, ncol = 2)

In [None]:
##word cloud

library(wordcloud2)
dtm4<-dtm3[dtm3$freq >10,]
wordcloud2(data=dtm4, size = 1, color="random-dark", shape="triangle", hoverFunction=NULL)
