## Section 0. Ensure Dependencies

In [1]:
install.packages('textclean', dependencies = TRUE, repos = 'http://cran.us.r-project.org')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
also installing the dependencies ‘dtt’, ‘syuzhet’, ‘english’, ‘lexicon’, ‘qdapRegex’, ‘textshape’



In [2]:
library(tidytext)
library(magrittr)
library(dplyr)
library(stringr)
library(textclean)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



## Section 1. Data Preparation

### Step 1: Read Data

In [14]:
spooky <- read.csv('../data/spooky.csv', as.is = TRUE)
spooky$author <- spooky$author %>% as.factor()
spooky$text <- spooky$text %>% replace_contraction()

### Step 2: Process Data - Split sentences into words

We use `unnest_tokens()` to process the data into tidy format for ease of analysis. For the subsequent analyses, we need:
- One dataframe that drops punctuations, turns words into lower case, but does not remove stop words
- One dataframe that drops punctuations, removes everything with higher case, removes all stop words, then turned to lower case

In [19]:
sws <- spooky_with_stopwords <- spooky %>% 
    unnest_tokens(word, text)
soc <- spooky_original_case <- spooky %>% 
    unnest_tokens(word, text, to_lower = FALSE) %>%
    drop_row('word', c('[A-Z]')) %>% 
    anti_join(stop_words, by = "word")

## Section 2. "Gothic" Keywords Comparison

In this section, we will compare the relative frequency of British Gothic keywords identified by [Jones (2010)](https://www.era.lib.ed.ac.uk/bitstream/handle/1842/5351/Dissertation_Final.pdf;sequence=1), which include 
- Pronouns (especially first-person and second-person)
    - Examples: I, me, myself, you, yourself, etc.
- Vocatives (names, titles, etc.)
    - Examples: Adrian, Miss, Mr, etc.
- Body parts 
    - Examples: Arms, foreheads, lips, etc.
- "Gothic" words (words related to a dark and supernatural atmosphere)
    - Examples: Fear, curiosity, silence, dark, dead, spirit

Because vocatives, especially names, are hard to identify given the time constraint of this project, we decide to examine the other three categories of keywords. In all cases, we define a word's relative frequenncy as 
$$
    \text{rel_freq}(\text{word}, \text{author}) 
    = \frac{\text{# occurrence of word}}{\text{total # words by author}}
$$

In [20]:
head(sws)

Unnamed: 0,id,author,word
1.0,id26305,EAP,this
1.1,id26305,EAP,process
1.2,id26305,EAP,however
1.3,id26305,EAP,afforded
1.4,id26305,EAP,me
1.5,id26305,EAP,no


In [27]:
# For pronouns
# Word count by author
author_words = sws %>%
    group_by(author) %>%
    count()
# Count of every word for each author
author_word_count = sws %>%
    group_by(author, word) %>%
    count()

In [36]:
author_word_counts <- author_word_count %>% 
    select(author, word, count = n) %>%
    left_join(author_words %>% 
                 select(author, total_count = n),
              by = 'author')

In [37]:
author_word_counts$rel_freq = author_word_counts$count / author_word_counts$total_count

In [41]:
head(author_word_counts[order(-author_word_counts$rel_freq),])

author,word,count,total_count,rel_freq
EAP,the,14993,200988,0.07459649
HPL,the,10933,156608,0.06981125
MWS,the,9659,165700,0.05829209
EAP,of,8972,200988,0.04463948
HPL,and,6098,156608,0.03893799
HPL,of,5846,156608,0.03732887


## Section 3. Gothic keywords comparison

## Section 4. Statistical Test

## Section 5. k-means on sentence sentiments