In this tutorial we're going to get started with some basic natural language processing (NLP) tasks. We're going to:

* Read in some helpful NLP libraries & our dataset
* Find out how often each author uses each word
* Use that to guess which author wrote a sentence

Ready? Let's get started! :D

## Read in some helpful NLP libraries & our dataset

For this tutorial, I'm going to be using the tidytext library, which is part of the tidyverse. The tidyverse is a collection of packages built around the central principle that data should be stored in table where each column is a single variable and each row is an observation of that variable. 

The tidytext package applies that philosophy to text data, and has a lot of nice helper functions that we're going to use. There's also a very helpful book, which you can read for free [here](http://tidytextmining.com/).

In [1]:
# libraries we'll need
library(tidytext)
library(tidyverse)
library(glue)

### Read in the data

# read in our data
texts <- read.csv("../input/train.csv")

# look at the first few rows
head(texts)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Attaching package: ‘glue’

The following object is masked from ‘package:dplyr’:

    collapse



id,text,author
id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP
id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL
id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP
id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS
id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL
id22965,"A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services.",MWS


## Find out how often each author uses each word

A lot of NLP applications rely on counting how often certain words are used. (The fancy term for this is "word frequency".) Let's look at the word frequency for each of the authors in our dataset. The tidytext library has lots of nice built-in functions and data structures for this that we can make use of.

In [2]:
### Split data

# split the data by author
byAuthor <- group_by(texts, author)

### Calcuate how often each author uses each word

freqByAuthor <-  texts %>%
    group_by(author) %>% # group by author
    select(text) %>% # grab only the Sentence column
    mutate(text = as.character(text)) %>% # convert them to characters
    unnest_tokens(words, text) %>% # tokenize
    count(words) %>% # frequency by token (by author)
    bind_tf_idf(words, author, n) # normalized frequency

Adding missing grouping variables: `author`


Now we can look at how often each writer uses specific words. Since this is a Halloween competition, how about "blood", "scream" and "fear"? 👻😨🧛‍♀️

In [3]:
### Look at how often each writer uses specific words

# see how often each author says "blood"
print(freqByAuthor[freqByAuthor$words == "blood",])

# see how often each author says "scream"
print(freqByAuthor[freqByAuthor$words == "scream",])

# see how often each author says "fear"
print(freqByAuthor[freqByAuthor$words == "fear",])

# A tibble: 3 x 6
# Groups:   author [3]
  author words     n           tf   idf tf_idf
  <fctr> <chr> <int>        <dbl> <dbl>  <dbl>
1    EAP blood    34 0.0001692763     0      0
2    HPL blood    40 0.0002559787     0      0
3    MWS blood    43 0.0002595051     0      0
# A tibble: 3 x 6
# Groups:   author [3]
  author  words     n           tf   idf tf_idf
  <fctr>  <chr> <int>        <dbl> <dbl>  <dbl>
1    EAP scream     4 1.991486e-05     0      0
2    HPL scream    16 1.023915e-04     0      0
3    MWS scream     5 3.017502e-05     0      0
# A tibble: 3 x 6
# Groups:   author [3]
  author words     n           tf   idf tf_idf
  <fctr> <chr> <int>        <dbl> <dbl>  <dbl>
1    EAP  fear    24 0.0001194892     0      0
2    HPL  fear    99 0.0006335473     0      0
3    MWS  fear   117 0.0007060954     0      0


## Use word frequency to guess which author wrote a sentence

The general idea is is that different people tend to use different words more or less often. (I had a beloved college professor that was especially fond of "gestalt".) If you're not sure who said something but it has a lot of words one person uses a lot in it, then you might guess that they were the one who wrote it.

Let's use this general principle to guess who might have been more likely to write the sentence "It was a dark and stormy night."

In [4]:
### Use this to make guesses about who wrote a given sentence

# One way to guess authorship is to use the joint probabilty that each 
# author used each word in a given sentence.

# first, let's start with a test sentence
testSentence <- as_data_frame(list(text = "It was a dark and stormy night."))

# and then tokenize it
preProcessedTestSentence <- unnest_tokens(testSentence, words, text)

# get the frequency for each word by author
testProbailities <- freqByAuthor[freqByAuthor$words %in% preProcessedTestSentence$words,]

# if there are x fewer terms from the term freq. table than are in the test string, 
# pad with x times a low frequency count & then take the joint probability
getJointProb <- function(author, testSentence, tf_idf){
  # get the term frequencytestSentence
  freq <- inner_join(freqByAuthor, testSentence)   # get the term frequency
  # select just the target author
  byAuthor <- freq[freq$author == author,]
  
  # number of returned terms 
  returnedTerms <- dim(byAuthor)[1]
  
  # add a very small amount for every term in the test sentence we didn't see
  # in our training corpus
  if(length(testSentence$words) < dim(byAuthor)[1]){
    # making the smoothing term very low reflects the idea that we think it's
    # unlikely that the author would use this term, since we haven't seen them 
    # use it before
    smoothingTerm <- (length(testSentence$words) - returnedTerms) * 0.000001
  } else {
    # since we're taking the product, making the smoothing term 1 won't
    # change our results
    smoothingTerm <- 1
  }
  
  # return probaility
  return(prod(c(byAuthor$tf, smoothingTerm)))
}

# empty variable to put our predictions in
authorEst <- NULL

# Joint uni-gram probability for each author
for(i in levels(texts$author)){
  authorEst <- c(authorEst, getJointProb(i, preProcessedTestSentence, freqByAuthor))
}

# and the winner is...
levels(texts$author)[which.max(authorEst)]


Joining, by = "words"
Joining, by = "words"
Joining, by = "words"


So based on what we've seen in our training data, it looks like of our three authors, H.P. Lovecraft was the most likely to write the sentence "It was a dark and stormy night".