# Working with Text Data

## Introduction

Text data is everywhere: social media posts, customer reviews, news articles, survey responses, medical records, legal documents, and more.  The ability to extract meaningful insights from this unstructured data is a crucial skill in data science.  Working with text enables us to perform tasks like:

- **Sentiment Analysis:** Determining the emotional tone (positive, negative, neutral) of text. Used to understand customer opinions, brand perception, and public sentiment.
- **Topic Modeling:** Discovering the underlying themes or topics discussed in a collection of documents. Useful for organizing large text corpora, identifying trends, and understanding customer feedback.
- **Document Classification:** Categorizing documents into predefined classes (e.g., spam/not spam, news article categories).
- **Named Entity Recognition:** Identifying and classifying named entities (people, organizations, locations, dates, etc.) in text.
- **Building Chatbots and Virtual Assistants:** Natural language processing (NLP) is fundamental to these applications.
- **Search Engines:** Building search engines and improving search relevance.

This module will provide you with the foundational skills to effectively work with text data in R using `tidyverse` methods. While some base R functions can perform similar tasks, we will focus on the tidyverse approach for consistency, readability, and ease of use.

### Learning Objectives
By the end of this notebook, you will be able to:

- Understand how text is represented in R.
- Use the `stringr` package to perform common string manipulation tasks.
- Apply basic regular expressions for pattern matching.
- Convert text data into a tidy format using tidytext.
- Perform fundamental text analysis using tidyverse tools.

### Packages

We'll make use of several different packages in this notebook:

- `stringr`: Part of the tidyverse, stringr provides a consistent and user-friendly set of functions for working with strings. It simplifies many common text manipulation tasks. We will use this extensively.
- `tidytext`: This package provides tools for converting text data into a tidy format (one-token-per-row), making it easy to integrate text analysis with other tidyverse workflows.

You'll need to download the `tidytext` package before you go through this analysis. You should be familiar with how to download these through anaconda now, but I'll add a page on Canvas explaining the process.

In [None]:
suppressMessages({ # hide all the startup messages
    library(tidyverse) # attaches stringr too
    library(tidytext)
})

## Representing Text in R

As you've seen in earlier modules, the fundamental way to represent text in R is through **character vectors**. Each element of the vector is a separate string. Even a single piece of text is typically represented as a character vector of length one. We can use the `str_length()` function to count the number of character in a string:

In [None]:
single_string = "This is a single string."

cat("single string: \n")

print(single_string)

cat(
    paste0(
        "\nclass: ", class(single_string),
        "\nlength: ", length(single_string),
        "\nnum chars: ", str_length(single_string),
        "\n\n"
    )
)

cat("multiple strings: \n")

multiple_strings = c("This is the first string.", "This is the second.", "And this is the third.")
print(multiple_strings)

cat(
    paste0(
        "\nclass: ", class(multiple_strings),
        "\nlength: ", length(multiple_strings),
        "\nnum chars: ", paste0(str_length(multiple_strings), collapse = ",")
    )
)

> In base R, there is also the function `nchar()` that can count the number of characters in a string. However, nchar() doesn't handle NA's or factors well, so you'd need to do a little extra work to ensure you're getting the right answer.

You can use both single quotes (`'...'`) or double quotes (`"..."`) to create a string. The main difference arises when you want to include quotes within your string.  If you want a single quote inside your string, you can enclose the string in double quotes, and vice versa:

In [None]:
# Single quote inside double quotes
quote_example1 = "It's a beautiful day.\n"
cat(quote_example1)

# Double quote inside single quotes
quote_example2 = 'He said, "Hello, world!"'
cat(quote_example2)

What if you want both single and double quotes inside your string, or you need to include other special characters like newlines or tabs?  This is where escaping comes in. You've seen this already with `\n` to create a newline; You use a backslash (\) to "escape" the special meaning of the character that follows it. Here are some common escape seqquences:

- `\n`: Newline (starts a new line)
- `\t`: Tab (inserts a tab space)
- `\\`: A literal backslash
- `\"`: A literal double quote (when using double quotes to define the string)
- `\'`: A literal single quote (when using single quotes to define the string)

In [None]:
escaped_string = "This is the first line.\nThis is the second line.\tThis is tabbed.\nShe said, \"It's a beautiful day!\""
cat(escaped_string) 

cat("\n\n")

escaped_string_single_quotes = 'What\'s this, a literal backslash: \\ and did I just escape a single quote?!'
cat(escaped_string_single_quotes)

## Basic String Manipulation with `stringr`

The `stringr` package, part of the `tidyverse`, provides a set of functions designed to make working with strings in R easier and more consistent.  `stringr` functions are vectorized, meaning they operate on entire character vectors at once, which is very efficient. They also generally follow a consistent naming convention: all `stringr` functions start with `str_`. We'll focus on common string manipulation tasks and the corresponding stringr functions.

### Concatenation
Concatenation is the process of joining strings together. We've seen how to do this using the `paste()` and `paste0()` functions. The primary function for this is in `stringr` is `str_c()`.

In [None]:
# Basic concatenation
string1 = "Hello"
string2 = "world"
combined_string = str_c(string1, string2)
print(combined_string)

# Using the 'sep' argument to add a separator
combined_string_with_space = str_c(string1, string2, sep = " ")
print(combined_string_with_space)

# Concatenating multiple strings
apology = "I'm sorry,"
name = "Dave."
explanation = "I'm afraid I can't do that."
hal_apology = str_c(apology, name, explanation, sep = " ")
print(hal_apology)

# Using the 'collapse' argument to combine a vector into a single string
words = c("What", "a", "strange", "movie.")
sentence = str_c(words, collapse = " ")
print(sentence)

`str_c()` is vectorized making it easy to handle certain operations:

In [None]:
# Vectorized concatenation
first_names = c("Alice", "Bob", "Charlie")
last_names = c("Smith", "Jones", "Brown")
full_names = str_c(first_names, last_names, sep = " ")
print(full_names)

### Subsetting

You can extract parts of strings using `str_sub()`. You specify the start and end positions (inclusive). Negative indices count from the end of the string.

In [None]:
# Extracting a substring
text = "This is an example string."
substring1 = str_sub(text, 1, 4)  # Extract characters 1 to 4
print(substring1)

substring2 = str_sub(text, 6, 7)  # Extract characters 6 to 7
print(substring2)

substring3 = str_sub(text, -7, -1) # Extract the last 7 characters
print(substring3)

#str_sub works on vectors too.
str_sub(full_names, 1, 4)

### Case Conversion

Change the case of strings using `str_to_lower()`, `str_to_upper()`, and `str_to_title()`.

In [None]:
# Case conversion
mixed_case = "ThIs Is MiXeD cAsE?"
lowercase = str_to_lower(mixed_case)
print(lowercase)

uppercase = str_to_upper(mixed_case)
print(uppercase)

titlecase = str_to_title(mixed_case)
print(titlecase)

### Whitespace Handling

Remove unnecessary whitespace with `str_trim()` and `str_squish()`. `str_trim()` removes leading and trailing whitespace. `str_squish()` also removes extra spaces within the string.

In [None]:
# Whitespace trimming
messy_string = "   Too much   whitespace!  "
trimmed_string = str_trim(messy_string)
cat(trimmed_string, "\n") # Shows more clearly the whitespace is removed.

squished_string = str_squish(messy_string)
cat(squished_string, "\n")

# str_trim() can also trim from only one side
left_trimmed = str_trim(messy_string, side = "left")
cat("Original: [", messy_string, "]\n", sep = "") # Show boundaries clearly
cat("Left Trimmed: [", left_trimmed, "]\n", sep = "")

### String Replacement

`str_replace()` and `str_replace_all()` will replace matched patterns in your string(s).

In [None]:
original_string = "I like apples, apples are good."

# Replace the first instance
you_like_oranges_or_apples = str_replace(original_string, "apples", "oranges")
print(you_like_oranges_or_apples)

# Replace all instances
you_like_oranges = str_replace_all(original_string, "apples", "oranges")
print(you_like_oranges)

### String Duplication
`str_dup()` will duplicate, then concatenate, a string a set number of times.

In [None]:
my_string = "repeat me!"

#Duplicate a string 3 times
duplicated_string = str_dup(my_string, 3)
print(duplicated_string)

#You can also provide a vector
times = c(1,2,3)
str_dup(my_string, times)

### String Padding

`str_pad()` can be used to add padding to strings, such as spaces or zeros.

In [None]:
my_numbers = c(1, 10, 100, 1000)

#pad numbers with zeros
padded_numbers = str_pad(my_numbers, width = 4, side = "left", pad = "0")
print(padded_numbers)

### Splitting Strings

`str_split()` splits strings based on a delimiter.

In [None]:
my_string = "This-is-my-delimited-string"

c("Split on the dash character.\n")
split_string = str_split(my_string, "-")
print(split_string)

cat("\nstr_split() also returns a list, because there can be different numbers of splits.\n")
split_multiple = str_split(multiple_strings, " ")
print(split_multiple)

### Detecting Patterns

You can use `str_detect()` to *detect* patterns in a string. It'll return a logical vector (TRUE/FALSE) indicating whether the pattern is found in each string.

In [None]:
# Detecting the presence of "apple"
fruits = c("apple", "banana", "orange", "grapefruit")
has_apple = str_detect(fruits, "apple")
print(has_apple)

`str_starts()` and `str_ends()` are convenient shortcuts for checking if a string starts or ends with a specific pattern.

In [None]:
# Using str_starts()
starts_with_b = str_starts(fruits, "b")
print(starts_with_b)

# Using str_ends()
ends_with_e = str_ends(fruits, "e")
print(ends_with_e)

### Intro to Regular Expressions (regex)

Regular expressions (regex) are a powerful way to describe patterns in text. `stringr` functions use regular expressions by default. We'll introduce some basic concepts here, but you can dig into [this chapter](https://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions) from Hadley Wickham's *R for Data Science* to learn more.  

- **Patterns and Matching:** A regular expression is a sequence of characters that defines a search pattern. The goal is to match this pattern within a string.
- **Basic Metacharacters:** These are some of the characters that have special meanings in regular expressions:
    - `.` (dot): Matches any single character (except newline).
    - `^`: Matches the beginning of the string.
    - `$`: Matches the end of the string.
    - `*`: Matches 0 or more repetitions of the preceding character/group.
    - `+`: Matches 1 or more repetitions.
    - `?`: Matches 0 or 1 repetition.
    - `[]`: Character set (e.g., `[aeiou]` matches any vowel).
    - `[^ ]`: Negated character set (e.g., `[^aeiou]` matches any non-vowel).
    - `|`: OR operator (e.g., `a|b` matches either "a" or "b").
    - `()`: Grouping.
    - `\d`: Match any digit from 0 to 9. You can specify a quantity with `{}`. 
- `regex()` **Modifiers:** Using the `regex()` helper function allows us to apply modifiers, such as case insensitivity.

In [None]:
cat("Examples using metacharacters\n")
text = c("apple", "banana", "apricot", "avocado", "Pineapple")
print(text)
str_c(c("\n", str_dup("-", 75), "\n")) %>% cat()

cat("Match strings starting with 'a'\n")
str_detect(text, "^a") 

cat("\nMatch strings ending with 'a'\n")
str_detect(text, "a$")

cat("\nMatch strings containing 'app' followed by any character\n")
str_detect(text, "app.")

cat("\nMatch strings containing some vowels\n")
str_detect(text, "[eio]")

cat("\nCase-insensitive search for 'p'\n")
str_detect(text, regex("p", ignore_case = TRUE))

cat("\nMatch strings containing 'an' repeated one or more times\n")
str_c(c("bannnnnana: ", str_detect("bannnnnana", "an+"), "\n")) %>% cat()
str_c(c("bana: ", str_detect("bana", "an+"), "\n")) %>% cat()
str_c(c("ba: ", str_detect("ba", "an+"), "\n")) %>% cat()

To match a literal metacharacter (like a literal dot `.`), you need to escape it with a backslash. Since the backslash itself is special in R strings, you need to use two backslashes (`\\`) to represent a single literal backslash in a regular expression.

In [None]:
# Matching a literal dot
str_detect("This is a sentence with a period.", "\\.")  # Matches because the sentence ends with a dot

str_detect("This is a sentence without periods", ".") # Incorrect, matches ANY character.

str_detect("This is a sentence without a period", "\\.") # Correctly implemented

### Extracting Matches

`str_extract()` extracts the *first* match of a pattern while `str_extract_all()` extracts *all* matches (returns a list). 

In [None]:
# Extracting the first match
text = "My phone number is 555-123-4567, and my other number is 555-987-6543."
first_phone_number = str_extract(text, "\\d{3}-\\d{3}-\\d{4}") # Matches a phone number format
print(first_phone_number)

# Extracting all matches (returns a list)
all_phone_numbers = str_extract_all(text, "\\d{3}-\\d{3}-\\d{4}")
print(all_phone_numbers)

# str_extract_all returns a list because each element can have differing numbers of matches.
another_example = c("One match here.", "Two matches here and here.", "No matches.")
str_extract_all(another_example, "here")

## Intro to `tidytext`

The `tidytext` package provides tools for converting text data into a tidy format.  The core idea of tidy data (as applied to text) is:

- **One token per row:** Each row represents a single token, which is usually a word, but could also be a sentence, an [n-gram](https://en.wikipedia.org/wiki/N-gram), or another unit of text.
- **Document-term matrix:** This tidy format facilitates creating a document-term matrix, which is a fundamental structure for many text analysis techniques.

This tidy structure allows us to seamlessly integrate text analysis with other tidyverse packages like `dplyr` for filtering, counting, and summarizing.

We'll focus on the key function `unnest_tokens()`. This function takes a data frame (or tibble) and converts a text column into a tidy format. Let's start with a simple example:

In [None]:
# Create a small data frame with a text column
text_data = tibble(
  document = 1:3,
  text = c("This is the first document.",
           "The second document is here.",
           "And the third document.")
)

# Tokenize the 'text' column into words
tidy_text = text_data %>%
  unnest_tokens(output = word, input = text)

# Print the tidy data
print(tidy_text)

A lot happened in a few lines of code:

- In `text_data`, we create an example where we imagine having multiple documents stored in our tibble.
- The `unnest_tokens()` function performs the tokenization:
    - `output = word`:  This argument specifies the name of the new column that will contain the tokens (the individual words by default). We're calling it `word`.
    - `input = text`: This argument specifies the name of the column in our input tibble (`text_data`) that contains the text we want to tokenize. We're using the `text` column.

There are a few other arguments we could specify in the `unnest_tokens()` function. The `token` argument is important as it controls how the text is split. The defaults is `"words"`, but some other options include:

- "sentences": Splits the text into sentences.
- "characters": Splits the text into individual characters.
- "ngrams": Splits the text into n-grams (sequences of n words).
- "lines": Splits the text into lines.

In [None]:
# Tokenizing by sentences
sentences = text_data %>%
  unnest_tokens(output = sentence, input = text, token = "sentences")
print(sentences)

# Tokenizing by characters
characters = text_data %>%
  unnest_tokens(output = char, input = text, token = "characters")
print(characters)

# Tokenizing into n-grams (in this case, bigrams - pairs of words)
bigrams = text_data %>%
  unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
print(bigrams)

Let's look at a simple but powerful example: counting word frequencies. We'll use a dataset from the [`janeaustenr`](https://cran.r-project.org/web/packages/janeaustenr/index.html) package that gives the text of Jane Austen’s 6 completed, published novels as a one-row-per-line format. I've included the dataset in the `austen-books.rds` file, so make sure you have that downloaded before proceeding. An excellent analysis of this dataset is done in the [tidytext vignette](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html). I encourage you to check it out!

In [None]:
tb_austen = readRDS("austen-books.rds")

tb_austen %>% 
    head()

# Filter out common "stop words" (like "the", "is", "a")
# using the built-in stop_words dataset from tidytext
stop_words %>% 
    head()

filtered_word_counts = tb_austen %>%
    unnest_tokens(output = word, input = text) %>% 
    anti_join(stop_words, by = "word") %>%  # Remove stop words
    count(word, sort = TRUE)

filtered_word_counts %>% 
    head()

options(repr.plot.width = 12, repr.plot.height = 8)
theme_set(theme_bw(base_size = 16))

# Visualize the top 10 words
top_10_words = filtered_word_counts %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>% #Sort for plotting
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  coord_flip() +  # Horizontal bars
  labs(title = "Top 10 Words (Excluding Stop Words)",
       x = "Word",
       y = "Frequency")
print(top_10_words) #print for notebook, but really just need the statement.

## Conclusion

In this notebook, we've covered the fundamentals of working with text data in R, with a strong emphasis on the tidyverse approach. We learned:

- **Text Representation**: How text is represented in R using character vectors.
- **`stringr` Fundamentals**: How to use the `stringr` package for common string manipulation tasks like concatenation, subsetting, case conversion, whitespace handling, pattern detection (using basic regular expressions), and string extraction.
- **Tidy Text with `tidytext`**: How to convert text data into a tidy format (one-token-per-row) using unnest_tokens(), enabling seamless integration with other tidyverse tools. We explored different tokenization options ("words", "sentences", "characters", "ngrams") and saw how to create a simple custom tokenizer.

In the data wrangling notebook, we'll take a deeper dive on using some of the tools in `tidytext` and some other packages to do some more advanced analysis.