This project demonstrates basic text analysis using R, covering tokenization, word frequency analysis, and filtering of stopwords. The analysis is applied to the text of "Frankenstein" by Mary Shelley. The project aims to provide a practical introduction to Natural Language Processing (NLP) concepts and techniques using R.
The purpose of this project is to explore basic text analysis techniques using R. Text analysis, a subset of Natural Language Processing (NLP), involves processing and analyzing large collections of text data to extract meaningful information. This project covers essential NLP concepts such as tokenization, word frequency analysis, and filtering of stopwords.
-
Tokenization: Tokenization is the process of breaking down text into smaller units, such as words or sentences. In this project, we use tokenization to split the text into individual words and sentences.
-
Word Frequency Analysis: This technique involves counting the occurrences of each word in the text. By analyzing word frequencies, we can identify the most common words and gain insights into the text's content.
-
Stopwords Filtering: Stopwords are common words (e.g., "and", "the", "is") that usually do not carry significant meaning and are often removed from text analysis to focus on more meaningful words. This project demonstrates how to filter out stopwords from the text.
To run this project, you need to have R and RStudio installed on your system. You also need to install the following R packages:
install.packages("tidyverse")
install.packages("tokenizers")
- Clone the repository to your local machine.
- Open the R script in RStudio.
- Run the script to perform the text analysis.
First, we load the necessary packages and create a variable containing a sample text.
library(tokenizers)
library(tidyverse)
text <- paste("You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking")
We then tokenize the text into words.
words <- tokenize_words(text)
words <- words[[1]]
Next, we create a table of word frequencies and convert it into a data frame.
tab <- table(words)
tab <- data_frame(word = names(tab), count = as.numeric(tab))
tab <- arrange(tab, desc(count))
We load a dataset of word frequencies and join it with our word frequency table to filter out stopwords.
wordfreq <- read_csv("https://raw.githubusercontent.com/BrockDSL/R_for_Text_Analysis/master/wordfrequency.csv")
tab <- inner_join(tab, wordfreq)
filtered_tab <- filter(tab, frequency < 0.01)
We create a function that performs the entire text analysis process, including tokenization, word frequency analysis, and filtering of stopwords.
top_words <- function(fulltext) {
words <- tokenize_words(fulltext)
words <- words[[1]]
tab <- table(words)
tab <- data_frame(word = names(tab), count = as.numeric(tab))
tab <- arrange(tab, desc(count))
wordfreq <- read_csv("https://raw.githubusercontent.com/BrockDSL/R_for_Text_Analysis/master/wordfrequency.csv")
tab <- inner_join(tab, wordfreq)
return(filter(tab, frequency < 0.01))
}
# Try out your new function by running on the text variable
top_words(text)
This project provides a practical introduction to basic text analysis techniques using R, suitable for beginners and those looking to explore NLP concepts.