# Applied Data Wrangling 06

## Introduction

In this notebook, we'll take a look at the [IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). We'll start by breifly exploring the data, then we'll clean it and perform some sentiment analysis. Next, we'll produce some word clouds using the `wordcloud` package. Finally, we'll train a classification model using tools from  `tidymodels` and `textrecipes` to classify positive and negative reviews.

> The paper for the IMDB dataset can be found [here](https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf). The researchers only consider highly polarizing reviews in constructing their dataset: $\leq$ 4 out of 10 is labeled negative while $\geq$ out of 10 is labeled as positive. The dataset is balanced, so a random guess to whether a review was positive or negative has a 50% chance of being right. Additionally, only 30 reviews per movie were included in the dataset. The task for this dataset is to correctly predict the label (negative or positive) from the text of the review. In their 2011 paper, the best model the researchers tested had a classification accuracy of 88.89%.

Before beginning this analysis, you'll need to...

- download the `imdb-dataset.rds` from Canvas
- install the `r-wordcloud=2.6` library
- install the `r-textrecipes=1.0.7` library

Be sure to specify the version number of the packages when you install. You should know the drill by now on how to install the packages. However, you'll find instructions on Canvas should you need them. Now, let's begin!

## Data Exploration and Preparation

In [None]:
suppressMessages({
    library(tidyverse)
    library(tidytext)
})

tb_imdb = readRDS("imdb-dataset.rds")

tb_imdb %>% 
    str()

We see that the dataset includes two columns: `review` and `sentiment`. The `review` column contains the text of the movie review, and the `sentiment` column indicates whether the review is positive or negative. Let's clean up the data a bit by tokenizing the reviews, removing stop words, and converting the `sentiment` column to a `factor` (binary instead of character):

In [None]:
cleanText = function(text) { # get rid of html newlines...
    text %>%
        str_replace_all("<br /><br />", "") 
}

tb_imdb_tokens = tb_imdb %>%
    mutate(review = cleanText(review)) %>% 
    rowid_to_column(var = "review_id") %>% # label the reviews
    mutate(sentiment = factor(sentiment, levels = c("negative", "positive"))) %>% 
    rename(label = sentiment) %>% # change the name to avoid conflicts later
    unnest_tokens(output = word, input = review) %>% 
    anti_join(stop_words, by = "word") # remove stop words

tb_imdb_tokens %>% 
    str()

As a first look at the data, is there any relationship between the number of words in a review and the sentiment of the review?

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)
theme_set(theme_bw(base_size = 16))

tb_imdb_tokens %>% 
    group_by(review_id) %>% 
    summarize(n_words = n(), label = first(label)) %>% 
    ggplot(aes(n_words, fill = label)) + 
    geom_density(color = "NA", alpha = 0.3) + 
    labs(title = "distribution of # of words in a review")

Doesn't look like it; both positive and negative reviews have a similar distribution of word length. Next, let's look at the sentiment of reviews.

## Sentiment Analysis

Now that we have cleaned data, let's perform some sentiment analysis.

> What is sentiment analysis? Sentiment analysis is a field of Natural Language Processing (NLP) and text analysis that uses computational techniques to determine the emotional tone behind textual data. In our analysis, we'll be using a lexicon ("bing") that contains a list of words that have been labeled as positive or negative to calculate a sentiment score for each review. You can read more about it [here](https://en.wikipedia.org/wiki/Sentiment_analysis). 

In [None]:
# use inner_join to drop all "neutral" words
# "many-to-many" used since some words can have positive and negative connotations
tb_imdb_sentiment = tb_imdb_tokens %>%
    inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") %>% 
    mutate(sentiment = ifelse(sentiment == "negative", 0, 1))

tb_imdb_sentiment %>% 
    str()

How many unique words are there, and what is there frequency?

In [None]:
tb_imdb_sentiment %>% 
    count(word, sort = T) %>% 
    mutate(p = n/sum(n)) %>% 
    head()

tb_imdb_sentiment %>% 
    count(word, sort = T) %>% 
    rowid_to_column(var = "word_index") %>% 
    ggplot(aes(word_index, n)) + 
    geom_line() + 
    scale_y_continuous(trans = "log10")

Now let's look at the sentiment distribution score given the label:

In [None]:
tb_imdb_sentiment %>% 
    group_by(review_id) %>% 
    summarize(mu_sentiment = mean(sentiment), label = first(label)) %>% 
    ggplot(aes(mu_sentiment, fill = label)) + 
    geom_density(color = "NA", alpha = 0.3) + 
    labs(title = "distribution of review sentiment score")

There's clearly a separation between the two distributions. If we used a model just based on the sentiment score of a review we'd get ok results:

In [None]:
suppressMessages({
    library(tidymodels)
})

tb_sentiment_predictions = tb_imdb_sentiment %>% 
    group_by(review_id) %>% 
    summarize(mu_sentiment = mean(sentiment), label = first(label)) %>% 
    mutate(prediction = ifelse(mu_sentiment < 0.5, "negative", "positive")) %>% 
    mutate(prediction = factor(prediction, levels = c("negative", "positive")))

tb_sentiment_predictions %>% 
    conf_mat(truth = label, estimate = prediction) 

tb_sentiment_predictions %>% 
    metrics(truth = label, estimate = prediction) 

So if we predict the label (negative or positive) of a review based on the average sentiment score (< 0.5 is negative), then our model would have $\approx 73\%$ accuracy. Keep in mind that random guessing would yield $50\%$ accuracy since the dataset is balanced. The [kap](https://en.wikipedia.org/wiki/Cohen%27s_kappa) metric accounts for random guessing with a value > 0 indicating better performance than random guessing.

Now, let's produce some word clouds.

## Word Clouds

First, let's make a basic word cloud using the words that have a sentiment score.

In [None]:
suppressMessages({
    library(wordcloud)
})

tb_imdb_sentiment %>% 
    count(word, sort = T) %>% 
    slice(1:100) %>% # Get 250 words with the highest frequencies
    with(wordcloud(word, n, color = "gray50")) 

We can make this more informative by separating the words by positive or negative sentiment using the `comparison.cloud()` function:

In [None]:
# comparison.cloud() expects a term matrix: rownames are words and columns are negative or positive
getTermDocMatrix = function(tb_, num_words = 100) {
    tb_tmp = tb_ %>% 
        mutate(sentiment = ifelse(sentiment == 0, "negative", "positive")) %>% 
        count(word, sentiment, sort = T) %>% 
        slice(1:num_words) %>% 
        pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) 

    M = tb_tmp %>% 
        select(negative, positive) %>% 
        as.matrix()

    rownames(M) = tb_tmp$word

    return(M)
}

tb_imdb_sentiment %>% 
    getTermDocMatrix() %>% 
    comparison.cloud(colors = c("#F8766D", "#00BFC4"))

## Classification Model

Finally, let's train a classification model to identify positive and negative reviews. We'll use a logistic regression model here (hardly cutting edge, but still effective).

> There's a lot going on in the next few lines of code. I encourage you to check out [this exerpt](https://smltar.com/mlclassification) from [Supervised Machine Learning for Text Analysis in R](https://smltar.com/) to dig deeper into each step of the process. This textbook is a wonderful resource, and this example in particular gives a great end-to-end example on using these tools.

In [None]:
suppressMessages({
    library(textrecipes)
})

set.seed(1)

# Reconstruct the reviews using only non-neutral language based on sentiment score
tb_imdb_clean = tb_imdb_sentiment %>% 
    group_by(review_id) %>% 
    summarize(review_clean = str_c(word, collapse = " "), label = first(label))

# get training and testing data
imdb_split = initial_split(tb_imdb_clean, prop = 0.8)
imdb_train = training(imdb_split)
imdb_test  = testing(imdb_split)

# 10-fold CV for hyperparameter tuning
imdb_folds = vfold_cv(imdb_train)

# specify the preprocessing
imdb_rec = recipe(label ~ review_clean, data = tb_imdb_clean) %>% 
    step_tokenize(review_clean) %>%
    step_tokenfilter(review_clean, max_tokens = 2E3) %>% # only use top 2K words
    step_tfidf(review_clean)

# sepecify the model
lasso_spec = logistic_reg(penalty = tune(), mixture = 1) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

# define the hyperparameter grid
lambda_grid = grid_regular(penalty(), levels = 10)

# define the workflow
tune_wf = workflow() %>%
  add_recipe(imdb_rec) %>%
  add_model(lasso_spec)

In [None]:
# train the model & tune parameters
# takes a few minutes to run...
set.seed(1)
tune_rs = tune_grid(
  tune_wf,
  imdb_folds,
  grid = lambda_grid,
  control = control_resamples(save_pred = TRUE, verbose = T)
)

In [None]:
collect_metrics(tune_rs) # see all the accuracy metrics

In [None]:
tune_rs %>% 
    show_best(metric = "accuracy") # see the most accuate models

In [None]:
# set the tuning parameter and train the model
chosen_acc = tune_rs %>%
  select_by_one_std_err(metric = "accuracy", -penalty)

final_lasso = finalize_workflow(tune_wf, chosen_acc)

fitted_lasso = fit(final_lasso, imdb_train)

In [None]:
# get predictions from the test data
tb_result = imdb_test %>% 
    mutate(prediction = predict(fitted_lasso, .) %>% unlist())

tb_result %>% 
    conf_mat(truth = label, estimate = prediction) 

tb_result %>% 
    metrics(truth = label, estimate = prediction) 

So we can achieve $\approx 84\%$ accuracy with our classification model. Not bad for a logistic regression model!

## Going Further



We could build a more accurate classification model using a deep learning framework. The [Supervised Machine Learning for Text Analysis in R](https://smltar.com/) textbook has [some chapters](https://smltar.com/dloverview) on building deep neural networks using keras. You can easily adapt one of these examples to build a model of your own using our IMDB dataset, though I'd reccomend installing the newer [keras3](https://keras3.posit.co/) if you try this out. As long as you follow the right steps, installation shouldn't be too much of a pain. I'll include a notebook in the "Extra Materials" section on Canvas to get you started with keras3 if you're interested. 