## Introduction to Text Classification

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import re
import numpy as np
import matplotlib.pyplot as plt

In this notebook, we'll be working with a set of reviews from Amazon.com. This is a subset of a larger dataset obtained from https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews.

In [None]:
reviews = pd.read_csv('data/amazon_reviews.csv')

reviews.head()

Each review is assigned a sentiment score, where 1 indicates negative and 2 indicates positive.

This subset is equally balanced between positive and negative sentiment.

In [None]:
reviews['sentiment'].value_counts()

First, let's look at some of the negative reviews.

In [None]:
seed = 123
for statement in reviews.loc[reviews['sentiment'] == 1, 'text'].sample(3, random_state=seed):
    print(statement)
    print('-----------------------------')

And then some of the positive ones.

In [None]:
seed = 123
for statement in reviews.loc[reviews['sentiment'] == 2, 'text'].sample(3, random_state=seed):
    print(statement)
    print('-----------------------------')

## Step 1: Naive Bayes Using the Text Field

First, we'll split out data into a train and test set, stratifying on the target variable.

In [None]:
X = reviews[['text']]
y = reviews['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

We need a way to convert the text into predictors. We will start by using a **bag of words** model - one which looks at what words are present but disregarding word order.

Let's start with a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

Fill in the code to import the CountVectorizer class.

In [None]:
# Fill this in to import the CountVectorizer class

Then create a CountVectorizer (with all of the default arguments) named `vect` and fit it to the "text" column of the training data transform both the train and test texts.

In [None]:
# Fill in the code to fit and transform a CountVectorizer (using all defaults) on the text column of X_train and X_test

vect = # Fill this in

X_train_vec = # Fill this in
X_test_vec = # Fill this in

The CountVectorizer class will take in the text, and **tokenize** it, splitting it into a list of tokens. The built-in tokenizer does some minimal cleaning of the text, and by default the CountVectorizer will convert all text to lowercase.

If we want to take a look at all of the tokens that the CountVectorizer has seen, we can look at its vocabulary. Check the `vocabulary_` attribute. 

In [None]:
# fill this in

**Question:** You should see that the first entry is `'this': 25040`. What does the 25040 represent?

**Question:** How many total tokens are there?

In [None]:
# Fill this in to answer the above question.

Now, let's see how often each word appeared across all reviews. To do this, we need to look at X_train_vec.

**Question:** What type of object is X_train_vec?

In [None]:
# Fill this in to answer.

Notice that we can sum this object to get a count per word.

In [None]:
X_train_vec.sum(axis = 0)

It would be nice to be able to view the counts result as a DataFrame. To do this, we need to convert this into a numpy array.

In [None]:
np.array(X_train_vec.sum(axis = 0))

Also, when we make a DataFrame, we need to pass in these values as a one-dimensional object. For this, we can use the `flatten` method.

Make a `word_counts` DataFrame which was a 'words' column containing each word in the vocabulary and a 'frequency' column containing the counts.

**Hint:** Check the methods of the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to see how to access the words.

In [None]:
# Fill this in to build a DataFrame of words and their counts
word_counts = pd.DataFrame({
    'words': #Fill this in,
    'frequency': #Fill this in
})

word_counts.head()

Which word appears most frequently?

In [None]:
# Fill this in

Now, let's fit a MultinomialNB model to our word count vectors.

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
nb = MultinomialNB().fit(X_train_vec, y_train)

y_pred = nb.predict(X_test_vec)

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
confusion_matrix(y_test, y_pred)

**Question:** How do the estimated probabilites $P(great | positive)$ and $P(great | negative)$ compare? How about $P(disappointed | positive)$ and $P(disappointed | negative)$? 

Hint: You can access the estimated (log) probabilities via the `feature_log_prob_` attribute of your model.

In [None]:
# Your code here.

## Logistic Regression for Text Classification

Fill in the code to fit a LogisticRegression model.

What accuracy score do you obtain? How does the confusion matrix look?

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

accuracy_score(y_test, y_pred)

In [None]:
confusion_matrix(y_test, y_pred)

### Explaining the Model

There are two lenses through which we can try to understand how the model is making predictions - (global) features importances and feature importances for single predictions.

We'll start by looking at global explanations. Since we're using a logistic regression model, we can look at the coefficients.

In [None]:
coefficients = pd.DataFrame({
    'word': vect.get_feature_names_out(),
    'coefficient': logreg.coef_[0]
})

coefficients.sort_values('coefficient').head(10)

In [None]:
coefficients.sort_values('coefficient', ascending = False).head(10)

**Question:** By what factor do we multiply the estimated odds of a review being positive when we see the word "great"? What about the word "garbage"? What about the word "the"?

In [None]:
# Your code here

Here's a helper function so that you can see what contributes to individual predictions.

In [None]:
def get_most_important_features(text, vectorizer, model):
    weight = vectorizer.transform(text).toarray().flatten() * model.coef_.flatten()
    weights = pd.DataFrame({
        'word': vectorizer.get_feature_names_out(),
        'weight': weight
    })
    return weights[weights['weight'] != 0].sort_values('weight', ascending = False)

In [None]:
i = 0

text = X_test.iloc[i]
print(text['text'])

print('Predicted Probability of Positive: {}'.format(logreg.predict_proba(X_test_vec[i].reshape(1,-1))[0,1]))
print('True Label: {}'.format(y_test.iloc[i]))

get_most_important_features(text, vect, logreg)

Let's look at some examples that are incorrecly classified.

In [None]:
np.where((y_test == 2) & (y_pred == 1))[0][:10]

In [None]:
i = 26

text = X_test.iloc[i]
print(text['text'])

print('Predicted Probability of Positive: {}'.format(logreg.predict_proba(X_test_vec[i].reshape(1,-1))[0,1]))
print('True Label: {}'.format(y_test.iloc[i]))

get_most_important_features(text, vect, logreg)

### Adding $n$-grams

Notice that the vectorizer we are currently using only looks like words individually and does not consider the order. We can include combinations of words using n-grams.

Create a new vectorizer that includes both unigrams and bigrams.

In [None]:
vect = CountVectorizer(# Fill this in so that the model uses both unigrams and bigrams)

X_train_vec = vect.fit_transform(X_train['text'])
X_test_vec = vect.transform(X_test['text'])

How large is the vocabulary when you use unigrams and bigrams?

In [None]:
len(vect.vocabulary_)

In [None]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

confusion_matrix(y_test, y_pred)

In [None]:
coefficients = pd.DataFrame({
    'word': vect.get_feature_names_out(),
    'coefficient': logreg.coef_[0]
})

coefficients.sort_values('coefficient', ascending = True).head(25)

**Question:** By what factor do we multiply predicted odds of a review being positive if we see the phrase "love this"? What about "very disappointed"?

In [None]:
# Fill this in

### TFIDF

Instead of using raw counts, we can instead use a tfidf-vectorizer. This will scale down the weights for frequently-occuring words.

The acronym "tfidf" stands for "term frequency inverse document frequency". This type of vectorizer takes into account the number of times a word occurs in a document but then scales inversely for the number of documents that word appears in.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vect = TfidfVectorizer()

X_train_vec = vect.fit_transform(X_train['text'])
X_test_vec = vect.transform(X_test['text'])

In [None]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

confusion_matrix(y_test, y_pred)

Both the CountVectorizer and TfidfVectorizer have some additional parameters that change the way that it treats tokens or which can remove certain tokens. 

Look at the `min_df` and `max_df` features. Experiment with these and see if you see any change in how the model performs when you adjust these (or any other parameters you want to experiment with).

Try out some different values for these parameters. Which combination does best?

In [None]:
# Your Code Here

## Bonus: Exploring Other Potential Features

**Question:** Does the length of a review seem to be related to its sentiment? That is, do longer (or shorter) reviews tend to have a more positive sentiment?

In [None]:
# Your Code here

In [None]:
# Your Code Here

We can even use pretrained models to help us. For example, nltk includes tools for determining sentiment, like the VADER tool. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 
 
This outputs a dictionary of scores:
* neg: proportion of words that are negative
* neu: proportion of words that are neutral
* pos: proportion of words that are positive
* compound: computed by summing the valence scores of each word, normalized to be between -1 and +1

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
sent_analyzer = SentimentIntensityAnalyzer()

In [None]:
i = 0
print(X_train.iloc[i,0])
print(sent_analyzer.polarity_scores(X_train.iloc[i,0]))
print(y_train.iloc[i])

Create new columns to hold the "neg", "neu", "pos" and "compound" scores for each review. 

Finally, concatenate your word counts and the new features together. How well do all of these together do at predicting positive or negative sentiment?

In [None]:
# Your Code Here

## Bonus Bonus - Using the Title and Text

There are a couple of ways that you could incorporate both the title and text.

**Option 1:** Concatenate together the title and the text into a single field.

**Option 2:** Use a separate vectorizer for the title and text and concatenate the results.

Try both options. Which gives a better result?

In [None]:
# Your Code Here