# Text analysis and sentiment

You may need to install `nltk`, if you haven't done so already.

In [None]:
%pip install nltk


In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn import preprocessing
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, classification_report, balanced_accuracy_score
from sklearn.model_selection import train_test_split, cross_validate

import nltk
from nltk import tokenize
from nltk.corpus import stopwords
from nltk import SnowballStemmer, LancasterStemmer
from nltk.sentiment import SentimentIntensityAnalyzer
# included so I can add latex code
from IPython.display import display, Math, Latex

You'll also need to download some additional functions and data by running this code. (you'll generally only need to do this once)

In [None]:
nltk.download('popular')
nltk.download('vader_lexicon')


The purpose of **text analysis** is to extract meaning from text data. This involves cleaning and processing text data, as well as using analysis methods that are able to get something **quantitative** out of something that doesn't inherently have **numbers**. 

One way to extract meaning from text is by assigning values of **sentiment**. The words we use have meaning, and we can assign measures of what they are intended to portray. For example, the word "bad" is generally a negative sentiment (slang usage notwithstanding), while "good" has a positive sentiment. The word "hurt" generally also has a negative sentiment, while "heal" has a positive one. In this way, we can attempt to put different words onto the same scale and measure the overall sentiment of text.

In this section, we will look at doing a type of analysis called **sentiment analysis**, which is a class of techniques designed to extract this type of meaning from text data. We'll compare two different approaches: one based on a "dictionary" method, and another that relies on a machine learning approach.

## Dictionary based sentiment analysis

VADER is a **dictionary-based** method, meaning it is pre-built and comes with a list of words and scores. To use it, we need to download the list of words with scores, then apply those scores to the words within our documents. Combining the scores of the words/tokens within our document gives us the overall sentiment of the document. For VADER, we will get back the negative, neutral, positive, and compound scores of a document.


In [None]:
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("I'm very angry!")

In [None]:
sia.polarity_scores("Feelin fine")

The scores for negative, neutral, and positive are always positive, and indicate how much of that type of sentiment is present in the document. The compound score is a value from -1 to 1 that provides an overall summary of how positive or negative that document is in its sentiment. 

The compound score is most often used, and typically, the threshold for being considered positive, neutral, or negative is as follows:
- positive sentiment: compound score >= 0.05
- neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
- negative sentiment: compound score <= -0.05

Vader is designed with social media in mind, so it will also pick up on things like slang and emoticons:

In [None]:
sia.polarity_scores(":-)")

Moreover, it includes some rules that pick up on some forms of negation. 

In [None]:
sia.polarity_scores("I'm not angry")

In [None]:
sia.polarity_scores("I'm not not angry")

We can try out applying the VADER method to a sample of Tweets that's included as part of the NLTK package. 

In [None]:
tweets = nltk.corpus.twitter_samples.strings()
tweets[:4]

<h2 style="color:red;font-weight: bold;">Question 1:</h2><p style="color:red;font-weight: bold;">Get the sentiment scores for the <code>tweets</code> data. Create a histogram to show the distribution of positive and negative sentiments</p>
<p><b>Hint</b>: Since the <code>polarity_scores</code> method requires a single string, you'll need to use a list comprehension to get the sentiment score for each tweet. </p>

## Evaluating the sentiment scoring

It seems intuitive that a dictionary of positive and negative terms would be a good way to classify text, but how do we know whether this sentiment dictionary is any good? One way to test out the performance of the VADER model is by comparing its predictions to a "ground truth" source of evidence. The `imdb_reviews` dataset has two columns: the text of a user review and a label that is 1 if the user gave a positive rating (>=7) and its zero if the user gave a negative rating ( >=4). Since this corpus includes both text and labels, we can use it as a way to evaluate VADER against a data source with some known labels.

In [None]:
imdb = pd.read_csv('imdb_reviews.csv').dropna()
imdb.head()

We'll need to convert the sentiment scores to a dichotomous variable to match the way sentiment is recorded in the IMDB corpus.

In [None]:
polarity =  pd.DataFrame([sia.polarity_scores(i) for i in imdb['text']])

# assign a positive or negative label based on the compound score: 
polarity['predicted_positive'] = polarity['compound']>=.05

polarity.head()

Then, we'll add the "ground truth" labels to the data:

In [None]:
# add the "ground truth" to the polarity
polarity['actual'] = imdb['label']
# view the results:
polarity.head()

Finally, we'll make a confusion matrix and evaluate the results

In [None]:
cmat = pd.crosstab(polarity['predicted_positive'], polarity['actual'],  margins=True)
cmat

In [None]:

print(classification_report(polarity.actual, polarity.predicted_positive ,
                            # add target_names to show labels in the report:
                            target_names=['negative', 'positive']))

# add cohen's kappa and balanced accuracy
print("cohens kappa: ", cohen_kappa_score(polarity.actual, polarity.predicted_positive))
print("balanced accuracy: ", balanced_accuracy_score(polarity.actual, polarity.predicted_positive))

Pretty good! Although VADER wasn't really developed for use in the particular context, it still does a decent job of identifying positive and negative reviews, although there are still plenty of documents that are not correctly classified.

# Machine learning for sentiment

Overall, the VADER tool performed well, but still had plenty of errors. Can we improve on this performance? Maybe! Instead of using a lexicon to determine if the texts are positive or negative, we can make a machine learning model that infers the terms that are associated with positive or negative documents automatically. In general, even fairly simple machine learning models can beat lexicon-based methods. 

Recall that, for supervised learning, we must have a y variable that we know. That is, we need to have a the y variable in our dataset, so that we can build our model and use that model to predict y for future data. We're going to compare the predictions from the VADER sentiment lexicon to another widely used machine learning model called a Naive Bayes classifier. It works by using Bayes' Theorem to calculate the probability of each class ($C_k$ in the formula below) given a predictor $x_i$:

$$ P(C_k|x_i) = \frac{P(C_k) P(x_i|C_k)}{P(x_i)}$$

- $P(x_i)$ is the probability of feature $x_i$ in all documents
- $P(C_k)$ is the probability of class $k$
- $P(x_i|C_k)$ is the probability of $x$ in documents with class $k$
- $P(C_k|x_i)$ is the probability of class $k$ given feature $x_i$

The model is called "naive" because it assumes (incorrectly!) that each word is conditionally independent of all other words given the class. This turns what might otherwise be a gnarly calculus problem into a bunch of division and mulitiplication: we just need to calculate the conditional probability of each class separately for each term, multiply these probabilities together, and then pick the class with the highest total probability. Despite this wild oversimplification of how words work, naive bayes classifiers have a pretty good track record for document classification problems like this one.


## Splitting the data

Before we do anything, we'll want to split the data into a training and testing set, just like we did with our other supervised models. We'll use `X_train` and `y_train` to train the model, and then we'll use `X_test` and `y_test` to evaluate on new data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(imdb.text, imdb.label,
                                     test_size=0.20, # 20% of observations for validation
                                     random_state = 999) # this is a random process, so you want to set a random seed! 

## Pre-processing

Our goal here is to apply a mathematical model to text data, so a basic step is going to be converting words into some kind of a sensible numeric format.

The VADER model does a lot of the "words-into-numbers" stuff for us. But for the current case we're going to need to do that process ourselves. As a starting point, we'll use what's commonly called a "bag-of-words" representation of our texts. A "bag-of-words" model simplifies texts by ignoring things like word-ordering, parts of speech, tone, context etc. and instead just focuses on "how many times a term occurs in a given document". To convert our reviews to a bag-of-words, we're going to do the following pre-processing steps:



1. **Tokenization** splits texts into smaller units. In the current case, this will be individual words without punctuation. But it could also be sentences, paragraphs or "n-gram" (groups of 1, 2, 3... words)
2. **Stopword Removal** removing common terms like "a", "and", "the" that are grammatically useful but uninformative for many text models
3. **Normalization** combining terms that are more-or-less equivalent by doing things like converting words to lower-case or removing word endings through stemming or lemmatization


(In some cases we may change the ordering of these steps, or we might do some cleaning and normalization and then decide, based on a closer look at our results, that we need to go back and do some additional cleaning. Our end goal, however, is generally to a representation of a text that is simple *enough* without sacrificing too much nuance or complexity.)

Once we've made our bag-of-words, we'll use it to create what is called a "document-term-matrix" where each row represents a single document, each column represents a word that occurs anywhere in the entire collection of documents, and the cell values indicate a count of the number of times word `j` occurs in document `i`

For example, if I think about the sentences as a collection of three documents:

>    "See Spot Run. Spot runs fast. Run spot run!"


I would represent those documents as a document-term-matrix that looks like this:




<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-0pky"></th>
    <th class="tg-0pky">See</th>
    <th class="tg-0pky">Spot</th>
    <th class="tg-0pky">Run</th>
    <th class="tg-0pky">Fast</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-0pky">See Spot run.</td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">0</td>
  </tr>
  <tr>
    <td class="tg-0pky">Spot runs fast</td>
    <td class="tg-0pky">0</td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">1</td>
  </tr>
  <tr>
    <td class="tg-0pky">Run Spot run.</td>
    <td class="tg-0pky">0</td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">2</td>
    <td class="tg-0pky">0</td>
  </tr>
</tbody>
</table>

Note that, for real-world text analysis, the number of columns is going to grow exponentially as we include more documents, and most cells are just going to have values of `0`. This tendency to get very large "sparse" matrices is a recurring problem in text analysis and we'll have to make a lot of simplifications to keep things manageable. Many of the processing steps below are intended to allow us to combine or drop columns from our document term matrix. 


The `scikit-learn` package has a `CountVectorizer` function that will let us do all of the necessary processing in one step, but just to get a sense of what each component does, we'll try each part separately.

### 1. Tokenization

In the tokenization step, we'll split up documents into individual words. To do this, we'll use the `word_tokenize` function from the NLTK package.

In [None]:
text = 'See Spot run. Spot runs fast. Run Spot run!'

nltk.word_tokenize(text)

For the IMDB reviews, we're going to go a step further and lower case all of the texts. This will ensure that our document-term-matrix treats "This", "this", and "THIS" as a single word instead of creating separate columns for terms that mean the same thing, so our output will end up looking something like this:

In [None]:
# lower case and drop empty

nltk.word_tokenize(text.lower())

### 2. Stopword removal


Stopwords are words that are found commonly throughout a text and carry little semantic information. Examples of common stopwords are prepositions ("to", "on", "in"), articles ("the", "an", "a"), conjunctions ("and", "or", "but") and common nouns. For example, the words *the* and *of* are totally ubiquitous, so they won't serve as meaningful features, whether to distinguish documents from each other or to tell what a given document is about. You may also run into words that you want to remove based on where you obtained your corpus of text or what it's about. There are many lists of common stopwords available for you to use, both for general documents and for specific contexts, so you don't have to start from scratch.   

We can eliminate stopwords by checking all the words in our corpus against a list of commonly occuring stopwords that comes with the `nltk` package:

In [None]:
nltk.download('stopwords')

In [None]:
stop = stopwords.words('english')
stop[0:10]

We'll convert this list of stopwords to a set (this will speed up some processing steps), and then we'll use it to filter the results from the tokenization step:

In [None]:
# tokenizing the stopword list and converting it to a list
eng_stopwords = set(stopwords.words('english'))

text = "This text contains a couple of stopwords."

tokens = nltk.word_tokenize(text.lower())
filtered_tokens = [w for w in tokens if w not in  eng_stopwords]

filtered_tokens

While we're filtering, we might also want to remove punctuation marks from the bag of words. So we'll use `.isalpha()` to remove anything that isn't an a letter:

In [None]:
filtered_tokens = [w for w in tokens if w not in  eng_stopwords and w.isalpha()]

filtered_tokens

### 3. Stemming

Finally, we'll try to simplify our bag-of-words analysis by grouping together different inflections of the same terms. Word-endings that indicate things like pluralization and tense are useful in the context of human communication, but they're not informative when we're trying to do things like identify the topic or tone of a text.

There are two common approaches to this kind of normalization: 

- **Lemmatization** uses parts-of-speech and context clues to convert words to their basic dictionary form. 
- **Stemming** uses some simple hueristics to remove word inflections. 

Stemming is more error-prone than lemmatization, and sometimes results in words that you won't find in a dictionary, but it has the advantage of being much faster because it relies in simple rules whereas lemmatization considers word context and parts-of-speech. 

There are multiple stemming algorithms with different rule sets and differing strengths and weaknesses. In this notebook, we'll use the Snowball Stemmer. You'll notice this works pretty well for many words, but gives odd results for others:

In [None]:
# load the stemming algorthim
stemmer = SnowballStemmer("english")

In [None]:
# apply it to some terms
forms = ['lying', 'fisherman', 'change', 'systematic', 'stapled', 'catlike', 'argument', 'alphabetical']
print([stemmer.stem(i) for i in forms])


You can contrast these results with the results from an alternate stemming algorithm, such as the LancasterStemmer

In [None]:
lancaster = LancasterStemmer()

In [None]:
forms = ['lying', 'fisherman', 'change', 'systematic', 'stapled', 'catlike', 'argument', 'alphabetical']
print([lancaster.stem(i) for i in forms])

### Putting it all together

We'll wrap all three steps together in a single function that will take a document and return a processed bag of words. 






In [None]:
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    return [stemmer.stem(token) for token in tokens if token not in eng_stopwords and token.isalpha()]

In [None]:
tokenize("This is a sentence that has been fully processed into a bag of words!! Isn't it great?")

### Building the Document-Term Matrix

Finally, we'll use `CountVectorizer` to convert the tokenized texts into a document-term-matrix with counts for each term. Note that we're passing the `tokenize` function as one of the arguments to the `CountVectorizer` here, so the stop word removal and stemming will be automatically applied before we get our word counts:

In [None]:
vectorizer = CountVectorizer(
                             tokenizer = tokenize,
                             ngram_range=(0,1), # Tokens are individual words for now
                             strip_accents='unicode'
                            )

Here's what the resulting data would look like if we applied this onto some example sentences:

In [None]:
sentences = ["Best movie I've ever seen",
             "This one is a total flop",
             "I give this one two thumbs up.",
             "I almost walked out of the theatre."
            ]
dtm = vectorizer.fit_transform(sentences)
pd.DataFrame.sparse.from_spmatrix(dtm, columns=vectorizer.get_feature_names_out())


## Building a pipeline

No we've got everything we need to process the data for text analysis. We just need to add our naive bayes classifier. We'll use a `MultinomialNB()`, which is a variant ofthe naive bayes model that can include information about term frequency.

To simplify some of the coding here, we'll use a `Pipeline` class that will combine the pre-processing and the classifier steps into a single function. We can specify a pipeline with a list of tuples that define each processing step:


In [None]:
# Create a pipeline
nb_pipeline = Pipeline([
    ('DTM',CountVectorizer(
                             ngram_range=(0,1), 
                             strip_accents='unicode',
                             max_df = 0.1,
                             min_df = .0025
                            )),
    ('naivebayes',MultinomialNB())
])


**Note**: In addition to stemming and stopword removal, you might notice we also set values for `max_df` and `min_df`. `max_df=0.1` means that we remove all terms that occur in more than 10% of the training documents. This can be a useful way to remove common terms that might not be captured by a standard stopword list, and it has the same basic motivation: we want to remove terms that are not really informative. Setting `min_df=.0025` has the effect of removing terms that occur in less than 0.25% of documents, so it gets rid of rare terms which - although they might be informative - don't show up in the corpus enough times for our model to really "learn" anything about them. Fiddling with both of these parameters may change your results for better or worse, so it may be worth experimenting with different values here if you're not satisfied with the results.


Now we'll use `fit` to train the pipeline object with our training data:

In [None]:
nb_pipeline.fit(X_train, y_train)

Now, using `nb_pipeline.predict()` to a list of strings will automatically process the text, apply the model, and return a predicted class. For example:

In [None]:
sentences = ["Best movie I've ever seen",
             "This one is a total flop",
             "I give this one two thumbs up.",
             "I almost walked out of the theatre."
            ]

# 1 = predicted positive, 0 = predicted negative
nb_pipeline.predict(sentences)

Apply this to our testing data and then evaluate the results:

In [None]:
preds = nb_pipeline.predict(X_test)


In [None]:
pd.crosstab(y_test, preds,  margins=True).rename_axis(index = 'Truth', columns='Predictions')


In [None]:

print(classification_report(y_test, preds, 
                            # add target_names to show labels in the report:
                            target_names=['negative', 'positive']))

# add cohen's kappa and balanced accuracy
print("cohens kappa: ", cohen_kappa_score(y_test, preds))
print("balanced accuracy: ", balanced_accuracy_score(y_test, preds))

Here's a quick reminder for interpreting these metrics:

| **metric**                  | Description                                                                                                                               |
|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| **accuracy** | % of predictions that are accurate                                                                                                        |
| **recall**                  | % of actually positive reviews that were correctly classified as positive                                                                 |
| **precision**               | % of predicted positive reviews that were actually positive                                                                               |
| **f-1**                     | Harmonic mean of precision and recall. Used as an overall measure of model performance. The maximum score is 1. Scores above .5 are poor. |
| **Cohen's Kappa**           | Measures how well the model performs relative to a model based on the marginal probabilities of each class. Higher is better.             |
| **Balanced Accuracy**       | Accuracy score after accounting for imbalance between each class                                                                          |

<h2 style="color:red;font-weight: bold;">Question 2:</h2><p style="color:red;font-weight: bold;">Run the code below to get predictions from the VADER sentiment analysis tool on the testing data only. Then create a classification report that compares the results from VADER to the actual values of y_test</p>

In [None]:
sia = SentimentIntensityAnalyzer()
vader_scores = [sia.polarity_scores(i)['compound'] for i in X_test]
vader_preds = [i>=.05 for i in vader_scores]

# now compare vader_preds to y_test



So how does the naive bayes model perform relative to VADER? More importantly why is there a difference? Primarily, this probably comes down to the differences in context: there are a lot of terms that indicate negative views in the IMDB corpus that probably wouldn't indicate negative views in other contexts. We can get a sense of this by extracting some of the most important features from the model using the function below.


In [None]:
def getFeatures(labels, features, nbmodel):
    """Takes a set of labels, feature names, and a fitted naive bayes classifier and returns the odds ratio of p(positive|word)/p(negative|word)
    Code adapted from: https://stackoverflow.com/a/62175164"""
    # get the probability of positive and negative classes
    prob_neg = labels.value_counts(normalize=True)[0]
    prob_pos = labels.value_counts(normalize=True)[1]
    # making a data frame with the results
    df_nbf = pd.DataFrame()
    df_nbf.index = features
    vals= np.e**(nbmodel.feature_log_prob_[0, :])
    # np.e exponentiates the logged odds, so this turns them back into probabilities 
    df_nbf['pos'] = np.e**(nbmodel.feature_log_prob_[1, :]) # log probability for negative class
    df_nbf['neg'] = np.e**(nbmodel.feature_log_prob_[0, :]) # log probability for positive class
    # terms with the highest ratio of association with predicting one class
    # p(positive|word)/p(negative|word) * (p(positive)/p(negative))
    df_nbf['odds_positive'] = (df_nbf['pos']/df_nbf['neg'])*(prob_pos /prob_neg)
    df_nbf = df_nbf.sort_values('odds_positive',ascending=False).reset_index(names='term')
    return df_nbf

In [None]:
features = nb_pipeline.named_steps['DTM'].get_feature_names_out() # getting the terms from the dtm
nb = nb = nb_pipeline.named_steps['naivebayes']                   # getting the weights from the classifier

getFeatures(y_train, features, nb).head()

Now we can look at some features associated with positive or negative reviews

And with a little reshaping, we can plot them

In [None]:
top = getFeatures(y_train, features, nb)
top_bottom = pd.concat([top.iloc[:15], top.iloc[-15:]])
ax = sns.barplot(data=top_bottom,
                 y= 'term',    
                 hue='term',
                x=np.log(top_bottom['odds_positive']),dodge=False, palette='turbo')
ax.set(xlabel='Strength of association with positive\n vs. negative reviews', ylabel='term')


The results here should give you a rough idea of how and why the Naive Bayes model is able to outperform VADER: there are a number of terms - particularly the names of actors (Seagal, Matthau) and directors (Boll) - that are strongly associated with negative or positive reviews in this corpus. 


<b>Note adaptability can be a double edged sword:</b> the model makes no distinction between *terms* that indicate negative feelings and *specific groups or individuals* who might be frequent targets of negative sentiments in the training data. As a result, its easy to fit models that make inferences that reflect biases - including cultural, racial, or gender biases - that we don't want to perpetuate. There's no easy fix for this problem, so it's important to remain cognizant of this risk, especially when using machine learning models to make high stakes decisions.

## Improving the pre-processing

Our pre-processing steps can make a big difference in model performance. I'll try a new processing step that includes **bi-grams** and that reweights the data so that certain terms count for more using the **TF-IDF** scheme. You can click the boxes below to get a little more detail on how these modify our data.


### Bi-grams
<details>
  <summary>(Click to expand)</summary>
N-grams are consecutive sequences of words. If I take a sentence like "See spot run"

- **uni-grams** are (see, spot, run)
- **bi-grams** are (see_spot, spot_run)
- **tri-gram** is (see_spot_run) 

Adjusting the `ngram_range=(0,1)` to `ngram_range=(0,2)` will make a document term matrix that includes all unigrams AND all bigrams in the corpus. Note that this will have the effect of nearly doubling the number of columns in my matrix, so the model will take noticeably longer to run, but we may get some improved predictions as a result.

</details>

### TF-IDF
<details>
  <summary>(Click to expand)</summary>



TF-IDF (Term Frequency - Inverse Document Frequency) weighting a common weighting causes rare words to recieve a bigger weight in the term-document matrix. The default version of the scheme used by scikit learn uses this formula:

$$\text{TF-IDF} = \text{TF}(t,d) \times \text{IDF}(t, D) $$

Where: 
$$\text{TF}(t,d) = \frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of words in document } d}$$

And:

$$\text{IDF}(t) = \log  \dfrac{\text{Total number of documents} +1 }{\text{Number of documents containing term } t + 1} +1 $$


So, the "TF" part simply converts the word counts to proportions, while the "IDF" weighting is inversely proportional to how common each term is. So a common term would have a lower weight and rare terms will have higher weights.
</details>

Here's our new pipeline function. Note that all I've really changed here is the vectorizer function and the `ngram_range` argument:

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Create a pipeline
nb_tfidf_ngram = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer(
                             tokenizer = tokenize,
                             ngram_range=(0,2), # groups of multiple words
                             strip_accents='unicode',
                             max_df = 0.1, # maximum number of documents in which word j occurs. 
                             min_df = .0025 # minimum number of documents in which word j occurs. 
                            )),
    ('naivebayes',MultinomialNB())
])




Now we can apply it to the data and fit a new model to check our results:

In [None]:
nb_tfidf_ngram.fit(X_train, y_train)


In [None]:
preds = nb_tfidf_ngram.predict(X_test)
print(classification_report(y_test, preds, 
                            # add target_names to show labels in the report:
                           target_names=['negative', 'positive']))

# add cohen's kappa and balanced accuracy
print("cohens kappa: ", cohen_kappa_score(y_test, preds))
print("balanced accuracy: ", balanced_accuracy_score(y_test, preds))

Now compare the important features from this model to the important features from the model that used unweighted unigrams:

In [None]:
features = nb_tfidf_ngram.named_steps['tfidf_vectorizer'].get_feature_names_out() # getting the terms from the dtm
nb_model = nb_tfidf_ngram.named_steps['naivebayes']                   # getting the weights from the classifier

top = getFeatures(y_train, features, nb_model)
top_bottom = pd.concat([top.iloc[:15], top.iloc[-15:]])
ax = sns.barplot(data=top_bottom,
                 y= 'term',    
                 hue='term',
                x=np.log(top_bottom['odds_positive']),dodge=False, palette='turbo')
ax.set(xlabel='Strength of association with positive\n vs. negative reviews', ylabel='term')


<h2 style="color:red;font-weight: bold;">Question 3:</h2> 
<p style="color:red;font-weight: bold;">Try changing the pre-processing steps for the reviews data. Consider making changes to <code>max_df</code> or <code>min_df</code>, or use a different stemming algorithm.  What steps you choose here is up to you. Fit a new naive bayes model and compare your results.</p>

## Fitting a different model

Can we do better with a better model? The Naive Bayes classifier, after all, fails to account for correlations between predictors. Logistic regression is one alternative that does not have the same "naive" assumptions. While a standard logit model wouldn't quite work here (because we have more predictors than observations) the scikit learn version of the logit model uses a lasso penalty that performs a kind of automatic variable selection during the fitting process.

In [74]:
from sklearn.linear_model import LogisticRegression
# Create a pipeline
logistic_tfidf_ngram = Pipeline([
    ('tfidf_vectorizer', TfidfVectorizer(
                             tokenizer = tokenize,
                             ngram_range=(0,2), # groups of multiple words
                             strip_accents='unicode',
                             max_df = 0.1, # maximum number of documents in which word j occurs. 
                             min_df = .0025 # minimum number of documents in which word j occurs. 
                            )),
    ('logistic',LogisticRegression())
])


logit_model = logistic_tfidf_ngram.fit(X_train, y_train)

In [75]:
preds = logit_model.predict(X_test)
print(classification_report(y_test, preds, 
                            # add target_names to show labels in the report:
                            target_names=['negative', 'positive']))

# add cohen's kappa and balanced accuracy
print("cohens kappa: ", cohen_kappa_score(y_test, preds))
print("balanced accuracy: ", balanced_accuracy_score(y_test, preds))

              precision    recall  f1-score   support

    negative       0.89      0.82      0.85       499
    positive       0.84      0.90      0.87       501

    accuracy                           0.86      1000
   macro avg       0.86      0.86      0.86      1000
weighted avg       0.86      0.86      0.86      1000

cohens kappa:  0.7199585538659722
balanced accuracy:  0.8599274397097589


## K-fold cross validation

Although its unlikely, given how many test examples we have, we might want to make sure that our performance with this model isn't just a function of random chance, and we also want to avoid over-fitting to a single test data set. To avoid this problem, we'll often use k-fold cross validation. In K-fold cross validation, we separate the data into "K" equally sized groups, and then loop through the folds using each one as the validation data and all of the remaining observations as training data. Scikit-learn has an easy method for running k-fold cross validation and getting some overall metrics. I'll create a new pipeline for this, but this is just to make sure I get a model that hasn't "seen" any of the reviews data before 

In [76]:
cross_validation_pipeline = Pipeline([
    ('tfidf',TfidfVectorizer(analyzer = 'word',
                             ngram_range=(0,1), 
                             strip_accents='unicode',
                             max_df = 0.1, # maximum number of documents in which word j occurs. 
                             min_df = .0025 # minimum number of documents in which word j occurs. 
                            )),
    ('naivebayes',MultinomialNB())
])


cross_val = cross_validate(cross_validation_pipeline, 
               reviews.text, 
               reviews.label, 
               cv=40,
               scoring =['f1', 'balanced_accuracy']
              
              )




In [77]:
pd.DataFrame(cross_val).describe()

Unnamed: 0,fit_time,score_time,test_f1,test_balanced_accuracy
count,40.0,40.0,40.0,40.0
mean,0.400579,0.011481,0.846313,0.846393
std,0.011234,0.000906,0.032429,0.031659
min,0.395329,0.010026,0.784,0.78405
25%,0.396755,0.01076,0.823292,0.824405
50%,0.398012,0.011383,0.847961,0.847798
75%,0.399468,0.012152,0.873016,0.872024
max,0.465295,0.013395,0.909091,0.911802
