# Text Classification in Python Using spaCy

Text is an extremely rich source of information. Each minute millions of text messages are being shared round the globe. It therefore goes without saying that very important insights can be mined from these text messages.

However due to the *unstructured* form of text messages, its volume and velocity, its proven to be tiresome, time-consuming and sometimes impossible for humans to mine insights from these text messages. 

## spaCy
To transform unstructured text data into something more useful for analysis and natural language processing (NLP), `spaCy` is used. 

`spaCy` is a NLP Library for Python.

In [12]:
# installing spaCy and its English-language model.

#!pip install spacy
#!python -m spacy download en

#! is used infront of each command to let the Jupyter notebook know that it should be read as a command line command.

# OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
# solution:
# creating a shortcut link for 'en' didn't work (maybe you don't have admin permissions?)
# but you can still load the model via its full package name: nlp =spacy.load('en_core_web_sm')

## Tokenization
Tokenization is the process of breaking text into pieces, called **tokens**. There can be:
1. word tokenization
2. sentence tokenization

In [13]:
#1. Word Tokenization

# importing the English Language model
from spacy.lang.en import English


# load English tokenizer, tagger, parser, NER and word vectors
# the nlp will be done in English
nlp = English()

#asssign the text to be processed to the variable, text
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)    # at this point the text is tokenized

# create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)
print(token_list)

['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']


A list that contains each token(word tokens) as a separate item has been produced. It has also recognized that contractions such as shouldn’t actually represent two distinct words, and it has thus broken them down into two distinct tokens.

In [14]:
#2. Sentence tokenization

# load English tokenizer, tagger, parser, NER and word vectors
# nlp will be done in English
nlp = English()

# create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')

# add the component to the pipeline
nlp.add_pipe(sbd)

#assigning the text to a variable
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)
print(sents_list)

["When learning data science, you shouldn't get discouraged!", "\nChallenges and setbacks aren't failures, they're just part of the journey.", "You've got this!"]


A list that contains each token(sentence token) as a separate item has been produced. 

## Cleaning Text Data: Removing Stopwords
Most text data that we work with contains a lot of words that aren’t actually useful to us. These words, called **stopwords**, are useful in human speech, but they don’t have much to contribute to data analysis. 

Removing stopwords helps to eliminate noise and distraction from the text data, and also speeds up the time analysis takes as there are fewer words to process.

In [15]:
# Default spaCy's Stop words
# importing stop words from English language.
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

#Printing the total number of stop words:
print('Number of stop words: %d' % len(spacy_stopwords))

#Printing first ten stop words:
print('First ten stop words: %s' % list(spacy_stopwords)[:10])

Number of stop words: 326
First ten stop words: ['therein', 'is', 'anywhere', 'afterwards', 'our', 'top', 'eleven', 'here', 'whoever', 'even']


## Removing Stopwords from the Data
At this point, stopwords from the text string above, stored in the variable `text` are removed using the default list of stopwords by spaCy.

An empty list, `filtered_sent` will be created. Next, the `doc` variable is iterated through to look at each tokenized word from our source text. 

`is_stop` spaCy token attribute will then be used to identify words that aren’t in the stopword list and then appended to  `filtered_sent list`.

In [16]:
from spacy.lang.en.stop_words import STOP_WORDS

# Implementation of stop words:
filtered_sent=[]

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# filtering stop words
for word in doc:
    if word.is_stop == False: 
        filtered_sent.append(word)
        
print("Filtered Sentence:",filtered_sent)

Filtered Sentence: [learning, data, science, ,, discouraged, !, 
, Challenges, setbacks, failures, ,, journey, ., got, !]


Removing the stopword has reduced the original text down to just a few words that give the general idea of what the sentences are discussing: learning data science, and discouraging challenges and setbacks along that journey.

## Lexicon Normalization
Lexicon Normalization is another step in data cleaning process.
Normalization converts high dimensional features into low dimensional features which are appropriate for any machine learning model.
## Lemmatization
This is a way of processing words that reduces them to their roots. i.e words like connect, connection, connecting, connected, etc. aren’t exactly the same, they all have the same essential meaning: connect. The differences in spelling have grammatical functions in spoken language, but for machine processing, those differences can be confusing, a way is therefore needed to change all these words that are forms of the word connect into the word connect itself.

One method for doing this is called **stemming**. Stemming involves simply lopping off easily-identified prefixes and suffixes to produce what’s often the simplest version of a word. Connection, for example, would have the -ion suffix removed and be correctly reduced to connect. This kind of simple stemming is often all that’s needed, but **lemmatization** — which actually looks at words and their roots (called **lemma**) as described in the dictionary—is more precise (as long as the words exist in the dictionary).

Since spaCy includes a build-in way to break a word down into its lemma, we can simply use that for lemmatization. In the following example, we’ll use `.lemma_` to produce the lemma for each word we’re analyzing.

In [17]:
# Implementing lemmatization
lem = nlp("was rats")

# finding lemma for each word
for word in lem:
    print(word.text,word.lemma_)

was was
rats rats


In [18]:
# Implementing lemmatization
lem1 = nlp("was rats")
lem2 = nlp("run runs running runner")

# finding lemma for each word
for word in lem1:
    print(word.text, word.lemma_.lower().strip())

print("\n")  

for word in lem2:
    print(word.text, word.lemma_.lower().strip())
 

was was
rats rats


run run
runs runs
running running
runner runner


## Part of Speech (POS) Tagging
A word’s **part of speech** defines its function within a sentence. A noun, for example, identifies an object. An adjective describes an object. A verb describes action. Identifying and tagging each word’s part of speech in the context of a sentence is called ***Part-of-Speech Tagging, or POS Tagging***.

For POS tagging with spaCy, `en_core_web_sm model` is required and therefore imported because it contains the dictionary and grammatical information required to do this analysis.

In [19]:
# POS tagging

# importing the model en_core_web_sm of English for vocabluary, syntax & entities
import en_core_web_sm

# load en_core_web_sm of English for vocabluary, syntax & entities
nlp = en_core_web_sm.load()

# "nlp" Object is used to create documents with linguistic annotations.
# u in u"All is well that ends well." signifies that the string is a Unicode string.
docs = nlp(u"All is well that ends well.")

for word in docs:
    print(word.text,word.pos_)

All DET
is AUX
well ADJ
that DET
ends VERB
well ADV
. PUNCT


`spaCy` has correctly identified the part of speech for each word in the above sentence. Being able to identify parts of speech is useful in a variety of NLP-related contexts, because it helps more accurately understand input sentences and more accurately construct output responses.

## Entity Detection
**Entity detection**, also called **Entity recognition**, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within an input string of text. This is really helpful for quickly extracting information from text, since one can quickly pick out important topics or indentify the key sections of text.

Using a few paragraphs from an article in Washington Post to demonstrate entity detection.
`.label` is used to grab a label for each entity that’s detected in the text, and then looking at these entities in a more visual format using spaCy‘s `displaCy` visualizer.

In [20]:
#for visualization of Entity detection importing displacy from spacy:

from spacy import displacy

nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid 
an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.

At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. 
The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.

The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations,
including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

entities = [(i, i.label_, i.label) for i in nytimes.ents]
entities

[(New York City, 'GPE', 384),
 (Tuesday, 'DATE', 391),
 (At least 285, 'CARDINAL', 397),
 (September, 'DATE', 391),
 (Brooklyn, 'GPE', 384),
 (Williamsburg, 'GPE', 384),
 (four, 'CARDINAL', 397),
 (Bill de Blasio, 'PERSON', 380),
 (Tuesday, 'DATE', 391),
 (Orthodox Jews, 'PERSON', 380),
 (6 months old, 'DATE', 391),
 (up to $1,000, 'MONEY', 394)]

From the short example above it can be seen that we've been able to identify a variety of different **entity types**, including *specific locations (**GPE**), date-related words (**DATE**), important numbers (**CARDINAL**), specific individuals (**PERSON**),* etc.

Using displaCy one can also visualize our input text, with each identified entity highlighted by color and labeled. `style = "ent"` is used to tell displaCy that we want to visualize entities here.


In [21]:
displacy.render(nytimes, style = "ent",jupyter = True)

## Dependacy Parsing
Depenency parsing is a language processing technique that allows one to better determine the meaning of a sentence by analyzing how it’s constructed to determine how the individual words relate to each other.

Considering, for example, the sentence “Bill throws the ball.” We have two nouns (Bill and ball) and one verb (throws). But we can’t just look at these words individually, or we may end up thinking that the ball is throwing Bill! To understand the sentence correctly, we need to look at the word order and sentence structure, not just the words and their parts of speech.

Using another `spaCy` called `noun_chunks`, which breaks the input down into nouns and the words describing them, and iterate through each chunk in our source text, identifying the word, its root, its dependency identification, and which chunk it belongs to.

In [22]:
docp = nlp (" In pursuit of a wall, President Trump ran into one.")

for chunk in docp.noun_chunks:
   print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

pursuit pursuit pobj In
a wall wall pobj of
President Trump Trump nsubj ran


This output can be a little bit difficult to follow, but since the displaCy visualizer is already imported, we can use that to view a dependency diagraram using style = "dep" that’s much easier to understand:

In [23]:
displacy.render(docp, style="dep", jupyter= True)

## Word Vector Representation
When we’re looking at words alone, it’s difficult for a machine to understand connections that a human would understand immediately. Engine and car, for example, have what might seem like an obvious connection (cars run using engines), but that link is not so obvious to a computer.

To tackle this challenge, there’s a way one can represent words that captures more of these sorts of connections. A **word vector** is a numeric representation of a word that commuicates its relationship to other words.

Each word is interpreted as a unique and lenghty array of numbers. One can think of these numbers as being something like GPS coordinates. GPS coordinates consist of two numbers (latitude and longitude), and if we saw two sets GPS coordinates that were numberically close to each other (like 43,-70, and 44,-70), we would know that those two locations were relatively close together. Word vectors work similarly, although there are a lot more than two coordinates assigned to each word, so they’re much harder for a human to eyeball.

Using `spaCy‘s en_core_web_sm model`, the length of a vector for a single word, and what that vector looks like using .vector and .shape is ilustrated.

In [24]:
import en_core_web_sm

nlp = en_core_web_sm.load()

luciana = nlp(u'Luciana')

print(luciana.vector.shape)
print(luciana.vector)

(96,)
[ 0.10664701 -0.07504743  0.03595835  2.2842727   2.5875654   6.391607
  4.3126864   1.25866     1.7773373   0.8812995   1.852885    0.14802298
 -0.48537052  0.22203338  0.28891057  0.11898458 -1.4072561   0.93513054
  3.283813   -1.6075639  -1.0125482  -0.49704075  2.1452773  -4.917537
 -0.5037513  -1.3133576   1.1718066  -0.94715214  0.15880446 -1.8873702
 -0.10553706 -0.81776214  1.3617762   2.1530282   3.0421371  -4.097598
  2.4708734  -0.01782579 -1.6933503   0.49135697  3.8267903  -2.5971653
  0.17764509 -4.442807    0.19148743  1.2740519  -0.4889225  -2.1662173
  1.1703099   2.4897647  -1.1301436  -1.483716   -3.1136823  -0.8048452
 -4.445101    2.255146    0.56909084  0.08033788  2.9775665   0.2835077
  0.28063476 -0.4804771  -0.43213904 -0.1735108   0.70856726 -2.8426328
  2.871697   -3.9304304  -1.58431    -0.89612997 -1.6255403   1.1376268
  0.37081936  1.0023171  -1.6875923  -3.7045298   3.3940258  -0.5357765
 -3.3093276   0.04607564 -0.65839314 -3.007496   -0.9103234

Representing the word this way works well for machines, because it allows one to represent both the word’s meaning and its “proximity” to other similar words using the coordinates in the array.

Importing libraries to help with the analysis:

In [25]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

## Loading Data
The [Amazon Alexa Product Reviews dataset](https://www.kaggle.com/sid321axn/amazon-alexa-reviews/data) will be used for illustration. This dataset has consumer reviews of amazon Alexa products like Echos, Echo Dots, Alexa Firesticks etc. It comes as a tab-separated file(.tsv), having five columns: `rating, date, variation, verified_reviews, feedback`. 
- `rating` denotes the rating each user gave the Alexa (out of 5). 
- `date` indicates the date of the review.
- `variation` describes which model the user reviewed.
- `verified_reviews` contains the text of each review.
- `feedback` contains a sentiment label, with 1 denoting positive sentiment (the user liked it) and 0 denoting negative sentiment (the user didn’t).
 
The obhective of this project is to to ***develop a classification model that looks at the review text and predicts whether a review is positive or negative.*** 
Since this data set already includes whether a review is positive or negative in the feedback column, those answers can be used to train and test the model. The goal is to produce an accurate model that can then be used to process new user reviews and quickly determine whether they were positive or negative.

Reading the data into a `pandas` dataframe and then using the built-in functions of pandas to take a closer look at the data.

In [26]:
# loading TSV file
df_amazon = pd.read_csv ("C:/Users/Luci/Desktop/Data Science/DataQuest/my_datasets/amazon_alexa.tsv", sep="\t")

# top 5 records
df_amazon.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [27]:
# shape of dataframe
df_amazon.shape

(3150, 5)

The dataframe has 3150 rows and 5 columns

In [28]:
# view data information
df_amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
rating              3150 non-null int64
date                3150 non-null object
variation           3150 non-null object
verified_reviews    3150 non-null object
feedback            3150 non-null int64
dtypes: int64(2), object(3)
memory usage: 123.1+ KB


In [29]:
# feedback Value count
df_amazon.feedback.value_counts()

1    2893
0     257
Name: feedback, dtype: int64

From the feedback value counts, **2893 users liked** the amazon alexa product (1) while **257 users didn't like** the amazon alexa product (0)

## Tokening the Data with spaCy
First, a custom tokenizer function is created using spaCy. This function is used to automatically strip information we don’t need, like stopwords and punctuation, from each review.

The English models needed are imported from spaCy, as well as Python’s `string` module, which contains a helpful list of all punctuation marks that we can use in `string.punctuation`. 
Variables that contain the punctuation marks and stopwords to be removed are created, and a parser that runs input through spaCy‘s English module.
A spacy_tokenizer() function is then created that accepts a sentence as input and processes the sentence into tokens, performing *lemmatization, lowercasing, and removing stop words*. 

In [30]:
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
import string

# create the list of punctuation marks
punctuations = string.punctuation

# create our list of stopwords
nlp = spacy.load('en_core_web_sm')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# creating the tokenizer function, whose input is a sentence
def spacy_tokenizer(sentence):
    # creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

## Defining a Custom Transformer
To further clean the text data, a custom transformer is created for removing initial and end spaces and converting text into lower case. 
Here, a custom `predictors` class is created which inherits the TransformerMixin class. This class overrides the `transform, fit` and `get_parrams` methods. 
A `clean_text()` function is created that removes spaces and converts text into lowercase.

In [31]:
# Custom transformer using spaCy
# TransformerMixin is the base class for writing our transformer class on top of
# TransformerMixin is in parenthesis while declaring the class to let Python know that our class is going to inherit from it

class predictors(TransformerMixin):
    
    # Like all the constructors we’re going to write: 
        # the fit method only needs to return self. 
        # the transform method is what we’re really writing to make the transformer do what we need it to do:
            # in this case it simply needs r to return clean text
    
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

## Vectorization Feature Engineering (TF-IDF)
When text is classified, the result is text snippets matched with their respective labels. But these text strings can't be used in machine learning model; therefore a way to convert the text into something that can be represented numerically is needed, just like the labels (1 for positive and 0 for negative) are. 
Classifying text in positive and negative labels is called ***sentiment analysis***. So we need a way to represent our text numerically.

One tool for representing the text numerically is called **Bag of Words**. BoW converts text into the matrix of occurrence of words within a given document. It focuses on whether given words occurred or not in the document, and it generates a matrix that we might see referred to as a BoW matrix or a document term matrix.

To generate a BoW matrix for the text data `scikit-learn‘s CountVectorizer` can be used. 
In the code below, we’re telling `CountVectorizer` to use the custom `spacy_tokenizer` function we built as its tokenizer, and defining the ngram range we want.
***N-grams*** are combinations of adjacent words in a given text, where n is the number of words that incuded in the tokens. for example, in the sentence “Who will win the football world cup in 2022?” unigrams would be a sequence of single words such as “who”, “will”, “win” and so on. Bigrams would be a sequence of 2 contiguous words such as “who will”, “will win”, and so on. So the ngram_range parameter we’ll use in the code below sets the lower and upper bounds of the our ngrams (we’ll be using unigrams). Then we’ll assign the ngrams to bow_vector.


In [32]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

**TF-IDF (Term Frequency-Inverse Document Frequency)**: this is simply a way of normalizing our Bag of Words(BoW) by looking at each word’s frequency in comparison to the document frequency. In other words, it’s a way of representing how important a particular term is in the context of a given document, based on how many times the term appears and how many other documents that same term appears in. The higher the TF-IDF, the more important that term is to that document.

To generate TF-IDF automatically using `scikit-learn‘s TfidfVectorizer`: 


In [33]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
# TfidfVectorizer is using the custom tokenizer that we built with spaCy, and the result assigned to the variable tfidf_vector.

## Splitting The Data into Training and Test Sets
When building a classification model, there has to be a way to know how it’s actually performing. Dividing the dataset into a ***training set*** and a ***test set*** allows us to determine how the model is performing. 

Half of the data set will be used as the training set, which will include the correct answers. Then the other half will be used to test the model. Without giving the model the answers, its performance is measured by how accurately it determines the answers of the remaining half of the dataset i.e the test set.

`scikit-learn` comes with a built-in function for doing splitting the dataset: `train_test_split()`. What's needed is just to tell it the feature set we want it to split (X), the labels we want it to test against (ylabels), and the size we want to use for the test set (represented as a percentage in decimal form).

In [34]:
from sklearn.model_selection import train_test_split

X = df_amazon['verified_reviews']    # the features(column) we want to analyze
ylabels = df_amazon['feedback']      # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

## Creating a Pipeline and Generating the Model
With the above steps complete, next is to build the model.
First, the **LogisticRegression module** is imported, and a **LogisticRegression classifier object** creaated.

Then, a pipeline is created with three components: a cleaner, a vectorizer, and a classifier. 
- the cleaner uses the predictors class object to clean and preprocess the text
- the vectorizer uses countvector objects to create the bag of words matrix for our text
- the classifier is an object that performs the logistic regression to classify the sentiments

Once the pipeline is built, the pipeline components are fit using fit().

In [35]:
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)



Pipeline(memory=None,
         steps=[('cleaner', <__main__.predictors object at 0x000001B0A7964FD0>),
                ('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function spacy_tokenizer at 0x000001B09667FC80>,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_

## Evaluating the Model
Now that the model is trained, the test data is put through the pipeline to come up with predictions. 

Then using various functions of the `metrics module` the model’s **accuracy, precision,** and **recall** are evaluated.
- *Accuracy* refers to the percentage of the total predictions the model makes that are completely correct.
- *Precision* describes the ratio of true positives to true positives plus false positives in our predictions.
- *Recall* describes the ratio of true positives to true positives plus false negatives in our predictions

All the three metrics are measured from 0 to 1, where 1 is predicting everything completely correctly. Therefore, the closer the model’s scores are to 1, the better.


In [36]:
from sklearn import metrics
# predicting with a test dataset
predicted = pipe.predict(X_test)

# model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.9439153439153439
Logistic Regression Precision: 0.9498364231188658
Logistic Regression Recall: 0.9920273348519362


The model correctly identified a comment’s sentiment 94.1% of the time. When it predicted a review was positive, that review was actually positive 95% of the time. When handed a positive review, our model identified it as positive 98.6% of the time