## Introduction to Text Classification

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import re
import numpy as np
import matplotlib.pyplot as plt

In this notebook, we'll be working with a set of reviews from Amazon.com. This is a subset of a larger dataset obtained from https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews.

In [4]:
reviews = pd.read_csv('data/amazon_reviews.csv')

reviews.head()

Unnamed: 0,sentiment,title,text
0,1,The Gnostic Gospels (Vintage),This is a misrepesentation of the Gospels. It ...
1,1,Christine Feehan sucks,Ok she always starts off good with the tension...
2,1,bad review,The Dvd that amazon sent me only worked one ti...
3,1,Cheap,"This bracelet was missing pearls, and when I e..."
4,1,piece of crap,The ear piece is completely worthless. It is c...


Each review is assigned a sentiment score, where 1 indicates negative and 2 indicates positive.

This subset is equally balanced between positive and negative sentiment.

In [5]:
reviews['sentiment'].value_counts()

1    5000
2    5000
Name: sentiment, dtype: int64

First, let's look at some of the negative reviews.

In [6]:
seed = 123
for statement in reviews.loc[reviews['sentiment'] == 1, 'text'].sample(3, random_state=seed):
    print(statement)
    print('-----------------------------')

I am a trim carpenter I have had three LS1212 saws all three saws have had the same problems. the bevel and miter adjustments start to stick and bind after about two months of use. Makita service is terrible. They have lied to me send saw back not fully adjusted. It took them four weeks to do what should have been a simple fix. It is a nice when it works right but you need a back up saw for when it breaks. Makita has disapointed me. I will never buy another Makita tool.Their products and service are first rate GARBAGE.
-----------------------------
The heater worked great in my 3 gallon eclipse for about 3 days or so. It kept my tank right at 78 degrees which is what I set it to. However, after that bad things started to happen. First, the temperature got to be 80, then 82 and then even when I set to heater to 76F, my tank was 85F and the heater was still not shutting off. I thought it was my thermometer, but it wasn't. I'm just glad it didn't get any hotter or my poor betta would have

And then some of the positive ones.

In [7]:
seed = 123
for statement in reviews.loc[reviews['sentiment'] == 2, 'text'].sample(3, random_state=seed):
    print(statement)
    print('-----------------------------')

Much like it's predecessor Darwinia, Bios is an ambitious combination of hard science fiction and pan-universal spirituality. Wilson writes beautifully; I'm envious. Of particular interest is his treatment of emerging biohazards. Let's hope this book isn't as prophetic as it may seem to be. If you really appreciate literature and science fiction (and especially a combo of the two) you owe it to yourself to read both Bios and Darwinia.
-----------------------------
this epesode told us bout how mac gyver never give up, even the person who he help hate him & insult him at the first he tried to help her.
-----------------------------
I've gotten to know Tatiana and her producer/manager/husband Matthew, over the last few years. In a live concert, from when she plays that first chord on her keyboard and opens her mouth -- shivers run up the spine of everyone in the audience. She's an amazing talent who can hit a perfect pitch with all the power and inspiriation of one of the clearest voices

## Step 1: Naive Bayes Using the Text Field

First, we'll split out data into a train and test set, stratifying on the target variable.

In [189]:
X = reviews[['text']]
y = reviews['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

We need a way to convert the text into predictors. We will start by using a **bag of words** model - one which looks at what words are present but disregarding word order.

Let's start with a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

Fill in the code to import the CountVectorizer class.

In [9]:
#######REMOVE
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
# Fill this in to import the CountVectorizer class

Then create a CountVectorizer (with all of the default arguments) named `vect` and fit it to the "text" column of the training data transform both the train and test texts.

In [232]:
#######REMOVE
vect = CountVectorizer()

X_train_vec = vect.fit_transform(X_train['text'])
X_test_vec = vect.transform(X_test['text'])

In [191]:
# Fill in the code to fit and transform a CountVectorizer (using all defaults) on the text column of X_train and X_test

vect = # Fill this in

X_train_vec = # Fill this in
X_test_vec = # Fill this in

SyntaxError: invalid syntax (869380711.py, line 3)

The CountVectorizer class will take in the text, and **tokenize** it, splitting it into a list of tokens. The built-in tokenizer does some minimal cleaning of the text, and by default the CountVectorizer will convert all text to lowercase.

If we want to take a look at all of the tokens that the CountVectorizer has seen, we can look at its vocabulary. Check the `vocabulary_` attribute. 

In [None]:
# fill this in

In [233]:
#####REMOVE
vect.vocabulary_

{'this': 25040,
 'was': 27139,
 'probally': 19291,
 'the': 24914,
 '2nd': 337,
 'best': 2873,
 'movie': 16380,
 'ive': 13308,
 'seen': 21971,
 'acting': 868,
 'great': 11048,
 'and': 1508,
 'josh': 13570,
 'jackson': 13333,
 'hotter': 12144,
 'than': 24888,
 'ever': 8845,
 'is': 13247,
 'must': 16523,
 'see': 21954,
 'it': 13280,
 'about': 702,
 'time': 25194,
 'what': 27346,
 'taking': 24524,
 'so': 22976,
 'long': 14807,
 'for': 9970,
 'studios': 23965,
 'to': 25263,
 'release': 20492,
 'these': 24988,
 'rarer': 19972,
 'movies': 16382,
 'also': 1349,
 'recently': 20153,
 'released': 20493,
 'butcher': 3862,
 'boy': 3418,
 'on': 17389,
 'dvd': 7993,
 'from': 10252,
 'one': 17394,
 'format': 10031,
 'vhs': 26747,
 'hd': 11576,
 'now': 17095,
 'coming': 5222,
 'into': 13095,
 'own': 17782,
 'films': 9613,
 'have': 11542,
 'passed': 18034,
 'initial': 12811,
 'decade': 6582,
 'of': 17294,
 'that': 24899,
 'ridiculous': 21009,
 'not': 17043,
 '1930': 168,
 'print': 19258,
 'needs': 16757

**Question:** You should see that the first entry is `'this': 25040`. What does the 25040 represent?

**Question:** How many total tokens are there?

In [None]:
# Fill this in to answer the above question.

In [None]:
#####REMOVE
len(vect.vocabulary_)

Now, let's see how often each word appeared across all reviews. To do this, we need to look at X_train_vec.

**Question:** What type of object is X_train_vec?

In [None]:
# Fill this in to answer.

Notice that we can sum this object to get a count per word.

In [133]:
X_train_vec.sum(axis = 0)

matrix([[55, 21,  2, ...,  1,  1,  1]])

It would be nice to be able to view the counts result as a DataFrame. To do this, we need to convert this into a numpy array.

In [134]:
np.array(X_train_vec.sum(axis = 0))

array([[55, 21,  2, ...,  1,  1,  1]])

Also, when we make a DataFrame, we need to pass in these values as a one-dimensional object. For this, we can use the `flatten` method.

Make a `word_counts` DataFrame which was a 'words' column containing each word in the vocabulary and a 'frequency' column containing the counts.

**Hint:** Check the methods of the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to see how to access the words.

In [193]:
####REMOVE
word_counts = pd.DataFrame({'words': vect.get_feature_names_out(),
                            'frequency': np.array(X_train_vec.sum(axis = 0)).flatten()})

word_counts.head()

Unnamed: 0,words,frequency
0,0,55
1,0,21
2,7,2
3,1,4
4,1781,1


In [None]:
# Fill this in to build a DataFrame of words and their counts
word_counts = pd.DataFrame({
    'words': #Fill this in,
    'frequency': #Fill this in
})

word_counts.head()

Which word appears most frequently?

In [None]:
# Fill this in

In [195]:
word_counts.sort_values('frequency', ascending = False)

Unnamed: 0,words,frequency
24914,the,28900
1508,and,15664
25263,to,14087
13280,it,13053
17294,of,11109
...,...,...
12588,incidentally,1
12583,incestous,1
12582,incest,1
12581,incessant,1


Now, let's fit a MultinomialNB model to our word count vectors.

In [197]:
from sklearn.naive_bayes import MultinomialNB

In [198]:
nb = MultinomialNB().fit(X_train_vec, y_train)

y_pred = nb.predict(X_test_vec)

In [199]:
accuracy_score(y_test, y_pred)

0.786

In [200]:
confusion_matrix(y_test, y_pred)

array([[1031,  219],
       [ 316,  934]])

**Question:** How do the estimated probabilites $P(great | positive)$ and $P(great | negative)$ compare? How about $P(disappointed | positive)$ and $P(disappointed | negative)$? 

Hint: You can access the estimated (log) probabilities via the `feature_log_prob_` attribute of your model.

In [188]:
# Your code here.

In [234]:
### REMOVE
np.exp(nb.feature_log_prob_)[:, vect.vocabulary_['great']]

array([0.00133935, 0.00444937])

In [235]:
np.exp(nb.feature_log_prob_)[:, vect.vocabulary_['disappointed']]

array([0.00076017, 0.00022389])

## Logistic Regression for Text Classification

Fill in the code to fit a LogisticRegression model.

What accuracy score do you obtain? How does the confusion matrix look?

In [201]:
from sklearn.linear_model import LogisticRegression

In [202]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

accuracy_score(y_test, y_pred)

0.8068

In [203]:
confusion_matrix(y_test, y_pred)

array([[1003,  247],
       [ 236, 1014]])

### Explaining the Model

There are two lenses through which we can try to understand how the model is making predictions - (global) features importances and feature importances for single predictions.

We'll start by looking at global explanations. Since we're using a logistic regression model, we can look at the coefficients.

In [204]:
coefficients = pd.DataFrame({
    'word': vect.get_feature_names_out(),
    'coefficient': logreg.coef_[0]
})

coefficients.sort_values('coefficient').head(10)

Unnamed: 0,word,coefficient
27722,worst,-2.631002
3333,boring,-2.437989
27155,waste,-2.110624
18869,poor,-2.107917
18870,poorly,-1.827457
7270,disappointed,-1.768856
20875,returned,-1.66478
7124,didnt,-1.651214
7273,disappointment,-1.550021
24839,terrible,-1.531603


In [205]:
coefficients.sort_values('coefficient', ascending = False).head(10)

Unnamed: 0,word,coefficient
8917,excellent,1.907333
9329,fantastic,1.638024
2297,awesome,1.615352
11855,highly,1.527134
18255,perfect,1.475142
11048,great,1.383717
14892,love,1.379172
1399,amazing,1.37436
5300,compare,1.369906
18701,pleased,1.349113


**Question:** By what factor do we multiply the estimated odds of a review being positive when we see the word "great"? What about the word "garbage"? What about the word "the"?

In [210]:
# Your code here

In [213]:
### REMOVE
np.exp(coefficients.set_index('word').loc['great'])

coefficient    3.989705
Name: great, dtype: float64

In [236]:
## REMOVE
np.exp(coefficients.set_index('word').loc['garbage'])

coefficient    0.513755
Name: garbage, dtype: float64

In [237]:
np.exp(coefficients.set_index('word').loc['the'])

coefficient    0.985293
Name: the, dtype: float64

Here's a helper function so that you can see what contributes to individual predictions.

In [238]:
def get_most_important_features(text, vectorizer, model):
    weight = vectorizer.transform(text).toarray().flatten() * model.coef_.flatten()
    weights = pd.DataFrame({
        'word': vectorizer.get_feature_names_out(),
        'weight': weight
    })
    return weights[weights['weight'] != 0].sort_values('weight', ascending = False)

In [222]:
i = 0

text = X_test.iloc[i]
print(text['text'])

print('Predicted Probability of Positive: {}'.format(logreg.predict_proba(X_test_vec[i].reshape(1,-1))[0,1]))
print('True Label: {}'.format(y_test.iloc[i]))

get_most_important_features(text, vect, logreg)

I believe almost everything in this book is true. I think the assumptions are correct, the assessments realistic and true. But, its impossible to accept at face value, because 100 pages of footnotes marked confidential do not a proof make. By the very nature of the subject, there is subterfuge and deception, secrecy and deceit. In the end, you have to believe a good part of this book without seeing all the footprints. And then, its scary as hell.The holy trinity of oil, prejudice and profits mean that this world won't change any time soon.
Predicted Probability of Positive: 0.8997689621304078
True Label: 2


Unnamed: 0,word,weight
1508,and,1.035374
21703,scary,0.641645
10866,good,0.629234
13300,its,0.430151
24914,the,0.420726
...,...,...
12526,impossible,-0.524243
2780,believe,-0.527439
8466,end,-0.588499
17857,pages,-0.903993


Let's look at some examples that are incorrecly classified.

In [172]:
np.where((y_test == 2) & (y_pred == 1))[0][:10]

array([ 26,  36,  47,  53,  54,  60,  62,  65,  87, 106])

In [223]:
i = 26

text = X_test.iloc[i]
print(text['text'])

print('Predicted Probability of Positive: {}'.format(logreg.predict_proba(X_test_vec[i].reshape(1,-1))[0,1]))
print('True Label: {}'.format(y_test.iloc[i]))

get_most_important_features(text, vect, logreg)

this epesode told us bout how mac gyver never give up, even the person who he help hate him & insult him at the first he tried to help her.
Predicted Probability of Positive: 0.02302098289037401
True Label: 2


Unnamed: 0,word,weight
26467,us,0.490733
12167,how,0.23547
3396,bout,0.172709
27412,who,0.14208
24914,the,0.120208
11585,he,0.082552
25305,told,-0.02366
25040,this,-0.027992
2087,at,-0.06548
11761,her,-0.086021


### Adding $n$-grams

Notice that the vectorizer we are currently using only looks like words individually and does not consider the order. We can include combinations of words using n-grams.

Create a new vectorizer that includes both unigrams and bigrams.

In [None]:
vect = CountVectorizer(# Fill this in so that the model uses both unigrams and bigrams)

X_train_vec = vect.fit_transform(X_train['text'])
X_test_vec = vect.transform(X_test['text'])

In [239]:
### REMOVE
vect = CountVectorizer(ngram_range = (1,2))

X_train_vec = vect.fit_transform(X_train['text'])
X_test_vec = vect.transform(X_test['text'])

How large is the vocabulary when you use unigrams and bigrams?

In [240]:
len(vect.vocabulary_)

257212

In [241]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

confusion_matrix(y_test, y_pred)

array([[1033,  217],
       [ 210, 1040]])

In [250]:
coefficients = pd.DataFrame({
    'word': vect.get_feature_names_out(),
    'coefficient': logreg.coef_[0]
})

coefficients.sort_values('coefficient', ascending = True).head(25)

Unnamed: 0,word,coefficient
33917,boring,-1.740628
167603,poor,-1.380055
60754,disappointed,-1.324677
243298,waste,-1.321843
253000,worst,-1.319185
146173,not,-1.115436
210688,terrible,-1.112718
24465,bad,-1.085593
147438,nothing,-1.075412
167710,poorly,-1.062365


**Question:** By what factor do we multiply predicted odds of a review being positive if we see the phrase "love this"? What about "very disappointed"?

In [252]:
# Fill this in

In [251]:
np.exp(coefficients.set_index("word").loc['very disappointed'])

coefficient    0.399618
Name: very disappointed, dtype: float64

In [248]:
np.exp(coefficients.set_index("word").loc['love this'])

coefficient    1.997428
Name: love this, dtype: float64

### TFIDF

Instead of using raw counts, we can instead use a tfidf-vectorizer. This will scale down the weights for frequently-occuring words.

The acronym "tfidf" stands for "term frequency inverse document frequency". This type of vectorizer takes into account the number of times a word occurs in a document but then scales inversely for the number of documents that word appears in.

In [71]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [72]:
vect = TfidfVectorizer()

X_train_vec = vect.fit_transform(X_train['text'])
X_test_vec = vect.transform(X_test['text'])

In [73]:
logreg = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = logreg.predict(X_test_vec)

confusion_matrix(y_test, y_pred)

array([[1030,  220],
       [ 224, 1026]])

Both the CountVectorizer and TfidfVectorizer have some additional parameters that change the way that it treats tokens or which can remove certain tokens. 

Look at the `min_df` and `max_df` features. Experiment with these and see if you see any change in how the model performs when you adjust these (or any other parameters you want to experiment with).

Try out some different values for these parameters. Which combination does best?

In [230]:
# Your Code Here

In [80]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [81]:
classification_pipe = Pipeline(steps = [
    ('vectorize', TfidfVectorizer()),
    ('logreg', LogisticRegression(max_iter = 10000))
])

In [96]:
param_grid = {
    'vectorize__min_df': [1, 5, 10, 25],
    'vectorize__max_df': [1.0, 0.75, 0.5],
    'vectorize__ngram_range': [(1,1), (1,2), (1,3)]
}

In [97]:
grid = GridSearchCV(estimator = classification_pipe,
                    param_grid = param_grid, verbose = 2, n_jobs = -1)

grid.fit(X_train['text'], y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


GridSearchCV(estimator=Pipeline(steps=[('vectorize', TfidfVectorizer()),
                                       ('logreg',
                                        LogisticRegression(max_iter=10000))]),
             n_jobs=-1,
             param_grid={'vectorize__max_df': [1.0, 0.75, 0.5],
                         'vectorize__min_df': [1, 5, 10, 25],
                         'vectorize__ngram_range': [(1, 1), (1, 2), (1, 3)]},
             verbose=2)

In [98]:
grid.best_params_

{'vectorize__max_df': 0.75,
 'vectorize__min_df': 5,
 'vectorize__ngram_range': (1, 2)}

In [99]:
y_pred = grid.predict(X_test['text'])

confusion_matrix(y_test, y_pred)

array([[1047,  203],
       [ 215, 1035]])

[CV] END vectorize__max_df=1.0, vectorize__min_df=1, vectorize__ngram_range=(1, 1); total time=   1.0s
[CV] END vectorize__max_df=1.0, vectorize__min_df=1, vectorize__ngram_range=(1, 2); total time=   4.5s
[CV] END vectorize__max_df=1.0, vectorize__min_df=1, vectorize__ngram_range=(1, 3); total time=  11.1s
[CV] END vectorize__max_df=1.0, vectorize__min_df=5, vectorize__ngram_range=(1, 1); total time=   0.8s
[CV] END vectorize__max_df=1.0, vectorize__min_df=5, vectorize__ngram_range=(1, 1); total time=   0.8s
[CV] END vectorize__max_df=1.0, vectorize__min_df=5, vectorize__ngram_range=(1, 2); total time=   1.7s
[CV] END vectorize__max_df=1.0, vectorize__min_df=5, vectorize__ngram_range=(1, 2); total time=   1.7s
[CV] END vectorize__max_df=1.0, vectorize__min_df=5, vectorize__ngram_range=(1, 3); total time=   3.8s
[CV] END vectorize__max_df=1.0, vectorize__min_df=10, vectorize__ngram_range=(1, 1); total time=   0.9s
[CV] END vectorize__max_df=1.0, vectorize__min_df=10, vectorize__ngram_r

[CV] END vectorize__max_df=1.0, vectorize__min_df=1, vectorize__ngram_range=(1, 1); total time=   1.0s
[CV] END vectorize__max_df=1.0, vectorize__min_df=1, vectorize__ngram_range=(1, 2); total time=   4.0s
[CV] END vectorize__max_df=1.0, vectorize__min_df=1, vectorize__ngram_range=(1, 2); total time=   3.9s
[CV] END vectorize__max_df=1.0, vectorize__min_df=1, vectorize__ngram_range=(1, 3); total time=   9.9s
[CV] END vectorize__max_df=1.0, vectorize__min_df=5, vectorize__ngram_range=(1, 2); total time=   1.5s
[CV] END vectorize__max_df=1.0, vectorize__min_df=5, vectorize__ngram_range=(1, 2); total time=   1.8s
[CV] END vectorize__max_df=1.0, vectorize__min_df=5, vectorize__ngram_range=(1, 3); total time=   3.4s
[CV] END vectorize__max_df=1.0, vectorize__min_df=10, vectorize__ngram_range=(1, 1); total time=   0.9s
[CV] END vectorize__max_df=1.0, vectorize__min_df=10, vectorize__ngram_range=(1, 1); total time=   0.8s
[CV] END vectorize__max_df=1.0, vectorize__min_df=10, vectorize__ngram_

## Bonus: Exploring Other Potential Features

**Question:** Does the length of a review seem to be related to its sentiment? That is, do longer (or shorter) reviews tend to have a more positive sentiment?

In [None]:
# Your Code here

Create a "length" column that contains the number of characters in the review text. Also, create a "num_words" column that counts the number of words and a "num_cap" that counts the number of capital letters in the text of a review. 

How well do these features work for predicting whether a review is positive or negative?

In [None]:
X_train['text'].str.len().groupby(y_train).describe()

In [None]:
X_train['length'] = X_train['text'].str.len()
X_test['length'] = X_test['text'].str.len()

In [None]:
X_train['num_words'] = X_train['text'].str.split().apply(lambda x: len(x))
X_test['num_words'] = X_test['text'].str.split().apply(lambda x: len(x))

What about the number of capital letters?

In [None]:
X_train['num_cap'] = X_train['text'].str.count(r'[A-Z]')
X_test['num_cap'] = X_test['text'].str.count(r'[A-Z]')

In [None]:
X_train.groupby(y_train)['num_cap'].describe()

We can even use pretrained models to help us. For example, nltk includes tools for determining sentiment, like the VADER tool. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 
 
This outputs a dictionary of scores:
* neg: proportion of words that are negative
* neu: proportion of words that are neutral
* pos: proportion of words that are positive
* compound: computed by summing the valence scores of each word, normalized to be between -1 and +1

In [254]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [255]:
sent_analyzer = SentimentIntensityAnalyzer()

In [256]:
i = 0
print(X_train.iloc[i,0])
print(sent_analyzer.polarity_scores(X_train.iloc[i,0]))
print(y_train.iloc[i])

This was probally the 2nd best movie ive seen. The acting was great and Josh Jackson was hotter than ever. This is a must see movie.
{'neg': 0.0, 'neu': 0.735, 'pos': 0.265, 'compound': 0.8519}
2


Create new columns to hold the "neg", "neu", "pos" and "compound" scores for each review. 

In [None]:
vader_train = X_train['text'].apply(lambda x: sent_analyzer.polarity_scores(x)).apply(pd.Series)
vader_train

In [None]:
X_train = pd.concat([X_train, vader_train], axis = 1)

In [None]:
vader_test = X_test['text'].apply(lambda x: sent_analyzer.polarity_scores(x)).apply(pd.Series)
X_test = pd.concat([X_test, vader_test], axis = 1)

In [None]:
clf = LogisticRegression(max_iter = 10000).fit(X_train.drop(columns = 'text'), y_train)

y_pred = clf.predict(X_test.drop(columns = 'text'))

confusion_matrix(y_test, y_pred)

In [None]:
coefficients = pd.DataFrame({
    'feature': clf.feature_names_in_,
    'coefficient': clf.coef_[0]
})

coefficients.sort_values('coefficient', ascending = False).head(10)

Finally, concatenate your word counts and the new features together. How well do all of these together do at predicting positive or negative sentiment?

In [None]:
X_train_full = pd.concat([X_train.drop(columns = 'text'),
           pd.DataFrame(X_train_vec.toarray(), columns = vect.vocabulary_).set_index(X_train.index)],
          axis = 1)

X_test_full = pd.concat([X_test.drop(columns = 'text'),
           pd.DataFrame(X_test_vec.toarray(), columns = vect.vocabulary_).set_index(X_test.index)],
          axis = 1)

In [None]:
clf = LogisticRegression(max_iter = 10000).fit(X_train_full, y_train)

y_pred = clf.predict(X_test_full)

confusion_matrix(y_test, y_pred)

In [None]:
coefficients = pd.DataFrame({
    'word': clf.feature_names_in_,
    'coefficient': clf.coef_[0]
})

coefficients.sort_values('coefficient', ascending = False).head(10)

## Bonus Bonus - Using the Title and Text

There are a couple of ways that you could incorporate both the title and text.

**Option 1:** Concatenate together the title and the text into a single field.

**Option 2:** Use a separate vectorizer for the title and text and concatenate the results.

Try both options. Which gives a better result?

In [None]:
X = reviews['title'] + ' ' + reviews['text']
X = pd.DataFrame(X, columns = ['text'])
X

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [None]:
vect = TfidfVectorizer()

X_train_vec = vect.fit_transform(X_train['text'])
X_test_vec = vect.transform(X_test['text'])

In [None]:
clf = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = clf.predict(X_test_vec)

confusion_matrix(y_test, y_pred)

**Option 2:** Vectorizer separately and concatenate the results.

In [None]:
X = reviews[['title', 'text']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [None]:
title_vect = TfidfVectorizer()

X_train_vec_title = title_vect.fit_transform(X_train['title'])
X_test_vec_title = title_vect.transform(X_test['title'])


text_vect = TfidfVectorizer()

X_train_vec_text = text_vect.fit_transform(X_train['text'])
X_test_vec_text = text_vect.transform(X_test['text'])

In [None]:
from scipy.sparse import hstack

In [None]:
X_train_vec = hstack([X_train_vec_title, X_train_vec_text])
X_test_vec = hstack([X_test_vec_title, X_test_vec_text])

In [None]:
clf = LogisticRegression(max_iter = 10000).fit(X_train_vec, y_train)

y_pred = clf.predict(X_test_vec)

confusion_matrix(y_test, y_pred)

In [None]:
X_train['length'] = X_train['text'].str.len()
X_test['length'] = X_test['text'].str.len()

X_train['num_words'] = X_train['text'].str.split().apply(lambda x: len(x))
X_test['num_words'] = X_test['text'].str.split().apply(lambda x: len(x))

X_train['num_cap'] = X_train['text'].str.count(r'[A-Z]')
X_test['num_cap'] = X_test['text'].str.count(r'[A-Z]')


vader_train = X_train['text'].apply(lambda x: sent_analyzer.polarity_scores(x)).apply(pd.Series).rename(columns = lambda x: 'text_' + x)
X_train = pd.concat([X_train, vader_train], axis = 1)
vader_test = X_test['text'].apply(lambda x: sent_analyzer.polarity_scores(x)).apply(pd.Series).rename(columns = lambda x: 'text_' + x)
X_test = pd.concat([X_test, vader_test], axis = 1)

vader_train = X_train['title'].apply(lambda x: sent_analyzer.polarity_scores(x)).apply(pd.Series).rename(columns = lambda x: 'title_' + x)
X_train = pd.concat([X_train, vader_train], axis = 1)
vader_test = X_test['title'].apply(lambda x: sent_analyzer.polarity_scores(x)).apply(pd.Series).rename(columns = lambda x: 'title_' + x)
X_test = pd.concat([X_test, vader_test], axis = 1)


In [None]:
X_test

In [None]:
X_train_full = pd.concat([X_train.drop(columns = ['text', 'title']),
           pd.DataFrame(X_train_vec.toarray(), 
                        columns = list(title_vect.vocabulary_.keys()) + list(text_vect.vocabulary_.keys())).set_index(X_train.index)],
          axis = 1)

X_test_full = pd.concat([X_test.drop(columns = ['text', 'title']),
           pd.DataFrame(X_test_vec.toarray(), 
                        columns = list(title_vect.vocabulary_.keys()) + list(text_vect.vocabulary_.keys())).set_index(X_test.index)],
          axis = 1)

In [None]:
clf = LogisticRegression(max_iter = 10000).fit(X_train_full, y_train)

y_pred = clf.predict(X_test_full)

confusion_matrix(y_test, y_pred)