In [103]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from data_sci.fastai.nlp import *
from sklearn.linear_model import LogisticRegression

## IMDB dataset and the sentiment classification task

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

`gunzip aclImdb_v1.tar.gz`

`tar -xvf aclImdb_v1.tar`

### Tokenizing and term document matrix creation

In [2]:
# PATH='data/aclImdb/'
PATH='/data/msnow/data_science/imdb/aclImdb/'
names = ['neg','pos']

In [3]:
%ls {PATH}

imdbEr.txt  imdb.vocab  README  [0m[01;34mtest[0m/  [01;34mtrain[0m/


In [4]:
%ls {PATH}train

labeledBow.feat  [0m[01;34mpos[0m/    unsupBow.feat  urls_pos.txt
[01;34mneg[0m/             [01;34munsup[0m/  urls_neg.txt   urls_unsup.txt


In [5]:
%ls {PATH}train/pos | head

0_9.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt
10005_7.txt
10006_7.txt
10007_7.txt
10008_7.txt
ls: write error


In [10]:
??texts_labels_from_folders

In [7]:
trn,trn_y = texts_labels_from_folders(f'{PATH}train',names)
val,val_y = texts_labels_from_folders(f'{PATH}test',names)

Here is the text of the first review

In [8]:
trn[0]

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, ev

In [11]:
trn_y[0]

0

[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts (part of `sklearn.feature_extraction.text`).

In [12]:
veczr = CountVectorizer(tokenizer=tokenize)

`fit_transform(trn)` finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the *same transformation* to your validation set, the second line uses just the method `transform(val)`. `trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document i and it contains a count of words for each document for each word in the vocabulary.

In [13]:
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [14]:
trn_term_doc

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

In [15]:
trn_term_doc[0]

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 189 stored elements in Compressed Sparse Row format>

In [16]:
vocab = veczr.get_feature_names(); vocab[5000:5005]

['aussie', 'aussies', 'austen', 'austeniana', 'austens']

In [17]:
w0 = set([o.lower() for o in trn[0].split(' ')]); w0

{'"controversial"',
 '(no',
 '/><br',
 '/>i',
 '/>the',
 '/>what',
 '1967.',
 '40',
 'a',
 'about',
 'ago,',
 'all',
 'also',
 'am',
 'america.',
 'and',
 'answer',
 'any',
 'anyone',
 'are',
 'arguably',
 'around',
 'artistic',
 'as',
 'asking',
 'at',
 'attentions',
 'average',
 'be',
 'because',
 'being',
 'bergman,',
 'between',
 'between,',
 'boy',
 'but',
 'by',
 'can',
 'centered',
 'certain',
 'cheaply',
 'cinema.',
 'classmates,',
 'commend',
 'considered',
 'controversy',
 'country,',
 'countrymen',
 'curious-yellow',
 'customs',
 'denizens',
 'do',
 'documentary',
 "doesn't",
 'drama',
 'enter',
 'even',
 'ever',
 'everything',
 'fact',
 'fan',
 'far',
 'few',
 'film',
 'filmmakers',
 'films',
 'films.<br',
 'find',
 'first',
 'focus',
 'for',
 'ford,',
 'from',
 'good',
 'had',
 'has',
 'have',
 'heard',
 'her',
 'his',
 'i',
 'if',
 'in',
 'ingmar',
 'intended)',
 'is',
 'issues',
 'it',
 "it's",
 'john',
 'just',
 'kills',
 'learn',
 'lena',
 'life.',
 'like',
 'made',
 '

In [18]:
len(w0)

185

In [22]:
veczr.vocabulary_['really']

53936

In [25]:
trn_term_doc[0,53936]

3

In [24]:
trn_term_doc[0,5000]

0

## Naive Bayes

### Theory break

This is to create a markdown style table from a pandas dataframe

In [63]:
tmp = []
tmp.append('This movie was great')
tmp.append('I liked this movie')
tmp.append('This is the worst movie ever')
tmp.append('Bad movie')
veczr_tmp = CountVectorizer(tokenizer=tokenize)
tmp_fit = veczr_tmp.fit_transform(tmp)
df = pd.DataFrame(tmp_fit.toarray(),columns=veczr_tmp.get_feature_names(),index=tmp)

from tabulate import tabulate
print(tabulate(df, headers=df.columns,tablefmt="pipe"))

|                              |   bad |   ever |   great |   i |   is |   liked |   movie |   the |   this |   was |   worst |
|:-----------------------------|------:|-------:|--------:|----:|-----:|--------:|--------:|------:|-------:|------:|--------:|
| This movie was great         |     0 |      0 |       1 |   0 |    0 |       0 |       1 |     0 |      1 |     1 |       0 |
| I liked this movie           |     0 |      0 |       0 |   1 |    0 |       1 |       1 |     0 |      1 |     0 |       0 |
| This is the worst movie ever |     0 |      1 |       0 |   0 |    1 |       0 |       1 |     1 |      1 |     0 |       1 |
| Bad movie                    |     1 |      0 |       0 |   0 |    0 |       0 |       1 |     0 |      0 |     0 |       0 |


In general, I want to know, given a specific document (which in our case refers to a review), whether it is a positive or negative review, class 0 or class 1, respectively.  Using Bayes I can determine the probability that a specific document will be in a certain class, e.g., $p\left(c=1\mid d\right)$. 

$$ p\left(c=1 \mid d\right) = \dfrac{p\left(d\mid c=1\right) p\left(c=1\right)}{p\left(d\right)} $$

Let's take this one step further before trying to solve, as it will make the math easier.  I don't realy care about the probabilty or a review being positive or negative, I just want to know if it's more likely to be psoitive or negative.  I can extract this information by taking the ratio of the conditional probabilities.

$$\dfrac{p\left(c=1 \mid d\right)}{p\left(c=0 \mid d\right)} $$

If the result is greater than 1, then the review is more likely to belong to class 1, i.e., positive and if the result is less than 1, the review is more likely to be negative, i.e., class 0.

\begin{align} 
\dfrac{p\left(c=1 \mid d\right)}{p\left(c=0 \mid d\right)} & = \dfrac{p\left(d\mid c=1\right) p\left(c=1\right)}{p\left(d\right)} \dfrac{p\left(d\right)}{p\left(d\mid c=0\right) p\left(c=0\right)} \\
& = \dfrac{p\left(d\mid c=1\right) p\left(c=1\right)}{p\left(d\mid c=0\right) p\left(c=0\right)} \\
\end{align}

Let's go through each of these terms in the context of the four sample reviews in the following term document matrix.



|                              |   bad |   ever |   great |   i |   is |   liked |   movie |   the |   this |   was |   worst |
|:-----------------------------|------:|-------:|--------:|----:|-----:|--------:|--------:|------:|-------:|------:|--------:|
| This movie was great         |     0 |      0 |       1 |   0 |    0 |       0 |       1 |     0 |      1 |     1 |       0 |
| I liked this movie           |     0 |      0 |       0 |   1 |    0 |       1 |       1 |     0 |      1 |     0 |       0 |
| This is the worst movie ever |     0 |      1 |       0 |   0 |    1 |       0 |       1 |     1 |      1 |     0 |       1 |
| Bad movie                    |     1 |      0 |       0 |   0 |    0 |       0 |       1 |     0 |      0 |     0 |       0 |

$p\left(c=C\right)$ is simply the probability of a document being class 0 or 1.  This is just the number of docuemnts in each class divided by the total number of documents

\begin{align}
p\left(c=0\right) &=2/4 = 0.5 \\
p\left(c=1\right) &= 2/4 = 0.5 \\
\end{align}

$p\left(d\mid c=C\right)$ is the probability of seeing this document given a specific class, $C$.  Since the document is just the words (or in nlp speak, the features) which make it up, we can rewrite these terms as $p\left(f_0,f_1,\ldots,f_p\mid c=0\right)$.  For example, for the first review 

$$ p\left(d_0\mid c=0\right) = p\left(f_8, f_6, f_9, f_2\mid c=0\right)$$

Here is where the Naive part of Naive Bayes comes in.  In Naive Bayes we assume that all features are conditionally independent, which means that I can rewrite the previous equation as 

$$ p\left(f_8, f_6, f_9, f_2\mid c=C\right) = p\left(f_8 \mid c=C\right) \times p\left(f_6 \mid c=C\right) \times p\left(f_9 \mid c=0\right) \times p\left(f_2 \mid c=C\right) = \prod\limits_{i=8,6,9,2}p\left(f_i \mid c=C\right) $$

Going back to our problem, we can now calculate $p\left(d_0 \mid c=0\right)$ for each feature as the number of times that feature appears in the document divided by the number of times that feature appears in all documents of that class.

$$p\left(f_{this}\mid c=0\right) = 2/2 = 1$$

$$p\left(f_{movie}\mid c=0\right) = 2/2 = 1$$

$$p\left(f_{was}\mid c=0\right) = 1/2 = 0.5$$

$$p\left(f_{great}\mid c=0\right) = 1/2 = 0.5$$

$$p\left(d_0 \mid c=0\right) = 1\times 1 \times 0.5 \times 0.5 = 0.25 $$

If we try and repeat the same procedure for the other class, we end up with a problem.  What happens if that feature never appears in that class.

$$p\left(f_{great}\mid c=1\right) = 1/0 = ???$$

To get around this problem we add an additional row to our term document matrix which contains a 1 in every entry.  Intuitively this row represents the idea that there is never a zero percent chance of some word appearing.  It might be infinitesimal, but it is greater than zero. This row of ones is used just for calculating $p\left(d \mid c=C\right)$

|                              |   bad |   ever |   great |   i |   is |   liked |   movie |   the |   this |   was |   worst |
|:-----------------------------|------:|-------:|--------:|----:|-----:|--------:|--------:|------:|-------:|------:|--------:|
| This movie was great         |     0 |      0 |       1 |   0 |    0 |       0 |       1 |     0 |      1 |     1 |       0 |
| I liked this movie           |     0 |      0 |       0 |   1 |    0 |       1 |       1 |     0 |      1 |     0 |       0 |
| This is the worst movie ever |     0 |      1 |       0 |   0 |    1 |       0 |       1 |     1 |      1 |     0 |       1 |
| Bad movie                    |     1 |      0 |       0 |   0 |    0 |       0 |       1 |     0 |      0 |     0 |       0 |
| **ones**                     |     1 |      1 |       1 |   1 |    1 |       1 |       1 |     1 |      1 |     1 |       1 |

We can now recalculate the probabilities:

$$p\left(f_{this}\mid c=0\right) = (2+1)/3 = 1$$

$$p\left(f_{movie}\mid c=0\right) = (2+1)/3 = 1$$

$$p\left(f_{was}\mid c=0\right) = (1+1)/3 = 0.667$$

$$p\left(f_{great}\mid c=0\right) = (1+1)/3 = 0.667$$

$$p\left(d_0 \mid c=0\right) = 1\times 1 \times 0.667 \times 0.667 = 0.444 $$

Repeat for the other class

$$p\left(f_{this}\mid c=1\right) = (1+1)/3 = 0.667$$

$$p\left(f_{movie}\mid c=1\right) = (2+1)/3 = 1$$

$$p\left(f_{was}\mid c=1\right) = (0+1)/3 = 0.333$$

$$p\left(f_{great}\mid c=1\right) = (0+1)/3 = 0.333$$

$$p\left(d_0 \mid c=1\right) = 0.667 \times 1 \times 0.333 \times 0.333 = 0.074 $$

Now to answer our original question we just need to take the ratios of these two probabilities

$$ \dfrac{p\left(d_0 \mid c=0\right)}{p\left(d_0 \mid c=1\right)} = \dfrac{0.444}{0.074} = 6$$

This tells us that it is 6 times more likely that review 1 belongs to class 0, than class 1.

As an aside: If we didn't use Naive Bayes the equation would be much harder to solve as the $p\left(f_8, f_6, f_9, f_2\mid c=0\right)$ would expand into a much harder term to solve:

$$ p\left(f_8, f_6, f_9, f_2\mid c=0\right) = p\left(f_8 \mid c=0\right) \times p\left(f_6\mid f_8, c=0\right) \times p\left(f_9, f_2\mid f_8, f_6, c=0\right) \times p\left(f_2\mid f_8, f_6, f_9, c=0\right) $$

### Theory break over

As stated above, the probability of each feature appearing in a document of a specific class is just the ratio of the number of times that feature appears in that class (plus 1) to the number of documents in that class (plus 1).  The ratio of ratios is then half of the equation we need to solve the problem.  To make the multiplication and division less likely to go to zero or infinity we can also convert everything to logs

In [80]:
def pr(y_i):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

x=trn_term_doc
y=trn_y

r = np.log(pr(1)/pr(0))
b = np.log((y==1).mean() / (y==0).mean())

Here is the formula for Naive Bayes.

Instead of calculating the probabilities for each document individiually we can just use matrix multiplication.

In [81]:
pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.81655999999999995

...and binarized Naive Bayes (where I don't care how often I've seen it, just if I have seen it or not)

In [82]:
x=trn_term_doc.sign()
r = np.log(pr(1)/pr(0))

pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.83016000000000001

## Logistic regression

Here is how we can fit logistic regression where the features are the unigrams.

In [83]:
m = LogisticRegression(C=1e8, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.83328000000000002

...and the binarized version

In [84]:
m = LogisticRegression(C=1e8, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.85519999999999996

...and the regularized version (The C paramater, the closer to 1, the greater the regularization)

In [85]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

0.84872000000000003

...and the regularized binarized version

In [86]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

0.88404000000000005

### Trigram with NB features

Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

In [87]:
veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [88]:
trn_term_doc.shape

(25000, 800000)

In [90]:
vocab = veczr.get_feature_names()

In [91]:
vocab[200000:200005]

['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']

In [92]:
y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()

In [93]:
r = np.log(pr(1) / pr(0))
b = np.log((y==1).mean() / (y==0).mean())

Here we fit regularized logistic regression where the features are the trigrams.

In [94]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()

0.90500000000000003

Here is the $\text{log-count ratio}$ `r`.  

In [95]:
r.shape, r

((1, 800000), matrix([[-0.05468386, -0.16100472, -0.24783616, ...,  1.09861229,
          -0.69314718, -0.69314718]]))

In [96]:
np.exp(r)

matrix([[ 0.94678442,  0.85128806,  0.7804878 , ...,  3.        ,
          0.5       ,  0.5       ]])

Here we fit regularized logistic regression where the features are the trigrams' log-count ratios.

This is not equivalent to just multiplying the weight by the ratios as the weights get regularized while the input values do not.  Thus when you multiply the input values by the naive Bayes ratios you are essentially saying that you beleive the ratios and that the model should not alter them unless it has a good reason to.

In [97]:
x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()

0.91768000000000005

## fastai NBSVM++

In [98]:
sl=2000

In [99]:
# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)

In [19]:
learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)

A Jupyter Widget

[ 0.       0.0251   0.12003  0.91552]                          



In [159]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

A Jupyter Widget

[ 0.       0.02014  0.11387  0.92012]                         
[ 1.       0.01275  0.11149  0.92124]                         



In [160]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

A Jupyter Widget

[ 0.       0.01681  0.11089  0.92129]                           
[ 1.       0.00949  0.10951  0.92223]                          



## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)