In [0]:
!pip install "fastai==0.7.0"

In [0]:
!pip install Pillow==4.1.1

!pip install torchtext==0.2.3
!apt-get install gunzip
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!gunzip aclImdb_v1.tar.gz
!tar -xvf aclImdb_v1.tar


In [0]:
!pip list

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.nlp import *
from sklearn.linear_model import LogisticRegression

## IMDB dataset and the sentiment classification task

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

`wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz`

`gunzip aclImdb_v1.tar.gz`

`tar -xvf aclImdb_v1.tar`

### Tokenizing and term document matrix creation

In [0]:
PATH='aclImdb/'
names = ['neg','pos']

In [6]:
%ls {PATH}

imdbEr.txt  imdb.vocab  README  [0m[01;34mtest[0m/  [01;34mtrain[0m/


In [7]:
%ls {PATH}train
# positive and negative reviews

labeledBow.feat  [0m[01;34mpos[0m/    unsupBow.feat  urls_pos.txt
[01;34mneg[0m/             [01;34munsup[0m/  urls_neg.txt   urls_unsup.txt


In [8]:
%ls {PATH}train/pos | head

0_9.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt
10005_7.txt
10006_7.txt
10007_7.txt
10008_7.txt


In [0]:
trn,trn_y = texts_labels_from_folders(f'{PATH}train',names) # find out all the values from these folders arg1 with names[neg and pos]
val,val_y = texts_labels_from_folders(f'{PATH}test',names)

In [0]:
??texts_labels_from_folders

Here is the text of the first review

In [10]:
trn[0]


'Are you familiar with concept of children\'s artwork? While it is not the greatest Picasso any three-year-old has ever accomplished with their fingers, you encourage them to do more. If painting is what makes them happy, there should be no reason a parent should hold that back on a child. Typically, if a child loves to paint or draw, you will immediately see the groundwork of their future style. You will begin to see their true form in these very primitive doodles. Well, this concept of children\'s artwork is how I felt about Fuqua\'s depressingly cheap and uncreative film Bait. While on all accounts it was a horrid film, it was impressive to see Fuqua\'s style begin emerging through even the messiest of moments. If you have seen either Training Day or King Arthur, you will be impressed with the birth of this director in his second film Bait. While Foxx gives a horrid, unchained performance, there are certain scenes, which define Fuqua and demonstrate his brilliance behind the camera.

In [0]:
trn_y[0]  #0 is negative review

0

[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts (part of `sklearn.feature_extraction.text`).

In [0]:
# creating a token
veczr = CountVectorizer(tokenizer=tokenize)

In [0]:
??CountVectorizer

`fit_transform(trn)` finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the *same transformation* to your validation set, the second line uses just the method `transform(val)`. `trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document i and it contains a count of words for each document for each word in the vocabulary.

In [0]:
trn_term_doc = veczr.fit_transform(trn) # create vocab and create term doc matrix based on trn set
val_term_doc = veczr.transform(val) # use previously fitted model / vocab for valid set

# if a new word arrives in validation set tokenizer makes another field named unknown

In [0]:
// ??veczr.fit_transform

In [16]:
trn_term_doc
# do NOT store as an array
# stored in way of only showing the no. of times the occurence happened
# e.g. doc 1 word no 12 came in 6 times so sparse matrix will save as
# (1,12) -> 6

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

In [17]:
trn_term_doc[0]  # 403 words of total 75132 are in doc 0

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 403 stored elements in Compressed Sparse Row format>

In [18]:
# get feature names maps integers to words
vocab = veczr.get_feature_names(); vocab[5000:5005]

['aussie', 'aussies', 'austen', 'austeniana', 'austens']

In [22]:
vocab = veczr.get_feature_names(); vocab[0:5]

['\x08\x08\x08\x08a', '\x10own', '!', '"', '#']

In [0]:
# splitting based on space and not using real tokenizer and converting into 
# lower case just to see order of appearence 

w0 = set([o.lower() for o in trn[0].split(' ')]); w0

In [0]:
len(w0) # total number of words appearing 

108

In [19]:
# maps words to integers
veczr.vocabulary_['could'] 

# this method returns index of the word and kindof is 
# opposite to get_feature_names

15042

In [20]:
trn_term_doc[0,15042] # find value 'could' in the vocab 
# verified from sublime

3

In [21]:
trn_term_doc[0,5000] # find aussie in vocab

0

## Naive Bayes

We define the **log-count ratio** $r$ for each word $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

In [0]:
x = trn_term_doc
y = trn_y

p = x[y==1].sum(0)+1 # numpy adding additional 1 grab row when dep var = 1 and sum to get total word count over rows
q = x[y==0].sum(0)+1
r = np.log((p/p.sum())/(q/q.sum()))  # taking log so we dont have to mult ; add instead ; ratio of +Ve / -Ve
b = np.log(len(p)/len(q))

In [0]:
def pr(y_i):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [0]:
x=trn_term_doc
y=trn_y

r = np.log(pr(1)/pr(0))
b = np.log((y==1).mean() / (y==0).mean())

Here is the formula for Naive Bayes.

In [27]:
# instead of counting number of occurences we just take presence is or is not
# multiply bayes prob by account
pre_preds = val_term_doc @ r.T + b #binarized means occurrence will be counted as 1 and negative occ = -1
preds = pre_preds.T>0
(preds==val_y).mean()

0.8074

...and binarized Naive Bayes.

In [29]:
# .sign will just check occurence coming or not instead of number of occurences
x=trn_term_doc.sign()
# r = np.log(pr(1)/pr(0))

pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.82624

### Logistic regression

Here is how we can fit logistic regression where the features are the unigrams.

In [34]:
'''
Instead of using coefficients(ratio) r we will learn them from data using
logistic regression
dual = true will reduce computation time when data is wider instead of longer
i.e. more colms 

C = smaller implies more regularization but using as small as 1e8 will turn it off

using C = 0.1 == 0.848
        = 1e8 == 0.8327

'''

m = LogisticRegression(C=1e8, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()



0.83276

In [31]:
''' This is the Binarized version of the fitting '''

m = LogisticRegression(C=1e8, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()



0.85496

...and the regularized version

In [0]:
# turning on reguralization ; overfitting so adding l2 regularization for overfitting
# L2 will not try to make things 0 but if 2 things are corelated then will turn both down
# 1 with 0 and 1 not 0
# whereas l1 will try to make both 0
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()



0.84872

In [0]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()



0.88404

### Trigram with NB features

Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

In [0]:
#ngrams use upto n words instead of just 1 in range (1,n)
# countvectorizer will sort vocab by occurence and cut after 800k words

veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)

val_term_doc = veczr.transform(val)

In [0]:
trn_term_doc.shape

(25000, 800000)

In [0]:
vocab = veczr.get_feature_names()

In [0]:
vocab[200000:200005] # 

['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']

In [0]:
y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()

In [0]:
r = np.log(pr(1) / pr(0))
b = np.log((y==1).mean() / (y==0).mean())

Here we fit regularized logistic regression where the features are the trigrams.

In [0]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()



0.905

Here is the $\text{log-count ratio}$ `r`.  

In [0]:
r.shape, r

((1, 800000),
 matrix([[-0.05468, -0.161  , -0.24784, ...,  1.09861, -0.69315, -0.69315]]))

In [0]:
np.exp(r)

matrix([[0.94678, 0.85129, 0.78049, ..., 3.     , 0.5    , 0.5    ]])

Here we fit regularized logistic regression where the features are the trigrams' log-count ratios.

In [0]:
# we create a random matrix and using stochastic grad desc find out optimal 
# values and then do logistic regression
x_nb = x.multiply(r)

m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()



0.91768

#Lesson 11


## fastai NBSVM++

In [0]:
sl=2000

In [0]:
# Here is how we get a model from a bag of words
# trn_term doc = bag of words trn_y = labels upto 2k unique words per view
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)

In [40]:
learner = md.dotprod_nb_learner() # fast ai generalization of model based on dot prod of naive bayes
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>   
    0      0.0247     0.1191     0.91624   



[array([0.1191]), 0.916239999961853]

In [41]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>   
    0      0.01922    0.113365   0.92156   
    1      0.01085    0.112204   0.92176   



[array([0.1122]), 0.921759999961853]

In [42]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>   
    0      0.016589   0.111132   0.92124   
    1      0.010538   0.109505   0.92208   



[array([0.1095]), 0.9220799999809265]

In [0]:
'''
NBSVM++

  class DotProdNB(nn.Module):
    def __init__(self, nf, ny, w_adj=0.4, r_adj=10):
        super().__init__()
        self.w_adj,self.r_adj = w_adj,r_adj
        self.w = nn.Embedding(nf+1, 1, padding_idx=0)  #nf+1 = no. of rows assume embedding = linear
        self.w.weight.data.uniform_(-0.1,0.1)
        self.r = nn.Embedding(nf+1, ny)

    def forward(self, feat_idx, feat_cnt, sz):
        w = self.w(feat_idx)
        r = self.r(feat_idx)
        x = ((w+self.w_adj)*r/self.r_adj).sum(1)  # calculating activations
        return F.softmax(x)



w=0 means we have no confidence in our answer whether right or wrong 
w =0  emperitically does not make any sense 

regularization tries to make w = 0 and so we add additional hurdle in front of
reg that it wont make w = 0 as it is penalized by summation(w^2) 
our w can be negative hence can make w+adj(w) =0 but this occurence will be 
cause penalty of summ(w^2)



'''

In [0]:
??DotProdNB

## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)