# Predicting the positive/negative labels for movie reviews

The objective of this programming work is to program a method for predicting the positive/negative label of a movie review.

For that, you will use a movie review database built by Pang and Lee, in 2008. This date provides textual data for movie reviews. I give below two examples of such movie reviews :

The former : "this film is extraordinarily horrendous and I'm not going to waste any more words on it." is quite negative.

The latter : "this three hour movie opens up with a view of singer/guitar player/musician/composer frank zappa rehearsing with his fellow band members. All the rest displays a compilation of footage, mostly from the concert at the palladium in new york city, halloween 1979. Other footage shows backstage foolishness, and amazing clay animation by Bruce Bickford. the performance of "titties and beer" played in this movie is very entertaining, with drummer terry bozzio supplying the voice of the devil. Frank's guitar solos outdo any van halen or hendrix I've ever heard. Bruce Bickford's outlandish clay animation is that beyond belief with zooms, morphings, etc. and actually, it doesn't even look like clay, it looks like meat." gives an positive opinion on the movie.

Pang and Lee labeled 1000 movie reviews with the 'positive' label and 1000 movie reviews with the 'negative' label.

In this work, you will program the method described in [Dave et al., 2003].

# Dealing with the movie review database

Firstly, let's manipulate the movie review database with Python.

This ressource is included in the nltk package:

In [2]:
import nltk

nltk.download("movie_reviews")

from nltk.corpus import movie_reviews

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

print(negids[0])

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\langlois\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.


neg/cv000_29416.txt


As you can see in this example, the package movie_reviews defines a funtion fileids which can list the id of the negative ou positive reviews.

It is possible to get the text of a review:

In [3]:
t = movie_reviews.words(fileids = [negids[0]])
print(" ".join(t))

plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what ' s the deal ? watch the movie and " sorta " find out . . . critique : a mind - fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn ' t snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it ' s simply too jumbled . it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea

As you can see, the words are given inside a list, and the tokenization is yet applied.

OK, now we can access to every movie review, we can access to the document content. Therefore we can compute statistics onto this textual material. Let's go!

# Work for you

Extract statistics for the negative reviews and for the positive reviews. We want the histogram of lengths of documents, by buckets of size equal to 100. Do you see a difference of distribution between negative reviews and positive reviews?

In [3]:
stats1 = {}
for iddoc in posids:
    t = movie_reviews.words(fileids = [iddoc])
    bucket = len(t) // 100 + 1
    if bucket in stats1:
        stats1[bucket] += 1
    else:
        stats1[bucket] = 1

stats2 = {}
for iddoc in negids:
    t = movie_reviews.words(fileids = [iddoc])
    bucket = len(t) // 100 + 1
    if bucket in stats2:
        stats2[bucket] += 1
    else:
        stats2[bucket] = 1

        
        
for b in stats1:
    print("bucket=",b,"(["+str((b-1)*100)+":"+str((b)*100-1)+"]) : pos =",stats1[b],"neg =",end="")
    if b in stats2:
        print(stats2[b])
    else:
        print("0")
    

bucket= 2 ([100:199]) : pos = 4 neg =6
bucket= 3 ([200:299]) : pos = 23 neg =35
bucket= 4 ([300:399]) : pos = 56 neg =66
bucket= 5 ([400:499]) : pos = 85 neg =83
bucket= 6 ([500:599]) : pos = 106 neg =139
bucket= 7 ([600:699]) : pos = 132 neg =154
bucket= 8 ([700:799]) : pos = 120 neg =129
bucket= 9 ([800:899]) : pos = 128 neg =126
bucket= 10 ([900:999]) : pos = 86 neg =98
bucket= 11 ([1000:1099]) : pos = 74 neg =48
bucket= 12 ([1100:1199]) : pos = 48 neg =39
bucket= 13 ([1200:1299]) : pos = 31 neg =21
bucket= 14 ([1300:1399]) : pos = 33 neg =12
bucket= 15 ([1400:1499]) : pos = 17 neg =13
bucket= 16 ([1500:1599]) : pos = 14 neg =8
bucket= 17 ([1600:1699]) : pos = 11 neg =8
bucket= 18 ([1700:1799]) : pos = 9 neg =3
bucket= 19 ([1800:1899]) : pos = 7 neg =4
bucket= 20 ([1900:1999]) : pos = 4 neg =3
bucket= 21 ([2000:2099]) : pos = 2 neg =0
bucket= 22 ([2100:2199]) : pos = 5 neg =1
bucket= 23 ([2200:2299]) : pos = 2 neg =2
bucket= 27 ([2600:2699]) : pos = 1 neg =0
bucket= 28 ([2700:2799])

# Segmentation of the corpus into a train part and a test part

When you want to estimate a predictive model and evaluate it, you have to estimate the predictive model on a train part, and you have to evaluate the predictive performance on a __separate__ test part.

For, that, I propose the folowwing code:

In [4]:
train_negids = negids[0:int(0.75*len(negids))]
test_negids = negids[int(0.75*len(negids)):]
train_posids = posids[0:int(0.75*len(posids))]
test_posids = posids[int(0.75*len(posids)):]

print(len(train_negids))
print(len(test_negids))
print(len(train_posids))
print(len(test_posids))

750
250
750
250


# The predictive model

[Dave et al., 2003] propose the following strategy to predict the label of a movie review :

The score of a word $w$ is definid by:

$$ score(w) = \frac{P(w|P)-P(w|N)}{P(w|P)+P(w|N)}$$

Then, the 'positivity' of a document $d$ is given by:

$$eval(d) = \sum_{w \in set(d)} score(w)$$

where the sum is applyed on the set of words in $d$ (a word occuring several times in $d$ is counted only once in the sum).

Then, the decision follows the following condition: if  eval(d) > 0 then the document is positive else the document is negative



# Estimating the parameters of the predictive model

The parameters of this predictive model are all the $P(w|P)$ and the $P(w|N)$ for all the words $w$ in the positive and negative documents.

__Work__: compute these values on the training corpus.

For estimating the $P(w|P)$, iterate on all the positive documents in the train part. $P(w|P)$ is defined by:

$$P(w|P) = \frac{|sum_{d \in \{P\}} \delta(w,d)|}{|\{P\}|}$$

where $\{P\}$ is the set of document in the positive train part, and  $\delta(w,d)$ is equal to 1 if $w$ is in $d$, 0 otherwise. 

The formula is the same for $P(w|N)$

In [9]:

## for P(w|negative)
stats_neg_words = {}

for iddoc in train_negids:
    t = movie_reviews.words(fileids = [iddoc])
    t = set(t) ## now a word occurs only once 
    for w in t:
        if w in stats_neg_words:
            stats_neg_words[w] += 1
        else:
            stats_neg_words[w] = 1

## we divide by the number of negative docs in order to have values between 0 and 1            
for w in stats_neg_words:
    stats_neg_words[w] = stats_neg_words[w]/len(train_negids) 

## same work for positive documents
stats_pos_words = {}

for iddoc in train_posids:
    t = movie_reviews.words(fileids = [iddoc])
    t = set(t)
    for w in t:
        if w in stats_pos_words:
            stats_pos_words[w] += 1
        else:
            stats_pos_words[w] = 1

for w in stats_pos_words:
    stats_pos_words[w] = stats_pos_words[w]/len(train_posids) 
    
## we complete stats_pos_words by adding each
## word that are in stats_neg_words and not in stats_pos_words
## but with score 0
for w in stats_neg_words:
  if not w in stats_pos_words:
    stats_pos_words[w] = 0

## same work for stats_neg_words
for w in stats_pos_words:
  if not w in stats_neg_words:
    stats_neg_words[w] = 0

    
## examples    
print(stats_pos_words["bad"])
print(stats_neg_words["bad"])
print(stats_pos_words["good"])
print(stats_neg_words["good"])

## P(w|?) ? can be positive or negative if stats is positive dictionary or negative dictionary
def score_word(w,stats):
    if w not in stats:
        return 0
    return stats[w]

## positivity of the word
def score(w,stats_p,stats_n):
    ## ATTENTION : I must take into account that, during test conditions,
    ## w can be not in the training positive docs and not int the training negative docs
    sp = score_word(w,stats_p)
    sn = score_word(w,stats_n)
    if sp+sn == 0:
        ## then, by default return 0 : not negative, not positive
        return 0
    return (sp-sn)/(sp+sn)


## positivity of a document
def eval(iddoc,stats_p,stats_n,threshold):
    t = movie_reviews.words(fileids = [iddoc])
    t = set(t)
    s = 0
    ## sum for all word in the document
    for w in t:
        s += score(w,stats_p,stats_n)
    ## return decision
    if s > threshold:
        return "P"
    else:
        return "N"



0.2613333333333333
0.5066666666666667
0.5853333333333334
0.5893333333333334


# Evaluating the predictive model

We want to know if, on the test corpus, the decision strategy described above works well or not. We want to know if when a document is labeled positive, the document is predicted as positive, and so on for a negative document.

We are going to evaluate the following statistics on the test corpus.

When the model predict P, the document could be actually P: this is a true positive ($tp$). But, the document may be actually negative: this is a false positive ($fp$).

In the same way, when the model predict N, the document could be actually N: this is a true negative ($tn$). But, the document may be actually positive: this is a false negative ($fn$).

Note that, there are the following constraints:

+ $tp+fp=n$
+ $fn+tn=n$
+ $tp+fn=n$
+ $fp+tn=n$ 

where $n$ is the number of documents (positive or negative).

these notations can be sumarized into the following table

<table>
    <tr>
        <td>  </td>
        <td>  </td>
        <td colspan=2 style="text-align: center"> True label </td>
    </tr>
    <tr>
        <td>  </td>
        <td>  </td>
        <td> positive </td>
        <td> negative </td>
    </tr>    
    <tr>
        <td rowspan=2 style="vertical-align: center"> Predicted label </td>
        <td> positive </td>
        <td> true positive </td>
        <td> false positive </td>
    </tr>
    <tr>
        <td> negative </td>
        <td> false negative </td>
        <td> true negative </td>
    </tr>
</table>


The recall evaluates how much the model can retrieve the correct decision:

$$recall = \frac{tp}{tp+fn}$$

The precision evaluates if the model does not sur-generate the positive or negative prediction:

$$precision = \frac{tp}{tp+fp}$$

Moreover, we can use the F1 measure which deal with precision and recall:

$$F1 = 2 \times \frac{precision \times recall}{precision + recall}$$

__Work__: evaluate on the test corpus the recall, the precision and the F1 measure.

In [10]:
## we predict P or N for each doc in test, and we compute tp, fp, tn and fn

for t in [0,1,2,3,4,5,6,7,8,9,10]:
    print("==================",t)

    tp = 0
    fp = 0
    tn = 0
    fn = 0

    for iddoc in test_negids:
        if eval(iddoc,stats_pos_words,stats_neg_words,t) == "P":
            fp += 1
        else:
            tn += 1

    for iddoc in test_posids:
        if eval(iddoc,stats_pos_words,stats_neg_words,t) == "P":
            tp += 1
        else:
            fn += 1

    recall = tp/(tp+fn)
    precision = tp/(tp+fp)
    F1 = 2*(precision*recall)/(precision+recall)

    print("recall",recall)
    print("precision",precision)
    print("F1",F1)

recall 0.976
precision 0.6455026455026455
F1 0.7770700636942673
recall 0.964
precision 0.667590027700831
F1 0.7888707037643207
recall 0.952
precision 0.695906432748538
F1 0.8040540540540541
recall 0.936
precision 0.7069486404833837
F1 0.8055077452667815
recall 0.908
precision 0.7138364779874213
F1 0.7992957746478873
recall 0.892
precision 0.7216828478964401
F1 0.7978533094812166
recall 0.852
precision 0.7344827586206897
F1 0.788888888888889
recall 0.824
precision 0.7490909090909091
F1 0.7847619047619048
recall 0.808
precision 0.7651515151515151
F1 0.7859922178988327
recall 0.776
precision 0.782258064516129
F1 0.7791164658634537
recall 0.764
precision 0.809322033898305
F1 0.7860082304526749


The model retrieves 97.6% of the positive documents, but only 64.6% of the retrieved documents (labeled positive by the model) are actually positive.