# Work practice 2 in NLP

In this notebook you will see the basics of Parts-of-speech tagging and sentiment analysis.

Subject : **Propose a method based on pos-tagging and SentiWordNet to classify movie reviews.**

# Libraries and importation of data

In [83]:
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
import nltk
nltk.download('treebank')
from sklearn.datasets import load_files
nltk.download('punkt')
import nltk.tokenize
from nltk.tokenize import word_tokenize
import pandas as pd
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
from nltk.corpus import sentiwordnet as swn
import numpy as np
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Here you can dowload the movie review of polarity dataset v2.0 : https://www.cs.cornell.edu/people/pabo/movie-review-data/

In [2]:
!wget http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

--2024-03-10 15:45:51--  http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3127238 (3.0M) [application/x-gzip]
Saving to: ‘review_polarity.tar.gz’


2024-03-10 15:45:51 (16.6 MB/s) - ‘review_polarity.tar.gz’ saved [3127238/3127238]



In [None]:
!tar --gunzip --extract --verbose --file=review_polarity.tar.gz

Now we created a folder named "txt_sentoken" that you can use directly to import the reviews.

To download SentiWordNet :

In [4]:
nltk.download("sentiwordnet")
nltk.download("stopwords")
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

# Course:

## Part-of-speech tagging:

See the small text bellow if you want to build your own pos_tag tool with UnigramTagger.

In [None]:
train_sents = treebank.tagged_sents()[:3000] #treebank.tagged_sents contains sentences with tagged word
tagger = UnigramTagger(train_sents) #we train UnigramTagger on the first 3000 sentences.
tagger.tag(treebank.sents()[0]) #prediction of the tag on the first sentence of treebank
#treebank.tagged_sents() = tokenized sentences with label (same format that the output)
#treebank.sents() = just the tokenized sentences

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

In [None]:
#another example
tagger.tag(["Brian","is","in","the","kitchen"])

[('Brian', 'NNP'),
 ('is', 'VBZ'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('kitchen', None)]

Below you will find the tags and their meaning.

| POS Tag | Description | Example|
| - | - | - |
| CC | Coordinating conjunction | and, but, or, & |
| CD | cardinal number | 1, three |
| DT | Determiner | the |
| PDT | predeterminer | the |
| WDT | determiner with 'wh'| the |
| IN | Preposition/subord.conj | in, of, like, after, whether |
| JJ | adjective | green |
| JJR| comparative adjective  | more...than |
| JJS| superlative adjective  | the most |
| MD | modal | could, will |
| NNS | Noun, plural | table |
| NN | Noun, singular or mass | table |
| PRP | personal pronoun  | |
| PRP | possessive pronoun  | |
| WP | pronoun with 'wh'  | |
| VB | Verb, Base form |  |
| VBD | verb, past ||
| VBG | verb, present participle ||
| VBN | verb, past participle ||
| VBP | verb, not 3rd person in present ||
| VBZ | verb, 3rd person in present ||
| RB | adverbe | good, again,  |
| RBR | adverb, comparative better| |
| RBS | adverb, superlative best | |
| WRB | adverb in 'wh' ||
| EX | Existantial | there |
| FW| foreign word | not english word|
| VP | verbal expression | muttering, yelling|
| LS | list | |
| POS | possessive determiner | 's|
| RP | particle  | to in to fly|
| S | simple declarative clause | |
| SBAR | Clause introduced by a subordinating conjunction ||
| SBARQ | Question tag | what|
| SINV| Inverted declarative sentence |subject follows the tense verb or the modal|
| SQ | inversed question tag |Reversed yes/no question, or main clause of a “Wh” question|
| SYM | symbol ||





## SentiWordNet:

SentiWordNet is a lexical resource for opinion mining. SentiWordNet (swn) assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity.
It works like this :

In [81]:
word = str(input())
print("the chosen word is:",word)
print("format of object created by swn:",type(swn.senti_synsets(word)))
print("object (with list) created by swn:",list(swn.senti_synsets(word)))
print("You take only one of them, the lenght of the list can change so use the first value (0) just in case")
print("value contained in the first one:",list(swn.senti_synsets(word))[0])
print("to get the object score (object are not link to sentiment) you use '.obj_score()':",list(swn.senti_synsets(word))[0].obj_score())
print("to get the subject score (subject are not link to sentiment but you can see how important the word is) you use '1-x.obj_score()':",float(1-list(swn.senti_synsets(word))[0].obj_score()))
print("to get the positive score you use '.pos_score()':",list(swn.senti_synsets(word))[0].pos_score())
print("to get the negative score you use '.neg_score()':",list(swn.senti_synsets(word))[0].neg_score())

wonderful
the chosen word is: wonderful
format of object created by swn: <class 'filter'>
object (with list) created by swn: [SentiSynset('fantastic.s.02')]
You take only one of them, the lenght of the list can change so use the first value (0) just in case
value contained in the first one: <fantastic.s.02: PosScore=0.75 NegScore=0.0>
to get the object score (object are not link to sentiment) you use '.obj_score()': 0.25
to get the subject score (subject are not link to sentiment but you can see how important the word is) you use '1-x.obj_score()': 0.75
to get the positive score you use '.pos_score()': 0.75
to get the negative score you use '.neg_score()': 0.0


Note that pos-tag and swn do not use the same tagging so you can choose to remove it or to convert it. Here is one way where we can map the tags from one system to another:

* NN: n
* VB: v
* JJ: a
* RB: r

Find bellow an example of a code to mesure the opinion of a sentence :


In [84]:
flatten = lambda l: [item for sublist in l for item in sublist]
english_stopwords = set(stopwords.words("english"))

nltk_to_sentiwordnet = {
    "NN": "n",
    "VB": "v",
    "JJ": "a",
    "RB": "r",
}

def get_sentiment(article):

    sentences = nltk.sent_tokenize(article)
    sentence_words = [nltk.word_tokenize(sentence) for sentence in sentences]
    tagged_sentence_words = flatten(nltk.pos_tag_sents(sentence_words))

    pos_scores = []
    neg_scores = []
    subj_scores = []

    for word, pos in tagged_sentence_words:

        swn_pos = nltk_to_sentiwordnet.get(pos[:2], None)

        if swn_pos == None:
            continue

        synsets_filter = swn.senti_synsets(word.lower(), pos=swn_pos)
        synsets = list(synsets_filter)
        if len(synsets) == 0:
            continue

        #print("{}:".format(word))
        for synset in synsets[:1]:
            pos_scores.append(synset.pos_score())
            neg_scores.append(synset.neg_score())
            subj_scores.append(1 - synset.obj_score())
    return np.average(pos_scores, weights=subj_scores) , np.average(neg_scores, weights=subj_scores), np.mean(subj_scores)
    #return np.average(pos_scores) , np.average(neg_scores), np.mean(subj_scores)

In [87]:
print("for the sentence : 'The product quality is consistently outstanding, exceeding my expectations every time.', we have this vector:")
v = get_sentiment("The product quality is consistently outstanding, exceeding my expectations every time.")
print(v)
print("So the review is :", "positive" if v[0]>v[1] else "negative")

for the sentence : 'The product quality is consistently outstanding, exceeding my expectations every time.', we have this vector:
(0.49107142857142855, 0.026785714285714284, 0.21875)
So the review is : positive


## Import text files data with labels:

if your dataset contains text files with categories as subfolder names.

Individual samples are assumed to be files stored a two levels folder structure such as the following:

container_folder/

> category_1_folder/

> > file_1.txt file_2.txt … file_42.txt

> category_2_folder/
> > file_43.txt file_44.txt …

You can use the function :

    x = sklearn.datasets.load_files(container_path, *, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0, allowed_extensions=None)

and you will be able to extract the text with :
    
    x.data

and the label with :
    
    x.target





# Practice:



### Data importation

In [42]:
data = load_files('txt_sentoken')

In [58]:
print(data.target_names)
print(np.unique(data.target))

['neg', 'pos']
[0 1]


So the labels are :


*   0 = neg
*   1 = pos



In [66]:
#see what data we have :

print("Example with the first review:",data.data[0])
print("Example with the first label:",data.target[0])

print("number of reviews:",len(data.data))
print("type of the data:",type(data.data[0]))

Example with the first review: b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in wi

### Data processing:

We will try two methodes:


*   One based on all the review.
*   One based on the verbs (love,etc), the adverbs (beautifully,etc) and the adjectives (beautiful,etc) because grammatically these are the words that are most likely to contain information about a person's opinion.



In [94]:
df= pd.DataFrame({"raw_review":data.data,"type_of_review":data.target})
#.decode() because the reviews are in bytes so we have to convert it in string
#lower() to put all the string in small letter because sentiwordnet bug when it's not only small letter
df["review"] = df["raw_review"].apply(lambda x : word_tokenize(x.decode().lower()))
df["tagged_review"] = df["review"].apply(lambda x : pos_tag(x)) #the tool pos_tag
#adjectives, verb and adverbs are the grammatical categories providing the most information about the opinion so to improve our analysis we will only use them.
df["vb_adv_adj"] = df["tagged_review"].apply(lambda list_tags : [x[0] for x in list_tags if x[1] in ["RB","JJ","VB","JJS","JJR","RBR","RBS","VBD","VBZ","VBP","VBN","VBG"]])
df

Unnamed: 0,raw_review,type_of_review,review,tagged_review,vb_adv_adj,swn_on_vb_adv_adj,pos_on_vb_adv_adj,neg_on_vb_adv_adj,type_of_review_vb_adv_adj,swn_on_review,pos_on_review,neg_on_review,type_of_review_all
0,"b""arnold schwarzenegger has been an icon for a...",0,"[arnold, schwarzenegger, has, been, an, icon, ...","[(arnold, RB), (schwarzenegger, NN), (has, VBZ...","[arnold, has, been, late, lately, have, been, ...","[[<arnold.n.01: PosScore=0.0 NegScore=0.0>, <a...",0.307907,0.293742,1,"[[<arnold.n.01: PosScore=0.0 NegScore=0.0>, <a...",0.256784,0.252229,1
1,"b""good films are hard to find these days . \ng...",1,"[good, films, are, hard, to, find, these, days...","[(good, JJ), (films, NNS), (are, VBP), (hard, ...","[good, are, hard, find, great, are, rare, russ...","[[<good.n.01: PosScore=0.5 NegScore=0.0>, <goo...",0.342423,0.224182,1,"[[<good.n.01: PosScore=0.5 NegScore=0.0>, <goo...",0.285653,0.209273,1
2,"b""quaid stars as a man who has taken up the pr...",1,"[quaid, stars, as, a, man, who, has, taken, up...","[(quaid, JJ), (stars, NNS), (as, IN), (a, DT),...","[quaid, has, taken, feels, is, betrayed, early...","[[], [<early.a.01: PosScore=0.0 NegScore=0.0>,...",0.364011,0.263736,1,"[[], [<star.n.01: PosScore=0.0 NegScore=0.0>, ...",0.317742,0.220161,1
3,b'we could paraphrase michelle pfieffer\'s cha...,0,"[we, could, paraphrase, michelle, pfieffer, 's...","[(we, PRP), (could, MD), (paraphrase, VB), (mi...","[paraphrase, dangerous, say, 's, fair, enough,...",[[<paraphrase.n.01: PosScore=0.0 NegScore=0.0>...,0.279494,0.300562,0,"[[], [], [<paraphrase.n.01: PosScore=0.0 NegSc...",0.257812,0.257812,1
4,"b""kolya is one of the richest films i've seen ...",1,"[kolya, is, one, of, the, richest, films, i, '...","[(kolya, NN), (is, VBZ), (one, CD), (of, IN), ...","[is, richest, i, 've, seen, plays, confirmed, ...","[[<confirm.v.01: PosScore=0.0 NegScore=0.0>, <...",0.343750,0.302083,1,"[[], [<be.v.01: PosScore=0.25 NegScore=0.125>,...",0.360294,0.210784,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,"b'under any other circumstances , i would not ...",0,"[under, any, other, circumstances, ,, i, would...","[(under, IN), (any, DT), (other, JJ), (circums...","[other, not, be, discussing, ending, particula...","[[<other.a.01: PosScore=0.0 NegScore=0.625>, <...",0.230924,0.304719,0,"[[<nether.s.03: PosScore=0.0 NegScore=0.0>, <u...",0.232176,0.251173,0
1996,b'bruce barth\'s mellow piano plays in the bac...,1,"[bruce, barth, 's, mellow, piano, plays, in, t...","[(bruce, VB), (barth, NN), ('s, POS), (mellow,...","[bruce, mellow, little, fonda, golden, has, so...","[[<bruce.n.01: PosScore=0.0 NegScore=0.0>, <br...",0.320833,0.201389,1,"[[<bruce.n.01: PosScore=0.0 NegScore=0.0>, <br...",0.299216,0.160279,1
1997,"b' "" a man is not a man without eight taels of...",1,"[``, a, man, is, not, a, man, without, eight, ...","[(``, ``), (a, DT), (man, NN), (is, VBZ), (not...","[is, not, starring, directed, written, alex, w...","[[<not.r.01: PosScore=0.0 NegScore=0.625>], []...",0.219027,0.373894,0,"[[], [<angstrom.n.01: PosScore=0.0 NegScore=0....",0.247831,0.260575,0
1998,"b""this is a film that i was inclined to like a...",0,"[this, is, a, film, that, i, was, inclined, to...","[(this, DT), (is, VBZ), (a, DT), (film, NN), (...","[is, was, inclined, like, main, had, been, inv...","[[<like.n.01: PosScore=0.125 NegScore=0.0>, <l...",0.253333,0.300000,0,"[[], [<be.v.01: PosScore=0.25 NegScore=0.125>,...",0.269293,0.250402,1


Here we calculate the sentiment associate with the review thanks to the selected verbs, adverbs and adjectives.

To compare the performance we will also use it on the reviews without modification.

I will use the average of the negative and positive opinions to see if the text is from a general point of view rather positive or negative. I will weight this average with the degree of importance of the word based in calculation 1 - x.obj_score.

In [95]:
df["swn_on_vb_adv_adj"] = df["vb_adv_adj"].apply(lambda list_adv_adj : [list(swn.senti_synsets(word)) for word in list_adv_adj])
df["pos_on_vb_adv_adj"] = df["swn_on_vb_adv_adj"].apply(lambda list_swn : np.average([x[0].pos_score() for x in list(list_swn) if x!=[]],weights=[1-x[0].obj_score() for x in list(list_swn) if x!=[]]))
df["neg_on_vb_adv_adj"] = df["swn_on_vb_adv_adj"].apply(lambda list_swn : np.average([x[0].neg_score() for x in list(list_swn) if x!=[]],weights=[1-x[0].obj_score() for x in list(list_swn) if x!=[]]))

df["type_of_review_vb_adv_adj"] = df.apply(lambda x : 0 if x["neg_on_vb_adv_adj"]>x["pos_on_vb_adv_adj"] else 1, axis = 1 )

df["swn_on_review"] = df["review"].apply(lambda list_review : [list(swn.senti_synsets(word)) for word in list_review])
df["pos_on_review"] = df["swn_on_review"].apply(lambda list_swn : np.average([x[0].pos_score() for x in list(list_swn) if x!=[]],weights=[1-x[0].obj_score() for x in list(list_swn) if x!=[]]))
df["neg_on_review"] = df["swn_on_review"].apply(lambda list_swn : np.average([x[0].neg_score() for x in list(list_swn) if x!=[]],weights=[1-x[0].obj_score() for x in list(list_swn) if x!=[]]))
df["type_of_review_all"] = df.apply(lambda x : 0 if x["neg_on_review"]>x["pos_on_review"] else 1, axis = 1 )

df

Unnamed: 0,raw_review,type_of_review,review,tagged_review,vb_adv_adj,swn_on_vb_adv_adj,pos_on_vb_adv_adj,neg_on_vb_adv_adj,type_of_review_vb_adv_adj,swn_on_review,pos_on_review,neg_on_review,type_of_review_all
0,"b""arnold schwarzenegger has been an icon for a...",0,"[arnold, schwarzenegger, has, been, an, icon, ...","[(arnold, RB), (schwarzenegger, NN), (has, VBZ...","[arnold, has, been, late, lately, have, been, ...","[[<arnold.n.01: PosScore=0.0 NegScore=0.0>, <a...",0.289770,0.259099,1,"[[<arnold.n.01: PosScore=0.0 NegScore=0.0>, <a...",0.256784,0.252229,1
1,"b""good films are hard to find these days . \ng...",1,"[good, films, are, hard, to, find, these, days...","[(good, JJ), (films, NNS), (are, VBP), (hard, ...","[good, are, hard, find, great, are, rare, russ...","[[<good.n.01: PosScore=0.5 NegScore=0.0>, <goo...",0.332169,0.189005,1,"[[<good.n.01: PosScore=0.5 NegScore=0.0>, <goo...",0.285653,0.209273,1
2,"b""quaid stars as a man who has taken up the pr...",1,"[quaid, stars, as, a, man, who, has, taken, up...","[(quaid, JJ), (stars, NNS), (as, IN), (a, DT),...","[quaid, has, taken, feels, is, betrayed, early...","[[], [<hour_angle.n.02: PosScore=0.0 NegScore=...",0.332061,0.216603,1,"[[], [<star.n.01: PosScore=0.0 NegScore=0.0>, ...",0.317742,0.220161,1
3,b'we could paraphrase michelle pfieffer\'s cha...,0,"[we, could, paraphrase, michelle, pfieffer, 's...","[(we, PRP), (could, MD), (paraphrase, VB), (mi...","[paraphrase, dangerous, say, 's, fair, enough,...",[[<paraphrase.n.01: PosScore=0.0 NegScore=0.0>...,0.255000,0.290000,0,"[[], [], [<paraphrase.n.01: PosScore=0.0 NegSc...",0.257812,0.257812,1
4,"b""kolya is one of the richest films i've seen ...",1,"[kolya, is, one, of, the, richest, films, i, '...","[(kolya, NN), (is, VBZ), (one, CD), (of, IN), ...","[is, richest, i, 've, seen, plays, confirmed, ...","[[<be.v.01: PosScore=0.25 NegScore=0.125>, <be...",0.381098,0.201220,1,"[[], [<be.v.01: PosScore=0.25 NegScore=0.125>,...",0.360294,0.210784,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,"b'under any other circumstances , i would not ...",0,"[under, any, other, circumstances, ,, i, would...","[(under, IN), (any, DT), (other, JJ), (circums...","[other, not, be, discussing, ending, particula...","[[<other.a.01: PosScore=0.0 NegScore=0.625>, <...",0.239266,0.250693,0,"[[<nether.s.03: PosScore=0.0 NegScore=0.0>, <u...",0.232176,0.251173,0
1996,b'bruce barth\'s mellow piano plays in the bac...,1,"[bruce, barth, 's, mellow, piano, plays, in, t...","[(bruce, VB), (barth, NN), ('s, POS), (mellow,...","[bruce, mellow, little, fonda, golden, has, so...","[[<bruce.n.01: PosScore=0.0 NegScore=0.0>, <br...",0.301786,0.166071,1,"[[<bruce.n.01: PosScore=0.0 NegScore=0.0>, <br...",0.299216,0.160279,1
1997,"b' "" a man is not a man without eight taels of...",1,"[``, a, man, is, not, a, man, without, eight, ...","[(``, ``), (a, DT), (man, NN), (is, VBZ), (not...","[is, not, starring, directed, written, alex, w...","[[<be.v.01: PosScore=0.25 NegScore=0.125>, <be...",0.246284,0.278716,0,"[[], [<angstrom.n.01: PosScore=0.0 NegScore=0....",0.247831,0.260575,0
1998,"b""this is a film that i was inclined to like a...",0,"[this, is, a, film, that, i, was, inclined, to...","[(this, DT), (is, VBZ), (a, DT), (film, NN), (...","[is, was, inclined, like, main, had, been, inv...","[[<be.v.01: PosScore=0.25 NegScore=0.125>, <be...",0.284751,0.217842,1,"[[], [<be.v.01: PosScore=0.25 NegScore=0.125>,...",0.269293,0.250402,1


Let's check the accuracy of both methods.

In [93]:
vb_adv_adj = accuracy_score(df["type_of_review"],df["type_of_review_vb_adv_adj"])
all = accuracy_score(df["type_of_review"],df["type_of_review_all"])
print(f'accuracy with sentiwordnet on verbs, adverbes and adjectives :{vb_adv_adj*100}%' )
print(f'accuracy with sentiwordnet on all the review :{all*100}%' )

accuracy with sentiwordnet on verbs, adverbes and adjectives :63.4%
accuracy with sentiwordnet on all the review :64.4%


Working with all the review slightly improve our results. But the difference is very small. It's probably because we remove too much information by not using all the word.

**64.4% accuracy is quite a good score given that we used a simple method.**

This method could be improved by adding other machine learning methods (classic or neural networks) and by carrying out supervised learning with a classification (if we consider the neg and pos classes) or a regression (if we consider a prediction of 'a value between 0 and 1' with close to 0 the neg class and close to 1 the pos class). We will saw it in another notebook.