## Parts-of-Speech-Tagging for sentiment analysis

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

https://www.nltk.org/_modules/nltk/corpus/reader/sentiwordnet.html

### Tagger

Part-of-speech tagging is the process of converting a sentence, in the form of a list of words,
into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech
tag, and signifies whether the word is a noun, adjective, verb, and so on.

Most of the taggers are trainable. They use a list of tagged sentences as their training data, such as
what you get from the tagged_sents() method of a TaggedCorpusReader class. With these training
sentences, the tagger generates an internal model that will tell it how to tag a word. Other taggers
use external data sources or match word patterns to choose a tag for a word.


All taggers in NLTK are in the nltk.tag package. Many taggers can also be combined into a backoff
chain, so that if one tagger cannot tag a word, the next tagger is used, and so on.

In [2]:
import nltk
from nltk.tag import UnigramTagger
from nltk.corpus import treebank

In [7]:
treebank.sents()

[['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'], ['Mr.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N.V.', ',', 'the', 'Dutch', 'publishing', 'group', '.'], ...]

In [33]:
train_sents = treebank.tagged_sents()[:3000]

In [34]:
tagger = UnigramTagger(train_sents)

In [35]:
tagger.tag(treebank.sents()[0])

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

### Sentiment Analysis

In [8]:
from nltk.corpus import sentiwordnet as swn

In [13]:
# nltk.download('sentiwordnet')

[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\marya\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\sentiwordnet.zip.


True

In [14]:
list(swn.senti_synsets('slow'))

[SentiSynset('decelerate.v.01'),
 SentiSynset('slow.v.02'),
 SentiSynset('slow.v.03'),
 SentiSynset('slow.a.01'),
 SentiSynset('slow.a.02'),
 SentiSynset('dense.s.04'),
 SentiSynset('slow.a.04'),
 SentiSynset('boring.s.01'),
 SentiSynset('dull.s.08'),
 SentiSynset('slowly.r.01'),
 SentiSynset('behind.r.03')]

In [205]:
happy = swn.senti_synsets('not')

In [206]:
happy0 = list(happy)[0]

In [207]:
print(happy0.pos_score())
print(happy0.neg_score())
print(happy0.obj_score())

0.0
0.625
0.375


### Read File to test

In [27]:
def read_file(filename, dir):
    with open(f"txt_sentoken/{dir}/{filename}", "r") as f:
        article = f.readlines()
    return article

In [22]:
filename = "cv001_19502.txt"

In [28]:
text = read_file(filename, 'neg')

In [112]:
len(text)

13

In [132]:
print(text[5])

we don't know why the crew was really out in the middle of nowhere , we don't know the origin of what took over the ship ( just that a big pink flashy thing hit the mir ) , and , of course , we don't know why donald sutherland is stumbling around drunkenly throughout . 



### Tokenize

In [133]:
text_tokenized = nltk.word_tokenize(text[5])
text_tokenized

['we',
 'do',
 "n't",
 'know',
 'why',
 'the',
 'crew',
 'was',
 'really',
 'out',
 'in',
 'the',
 'middle',
 'of',
 'nowhere',
 ',',
 'we',
 'do',
 "n't",
 'know',
 'the',
 'origin',
 'of',
 'what',
 'took',
 'over',
 'the',
 'ship',
 '(',
 'just',
 'that',
 'a',
 'big',
 'pink',
 'flashy',
 'thing',
 'hit',
 'the',
 'mir',
 ')',
 ',',
 'and',
 ',',
 'of',
 'course',
 ',',
 'we',
 'do',
 "n't",
 'know',
 'why',
 'donald',
 'sutherland',
 'is',
 'stumbling',
 'around',
 'drunkenly',
 'throughout',
 '.']

In [43]:
# text_tokenized = nltk.word_tokenize(text[2])
# text_tokenized

In [63]:
from replacers import RegexpReplacer

In [64]:
replacer = RegexpReplacer()

In [134]:
tokenized_text = nltk.word_tokenize(replacer.replace(text[5]))

#### Classer les mots avec le Tagger, puis les filtrer selon le tag

In [135]:
tagged_text = tagger.tag(tokenized_text)
tagged_text

[('we', 'PRP'),
 ('do', 'VBP'),
 ('not', 'RB'),
 ('know', 'VB'),
 ('why', 'WRB'),
 ('the', 'DT'),
 ('crew', None),
 ('was', 'VBD'),
 ('really', 'RB'),
 ('out', 'RP'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('middle', 'NN'),
 ('of', 'IN'),
 ('nowhere', 'RB'),
 (',', ','),
 ('we', 'PRP'),
 ('do', 'VBP'),
 ('not', 'RB'),
 ('know', 'VB'),
 ('the', 'DT'),
 ('origin', None),
 ('of', 'IN'),
 ('what', 'WP'),
 ('took', 'VBD'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('ship', None),
 ('(', None),
 ('just', 'RB'),
 ('that', 'IN'),
 ('a', 'DT'),
 ('big', 'JJ'),
 ('pink', None),
 ('flashy', 'JJ'),
 ('thing', 'NN'),
 ('hit', 'VBN'),
 ('the', 'DT'),
 ('mir', None),
 (')', None),
 (',', ','),
 ('and', 'CC'),
 (',', ','),
 ('of', 'IN'),
 ('course', 'NN'),
 (',', ','),
 ('we', 'PRP'),
 ('do', 'VBP'),
 ('not', 'RB'),
 ('know', 'VB'),
 ('why', 'WRB'),
 ('donald', None),
 ('sutherland', None),
 ('is', 'VBZ'),
 ('stumbling', None),
 ('around', 'IN'),
 ('drunkenly', None),
 ('throughout', 'IN'),
 ('.', '.')]

In [120]:
def filter_tag(token):
    if token[1] == "RB" or token[1] == "RBR" or token[1] == "RBS" or token[1] == "":
        return True
    
    return False

In [136]:
filtered_text = filter(filter_tag, tagged_text)

In [137]:
filtered_tokens = [x[0] for x in list(filter(filter_tag, tagged_text))]

In [138]:
filtered_tokens

['not', 'really', 'nowhere', 'not', 'just', 'not']

#### Utiliser SentiWordNet pour voir le score de négativité, positivité et objectivité

In [204]:
sentiment = swn.senti_synsets('damn')

In [189]:
len(list(sentiment))

5

In [190]:
list(sentiment)

[]

#### Structure de la donnée qui va être prise en paramètre pour faire l'analyse sentimentale

Liste de listes qui contienent les phrases tokenisées filtrés selon le tag le modèle fera un average et prendra le meilleur 

In [242]:
def calculate_total_score(tuple_list):
   
    _, pos, neg, _  = zip(*tuple_list)
    
    return f"positive : {sum(pos)}, negative : {sum(neg)}"

In [243]:
def get_sentiment(filename):
    text = read_file(filename, 'neg')
    tokens = []
    for sentence in text:
        tokenized_text = nltk.word_tokenize(replacer.replace(sentence))
        tagged_text = tagger.tag(tokenized_text)

        filtered_tokens = [x[0] for x in list(filter(filter_tag, tagged_text))]    

        if len(filtered_tokens) != 0:
            for token in filtered_tokens:
                tokens.append(token)

        # print(filtered_tokens)
    
    print(tokens)

    token_score = []
    for token in tokens:
        happy = swn.senti_synsets(token)
        happy0 = list(happy)[0]
        # print(token)
        # print(f"pos : {happy0.pos_score()}, neg : {happy0.neg_score()}, obj : {happy0.obj_score()}")
        token_score.append((token, happy0.pos_score(), happy0.neg_score(), happy0.obj_score()))
        # print(happy0.pos_score())
        # print(happy0.neg_score())
        # print(happy0.obj_score())
    
    print(token_score)
    score = calculate_total_score(token_score)

    return score

In [244]:
get_sentiment(filename)

['damn', 'back', 'here', 'still', 'very', 'not', 'really', 'nowhere', 'not', 'just', 'not', 'here', 'just', 'even', 'well', 'here', 'so', 'really', 'here', 'pretty', 'much']
[('damn', 0.125, 0.125, 0.75), ('back', 0.0, 0.0, 1.0), ('here', 0.0, 0.0, 1.0), ('still', 0.0, 0.0, 1.0), ('very', 0.5, 0.0, 0.5), ('not', 0.0, 0.625, 0.375), ('really', 0.625, 0.0, 0.375), ('nowhere', 0.0, 0.25, 0.75), ('not', 0.0, 0.625, 0.375), ('just', 0.625, 0.0, 0.375), ('not', 0.0, 0.625, 0.375), ('here', 0.0, 0.0, 1.0), ('just', 0.625, 0.0, 0.375), ('even', 0.0, 0.0, 1.0), ('well', 0.0, 0.0, 1.0), ('here', 0.0, 0.0, 1.0), ('so', 0.0, 0.0, 1.0), ('really', 0.625, 0.0, 0.375), ('here', 0.0, 0.0, 1.0), ('pretty', 0.875, 0.125, 0.0), ('much', 0.125, 0.125, 0.75)]


'positive : 4.125, negative : 2.5'