# 1. 비지도 학습 감정 분석 이해하기

> 비지도학습 감성 분석은 Lexicon이라는 감성 사전을 기반으로 이루어진다. 감성 사전은 긍정 또는 부정 감성의 정도를 의미하는 수치인 감성 지수를 가지고 있으며 단어의 위치, 주변 단어, 문맥, 품사(POS) 등을 참고해 결정된다.

## 1.1 WordNet

> WordNet은 시맨틱 분석을 제공하는 어휘사전이다. 시맨틱이란 단순한 단어의 뜻이 아닌 문맥상 의미를 뜻한다. 단어들은 하나의 뜻만 가지고 있는 것이 아니고 또 상황에 따라 다른 의미를 갖게된다. Present는 '선물'이라는 의미이지만 Present Day라고 하면 '선물같은 날', 즉 현재를 의미하게 된다. 

> 이처럼 WordNet은 다양한 상황에서 같은 어휘라도 다르게 사용되는 어휘의 시맨틱 정보를 제공하며 이를 위해 각각의 품사(명사, 동사, 형용사, 부사 등)으로 구성된 개별 단어를 Sysnet(Sets of congnitive synonyms)라는 개념을 이용해 표현한다.

## 1.2 NLTK 감성 사전 패키지

- SentiWordNet : Synset별로 긍정, 부정, 개별성 지수를 점수로 할당한 패키지
- VADER : 주로 소셜 미디어의 텍스트에 대한 감성 분석을 제공하기 위한 패키지로 감성 분석이 뛰어나며 빠른 수행 시간을 보장해 대용량 텍스트 데이터에 잘 사용됨
- Pattern : 예측 성능 측면에서 가장 주목받는 패키지이지만 파이썬 2.x버전에서만 동작됨

# 2. SentiWordNet

## 2.1 synset 이해하기

> Synsets은 단어가 가지는 문맥, 시맨틱 정보를 제공하는 WordNet에서의 핵심 개념이다.

### 2.1.1 패키지 다운로드

In [1]:
import nltk

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\kys05\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\kys05\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\kys05\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\kys05\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\kys05\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       tagge

True

### 2.1.2 synsets 동작 이해

In [3]:
from nltk.corpus import wordnet as wn

word = 'present'

synsets = wn.synsets(word)

print('syssets 반환 타입 :', type(synsets))
print('synsets 반환 값 개수 :', len(synsets))
print('synsets 반환값')
synsets

syssets 반환 타입 : <class 'list'>
synsets 반환 값 개수 : 18
synsets 반환값


[Synset('present.n.01'),
 Synset('present.n.02'),
 Synset('present.n.03'),
 Synset('show.v.01'),
 Synset('present.v.02'),
 Synset('stage.v.01'),
 Synset('present.v.04'),
 Synset('present.v.05'),
 Synset('award.v.01'),
 Synset('give.v.08'),
 Synset('deliver.v.01'),
 Synset('introduce.v.01'),
 Synset('portray.v.04'),
 Synset('confront.v.03'),
 Synset('present.v.12'),
 Synset('salute.v.06'),
 Synset('present.a.01'),
 Synset('present.a.02')]

> 실행결과 list 형식으로 18개의 의미를 갖게 된다. 반환 값에서 present.n.01은 의미, 품사, 순번를 구분하는 인덱스이다. n은 명사, v는 동사, a는 형용사를 의미한다.

In [4]:
for i, synset in enumerate(synsets):
    print('###### Synset name : ', synset.name(), '######')
    print('POS : ', synset.lexname())
    print('Definition :', synset.definition())
    print('Lemmas :', synset.lemma_names())
    print("\n")

    if i == 1:
        break

###### Synset name :  present.n.01 ######
POS :  noun.time
Definition : the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas : ['present', 'nowadays']


###### Synset name :  present.n.02 ######
POS :  noun.possession
Definition : something presented as a gift
Lemmas : ['present']




- POS(Part Of Speach) : 품사를 의미하며 위의 내용에서는 품사.의미로 사용되고 있다.
- Definition : 현재 단어의 뜻
- Lemmas : 부제

> present.n.01은 시간적 의미를 갖는 것으로 지금 발생된 기간, 즉 현재를 의미하게 된다. 부제로는 'present', 'nowadays'이다.

## 2.2 유사도 분석

### 2.2.1 객체 생성

In [5]:
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

entities = [tree, lion, tiger, cat, dog]

tree.definition()

'a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms'

### 2.2.2 유사도 측정

In [6]:
tree.path_similarity(tiger)

0.07142857142857142

In [10]:
similarity = [round(tree.path_similarity(x),3) for x in entities]
similarity

[1.0, 0.071, 0.071, 0.077, 0.125]

In [11]:
similarities = []

for entity in entities:
    similarity = [round(entity.path_similarity(x),3) for x in entities]
    similarities.append(similarity)

similarities

[[1.0, 0.071, 0.071, 0.077, 0.125],
 [0.071, 1.0, 0.333, 0.25, 0.167],
 [0.071, 0.333, 1.0, 0.25, 0.167],
 [0.077, 0.25, 0.25, 1.0, 0.2],
 [0.125, 0.167, 0.167, 0.2, 1.0]]

### 2.2.3 결과 출력

In [12]:
import pandas as pd

similarity_df = pd.DataFrame(similarities)
similarity_df

Unnamed: 0,0,1,2,3,4
0,1.0,0.071,0.071,0.077,0.125
1,0.071,1.0,0.333,0.25,0.167
2,0.071,0.333,1.0,0.25,0.167
3,0.077,0.25,0.25,1.0,0.2
4,0.125,0.167,0.167,0.2,1.0


### 2.2.4 실행 결과에 rows 및 columns 이름 출력

In [13]:
tree.name()

'tree.n.01'

In [14]:
tree.name().split('.')

['tree', 'n', '01']

In [15]:
tree.name().split('.')[0]

'tree'

In [16]:
entity_name = [entity.name().split('.')[0] for entity in entities]
entity_name

['tree', 'lion', 'tiger', 'cat', 'dog']

In [18]:
similarity_df = pd.DataFrame(similarities, columns=entity_name, index=entity_name)
similarity_df

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.071,0.071,0.077,0.125
lion,0.071,1.0,0.333,0.25,0.167
tiger,0.071,0.333,1.0,0.25,0.167
cat,0.077,0.25,0.25,1.0,0.2
dog,0.125,0.167,0.167,0.2,1.0


> 가장 높은 유사도를 갖는 것은 lion과 tiger이며 다음이 cat과 lion 또는 tiger이다. 이러한 유사도는 tree.definition()에서 나타나는 설명문을 기준으로 구하게 된다.

## 2.3 SentiWordNet

### 2.3.1 기본 정보

In [20]:
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets  반환 타입 : ', type(senti_synsets))
print('senti_synsets 반환 갯수 :', len(senti_synsets))
print('senti_synsets 반환값')
senti_synsets

senti_synsets  반환 타입 :  <class 'list'>
senti_synsets 반환 갯수 : 11
senti_synsets 반환값


[SentiSynset('decelerate.v.01'),
 SentiSynset('slow.v.02'),
 SentiSynset('slow.v.03'),
 SentiSynset('slow.a.01'),
 SentiSynset('slow.a.02'),
 SentiSynset('dense.s.04'),
 SentiSynset('slow.a.04'),
 SentiSynset('boring.s.01'),
 SentiSynset('dull.s.08'),
 SentiSynset('slowly.r.01'),
 SentiSynset('behind.r.03')]

> WordNet과 같은 결과를 갖게 된다.

### 2.3.2 감성 지수

In [26]:
father = swn.senti_synset('father.n.01')

print('father 긍정 감성 지수 :', father.pos_score())
print('father 부정 감성 지수 :', father.neg_score())
print('father 객관성 지수 :', father.obj_score())

father 긍정 감성 지수 : 0.0
father 부정 감성 지수 : 0.0
father 객관성 지수 : 1.0


In [29]:
fabulous = swn.senti_synset('fabulous.a.01')
print('fabulous 긍정 감성 지수 :', fabulous.pos_score())
print('fabulous 부정 감성 지수 :', fabulous.neg_score())
print('fabulous 객관성 지수 :', fabulous.obj_score())

fabulous 긍정 감성 지수 : 0.875
fabulous 부정 감성 지수 : 0.125
fabulous 객관성 지수 : 0.0


In [30]:
mother = swn.senti_synset('mother.n.01')

print('mother 긍정 감성 지수 :', mother.pos_score())
print('mother 부정 감성 지수 :', mother.neg_score())
print('mother 객관성 지수 :', mother.obj_score())

mother 긍정 감성 지수 : 0.0
mother 부정 감성 지수 : 0.0
mother 객관성 지수 : 1.0


In [31]:
mom = swn.senti_synset('mom.n.01')

print('mom 긍정 감성 지수 :', mom.pos_score())
print('mom 부정 감성 지수 :', mom.neg_score())
print('mom 객관성 지수 :', mom.obj_score())

mom 긍정 감성 지수 : 0.875
mom 부정 감성 지수 : 0.0
mom 객관성 지수 : 0.125


> 감성 지수와 객관성 지수는 서로 반대 개념으로 감성 지수가 1이면 객관성 지수는 0이다.

### 2.3.3 감성 지수 함수 분석

In [32]:
from nltk import sent_tokenize

text = "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.";

raw_sentences = sent_tokenize(text)
raw_sentences

["With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again.",
 'Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent.',
 'Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released.',
 "Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring.",
 'Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit wh

In [33]:
len(raw_sentences)

15

In [34]:
from nltk import word_tokenize

raw_words = word_tokenize(raw_sentences[0])
raw_words

['With',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'MJ',
 'i',
 "'ve",
 'started',
 'listening',
 'to',
 'his',
 'music',
 ',',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there',
 ',',
 'watched',
 'The',
 'Wiz',
 'and',
 'watched',
 'Moonwalker',
 'again',
 '.']

In [35]:
from nltk import pos_tag

tagged_sentences = pos_tag(raw_words)
tagged_sentences

[('With', 'IN'),
 ('all', 'PDT'),
 ('this', 'DT'),
 ('stuff', 'NN'),
 ('going', 'VBG'),
 ('down', 'RP'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('moment', 'NN'),
 ('with', 'IN'),
 ('MJ', 'NNP'),
 ('i', 'NN'),
 ("'ve", 'VBP'),
 ('started', 'VBN'),
 ('listening', 'VBG'),
 ('to', 'TO'),
 ('his', 'PRP$'),
 ('music', 'NN'),
 (',', ','),
 ('watching', 'VBG'),
 ('the', 'DT'),
 ('odd', 'JJ'),
 ('documentary', 'NN'),
 ('here', 'RB'),
 ('and', 'CC'),
 ('there', 'RB'),
 (',', ','),
 ('watched', 'VBD'),
 ('The', 'DT'),
 ('Wiz', 'NNP'),
 ('and', 'CC'),
 ('watched', 'VBD'),
 ('Moonwalker', 'NNP'),
 ('again', 'RB'),
 ('.', '.')]

In [36]:
for word, tag in tagged_sentences:
    if tag.startswith('V'):
        print(word, ':', tag)

going : VBG
've : VBP
started : VBN
listening : VBG
watching : VBG
watched : VBD
watched : VBD


In [37]:
def penn_to_wn(tag):
    from nltk.corpus import wordnet as wn

    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    
    return

In [38]:
for word, tag in tagged_sentences:
    wn_tag = penn_to_wn(tag)
    if wn_tag not in (wn.ADJ, wn.NOUN, wn.ADV, wn.VERB):
        continue
    print(word, ':', tag)

stuff : NN
going : VBG
down : RP
moment : NN
MJ : NNP
i : NN
've : VBP
started : VBN
listening : VBG
music : NN
watching : VBG
odd : JJ
documentary : NN
here : RB
there : RB
watched : VBD
Wiz : NNP
watched : VBD
Moonwalker : NNP
again : RB


In [39]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for word, tag in tagged_sentences:
    wn_tag = penn_to_wn(tag)
    if wn_tag not in (wn.ADJ, wn.NOUN, wn.ADV, wn.VERB):
        continue

    lemma = lemmatizer.lemmatize(word, pos=wn_tag)

    if not lemma:
        continue 

    print(word, ':', tag, ':', lemma)

stuff : NN : stuff
going : VBG : go
down : RP : down
moment : NN : moment
MJ : NNP : MJ
i : NN : i
've : VBP : 've
started : VBN : start
listening : VBG : listen
music : NN : music
watching : VBG : watch
odd : JJ : odd
documentary : NN : documentary
here : RB : here
there : RB : there
watched : VBD : watch
Wiz : NNP : Wiz
watched : VBD : watch
Moonwalker : NNP : Moonwalker
again : RB : again


In [41]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for word, tag in tagged_sentences:
    wn_tag = penn_to_wn(tag)
    if wn_tag not in (wn.ADJ, wn.NOUN, wn.ADV, wn.VERB):
        continue

    lemma = lemmatizer.lemmatize(word, pos=wn_tag)

    if not lemma:
        continue 

    synsets = wn.synsets(lemma, pos=wn_tag)

    if not synsets:
        continue 

    print(word, ':', tag, ':', lemma, ':', synsets[0])

stuff : NN : stuff : Synset('material.n.01')
going : VBG : go : Synset('travel.v.01')
down : RP : down : Synset('down.r.01')
moment : NN : moment : Synset('moment.n.01')
i : NN : i : Synset('iodine.n.01')
started : VBN : start : Synset('get_down.v.07')
listening : VBG : listen : Synset('listen.v.01')
music : NN : music : Synset('music.n.01')
watching : VBG : watch : Synset('watch.v.01')
odd : JJ : odd : Synset('odd.a.01')
documentary : NN : documentary : Synset('documentary.n.01')
here : RB : here : Synset('here.r.01')
there : RB : there : Synset('there.r.01')
watched : VBD : watch : Synset('watch.v.01')
Wiz : NNP : Wiz : Synset('ace.n.03')
watched : VBD : watch : Synset('watch.v.01')
again : RB : again : Synset('again.r.01')


In [42]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for word, tag in tagged_sentences:
    wn_tag = penn_to_wn(tag)
    if wn_tag not in (wn.ADJ, wn.NOUN, wn.ADV, wn.VERB):
        continue

    lemma = lemmatizer.lemmatize(word, pos=wn_tag)

    if not lemma:
        continue 

    synsets = wn.synsets(lemma, pos=wn_tag)

    if not synsets:
        continue 
    
    synset = synsets[0]

    swn_score = swn.senti_synset(synset.name())

    print(swn_score, ':', swn_score.pos_score(), ':', swn_score.neg_score(), ':', swn_score.obj_score())

<material.n.01: PosScore=0.0 NegScore=0.0> : 0.0 : 0.0 : 1.0
<travel.v.01: PosScore=0.0 NegScore=0.0> : 0.0 : 0.0 : 1.0
<down.r.01: PosScore=0.0 NegScore=0.125> : 0.0 : 0.125 : 0.875
<moment.n.01: PosScore=0.0 NegScore=0.0> : 0.0 : 0.0 : 1.0
<iodine.n.01: PosScore=0.0 NegScore=0.0> : 0.0 : 0.0 : 1.0
<get_down.v.07: PosScore=0.0 NegScore=0.0> : 0.0 : 0.0 : 1.0
<listen.v.01: PosScore=0.0 NegScore=0.0> : 0.0 : 0.0 : 1.0
<music.n.01: PosScore=0.0 NegScore=0.0> : 0.0 : 0.0 : 1.0
<watch.v.01: PosScore=0.125 NegScore=0.0> : 0.125 : 0.0 : 0.875
<odd.a.01: PosScore=0.5 NegScore=0.375> : 0.5 : 0.375 : 0.125
<documentary.n.01: PosScore=0.0 NegScore=0.0> : 0.0 : 0.0 : 1.0
<here.r.01: PosScore=0.0 NegScore=0.0> : 0.0 : 0.0 : 1.0
<there.r.01: PosScore=0.0 NegScore=0.0> : 0.0 : 0.0 : 1.0
<watch.v.01: PosScore=0.125 NegScore=0.0> : 0.125 : 0.0 : 0.875
<ace.n.03: PosScore=0.125 NegScore=0.0> : 0.125 : 0.0 : 0.875
<watch.v.01: PosScore=0.125 NegScore=0.0> : 0.125 : 0.0 : 0.875
<again.r.01: PosScore=0.0 

### 2.3.4 감성 지수 함수 구현

In [43]:
def swn_polarity(text):
    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import sentiwordnet as swn
    from nltk import sent_tokenize, word_tokenize, pos_tag

    sentiment = 0.0
    tokens_count = 0

    # 어근 추출
    lemmatizer = WordNetLemmatizer()

    # 문단을 문장 리스트로 
    raw_sentences = sent_tokenize(text)

    # 문장을 단어별로 리스트로
    for raw_sentence in raw_sentences:
        # 단어별 품사를 추출
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))

        for word, tag in tagged_sentence:
            # WordNet 기본 품사로 변경
            wn_tag = penn_to_wn(tag)
            if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV, wn.VERB):
                continue

            # 어근 추출
            lemma = lemmatizer.lemmatize(word, pos=wn_tag)
            if not lemma:
                continue

            # 어근 추출한 단어 WordNet의 기본품사로 만들어서 synset 객체 생성
            synsets = wn.synsets(lemma, pos=wn_tag)
            if not synsets:
                continue

            # synset 의 첫번째 의미만 사용
            synset = synsets[0]

            # 감성 분석 객체로 생성
            swn_synset = swn.senti_synset(synset.name())

            # 긍정, 부정, 객과성 지수 출력
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())
            tokens_count += 1
    
    if not tokens_count:
        return 0
    
    # 총 점수가 0이상이면 긍정은 1을 반환하고, 부정이면 0을 반환
    if sentiment >= 0:
        return 1
    
    return 0

### 2.3.5 예측

In [45]:
import pandas as pd

review_df = pd.read_csv('data/labeledTrainData.tsv', sep=r'\t', quoting=3, engine='python')
review_df.head(2)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."


In [46]:
review_df['preds'] = review_df['review'].apply(lambda x: swn_polarity(x))

y_target = review_df['sentiment'].values
preds = review_df['preds'].values

### 2.3.6 평가 지표

In [47]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.metrics import f1_score, roc_auc_score

def get_clf_eval(y_test, pred):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred) 
    f1 = f1_score(y_test, pred)

    print('오차행렬')
    print(confusion)
    print(f'정확도 : {accuracy:.4f}, 정밀도 : {precision:.4f}, 재현율 : {recall:.4f}, F1 : {f1:.4f}')

In [48]:
get_clf_eval(y_target, preds)

오차행렬
[[ 4061  8439]
 [ 1367 11133]]
정확도 : 0.6078, 정밀도 : 0.5688, 재현율 : 0.8906, F1 : 0.6943


> 지도학습에 비해 정확도가 많이 떨어진다.

# 3. VADER

> VADER는 소셜 미디어 감성 분석 용도로 만들어진 Lexicon이다.

## 3.1 API 확인

In [49]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])

print(senti_scores)

{'neg': 0.13, 'neu': 0.744, 'pos': 0.126, 'compound': -0.8278}


- SentimentIntensityAnalyzer()로 생성된 객체의 polarity_scores()를 이용하여 감성 분석 수행이 가능하다.
- neg는 부정, neu는 중립, pos는 긍정, compound는 neg, neu, pos를 조합해 만든 감성 지수이다.
- compound는 -1 ~ 1 사이의 값을 가지며 보통 0.1 이상이면 긍정으로 판단한다.

## 3.2 예측

In [50]:
def vader_polarity(review, threshold=0.1):
    from nltk.sentiment.vader import SentimentIntensityAnalyzer

    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)

    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold else 0

    return final_sentiment

In [51]:
review_df['vader_preds'] = review_df['review'].apply(lambda x:vader_polarity(x))

y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

In [52]:
get_clf_eval(y_target, vader_preds)

오차행렬
[[ 6820  5680]
 [ 1936 10564]]
정확도 : 0.6954, 정밀도 : 0.6503, 재현율 : 0.8451, F1 : 0.7350


In [None]:
# 오차행렬
# [[ 4061  8439]
#  [ 1367 11133]]
# 정확도 : 0.6078, 정밀도 : 0.5688, 재현율 : 0.8906, F1 : 0.6943