## 감성 분석 개요
- 소셜 미디어, 여론조사, 온라인 리뷰, 피드백 등 다양한 분야에서 활용 중
- 지도학습 기반, 비지도학습 기반

In [25]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [26]:
import pandas as pd
review_df = pd. read_csv('/content/drive/MyDrive/Colab Notebooks/2023/python/data/labeledTrainData.tsv', header = 0, sep = '\t', quoting=3)
review_df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


- id : 각 데이터의 ID
- sentiment : 영화평의 결과값
  + 1은 긍정평가
  + 0은 부정평가
- review 텍스트

In [27]:
print(review_df['review'][1])

"\"The Classic War of the Worlds\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur \"critics\" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the \"critics\". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the \"critics\" perceive to be its shortcomings."


## 텍스트 데이터 전처리
- 특수문자들이 매우 많음, 삭제
- 정규표현식이란걸 배움

In [28]:
import re

# <br> html 태그는 replace 함수를 활용하여 공백으로 변환

review_df['review'] = review_df['review'].str.replace('<br />', ' ')

In [29]:
# 파이썬의 정규표현식 모듈 활용하여 영어 문자열이 아닌 문자는 모두 공백으로 변환한다.
review_df['review'] = review_df['review'].apply(lambda x : re.sub('[^a-zA-Z]', ' ', x))

In [30]:
review_df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,With all this stuff going down at the moment ...
1,"""2381_9""",1,The Classic War of the Worlds by Timothy ...
2,"""7759_3""",0,The film starts with a manager Nicholas Bell...
3,"""3630_4""",0,It must be assumed that those who praised thi...
4,"""9495_8""",1,Superbly trashy and wondrously unpretentious ...


## 머신러닝 코드 작성하기

In [31]:
from sklearn.model_selection import train_test_split

class_Y = review_df['sentiment']
feature_X = review_df['review']

X_train, X_test, y_train, y_test = train_test_split(feature_X, class_Y, test_size=0.3, random_state = 42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((17500,), (7500,), (17500,), (7500,))

- N-Gram은 연속된 N개의 단어를 하나의 토큰화 단위로 분리함
  + 예시) I Love You
  + 기존 : I, Love, You
  + N-Gram : (I, Love), (Love, You)

In [32]:
# 머신러닝 코드 작성, CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words= 'english', ngram_range = (1,2))),
    ('lr_clf', LogisticRegression(C=10))
])

# 모형 학습
pipeline.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Pipeline(steps=[('cnt_vect',
                 CountVectorizer(ngram_range=(1, 2), stop_words='english')),
                ('lr_clf', LogisticRegression(C=10))])

In [33]:
# 모형 예측
pred = pipeline.predict(X_test)
pred_probs = pipeline.predict_proba(X_test)[:,1]

print(accuracy_score(y_test,pred))
print(roc_auc_score(y_test, pred_probs))

0.8844
0.9508383943629359


In [34]:
# 머신러닝 코드 작성, TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
pipeline = Pipeline([
    ('cnt_vect', TfidfVectorizer(stop_words= 'english', ngram_range = (1,2))),
    ('lr_clf', LogisticRegression(C=10))
])

# 모형 학습
pipeline.fit(X_train, y_train)

# 모형 예측
pred = pipeline.predict(X_test)
pred_probs = pipeline.predict_proba(X_test)[:,1]

print(accuracy_score(y_test,pred))
print(roc_auc_score(y_test, pred_probs))

0.8916
0.9591985866379715


## 비지도학습 기반 감성 분석
- Lexicon을 기반으로 함
- 지도학습 기반
  + 라벨링 (종속변수가 반드시 필요)
  + 라벨링 구축에 시간 많이 소요
- 감성 사전을 이용함
  + Positive, Negative
  + Polarity Score

## SentiWordNet 감성분석
- 시간 측정을 위해 함수를 하나 정의

In [35]:
import time
import datetime
def bench_mark(start):
  sec = time.time() - start
  times = str(datetime.timedelta(seconds=sec)).split(".")
  times = times[0]

  print(times)

In [36]:
start = time.time()

import nltk
nltk.download('all')

print(bench_mark(start))

0:00:00
None


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    |   Package bcp47 is already up-to-dat

- wordNet 모듈 불러오기
- Synsets()
  + 특정 단어가 가지는 문맥, 시맨틱 정보를 제공함

In [37]:
from nltk.corpus import wordnet as wn

keyword = 'present'

# 'present'라는 단어로 wordnet synsets 생성
synsets = wn.synsets(keyword)
print(type(synsets))
print(len(synsets))
print(synsets)

<class 'list'>
18
[Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]


In [38]:
for idx,synset in enumerate(synsets):
  print('## Synset name : ', synset.name(), '##')
  print('POS : ', synset.lexname())
  print('Edfinition : ', synset.definition())
  print('Lemmas : ', synset.lemma_names())
  print("\n")

## Synset name :  present.n.01 ##
POS :  noun.time
Edfinition :  the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas :  ['present', 'nowadays']


## Synset name :  present.n.02 ##
POS :  noun.possession
Edfinition :  something presented as a gift
Lemmas :  ['present']


## Synset name :  present.n.03 ##
POS :  noun.communication
Edfinition :  a verb tense that expresses actions or states at the time of speaking
Lemmas :  ['present', 'present_tense']


## Synset name :  show.v.01 ##
POS :  verb.perception
Edfinition :  give an exhibition of to an interested audience
Lemmas :  ['show', 'demo', 'exhibit', 'present', 'demonstrate']


## Synset name :  present.v.02 ##
POS :  verb.communication
Edfinition :  bring forward and present to the mind
Lemmas :  ['present', 'represent', 'lay_out']


## Synset name :  stage.v.01 ##
POS :  verb.creation
Edfinition :  perform (a play), especially on a stage
Lemmas :  ['stage', 'present', 'repr

## WordNet 단어들의 유사도 측정


In [39]:
import pandas as pd

plant = wn.synset('plant.n.01')
rabbit = wn.synset('rabbit.n.01')
lion = wn.synset('lion.n.01')
horse = wn.synset('horse.n.01')
dog = wn.synset('dog.n.01')

entities = [plant, rabbit, lion, horse, dog]
similarities = []
entity_names = [entity.name().split('.')[0] for entity in entities]

# 단어별 synset을 반복하면서 다른 단어의 `synset`과 유사도를 측정함
for entity in entities:
  similarity = [round(entity.path_similarity(compared_entity), 2) for compared_entity in entities]
  similarities.append(similarity)

# 개별 단어별 synset과 다른 단어의 `synset`과의 유사도를 `DataFrame` 형태로 저장
similarity_df = pd.DataFrame(similarities, columns = entity_names, index = entity_names)
similarity_df

Unnamed: 0,plant,rabbit,lion,horse,dog
plant,1.0,0.07,0.06,0.06,0.1
rabbit,0.07,1.0,0.12,0.12,0.14
lion,0.06,0.12,1.0,0.11,0.17
horse,0.06,0.12,0.11,1.0,0.12
dog,0.1,0.14,0.17,0.12,1.0


## 감성지수
- SentiSynset 객체는 단어의 감성을 나타냄
  + 어떤 단어가 감성적이지 않으면 객관성 지수는 1, 감성적이면 지수는 0 이렇게 구분이 됨

In [40]:
import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset('father.n.01')
print(father.obj_score()) # 객관성 지수 1 / 감성적이지 않음
print(father.pos_score()) # 긍정 --> 0
print(father.neg_score()) # 부정 --> 0

1.0
0.0
0.0


In [41]:
fabulous = swn.senti_synset('fabulous.a.01')
print(fabulous.obj_score()) # 객관성 지수 0 / 감성적임
print(fabulous.pos_score()) # 긍정 --> 0.875
print(fabulous.neg_score()) # 부정 --> 0.125

0.0
0.875
0.125


## 영화평 감성 분석
- Document를 문장 단위로 분해
- 문장을 단어 단위로 토큰화하고 품사 태깅
- 품사 태깅된 단어 기반으로 synset객체와 senti_synset객체를 생성
- Senti_synset에서 긍정 감성 / 부정 감성 지수를 구할 수 있음
  + 모두 합산해, 특정한 임계치 값이상일 때 긍정 감성으로, 그렇지 않으면 부정 감성으로 결정( IF 조건문)
- WordNet을 이용해 문서를 다시 단어로 토큰화 한 뒤 어근 추출, 품사 태깅 적용

In [42]:
# 품사 태깅 함수

from nltk.corpus import wordnet as wn

# 간단한 NLTK PennTreebank Tag를 기반으로 WordNet 기반의 품사 Tag로 변환
def penn_to_wn(tag):
  if tag.startswith('J'):
    return wn.ADJ
  elif tag.startswith('N'):
    return wn.NOUN
  elif tag.startswith('R'):
    return wn.ADV
  elif tag.startswith('V'):
    return wn.VERB

- 감성 지수 계산하는 함수 생성

In [43]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

def swn_polarity(text):
    # 감성 지수 초기화 
    sentiment = 0.0
    tokens_count = 0
    
    lemmatizer = WordNetLemmatizer()
    raw_sentences = sent_tokenize(text)
    # 분해된 문장별로 단어 토큰 -> 품사 태깅 후에 SentiSynset 생성 -> 감성 지수 합산 
    for raw_sentence in raw_sentences:
        # NTLK 기반의 품사 태깅 문장 추출  
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        for word , tag in tagged_sentence:
            
            # WordNet 기반 품사 태깅과 어근 추출
            wn_tag = penn_to_wn(tag)
            if wn_tag not in (wn.NOUN , wn.ADJ, wn.ADV):
                continue                   
            lemma = lemmatizer.lemmatize(word, pos=wn_tag)
            if not lemma:
                continue
            # 어근을 추출한 단어와 WordNet 기반 품사 태깅을 입력해 Synset 객체를 생성. 
            synsets = wn.synsets(lemma , pos=wn_tag)
            if not synsets:
                continue
            # sentiwordnet의 감성 단어 분석으로 감성 synset 추출
            # 모든 단어에 대해 긍정 감성 지수는 +로 부정 감성 지수는 -로 합산해 감성 지수 계산. 
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())           
            tokens_count += 1
    
    if not tokens_count:
        return 0
    
    # 총 score가 0 이상일 경우 긍정(Positive) 1, 그렇지 않을 경우 부정(Negative) 0 반환
    if sentiment >= 0 :
        return 1
    
    return 0

- swn_polarity(text) 함수를 활용, 각 개별 문서에 대한 긍정 및 부정 감성을 예측한다.

In [45]:
start = time.time()
review_df['preds'] = review_df['review'].apply(lambda x: swn_polarity(x))
y_target= review_df['sentiment'].values
preds = review_df['preds'].values

bench_mark(start)

0:05:39


- 성능 분석

In [46]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.metrics import recall_score, f1_score, roc_auc_score
import numpy as np

print(confusion_matrix(y_target, preds))
print("정확도:", np.round(accuracy_score(y_target, preds), 4))
print("정밀도:", np.round(precision_score(y_target, preds), 4))
print("재현율:", np.round(recall_score(y_target, preds), 4))

[[7668 4832]
 [3636 8864]]
정확도: 0.6613
정밀도: 0.6472
재현율: 0.7091
