# 제품 review 데이터의 감성 분석

***

### 목표 - 감성 분석의 기본 이해: 감성 분석이 무엇이며 그 응용에 대해 학습

- 데이터
    - (홀수) amazon_uk_shoes_products_dataset_2021_12.csv 
- 데이터 전처리: 머신 러닝 작업을 위해 데이터 전 처리하는 경험을 쌓음.
    - 데이터 정리 (노이즈 제거, 결측값 처리 등).
    - 텍스트 토큰화 및 불용어 제거.
    - 텍스트 정규화 (스테밍 또는 표제어 추출).
- 모델 구현: 수업에서 활용한 모델을 활용해서 구현.
    - 수업에서 홯용한 모델을 적용.
- 모델 학습 및 튜닝
    - 데이터셋을 학습 및 테스트 세트로 분할.
    - 모델 학습 및 하이퍼파라미터 튜닝 (예: 그리드 서치 또는 랜덤 서치 사용).
    - 교차 검증과 같은 기술을 사용하여 견고성 보장.
- 모델 평가: 적절한 지표를 사용하여 모델의 성능을 평가.
    - 정확도, 정밀도, 재현율, F1 점수 및 ROC-AUC와 같은 지표를 사용하여 모델 평가.
    - 오버피팅과 언더피팅을 논의하고 이를 해결하는 방법 학습.
- 시각화 및 해석: 결과를 시각화하고 해석
    - Matplotlib 또는 Seaborn과 같은 라이브러리를 사용하여 결과 시각화.
    - 혼동 행렬 및 ROC 곡선 생성.

***

## 데이터 전처리

### 데이터 확인
- url : 상품 URL 주소
- product_name : 상품명
- reviewer_name : 리뷰 작성자 이름
- review_title : 리뷰 제목
- review_text : 리뷰 내용
- review_rating : 리뷰 별점
- verified_purchase : 구매 확인 여부
- review_date : 리뷰 작성 날짜
- helpful_count : 리뷰가 도움이 된 사람의 수
- uniq_id : 리뷰 ID
- scraped_at : 리뷰를 가져온 시점

### 데이터 정리
- 결측치 확인
- 실제 구매자가 작성하지 않은 리뷰 제외
- 영어 리뷰만 사용

In [1]:
# 필요한 라이브러리 설치 및 다운
%pip install langdetect

from langdetect import detect
import re

import pandas as pd

Note: you may need to restart the kernel to use updated packages.


In [2]:
# 데이터 불러오기
file_path = './amazon_uk_shoes_products_dataset_2021_12.csv'
df = pd.read_csv(file_path)

In [3]:
# 결측치 확인
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6823 entries, 0 to 6822
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   url                6823 non-null   object 
 1   product_name       6823 non-null   object 
 2   reviewer_name      6823 non-null   object 
 3   review_title       6822 non-null   object 
 4   review_text        6814 non-null   object 
 5   review_rating      6823 non-null   float64
 6   verified_purchase  6823 non-null   bool   
 7   review_date        6823 non-null   object 
 8   helpful_count      1953 non-null   object 
 9   uniq_id            6823 non-null   object 
 10  scraped_at         6823 non-null   object 
dtypes: bool(1), float64(1), object(9)
memory usage: 539.8+ KB


In [4]:
# 결측치 제거
df = df.dropna(subset=['review_title','review_text']) #review_title과 review_text열에 있는 결측치에 대해서만 전체 행 제거
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6813 entries, 0 to 6822
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   url                6813 non-null   object 
 1   product_name       6813 non-null   object 
 2   reviewer_name      6813 non-null   object 
 3   review_title       6813 non-null   object 
 4   review_text        6813 non-null   object 
 5   review_rating      6813 non-null   float64
 6   verified_purchase  6813 non-null   bool   
 7   review_date        6813 non-null   object 
 8   helpful_count      1950 non-null   object 
 9   uniq_id            6813 non-null   object 
 10  scraped_at         6813 non-null   object 
dtypes: bool(1), float64(1), object(9)
memory usage: 592.1+ KB


In [5]:
df['verified_purchase'] = df['verified_purchase'].astype(str) # 문자열 변환

# 실제 구매자가 작성하지 않은 리뷰 제외
df = df[df.verified_purchase != 'False']
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6800 entries, 0 to 6822
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   url                6800 non-null   object 
 1   product_name       6800 non-null   object 
 2   reviewer_name      6800 non-null   object 
 3   review_title       6800 non-null   object 
 4   review_text        6800 non-null   object 
 5   review_rating      6800 non-null   float64
 6   verified_purchase  6800 non-null   object 
 7   review_date        6800 non-null   object 
 8   helpful_count      1950 non-null   object 
 9   uniq_id            6800 non-null   object 
 10  scraped_at         6800 non-null   object 
dtypes: float64(1), object(10)
memory usage: 637.5+ KB


In [6]:
# 감성분석에 필요한 'review_title', 'review_text', 'review_rating' 열만 선택
df = df[['review_rating', 'review_title', 'review_text']]

In [7]:
# 영어로 된 리뷰만 남기기

# 'review_text'가 영어인 행만 남기고 나머지 제거 하는 함수 정의
def is_english(text):
    try:
        return detect(text) == 'en'
    except:
        return False

df = df[df['review_text'].apply(is_english)]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3817 entries, 0 to 6817
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   review_rating  3817 non-null   float64
 1   review_title   3817 non-null   object 
 2   review_text    3817 non-null   object 
dtypes: float64(1), object(2)
memory usage: 119.3+ KB


In [8]:
# 텍스트 전처리 함수
def preprocess_text(text):
    # 소문자 변환
    text = text.lower()
    # 숫자 제거
    text = re.sub(r'\d+', '', text)
    # 연속된 공백을 하나의 공백으로 통일
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# 'review_title'과 'review_text' 전처리
df['review_title'] = df['review_title'].apply(preprocess_text)
df['review_text'] = df['review_text'].apply(preprocess_text)

In [9]:
# 데이터 확인
df

Unnamed: 0,review_rating,review_title,review_text
0,5.0,love em,love these. was looking for converses and thes...
1,2.0,the plastic ripped,"the shoes are very cute, but after the nd day ..."
2,5.0,good quality,good quality
3,5.0,good,great
14,5.0,perfect right outta the box,true to size. if between i'd probably go with ...
...,...,...,...
6813,5.0,great for early walkers,the only shoes (after many tries) that worked ...
6814,3.0,three stars,too narrow hard to get on for a toddler
6815,5.0,said they were very comfortable.,my son loves them. said they were very comfort...
6816,2.0,they are smaller than other shoes the same size,size but they are smaller than the size my son...


In [10]:
# 필요 라이브러리 다운
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK 데이터 다운로드 (처음 한 번만 실행)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### 데이터 토큰화

In [11]:
# 텍스트 토큰화
df['review_title_token'] = df['review_title'].apply(word_tokenize)
df['review_text_token'] = df['review_text'].apply(word_tokenize)

In [12]:
# 데이터 확인
df

Unnamed: 0,review_rating,review_title,review_text,review_title_token,review_text_token
0,5.0,love em,love these. was looking for converses and thes...,"[love, em]","[love, these, ., was, looking, for, converses,..."
1,2.0,the plastic ripped,"the shoes are very cute, but after the nd day ...","[the, plastic, ripped]","[the, shoes, are, very, cute, ,, but, after, t..."
2,5.0,good quality,good quality,"[good, quality]","[good, quality]"
3,5.0,good,great,[good],[great]
14,5.0,perfect right outta the box,true to size. if between i'd probably go with ...,"[perfect, right, outta, the, box]","[true, to, size, ., if, between, i, 'd, probab..."
...,...,...,...,...,...
6813,5.0,great for early walkers,the only shoes (after many tries) that worked ...,"[great, for, early, walkers]","[the, only, shoes, (, after, many, tries, ), t..."
6814,3.0,three stars,too narrow hard to get on for a toddler,"[three, stars]","[too, narrow, hard, to, get, on, for, a, toddler]"
6815,5.0,said they were very comfortable.,my son loves them. said they were very comfort...,"[said, they, were, very, comfortable, .]","[my, son, loves, them, ., said, they, were, ve..."
6816,2.0,they are smaller than other shoes the same size,size but they are smaller than the size my son...,"[they, are, smaller, than, other, shoes, the, ...","[size, but, they, are, smaller, than, the, siz..."


### 불용어 제거

In [13]:
# 불용어 제거 함수
stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

# 불용어 제거 적용
df['review_title_token'] = df['review_title_token'].apply(remove_stopwords)
df['review_text_token'] = df['review_text_token'].apply(remove_stopwords)

In [14]:
# 데이터 확인
df

Unnamed: 0,review_rating,review_title,review_text,review_title_token,review_text_token
0,5.0,love em,love these. was looking for converses and thes...,"[love, em]","[love, ., looking, converses, half, price, uni..."
1,2.0,the plastic ripped,"the shoes are very cute, but after the nd day ...","[plastic, ripped]","[shoes, cute, ,, nd, day, wearing, tongue, sta..."
2,5.0,good quality,good quality,"[good, quality]","[good, quality]"
3,5.0,good,great,[good],[great]
14,5.0,perfect right outta the box,true to size. if between i'd probably go with ...,"[perfect, right, outta, box]","[true, size, ., 'd, probably, go, lower, end, ..."
...,...,...,...,...,...
6813,5.0,great for early walkers,the only shoes (after many tries) that worked ...,"[great, early, walkers]","[shoes, (, many, tries, ), worked, early, walk..."
6814,3.0,three stars,too narrow hard to get on for a toddler,"[three, stars]","[narrow, hard, get, toddler]"
6815,5.0,said they were very comfortable.,my son loves them. said they were very comfort...,"[said, comfortable, .]","[son, loves, ., said, comfortable, .]"
6816,2.0,they are smaller than other shoes the same size,size but they are smaller than the size my son...,"[smaller, shoes, size]","[size, smaller, size, son, outgrowing, ., disa..."


### 데이터 정규화

In [15]:
# 표제어 추출 함수
lemmatizer = WordNetLemmatizer()

def lemmas(tokens):
    return [lemmatizer.lemmatize(word, pos='v') for word in tokens] #'v'는 동사(verb)라는 뜻

# 표제어 추출 적용
df['review_title_token'] = df['review_title_token'].apply(lemmas)
df['review_text_token'] = df['review_text_token'].apply(lemmas)

In [16]:
# 데이터 확인
df

Unnamed: 0,review_rating,review_title,review_text,review_title_token,review_text_token
0,5.0,love em,love these. was looking for converses and thes...,"[love, em]","[love, ., look, converse, half, price, unique—..."
1,2.0,the plastic ripped,"the shoes are very cute, but after the nd day ...","[plastic, rip]","[shoe, cute, ,, nd, day, wear, tongue, start, ..."
2,5.0,good quality,good quality,"[good, quality]","[good, quality]"
3,5.0,good,great,[good],[great]
14,5.0,perfect right outta the box,true to size. if between i'd probably go with ...,"[perfect, right, outta, box]","[true, size, ., 'd, probably, go, lower, end, ..."
...,...,...,...,...,...
6813,5.0,great for early walkers,the only shoes (after many tries) that worked ...,"[great, early, walkers]","[shoe, (, many, try, ), work, early, walker, b..."
6814,3.0,three stars,too narrow hard to get on for a toddler,"[three, star]","[narrow, hard, get, toddler]"
6815,5.0,said they were very comfortable.,my son loves them. said they were very comfort...,"[say, comfortable, .]","[son, love, ., say, comfortable, .]"
6816,2.0,they are smaller than other shoes the same size,size but they are smaller than the size my son...,"[smaller, shoe, size]","[size, smaller, size, son, outgrow, ., disappo..."


***
## 모델 구현

- VADER 모델 사용
- -1 ~ 1 사이의 소수점 값에서 0과 1로 분리된 이진 값으로 라벨링하는 함수 정의
- logistic regression 사용

In [17]:
# 필요 라이브러리 import
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [18]:
# VADER 감정 분석기 초기화
vader_sentiment = SentimentIntensityAnalyzer()

In [19]:
# 감정 분석 함수 정의
def calc_sentiment(review_tokens):
    if isinstance(review_tokens, list):
        review_text = ' '.join(review_tokens)  #토큰 리스트를 문자열로 변환
        return vader_sentiment.polarity_scores(review_text) #감성 분석
    else:
        return {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}  #기본값 반환

# 데이터 프레임의 각 리뷰에 대해 감정 분석 수행
df["review_title_sentiment_score"] = df['review_title_token'].apply(calc_sentiment)
df["review_text_sentiment_score"] = df['review_text_token'].apply(calc_sentiment)

# compound 점수만 추출하여 새로운 열 생성
df['review_title_compound'] = df['review_title_sentiment_score'].apply(lambda x: x['compound'])
df['review_text_compound'] = df['review_text_sentiment_score'].apply(lambda x: x['compound'])


In [20]:
# 데이터 확인
df

Unnamed: 0,review_rating,review_title,review_text,review_title_token,review_text_token,review_title_sentiment_score,review_text_sentiment_score,review_title_compound,review_text_compound
0,5.0,love em,love these. was looking for converses and thes...,"[love, em]","[love, ., look, converse, half, price, unique—...","{'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'comp...","{'neg': 0.07, 'neu': 0.446, 'pos': 0.484, 'com...",0.6369,0.9188
1,2.0,the plastic ripped,"the shoes are very cute, but after the nd day ...","[plastic, rip]","[shoe, cute, ,, nd, day, wear, tongue, start, ...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...","{'neg': 0.0, 'neu': 0.841, 'pos': 0.159, 'comp...",0.0000,0.6705
2,5.0,good quality,good quality,"[good, quality]","[good, quality]","{'neg': 0.0, 'neu': 0.256, 'pos': 0.744, 'comp...","{'neg': 0.0, 'neu': 0.256, 'pos': 0.744, 'comp...",0.4404,0.4404
3,5.0,good,great,[good],[great],"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound...","{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound...",0.4404,0.6249
14,5.0,perfect right outta the box,true to size. if between i'd probably go with ...,"[perfect, right, outta, box]","[true, size, ., 'd, probably, go, lower, end, ...","{'neg': 0.0, 'neu': 0.448, 'pos': 0.552, 'comp...","{'neg': 0.07, 'neu': 0.728, 'pos': 0.202, 'com...",0.5719,0.6361
...,...,...,...,...,...,...,...,...,...
6813,5.0,great for early walkers,the only shoes (after many tries) that worked ...,"[great, early, walkers]","[shoe, (, many, try, ), work, early, walker, b...","{'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'comp...","{'neg': 0.0, 'neu': 0.623, 'pos': 0.377, 'comp...",0.6249,0.8658
6814,3.0,three stars,too narrow hard to get on for a toddler,"[three, star]","[narrow, hard, get, toddler]","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...","{'neg': 0.318, 'neu': 0.682, 'pos': 0.0, 'comp...",0.0000,-0.1027
6815,5.0,said they were very comfortable.,my son loves them. said they were very comfort...,"[say, comfortable, .]","[son, love, ., say, comfortable, .]","{'neg': 0.0, 'neu': 0.233, 'pos': 0.767, 'comp...","{'neg': 0.0, 'neu': 0.211, 'pos': 0.789, 'comp...",0.5106,0.8176
6816,2.0,they are smaller than other shoes the same size,size but they are smaller than the size my son...,"[smaller, shoe, size]","[size, smaller, size, son, outgrow, ., disappo...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...","{'neg': 0.351, 'neu': 0.649, 'pos': 0.0, 'comp...",0.0000,-0.4019


### 라벨링

In [21]:
# logistic regression 적용을 위한 라벨링 함수 정의
def change_to_binary(sentiment_label):
    if sentiment_label >= 0.1: #0.1 이상이면 긍정, 나머지는 부정
        return 1
    else:
        return 0
    
# 데이터 프레임의 각 리뷰에 대해 라벨링
df["review_title_sentiment_label"] = df.review_title_compound.apply(change_to_binary)
df["review_text_sentiment_label"] = df.review_text_compound.apply(change_to_binary)

In [29]:
# 데이터 확인
df

Unnamed: 0,review_rating,review_title,review_text,review_title_token,review_text_token,review_title_sentiment_score,review_text_sentiment_score,review_title_compound,review_text_compound,review_title_sentiment_label,review_text_sentiment_label,check_Labeling
0,5.0,love em,love these. was looking for converses and thes...,"[love, em]","[love, ., look, converse, half, price, unique—...","{'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'comp...","{'neg': 0.07, 'neu': 0.446, 'pos': 0.484, 'com...",0.6369,0.9188,1,1,Match
1,2.0,the plastic ripped,"the shoes are very cute, but after the nd day ...","[plastic, rip]","[shoe, cute, ,, nd, day, wear, tongue, start, ...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...","{'neg': 0.0, 'neu': 0.841, 'pos': 0.159, 'comp...",0.0000,0.6705,0,1,Mismatch
2,5.0,good quality,good quality,"[good, quality]","[good, quality]","{'neg': 0.0, 'neu': 0.256, 'pos': 0.744, 'comp...","{'neg': 0.0, 'neu': 0.256, 'pos': 0.744, 'comp...",0.4404,0.4404,1,1,Match
3,5.0,good,great,[good],[great],"{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound...","{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound...",0.4404,0.6249,1,1,Match
14,5.0,perfect right outta the box,true to size. if between i'd probably go with ...,"[perfect, right, outta, box]","[true, size, ., 'd, probably, go, lower, end, ...","{'neg': 0.0, 'neu': 0.448, 'pos': 0.552, 'comp...","{'neg': 0.07, 'neu': 0.728, 'pos': 0.202, 'com...",0.5719,0.6361,1,1,Match
...,...,...,...,...,...,...,...,...,...,...,...,...
6813,5.0,great for early walkers,the only shoes (after many tries) that worked ...,"[great, early, walkers]","[shoe, (, many, try, ), work, early, walker, b...","{'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'comp...","{'neg': 0.0, 'neu': 0.623, 'pos': 0.377, 'comp...",0.6249,0.8658,1,1,Match
6814,3.0,three stars,too narrow hard to get on for a toddler,"[three, star]","[narrow, hard, get, toddler]","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...","{'neg': 0.318, 'neu': 0.682, 'pos': 0.0, 'comp...",0.0000,-0.1027,0,0,Match
6815,5.0,said they were very comfortable.,my son loves them. said they were very comfort...,"[say, comfortable, .]","[son, love, ., say, comfortable, .]","{'neg': 0.0, 'neu': 0.233, 'pos': 0.767, 'comp...","{'neg': 0.0, 'neu': 0.211, 'pos': 0.789, 'comp...",0.5106,0.8176,1,1,Match
6816,2.0,they are smaller than other shoes the same size,size but they are smaller than the size my son...,"[smaller, shoe, size]","[size, smaller, size, son, outgrow, ., disappo...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...","{'neg': 0.351, 'neu': 0.649, 'pos': 0.0, 'comp...",0.0000,-0.4019,0,0,Match


#### 리뷰 제목과 리뷰 내용 비교

In [34]:
# 리뷰의 제목과 내용에 대하여 라벨링된 감정 점수가 일치하는지 확인하는 함수 정의
def check_labeling(review_title_sentiment_label, review_text_sentiment_label):
    if review_title_sentiment_label == 1 and review_text_sentiment_label == 1:
        return "Match"
    elif review_title_sentiment_label == 0 and review_text_sentiment_label == 0:
        return "Match"
    else:
        return "Mismatch"

# 점수 비교
df['check_labeling'] = df.apply(lambda row: check_labeling(row['review_title_sentiment_label'], row['review_text_sentiment_label']), axis=1)

In [35]:
# 데이터 확인
df[['review_title_sentiment_label', 'review_text_sentiment_label', 'check_labeling']]

Unnamed: 0,review_title_sentiment_label,review_text_sentiment_label,check_labeling
0,1,1,Match
1,0,1,Mismatch
2,1,1,Match
3,1,1,Match
14,1,1,Match
...,...,...,...
6813,1,1,Match
6814,0,0,Match
6815,1,1,Match
6816,0,0,Match


In [36]:
# Match로 출력되는 비율 계산
match_count = (df['check_labeling'] == 'Match').sum() # 'check_labeling' 열에서 'Match'인 항목의 수를 세고 match_count에 저장
total_count = len(df) # DataFrame의 총 행 수를 total_count에 저장
match_ratio = match_count / total_count # 'Match'로 출력되는 비율을 계산하여 match_ratio에 저장
print("matching ratio", match_ratio)

matching ratio 0.6772334293948127


0.67로 유의미한 비교는 아니라고 생각함.

#### 리뷰 별점과 리뷰 제목 비교

In [38]:
# 리뷰 별점과 리뷰 제목에 대하여 라벨링된 감정 점수가 일치하는지 확인하는 함수 정의
def check_labeling(review_rating, review_title_sentiment_label):
    if review_rating >= 3 and review_title_sentiment_label == 1:
        return "Match"
    elif review_rating < 3 and review_title_sentiment_label == 0:
        return "Match"
    else:
        return "Mismatch"

# 점수 비교
df['check_labeling'] = df.apply(lambda row: check_labeling(row['review_rating'], row['review_title_sentiment_label']), axis=1)

In [39]:
# 데이터 확인
df[['review_rating', 'review_title_sentiment_label', 'check_labeling']]

Unnamed: 0,review_rating,review_title_sentiment_label,check_labeling
0,5.0,1,Match
1,2.0,0,Match
2,5.0,1,Match
3,5.0,1,Match
14,5.0,1,Match
...,...,...,...
6813,5.0,1,Match
6814,3.0,0,Mismatch
6815,5.0,1,Match
6816,2.0,0,Match


In [40]:
# Match로 출력되는 비율 계산
match_count = (df['check_labeling'] == 'Match').sum() # 'check_labeling' 열에서 'Match'인 항목의 수를 세고 match_count에 저장
total_count = len(df) # DataFrame의 총 행 수를 total_count에 저장
match_ratio = match_count / total_count # 'Match'로 출력되는 비율을 계산하여 match_ratio에 저장
print("matching ratio", match_ratio)

matching ratio 0.7186271941315169


0.71로 이 또한 유의미한 향상은 아님

#### 리뷰 별점과 리뷰 내용 비교

In [41]:
# 리뷰 별점과 리뷰 내용에 대하여 라벨링된 감정 점수가 일치하는지 확인하는 함수 정의
def check_labeling(review_rating, review_text_sentiment_label):
    if review_rating >= 3 and review_text_sentiment_label == 1:
        return "Match"
    elif review_rating < 3 and review_text_sentiment_label == 0:
        return "Match"
    else:
        return "Mismatch"

# 점수 비교
df['check_labeling'] = df.apply(lambda row: check_labeling(row['review_rating'], row['review_text_sentiment_label']), axis=1)

In [42]:
# 데이터 확인
df[['review_rating', 'review_text_sentiment_label', 'check_labeling']]

Unnamed: 0,review_rating,review_text_sentiment_label,check_labeling
0,5.0,1,Match
1,2.0,1,Mismatch
2,5.0,1,Match
3,5.0,1,Match
14,5.0,1,Match
...,...,...,...
6813,5.0,1,Match
6814,3.0,0,Mismatch
6815,5.0,1,Match
6816,2.0,0,Match


In [43]:
# Match로 출력되는 비율 계산
match_count = (df['check_labeling'] == 'Match').sum() # 'check_labeling' 열에서 'Match'인 항목의 수를 세고 match_count에 저장
total_count = len(df) # DataFrame의 총 행 수를 total_count에 저장
match_ratio = match_count / total_count # 'Match'로 출력되는 비율을 계산하여 match_ratio에 저장
print("matching ratio", match_ratio)

matching ratio 0.8307571391144878


0.83으로 성능이 가장 좋음. 따라서 감성 분석에 대한 모델을 학습 시킬 때 리뷰 내용에 대하여 하기로 결정

#### VADER의 강점
- 사용 용이성
    - VADER는 사전 훈련된 모델로서, 별도의 훈련 없이 즉시 사용할 수 있다.
- 정확한 감정 분석
    - 감정 점수(positive, negative, neutral, compound)를 모두 제공하여 다양한 측면에서 감정을 분석할 수 있다.

#### VADER의 약점
- 사전의 한계
    - VADER는 미리 정의된 단어 사전을 사용하므로, 새로운 단어나 신조어를 인식하지 못할 수 있다.
- 부정어 처리
    - VADER는 "not bad"와 같은 부정어를 적절히 처리하지만, 복잡한 부정 표현에서는 한계가 있을 수 있다.


***
## 모델 학습 및 튜닝, 모델 평가

In [44]:
# 필요 라이브러리 다운
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report

In [52]:
# 특성과 라벨 분리
# 감정 점수가 특성 데이터로 사용되며, 라벨 데이터는 별도로 제공된다고 가정합니다.
# 예: review_text_sentiment_label이 라벨 데이터로 사용됩니다.
X = df["review_text_compound"].values.reshape(-1, 1)  # 특성 데이터는 2차원 배열이어야 합니다.
y = df["review_text_sentiment_label"]

# 학습 및 테스트 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=77)

# 로지스틱 회귀 모델 초기화
log_reg = LogisticRegression()

# 하이퍼파라미터 그리드 정의
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}

# 그리드 서치 초기화
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')

# 모델 학습
grid_search.fit(X_train, y_train)

# 최적의 하이퍼파라미터 출력
print("Best hyperparameters: ", grid_search.best_params_)

Best hyperparameters:  {'C': 100}


In [53]:
# 최적의 모델 선택
best_model = grid_search.best_estimator_

# 예측
y_pred = best_model.predict(X_test)

# 평가 지표 계산
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"ROC-AUC: {roc_auc}")


Accuracy: 0.9991273996509599
Precision: 1.0
Recall: 0.9989795918367347
F1 Score: 0.9994895354772844
ROC-AUC: 0.9994897959183673


In [54]:
# 교차 검증
cv_scores_accuracy = cross_val_score(best_model, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation accuracy scores: {cv_scores_accuracy}")
print(f"Mean cross-validation accuracy: {np.mean(cv_scores_accuracy)}")
print(f"Standard deviation of cross-validation accuracy: {np.std(cv_scores_accuracy)}")

cv_scores_precision = cross_val_score(best_model, X, y, cv=5, scoring='precision')
print(f"Cross-validation precision scores: {cv_scores_precision}")
print(f"Mean cross-validation precision: {np.mean(cv_scores_precision)}")
print(f"Standard deviation of cross-validation precision: {np.std(cv_scores_precision)}")

cv_scores_recall = cross_val_score(best_model, X, y, cv=5, scoring='recall')
print(f"Cross-validation recall scores: {cv_scores_recall}")
print(f"Mean cross-validation recall: {np.mean(cv_scores_recall)}")
print(f"Standard deviation of cross-validation recall: {np.std(cv_scores_recall)}")

cv_scores_f1 = cross_val_score(best_model, X, y, cv=5, scoring='f1')
print(f"Cross-validation f1 scores: {cv_scores_f1}")
print(f"Mean cross-validation f1: {np.mean(cv_scores_f1)}")
print(f"Standard deviation of cross-validation f1: {np.std(cv_scores_f1)}")

cv_scores_roc_auc = cross_val_score(best_model, X, y, cv=5, scoring='roc_auc')
print(f"Cross-validation roc_auc scores: {cv_scores_roc_auc}")
print(f"Mean cross-validation roc_auc: {np.mean(cv_scores_roc_auc)}")
print(f"Standard deviation of cross-validation roc_auc: {np.std(cv_scores_roc_auc)}")

Cross-validation accuracy scores: [0.9973822  1.         1.         0.99868938 0.99737877]
Mean cross-validation accuracy: 0.9986900701968668
Standard deviation of cross-validation accuracy: 0.0011714839509345157
Cross-validation precision scores: [1. 1. 1. 1. 1.]
Mean cross-validation precision: 1.0
Standard deviation of cross-validation precision: 0.0
Cross-validation recall scores: [0.9969278  1.         1.         0.99846154 0.99692308]
Mean cross-validation recall: 0.9984624837528063
Standard deviation of cross-validation recall: 0.0013749858581258668
Cross-validation f1 scores: [0.99846154 1.         1.         0.99923018 0.99845917]
Mean cross-validation f1: 0.9992301766943015
Standard deviation of cross-validation f1: 0.0006885513865469319
Cross-validation roc_auc scores: [1. 1. 1. 1. 1.]
Mean cross-validation roc_auc: 1.0
Standard deviation of cross-validation roc_auc: 0.0
