# RNN-LSTM

**Amazon Fine Food Reviews** 데이터셋

- Id: 리뷰의 고유 ID
- ProductId: 상품 ID
- UserId: 사용자 ID
- ProfileName: 사용자 이름(프로필명)
- HelpfulnessNumerator: “도움이 되었다” 평가의 분자(도움이 된 횟수)
- HelpfulnessDenominator: “도움이 되었다” 평가의 분모(평가한 사람 수)
- Score: 리뷰 평점(1~5점)
- Time: 유닉스 타임(리뷰가 작성된 시간)
- Summary: 리뷰 요약
- Text: 리뷰 전문

In [None]:
import pandas as pd
import numpy as np
import re
import string

# TensorFlow/Keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, GRU, SimpleRNN, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report


1. 데이터셋 읽어와서 출력해보세요.

In [None]:
df = pd.read_csv("Reviews.csv")
print(df.shape)
df.head()

(568454, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


2. 데이터셋의 특성과 결측치를 확인해보세요.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568428 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB


In [None]:
df.isna().count()

Unnamed: 0,0
Id,568454
ProductId,568454
UserId,568454
ProfileName,568454
HelpfulnessNumerator,568454
HelpfulnessDenominator,568454
Score,568454
Time,568454
Summary,568454
Text,568454


3. 분석에 불필요한 열과 결측치를 제거하세요.

In [None]:
# 분석에 불필요한 컬럼(예: Id, ProductId, UserId, ProfileName, Time 등)을 제거
df = df[['Score', 'Text']].dropna()

# 데이터 개요 확인
print(df.info())
print(df['Score'].value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Score   568454 non-null  int64 
 1   Text    568454 non-null  object
dtypes: int64(1), object(1)
memory usage: 8.7+ MB
None
Score
5    363122
4     80655
1     52268
3     42640
2     29769
Name: count, dtype: int64


4. 레이블의 분류를 결정해보세요.

- 레이블(Score) 이진화 또는 다중 분류 설정
  
  1. 이진 분류 예시(긍정 vs. 부정)
    - 평점이 4,5면 긍정(1), 평점이 1,2면 부정(0), 평점 3은 제외(혹은 중립)
  2. 다중 분류 예시(1~5점 예측)
    - Score를 그대로 5개 클래스로 분류

In [None]:
# 이진 분류

# 1~2점 => 0(부정), 4~5점 => 1(긍정), 3점 => 제거
df = df[df['Score'] != 3]
df['Sentiment'] = df['Score'].apply(lambda x: 1 if x > 3 else 0)
df = df[['Text', 'Sentiment']]

df.head()

Unnamed: 0,Text,Sentiment
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,1
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,1


5. 모델이 데이터를 잘 학습할 수 있도록 전처리 과정을 하겠습니다.

In [None]:
def clean_text(text):
    # 소문자 변환
    text = text.lower()
    # HTML 태그 제거
    text = re.sub(r"<[^>]*>", "", text)
    # 구두점, 숫자 제거 등
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    return text

df['cleaned_text'] = df['Text'].apply(clean_text)
df.head()


Unnamed: 0,Text,Sentiment,cleaned_text
0,I have bought several of the Vitality canned d...,1,i have bought several of the vitality canned d...
1,Product arrived labeled as Jumbo Salted Peanut...,0,product arrived labeled as jumbo salted peanut...
2,This is a confection that has been around a fe...,1,this is a confection that has been around a fe...
3,If you are looking for the secret ingredient i...,0,if you are looking for the secret ingredient i...
4,Great taffy at a great price. There was a wid...,1,great taffy at a great price there was a wide...


6. 모델 학습을 위해 데이터셋을 나눠주세요.

훈련/검증/테스트 데이터 셋으로 나누고 비율은 8:1:1, 재현을 위해 random_state는 42로 설정하세요.

In [None]:
X = df['cleaned_text'].values
y = df['Sentiment'].values

# 훈련:검증:테스트 = 8:1:1
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

7. RNN모델 학습을 위해 텍스트 형태의 데이터를 시퀀스(시계열) 데이터로 변환하겠습니다.

In [None]:
# 토큰화
tokenizer = Tokenizer(num_words=20000)  # 상위 20,000개 단어만 사용
tokenizer.fit_on_texts(X_train)

# 시퀀스로 변환
train_sequences = tokenizer.texts_to_sequences(X_train)
val_sequences = tokenizer.texts_to_sequences(X_val)
test_sequences = tokenizer.texts_to_sequences(X_test)

# 패딩
max_len = 100  # 문장의 최대 길이
X_train_pad = pad_sequences(train_sequences, maxlen=max_len, padding='post')
X_val_pad = pad_sequences(val_sequences, maxlen=max_len, padding='post')
X_test_pad = pad_sequences(test_sequences, maxlen=max_len, padding='post')

8. 전처리 된 데이터를 사용해서 모델을 구성한 뒤에 학습도 진행해보세요.


- Embedding: 단어 인덱스를 밀집 벡터로 변환
- LSTM: RNN 변형 모델 중 하나로, 장기 의존성(Long-term dependency)을 잘 학습
- Dense(1, sigmoid): 이진 분류 출력을 위한 활성화 함수


In [None]:
early_stop = EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)

In [None]:
RNN_model = Sequential()
RNN_model.add(Embedding(input_dim=20000, output_dim=128, input_length=max_len))
RNN_model.add(SimpleRNN(64, dropout=0.2, recurrent_dropout=0.2))
RNN_model.add(Dense(1, activation='sigmoid'))

RNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
RNN_model.summary()



In [None]:
model = Sequential()
model.add(Embedding(input_dim=20000, output_dim=128, input_length=max_len))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))  # LSTM 유닛 수 64, dropout 적용
model.add(Dense(1, activation='sigmoid'))  # 이진 분류이므로 시그모이드 출력

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()


9. 생성된 모델을 이용해서 학습을 진행해보세요.

In [None]:
# RNN 모델 예시

RNN_history = RNN_model.fit(
    X_train_pad, y_train,
    validation_data=(X_val_pad, y_val),
    epochs=3,
    batch_size=128,   # GPU 리소스에 맞춰 조절
    callbacks=[early_stop]
)

Epoch 1/3
[1m3287/3287[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 13ms/step - accuracy: 0.8411 - loss: 0.4342 - val_accuracy: 0.8537 - val_loss: 0.4035
Epoch 2/3
[1m3287/3287[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m76s[0m 12ms/step - accuracy: 0.8561 - loss: 0.3992 - val_accuracy: 0.8552 - val_loss: 0.3999
Epoch 3/3
[1m3287/3287[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 11ms/step - accuracy: 0.8608 - loss: 0.3885 - val_accuracy: 0.8592 - val_loss: 0.3944


In [None]:
# LSTM 모델 예시

history = model.fit(
    X_train_pad, y_train,
    validation_data=(X_val_pad, y_val),
    epochs=3,
    batch_size=128,   # GPU 리소스에 맞춰 조절
    callbacks=[early_stop]
)

Epoch 1/3
[1m3287/3287[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1100s[0m 335ms/step - accuracy: 0.8671 - loss: 0.3548 - val_accuracy: 0.9396 - val_loss: 0.1612
Epoch 2/3
[1m3287/3287[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1100s[0m 335ms/step - accuracy: 0.9465 - loss: 0.1458 - val_accuracy: 0.9542 - val_loss: 0.1248
Epoch 3/3
[1m1782/3287[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m8:06[0m 323ms/step - accuracy: 0.9628 - loss: 0.1038

10. 테스트 세트를 이용해서 모델을 평가해보겠습니다.

- accuracy_score: 정확도 계산
- classification_report: 정밀도(precision), 재현율(recall), F1 점수를 확인 가능

In [None]:
# 테스트 세트 예측
y_pred_prob = RNN_model.predict(X_test_pad)
y_pred = (y_pred_prob > 0.5).astype(int)

# 정확도 및 지표
acc = accuracy_score(y_test, y_pred)
print("Test Accuracy: {:.2f}%".format(acc * 100))

print(classification_report(y_test, y_pred))


[1m1644/1644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step
Test Accuracy: 85.92%
              precision    recall  f1-score   support

           0       0.73      0.15      0.24      8161
           1       0.86      0.99      0.92     44421

    accuracy                           0.86     52582
   macro avg       0.80      0.57      0.58     52582
weighted avg       0.84      0.86      0.82     52582



11. 모델이 학습된 상태에서, 사용자가 텍스트를 입력하면 예측 결과(긍정/부정)를 확인할 수 있도록 함수를 구성하겠습니다.

In [None]:
# 사용자 입력 전처리 함수
def preprocess_input_text(text, tokenizer, max_len=100):
    """
    1) 소문자 변환, 불필요한 문자 제거 (clean_text)
    2) tokenizer.texts_to_sequences() 적용
    3) pad_sequences 적용
    """
    # 간단한 정규식 전처리
    text = text.lower()
    text = re.sub(r"[^a-zA-Z\s]", "", text)

    # 시퀀스로 변환
    seq = tokenizer.texts_to_sequences([text])
    # 패딩
    pad_seq = pad_sequences(seq, maxlen=max_len, padding='post')
    return pad_seq

def predict_sentiment(user_text, model, tokenizer):
    """
    사용자 입력 텍스트를 전처리한 후 모델로 감성 예측(0: 부정, 1: 긍정)
    """
    # 전처리
    processed_text = preprocess_input_text(user_text, tokenizer, max_len=100)
    # 예측
    pred_prob = model.predict(processed_text)
    pred_label = (pred_prob > 0.5).astype(int)[0][0]  # 0 또는 1
    return pred_label


In [None]:
# user_input = input("감성 분석을 원하는 문장을 입력하세요: ")
user_inputs = [
    "I absolutely love this product! It exceeded my expectations and I'll definitely buy it again.",
    "I was very disappointed with the quality. It broke after just one use, and I won't be purchasing it again.",
    "The product was okay—nothing special, but it worked as described. I might consider a different brand next time."
]

for user_input in user_inputs:
  label = predict_sentiment(user_input, RNN_model, tokenizer)

  if label == 1:
    print(">> 예측 결과: 긍정적인 리뷰로 판단됩니다!")
  else:
    print(">> 예측 결과: 부정적인 리뷰로 판단됩니다!")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 442ms/step
>> 예측 결과: 긍정적인 리뷰로 판단됩니다!
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 92ms/step
>> 예측 결과: 긍정적인 리뷰로 판단됩니다!
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 93ms/step
>> 예측 결과: 긍정적인 리뷰로 판단됩니다!
