## 1번 실험
![alt text](<image (1).png>)

- 왜 로지스틱 회귀가 대부분의 경우에서 평가 지표가 높게 나올까?
    - 1. 로지스틱 회귀는 희소(대부분이 0으로 이루어진) 벡터를 그대로 처리하는 선형 모델이라 효율적일 것이다.
    - 2. 함께 나오는 단어들의 영향까지 가중치('weighted')로 학습하기 때문이다
    - 3. 클래스마다 가중치 벡터만 두면 되므로 다중 클래스에서도 안정적으로 높은 F1-score를 낸다.


In [288]:
!pip install gensim



In [289]:
from tensorflow.keras.datasets import reuters
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [290]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2)

In [291]:
word_index = reuters.get_word_index(path="reuters_word_index.json")

In [292]:
index_to_word = { index+3 : word for word, index in word_index.items() }
for index, token in enumerate(("<pad>", "<sos>", "<unk>")):
  index_to_word[index]=token

In [293]:
print(' '.join([index_to_word[index] for index in x_train[0]]))

<sos> mcgrath rentcorp said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3


In [294]:
decoded = []
for i in range(len(x_train)):
    t = ' '.join([index_to_word[index] for index in x_train[i]])
    decoded.append(t)

x_train = decoded
print(len(x_train))

8982


In [295]:
decoded_test = []
for i in range(len(x_test)):
    t = ' '.join([index_to_word[index] for index in x_test[i]])
    decoded_test.append(t)

x_test = decoded_test
print(len(x_test))

2246


In [296]:
# 벡터화 DTM, TF-idf 방법
dtmvector = CountVectorizer()

tfidf_transformer = TfidfTransformer()

x_train_dtm = dtmvector.fit_transform(x_train)
x_test_dtm= dtmvector.transform(x_test)

x_train_tfidf = tfidf_transformer.fit_transform(x_train_dtm)
x_test_tfidf = tfidf_transformer.transform(x_test_dtm)

In [297]:
from sklearn.naive_bayes import MultinomialNB #다항분포 나이브 베이즈 모델
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score #정확도 계산

print('=3')

=3


- 역할을 분담하여 진행했기에 num_word = None 시 결정 트리 ~ voting 부분 진행함
    - 결정 트리

In [298]:
tree = DecisionTreeClassifier(max_depth=10, random_state=0)
tree.fit(x_train_tfidf, y_train)

In [299]:
predicted = tree.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교
print("f1-score", f1_score(y_test, predicted, average='weighted'))  

정확도: 0.6211041852181657
f1-score 0.5769283128518847


- 랜덤 포레스트

In [300]:
from sklearn.ensemble import RandomForestClassifier

# 트리 5개, 난수 시드 0
forest = RandomForestClassifier(
    n_estimators=5,
    random_state=0,
    n_jobs=-1        # (선택) CPU 다중 코어 사용
)

# 학습
forest.fit(x_train_tfidf, y_train)


In [301]:
predicted = forest.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교
print("f1-score", f1_score(y_test, predicted, average='weighted'))  

정확도: 0.6544968833481746
f1-score 0.6225909375608356


- 그레디언트 부스팅 트리

In [302]:
# 15분 정도 소요될 수 있습니다.
grbt = GradientBoostingClassifier(random_state=0) # verbose=3
grbt.fit(x_train_tfidf, y_train)

In [303]:
predicted = grbt.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교
print("f1-score", f1_score(y_test, predicted, average='weighted'))   

정확도: 0.7680320569902048
f1-score 0.7627808003795614


- 보팅

In [304]:
from sklearn.linear_model   import LogisticRegression
from sklearn.naive_bayes    import ComplementNB
from sklearn.ensemble       import GradientBoostingClassifier, VotingClassifier

log_clf   = LogisticRegression(          
    penalty='l2',
    solver='lbfgs',        
    max_iter=1000,
    random_state=0
)

cnb_clf   = ComplementNB()               

gb_clf    = GradientBoostingClassifier( random_state=0 )

voting_classifier = VotingClassifier(
    estimators=[
        ('lr',  log_clf),
        ('cnb', cnb_clf),
        ('gb',  gb_clf)
    ],
    voting='soft',   # ← 필수 조건
    n_jobs=9       # CPU 코어 모두 사용 (옵션)
)


voting_classifier.fit(x_train_tfidf, y_train)

In [305]:
predicted = voting_classifier.predict(x_test_tfidf) #테스트 데이터에 대한 예측
print("정확도:", accuracy_score(y_test, predicted)) #예측값과 실제값 비교
print("f1-score", f1_score(y_test, predicted, average='weighted'))   

정확도: 0.7996438112199465
f1-score 0.7942898938123609


In [306]:
# 벡터화 W2V방법
from gensim.models import Word2Vec

# 우선 문장을 토큰화 시킵시다 띄어쓰기 기반으로 해볼게요! -> # 위에서 DTM만들때는 왜 안해줬냐! -> CountVectorizer에서 띄어쓰기 기반 토큰화가 내장되있음
x_train_tokenized = [sentence.split() for sentence in x_train]
x_test_tokenized = [sentence.split() for sentence in x_test]

# vector사이즈를 늘리거나 줄여보세요 아마 512 가장많이쓰이는 방식
model = Word2Vec(sentences = x_train_tokenized, vector_size = 256, window = 5, min_count = 5, workers = 4, sg = 0)
print("모델 학습 완료!")

모델 학습 완료!


In [307]:
# W2V이 잘되었는지 확인 -> 여차저차 되긴한것같다
model_result = model.wv.most_similar('man')
print(model_result)

[('ontario', 0.8736326098442078), ('colony', 0.8663220405578613), ('olivetti', 0.8608534336090088), ('iowa', 0.8515343070030212), ('glenn', 0.8496575951576233), ('okla', 0.8478798866271973), ('cawl', 0.8436609506607056), ('alliance', 0.8409212827682495), ('iii', 0.8394174575805664), ('hazardous', 0.8388103246688843)]


In [308]:
# 학습된 Word2Vec 모델
w2v_model = model

# 각 문장을 벡터화 시키는 코드
def vectorize_sentence(sentence, model, max_len):
    vecs = []
    for word in sentence:
        if word in model.wv:
            vecs.append(model.wv[word])
        else:
            vecs.append(np.zeros(model.vector_size))
    # Padding
    if len(vecs) < max_len:
        vecs += [np.zeros(model.vector_size)] * (max_len - len(vecs))
    else:
        vecs = vecs[:max_len]
    return np.array(vecs)


# 최대 문장길이를 잘 잡아주세요
x_train_w2v = np.array([vectorize_sentence(s, w2v_model, max_len=100) for s in x_train_tokenized])
x_test_w2v = np.array([vectorize_sentence(s, w2v_model, max_len=100) for s in x_test_tokenized])


### 모델 정의 및 실험

In [310]:
# TF-idf데이터로 XGBoost 모델 학습하기

# 이친구도 시간좀 걸립니다!

from xgboost import XGBClassifier

# XGBoost 모델 학습
xgb_model = XGBClassifier(n_estimators=100, max_depth=5, eval_metric='mlogloss')
xgb_model.fit(x_train_tfidf, y_train)

In [311]:
# 예측
y_pred = xgb_model.predict(x_test_tfidf)

# 평가 지표
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')

print(f"✅ Accuracy : {acc:.4f}")
print(f"✅ F1-score : {f1:.4f}")

✅ Accuracy : 0.7939
✅ F1-score : 0.6410


In [312]:
# 데이터를 단어단위에서 문장단위로 바꿔줘야합니다.. ML은 2차원데이터만 받을수있기때문
# 문장에 대해서 토큰들의 벡터를 평균을 취해줍니다.

# Word2Vec 임베딩 시퀀스: (8982, 100, 256)
x_w2v_seq_train = x_train_w2v
x_w2v_seq_test = x_test_w2v
# 평균 풀링 → (8982, 256)
x_w2v_avg_train = np.mean(x_w2v_seq_train, axis=1)
x_w2v_avg_test = np.mean(x_w2v_seq_test, axis=1)
print(x_w2v_avg_train.shape)  # (8982, 256)

(8982, 256)


In [313]:
# Word2Vec 데이터로 XGBoost 모델 학습하기
from xgboost import XGBClassifier


# XGBoost 모델 학습
xgb_model = XGBClassifier(n_estimators=100, max_depth=5, eval_metric='mlogloss')
xgb_model.fit(x_w2v_avg_train, y_train)

In [316]:
# 예측
y_pred = xgb_model.predict(x_w2v_avg_test)

# 평가 지표
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"✅ Accuracy : {acc:.4f}")
print(f"✅ F1-score : {f1:.4f}")

✅ Accuracy : 0.7315
✅ F1-score : 0.7130


### Dense NN 딥러닝

In [317]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Dropout, LSTM, Dense, Dropout


dense_model = Sequential([
    Flatten(input_shape=(100, 256)),  # (seq_len, embedding_dim)
    Dense(512, activation='relu'),
    Dropout(0.3),
    Dense(128, activation='relu'),
    Dropout(0.3),
    Dense(46, activation='softmax')   # 클래스 수에 맞게 조정 46개로 맞춰주세요!
])

dense_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
dense_model.summary()

  super().__init__(**kwargs)


In [318]:
# 시간이 좀 걸립니다! 한 20분정도..
dense_model.fit(x_train_w2v, y_train, epochs=10, batch_size=32, validation_split=0.2)

Epoch 1/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 63ms/step - accuracy: 0.5237 - loss: 2.2640 - val_accuracy: 0.6706 - val_loss: 1.5005
Epoch 2/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 65ms/step - accuracy: 0.6776 - loss: 1.3335 - val_accuracy: 0.6850 - val_loss: 1.4119
Epoch 3/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 68ms/step - accuracy: 0.7434 - loss: 1.0799 - val_accuracy: 0.7023 - val_loss: 1.3738
Epoch 4/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 61ms/step - accuracy: 0.7829 - loss: 0.8989 - val_accuracy: 0.6861 - val_loss: 1.4266
Epoch 5/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 67ms/step - accuracy: 0.8116 - loss: 0.7458 - val_accuracy: 0.6989 - val_loss: 1.4727
Epoch 6/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 71ms/step - accuracy: 0.8444 - loss: 0.6362 - val_accuracy: 0.6873 - val_loss: 1.5400
Epoch 7/10
[1m2

<keras.src.callbacks.history.History at 0x1a198877b60>

In [319]:
y_pred_proba = dense_model.predict(x_test_w2v)
y_pred = np.argmax(y_pred_proba, axis=1)

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"✅ Accuracy: {acc:.4f}")
print(f"✅ F1-score: {f1:.4f}")

[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step
✅ Accuracy: 0.6955
✅ F1-score: 0.6724


### RNN 딥러닝 모델

In [320]:
rnn_model = Sequential([
    LSTM(128, input_shape=(100, 256)),  # (seq_len, embedding_dim)
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(46, activation='softmax')   # 클래스 수에 맞게 조정 46개로 맞춰주세요~
])

rnn_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
rnn_model.summary()

  super().__init__(**kwargs)


In [321]:
# 시간이 좀 걸립니다! 한 20분정도
rnn_model.fit(x_train_w2v, y_train, epochs=10, batch_size=32, validation_split=0.2)

Epoch 1/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 81ms/step - accuracy: 0.4145 - loss: 2.6705 - val_accuracy: 0.6077 - val_loss: 1.6617
Epoch 2/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 98ms/step - accuracy: 0.5892 - loss: 1.7294 - val_accuracy: 0.6422 - val_loss: 1.6663
Epoch 3/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 95ms/step - accuracy: 0.5448 - loss: 1.8594 - val_accuracy: 0.6066 - val_loss: 1.5554
Epoch 4/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 106ms/step - accuracy: 0.6117 - loss: 1.5729 - val_accuracy: 0.6711 - val_loss: 1.3957
Epoch 5/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 113ms/step - accuracy: 0.6766 - loss: 1.3516 - val_accuracy: 0.6566 - val_loss: 1.3593
Epoch 6/10
[1m225/225[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 102ms/step - accuracy: 0.6686 - loss: 1.3375 - val_accuracy: 0.7045 - val_loss: 1.2546
Epoch 7/10
[

<keras.src.callbacks.history.History at 0x1a198c35df0>

In [322]:
y_pred_proba = rnn_model.predict(x_test_w2v)
y_pred = np.argmax(y_pred_proba, axis=1)

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"✅ Accuracy: {acc:.4f}")
print(f"✅ F1-score: {f1:.4f}")

[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 29ms/step
✅ Accuracy: 0.7070
✅ F1-score: 0.6674
