<a href="https://colab.research.google.com/github/12park1jiho/mbti-behavior-predictor/blob/main/mbti-behavior-predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

✅ 전체 프로젝트 로드맵 (오픈소스 기반)

1️⃣ 데이터 확보 및 정제 (완료)

✅ Kaggle에서 MBTI+게시글 데이터 확보 완료 (datasnaek/mbti-type)

⚙️ CSV 불러오기 실패 → Google Drive 연동으로 재시도 예정

In [3]:
# 1. 라이브러리 설치
!pip install kagglehub --upgrade
!pip install openai pandas tqdm

# 2. Google Drive 연동
from google.colab import drive
drive.mount('/content/drive')

# 3. KaggleHub로 데이터 다운로드
import kagglehub
import os

# 데이터 다운로드
path = kagglehub.dataset_download("datasnaek/mbti-type")
print("✅ Dataset 다운로드 완료:", path)

# 실제 파일명 확인
import os

for root, dirs, files in os.walk(path):
    for f in files:
        print("✔ Found:", os.path.join(root, f))

# 4. 데이터 파일 찾기
# 아마 다음과 같은 파일이 있을 가능성이 높음
file_path = None
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(".csv"):
            file_path = os.path.join(root, file)

# 파일이 존재하면 Drive에 복사
if file_path:
    drive_path = "/content/drive/MyDrive/mbti_data"
    os.makedirs(drive_path, exist_ok=True)

    dest_path = os.path.join(drive_path, os.path.basename(file_path))
    !cp "{file_path}" "{dest_path}"
    print(f"✅ 파일이 Google Drive로 복사됨: {dest_path}")
else:
    print("❌ CSV 파일을 찾을 수 없습니다.")


Collecting kagglehub
  Downloading kagglehub-0.3.11-py3-none-any.whl.metadata (32 kB)
Downloading kagglehub-0.3.11-py3-none-any.whl (63 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/63.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.4/63.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kagglehub
  Attempting uninstall: kagglehub
    Found existing installation: kagglehub 0.3.10
    Uninstalling kagglehub-0.3.10:
      Successfully uninstalled kagglehub-0.3.10
Successfully installed kagglehub-0.3.11
Mounted at /content/drive
✅ Dataset 다운로드 완료: /root/.cache/kagglehub/datasets/datasnaek/mbti-type/versions/1
✔ Found: /root/.cache/kagglehub/datasets/datasnaek/mbti-type/versions/1/mbti_1.csv
✅ 파일이 Google Drive로 복사됨: /content/drive/MyDrive/mbti_data/mbti_1.csv


2️⃣ 데이터 전처리

🔹 컬럼: type(MBTI), posts(사용자 게시글)

🔹 텍스트 정제: 링크/기호 제거, 문장 분리 등

🔹 샘플링: MBTI 별로 일정 수 이상 확보

In [12]:
import pandas as pd
import re

# CSV 파일 불러오기
file_path = '/content/drive/MyDrive/mbti_data/mbti_1.csv'  # 네 드라이브 경로에 맞게 수정
df = pd.read_csv(file_path)

# 🔹 1. NaN 제거
df = df.dropna()

# 🔹 2. MBTI 타입을 대문자로 통일
df['type'] = df['type'].str.upper()

# 🔹 3. posts를 문장 단위로 분할
df['sentences'] = df['posts'].apply(lambda x: x.split("|||"))

# 🔹 4. 1인당 최대 문장 수 제한 (예: 상위 20개)
MAX_SENTENCES = 20
df['sentences'] = df['sentences'].apply(lambda x: x[:MAX_SENTENCES])

# 🔹 5. explode()로 문장별 행 분리
df = df.explode('sentences').reset_index(drop=True)

# 🔹 6. 특수기호/링크 제거
def clean_text(text):
    text = re.sub(r"http\S+", "", text)              # 링크 제거
    text = re.sub(r"[^a-zA-Z0-9.,!?'\s]", "", text)   # 특수문자 제거
    return text.strip()

df['clean_post'] = df['sentences'].apply(clean_text)

# 🔹 7. 필터링 (너무 짧은 문장은 제거)
df = df[df['clean_post'].str.len() > 10].reset_index(drop=True)

# 결과 확인
df[['type', 'clean_post']].head(10)


Unnamed: 0,type,clean_post
0,INFJ,enfp and intj moments sportscenter not top ...
1,INFJ,What has been the most lifechanging experience...
2,INFJ,On repeat for most of today.
3,INFJ,May the PerC Experience immerse you.
4,INFJ,The last thing my INFJ friend posted on his fa...
5,INFJ,Hello ENFJ7. Sorry to hear of your distress. I...
6,INFJ,84389 84390 ...
7,INFJ,Welcome and stuff.
8,INFJ,Game. Set. Match.
9,INFJ,"Prozac, wellbrutin, at least thirty minutes of..."


In [17]:
# 1. 설치
!pip install -q transformers sentencepiece

# 2. 모델 로딩
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to("cuda" if torch.cuda.is_available() else "cpu")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

3️⃣ 행동패턴 레이블 정의

🔥 방법1: 클러스터링 후 행동 레이블 붙이기 (예: 야외 활동 선호형, 혼자만의 시간 선호형 등)

🔥 방법2: 직접 키워드 매핑 or GPT로 레이블링 후 모델 학습

예: posts → GPT → "이 사람은 밤에 혼자 글 쓰기를 즐긴다." → 행동패턴

In [30]:
# 라벨 후보 목록
LABEL_CANDIDATES = [
    "Prefers alone time", "Enjoys social activities", "Spontaneous outdoor behavior",
    "Planned and structured actions", "Emotional expression", "Self-reflective habits",
    "Intellectual exploration", "Practical problem solving", "Healing in nature", "Desire for recognition"
]

def build_prompt(mbti, post):
    label_options = ", ".join(LABEL_CANDIDATES)
    return (
        f"Based on the MBTI type and sentence below, choose the most appropriate behavioral pattern from the list.\n"
        f"MBTI: {mbti}\n"
        f"Sentence: {post}\n"
        f"Choices: {label_options}\n"
        f"Answer: "
    )

def classify_with_flan(mbti, post, max_tokens=32):
    prompt = build_prompt(mbti, post)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    output = model.generate(**inputs, max_new_tokens=max_tokens)
    result = tokenizer.decode(output[0], skip_special_tokens=True)
    return result.strip()


4️⃣ 모델 설계 (무료/오픈소스 LLM 기반)

🎯 목표: MBTI + 게시글 → 행동패턴 텍스트 생성

✅ 선택 모델: google/flan-t5-base
🚀 오픈소스 / 무료

🤏 가볍고 빠름 (Colab에서도 잘 돌아감)

🧠 Instruction-tuned (프롬프트 주면 "지시"에 잘 따름)

In [31]:
import pandas as pd
from tqdm import tqdm

# 🔹 이전에 저장한 샘플 데이터 불러오기
df = pd.read_csv("/content/drive/MyDrive/mbti_data/labeling_sample.csv")

# 🔹 tqdm 적용
tqdm.pandas()

# 🔹 flan-t5로 라벨링
df["label"] = df.progress_apply(lambda row: classify_with_flan(row['type'], row['clean_post']), axis=1)

# 🔹 결과 저장
df.to_csv("/content/drive/MyDrive/mbti_data/labeling_sample_labeled_flan.csv", index=False)
print("✅ flan-t5 기반 자동 라벨링 완료 및 저장!")


100%|██████████| 100/100 [01:56<00:00,  1.17s/it]

✅ flan-t5 기반 자동 라벨링 완료 및 저장!





In [32]:
# 설치 (필요 시만)
!pip install -q scikit-learn

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# 라벨링된 데이터 불러오기
file_path = "/content/drive/MyDrive/mbti_data/labeling_sample_labeled_flan.csv"
df = pd.read_csv(file_path)

# 결측치 제거
df = df.dropna(subset=['type', 'clean_post', 'label'])

# 입력: MBTI + 게시글 텍스트 결합
df["input_text"] = df["type"] + " / " + df["clean_post"]

X_train, X_test, y_train, y_test = train_test_split(
    df["input_text"], df["label"], test_size=0.2, random_state=42
)

# 간단한 ML 파이프라인 (벡터화 + 분류기)
clf = make_pipeline(
    TfidfVectorizer(max_features=1000),
    LogisticRegression(max_iter=1000)
)

# 모델 학습
clf.fit(X_train, y_train)


In [33]:
# 예측 및 평가
y_pred = clf.predict(X_test)
print("✅ 성능 평가 결과:")
print(classification_report(y_test, y_pred))

✅ 성능 평가 결과:
                              precision    recall  f1-score   support

        Emotional expression       0.00      0.00      0.00         4
    Enjoys social activities       0.50      0.20      0.29         5
          Prefers alone time       0.29      1.00      0.45         5
Spontaneous outdoor behavior       0.00      0.00      0.00         6

                    accuracy                           0.30        20
                   macro avg       0.20      0.30      0.19        20
                weighted avg       0.20      0.30      0.19        20



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [34]:
sample_input = "ISTJ / I usually spend Friday nights reading or organizing my plans for the next week."
predicted_label = clf.predict([sample_input])[0]
print(f"🔍 예측 행동 패턴: {predicted_label}")

🔍 예측 행동 패턴: Prefers alone time


✅ 성능 분석 요약

항목	내용
정확도 (Accuracy)	30% – 무작위 예측(25%)보다는 조금 나은 수준

데이터셋 라벨 수	4개 (총 레이블 중 일부만 예측됨)

문제점	클래스 불균형 + 일부 레이블 미예측

4️⃣ 모델 설계 (무료/오픈소스 LLM 기반)

🎯 목표: MBTI + 게시글 → 행동패턴 텍스트 생성

🧩 사용 모델: distilbert-base-uncased
영어 데이터에 최적화

BERT보다 가볍고 빠름

🛠️ 1단계: 필요 라이브러리 설치

In [35]:
!pip install transformers datasets scikit-learn

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

🧠 2단계: 데이터셋 준비

In [36]:
import pandas as pd

# 데이터셋 불러오기
df = pd.read_csv("/content/drive/MyDrive/mbti_data/labeling_sample_labeled_flan.csv")
df = df.dropna(subset=["type", "clean_post", "label"])

# 입력: MBTI + 텍스트 결합
df["text"] = df["type"] + " / " + df["clean_post"]

# 라벨 인코딩
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df["label_id"] = label_encoder.fit_transform(df["label"])


🧪 3단계: 데이터셋 분할

In [37]:
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(
    df["text"].tolist(), df["label_id"].tolist(), test_size=0.2, random_state=42
)

🤗 4단계: Tokenizer + Dataset 구성

In [38]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

import torch

class MBTIDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        return {
            key: torch.tensor(val[idx]) for key, val in self.encodings.items()
        } | {"labels": torch.tensor(self.labels[idx])}

    def __len__(self):
        return len(self.labels)

train_dataset = MBTIDataset(train_encodings, train_labels)
test_dataset = MBTIDataset(test_encodings, test_labels)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

🧠 5단계: DistilBERT 모델 구성 및 학습

In [40]:
import os
os.environ["WANDB_DISABLED"] = "true"  # wandb 끄기

from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=len(label_encoder.classes_)
)

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="no",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)
trainer.train()


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss
1,1.5525,1.488885
2,1.4288,1.453842
3,1.3525,1.446547


TrainOutput(global_step=30, training_loss=1.4445861180623372, metrics={'train_runtime': 132.2674, 'train_samples_per_second': 1.815, 'train_steps_per_second': 0.227, 'total_flos': 4160526818400.0, 'train_loss': 1.4445861180623372, 'epoch': 3.0})

✅ 예측 및 평가

In [42]:
from sklearn.metrics import classification_report
import numpy as np

# 예측 수행
predictions = trainer.predict(test_dataset)
y_pred = np.argmax(predictions.predictions, axis=1)
y_true = predictions.label_ids

# 등장한 클래스 인덱스
unique_labels = sorted(np.unique(np.concatenate([y_true, y_pred])))

# 그에 해당하는 클래스 이름만 추출
target_names = label_encoder.inverse_transform(unique_labels)

# 평가 출력
print("✅ DistilBERT 기반 모델 성능 평가:")
print(classification_report(y_true, y_pred, labels=unique_labels, target_names=target_names))


✅ DistilBERT 기반 모델 성능 평가:
                              precision    recall  f1-score   support

        Emotional expression       0.00      0.00      0.00         4
    Enjoys social activities       0.00      0.00      0.00         5
          Prefers alone time       0.25      1.00      0.40         5
Spontaneous outdoor behavior       0.00      0.00      0.00         6

                    accuracy                           0.25        20
                   macro avg       0.06      0.25      0.10        20
                weighted avg       0.06      0.25      0.10        20



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
# 디렉토리 이동
cd ~/your_project_directory  # 예: /content/mbti_project

# git 초기화
git init

# GitHub 저장소 연결
git remote add origin https://github.com/사용자아이디/mbti-behavior-predictor.git
