##  Cross-Encoder + LightGBM Ensemble Model

This notebook builds a powerful ensemble model to determine if a comment violates a community rule.

### **Methodology**
1.  **Feature Engineering**: Creates numerical and categorical features based on EDA insights.
2.  **Cross-Encoder Training**: A `deberta-v3-small` model is fine-tuned to understand the semantic relationship between a rule and a comment, generating a 'semantic score'.
3.  **LightGBM Training**: An LGBM model is trained on a combination of the engineered features and the semantic score from the Cross-Encoder.
4.  **Ensemble Pipeline**: The final model uses this two-stage process for prediction.

In [1]:
import numpy as np
import pandas as pd
import os
import pickle
import joblib
import re
import torch
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# GPU 사용 가능 여부 확인
print(f"GPU 사용 가능: {torch.cuda.is_available()}")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"사용 디바이스: {device}")

GPU 사용 가능: True
사용 디바이스: cuda


In [2]:
# ==========================================================
# 📊 데이터 로드 및 기본 정보 확인
# ==========================================================
# 로컬 경로에서 학습 데이터를 불러옵니다.
# If you don't have 'train.csv', create a dummy file to run the notebook
if not os.path.exists('train.csv'):
    dummy_data = {
        'body': ['This is a great post!', 'Check out my website www.spam.com', 'I disagree with this rule.', 'legal advice is not allowed here', 'where can i watch the game?', 'no advertising please'],
        'rule': ['Be nice', 'No spam', 'Follow the rules', 'No legal advice', 'No illegal content', 'No Advertising'],
        'subreddit': ['hearthstone', 'soccerstreams', 'legaladvice', 'legaladvice', 'soccerstreams', 'sex'],
        'rule_violation': [0, 1, 0, 1, 1, 1],
        'positive_example_1': [np.nan, 'our product is the best', np.nan, 'asking for a lawyer is legal advice', 'youtube.com/stream', 'dont promote your onlyfans'],
        'negative_example_1': ['thanks for sharing', np.nan, 'I love this sub', 'I am not a lawyer but...', 'what time is the match?', 'i have a question about my body']
    }
    train_df = pd.DataFrame(dummy_data)
    train_df.to_csv('train.csv', index=False)
    print("Dummy 'train.csv' created.")

train_df = pd.read_csv('train.csv')

print(f"🔍 데이터 형태: {train_df.shape}")
print(f"🎯 타겟 분포: {train_df['rule_violation'].value_counts().to_dict()}")
print("\n📋 데이터 샘플:")
display(train_df.head())

🔍 데이터 형태: (2029, 9)
🎯 타겟 분포: {1: 1031, 0: 998}

📋 데이터 샘플:


Unnamed: 0,row_id,body,rule,subreddit,positive_example_1,positive_example_2,negative_example_1,negative_example_2,rule_violation
0,0,Banks don't want you to know this! Click here ...,"No Advertising: Spam, referral links, unsolici...",Futurology,If you could tell your younger self something ...,hunt for lady for jack off in neighbourhood ht...,Watch Golden Globe Awards 2017 Live Online in ...,"DOUBLE CEE x BANDS EPPS - ""BIRDS""\n\nDOWNLOAD/...",0
1,1,SD Stream [ ENG Link 1] (http://www.sportsstre...,"No Advertising: Spam, referral links, unsolici...",soccerstreams,[I wanna kiss you all over! Stunning!](http://...,LOLGA.COM is One of the First Professional Onl...,#Rapper \n🚨Straight Outta Cross Keys SC 🚨YouTu...,[15 Amazing Hidden Features Of Google Search Y...,0
2,2,Lol. Try appealing the ban and say you won't d...,No legal advice: Do not offer or request legal...,pcmasterrace,Don't break up with him or call the cops. If ...,It'll be dismissed: https://en.wikipedia.org/w...,Where is there a site that still works where y...,Because this statement of his is true. It isn'...,1
3,3,she will come your home open her legs with an...,"No Advertising: Spam, referral links, unsolici...",sex,Selling Tyrande codes for 3€ to paypal. PM. \n...,tight pussy watch for your cock get her at thi...,NSFW(obviously) http://spankbang.com/iy3u/vide...,Good News ::Download WhatsApp 2.16.230 APK for...,1
4,4,code free tyrande --->>> [Imgur](http://i.imgu...,"No Advertising: Spam, referral links, unsolici...",hearthstone,wow!! amazing reminds me of the old days.Well...,seek for lady for sex in around http://p77.pl/...,must be watch movie https://sites.google.com/s...,We're streaming Pokemon Veitnamese Crystal RIG...,1


In [3]:
# ==========================================================
# 🛠️ 특징 엔지니어링 함수 정의
# ==========================================================
def count_urls(text):
    return len(re.findall(r'https?://\S+|www\.\S+', str(text)))

def count_exclaims(text):
    return str(text).count('!')

def count_questions(text):
    return str(text).count('?')

def upper_ratio(text):
    s = str(text)
    letters = [c for c in s if c.isalpha()]
    if not letters:
        return 0.0
    upp = sum(1 for c in letters if c.isupper())
    return upp / len(letters)

def repeat_char_max(text):
    longest = 1
    last = ''
    cur = 0
    for ch in str(text):
        if ch == last:
            cur += 1
        else:
            longest = max(longest, cur)
            cur = 1
            last = ch
    longest = max(longest, cur)
    return longest

def jaccard_similarity(text1, text2):
    set1 = set(str(text1).lower().split())
    set2 = set(str(text2).lower().split())
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union > 0 else 0.0

def create_features(df):
    df = df.copy()
    print("📏 기본 텍스트 특징 생성 중...")
    df['body_len'] = df['body'].astype(str).str.len()
    df['rule_len'] = df['rule'].astype(str).str.len()
    df['body_words'] = df['body'].astype(str).str.split().str.len()
    print("🎨 스타일 특징 생성 중...")
    df['url_cnt'] = df['body'].apply(count_urls)
    df['exc_cnt'] = df['body'].apply(count_exclaims)
    df['q_cnt'] = df['body'].apply(count_questions)
    df['upper_rt'] = df['body'].apply(upper_ratio)
    df['rep_run'] = df['body'].apply(repeat_char_max)
    print("🔗 규칙-댓글 상호작용 특징 생성 중...")
    df['rule_body_jaccard'] = [jaccard_similarity(rule, body) for rule, body in zip(df['rule'], df['body'])]
    print("✅ 특징 생성 완료!")
    return df

def prepare_cross_encoder_input(rule, body, positive_ex1=None, negative_ex1=None):
    rule_text = str(rule).strip()
    comment_text = str(body).strip()
    examples_text = ""
    if pd.notna(positive_ex1) and str(positive_ex1).strip():
        examples_text += f" [긍정예시] {str(positive_ex1).strip()}"
    if pd.notna(negative_ex1) and str(negative_ex1).strip():
        examples_text += f" [부정예시] {str(negative_ex1).strip()}"
    full_input = f"[규칙] {rule_text}{examples_text} [댓글] {comment_text}"
    return full_input

# ==========================================================
# 📊 데이터 전처리 및 특징 생성 실행
# ==========================================================
print("🔧 데이터 전처리 시작...")
train_df_featured = create_features(train_df)

🔧 데이터 전처리 시작...
📏 기본 텍스트 특징 생성 중...
🎨 스타일 특징 생성 중...
🔗 규칙-댓글 상호작용 특징 생성 중...
✅ 특징 생성 완료!


In [4]:
# ==========================================================
# 🤖 Cross-Encoder 입력 준비
# ==========================================================
print("🔄 Cross-Encoder 입력 데이터 준비 중...")
ce_inputs = []
labels = []
for idx, row in tqdm(train_df.iterrows(), total=len(train_df), desc="CE 입력 데이터 처리"):
    ce_input = prepare_cross_encoder_input(
        row['rule'], row['body'],
        row.get('positive_example_1'),
        row.get('negative_example_1')
    )
    ce_inputs.append(ce_input)
    labels.append(int(row['rule_violation']))

ce_inputs = np.array(ce_inputs)
labels = np.array(labels)
print(f"✅ {len(ce_inputs)}개의 Cross-Encoder 입력 쌍 준비 완료")

🔄 Cross-Encoder 입력 데이터 준비 중...


CE 입력 데이터 처리: 100%|██████████| 2029/2029 [00:00<00:00, 20562.90it/s]

✅ 2029개의 Cross-Encoder 입력 쌍 준비 완료





In [5]:
# ==========================================================
# 🏗️ 1단계: Cross-Encoder 모델 훈련 (오류 수정 최종본)
# ==========================================================
from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader

print("🏗️ 전체 데이터셋으로 Cross-Encoder 모델 훈련 시작!")
output_model_path = './model_output/final_cross_encoder_model'
os.makedirs(output_model_path, exist_ok=True)

model_name = 'microsoft/deberta-v3-small'
cross_encoder_model = CrossEncoder(model_name, num_labels=1, device=device)


# ✨✨✨ 이 부분이 제가 실수로 빠뜨렸던 코드입니다! ✨✨✨
# Cross-Encoder가 학습할 수 있는 InputExample 형태로 데이터를 변환합니다.
print("📚 전체 훈련 예시 생성 중...")
train_examples = []
for i in tqdm(range(len(ce_inputs)), desc="최종 훈련 데이터 처리"):
    ce_input = ce_inputs[i]
    if '[댓글]' in ce_input:
        rule_part, comment_part = ce_input.split('[댓글]', 1)
    else: # 만약 '[댓글]' 구분자가 없는 경우에 대한 대비
        rule_part, comment_part = ce_input, ""
    
    train_examples.append(
        InputExample(texts=[rule_part.strip(), comment_part.strip()], label=float(labels[i]))
    )
# ✨✨✨ 여기까지가 누락된 부분이었습니다. ✨✨✨


# 이제 train_examples가 정상적으로 생성되었으므로 DataLoader를 만들 수 있습니다.
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
warmup_steps = max(1, int(len(train_dataloader) * 0.1))

print(f"🚀 최종 모델 훈련 (에폭: 4, 배치: 16)")

# 1. fit 함수에서는 저장 관련 옵션을 제거합니다.
cross_encoder_model.fit(
    train_dataloader=train_dataloader,
    epochs=4,
    warmup_steps=warmup_steps,
    show_progress_bar=True
)

# 2. 학습이 완료된 후, .save() 메서드를 명시적으로 호출하여 저장합니다.
print(f"💾 모델을 {output_model_path} 경로에 저장 중...")
cross_encoder_model.save(output_model_path)
print("✅ 모델 저장 완료!")

🏗️ 전체 데이터셋으로 Cross-Encoder 모델 훈련 시작!


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


📚 전체 훈련 예시 생성 중...


최종 훈련 데이터 처리: 100%|██████████| 2029/2029 [00:00<00:00, 137401.60it/s]

🚀 최종 모델 훈련 (에폭: 4, 배치: 16)





Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/127 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Iteration:   0%|          | 0/127 [00:00<?, ?it/s]

Iteration:   0%|          | 0/127 [00:00<?, ?it/s]

Iteration:   0%|          | 0/127 [00:00<?, ?it/s]

💾 모델을 ./model_output/final_cross_encoder_model 경로에 저장 중...
✅ 모델 저장 완료!


In [6]:
# ✨ NEW: =======================================================
# 🤖 2단계 준비: Cross-Encoder로 Semantic Feature 생성
# ==========================================================
print("🔮 Cross-Encoder를 사용하여 semantic score 예측 중...")

# 훈련된 모델을 사용하여 예측을 수행하기 위한 입력 형식으로 변환
predict_examples = []
for i in tqdm(range(len(ce_inputs)), desc="예측용 데이터 변환"):
    ce_input = ce_inputs[i]
    if '[댓글]' in ce_input:
        rule_part, comment_part = ce_input.split('[댓글]', 1)
    else:
        rule_part, comment_part = ce_input, ""
    predict_examples.append([rule_part.strip(), comment_part.strip()])

# 예측 수행 (raw logit scores)
ce_predictions = cross_encoder_model.predict(predict_examples, show_progress_bar=True)

# 예측 점수를 DataFrame의 새로운 컬럼으로 추가
train_df_featured['ce_score'] = ce_predictions
print("✅ Semantic score ('ce_score')가 특징에 추가되었습니다.")
display(train_df_featured[['body', 'rule', 'ce_score', 'rule_violation']].head())

🔮 Cross-Encoder를 사용하여 semantic score 예측 중...


예측용 데이터 변환: 100%|██████████| 2029/2029 [00:00<00:00, 117264.59it/s]


Batches:   0%|          | 0/64 [00:00<?, ?it/s]

✅ Semantic score ('ce_score')가 특징에 추가되었습니다.


Unnamed: 0,body,rule,ce_score,rule_violation
0,Banks don't want you to know this! Click here ...,"No Advertising: Spam, referral links, unsolici...",0.073306,0
1,SD Stream [ ENG Link 1] (http://www.sportsstre...,"No Advertising: Spam, referral links, unsolici...",0.030508,0
2,Lol. Try appealing the ban and say you won't d...,No legal advice: Do not offer or request legal...,0.956159,1
3,she will come your home open her legs with an...,"No Advertising: Spam, referral links, unsolici...",0.968073,1
4,code free tyrande --->>> [Imgur](http://i.imgu...,"No Advertising: Spam, referral links, unsolici...",0.966402,1


In [7]:
# ✨ NEW: =======================================================
# 🛠️ 2단계 준비: LGBM을 위한 데이터 준비
# ==========================================================
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from scipy.sparse import hstack

print("🔧 LGBM 모델을 위한 특징 스케일링 및 인코딩 중...")

# 1. 수치 특징 (Numerical Features)
numerical_cols = [
    'body_len', 'rule_len', 'body_words', 'url_cnt', 'exc_cnt', 'q_cnt',
    'upper_rt', 'rep_run', 'rule_body_jaccard', 
    'ce_score' # Cross-Encoder 예측 점수 포함!
]
scaler = StandardScaler()
numerical_features = scaler.fit_transform(train_df_featured[numerical_cols])
print(f"🔢 {len(numerical_cols)}개의 수치 특징 스케일링 완료.")

# 2. 범주형 특징 (Categorical Features) - EDA 인사이트 반영!
categorical_cols = ['subreddit']
onehot_encoder = OneHotEncoder(handle_unknown='ignore')
categorical_features = onehot_encoder.fit_transform(train_df_featured[categorical_cols])
print(f"📋 {len(categorical_cols)}개의 범주형 특징 원-핫 인코딩 완료.")

# 3. 모든 특징 결합
X_lgbm = hstack([numerical_features, categorical_features])
y_lgbm = train_df_featured['rule_violation'].values

print(f"✅ LGBM 훈련 데이터 준비 완료. 최종 형태: {X_lgbm.shape}")

🔧 LGBM 모델을 위한 특징 스케일링 및 인코딩 중...
🔢 10개의 수치 특징 스케일링 완료.
📋 1개의 범주형 특징 원-핫 인코딩 완료.
✅ LGBM 훈련 데이터 준비 완료. 최종 형태: (2029, 110)


In [8]:
# ✨ NEW: =======================================================
# 🏗️ 2단계: LightGBM 모델 훈련
# ==========================================================
import lightgbm as lgb

print("🚀 LightGBM 모델 훈련 시작...")

lgbm_model = lgb.LGBMClassifier(
    objective='binary',
    metric='auc',
    n_estimators=1000, # 조기 종료를 사용하므로 넉넉하게 설정
    learning_rate=0.05,
    num_leaves=31,
    max_depth=-1,
    random_state=42,
    n_jobs=-1,
    colsample_bytree=0.8,
    subsample=0.8
)

# LGBM 훈련
lgbm_model.fit(X_lgbm, y_lgbm, 
             eval_set=[(X_lgbm, y_lgbm)],
             eval_metric='auc',
             callbacks=[lgb.early_stopping(100, verbose=False)])

print("✅ LightGBM 모델 훈련 완료!")

🚀 LightGBM 모델 훈련 시작...
[LightGBM] [Info] Number of positive: 1031, number of negative: 998
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000219 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1030
[LightGBM] [Info] Number of data points in the train set: 2029, number of used features: 34
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508132 -> initscore=0.032531
[LightGBM] [Info] Start training from score 0.032531
✅ LightGBM 모델 훈련 완료!


In [9]:
# ==========================================================
# 💾 최종 모델 및 전처리 객체 저장
# ==========================================================
print("💾 모델 및 전처리 객체 저장 중...")
output_dir = './model_output'
os.makedirs(output_dir, exist_ok=True)

# 1. Cross-Encoder 모델은 이미 output_path에 저장됨
print(f"✅ Cross-Encoder 모델 저장 완료: {os.path.abspath(output_model_path)}")

# ✨ NEW: 2. LGBM 모델 저장
joblib.dump(lgbm_model, os.path.join(output_dir, 'lgbm_model.pkl'))
print(f"✅ LGBM 모델 저장 완료: {os.path.join(output_dir, 'lgbm_model.pkl')}")

# 3. Scaler 저장
joblib.dump(scaler, os.path.join(output_dir, 'scaler.pkl'))
print(f"✅ Scaler 저장 완료: {os.path.join(output_dir, 'scaler.pkl')}")

# ✨ NEW: 4. OneHotEncoder 저장
joblib.dump(onehot_encoder, os.path.join(output_dir, 'onehot_encoder.pkl'))
print(f"✅ OneHotEncoder 저장 완료: {os.path.join(output_dir, 'onehot_encoder.pkl')}")

# 5. 수치 특징 컬럼명 저장
with open(os.path.join(output_dir, 'numerical_cols.pkl'), 'wb') as f:
    pickle.dump(numerical_cols, f)
print(f"✅ 수치 특징 컬럼명 저장 완료: {os.path.join(output_dir, 'numerical_cols.pkl')}")


💾 모델 및 전처리 객체 저장 중...
✅ Cross-Encoder 모델 저장 완료: /workspace/Agile-Community-Rules-Classification/model_output/final_cross_encoder_model
✅ LGBM 모델 저장 완료: ./model_output/lgbm_model.pkl
✅ Scaler 저장 완료: ./model_output/scaler.pkl
✅ OneHotEncoder 저장 완료: ./model_output/onehot_encoder.pkl
✅ 수치 특징 컬럼명 저장 완료: ./model_output/numerical_cols.pkl
