# 탑-k 방법

탑-k 방법은 모델의 결과를 조사하는 유용한 방법입니다. 단순하게 **가장 성공적인 샘플과 가장 성공적이지 않은 샘플**을 살펴 보고 그 안의 패턴을 찾는 것입니다. 이런 패턴을 사용해 새로운 특성을 고안하거나 기존 특성을 반복할 수 있습니다.

먼저 데이터를 로드합니다.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import joblib

import sys
sys.path.append("..")
import warnings
warnings.filterwarnings('ignore')

from ml_editor.data_processing import (
    format_raw_df, get_split_by_author, 
    add_text_features_to_df, 
    get_vectorized_series, 
    get_feature_vector_and_label
)
from ml_editor.model_evaluation import get_top_k

data_path = Path('data/writers.csv')
df = pd.read_csv(data_path)
df = format_raw_df(df.copy())

그다음 특성을 추가하고 데이터셋을 분할합니다.

In [2]:
df = add_text_features_to_df(df.loc[df["is_question"]].copy())
train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=40)

훈련된 모델을 로드하고 특성을 벡터화합니다.

In [3]:
model_path = Path("models/model_1.pkl")
clf = joblib.load(model_path) 
vectorizer_path = Path("models/vectorizer_1.pkl")
vectorizer = joblib.load(vectorizer_path) 

In [4]:
model_path

PosixPath('../models/model_1.pkl')

In [5]:
train_df["vectors"] = get_vectorized_series(train_df["full_text"].copy(), vectorizer)
test_df["vectors"] = get_vectorized_series(test_df["full_text"].copy(), vectorizer)

features = [
                "action_verb_full",
                "question_mark_full",
                "text_len",
                "language_question",
            ]
X_train, y_train = get_feature_vector_and_label(train_df, features)
X_test, y_test = get_feature_vector_and_label(test_df, features)

이제 탑-k 방법을 사용해 다음을 조사합니다:

- 각 클래스에서 (높은 점수와 낮은 점수를 내는) k 개의 최상의 샘플
- 각 클래스에서 k 개의 최악의 샘플
- 모델 예측 확률이 0.5에 가까운 가장 불확실한 k 개 샘플

이런 특정 샘플을 출력하는 것이 모델 반복에 어떻게 도움이 되는지 알려면 이 책의 5장을 참고하세요.

In [6]:
test_analysis_df = test_df.copy()
y_predicted_proba = clf.predict_proba(X_test)
test_analysis_df["predicted_proba"] = y_predicted_proba[:, 1]
test_analysis_df["true_label"] = y_test

to_display = [
    "predicted_proba",
    "true_label",
    "Title",
    "body_text",
    "text_len",
    "action_verb_full",
    "question_mark_full",
    "language_question",
]
threshold = 0.5


top_pos, top_neg, worst_pos, worst_neg, unsure = get_top_k(test_analysis_df, "predicted_proba", "true_label", k=2)
pd.options.display.max_colwidth = 500

가장 올바르게 정답을 맞춘 양성 예측

In [7]:
# 가장 올바르게 예측한 양성 샘플
top_pos[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6793,0.74,True,Non-cheap ways to make villains evil?,"Do you have any tried and true techniques to make villains of your stories truly hated by the audience?\nI mean, frequently it's ""eh, sure, that's bad, he's got to be stopped"" but the audience would rather observe the villain more, learn, maybe try to get them to change their ways. Or worst of all, pity the villain in the end for failing to execute their just revenge, or not getting along with their plan for what would -really- be a better future, even if through baptism of fire.\nNow what t...",1448,False,True,False
4107,0.72,True,Single character POV vs. two POVs - how to decide?,"I'm starting to look at my next novel, and I'm trying to decide whether I should tell it from one POV or two. I've used both techniques in the past, so I'm aware of the basic advantages/disadvantages, but I'm still having trouble deciding which is best for the story I want to tell. \nI realize that it's impossible to answer that question without knowing the details of my story, but I'm hunting for some sort of framework for my thoughts, so: in general, when is it advisable to stick with a ...",1406,True,True,False


가장 올바르게 정답을 맞춘 음성 예측

In [8]:
# 가장 올바르게 예측한 음성 샘플
top_neg[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7488,0.09,False,Releases needed for picture books?,Do you need location releases for national parks and model releases for Pets to use in picture books?\n,137,False,True,False
34919,0.13,False,Is Rob Zombie's Red Red Kroovy Trochee style?,"https://genius.com/Rob-zombie-never-gonna-stop-\nYeah, I'm on Durango number 95\nTake me to the home, kick boots and ultra live\nSee heaven flash a horror show\nKnock it nice and smooth step back and watch it blow, yeah\nNever gonna stop me, never gonna stop\nNever gonna stop me, never gonna stop\nNever gonna stop me, never gonna stop\nNever gonna stop me, never gonna stop\nGive it to me, give it to me\nYeah, the devil ride it down the shore\nHe paint the monster red so the blood don't stain...",783,False,True,False


올바르게 예측한 음성 예측의 대부분은 **길이가 짧습니다**. 이 결과는 가중 중요한 특성 중의 하나가 질문의 길이라는 특성 중요도 분석의 결과를 뒷받침합니다.

가장 확실하게 틀린 양성 예측을 살펴 보죠.

In [9]:
# 가장 틀리게 예측한 양성 샘플
worst_pos[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
14157,0.14,True,Do you bold punctuation directly after bold text?,"Do you bold punctuation directly after bold, linked or italic text? \n",119,False,True,False
17543,0.15,True,Using Pronoun 'It' repetitvely for emphasis?,"I'd like to know if using ""It"" repetitively (for emphasis) in this context is okay grammatically.\n\nTV has become the modern day baby sitter. It is raising our children. It is dictating the cultural narrative and shaping future society. It is raising the bored inattentive child. It is raising the consumer child. It is raising the aggressive child. It is raising the obese child. It is raising the misinformed and complacent child. It is raising the disenchanted child. And what’s more, it...",591,False,True,False


반대로 모델이 틀린 질문에서 높은 점수를 가진 짧은 질문이 잘 나타나있습니다.

그다음 가장 확실하게 틀린 음성 예측을 살펴 보겠습니다.

In [10]:
# 가장 틀리게 예측한 음성 샘플
worst_neg[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7878,0.86,False,"When quoting a person's informal speech, how much liberty do you have to make changes to what they say?","Even during a formal interview for a news article, people speak informally. They say ""uhm"", they cut off sentences half-way through, they interject phrases like ""you know?"", and they make innocent grammatical mistakes.\nAs somebody who wants to fairly and accurately report the discussion that takes place in an interview, what guidelines should I use in making changes to what a person says?\nWhile the simplest solution is to write exactly what they say and [sic] any errors they make, that can...",694,True,True,False
24995,0.72,False,Self-translating into English,"I am finishing writing my first book (in Slovak, SF) and will be looking for publishers soon. I was considering self-publishing but I don't think I can do more then them in this field. Well, except for the translation.\nWe have all heard the 3% problem where only this many books are translated to english. So I think about translating my book to english by my own money (i.e. paying someone to translate it - I'm not the one doing this).\nFew reasons: 1) publishers in my country don't try hard ...",917,True,True,False


마지막으로 모델의 확률이 모든 클래스에 동일한 가장 불확실한 질문입니다(두 개의 클래스라면 확률이 `0.5`에 가까운 샘플).

In [11]:
unsure[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
21955,0.5,True,How to make a dark story not-so-dark (Shining the light in darkness),"I'm writing a war story, and it's dark. However, I find that every scene turns out to be depressing because of it. Readers will be overwhelmed. Are there ways I can induce hope/shine the light in the darkness in my novel?\n",291,True,True,False
23599,0.5,False,Quotes around long backstory narrated by a character,"I am working on a novel in which the characters talk to the protagonist and explain lot of backstory. It can run into tens of pages - essentially the entire story is told by the character to the protagonist and the reader is a third person learning about it in parallel.\nMy question is: what are the rules for quotes in these kind of conversations.\nFor instance (H: hero, C: a local character)\nPage 1: H: So, how did you end up here?\nPage 1 - 15: C: tells a 15-page long story ......\nPage 16...",831,True,True,False


새로운 후보 특성을 찾기 위해 탑-k 방법과 특성 중요도, 벡터화 방법을 함께 사용하는 것을 추천합니다.