# Top-k

모델의 결과를 조사하는 유용한 방법. 가장 성공적인 샘플과 가장 실패하는 샘플, 불확실한 샘플을 살펴보고 패턴을 찾는다.

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import joblib

import sys
sys.path.append("..")
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%load_ext autoreload
%autoreload 2

from ml_editor.data_processing import (
    format_raw_df, get_split_by_author, 
    add_text_features_to_df, 
    get_vectorized_series, 
    get_feature_vector_and_label
)
from ml_editor.model_evaluation import get_top_k

data_path = Path('./data/processed/writers/writers.csv')
df = pd.read_csv(data_path, index_col=0)
df = format_raw_df(df.copy())

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


특성을 추가하고 데이터 셋을 분할

In [11]:
df = add_text_features_to_df(df.loc[df["is_question"]].copy())
train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=40)

훈련된 모델을 로드하고 특성을 벡터화합니다.

In [12]:
model_path = Path("./models/model_1.pkl")
clf = joblib.load(model_path) 
vectorizer_path = Path("./models/vectorizer_1.pkl")
vectorizer = joblib.load(vectorizer_path) 

In [13]:
model_path

WindowsPath('models/model_1.pkl')

In [14]:
train_df["vectors"] = get_vectorized_series(train_df["full_text"].copy(), vectorizer)
test_df["vectors"] = get_vectorized_series(test_df["full_text"].copy(), vectorizer)

features = [
                "action_verb_full",
                "question_mark_full",
                "text_len",
                "language_question",
            ]
X_train, y_train = get_feature_vector_and_label(train_df, features)
X_test, y_test = get_feature_vector_and_label(test_df, features)

탑-k 방법을 사용해 조사    
- 각 클래스에서 (높은 점수와 낮은 점수를 내는)k 개의 최상의 샘플
- 각 클래스에서 k 개의 최악의 샘플
- 모델 예측 확률이 0.5에 가까운 가장 불확실한 k개의 샘플

In [16]:
test_analysis_df = test_df.copy()
y_predicted_proba = clf.predict_proba(X_test)

test_analysis_df["predicted_proba"] = y_predicted_proba[:, 1]
test_analysis_df["true_label"] = y_test

to_display = [
    "predicted_proba",
    "true_label",
    "Title",
    "body_text",
    "text_len",
    "action_verb_full",
    "question_mark_full",
    "language_question",
]
threshold = 0.5

top_pos, top_neg, worst_pos, worst_neg, unsure = get_top_k(test_analysis_df, "predicted_proba", "true_label", k=2)
pd.options.display.max_colwidth = 500

In [17]:
# 가장 올바르게 예측한 양성 샘플
top_pos[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
529,0.72,True,How to make travel scenes interesting without adding needless plot diversions?,"I have always had a problem with travel in my stories. Since I'm writing an epic fantasy novel, travel is a big theme as characters often have to move from where they are to where the plot dictates.\nHowever, one of the difficulties I have is that the travel itself is often not important to the plot. In the novel I'm reading now (Wizard's First Rule by Terry Goodkind), there is a huge amount of travel, and the author adds needless encounters with various magical beasts just to keep tension...",1391,True,True,False
8335,0.72,True,What factors in fiction arouse readers' expectations?,"Feedback from my writer's group tells me that my recent stories leave promises unfulfilled and important questions unanswered.\nSo I've become interested in how stories make promises and raise questions.\nSo I've identified a few factors that arouse readers' expectations.\n\nCharacter desire. If I put a desire into a character's mind (or words, or actions), readers expect the story to resolve the desire.\nCharacter speculation. If a character speculates about some future event or condition, ...",1921,True,True,False


In [18]:
# 가장 올바르게 예측한 음성 샘플
top_neg[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
16799,0.1,False,Capitalization of Open form Compound Words in Titles,"What would be considered proper capitalization of open form compound words in titles? Should the second part of the compound word be capitalized? Why?\nFor example, the capitalization for which title would be correct?\n\nCash flow Analysis Report\n\n--OR--\n\nCash Flow Analysis Report\n\nThanks!\n",343,True,True,True
51092,0.11,False,Are illogical comparisons permitted?,"\n""Clouds soared high into the sky like raging horses.""\n\nHorses don't soar, but is it ok to use ""like raging horses"" after ""soar high into the sky""? I am wondering if this kind of comparison is permitted. The direction is ""wrong"" and the verb is ""wrong"", so I am wondering if the use of like would be warranted and if another comparison should be used.\n",389,True,True,False


올바르게 음성으로 예측한 샘플의 대부분이 길이가 짧다는 것을 확인할 수 있다.    
좋은 점수를 받기 위해 질문의 길이가 중요한 특성 중 하나라는 것을 뒷받침한다.

In [19]:
# 가장 틀리게 예측한 양성 샘플
worst_pos[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
42882,0.08,True,Capitlization of A Named Experiment,"I have an experiment which we call 'the krypton experiment'. In referring to the krypton experiment, should it be capitalized?\ne.g.\nThe Krypton Experiment was used as a source of benchmark data.\nor\nThe krypton experiment was used as a source of benchmark data.\n",298,True,True,True
19760,0.09,True,Adding coding template in a Google blogger,I want to add a code viewer in Google Blogger like in image given below.\nHow can I add it please help.\n\n,147,True,False,False


음성으로 잘못 예측한 양성 데이터의 경우 질문의 길이가 짧은 것을 확인할 수 있다.    
=> 질문의 길이만으로는 올바르게 예측하는데 한계가 있음을 알 수 있다.

In [20]:
# 가장 틀리게 예측한 음성 샘플
worst_neg[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
30420,0.72,False,Should I make my prologue chapter 1?,"My prologue is set 17 years before the main story arc. I am reflecting on the discussion here, which was asked by another SE contributor. I'm trying to decide what to do with my prologue. Building a website for my world with minor character sketches, short stories, mythologies, etc and additional supplemental is one possibility. It could go there. Or,\n\nI can delete it entirely, and put any necessary points into the rest of the book. \nI can leave it as the prologue, since that is my first ...",1658,True,True,False
56530,0.71,False,How do I manage audience expectations for a paranormal romance story?,"A big part of writing is managing audience expectations, especially as it pertains to the genre. I.e., if a story is pitched as an action-adventure story, people expect a story of fight scenes and explosions; if it's pitched as a comedy they expect it to actually be funny; if it's pitched as a romance they expect to see true love and happily ever after. If the author pitches their story as one genre and it ends up spiraling into another...well, the audience feels betrayed and often throws th...",3312,True,True,False


앞서 잘못 예측한 양성 데이터와 마찬가지로 질문의 길이가 긴 음성 데이터를 양성으로 잘못 예측하는 경향을 확인할 수 있다.

In [21]:
unsure[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
10798,0.5,False,Is it better to follow some structure or just write following intuition,"I have a plot in mind without any details. I am planning to write it down as a story and then a screenplay, without any background in writing or reading books but just watching movies. When going through online articles its seen that following certain structures for plots is useful and is widely followed by many. One example is this.\nNow I wonder if I should just write what comes to my mind or plan and do some homework and build the story methodologically. What usually works good in writin...",629,True,True,False
20317,0.5,False,"How to give written advice in a way that is encouraging, not overbearing","How do we write something to inspire a person which corrects the mistakes they've made until now, but without making them feel like they're getting mocked from the recipient's perspective? \nI was trying to write a text to a person younger to me in order to inspire him. Something just doesn't feel right in this paragraph. It doesn't evoke any positive feeling such as hope or inspiration from it. It appears kinda over-bearing and in part hurtful albeit being true. How can I improve and convey...",2472,True,True,False


새로운 후보 특성을 찾기 위해 탑-k 방법과 특성 중요도, 벡터화 방법을 함께 사용하는 것이 추천된다.