## 51. 特徴量抽出
学習データ，検証データ，評価データから特徴量を抽出し，それぞれtrain.feature.txt，valid.feature.txt，test.feature.txtというファイル名で保存せよ． なお，カテゴリ分類に有用そうな特徴量は各自で自由に設計せよ．記事の見出しを単語列に変換したものが最低限のベースラインとなるであろう．

In [61]:
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()


columns = ('id',
           'title',
           'category',
           'story')

train = pd.read_csv('../../data/NewsAggregatorDataset/train.feature.txt',
                    names=columns, sep='\t')
valid = pd.read_csv('../../data/NewsAggregatorDataset/valid.feature.txt',
                    names=columns, sep='\t')
test = pd.read_csv('../../data/NewsAggregatorDataset/test.feature.txt',
                   names=columns, sep='\t')

  from pandas import Panel


In [74]:
import re

def tokenize(doc):
    doc = re.sub(r"[',.]", '', doc)  # 記号を削除
    tokens = doc.split(' ')
    tokens = [token.lower() for token in tokens]  # 小文字に統一
    return tokens

def preprocessor(tokens):
    tokens = [token for token in tokens if token in vocab]
    return tokens

def bag_of_words(doc):
    vector = [0]*len(vocab)
    for word in doc:
        if word in vocab:
            vector[vocab.index(word)] += 1
    return pd.Series(vector)

In [76]:
# vocabulary
from collections import Counter

train['tokens'] = train.title.apply(tokenize)
vocab = train['tokens'].tolist()
vocab = sum(vocab, [])  # flat list
counter = Counter(vocab)
vocab = [
    token
    for token, freq in counter.most_common()
    if 2 < freq < 300
]

In [77]:
train['tokens'] = train.tokens.progress_apply(preprocessor)
X_train = train.tokens.progress_apply(bag_of_words)

test['tokens'] = test.title.apply(tokenize)
test['tokens'] = test.tokens.progress_apply(preprocessor)
X_test = test.tokens.progress_apply(bag_of_words)

valid['tokens'] = valid.title.apply(tokenize)
valid['tokens'] = valid.tokens.progress_apply(preprocessor)
X_valid = valid.tokens.progress_apply(bag_of_words)

100%|██████████| 10672/10672 [00:05<00:00, 2129.69it/s]
100%|██████████| 10672/10672 [00:20<00:00, 509.11it/s]
100%|██████████| 1334/1334 [00:00<00:00, 1996.46it/s]
100%|██████████| 1334/1334 [00:02<00:00, 479.35it/s]
100%|██████████| 1334/1334 [00:00<00:00, 2097.34it/s]
100%|██████████| 1334/1334 [00:02<00:00, 511.08it/s]


## 52. 学習
51で構築した学習データを用いて，ロジスティック回帰モデルを学習せよ．

In [65]:
from sklearn.linear_model import LogisticRegression

Y_train = train['category'].map({'b': 0, 't': 1, 'e': 2, 'm': 3}) # クラスを定義
lr = LogisticRegression(class_weight='balanced') # ロジスティック回帰モデルのインスタンスを作成
lr.fit(X_train, Y_train) # ロジスティック回帰モデルの重みを学習

100%|██████████| 10672/10672 [00:05<00:00, 1866.74it/s]
100%|██████████| 10672/10672 [00:21<00:00, 501.62it/s]
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [66]:
print("coefficient = ", lr.coef_) # 説明変数の係数
print("intercept = ", lr.intercept_) # 切片

coefficient =  [[-0.63990247 -0.55659616 -1.08942751 ... -0.09404083 -0.06126723
  -0.081567  ]
 [-0.5264873  -0.46478526 -0.4030986  ... -0.05961833 -0.03921252
   0.15460631]
 [ 1.65678062  1.40852714  1.13529375 ...  0.19064374  0.1266009
   0.0486148 ]
 [-0.49039084 -0.38714573  0.35723237 ... -0.03698459 -0.02612115
  -0.12165411]]
intercept =  [ 0.35953993 -0.0887583   0.42194464 -0.69272628]


## 53. 予測
52で学習したロジスティック回帰モデルを用い，与えられた記事見出しからカテゴリとその予測確率を計算するプログラムを実装せよ．

In [70]:
#  訓練データで予測
 
Y_pred = lr.predict(X_train)
Y_true = train['category'].map({'b': 0, 't': 1, 'e': 2, 'm': 3})
print('予測ラベル', Y_pred[:10])
print('正解', Y_true.head(10).tolist())

予測ラベル [3 0 2 2 0 0 0 1 2 0]
正解 [3, 0, 2, 2, 0, 0, 0, 1, 2, 0]


In [69]:
# 評価データで予測

Y_pred = lr.predict(X_test)
Y_true = test['category'].map({'b': 0, 't': 1, 'e': 2, 'm': 3})
print('予測ラベル', Y_pred[:10])
print('正解', Y_true.head(10).tolist())

100%|██████████| 1334/1334 [00:00<00:00, 1879.01it/s]
100%|██████████| 1334/1334 [00:02<00:00, 456.54it/s]

予測ラベル [1 2 0 1 2 2 3 0 0 2]
正解 [1, 2, 0, 1, 2, 2, 3, 0, 0, 2]





## 54. 正解率の計測
52で学習したロジスティック回帰モデルの正解率を，学習データおよび評価データ上で計測せよ．

In [75]:
def accuracy(predict, y):
    return (predict == y).mean()

In [73]:
#  訓練データの正解率

X_train = train.tokens.progress_apply(bag_of_words)
Y_pred = lr.predict(X_train)
Y_true = train['category'].map({'b': 0, 't': 1, 'e': 2, 'm': 3})

print('正解率', accuracy(Y_pred, Y_true))
# doc2vecでは 0.3256184407796102

100%|██████████| 10672/10672 [00:22<00:00, 476.37it/s]


正解率 0.9845389805097451


In [88]:
#  評価データの正解率

X_test = test.tokens.progress_apply(bag_of_words)
Y_pred = lr.predict(X_test)
Y_true = test['category'].map({'b': 0, 't': 1, 'e': 2, 'm': 3})
print('正解率', accuracy(Y_pred, Y_test))
# doc2vecでは   0.24287856071964017

100%|██████████| 1334/1334 [00:02<00:00, 494.32it/s]

正解率 0.8958020989505248





## 55. 混同行列の作成
52で学習したロジスティック回帰モデルの混同行列（confusion matrix）を，学習データおよび評価データ上で作成せよ．

In [89]:
import numpy as np

def confusion_matrix(y_true, y_pred):
    size = len(set(y_true))
    result = np.array([0]*(size*size)).reshape((size,size)) # 配列の初期化
    for t, p in zip(y_true, y_pred):
        result[t][p] += 1
    return result

In [90]:
con_matrix = confusion_matrix(Y_test, Y_pred)
print(con_matrix)

[[498  36  10   9]
 [ 21 124   4   4]
 [ 14  14 505   5]
 [  7   8   7  68]]


## 56. 適合率，再現率，F1スコアの計測Permalink
52で学習したロジスティック回帰モデルの適合率，再現率，F1スコアを，評価データ上で計測せよ．カテゴリごとに適合率，再現率，F1スコアを求め，カテゴリごとの性能をマイクロ平均（micro-average）とマクロ平均（macro-average）で統合せよ．

In [93]:
def precision(con_matrix):
    size = len(con_matrix)
    results = []
    for i in range(size):
        result = con_matrix[i][i]/sum(con_matrix[i])
        results.append(result)
    return results
    
def recall(con_matrix):
    size = len(con_matrix)
    results = []
    for i in range(size):
        result = con_matrix[i][i]/sum([row[i] for row in con_matrix])
        results.append(result)
    return results

def f1(pre, rec):
    size = len(con_matrix)
    results = []
    for i in range(size):
        result = (2*rec[i]*pre[i])/(rec[i]+pre[i])
        results.append(result)
    return results

In [99]:
pre = precision(con_matrix)
rec = recall(con_matrix)
f1_score = f1(pre, rec)
print('presition', pre)
print('recall', rec)
print('f1', f1_score)

presition [0.9005424954792043, 0.8104575163398693, 0.9386617100371747, 0.7555555555555555]
recall [0.9222222222222223, 0.6813186813186813, 0.9600760456273765, 0.7906976744186046]
f1 [0.9112534309240622, 0.7402985074626866, 0.9492481203007519, 0.7727272727272727]


In [103]:
# マクロ平均
print('適合率 マクロ平均', np.array(pre).mean())
print('再現率 マクロ平均', np.array(rec).mean())
print('F1 マクロ平均', np.array(f1_score).mean())

適合率 マクロ平均 0.8513043193529509
再現率 マクロ平均 0.8385786558967212
F1 マクロ平均 0.8433818328536933


In [None]:
# マクロ平均
print('適合率 マクロ平均', np.array(pre).mean())
print('再現率 マクロ平均', np.array(rec).mean())
print('F1 マクロ平均', np.array(f1_score).mean())