# 第6章: 機械学習
本章では，Fabio Gasparetti氏が公開しているNews Aggregator Data Setを用い，ニュース記事の見出しを「ビジネス」「科学技術」「エンターテイメント」「健康」のカテゴリに分類するタスク（カテゴリ分類）に取り組む．

# 50. データの入手・整形
News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

1. ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
2. 情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
3. 抽出された事例をランダムに並び替える．
4. 抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．

学習データと評価データを作成したら，各カテゴリの事例数を確認せよ．

In [83]:
!ls files/NewsAggregatorDataset/

2pageSessions.csv check.txt         newsCorpora.csv   readme.txt


In [85]:
!head -210713 files files/NewsAggregatorDataset/newsCorpora.csv | tail -1

head: Error reading files
210713	"The Best Reactions To The Supposed Video of Solange Knowles & Jay Z  ...	http://www.hiphopdx.com/index/news/id.28728/title.-the-best-reactions-to-the-supposed-video-of-solange-knowles-jay-z-fighting-in-an-elevator-list-by-time/	HipHopDX	e	dku0uRoeehpC9JM1RoZ4n0fg8cyoM	www.hiphopdx.com	1399983366398


"The Best Reactions To The Supposed Video of Solange Knowles & Jay Z  ...
## ↑ダブルクオーテーションが片側しかないとDictreaderがうまく読み込まないので修正する

In [22]:
import csv


PUBLISHERS = ('Reuters','Huffington Post','Businessweek','Contactmusic.com','Daily Mail')
CATEGORIES = ('b','t','e','m')
CATEGORIES_DIC = {'b':0,'t':1,'e':2,'m':3}
CATEGORIES_DIC_inv = {0:'business',1:'science and technology',2:'entertaiment',3:'health'}

y_X_list = []

with open('files/NewsAggregatorDataset/newsCorpora.csv') as f:
    reader = csv.DictReader(f,delimiter='\t',fieldnames=["id","title","url","publisher","category","story","hostname","timestamp"])
    y_X_list = [(CATEGORIES_DIC[row['category']],row['title']) for row in reader if row['publisher'] in PUBLISHERS and row['category'] in CATEGORIES]


In [2]:
from sklearn.model_selection import train_test_split
from typing import List
import csv
y = [y for y, x in  y_X_list]
X = [x for y, x in  y_X_list]


print(f"y len :{len(y)}, X len: {len(X)}")

X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.20, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_tmp, y_tmp, test_size=0.50, random_state=42)

print(f"train:{len(X_train)}, valid:{len(X_valid)}, test:{len(X_test)}")

train = [[_y_train,_x_train] for _y_train,_x_train  in zip(y_train,X_train)]
valid = [[_y_valid,_x_valid] for _y_valid,_x_valid in zip(y_valid,X_valid)]
test = [[_y_test,_x_test] for _y_test,_x_test in zip(y_test,X_test)]

def save_data(data:List[List],file_name):
    with open(file_name, mode='w', encoding='utf-8') as f:
        writer = csv.writer(f,delimiter='\t')
        writer.writerows(data)

save_data(train,'files/train.txt')
save_data(valid,'files/valid.txt')
save_data(test,'files/test.txt')

y len :13340, X len: 13340
train:10672, valid:1334, test:1334


In [3]:
!head files/train.txt

0	Draghi Unites Euro Bulls With Bears Watching $1.35: Currencies
3	A Guide To Spring Gardening, For Allergy-Sufferers
2	By Odin's beard: Marvel creates a storm of controversy as it reveals Thor is a  ...
0	Walmart Strikes Deal That Will Hopefully Make Organic Food Cheaper
2	The Voice contestant Kristen Merlin goes silent mid-song after microphone fails  ...
1	Gender Non-Conforming Teen Forced To Remove Makeup For Driver's License  ...
2	From Cleaning Toilets to Intergalactic Royalty: Jupiter Ascending [Trailer +  ...
1	CORRECTED-Sprint's revenue beats estimate as network upgrade progresses
2	'My mother touched me I'm certain': Hip hop mogul gunned down at mom's  ...
0	Fairfax CEO Watsa Probed by Regulator in Trading Review


In [4]:
!cut -f 1 files/train.txt | sort | uniq -c

4538 0
1205 1
4228 2
 701 3


In [5]:
!cut -f 1 files/valid.txt | sort | uniq -c

 531 0
 155 1
 529 2
 119 3


In [6]:
!cut -f 1 files/test.txt | sort | uniq -c

 558 0
 164 1
 522 2
  90 3


# 51. 特徴量抽出
学習データ，検証データ，評価データから特徴量を抽出し，それぞれtrain.feature.txt，valid.feature.txt，test.feature.txtというファイル名で保存せよ． なお，カテゴリ分類に有用そうな特徴量は各自で自由に設計せよ．記事の見出しを単語列に変換したものが最低限のベースラインとなるであろう．

## 特徴量1:Bag of Words (1-gram, 2-gram, binary=True)
参考:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

count_vect = CountVectorizer(binary=True,ngram_range=(1,2),dtype=np.int8)
X_train_bow = count_vect.fit_transform(X_train).toarray()
X_valid_bow = count_vect.transform(X_valid).toarray()
X_test_bow = count_vect.transform(X_test).toarray()

print(f"train: {X_train_bow.shape}\nvalid: {X_valid_bow.shape}\ntest: {X_test_bow.shape}")


train: (10672, 75528)
valid: (1334, 75528)
test: (1334, 75528)


In [14]:
import numpy as np
print(np.array([y_train],dtype=np.int8).T.shape)
print(f"type :{X_train_bow.dtype} , shape:{X_train_bow.shape}")
train_bow = np.hstack((np.array([y_train]).T, X_train_bow)).astype(np.int8)
valid_bow = np.hstack((np.array([y_valid]).T, X_valid_bow)).astype(np.int8)
test_bow = np.hstack((np.array([y_test]).T, X_test_bow)).astype(np.int8)
print(f"type : {train_bow.dtype}")
print(f"train: {train_bow.shape}\nvalid: {valid_bow.shape}\ntest: {test_bow.shape}")


(10672, 1)
type :int8 , shape:(10672, 75528)
type : int8
train: (10672, 75529)
valid: (1334, 75529)
test: (1334, 75529)


保存すると各ファイル5GB以上になるので保存しない!

In [102]:
# 特徴量2:tfidf

In [103]:
# 特徴量3: word2vec

# 52. 学習
51で構築した学習データを用いて，ロジスティック回帰モデルを学習せよ．

In [17]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0,dual=True,solver='liblinear',verbose=1).fit(X_train_bow,y_train)

[LibLinear]

# 53. 予測
52で学習したロジスティック回帰モデルを用い，与えられた記事見出しからカテゴリとその予測確率を計算するプログラムを実装せよ．

In [23]:
sentense = "science is important to improve your life."
CATEGORIES_DIC_inv[clf.predict(count_vect.transform([sentense]).toarray())[0]]

'entertaiment'

# 54. 正解率の計測
52で学習したロジスティック回帰モデルの正解率を，学習データおよび評価データ上で計測せよ．

In [None]:
from sklearn.metrics import accuracy_score
y_train_pred =  clf.predict(X_train_bow)
y_test_pred = clf.predict(X_test_bow)

print(f"train accuracy:{accuracy_score(y_train,y_train_pred)}\ntest accuracy{accuracy_score(y_test,y_test_pred)}")

# 55. 混同行列の作成
52で学習したロジスティック回帰モデルの混同行列（confusion matrix）を，学習データおよび評価データ上で作成せよ．

In [None]:
from sklearn.metrics import confusion_matrix
print(f"train accuracy:\n{confusion_matrix(y_train,y_train_pred)}\ntest accuracy:\n{confusion_matrix(y_test,y_test_pred)}")

# 56. 適合率，再現率，F1スコアの計測
52で学習したロジスティック回帰モデルの適合率，再現率，F1スコアを，評価データ上で計測せよ．カテゴリごとに適合率，再現率，F1スコアを求め，カテゴリごとの性能をマイクロ平均（micro-average）とマクロ平均（macro-average）で統合せよ．

In [None]:
from sklearn.metrics import precision_score,recall_score,f1_score
pre_micro =precision_score(y_test,y_test_pred,average='micro')
pre_macro =precision_score(y_test,y_test_pred,average='macro')
rec_micro =recall_score(y_test,y_test_pred,average='micro')
rec_macro = recall_score(y_test,y_test_pred,average='macro')
f1_micro =f1_score(y_test,y_test_pred,average='micro')
f1_macro =f1_score(y_test,y_test_pred,average='macro')
print("precision micro:",precision_score(y_test,y_test_pred,average='micro'))
print("precision macro:",precision_score(y_test,y_test_pred,average='macro'))
print("recall micro:",recall_score(y_test,y_test_pred,average='micro'))
print("recall macro:",recall_score(y_test,y_test_pred,average='macro'))
print("f1 micro:",f1_score(y_test,y_test_pred,average='micro'))
print("f1 macro:",f1_score(y_test,y_test_pred,average='macro'))

# ↓真面目に計算するとこんな感じ
# print("f1 micro:",2*pre_micro*rec_micro/(pre_micro+rec_micro))


In [None]:
# 別々で計算すると....(イケてる書き方知りたい)
a = np.array(y_test)[y_test_pred == 0]
a[a > 1] = 1
pre_0 = precision_score(1 - a , 1-y_test_pred[y_test_pred == 0])

a = np.array(y_test)[y_test_pred == 1]
a[a != 1] = 0
pre_1 = precision_score(a , y_test_pred[y_test_pred == 1])

a = np.array(y_test)[y_test_pred == 2]
a[a != 2] = 0
a[a == 2] = 1
pre_2 = precision_score(a , y_test_pred[y_test_pred == 2]-1)

a = np.array(y_test)[y_test_pred == 3]
a[a != 3] = 0
a[a == 3] = 1
pre_3 = precision_score(a , y_test_pred[y_test_pred == 3]-2)
print('precision micro:',(pre_0*len(y_test_pred[y_test_pred == 0])+pre_1*len(y_test_pred[y_test_pred == 1])+pre_2*len(y_test_pred[y_test_pred == 2])+pre_3*len(y_test_pred[y_test_pred == 3]))/len(y_test))
print('precision macro:',(pre_0+pre_1+pre_2+pre_3)/4)


# 57. 特徴量の重みの確認
52で学習したロジスティック回帰モデルの中で，重みの高い特徴量トップ10と，重みの低い特徴量トップ10を確認せよ．

In [None]:
top10 = clf.coef_.argsort()[:,:10]
bottom10 = clf.coef_.argsort()[:,-10:][:,::-1]
print(CATEGORIES_DIC_inv[0]," top 10:",','.join(np.array(count_vect.get_feature_names())[top10[0]]))
print(CATEGORIES_DIC_inv[0]," bottom 10:",','.join(np.array(count_vect.get_feature_names())[bottom10[0]]))

print(CATEGORIES_DIC_inv[1]," top 10:",','.join(np.array(count_vect.get_feature_names())[top10[1]]))
print(CATEGORIES_DIC_inv[1]," bottom 10:",','.join(np.array(count_vect.get_feature_names())[bottom10[1]]))

print(CATEGORIES_DIC_inv[2]," top 10:",','.join(np.array(count_vect.get_feature_names())[top10[2]]))
print(CATEGORIES_DIC_inv[2]," bottom 10:",','.join(np.array(count_vect.get_feature_names())[bottom10[2]]))

print(CATEGORIES_DIC_inv[3]," top 10:",','.join(np.array(count_vect.get_feature_names())[top10[3]]))
print(CATEGORIES_DIC_inv[3]," bottom 10:",','.join(np.array(count_vect.get_feature_names())[bottom10[3]]))
# 見た感じどうやら値が低いほうが特徴を表しているっぽい

# 58. 正則化パラメータの変更
ロジスティック回帰モデルを学習するとき，正則化パラメータを調整することで，学習時の過学習（overfitting）の度合いを制御できる．異なる正則化パラメータでロジスティック回帰モデルを学習し，学習データ，検証データ，および評価データ上の正解率を求めよ．実験の結果は，正則化パラメータを横軸，正解率を縦軸としたグラフにまとめよ．



# 59. ハイパーパラメータの探索
学習アルゴリズムや学習パラメータを変えながら，カテゴリ分類モデルを学習せよ．検証データ上の正解率が最も高くなる学習アルゴリズム・パラメータを求めよ．また，その学習アルゴリズム・パラメータを用いたときの評価データ上の正解率を求めよ．

