# 第6章: 機械学習
本章では，Fabio Gasparetti氏が公開しているNews Aggregator Data Setを用い，ニュース記事の見出しを「ビジネス」「科学技術」「エンターテイメント」「健康」のカテゴリに分類するタスク（カテゴリ分類）に取り組む．

# 50. データの入手・整形
News Aggregator Data Setをダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

1. ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
2. 情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
3. 抽出された事例をランダムに並び替える．
4. 抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．

学習データと評価データを作成したら，各カテゴリの事例数を確認せよ．

In [1]:
!ls files/NewsAggregatorDataset/

2pageSessions.csv newsCorpora.csv   readme.txt


In [2]:
!head -210713 files files/NewsAggregatorDataset/newsCorpora.csv | tail -1

head: Error reading files
210713	The Best Reactions To The Supposed Video of Solange Knowles & Jay Z  ...	http://www.hiphopdx.com/index/news/id.28728/title.-the-best-reactions-to-the-supposed-video-of-solange-knowles-jay-z-fighting-in-an-elevator-list-by-time/	HipHopDX	e	dku0uRoeehpC9JM1RoZ4n0fg8cyoM	www.hiphopdx.com	1399983366398


"The Best Reactions To The Supposed Video of Solange Knowles & Jay Z  ...

\t"[^"]+?\n←これで引っかかるやつ全て修正した｡
## ↑ダブルクオーテーションが片側しかないとDictreaderがうまく読み込まないので修正する

In [3]:
import csv


PUBLISHERS = ('Reuters','Huffington Post','Businessweek','Contactmusic.com','Daily Mail')
CATEGORIES = ('b','t','e','m')
CATEGORIES_DIC = {'b':0,'t':1,'e':2,'m':3}
CATEGORIES_DIC_inv = {0:'business',1:'science and technology',2:'entertaiment',3:'health'}

y_X_list = []

with open('files/NewsAggregatorDataset/newsCorpora.csv') as f:
    reader = csv.DictReader(f,delimiter='\t',fieldnames=["id","title","url","publisher","category","story","hostname","timestamp"])
    y_X_list = [(CATEGORIES_DIC[row['category']],row['title']) for row in reader if row['publisher'] in PUBLISHERS and row['category'] in CATEGORIES]


In [4]:
from sklearn.model_selection import train_test_split
from typing import List
import csv
y = [y for y, x in  y_X_list]
X = [x for y, x in  y_X_list]


print(f"y len :{len(y)}, X len: {len(X)}")

X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.20, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_tmp, y_tmp, test_size=0.50, random_state=42)

print(f"train:{len(X_train)}, valid:{len(X_valid)}, test:{len(X_test)}")

train = [[_y_train,_x_train] for _y_train,_x_train  in zip(y_train,X_train)]
valid = [[_y_valid,_x_valid] for _y_valid,_x_valid in zip(y_valid,X_valid)]
test = [[_y_test,_x_test] for _y_test,_x_test in zip(y_test,X_test)]

def save_data(data:List[List],file_name:str):
    with open(file_name, mode='w', encoding='utf-8') as f:
        writer = csv.writer(f,delimiter='\t')
        writer.writerows(data)

save_data(train,'files/train.txt')
save_data(valid,'files/valid.txt')
save_data(test,'files/test.txt')

y len :13356, X len: 13356
train:10684, valid:1336, test:1336


In [5]:
!head files/train.txt

2	Lena Dunham: I Feel Prettier With A Naked Face And ChapStick
0	REFILE-FOREX-Yen grinds lower as global stocks rally, dollar holds steady
2	Dolly Parton is a pint-size knickerbocker glittering glory
0	UPDATE 3-Shares of China's JD.com climb in US market debut
2	Seth Rogen talks about Zac Efrons sex appeal on chat show
0	UPDATE 1-Marlboro maker Philip Morris cuts 2014 earnings forecast
0	Walmart Strikes Deal That Will Hopefully Make Organic Food Cheaper
2	Julia Roberts - Julia Roberts opens up about half-sister's death
3	Top Official: VA Has Lost Trust Of Veterans, American People
2	From Cleaning Toilets to Intergalactic Royalty: Jupiter Ascending [Trailer +  ...


In [6]:
!cut -f 1 files/train.txt | sort | uniq -c

4558 0
1193 1
4231 2
 702 3


In [7]:
!cut -f 1 files/valid.txt | sort | uniq -c

 511 0
 196 1
 526 2
 103 3


In [8]:
!cut -f 1 files/test.txt | sort | uniq -c

 558 0
 136 1
 537 2
 105 3


# 51. 特徴量抽出
学習データ，検証データ，評価データから特徴量を抽出し，それぞれtrain.feature.txt，valid.feature.txt，test.feature.txtというファイル名で保存せよ． なお，カテゴリ分類に有用そうな特徴量は各自で自由に設計せよ．記事の見出しを単語列に変換したものが最低限のベースラインとなるであろう．

## 特徴量1:Bag of Words (1-gram, 2-gram, binary=True)
参考:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

count_vect = CountVectorizer(binary=True,ngram_range=(1,2),dtype=np.int8)
X_train_bow = count_vect.fit_transform(X_train).toarray()
X_valid_bow = count_vect.transform(X_valid).toarray()
X_test_bow = count_vect.transform(X_test).toarray()

print(f"train: {X_train_bow.shape}\nvalid: {X_valid_bow.shape}\ntest: {X_test_bow.shape}")


train: (10684, 75580)
valid: (1336, 75580)
test: (1336, 75580)


In [11]:
import numpy as np
print(np.array([y_train],dtype=np.int8).T.shape)
print(f"type :{X_train_bow.dtype} , shape:{X_train_bow.shape}")
train_bow = np.hstack((np.array([y_train]).T, X_train_bow)).astype(np.int8)
valid_bow = np.hstack((np.array([y_valid]).T, X_valid_bow)).astype(np.int8)
test_bow = np.hstack((np.array([y_test]).T, X_test_bow)).astype(np.int8)
print(f"type : {train_bow.dtype}")
print(f"train: {train_bow.shape}\nvalid: {valid_bow.shape}\ntest: {test_bow.shape}")


(10684, 1)
type :int8 , shape:(10684, 75580)
type : int8
train: (10684, 75581)
valid: (1336, 75581)
test: (1336, 75581)


保存すると各ファイル5GB以上になるので保存しない!

## 特徴量2:tfidf

In [14]:
np.float16

numpy.float16

In [22]:
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(binary=True,ngram_range=(1,2),dtype=np.float32)
X_train_tfidf = vectorizer.fit_transform(X_train).toarray()
X_valid_tfidf = vectorizer.transform(X_valid).toarray()
X_test_tfidf = vectorizer.transform(X_test).toarray()
print(f"train: {X_train_tfidf.shape}\nvalid: {X_valid_tfidf.shape}\ntest: {X_test_tfidf.shape}")

print(np.array([y_train],dtype=np.float32).T.shape)
print(f"type :{X_train_tfidf.dtype} , shape:{X_train_tfidf.shape}")
train_tfidf = np.hstack((np.array([y_train]).T, X_train_tfidf)).astype(np.float32)
valid_tfidf = np.hstack((np.array([y_valid]).T, X_valid_tfidf)).astype(np.float32)
test_tfidf = np.hstack((np.array([y_test]).T, X_test_tfidf)).astype(np.float32)
print(f"type : {train_tfidf.dtype}")
print(f"train: {train_tfidf.shape}\nvalid: {valid_tfidf.shape}\ntest: {test_tfidf.shape}")



train: (10684, 75580)
valid: (1336, 75580)
test: (1336, 75580)
(10684, 1)
type :float32 , shape:(10684, 75580)
type : float32
train: (10684, 75581)
valid: (1336, 75581)
test: (1336, 75581)


# 52. 学習
51で構築した学習データを用いて，ロジスティック回帰モデルを学習せよ．

In [31]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0,dual=True,solver='liblinear',verbose=1).fit(X_train_bow,y_train)

[LibLinear]

# 53. 予測
52で学習したロジスティック回帰モデルを用い，与えられた記事見出しからカテゴリとその予測確率を計算するプログラムを実装せよ．

In [32]:
sentense = "science is important to improve your life."
CATEGORIES_DIC_inv[clf.predict(count_vect.transform([sentense]).toarray())[0]]

'entertaiment'

# 54. 正解率の計測
52で学習したロジスティック回帰モデルの正解率を，学習データおよび評価データ上で計測せよ．

In [33]:
from sklearn.metrics import accuracy_score
y_train_pred =  clf.predict(X_train_bow)
y_test_pred = clf.predict(X_test_bow)

print(f"train accuracy:{accuracy_score(y_train,y_train_pred)}\ntest accuracy{accuracy_score(y_test,y_test_pred)}")

train accuracy:0.9988768251591165
test accuracy0.9101796407185628


# 55. 混同行列の作成
52で学習したロジスティック回帰モデルの混同行列（confusion matrix）を，学習データおよび評価データ上で作成せよ．

In [34]:
from sklearn.metrics import confusion_matrix
print(f"train accuracy:\n{confusion_matrix(y_train,y_train_pred)}\ntest accuracy:\n{confusion_matrix(y_test,y_test_pred)}")

train accuracy:
[[4552    4    2    0]
 [   2 1191    0    0]
 [   3    0 4228    0]
 [   0    0    1  701]]
test accuracy:
[[529   6  21   2]
 [ 22  94  19   1]
 [  9   4 522   2]
 [ 21   0  13  71]]


# 56. 適合率，再現率，F1スコアの計測
52で学習したロジスティック回帰モデルの適合率，再現率，F1スコアを，評価データ上で計測せよ．カテゴリごとに適合率，再現率，F1スコアを求め，カテゴリごとの性能をマイクロ平均（micro-average）とマクロ平均（macro-average）で統合せよ．

In [35]:
from sklearn.metrics import precision_score,recall_score,f1_score
pre_micro =precision_score(y_test,y_test_pred,average='micro')
pre_macro =precision_score(y_test,y_test_pred,average='macro')
rec_micro =recall_score(y_test,y_test_pred,average='micro')
rec_macro = recall_score(y_test,y_test_pred,average='macro')
f1_micro =f1_score(y_test,y_test_pred,average='micro')
f1_macro =f1_score(y_test,y_test_pred,average='macro')
print("precision micro:",precision_score(y_test,y_test_pred,average='micro'))
print("precision macro:",precision_score(y_test,y_test_pred,average='macro'))
print("recall micro:",recall_score(y_test,y_test_pred,average='micro'))
print("recall macro:",recall_score(y_test,y_test_pred,average='macro'))
print("f1 micro:",f1_score(y_test,y_test_pred,average='micro'))
print("f1 macro:",f1_score(y_test,y_test_pred,average='macro'))

# ↓真面目に計算するとこんな感じ
# print("f1 micro:",2*pre_micro*rec_micro/(pre_micro+rec_micro))


precision micro: 0.9101796407185628
precision macro: 0.9140954766333167
recall micro: 0.9101796407185628
recall macro: 0.8218656649299956
f1 micro: 0.9101796407185628
f1 macro: 0.8588994069418818


In [37]:
# 別々で計算すると....(イケてる書き方知りたい)
# a = np.array(y_test)[y_test_pred == 0]
# a[a > 1] = 1
# pre_0 = precision_score(1 - a , 1-y_test_pred[y_test_pred == 0])

# a = np.array(y_test)[y_test_pred == 1]
# a[a != 1] = 0
# pre_1 = precision_score(a , y_test_pred[y_test_pred == 1])

# a = np.array(y_test)[y_test_pred == 2]
# a[a != 2] = 0
# a[a == 2] = 1
# pre_2 = precision_score(a , y_test_pred[y_test_pred == 2]-1)

# a = np.array(y_test)[y_test_pred == 3]
# a[a != 3] = 0
# a[a == 3] = 1
# pre_3 = precision_score(a , y_test_pred[y_test_pred == 3]-2)
# print('precision micro:',(pre_0*len(y_test_pred[y_test_pred == 0])+pre_1*len(y_test_pred[y_test_pred == 1])+pre_2*len(y_test_pred[y_test_pred == 2])+pre_3*len(y_test_pred[y_test_pred == 3]))/len(y_test))
# print('precision macro:',(pre_0+pre_1+pre_2+pre_3)/4)

# 57. 特徴量の重みの確認
52で学習したロジスティック回帰モデルの中で，重みの高い特徴量トップ10と，重みの低い特徴量トップ10を確認せよ．

In [36]:
top10 = clf.coef_.argsort()[:,:10]
bottom10 = clf.coef_.argsort()[:,-10:][:,::-1]
print(CATEGORIES_DIC_inv[0]," top 10:",','.join(np.array(count_vect.get_feature_names())[top10[0]]))
print(CATEGORIES_DIC_inv[0]," bottom 10:",','.join(np.array(count_vect.get_feature_names())[bottom10[0]]))

print(CATEGORIES_DIC_inv[1]," top 10:",','.join(np.array(count_vect.get_feature_names())[top10[1]]))
print(CATEGORIES_DIC_inv[1]," bottom 10:",','.join(np.array(count_vect.get_feature_names())[bottom10[1]]))

print(CATEGORIES_DIC_inv[2]," top 10:",','.join(np.array(count_vect.get_feature_names())[top10[2]]))
print(CATEGORIES_DIC_inv[2]," bottom 10:",','.join(np.array(count_vect.get_feature_names())[bottom10[2]]))

print(CATEGORIES_DIC_inv[3]," top 10:",','.join(np.array(count_vect.get_feature_names())[top10[3]]))
print(CATEGORIES_DIC_inv[3]," bottom 10:",','.join(np.array(count_vect.get_feature_names())[bottom10[3]]))
# 見た感じどうやら値が低いほうが特徴を表しているっぽい

business  top 10: ebola,google,her,aereo,apple,microsoft,facebook,kardashian,star,chris
business  bottom 10: fed,bank,ecb,obamacare,ukraine,china,oil,euro,mcdonald,yellen
science and technology  top 10: stocks,fed,ecb,ukraine,bank,euro,american,kardashian,shares,movie
science and technology  bottom 10: google,facebook,apple,microsoft,climate,tesla,comcast,nasa,heartbleed,gm
entertaiment  top 10: google,facebook,china,apple,gm,fed,bank,update,billion,study
entertaiment  bottom 10: kardashian,chris,film,star,wedding,her,jennifer,movie,jay,paul
health  top 10: google,facebook,apple,gm,fed,ecb,bank,kardashian,climate,sales
health  bottom 10: ebola,cancer,study,fda,mers,drug,cigarettes,cdc,doctors,health


# 58. 正則化パラメータの変更
ロジスティック回帰モデルを学習するとき，正則化パラメータを調整することで，学習時の過学習（overfitting）の度合いを制御できる．異なる正則化パラメータでロジスティック回帰モデルを学習し，学習データ，検証データ，および評価データ上の正解率を求めよ．実験の結果は，正則化パラメータを横軸，正解率を縦軸としたグラフにまとめよ．



In [105]:

train_accuracy = []
valid_accuracy = []
test_accuracy = []
Cs = np.logspace(-1,2,10)
for c in Cs:
    _clf = LogisticRegression(random_state=0,dual=False,solver='liblinear',class_weight='balanced',penalty='l1',verbose=1,C=c).fit(X_train_bow,y_train)
    _y_train_pred = _clf.predict(X_train_bow)
    _y_valid_pred = _clf.predict(X_valid_bow)
    _y_test_pred = _clf.predict(X_test_bow)
    train_accuracy.append(accuracy_score(y_train,_y_train_pred))
    valid_accuracy.append(accuracy_score(y_valid,_y_valid_pred))
    test_accuracy.append(accuracy_score(y_test,_y_test_pred))

[LibLinear][LibLinear][LibLinear]


Liblinear failed to converge, increase the number of iterations.



[LibLinear]


Liblinear failed to converge, increase the number of iterations.



[LibLinear]


Liblinear failed to converge, increase the number of iterations.



[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]

In [106]:
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import plotly.graph_objects as go

fig = go.Figure(data=go.Scatter(x=Cs, y=train_accuracy,name='train'),layout_title_text="accuracy")
fig.add_trace(
    go.Scatter(x=Cs, y=valid_accuracy,name='valid')
)
fig.add_trace(
    go.Scatter(x=Cs, y=test_accuracy,name='test')
)


# TODO: xlabel ylabel
fig.show()

## tfidf

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import plotly.graph_objects as go



train_accuracy = []
valid_accuracy = []
test_accuracy = []
Cs = np.logspace(-1,2,10)
for c in Cs:
    _clf = LogisticRegression(random_state=0,dual=False,solver='liblinear',class_weight='balanced',penalty='l1',verbose=1,C=c).fit(X_train_tfidf,y_train)
    _y_train_pred = _clf.predict(X_train_tfidf)
    _y_valid_pred = _clf.predict(X_valid_tfidf)
    _y_test_pred = _clf.predict(X_test_tfidf)
    train_accuracy.append(accuracy_score(y_train,_y_train_pred))
    valid_accuracy.append(accuracy_score(y_valid,_y_valid_pred))
    test_accuracy.append(accuracy_score(y_test,_y_test_pred))


fig = go.Figure(data=go.Scatter(x=Cs, y=train_accuracy,name='train'),layout_title_text="tfidf accuracy")
fig.add_trace(
    go.Scatter(x=Cs, y=valid_accuracy,name='valid')
)
fig.add_trace(
    go.Scatter(x=Cs, y=test_accuracy,name='test')
)


# TODO: xlabel ylabel
fig.show()

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]


Liblinear failed to converge, increase the number of iterations.



[LibLinear]


Liblinear failed to converge, increase the number of iterations.



[LibLinear]


Liblinear failed to converge, increase the number of iterations.



[LibLinear]


Liblinear failed to converge, increase the number of iterations.



[LibLinear]


Liblinear failed to converge, increase the number of iterations.



[LibLinear]


Liblinear failed to converge, increase the number of iterations.



# 59. ハイパーパラメータの探索
学習アルゴリズムや学習パラメータを変えながら，カテゴリ分類モデルを学習せよ．検証データ上の正解率が最も高くなる学習アルゴリズム・パラメータを求めよ．また，その学習アルゴリズム・パラメータを用いたときの評価データ上の正解率を求めよ．



### Randomized SeachCVを使って パラーメタ探索をする
### モデルはrandom forestと logistic regressionを候補とする

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(max_depth=2, random_state=0)
random_forest_params= dict(n_estimators=range(20,300,50),criterion=['gini','entropy'])
random_forest_clf = RandomizedSearchCV(random_forest, random_forest_params, random_state=0,n_jobs=-1,verbose=1)

random_forest_search = random_forest_clf.fit(np.vstack([X_train_tfidf,X_valid_tfidf]), y_train+y_valid)

In [None]:
logistic_regression = LogisticRegression(solver='saga', tol=1e-2, max_iter=200, random_state=0)
logistic_regression_params = dict(C=uniform(loc=0, scale=4),penalty=['l2', 'l1'])
logistic_regression_clf = RandomizedSearchCV(logistic_regression, logistic_regression_params, random_state=0,n_jobs=-1,,verbose=1)
logistic_regression_search = logistic_regression_clf.fit(np.vstack([X_train_tfidf,X_valid_tfidf]), y_train+y_valid)



In [38]:

accuracy_score(y_test,logistic_regression_search.predict(y_test_pred))

10684

In [None]:
accuracy_score(y_test,random_forest_search.predict(y_test_pred))