<a href="https://colab.research.google.com/github/Fukkatsuso/livedoornews-topicmodel/blob/master/topicmodel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# トピックモデル

## Goal
- ライブドアコーパスでWeb記事分類器を作る

## Step
1. ライブドアコーパスをスクレイピング
1. データの前処理
  1. 形態素解析 => **MeCab** (+NEologd)
  1. 不要語の削除, 語の統一(ステミング)
1. トピックモデルの構築 => **gensim**
1. 機械学習 => **sklearn**

## 参考
- [LDAによるトピックモデル with gensim ~ Qiitaのタグからユーザーの嗜好を考える ~](https://qiita.com/shizuma/items/44c016812552ba8a8b88)
- [トピックモデルをザックリと理解してサクッと試した](https://qiita.com/d-ogawa/items/c423cd4b01c6ed84a5e7)
- [WordCloudとpyLDAvisによるLDAの可視化について](http://www.ie110704.net/2018/12/29/wordcloud%E3%81%A8pyldavis%E3%81%AB%E3%82%88%E3%82%8Blda%E3%81%AE%E5%8F%AF%E8%A6%96%E5%8C%96%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6/)
- [自然言語処理による文書分類の基礎の基礎、トピックモデルを学ぶ](https://qiita.com/icoxfog417/items/7c944cb29dd7cdf5e2b1)
- [scikit-learnとgensimでニュース記事を分類する](https://qiita.com/yasunori/items/31a23eb259482e4824e2)
- [文書分類で自然言語処理に触れる](https://colab.research.google.com/drive/1IMjc-RTesapfNCEh0TPmg_ce_qAcV95b#scrollTo=9a9CUjgUXgB6)
- [自然言語処理における前処理の種類とその威力](https://qiita.com/Hironsan/items/2466fe0f344115aff177)
- [Python3×日本語：自然言語処理の前処理まとめ](https://qiita.com/chamao/items/7edaba62b120a660657e)
- [ニュース記事の分類を機械学習で予測する](https://qiita.com/hyo_07/items/ba3d53868b2f55ed9941)


## データ収集
### 対象
- [livedoorニュースコーパス](https://www.rondhuit.com/download.html#ldcc)
  - [トピックニュース](http://news.livedoor.com/category/vender/news/)
  - [Sports Watch](http://news.livedoor.com/category/vender/208/)
  - [ITライフハック](http://news.livedoor.com/category/vender/223/)
  - [家電チャンネル](http://news.livedoor.com/category/vender/kadench/)
  - [MOVIE ENTER](http://news.livedoor.com/category/vender/movie_enter/)
  - [独女通信](http://news.livedoor.com/category/vender/90/)
  - [エスマックス](http://news.livedoor.com/category/vender/smax/)
  - [livedoor HOMME](http://news.livedoor.com/category/vender/homme/)
  - [Peachy](http://news.livedoor.com/category/vender/ldgirls/)


In [0]:
# Get dataset
!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz
!mkdir -p dataset/livedoor && tar xvzf ldcc-20140209.tar.gz -C /content/dataset/livedoor --strip-components 1
!rm ldcc-20140209.tar.gz

# Install MeCab
!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3

# Install mecab-ipadic-NEologd
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -a

!pip install mojimoji

In [0]:
# 確認
!echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

## 1.前処理なしLDA

In [0]:
import gensim
import glob2
import MeCab

mecab = MeCab.Tagger("-Owakati -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd")

paths = glob2.glob("dataset/livedoor/sports-watch/*-*.txt")

In [0]:
words = []
for path in paths:
  data = open(path, 'r', encoding="utf-8").read().split('\n')
  title = data[2]
  words.append(mecab.parse(title).split())

In [0]:
# 辞書, コーパス作成
dictionary = gensim.corpora.Dictionary(words)

dictionary.save_as_text("dictionary1.dict.txt")

corpus = [dictionary.doc2bow(w) for w in words]

In [0]:
# LDA
topic_N = 10
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=topic_N, id2word=dictionary)

for i in range(topic_N):
  print('TOPIC:', i, '=>', lda.print_topic(i))

using symmetric alpha at 0.1
using symmetric eta at 0.1
using serial LDA version on this node
running online (single-pass) LDA training, 10 topics, 1 passes over the supplied corpus of 900 documents, updating model once every 900 documents, evaluating perplexity every 900 documents, iterating 50x with a convergence threshold of 0.001000
too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
-11.454 per-word bound, 2805.7 perplexity estimate based on a held-out corpus of 900 documents with 14123 words
PROGRESS: pass 0, at document #900/900
topic #0 (0.100): 0.053*"の" + 0.039*"】" + 0.039*"に" + 0.039*"【" + 0.038*"Watch" + 0.037*"Sports" + 0.033*"「" + 0.032*"」" + 0.030*"、" + 0.023*"が"
topic #4 (0.100): 0.036*"「" + 0.036*"」" + 0.032*"に" + 0.031*"が" + 0.024*"は" + 0.021*"の" + 0.017*"・" + 0.016*"Sports" + 0.016*"【" + 0.016*"Watch"
topic #2 (0.100): 0.036*"、" + 0.030*"Sports" + 0.030*"【" + 0.030*"Watch" + 0.029*"】" + 0.027*"「" + 

TOPIC: 0 => 0.053*"の" + 0.039*"】" + 0.039*"に" + 0.039*"【" + 0.038*"Watch" + 0.037*"Sports" + 0.033*"「" + 0.032*"」" + 0.030*"、" + 0.023*"が"
TOPIC: 1 => 0.022*"に" + 0.009*"Sports" + 0.009*"Watch" + 0.009*"【" + 0.009*"た" + 0.009*"、" + 0.008*"】" + 0.007*"”" + 0.007*"“" + 0.007*"ない"
TOPIC: 2 => 0.036*"、" + 0.030*"Sports" + 0.030*"【" + 0.030*"Watch" + 0.029*"】" + 0.027*"「" + 0.026*"」" + 0.022*"に" + 0.019*"を" + 0.014*"た"
TOPIC: 3 => 0.035*"の" + 0.027*"」" + 0.026*"「" + 0.025*"は" + 0.025*"Sports" + 0.024*"【" + 0.024*"Watch" + 0.024*"】" + 0.021*"、" + 0.021*"た"
TOPIC: 4 => 0.036*"「" + 0.036*"」" + 0.032*"に" + 0.031*"が" + 0.024*"は" + 0.021*"の" + 0.017*"・" + 0.016*"Sports" + 0.016*"【" + 0.016*"Watch"
TOPIC: 5 => 0.040*"】" + 0.040*"、" + 0.039*"Sports" + 0.039*"Watch" + 0.039*"【" + 0.038*"の" + 0.023*"は" + 0.022*"に" + 0.020*"「" + 0.020*"」"
TOPIC: 6 => 0.030*"・" + 0.028*"の" + 0.028*"は" + 0.021*"Sports" + 0.021*"、" + 0.021*"【" + 0.020*"】" + 0.020*"Watch" + 0.020*"に" + 0.012*"と"
TOPIC: 7 => 0.047*"、" + 0.

## 2.前処理ありLDA

1. 正規化
  - 半角かな => 全角かな
  - 全角英数 => 半角英数
  - 大文字 => 小文字
  - 辞書による統一?
1. 品詞で取捨選択
1. ストップワード除去
  - 辞書
    - [SlothLib](http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt)

In [0]:
import gensim
import glob2
import mojimoji
import MeCab
import urllib3

from sklearn.model_selection import train_test_split

mecab = MeCab.Tagger("mecabrc -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd")

paths = glob2.glob("dataset/livedoor/*/*-*.txt")

# 学習用と評価用に分ける
train_rate = 0.8
train_article_paths, test_article_paths = train_test_split(paths, train_size=train_rate, random_state=0)

In [0]:
# 学習用データのカテゴリごとの数
cat = {}
for path in test_article_paths:
  c = path.split('/')[2]
  if cat.get(c) == None:
    cat[c] = 0
  else:
    cat[c] += 1

for c in cat:
  print(c, ':', cat[c])

In [0]:
# パースされた文字列から品詞を限定して取り出す
def extract_by_parts(parsed, parts):
  words = []
  lines = parsed.split('\n')
  for line in lines:
    feature = line.split('\t')
    if len(feature) == 2:
      info = feature[1].split(',')
      if info[0] in parts:
        if info[6] == '*': 
          words.append(feature[0])  # 活用なしの語
        else:  
          words.append(info[6]) # 表記ゆれの対処
  return words

# 各単語を前処理にかける
def preprocess_words(words, stopwords):
  for i in range(len(words)):
    words[i] = unify_chartype(words[i])
  words = filter_stopwords(words, stopwords)
  return words

# 文字種を統一する
def unify_chartype(text):
  text = mojimoji.zen_to_han(text, kana=False, digit=True, ascii=True) # 全角英数=>半角英数
  text = mojimoji.han_to_zen(text, kana=True, digit=False, ascii=False) # 半角かな=>全角かな
  text = text.lower() # 大文字=>小文字
  return text

# ストップワードを除去する
def filter_stopwords(words, stopwords):
  filtered_words = [word for word in words if word not in stopwords]
  return filtered_words

# ストップワードのリストを返す
def get_stopwords():
  # SlothLib
  slothlib_url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
  http = urllib3.PoolManager()
  res = http.request('GET', slothlib_url)
  stopwords = res.data.decode('utf-8').split()
  return stopwords

In [0]:
def text2words(text, stopwords, parts):
  parsed_text = mecab.parse(text)
  words = extract_by_parts(parsed_text, parts)
  words = preprocess_words(words, stopwords)
  return words

stopwords = get_stopwords()

train_words = []
for path in train_article_paths:
  data = open(path, 'r', encoding="utf-8").read().split('\n')
  words = text2words(data[2], stopwords, ('名詞'))
  train_words.append(words)

In [0]:
def words2corpus(dictionary, words):
  return [dictionary.doc2bow(w) for w in words]

# 辞書, コーパス作成
dictionary = gensim.corpora.Dictionary(train_words)
dictionary.filter_extremes(no_below=5, no_above=0.075)
dictionary.save_as_text("dictionary2.dict.txt")

train_corpus = words2corpus(dictionary, train_words)

In [8]:
# LDA
topic_N = 10
lda = gensim.models.ldamodel.LdaModel(corpus=train_corpus, num_topics=topic_N, id2word=dictionary)

for i in range(topic_N):
  print('TOPIC:', i, '=>', lda.print_topic(i))

TOPIC: 0 => 0.031*"公開" + 0.021*"登場" + 0.016*"写真" + 0.015*"レポート" + 0.013*"d" + 0.012*"deji" + 0.011*"スマホ" + 0.011*"対応" + 0.010*"バッテリー" + 0.009*"smartphone"
TOPIC: 1 => 0.021*"nttドコモ" + 0.021*"smartphone" + 0.020*"向け" + 0.020*"搭載" + 0.020*"d" + 0.016*"発表" + 0.014*"au" + 0.014*"更新" + 0.014*"提供開始" + 0.014*"対応"
TOPIC: 2 => 0.104*"watch" + 0.104*"sports" + 0.013*"理由" + 0.010*"氏" + 0.009*"選手" + 0.009*"の" + 0.009*"ノムさん" + 0.008*"年収" + 0.007*"楽天" + 0.007*"端末"
TOPIC: 3 => 0.052*"独女" + 0.018*"特集" + 0.015*"写真" + 0.014*"女子" + 0.013*"ゴルフ" + 0.013*"vol" + 0.012*"d" + 0.011*"結婚" + 0.010*"動画" + 0.010*"cafe"
TOPIC: 4 => 0.027*"アプリ" + 0.018*"さ" + 0.013*"家電" + 0.013*"虎の巻" + 0.010*"非難" + 0.009*"akb" + 0.009*"便利" + 0.009*"deji" + 0.008*"予定" + 0.008*"ニュース"
TOPIC: 5 => 0.068*"映画" + 0.023*"オトナ女子" + 0.014*"まとめ" + 0.014*"週末" + 0.014*"読み" + 0.011*"終了" + 0.011*"クリスマス" + 0.010*"決定" + 0.009*"プレゼント" + 0.007*"時代"
TOPIC: 6 => 0.025*"の" + 0.024*"iphone" + 0.021*"女子" + 0.018*"プレゼント" + 0.016*"deji" + 0.016*"終了" + 0.016*"g

## 分類

In [9]:
# テスト用データ
test_words = []
for path in test_article_paths:
  data = open(path, 'r', encoding="utf-8").read().split('\n')
  words = text2words(data[2], stopwords, ('名詞'))
  test_words.append(words)

test_corpus = words2corpus(dictionary, test_words)

for i in range(len(test_article_paths)):
  category = test_article_paths[i].split('/')[2]
  title = open(test_article_paths[i], 'r', encoding="utf-8").read().split('\n')[2]
  print(category, '\n', title, '\n', lda[test_corpus[i]], '\n')

[1;30;43mストリーミング出力は最後の 5000 行に切り捨てられました。[0m
 応援する？ ムカつく？ あなたの近くの社内恋愛 
 [(0, 0.05), (1, 0.05), (2, 0.05), (3, 0.05), (4, 0.5499877), (5, 0.050012343), (6, 0.05), (7, 0.05), (8, 0.05), (9, 0.05)] 

smax 
 NTTドコモ、Xi対応Android 4.0 ICS搭載スマートフォン「MEDIAS X N-07D」を発表！1.5GHzデュアルコアCPUや4.3インチHDディスプレイ、おサイフ、ワンセグ、赤外線、防水、NOTTV 
 [(1, 0.9185209), (7, 0.05184615)] 

topic-news 
 「殺人行為に等しい」の声も…てんかん患者の免許取得に賛否 
 [(0, 0.025000842), (1, 0.025000464), (2, 0.025001125), (3, 0.025000747), (4, 0.2901549), (5, 0.025001513), (6, 0.025001729), (7, 0.5098374), (8, 0.025001302), (9, 0.025)] 

peachy 
 ケイト・ミドルトン ニュープリンセス誕生の軌跡 / ロイヤルウェディング特集 
 [(0, 0.020000074), (1, 0.2858732), (2, 0.020000057), (3, 0.34202352), (4, 0.23209684), (5, 0.020000044), (6, 0.020000165), (7, 0.019999998), (8, 0.020000432), (9, 0.020005645)] 

peachy 
 化粧品をすぐ変えるとはやく老ける／意外な飲み物で楽ヤセなど−【ビューティー】週間ランキング 
 [(0, 0.02500004), (1, 0.025000092), (2, 0.025003674), (3, 0.025000395), (4, 0.025000516), (5, 0.025001613), (6, 0.02500011), (7, 0.025019782), (8