In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
pd.set_option('max_rows', 5)
pd.set_option('max_columns', 13)

# 単純ベイズ(Naive Bayes)
---
ベイズの定理に基づいて、特徴の分布からクラスを推測するモデル。文書分類や迷惑メールフィルタで使用される。  
仮定する特徴の分布によっていくつかの実装がある。

あるサンプル (文書) がクラス (カテゴリ) $C_{j}$ に属する確率 (全ての文書の中で $C_{j}$ が出現する確率) を $p( C_{j} )$ 、クラス $C_{j}$ に属するサンプルに特徴 (単語) $F_{1} ,\ F_{2} ,\ \dotsc F_{k}$ が表れる確率を $p( F_{1} |C_{j} ) ,\ p( F_{2} |C_{j} ) ,\ \dotsc p( F_{k} |C_{j} )$ とし、ある特徴 (単語) $F_{1} ,\ F_{2} ,\ \dotsc F_{k}$ を含む文書が与えられたときに、その文書がクラス (カテゴリ) $C_{j}$ に属する確率 $p( C_{j} |F_{1} ,\ F_{2} ,\ \cdots \ F_{k})$ が最大となる $j$ を選ぶことを考える。

$F_{1} ,\ F_{2} ,\ \dotsc F_{k}$ が独立と仮定すると、 ${\displaystyle p( C_{j})\prod ^{k}_{i=1} p( F_{i} |C_{j})}$ が最大となるのが求める $j$ である。

証明

---

[ベイズの定理](../beginner/market_basket_analysis.ipynb#%E3%83%99%E3%82%A4%E3%82%BA%E3%81%AE%E5%AE%9A%E7%90%86-(Bayes'-theorem)) $
{\displaystyle p( B|A) =\frac
    {p( A|B)\cdot p( B)}
    {p( A)}
}
$ と確率論的独立 $P( A|B) =P( A)$ を利用する。

ある特徴 $F_{1} ,\ F_{2} ,\ \dotsc F_{k}$ が与えられたときに、そのサンプルがクラス $C$ である確率は

$p( C|F_{1} ,\ F_{2} ,\ \dotsc F_{k})$

と表せる。これにベイズの定理を適用して

$
{\displaystyle p( C|F_{1} ,\ F_{2} ,\ \dotsc F_{k}) =\frac
    {p( C) \cdot p( F_{1} ,\ F_{2} ,\ \dotsc F_{k} |C)}
    {p( F_{1} ,\ F_{2} ,\ \dotsc F_{k})}
}
$

分母の $p( F_{1} ,\ F_{2} ,\ \dotsc F_{k})$ は定数 (どのクラスについても同じ) であり、確率が最大になるクラスを求めるだけなら無視してもいいので

$p( C|F_{1} ,\ F_{2} ,\ \dotsc F_{k}) \varpropto p( C) \cdot p( F_{1} ,\ F_{2} ,\ \dotsc F_{k} |C) \ $

と表せる。 ( $\varpropto $ は比例を表す記号)

条件付き確率の定義 ( $p( X,\ Y|Z) =p( X|Z) \cdot p( Y|Z,\ X)$ ) より、右辺は

$
\begin{align}
    p( C) \cdot p( F_{1} ,\ F_{2} ,\ \dotsc F_{k} |C) \  & =p( C) \cdot p( F_{1} |C) \cdot p( F_{2} ,\ F_{3} ,\ \dotsc F_{k} |C,\ F_{1})\\
     & =p( C) \cdot p( F_{1} |C) \cdot p( F_{2} |C,\ F_{1}) \cdot p( F_{3} ,\ F_{4} ,\ \dotsc F_{k} |C,\ F_{1} ,\ F_{2})\\
     & \vdots \\
     & =p( C) \cdot p( F_{1} |C) \cdot p( F_{2} |C,\ F_{1}) \cdot \dotsc p( F_{k} |C,\ F_{1} ,\ F_{2} ,\ \dotsc F_{k-1})
\end{align}
$

ここで、 $F_{1} ,\ F_{2} ,\ \dotsc F_{k}$ の出現する確率 $p( F_{1}) ,\ p( F_{2}) ,\ \dotsc p( F_{k})$ が独立だとすると

$
\begin{align}
    p( C) \cdot p( F_{1} ,\ F_{2} ,\ \dotsc F_{k} |C) & =p( C) \cdot p( F_{1} |C) \cdot p( F_{2} |C,\ F_{1}) \cdot \dotsc p( F_{k} |C,\ F_{1} ,\ F_{2} ,\ \dotsc F_{k-1})\\
     & =p( C) \cdot p( F_{1} |C) \cdot p( F_{2} |C) \cdot \dotsc p( F_{k} |C)\\
     & =p( C) \cdot \prod ^{k}_{i=1} p( F_{k} |C)
\end{align}
$

---

## Pythonでの単純ベイズ実行方法
---
scikit-learn を用いる場合、 $p( F|C)$ が従う確率分布をどのように仮定するかによって複数の実装がある。

### ベルヌーイ分布
---
特徴が 0/1 や True/False のように2値で表される場合、`sklearn.naive_bayes.BernoulliNB`を使用する。  
`sklearn.feature_extraction.text.CountVectorizer`を利用して特徴を作成する場合は、引数`binary=True`を設定するか、`sklearn.naive_bayes.BernoulliNB`の引数`binarize`に適切な数値を設定する。 (`binarize`以下の値が0、より大きい値が1になる)

In [2]:
remove = ('headers', 'footers', 'quotes')

loader = fetch_20newsgroups(subset='all', random_state=1234, remove=remove)
news = pd.DataFrame(dict(document=loader.data, category=loader.target))
news['category'] = pd.Categorical.from_codes(news['category'],
                                             categories=loader.target_names)
print('news')
display(news)

news


Unnamed: 0,document,category
0,\n\n\nLikewise for me please. First time I've ...,comp.graphics
1,"Sorry, but I just wanted to be the first hypoc...",talk.politics.misc
...,...,...
18844,\nPut up or shut up. Where is your evidence?\n...,talk.politics.misc
18845,\n\n\n\nWe're looking at a series of chips by ...,sci.electronics


In [3]:
help(BernoulliNB)

Help on class BernoulliNB in module sklearn.naive_bayes:

class BernoulliNB(_BaseDiscreteNB)
 |  BernoulliNB(*, alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)
 |  
 |  Naive Bayes classifier for multivariate Bernoulli models.
 |  
 |  Like MultinomialNB, this classifier is suitable for discrete data. The
 |  difference is that while MultinomialNB works with occurrence counts,
 |  BernoulliNB is designed for binary/boolean features.
 |  
 |  Read more in the :ref:`User Guide <bernoulli_naive_bayes>`.
 |  
 |  Parameters
 |  ----------
 |  alpha : float, default=1.0
 |      Additive (Laplace/Lidstone) smoothing parameter
 |      (0 for no smoothing).
 |  
 |  binarize : float or None, default=0.0
 |      Threshold for binarizing (mapping to booleans) of sample features.
 |      If None, input is presumed to already consist of binary vectors.
 |  
 |  fit_prior : bool, default=True
 |      Whether to learn class prior probabilities or not.
 |      If false, a uniform prior wil

In [4]:
# ノイズが多いのでアルファベットのみで構成され、文書全体で20回以上出現する単語に限定
binary_vectorizer = CountVectorizer(stop_words='english',
                                    token_pattern='(?u)\\b[a-z][a-z]+\\b',
                                    min_df=20,
                                    binary=True)
binary = binary_vectorizer.fit_transform(news['document'])
news_binary = pd.DataFrame.sparse.from_spmatrix(
    binary, columns=binary_vectorizer.get_feature_names())
news_binary

Unnamed: 0,aa,aaa,aaron,ab,abandon,abandoned,...,zionist,zionists,zip,zone,zoom,zx
0,0,0,0,0,0,0,...,0,0,0,0,0,0
1,0,0,0,0,0,0,...,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18844,0,0,0,0,0,0,...,0,0,0,0,0,0
18845,0,0,0,0,0,0,...,0,0,0,0,0,0


In [5]:
bernoulli = BernoulliNB()
bernoulli.fit(news_binary, news['category'])
prediction_binary = bernoulli.predict(news_binary)
pd.DataFrame(dict(category=news['category'], prediction=prediction_binary))

Unnamed: 0,category,prediction
0,comp.graphics,rec.autos
1,talk.politics.misc,rec.motorcycles
...,...,...
18844,talk.politics.misc,rec.motorcycles
18845,sci.electronics,comp.sys.ibm.pc.hardware


### 多項分布(Multinomial Naive Bayes)
---
特徴が存否だけでなく、数量 (出現回数) で表される場合、`sklearn.naive_bayes.MultinomialNB`を使用する。  
二項分布がコイントスのようなベルヌーイ試行を $𝑛$ 回行ったときの成功数の確率分布なのに対して、多項分布はサイコロのような結果が多数ある試行を $n$ 回行なったときのそれぞれの結果の出現数の確率分布。

単純な単語の出現回数などだけではなく、TF-IDFなどの指標化されたデータでも動く。

In [6]:
help(MultinomialNB)

Help on class MultinomialNB in module sklearn.naive_bayes:

class MultinomialNB(_BaseDiscreteNB)
 |  MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)
 |  
 |  Naive Bayes classifier for multinomial models
 |  
 |  The multinomial Naive Bayes classifier is suitable for classification with
 |  discrete features (e.g., word counts for text classification). The
 |  multinomial distribution normally requires integer feature counts. However,
 |  in practice, fractional counts such as tf-idf may also work.
 |  
 |  Read more in the :ref:`User Guide <multinomial_naive_bayes>`.
 |  
 |  Parameters
 |  ----------
 |  alpha : float, default=1.0
 |      Additive (Laplace/Lidstone) smoothing parameter
 |      (0 for no smoothing).
 |  
 |  fit_prior : bool, default=True
 |      Whether to learn class prior probabilities or not.
 |      If false, a uniform prior will be used.
 |  
 |  class_prior : array-like of shape (n_classes,), default=None
 |      Prior probabilities of the classes

In [7]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(news['document'])
news_tfidf = pd.DataFrame.sparse.from_spmatrix(
    tfidf, columns=tfidf_vectorizer.get_feature_names())
news_tfidf

Unnamed: 0,00,000,0000,00000,000000,00000000,...,zzzoh,zzzzzz,zzzzzzt,³ation,ýé,ÿhooked
0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18844,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0
18845,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
multinomial = MultinomialNB()
multinomial.fit(news_tfidf, news['category'])
prediction_tfidf = multinomial.predict(news_tfidf)
pd.DataFrame(dict(category=news['category'], prediction=prediction_tfidf))

Unnamed: 0,category,prediction
0,comp.graphics,comp.graphics
1,talk.politics.misc,rec.sport.hockey
...,...,...
18844,talk.politics.misc,talk.politics.misc
18845,sci.electronics,sci.electronics


### 正規分布
---
特徴の分布が正規分布に従う場合、`sklearn.naive_bayes.GaussianNB`を使用する。

In [9]:
help(GaussianNB)

Help on class GaussianNB in module sklearn.naive_bayes:

class GaussianNB(_BaseNB)
 |  GaussianNB(*, priors=None, var_smoothing=1e-09)
 |  
 |  Gaussian Naive Bayes (GaussianNB)
 |  
 |  Can perform online updates to model parameters via :meth:`partial_fit`.
 |  For details on algorithm used to update feature means and variance online,
 |  see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:
 |  
 |      http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
 |  
 |  Read more in the :ref:`User Guide <gaussian_naive_bayes>`.
 |  
 |  Parameters
 |  ----------
 |  priors : array-like of shape (n_classes,)
 |      Prior probabilities of the classes. If specified the priors are not
 |      adjusted according to the data.
 |  
 |  var_smoothing : float, default=1e-9
 |      Portion of the largest variance of all features that is added to
 |      variances for calculation stability.
 |  
 |      .. versionadded:: 0.20
 |  
 |  Attributes
 |  ----------
 |  c

## 推薦図書
---
- [見て試してわかる機械学習アルゴリズムの仕組み 機械学習図鑑](https://www.amazon.co.jp/%E8%A6%8B%E3%81%A6%E8%A9%A6%E3%81%97%E3%81%A6%E3%82%8F%E3%81%8B%E3%82%8B%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%82%A2%E3%83%AB%E3%82%B4%E3%83%AA%E3%82%BA%E3%83%A0%E3%81%AE%E4%BB%95%E7%B5%84%E3%81%BF-%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E5%9B%B3%E9%91%91-%E7%A7%8B%E5%BA%AD-%E4%BC%B8%E4%B9%9F/dp/4798155659/)
- [Pythonではじめる機械学習 ―scikit-learnで学ぶ特徴量エンジニアリングと機械学習の基礎](https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82%81%E3%82%8B%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92-%E2%80%95scikit-learn%E3%81%A7%E5%AD%A6%E3%81%B6%E7%89%B9%E5%BE%B4%E9%87%8F%E3%82%A8%E3%83%B3%E3%82%B8%E3%83%8B%E3%82%A2%E3%83%AA%E3%83%B3%E3%82%B0%E3%81%A8%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%81%AE%E5%9F%BA%E7%A4%8E-Andreas-C-Muller/dp/4873117984/)