## Naive Bayes 


#### 若A、B事件互相獨立，A、B事件同時發生的機率為
$P(A\bigcap B) = P(A) \times P(B)$

#### 貝氏定理

$$P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$$

#### 已知檢測為陽性，確診癌症的機率

$$P(癌症|陽性) = \frac{P(陽性|癌症) \times P(癌症)}{P(陽性)}$$

#### 範例
- 癌症：0.001%
- 檢測準確度：99%

**檢測出陽性的話有2種可能性**
- 癌症 & 檢測正確
- 不是癌症 & 檢測錯誤

In [5]:
import numpy as np

p_disease = 1/100000
p_correct = 0.99

p_disease_and_correct = p_disease * p_correct

p_no_disease_and_incorrect = (1 - p_disease) * (1 - p_correct)

print('p_disease_and_correct：', p_disease_and_correct)
print('p_no_disease_and_incorrect：', p_no_disease_and_incorrect)

p_disease_and_correct： 9.9e-06
p_no_disease_and_incorrect： 0.00999990000000001


In [6]:
import numpy as np

p_disease = 1/100000
p_correct = 0.99

# P(陽性) = P(癌症) * P(檢測正確) + P(沒癌症) * P(檢測錯誤)
p_positive = p_disease * p_correct + (1-p_disease) * (1-p_correct)

# P(癌症|陽性) = P(陽性|癌症) * P(癌症) / P(陽性)
p_disease_given_positive = p_positive_given_disease * p_disease / p_positive
print(p_disease_given_positive)

0.9800999999999999
0.00097914044236648


## Sklearn

In [8]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

train_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'], subset='train', shuffle=True, random_state=108)

# print(train_emails.target_names)
# print(train_emails.data[5])
# print(train_emails.target[5])
# print(train_emails.target_names[train_emails.target[5]])

test_emails = fetch_20newsgroups(categories=['rec.sport.baseball', 'rec.sport.hockey'], subset='test', shuffle=True, random_state=108)

counter = CountVectorizer()
counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)

test_counts = counter.transform(test_emails.data)
print(test_counts.shape)


classifier = MultinomialNB()
classifier.fit(train_counts, train_emails.target)
print(classifier.score(test_counts, test_emails.target))

(796, 23714)
0.9723618090452262
