# Naive Bayes

## [ Naive Bayes ]

### 1. Naive Bayes

* 수식 : $P(B|A) = \frac{P(A|B)P(B)}{P(A)}$
* 종속변수(Y)가 주어졌을 때, 입력 변수(X)들은 모두 조건부 독립 => 계산량이 많으므로 독립 가정
* 예측 변수들의 정확한 조건부 확률은 각 조건부 확률의 곱으로 충분히 잘 추정할 수 있다는 단순한 가정
* 데이터 셋을 순진하게 믿는다고 하여 Naive Bayes
  * (x1=1, x2=1, x3=1)일 때 y=1일 확률, (x1=1, x2=1, x3=0)일 때 y=1일 확률 등을 비교
<img src="../Images/Machine_Learning/Naive_Bayes_1.JPG" width="600" height="200" title=""/>


* `Laplace Smoothing` 
  * Count를 하다 보면 한 번도 나오지 않는 경우가 발생 => 확률이 0이 됨
  * 확률이 0이 되는 것을 방지하기 위해 최소한의 확률을 정해줌
  * $P(x|c) = \frac{count(x,c)+1}{\sum count(x,c) + v}$
  * v : 입력 변수의 개수

* 장점 : 변수가 많은 경우 좋음, 텍스트 데이터에서 큰 강점을 보임
* 단점 : 희귀한 확률이 나왔을 경우 처리하기 힘듦, 조건부 독립이라는 가정 자체가 비현실적

---

## [ Naive Bayes : 스팸 메시지 분류]

### 1. Data

#### 1-1. Data Load

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2021)

In [9]:
spam = pd.read_csv("../Data/Naive_Bayes/sms_spam.csv")
spam.head()

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
text = spam["text"]
label = spam["type"]

In [11]:
label.value_counts()

ham     4827
spam     747
Name: type, dtype: int64

#### 1-2. Data Cleaning

* type을 숫자로 변환

In [13]:
mapping = {'ham':0, 'spam':1}
label=label.map(mapping)

label.value_counts()

0    4827
1     747
Name: type, dtype: int64

* text 내용에 불필요한 특수문자 등을 제거
  * ^ : 반대

In [19]:
re_pattern = "[^a-zA-Z0-9\ ]"

In [20]:
print(text[0])
print(text.iloc[:1].str.replace(re_pattern, "", regex=True)[0])


Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat


In [22]:
text = text.str.replace(re_pattern, "", regex=True)
text[0]

'Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat'

* 소문자로 변경

In [23]:
text = text.str.lower()
text[0]

'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat'

#### 1-3. Data Split

In [24]:
from sklearn.model_selection import train_test_split

train_text, test_text, train_label, test_label = train_test_split(
    text, label, train_size=0.7, random_state=2021
)

In [25]:
print(f"train_data size: {len(train_label)}, {len(train_label)/len(text):.2f}")
print(f"test_data size: {len(test_label)}, {len(test_label)/len(text):.2f}")

train_data size: 3901, 0.70
test_data size: 1673, 0.30


### 2. Count Vectorize

* Naive Bayes 학습을 위해 각 문장에서 단어 출현 횟수를 변환

#### 2-1. `nltk.word_tokenize` : 문장을 단어로 나누기

In [27]:
# pip install nltk

Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
Collecting regex>=2021.8.3
  Downloading regex-2022.4.24-cp38-cp38-win_amd64.whl (262 kB)
Installing collected packages: regex, nltk
Successfully installed nltk-3.7 regex-2022.4.24
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\jack0\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


In [28]:
import nltk
from nltk import word_tokenize

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jack0\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [29]:
train_text.iloc[0]

'am only searching for good dual sim mobile pa'

In [30]:
word_tokenize(train_text.iloc[0])

['am', 'only', 'searching', 'for', 'good', 'dual', 'sim', 'mobile', 'pa']

#### 2-2. `CountVectorizer` : 단어를 count vector로 만들기 

In [32]:
train_text.iloc[:2].values

array(['am only searching for good dual sim mobile pa',
       'excellent ill see what rileys plans are'], dtype=object)

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize)

cnt_vectorizer.fit(train_text.iloc[:2])


CountVectorizer(tokenizer=<function word_tokenize at 0x000001B5737D85E0>)

In [35]:
cnt_vectorizer.vocabulary_

{'am': 0,
 'only': 8,
 'searching': 12,
 'for': 4,
 'good': 5,
 'dual': 2,
 'sim': 14,
 'mobile': 7,
 'pa': 9,
 'excellent': 3,
 'ill': 6,
 'see': 13,
 'what': 15,
 'rileys': 11,
 'plans': 10,
 'are': 1}

In [37]:
vocab = sorted(cnt_vectorizer.vocabulary_.items(), key=lambda x: x[1])
vocab

[('am', 0),
 ('are', 1),
 ('dual', 2),
 ('excellent', 3),
 ('for', 4),
 ('good', 5),
 ('ill', 6),
 ('mobile', 7),
 ('only', 8),
 ('pa', 9),
 ('plans', 10),
 ('rileys', 11),
 ('searching', 12),
 ('see', 13),
 ('sim', 14),
 ('what', 15)]

In [38]:
vocab = list(map(lambda x: x[0], vocab))
vocab

['am',
 'are',
 'dual',
 'excellent',
 'for',
 'good',
 'ill',
 'mobile',
 'only',
 'pa',
 'plans',
 'rileys',
 'searching',
 'see',
 'sim',
 'what']

In [39]:
sample_cnt_vector = cnt_vectorizer.transform(train_text.iloc[:2]).toarray()
sample_cnt_vector

array([[1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1]], dtype=int64)

In [40]:
train_text.iloc[:2].values

array(['am only searching for good dual sim mobile pa',
       'excellent ill see what rileys plans are'], dtype=object)

In [41]:
pd.DataFrame(sample_cnt_vector, columns=vocab)

Unnamed: 0,am,are,dual,excellent,for,good,ill,mobile,only,pa,plans,rileys,searching,see,sim,what
0,1,0,1,0,1,1,0,1,1,1,0,0,1,0,1,0
1,0,1,0,1,0,0,1,0,0,0,1,1,0,1,0,1


#### 2-3. 학습

* 2개 행에서 이제 전체 데이터로 학습

In [42]:
cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize)
cnt_vectorizer.fit(train_text)



CountVectorizer(tokenizer=<function word_tokenize at 0x000001B5737D85E0>)

In [43]:
len(cnt_vectorizer.vocabulary_)

7908

#### 2-4. 예측

In [44]:
train_matrix = cnt_vectorizer.transform(train_text)
test_matrix = cnt_vectorizer.transform(test_text)

* 존재하지 않는 단어가 들어오면 0(무시)으로 처리

In [45]:
cnt_vectorizer.transform(["notavailblewordforcnt"]).toarray().sum()

0

### 3. Naive Bayes

* 분류를 위한 Naive Bayes 모델

In [46]:
from sklearn.naive_bayes import BernoulliNB

naive_bayes = BernoulliNB()

# 토큰화환 문장과 기존의 정답 라벨
naive_bayes.fit(train_matrix, train_label)

BernoulliNB()

In [47]:
train_pred = naive_bayes.predict(train_matrix)
test_pred = naive_bayes.predict(test_matrix)

In [48]:
from sklearn.metrics import accuracy_score

train_acc = accuracy_score(train_label, train_pred)
test_acc = accuracy_score(test_label, test_pred)

print(f"Train Accuracy is {train_acc:.4f}")
print(f"Test Accuracy is {test_acc:.4f}")

Train Accuracy is 0.9854
Test Accuracy is 0.9767
