# Klasifikasi Naive Bayas  

Naive Bayes Classifier adalah algoritma machine learning berbasis probabilitas yang sederhana namun kuat, berdasarkan Teorema Bayes dengan asumsi independensi bersyarat (conditional independence) antar fitur.  

**1. Prinsip Dasar - Teorema Bayes**  

**Teorema Bayes:**  

$$
P(Y | X) = \frac{P(X | Y) \cdot P(Y)}{P(X)}
$$
  
**Penjelasan:**  
- $ P(Y | X) $ = Probabilitas posterior (kelas $𝑌$ diberikan fitur $𝑋$).  
- $ P(X | Y) $ = Likelihood (probabilitas fitur $𝑋$ diberikan kelas $𝑌$).  
- $ P(Y) $ = Prior probability (probabilitas awal kelas $𝑌$).  
- $ P(X) $ = Evidence (probabilitas fitur $𝑋$ secara keseluruhan).  

**Tujuan:**  
Mencari kelas $𝑌$ dengan nilai $𝑃(𝑌|𝑋)$ tertinggi.  

**2. Prinsip "Naive" (Independensi Fitur)**  

Naive Bayes mengasumsikan bahwa setiap fitur dalam dataset **tidak saling bergantung**, sehingga probabilitas gabungan fitur dapat dihitung sebagai:

$$
P(A | X_1, X_2, ..., X_n) = \frac{P(X_1, X_2, ..., X_n | A) \cdot P(A)}{P(X_1, X_2, ..., X_n)}
$$

Karena diasumsikan fitur **independen**, maka:

$$
P(X_1, X_2, ..., X_n | A) = P(X_1 | A) \cdot P(X_2 | A) \cdot ... \cdot P(X_n | A)
$$

Sehingga:

$$
P(A | X_1, X_2, ..., X_n) = \frac{P(A) \cdot \prod_{i=1}^{n} P(X_i | A)}{P(X_1, X_2, ..., X_n)}
$$

**3. Jenis Naive Bayes**  

**a) Gaussian Naive Bayes** (Untuk Data Numerik)  
Jika fitur $ X $ mengikuti **distribusi normal (Gaussian)**, maka probabilitas dihitung dengan:

$$
P(x | C) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}
$$

Di mana:
- $ \mu $ = Rata-rata fitur dalam kelas tertentu
- $ \sigma $ = Standar deviasi fitur dalam kelas tertentu
- $ x $ = Nilai fitur

**b) Multinomial Naive Bayes** (Untuk Data Kategori)  
Digunakan untuk **klasifikasi teks** berdasarkan frekuensi kata:

$$
P(X | C) = \frac{(N_{c, X} + \alpha)}{(N_c + \alpha \cdot d)}
$$

Di mana:
- $ N_{c, X} $ = Jumlah kata $ X $ dalam kelas $ C $
- $ N_c $ = Total jumlah kata dalam kelas $ C $
- $ d $ = Jumlah total kata unik dalam semua kelas
- $ \alpha $ = Parameter smoothing (Laplace Smoothing)

**c) Bernoulli Naive Bayes** (Untuk Data Biner)  
Digunakan jika fitur hanya memiliki dua kemungkinan (ada/tidak ada):

$$
P(X | C) = P(X_1 | C)^{x_1} \cdot P(X_2 | C)^{x_2} \cdot ... \cdot P(X_n | C)^{x_n} \cdot (1 - P(X_1 | C))^{(1 - x_1)}
$$

**4. Kelebihan dan Kekurangan**  

**Kelebihan:**  
- Sederhana dan mudah diimplementasikan.  
- Cepat dan efisien untuk dataset besar.  
- Cocok untuk data dengan banyak fitur.  

**Kekurangan:**  
- Asumsi independensi fitur seringkali tidak realistis.  
- Performa menurun jika fitur saling bergantung.  
- Memerlukan penanganan khusus untuk data dengan probabilitas nol.  

**5. Contoh Penggunaan Naive Bayes**  

Naive Bayes sering digunakan di bidang seperti:
- Deteksi Spam 
- Analisis Sentimen 
- Klasifikasi Dokumen 
- Prediksi Penyakit 


# Implementasi Naive Bayes  

Import Library

In [7]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

Load Dataset

In [8]:
dataset = pd.read_csv('data_gabungan.csv')
dataset.head()

Unnamed: 0,id,class,petal_length,petal_width,sepal_length,sepal_width
0,1,Iris-setosa,1.4,0.2,5.1,3.5
1,2,Iris-setosa,1.4,0.2,4.9,3.0
2,3,Iris-setosa,1.3,0.2,4.7,3.2
3,4,Iris-setosa,1.5,0.2,4.6,3.1
4,5,Iris-setosa,1.4,0.2,5.0,3.6


Encoding Label

In [9]:
en = LabelEncoder() 

dataset['class'] = en.fit_transform(dataset['class'])
dataset.head()

Unnamed: 0,id,class,petal_length,petal_width,sepal_length,sepal_width
0,1,0,1.4,0.2,5.1,3.5
1,2,0,1.4,0.2,4.9,3.0
2,3,0,1.3,0.2,4.7,3.2
3,4,0,1.5,0.2,4.6,3.1
4,5,0,1.4,0.2,5.0,3.6


Memisahkan Fitur dan Label

In [14]:
X = dataset.iloc[:, 2:].values
y = dataset.iloc[:, 1].values

In [15]:
X

array([[1.4, 0.2, 5.1, 3.5],
       [1.4, 0.2, 4.9, 3. ],
       [1.3, 0.2, 4.7, 3.2],
       [1.5, 0.2, 4.6, 3.1],
       [1.4, 0.2, 5. , 3.6],
       [1.7, 0.4, 5.4, 3.9],
       [1.4, 0.3, 4.6, 3.4],
       [1.5, 0.2, 5. , 3.4],
       [1.4, 0.2, 4.4, 2.9],
       [1.5, 0.1, 4.9, 3.1],
       [1.5, 0.2, 5.4, 3.7],
       [1.6, 0.2, 4.8, 3.4],
       [1.4, 0.1, 4.8, 3. ],
       [1.1, 0.1, 4.3, 3. ],
       [1.2, 0.2, 5.8, 4. ],
       [1.5, 0.4, 5.7, 4.4],
       [1.3, 0.4, 5.4, 3.9],
       [1.4, 0.3, 5.1, 3.5],
       [1.7, 0.3, 5.7, 3.8],
       [1.5, 0.3, 5.1, 3.8],
       [1.7, 0.2, 5.4, 3.4],
       [1.5, 0.4, 5.1, 3.7],
       [1. , 0.2, 4.6, 3.6],
       [1.7, 0.5, 5.1, 3.3],
       [1.9, 0.2, 4.8, 3.4],
       [1.6, 0.2, 5. , 3. ],
       [1.6, 0.4, 5. , 3.4],
       [1.5, 0.2, 5.2, 3.5],
       [1.4, 0.2, 5.2, 3.4],
       [1.6, 0.2, 4.7, 3.2],
       [1.6, 0.2, 4.8, 3.1],
       [1.5, 0.4, 5.4, 3.4],
       [1.5, 0.1, 5.2, 4.1],
       [1.4, 0.2, 5.5, 4.2],
       [1.5, 0

In [16]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [29]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

print("x_train shape: ", x_train.shape)
print("x_test shape: ", x_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

x_train shape:  (120, 4)
x_test shape:  (30, 4)
y_train shape:  (120,)
y_test shape:  (30,)


Normalisasi Data

In [None]:
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

In [19]:
x_train

array([[ 1.32384841,  0.91509239,  1.89107197, -0.54902995],
       [ 0.68491609,  0.37415107,  0.16162128, -1.91685543],
       [-1.28996563, -1.38390819, -1.44429721,  0.3628537 ],
       [-1.40613514, -1.38390819, -0.95016844,  1.04676645],
       [ 0.10406852, -0.30202557,  0.16162128, -1.91685543],
       [-0.30252477, -0.30202557, -1.07370063, -1.68888452],
       [ 1.49810268,  1.05032772,  2.50873292,  1.73067919],
       [-1.23188088, -1.11343754, -0.57957187,  1.9586501 ],
       [-1.4642199 , -1.11343754, -0.57957187,  1.9586501 ],
       [ 0.04598377, -0.03155491, -0.0854431 , -0.77700086],
       [-0.18635526, -0.30202557, -1.07370063, -2.37279726],
       [-1.52230466, -1.38390819, -1.07370063,  0.3628537 ],
       [ 1.20767889,  1.4560337 ,  1.14987882,  0.3628537 ],
       [ 0.8010856 ,  1.4560337 ,  1.02634662, -0.09308812],
       [ 0.74300084,  0.91509239, -0.0854431 , -0.77700086],
       [ 0.74300084,  1.59126903, -0.0854431 , -0.54902995],
       [ 1.03342462,  1.

In [20]:
x_test

array([[ 0.62683133,  0.37415107,  0.53221786, -1.23294269],
       [ 0.97533987,  1.18556304,  1.14987882, -0.09308812],
       [ 1.03342462,  1.32079837,  0.65575005, -0.54902995],
       [ 0.16215328,  0.10368042, -0.33250748, -0.09308812],
       [-1.34805039, -1.51914352, -1.19723282,  0.13488279],
       [ 0.56874657,  0.77985706,  0.16162128, -0.09308812],
       [ 0.33640755,  0.10368042,  0.53221786, -1.68888452],
       [-1.4642199 , -1.38390819, -1.81489378,  0.3628537 ],
       [-1.40613514, -1.38390819, -1.81489378, -0.32105904],
       [ 0.33640755, -0.03155491, -0.45603967, -1.00497178],
       [ 0.74300084,  1.4560337 ,  1.27341101,  0.13488279],
       [-1.40613514, -1.38390819, -0.45603967,  2.64256284],
       [ 0.04598377,  0.23891575, -0.82663625, -0.77700086],
       [ 0.97533987,  0.77985706,  0.77928224, -0.09308812],
       [ 1.32384841,  1.4560337 ,  2.26166854, -0.09308812],
       [ 1.14959414,  1.32079837,  0.77928224, -0.09308812],
       [-1.4642199 , -1.

In [None]:
clasifier = GaussianNB()
clasifier.fit(x_train, y_train)

In [22]:
y_pred = clasifier.predict(x_test)
y_pred

array([1, 2, 2, 1, 0, 2, 1, 0, 0, 1, 2, 0, 1, 2, 2, 2, 0, 0, 1, 0, 0, 1,
       0, 2, 0, 0, 0, 2, 2, 0])

In [23]:
clasifier.predict_proba(x_test)

array([[9.98113576e-125, 9.23061983e-001, 7.69380171e-002],
       [7.28011469e-195, 1.22131474e-005, 9.99987787e-001],
       [1.76579881e-205, 4.05456535e-006, 9.99995945e-001],
       [6.68285574e-075, 9.99831223e-001, 1.68777079e-004],
       [1.00000000e+000, 7.69492498e-017, 1.27237734e-025],
       [5.40967176e-131, 2.65635572e-001, 7.34364428e-001],
       [3.65348518e-092, 9.99265736e-001, 7.34264286e-004],
       [1.00000000e+000, 2.13664102e-017, 2.01478558e-026],
       [1.00000000e+000, 2.34838672e-016, 1.53190017e-025],
       [2.62228168e-085, 9.99862993e-001, 1.37007010e-004],
       [9.66245490e-184, 1.79873766e-006, 9.99998201e-001],
       [1.00000000e+000, 9.41524767e-019, 1.84582236e-026],
       [5.43637781e-070, 9.99924553e-001, 7.54469589e-005],
       [2.10529816e-175, 2.94898096e-003, 9.97051019e-001],
       [4.48635493e-257, 4.56398627e-010, 1.00000000e+000],
       [2.25678814e-219, 6.99119033e-007, 9.99999301e-001],
       [1.00000000e+000, 9.75935324e-017

In [30]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[13  0  0]
 [ 0  6  0]
 [ 0  1 10]]


In [31]:
akurasi = classification_report(y_test, y_pred)
print(akurasi)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       0.86      1.00      0.92         6
           2       1.00      0.91      0.95        11

    accuracy                           0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30



In [32]:
akurasi = accuracy_score(y_test,y_pred)
print("Tingkat Akurasi : %d persen"%(akurasi*100))

Tingkat Akurasi : 96 persen


In [33]:
ydata = pd.DataFrame()
ydata['y_test'] = pd.DataFrame(y_test)
ydata['y_pred'] = pd.DataFrame(y_pred)
ydata

Unnamed: 0,y_test,y_pred
0,1,1
1,2,2
2,2,2
3,1,1
4,0,0
5,2,2
6,1,1
7,0,0
8,0,0
9,1,1
