Tự Cài Đặt Giải Thuật Gaussian Naïve Bayes biết rằng xác suất phân lớp theo giải thuật này như sau :

$$P(x|C_1) = \displaystyle \frac{1}{\sqrt{2\pi}\sigma_1} \cdot  e^{\displaystyle -\frac{(x-\mu_1)^2}{2\sigma_1^2}}$$

Trong đó:
- μ (mean) và σ (standard deviation)

Viết các hàm ( không sử dụng thư viện) :
 
1. Đọc Dữ Liệu
2. Tính Mean
3. Tính Standard Deviation
4. Tính Xác suất
5. Phân lớp
6. Tính độ chính xác
7. Áp dụng giải thuật này cho bộ dữ liệu sau :

| Day | Outlook | Humidity | Wind | Play Tennis |
|---|---|---|---|---|
| D1 | Sunny | High | Weak | No |
| D2 | Sunny | High | Strong | No |
| D3 | Overcast | High | Weak | Yes |
| D4 | Rain | High | Weak | Yes |
| D5 | Rain | Normal | Weak | Yes |
| D6 | Rain | Normal | Strong | No |
| D7 | Overcast | Normal | Strong | Yes |
| D8 | Sunny | High | Weak | No |
| D9 | Sunny | Normal | Weak | Yes |
| D10 | Rain | Normal | Weak | Yes |
| D11 | Sunny | Normal | Strong | Yes |
| D12 | Overcast | High | Strong | Yes |
| D13 | Overcast | Normal | Weak | Yes |
| D14 | Rain | High | Strong | No |



In [10]:
import numpy as np
import pandas as pd
import csv

In [64]:
def read_data_1():
    lines = csv.reader(open(r'weather.csv', 'r'))
    data = list(lines)
    columns = data[0]
    data = data[1:]
    return data, columns

In [66]:
def read_data(data):
    lines = data.split('\n')
    dataset = []
    columns = lines[0].strip().split(',')
    for line in lines[1:]:  # Bỏ qua dòng đầu tiên chứa tiêu đề
        if line.strip():  # Bỏ qua các dòng trống
            data_point = line.strip().split(',')
            dataset.append(data_point)
    return dataset, columns

In [67]:

data = 'Day,Oulook,Humidity,Wind,Play Tennis\nD1,Sunny,High,Weak,No\nD2,Sunny,High,Strong,No\nD3,Overcast,High,Weak,Yes\nD4,Rain,High,Weak,Yes\nD5,Rain,Normal,Weak,Yes\nD6,Rain,Normal,Strong,No\nD7,Overcast,Normal,Strong,Yes\nD8,Sunny,High,Weak,No\nD9,Sunny,Normal,Weak,Yes\nD10,Rain,Normal,Weak,Yes\nD11,Sunny,Normal,Strong,Yes\nD12,Overcast,High,Strong,Yes\nD13,Overcast,Normal,Weak,Yes\nD14,Rain,High,Strong,No'
 
dataset,columns = read_data(data)
df = pd.DataFrame(dataset, columns=columns)
df

Unnamed: 0,Day,Oulook,Humidity,Wind,Play Tennis
0,D1,Sunny,High,Weak,No
1,D2,Sunny,High,Strong,No
2,D3,Overcast,High,Weak,Yes
3,D4,Rain,High,Weak,Yes
4,D5,Rain,Normal,Weak,Yes
5,D6,Rain,Normal,Strong,No
6,D7,Overcast,Normal,Strong,Yes
7,D8,Sunny,High,Weak,No
8,D9,Sunny,Normal,Weak,Yes
9,D10,Rain,Normal,Weak,Yes


In [68]:
def calculate_mean(numbers):
    return sum(numbers) / len(numbers)

def calculate_std_dev(numbers):
    mean = calculate_mean(numbers)
    variance = sum([(x - mean) ** 2 for x in numbers]) / len(numbers)
    return variance ** 0.5

$$P(x|C_1) = \displaystyle \frac{1}{\sqrt{2\pi}\sigma_1} \cdot  e^{\displaystyle -\frac{(x-\mu_1)^2}{2\sigma_1^2}}$$

In [None]:

def calculate_probability(x, mean, std_dev):
    return (1 / (std_dev * (2 * np.pi) ** 0.5)) * np.exp(-((x - mean) ** 2) / (2 * std_dev ** 2))

def predict_class(input_data, class_means, class_std_devs, class_prior):
    probabilities = {}
    for class_val, class_mean in class_means.items():
        probabilities[class_val] = class_prior[class_val]
        for i in range(len(class_mean)):
            probabilities[class_val] *= calculate_probability(input_data[i], class_mean[i], class_std_devs[class_val][i])
    return max(probabilities, key=probabilities.get)

def accuracy(actual, predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
    return correct / float(len(actual)) * 100.0


In [15]:
outlook_mapping = {'Sunny': 0, 'Overcast': 1, 'Rain': 2}
humidity_mapping = {'High': 0, 'Normal': 1}
wind_mapping = {'Weak': 0, 'Strong': 1}
play_mapping = {'No': 0, 'Yes': 1}

X = []
y = []
for row in dataset:
    X.append([outlook_mapping[row[1]], humidity_mapping[row[2]], wind_mapping[row[3]]])
    y.append(play_mapping[row[4]])

X = np.array(X)
y = np.array(y)

# Tính toán các tham số cần thiết cho Gaussian Naïve Bayes
class_means = {}
class_std_devs = {}
class_prior = {}

for class_val in np.unique(y):
    X_class = X[y == class_val]
    class_means[class_val] = X_class.mean(axis=0)
    class_std_devs[class_val] = X_class.std(axis=0)
    class_prior[class_val] = len(X_class) / len(X)

# Dự đoán và tính độ chính xác
predictions = []
for data_point in X:
    predictions.append(predict_class(data_point, class_means, class_std_devs, class_prior))

acc = accuracy(y, predictions)
print("Accuracy:", acc)

Accuracy: 85.71428571428571
