# Naive Bayes Classifier</font></center>
Tutorial of Computational Linguistics, National Chengchi University

*Chang-Yu Tsai, 2025.03.14*

- In this week, we will try:
  - to build a Naive Bayes Classifier by using `PyTorch`

  - to conduct K-fold cross validation
  
  - to understand the differences between averaging methods of F1-score

## Set-up

- importing required packages

```
import pandas as pd

import torch

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

import numpy as np

import re
```

In [1]:
import pandas as pd

import torch

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

import numpy as np

import re

## Preprocessing

- downloading the dataset from github

```
!wget https://raw.githubusercontent.com/EntropiaTsai/nccu_elt_course_material/refs/heads/main/fake_news.csv
```

In [2]:
!wget https://raw.githubusercontent.com/EntropiaTsai/nccu_elt_course_material/refs/heads/main/fake_news.csv

--2025-03-07 03:15:12--  https://raw.githubusercontent.com/EntropiaTsai/nccu_elt_course_material/refs/heads/main/fake_news.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 726216 (709K) [text/plain]
Saving to: ‘fake_news.csv’


2025-03-07 03:15:12 (7.87 MB/s) - ‘fake_news.csv’ saved [726216/726216]



- reading the file

```
# 使用 pandas 讀取 CSV 文件
df = pd.read_csv('fake_news.csv',encoding='utf-8')
# 顯示前五行數據
print(df.head())
len(df)
```

In [None]:
# 使用 pandas 讀取 CSV 文件
df = pd.read_csv('fake_news.csv',encoding='utf-8')
# 顯示前五行數據
print(df.head())
len(df)

                                                text  label
0  星巴克Starbucks詐騙邀請LINE訊息. 原始謠傳版本：. 統一星巴克starbuck...      1
1  转摘：\n我给大家透漏个家传秘密，每天出门前用棉签蘸点小磨香油，滴进两个鼻孔内，轻捏几下即可...      1
2  各位朋友請注意，剛才在承德路被警察攔下臨檢，因為當時沒有戴口罩，被警察詢問車上有沒有乘客，我...      1
3  亡國危機迫在眉睫- -家庭觀念及性別認同危機。這支影片看後你我都有責仼，請及時踴躍發聲！轉發...      1
4  (shiny)臭跩貓回來囉(shiny)\n限時活動發放貼圖主題\n超可愛白爛貓愛嗆人嗆不停...      1


1000

- Extracting features

```
## searching the key words of reposting
repost=[]
for text in df['text']:
  if re.search('轉[發載摘貼]*',text):
    result=1
  else:
    result=0
  repost.append(result)

df['repost']=repost

# taking a look at the current dataframe

df.head()
```

In [None]:
## searching the key words of reposting
repost=[]
for text in df['text']:
  if re.search('轉[發載摘貼]*',text):
    result=1
  else:
    result=0
  repost.append(result)

df['repost']=repost

# taking a look at the current dataframe

df.head()

Unnamed: 0,text,label,repost
0,星巴克Starbucks詐騙邀請LINE訊息. 原始謠傳版本：. 統一星巴克starbuck...,1,0
1,转摘：\n我给大家透漏个家传秘密，每天出门前用棉签蘸点小磨香油，滴进两个鼻孔内，轻捏几下即可...,1,0
2,各位朋友請注意，剛才在承德路被警察攔下臨檢，因為當時沒有戴口罩，被警察詢問車上有沒有乘客，我...,1,0
3,亡國危機迫在眉睫- -家庭觀念及性別認同危機。這支影片看後你我都有責仼，請及時踴躍發聲！轉發...,1,1
4,(shiny)臭跩貓回來囉(shiny)\n限時活動發放貼圖主題\n超可愛白爛貓愛嗆人嗆不停...,1,0


- seperating the features and labels

```
features=df[['repost']]
labels=df['label']
print('Feature:')
features.head()
# print('Labels:')
# labels.head()
```

In [None]:
features=df[['repost']]
labels=df['label']
print('Feature:')
features.head()
# print('Labels:')
# labels.head()

Feature:


Unnamed: 0,repost
0,0
1,0
2,0
3,1
4,0


- dividing them into the training set and the test set

```
feat_train, feat_test, label_train, label_test = train_test_split(features, labels, test_size=0.2, random_state=42)
```

In [None]:
feat_train, feat_test, label_train, label_test = train_test_split(features, labels, test_size=0.2, random_state=42)

- Converting features and labels into **tensors**
  - A tensor is an n-dimentional space to store values which are usually employed in the deep laerning task.
  
```
# 0D Tensor (scalar)
print(torch.tensor(1.1))

# 1D Tensor (vector)
print(torch.tensor([1.1, 2.1]))

# 2D Tensor (matrix)
print(torch.tensor([[1.1, 2.1], [3.1, 3.2]]))

# 3D Tensor (cube)
print(torch.tensor([[[1.1, 0.2], [0.1, 0.2]], [[0.3, 0.4], [0.3, 0.4]]]))

# 4D Tensor
print(torch.tensor([[[[0.1, 0.2], [0.1, 0.2]], [[0.3, 0.4], [0.3, 0.4]]],
                    [[[0.5, 0.6], [0.5, 0.6]], [[0.7, 0.8], [0.7, 0.8]]]]))

# 5D Tensor
print(torch.tensor([[[[[0.1, 0.2], [0.1, 0.2]], [[0.3, 0.4], [0.3, 0.4]]],
                      [[[0.5, 0.6], [0.5, 0.6]], [[0.7, 0.8], [0.7, 0.8]]]],

                    [[[[0.9, 1.0], [0.9, 1.0]], [[1.1, 1.2], [1.1, 1.2]]],
                      [[[1.3, 1.4], [1.3, 1.4]], [[1.5, 1.6], [1.5, 1.6]]]]]))
```

In [None]:
# 0D Tensor (scalar)
print(torch.tensor(1.1))

# 1D Tensor (vector)
print(torch.tensor([1.1, 2.1]))

# 2D Tensor (matrix)
print(torch.tensor([[1.1, 2.1], [3.1, 3.2]]))

# 3D Tensor (cube)
print(torch.tensor([[[1.1, 0.2], [0.1, 0.2]], [[0.3, 0.4], [0.3, 0.4]]]))

# 4D Tensor
print(torch.tensor([[[[0.1, 0.2], [0.1, 0.2]], [[0.3, 0.4], [0.3, 0.4]]],
                    [[[0.5, 0.6], [0.5, 0.6]], [[0.7, 0.8], [0.7, 0.8]]]]))

# 5D Tensor
print(torch.tensor([[[[[0.1, 0.2], [0.1, 0.2]], [[0.3, 0.4], [0.3, 0.4]]],
                      [[[0.5, 0.6], [0.5, 0.6]], [[0.7, 0.8], [0.7, 0.8]]]],

                    [[[[0.9, 1.0], [0.9, 1.0]], [[1.1, 1.2], [1.1, 1.2]]],
                      [[[1.3, 1.4], [1.3, 1.4]], [[1.5, 1.6], [1.5, 1.6]]]]]))

tensor(1.1000)
tensor([1.1000, 2.1000])
tensor([[1.1000, 2.1000],
        [3.1000, 3.2000]])
tensor([[[1.1000, 0.2000],
         [0.1000, 0.2000]],

        [[0.3000, 0.4000],
         [0.3000, 0.4000]]])
tensor([[[[0.1000, 0.2000],
          [0.1000, 0.2000]],

         [[0.3000, 0.4000],
          [0.3000, 0.4000]]],


        [[[0.5000, 0.6000],
          [0.5000, 0.6000]],

         [[0.7000, 0.8000],
          [0.7000, 0.8000]]]])
tensor([[[[[0.1000, 0.2000],
           [0.1000, 0.2000]],

          [[0.3000, 0.4000],
           [0.3000, 0.4000]]],


         [[[0.5000, 0.6000],
           [0.5000, 0.6000]],

          [[0.7000, 0.8000],
           [0.7000, 0.8000]]]],



        [[[[0.9000, 1.0000],
           [0.9000, 1.0000]],

          [[1.1000, 1.2000],
           [1.1000, 1.2000]]],


         [[[1.3000, 1.4000],
           [1.3000, 1.4000]],

          [[1.5000, 1.6000],
           [1.5000, 1.6000]]]]])


- It is required to convert data into tensors before sending them to a model built with `PlabelTorch`.

```
# Converting features into tensors
feat_train_tensor = torch.tensor(feat_train.values, dtype=torch.float32)
feat_test_tensor = torch.tensor(feat_test.values, dtype=torch.float32)

# Converting labels into tensors
label_train_tensor = torch.tensor(label_train.values, dtype=torch.long)
label_test_tensor = torch.tensor(label_test.values, dtype=torch.long)

# Inspecting the result
print(f"feat_train_tensor shape: {feat_train_tensor.shape}")
print(f"label_train_tensor shape: {label_train_tensor.shape}")
print(feat_train_tensor[:1])
print(label_train_tensor[:1])

print(f"feat_test_tensor shape: {feat_test_tensor.shape}")
print(f"label_test_tensor shape: {label_test_tensor.shape}")
print(feat_test_tensor[:1])
print(label_test_tensor[:1])

```

In [None]:
# Converting features into tensors
feat_train_tensor = torch.tensor(feat_train.values, dtype=torch.float32)
feat_test_tensor = torch.tensor(feat_test.values, dtype=torch.float32)

# Converting labels into tensors
label_train_tensor = torch.tensor(label_train.values, dtype=torch.long)
label_test_tensor = torch.tensor(label_test.values, dtype=torch.long)

# Inspecting the result
print(f"feat_train_tensor shape: {feat_train_tensor.shape}")
print(f"label_train_tensor shape: {label_train_tensor.shape}")
print(feat_train_tensor[:1])
print(label_train_tensor[:1])

print(f"feat_test_tensor shape: {feat_test_tensor.shape}")
print(f"label_test_tensor shape: {label_test_tensor.shape}")
print(feat_test_tensor[:1])
print(label_test_tensor[:1])


feat_train_tensor shape: torch.Size([800, 1])
label_train_tensor shape: torch.Size([800])
tensor([[0.]])
tensor([1])
feat_test_tensor shape: torch.Size([200, 1])
label_test_tensor shape: torch.Size([200])
tensor([[0.]])
tensor([0])


- counting the type numbers of features and labels, which are needed to be specified in the model

```
num_features = feat_train_tensor.shape[1]  # number of features
print('Number of features:', num_features)
num_classes = len(torch.unique(label_train_tensor))  # number of labels
print('Number of labels:',num_classes)
```

In [None]:
num_features = feat_train_tensor.shape[1]  # number of features
print('Number of features:', num_features)
num_classes = len(torch.unique(label_train_tensor))  # number of labels
print('Number of labels:',num_classes)

Number of features: 1
Number of labels: 2


## Model training

- recalling the formula of Naive Bayes Classifier

  $\hat C=\arg\max \log{P}(C_i) + \Sigma \log{P(F_i \mid C_i)}$
- defining a `MultinomialNaiveBayes`

```
class MultinomialNB:
    def __init__(self, num_classes, num_features, alpha=1.0):
        """
        Multinomial Naive Bayes model.
        Parameters:
            num_classes (int): Number of class labels.
            num_features (int): Number of features.
            alpha (float): Laplace smoothing parameter. Set alpha=0 to disable smoothing.
        """
        self.num_classes = num_classes
        self.num_features = num_features
        self.alpha = alpha  # Laplace smoothing
        self.class_log_prior_ = None  # log(P(C))
        self.feature_log_prob_ = None  # log(P(X|C))

    def compute_probabilities(self, X, y):
        """
        Compute class prior probabilities P(C) and feature probabilities P(X | C).
        Parameters:
            X (Tensor): (num_samples, num_features), Feature counts.
            y (Tensor): (num_samples,), Class labels.
        """
        # calculate the prior probability P(C)
        class_counts = torch.bincount(y, minlength=self.num_classes).float()
        class_prior = class_counts / class_counts.sum()  # P(C)
        self.class_log_prior_ = torch.log(class_prior)  # log(P(C))

        # printing out the prior probability
        for c in range(self.num_classes):
            print(f"Prior P(C={c}): {class_prior[c]:.4f}")

        # caluculating the conditional probability P(X | C)
        feature_counts = torch.zeros(self.num_classes, self.num_features)
        for c in range(self.num_classes):
            feature_counts[c] = X[y == c].sum(dim=0)  # 計算該類別的特徵計數

        # Avoiding division by zero problems when `alpha=0` via setting epsilon
        eps = 1e-10
        feature_prob = (feature_counts + self.alpha) / (feature_counts.sum(dim=1, keepdim=True) + self.alpha * self.num_features + eps)
        self.feature_log_prob_ = torch.log(feature_prob)  # log(P(X | C))


    def compute_log_likelihood(self, X):
        """
        Compute log likelihood for each sample.
        Parameters:
            X (Tensor): (num_samples, num_features), Feature counts.
        Returns:
            log_prob (Tensor): (num_samples, num_classes), Log probabilities.
        """
        return self.class_log_prior_ + X @ self.feature_log_prob_.T  # log P(C) + Σ log P(X | C)

```

In [None]:
class MultinomialNB:
    def __init__(self, num_classes, num_features, alpha=1.0):
        """
        Multinomial Naive Bayes model.
        Parameters:
            num_classes (int): Number of class labels.
            num_features (int): Number of features.
            alpha (float): Laplace smoothing parameter. Set alpha=0 to disable smoothing.
        """
        self.num_classes = num_classes
        self.num_features = num_features
        self.alpha = alpha  # Laplace smoothing
        self.class_log_prior_ = None  # log(P(C))
        self.feature_log_prob_ = None  # log(P(X|C))

    def compute_probabilities(self, X, y):
        """
        Compute class prior probabilities P(C) and feature probabilities P(X | C).
        Parameters:
            X (Tensor): (num_samples, num_features), Feature counts.
            y (Tensor): (num_samples,), Class labels.
        """
        # calculate the prior probability P(C)
        class_counts = torch.bincount(y, minlength=self.num_classes).float()
        class_prior = class_counts / class_counts.sum()  # P(C)
        self.class_log_prior_ = torch.log(class_prior)  # log(P(C))

        # printing out the prior probability
        for c in range(self.num_classes):
            print(f"Prior P(C={c}): {class_prior[c]:.4f}")

        # caluculating the conditional probability P(X | C)
        feature_counts = torch.zeros(self.num_classes, self.num_features)
        for c in range(self.num_classes):
            feature_counts[c] = X[y == c].sum(dim=0)  # 計算該類別的特徵計數

        # Avoiding division by zero problems when `alpha=0` via setting epsilon
        eps = 1e-10
        feature_prob = (feature_counts + self.alpha) / (feature_counts.sum(dim=1, keepdim=True) + self.alpha * self.num_features + eps)
        self.feature_log_prob_ = torch.log(feature_prob)  # log(P(X | C))


    def compute_log_likelihood(self, X):
        """
        Compute log likelihood for each sample.
        Parameters:
            X (Tensor): (num_samples, num_features), Feature counts.
        Returns:
            log_prob (Tensor): (num_samples, num_classes), Log probabilities.
        """
        return self.class_log_prior_ + X @ self.feature_log_prob_.T  # log P(C) + Σ log P(X | C)

- training the model with 10-fold cross validation
> K-fold cross validation helps us inspect if the model overfits or not.

```
# defining 10-Fold Cross-Validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# collecting the weighted F1-scores of each fold
f1_scores_weighted = []

# conducting 10-Fold Cross-Validation
for fold, (train_idx, dev_idx) in enumerate(kf.split(feat_train_tensor)):
    print(f"\nFold {fold+1}/10")

    # splitting data into the trainig set and the dev set
    feat_train_fold = feat_train_tensor[torch.tensor(train_idx)]
    feat_dev_fold = feat_train_tensor[torch.tensor(dev_idx)]
    label_train_fold = label_train_tensor[torch.tensor(train_idx)]
    label_dev_fold = label_train_tensor[torch.tensor(dev_idx)]

    # It helps debug to put print() in your codes.
    print(f"feat_dev_fold shape: {feat_dev_fold.shape}")
    print(f"label_dev_fold shape: {label_dev_fold.shape}")

    # training
    model = MultinomialNB(num_classes, num_features, alpha=1.0)
    model.compute_probabilities(feat_train_fold, label_train_fold)

    # calculating log likelihood
    log_likihood = model.compute_log_likelihood(feat_dev_fold)

    # predicting based on log likihood
    label_dev_pred = log_likihood.argmax(dim=1)

    # It helps debug to put print() in your codes.
    print(f"label_dev_pred shape: {label_dev_pred.shape}")
    print(f"label_dev_fold shape: {label_dev_fold.shape}")

    # ensuring `label_dev_pred` and `label_dev_fold` have consistent shapes
    label_dev_pred_np = label_dev_pred.view(-1).numpy()
    label_dev_fold_np = label_dev_fold.view(-1).numpy()

    # calculating F1-score
    f1_weighted = f1_score(label_dev_fold_np, label_dev_pred_np, average='weighted')

    # collecting weighted F1-scores of each fold
    f1_scores_weighted.append(f1_weighted)

    # printing out the weighted F1-score of each fold
    print(f"F1-Score (weighted): {f1_weighted:.4f}")

# calcualting the mean and the standard variation of weighted F1-scores
print("\n=== Cross-Validation Results ===")
print(f"F1-Score (weighted): Mean = {np.mean(f1_scores_weighted):.4f}, Std = {np.std(f1_scores_weighted):.4f}")

```

In [None]:
# defining 10-Fold Cross-Validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# collecting the weighted F1-scores of each fold
f1_scores_weighted = []

# conducting 10-Fold Cross-Validation
for fold, (train_idx, dev_idx) in enumerate(kf.split(feat_train_tensor)):
    print(f"\nFold {fold+1}/10")

    # splitting data into the trainig set and the dev set
    feat_train_fold = feat_train_tensor[torch.tensor(train_idx)]
    feat_dev_fold = feat_train_tensor[torch.tensor(dev_idx)]
    label_train_fold = label_train_tensor[torch.tensor(train_idx)]
    label_dev_fold = label_train_tensor[torch.tensor(dev_idx)]

    # It helps debug to put print() in your codes.
    print(f"feat_dev_fold shape: {feat_dev_fold.shape}")
    print(f"label_dev_fold shape: {label_dev_fold.shape}")

    # training
    model = MultinomialNB(num_classes, num_features, alpha=1.0)
    model.compute_probabilities(feat_train_fold, label_train_fold)

    # calculating log likelihood
    log_likihood = model.compute_log_likelihood(feat_dev_fold)

    # predicting based on log likihood
    label_dev_pred = log_likihood.argmax(dim=1)

    # It helps debug to put print() in your codes.
    print(f"label_dev_pred shape: {label_dev_pred.shape}")
    print(f"label_dev_fold shape: {label_dev_fold.shape}")

    # ensuring `label_dev_pred` and `label_dev_fold` have consistent shapes
    label_dev_pred_np = label_dev_pred.view(-1).numpy()
    label_dev_fold_np = label_dev_fold.view(-1).numpy()

    # calculating F1-score
    f1_weighted = f1_score(label_dev_fold_np, label_dev_pred_np, average='weighted')

    # collecting weighted F1-scores of each fold
    f1_scores_weighted.append(f1_weighted)

    # printing out the weighted F1-score of each fold
    print(f"F1-Score (weighted): {f1_weighted:.4f}")

# calcualting the mean and the standard variation of weighted F1-scores
print("\n=== Cross-Validation Results ===")
print(f"F1-Score (weighted): Mean = {np.mean(f1_scores_weighted):.4f}, Std = {np.std(f1_scores_weighted):.4f}")


Fold 1/10
feat_dev_fold shape: torch.Size([80, 1])
label_dev_fold shape: torch.Size([80])
Prior P(C=0): 0.4958
Prior P(C=1): 0.5042
label_dev_pred shape: torch.Size([80])
label_dev_fold shape: torch.Size([80])
F1-Score (weighted): 0.3473

Fold 2/10
feat_dev_fold shape: torch.Size([80, 1])
label_dev_fold shape: torch.Size([80])
Prior P(C=0): 0.4972
Prior P(C=1): 0.5028
label_dev_pred shape: torch.Size([80])
label_dev_fold shape: torch.Size([80])
F1-Score (weighted): 0.3615

Fold 3/10
feat_dev_fold shape: torch.Size([80, 1])
label_dev_fold shape: torch.Size([80])
Prior P(C=0): 0.5028
Prior P(C=1): 0.4972
label_dev_pred shape: torch.Size([80])
label_dev_fold shape: torch.Size([80])
F1-Score (weighted): 0.2535

Fold 4/10
feat_dev_fold shape: torch.Size([80, 1])
label_dev_fold shape: torch.Size([80])
Prior P(C=0): 0.4875
Prior P(C=1): 0.5125
label_dev_pred shape: torch.Size([80])
label_dev_fold shape: torch.Size([80])
F1-Score (weighted): 0.2663

Fold 5/10
feat_dev_fold shape: torch.Size([

## Predicting and evaluating


- predicting

```
label_pred = model.compute_log_likelihood(feat_test_tensor).argmax(dim=1)
label_true = label_test_tensor

# Converting tensors into numpy format
label_pred = label_pred.numpy()
label_true = label_true.numpy()
```

In [None]:
label_pred = model.compute_log_likelihood(feat_test_tensor).argmax(dim=1)
label_true = label_test_tensor

# Converting tensors into numpy format
label_pred = label_pred.numpy()
label_true = label_true.numpy()

- evaluating

```
accuracy_sklearn = accuracy_score(label_true, label_pred)
precision_sklearn = precision_score(label_true, label_pred, average='weighted')
recall_sklearn = recall_score(label_true, label_pred, average='weighted')
f1_sklearn = f1_score(label_true, label_pred, average='weighted')

# printind out the results
print("\n=== Evaluation Metrics (Scikit-learn) ===")
print(f"Accuracy : {accuracy_sklearn:.4f}")
print(f"Precision: {precision_sklearn:.4f}")
print(f"Recall   : {recall_sklearn:.4f}")
print(f"F1 Score : {f1_sklearn:.4f}")
```

In [None]:
accuracy_sklearn = accuracy_score(label_true, label_pred)
precision_sklearn = precision_score(label_true, label_pred, average='weighted')
recall_sklearn = recall_score(label_true, label_pred, average='weighted')
f1_sklearn = f1_score(label_true, label_pred, average='weighted')

# printind out the results
print("\n=== Evaluation Metrics (Scikit-learn) ===")
print(f"Accuracy : {accuracy_sklearn:.4f}")
print(f"Precision: {precision_sklearn:.4f}")
print(f"Recall   : {recall_sklearn:.4f}")
print(f"F1 Score : {f1_sklearn:.4f}")


=== Evaluation Metrics (Scikit-learn) ===
Accuracy : 0.4800
Precision: 0.2304
Recall   : 0.4800
F1 Score : 0.3114


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Assignment

Please create a new `ipynb` to write down your assignment. Don't forget to include your name and related information at the top of your codes.

1. **Feature Extraction:**  
   - Add at least two new features that you believe can improve the prediction. (10%)
   - Explain why you selected these features and how they might help. (20%)

2. **K-Fold Cross-Validation for Model stability:**  
   - Train your model using K-fold Cross-Validation and compare the standard deviation (std) of F1-score between your model and the one used in class. (10%)
   - Is there any difference? If yes, try to explain what the difference stand for. If no, try to explain what makes them no difference. (20%)

4. **Prediction and Evaluation:**  
   - Predict and evaluate your model on the test set with `weighted`, `micro` or `macro` averaging for F1-score.  (10%)
   - Justify your choice of averaging method. (20%)
