# Logistic Regression
Tutorial of Computational Linguistics, National Chengchi University

*Chang-Yu Tsai, 2025.03.14*

- In this week, we will try:
  - to build a Logistic Regression Classifier by using PyTorch
  - to understand the concept of batch sizes
  - to understand the concept of learning rate


## Set-up

- importing required packages

```
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import numpy as np

import re
```


In [None]:
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import numpy as np

import re

## Preprocessing

- downloading the dataset from github
> Our data was collected from Cofacts. If you want to explore more on their dataset, please refer to [this website](https://huggingface.co/datasets/Cofacts/line-msg-fact-check-tw).

```
!wget https://raw.githubusercontent.com/EntropiaTsai/nccu_elt_course_material/refs/heads/main/balanced_misinformation.csv
```

In [None]:
!wget https://raw.githubusercontent.com/EntropiaTsai/nccu_elt_course_material/refs/heads/main/balanced_misinformation.csv

--2025-03-14 03:29:52--  https://raw.githubusercontent.com/EntropiaTsai/nccu_elt_course_material/refs/heads/main/balanced_misinformation.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18379337 (18M) [text/plain]
Saving to: ‘balanced_misinformation.csv.2’


2025-03-14 03:29:52 (123 MB/s) - ‘balanced_misinformation.csv.2’ saved [18379337/18379337]



- reading the file

```
df = pd.read_csv('balanced_misinformation.csv',encoding='utf-8')
print('The dataframe contains', len(df), 'rows.')
df.head()
```

In [None]:
df = pd.read_csv('balanced_misinformation.csv',encoding='utf-8')
print('The dataframe contains', len(df), 'rows.')
df.head()

The dataframe contains 25170 rows.


Unnamed: 0,text,title,type
0,https://tw.news.yahoo.com/%E6%A0%B8%E5%BB%A2%E...,政治,OPINIONATED
1,"昨天6/15時鐘說,政府買進居\n家快篩劑一份一千元,需要的\n民眾可以洽購。\n今天6/1...",法律,RUMOR
2,桃園市議會 通過112年度預算 明年起即可全面適用\n☝️自111學年度下學期起，全面推動...,法律,NOT_RUMOR
3,救命藥出現了？ 美治療首例確診病患 1天後退燒 實驗藥奏效\n\n世界日報\n2020/02...,健康醫療,RUMOR
4,一月初我被幾個中國醫生朋友拉入中國人所成立的幫助武漢群組WeChat裡面，我在裡面什麼都沒幹...,政治,OPINIONATED


- extracting features

```
# searching the key words of links
links=[]
for text in df['text']:
  if re.search('http',text):
    result=1
  else:
    result=0
  links.append(result)

df['links']=links

# taking a look at the current dataframe
df.head()
```

In [None]:
# searching the key words of links
links=[]
for text in df['text']:
  if re.search('http',text):
    result=1
  else:
    result=0
  links.append(result)

df['links']=links

# taking a look at the current dataframe
df.head()

Unnamed: 0,text,title,type,links
0,https://tw.news.yahoo.com/%E6%A0%B8%E5%BB%A2%E...,政治,OPINIONATED,1
1,"昨天6/15時鐘說,政府買進居\n家快篩劑一份一千元,需要的\n民眾可以洽購。\n今天6/1...",法律,RUMOR,0
2,桃園市議會 通過112年度預算 明年起即可全面適用\n☝️自111學年度下學期起，全面推動...,法律,NOT_RUMOR,0
3,救命藥出現了？ 美治療首例確診病患 1天後退燒 實驗藥奏效\n\n世界日報\n2020/02...,健康醫療,RUMOR,1
4,一月初我被幾個中國醫生朋友拉入中國人所成立的幫助武漢群組WeChat裡面，我在裡面什麼都沒幹...,政治,OPINIONATED,0


- encoding data
  - We need to encode contents that cannot be understood by the model. `LabelEncoder()` is one of the common approaches.

```
# converting features
features_label_encoder = LabelEncoder()
df["type_encoded"] = features_label_encoder.fit_transform(df["type"])
# converting labels
labels_label_encoder = LabelEncoder()
df["title_encoded"] = labels_label_encoder.fit_transform(df["title"])
df.head()
```

In [None]:
# converting features
features_label_encoder = LabelEncoder()
df["title_encoded"] = features_label_encoder.fit_transform(df["title"])
# converting labels
labels_label_encoder = LabelEncoder()
df["type_encoded"] = labels_label_encoder.fit_transform(df["type"])
df.head()

Unnamed: 0,text,title,type,links,type_encoded,title_encoded
0,https://tw.news.yahoo.com/%E6%A0%B8%E5%BB%A2%E...,政治,OPINIONATED,1,1,5
1,"昨天6/15時鐘說,政府買進居\n家快篩劑一份一千元,需要的\n民眾可以洽購。\n今天6/1...",法律,RUMOR,0,2,6
2,桃園市議會 通過112年度預算 明年起即可全面適用\n☝️自111學年度下學期起，全面推動...,法律,NOT_RUMOR,0,0,6
3,救命藥出現了？ 美治療首例確診病患 1天後退燒 實驗藥奏效\n\n世界日報\n2020/02...,健康醫療,RUMOR,1,2,1
4,一月初我被幾個中國醫生朋友拉入中國人所成立的幫助武漢群組WeChat裡面，我在裡面什麼都沒幹...,政治,OPINIONATED,0,1,5


- seperating the features and labels

```
features=df[['title_encoded','links']]
labels=df['type_encoded']
print('Feature:')
features.head()
# print('Labels:')
# labels.head()
```

In [None]:
features=df[['title_encoded','links']]
labels=df['type_encoded']
print('Feature:')
features.head()
# print('Labels:')
# labels.head()

Feature:


Unnamed: 0,title_encoded,links
0,5,1
1,6,0
2,6,0
3,1,1
4,5,0


- dividing them into the training set and the test set

```
feat_train, feat_test, label_train, label_test = train_test_split(features, labels, test_size=0.2, random_state=42)
```

In [None]:
feat_train, feat_test, label_train, label_test = train_test_split(features, labels, test_size=0.2, random_state=42)

- Converting features and labels into tensors


```
# converting data into Tensor
feat_train_tensor = torch.tensor(feat_train.values, dtype=torch.float32)
label_train_tensor = torch.tensor(label_train.values, dtype=torch.long)

feat_test_tensor = torch.tensor(feat_test.values, dtype=torch.float32)
label_test_tensor = torch.tensor(label_test.values, dtype=torch.long)
```

In [None]:
# converting data into Tensor
feat_train_tensor = torch.tensor(feat_train.values, dtype=torch.float32)
label_train_tensor = torch.tensor(label_train.values, dtype=torch.long)

feat_test_tensor = torch.tensor(feat_test.values, dtype=torch.float32)
label_test_tensor = torch.tensor(label_test.values, dtype=torch.long)

- Creating datasets for mini-batch training

```
# converting `TensorDataset`
dataset_train = TensorDataset(feat_train_tensor, label_train_tensor)
dataset_test = TensorDataset(feat_test_tensor, label_test_tensor)
# setting  batch size
batch_size = 32
# converting `TensorDataset` into `DataLoader` for mini-batch training
train_loader = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset_test, batch_size=batch_size, shuffle=False)
```

In [None]:
# converting `TensorDataset`
dataset_train = TensorDataset(feat_train_tensor, label_train_tensor)
dataset_test = TensorDataset(feat_test_tensor, label_test_tensor)
# setting  batch size
batch_size = 32
# converting `TensorDataset` into `DataLoader` for mini-batch training
train_loader = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset_test, batch_size=batch_size, shuffle=False)

## Model training
Recalling the type of Logistic Regression:
- Binary Logistic Regression: for binary-class task
  - $ \hat C = \sigma(W \cdot F + b) $
  
    > $C = class$; $F = feature$; $w = weight$; $b = bias$

- Multinomial Logistic Regression: for multi-class task
  - $ \hat C = softmax(W \cdot F+b)$
    > $C = class$; $F = feature$; $w = weight$; $b = bias$


- defining the models
  - binary logistic regression

  
```
class BinaryLogisticRegression(nn.Module):
    def __init__(self, feature_dim):
        """
        Simple Binary Logistic Regression model in PyTorch.
        Parameters:
            feature_dim (int): Number of input features.
        """
        super().__init__()
        self.linear = nn.Linear(in_features = feature_dim, # linear transformer
                                out_features = 1)  
        self.sigmoid = nn.Sigmoid()  # The default sigmoid function of PyTorch is a logistic function, which is the activation function of Binary Logistic Regression.

    def forward(self, x):
        """ Forward pass: Compute sigmoid activation """
        return self.sigmoid(self.linear(x))
```
  - multinomial logistic regression


```
class MultinomialLogisticRegression(nn.Module):
    def __init__(self, feature_dim, num_classes):
        super().__init__()
        self.linear = nn.Linear(feature_dim,
                                num_classes) # We need to specify the class number based on our dataset.

    def forward(self, x):
      """
      No need to specify softmax() here, since the loss function we use already calculates it internally.
      """
      return self.linear(x)
```

In [None]:
class MultinomialLogisticRegression(nn.Module):
    def __init__(self, feature_dim, num_classes):
        super().__init__()
        self.linear = nn.Linear(feature_dim,
                                num_classes) # We need to specify the class number based on our dataset.

    def forward(self, x):
      """
      No need to specify softmax() here, since the loss function we use already calculates it internally.
      """
      return self.linear(x)

- initialising the model

```
torch.manual_seed(4) # the random seed of initialisation
multinomial_model = MultinomialLogisticRegression(feature_dim=feat_train_tensor.shape[1],
                                                  num_classes=len(torch.unique(label_train_tensor))) #initialising
multinomial_criterion = nn.CrossEntropyLoss() # Bross Entropy loss function

# setting up the optimiser
learning_rate = 0.1 # learning rate
l2_lambda = 0.001    # lambda of l2 regularisation

multinomial_optimizer = optim.SGD(multinomial_model.parameters(), lr=learning_rate, weight_decay=l2_lambda)    
```

In [None]:
torch.manual_seed(4) # the random seed of initialisation
multinomial_model = MultinomialLogisticRegression(feature_dim=feat_train_tensor.shape[1],
                                                  num_classes=len(torch.unique(label_train_tensor))) #initialising
multinomial_criterion = nn.CrossEntropyLoss() # Cross Entropy loss function

# setting up the optimiser
learning_rate = 0.1 # learning rate
l2_lambda = 0.001    # lambda of l2 regularisation

multinomial_optimizer = optim.SGD(multinomial_model.parameters(), lr=learning_rate, weight_decay=l2_lambda)

- training the model

  - the procedure of each batch
    1. Clearing previous gradients to prevent accumulation
    2. Performing forward pass
    3. Calculating loss
    4. Computing gradients for the current batch (Backward Pass)
    5. Updating model parameters using gradient descent
        
    After updating the parameters, we move on to the next batch. Clearing previous gradients and ...

```
multinomial_model.train()
for batch_feat, batch_label in train_loader:

    # We first convert Tensors into particular formats.
    
    batch_feat = batch_feat.float()  # ensuring features to be 32-bit floats
    batch_label = batch_label.long() # ensuring labels to be be long integers

    multinomial_optimizer.zero_grad() # clearing previous gradients to prevent accumulation
    multinomial_predictions = multinomial_model(batch_feat) # performing forward pass
    multinomial_loss = multinomial_criterion(multinomial_predictions, batch_label) # calculating loss

    multinomial_loss.backward() # computing gradients for the current batch (Backward Pass)
    multinomial_optimizer.step() # updating model parameters using gradient descent
```

In [None]:
multinomial_model.train()
for batch_feat, batch_label in train_loader:

    # We first convert Tensors into particular formats.

    batch_feat = batch_feat.float()  # ensuring features to be 32-bit floats
    batch_label = batch_label.long() # ensuring labels to be be long integers

    multinomial_optimizer.zero_grad() # clearing previous gradients to prevent accumulation
    multinomial_predictions = multinomial_model(batch_feat) # performing forward pass
    multinomial_loss = multinomial_criterion(multinomial_predictions, batch_label) # calculating loss

    multinomial_loss.backward() # computing gradients for the current batch (Backward Pass)
    multinomial_optimizer.step() # updating model parameters using gradient descent

### What are `long()` and `float()` for?

- They convert tensor into particular formats, `long()` for long integers and `float()` for 32-bit floats.
  - 32-bit floats are widely used to represent features in machine learnings.
  - Long integers are **required** to calculate cross entropy function.

In [None]:
# Creating a Float64 Tensor
tensor_float64 = torch.tensor([31.2345676543, 2.231, 0.3], dtype=torch.float64)

# Converting it into a Float32 Tensor
tensor_float32 = tensor_float64.float()
# Converting it into a Long Integer Tensor
tensor_long = tensor_float64.long()

# setting the number of decimal places displayed in PyTorch output
torch.set_printoptions(precision=8)

print("Float64 Tensor:", tensor_float64)
print("Float32 Tensor:", tensor_float32)
print("Long Integer Tensor:", tensor_long)



Float64 Tensor: tensor([31.23456765,  2.23100000,  0.30000000], dtype=torch.float64)
Float32 Tensor: tensor([31.23456764,  2.23099995,  0.30000001])
Long Integer Tensor: tensor([31,  2,  0])


## Predicting and evaluating


```
# prediciting
multinomial_model.eval()
label_true = []
label_pred = []

with torch.no_grad():
    for batch_feat, batch_label in test_loader:
        batch_feat = batch_feat.float()
        batch_label = batch_label.long()
        test_predictions = multinomial_model(batch_feat)

        predicted_labels = torch.argmax(test_predictions, dim=1)

        label_true.extend(batch_label.cpu().numpy())
        label_pred.extend(predicted_labels.cpu().numpy())
# evaluating
accuracy = accuracy_score(label_true, label_pred)
precision = precision_score(label_true, label_pred, average='weighted')
recall = recall_score(label_true, label_pred, average='weighted')
f1 = f1_score(label_true, label_pred, average='weighted')

# printing out the results
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")


```


In [None]:
# prediciting
multinomial_model.eval()
label_true = []
label_pred = []

with torch.no_grad():
    for batch_feat, batch_label in test_loader:
        batch_feat = batch_feat.float()
        batch_label = batch_label.long()
        test_predictions = multinomial_model(batch_feat)

        predicted_labels = torch.argmax(test_predictions, dim=1)

        label_true.extend(batch_label.cpu().numpy())
        label_pred.extend(predicted_labels.cpu().numpy())
# evaluating
accuracy = accuracy_score(label_true, label_pred)
precision = precision_score(label_true, label_pred, average='weighted')
recall = recall_score(label_true, label_pred, average='weighted')
f1 = f1_score(label_true, label_pred, average='weighted')

# printing out the results
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")

Test Accuracy: 0.3308
Precision: 0.1094, Recall: 0.3308, F1 Score: 0.1644


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Assignment

Please create a new `ipynb` to write down your assignment. Don't forget to include your name and related information at the top of your codes.

1. Train other models to compare with the model used in class:
  - Try more than three different batch sizes. (25\%)
  - Explain their influences on evaluations. (25\%)
2. Based on the batch size of the outperforming model in Question 1, try to tune more parameters:
  - Train the model with different learning rates from 0.001 to 0.01. (15\%)
  - Plot out f-scores of the models trained with different learning rates. (10\%)
  > Hint: you can use the package `matplotlib.pyplot` to visualise the results. Select the proper chart from [their official ducumentation](https://matplotlib.org/stable/gallery/index.html).
  - Explain their influences on evaluations. (15\%)

**Bonus:** Change the optimiser. You can find the avalible options on [this website](https://pytorch-cn.readthedocs.io/zh/latest/package_references/torch-optim/). Explain the reason you choose it.
