# Automated Website categorization using machine learning algorithms (DT)

This notebook processes the website contents data and builds a BERT model to predict the category of websites.

BERT (Bidirectional Encoder Representations from Transformers) is an advanced NLP model that provides deep contextual understanding of language. Unlike traditional models, BERT can interpret the meaning of words in context, significantly improving the accuracy of text classification tasks.

# Stages of the project
- Web Scraping: Extract textual content from websites using tools like BeautifulSoup and Selenium.
- Data Preprocessing: Prepare and clean the text data for input into the model.
- Modeling: Decision Tree, Regression Tree, BERT
- Output Results: Evaluate the model performance.

# Model implementation in this file
The BERT model is implemented using the Hugging Face Transformers library. The following steps are performed:
1. Prepare Data
2. Preprocessing
3. Modeling & Fine-tuning
4. Evaluation using different metrics (e.g. accuracy, precision, recall)

Verizon, Group 41
<br>Athena Bai, Tia Zheng, Kathy Yang, Tapuwa Kabaira, Chris Smith

Last updated: Dec. 1, 2024

In [1]:
# Install Scikit-learn for evaluation metrics
!pip install scikit-learn



In [1]:
import sys
!{sys.executable} -m pip install torch



In [None]:
import sys
!{sys.executable} -m pip install transformers

In [1]:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import AdamW
import torch
from torch.utils.data import Dataset, DataLoader, random_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
import os

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import pandas as pd

In [12]:
#. Load and Preprocess Data
# Assuming 'text_content' column contains website content
data = pd.read_csv('df_text.csv')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 7535: invalid start byte

In [None]:
data = data.dropna(axis='text_content')

In [None]:
data.columns.values

In [None]:
# Encode the target labels
label_encoder = LabelEncoder()
data['category'] = label_encoder.fit_transform(data['category'])

In [None]:
# Encode the target labels
label_encoder = LabelEncoder()
data['category'] = label_encoder.fit_transform(data['category'])

# 3. Custom Dataset Class for BERT
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors="pt"
        )
        return {key: val.squeeze(0) for key, val in encoding.items()}, torch.tensor(label, dtype=torch.long)

Here’s an explanation of what each part of this code is doing:

### 1. **Label Encoding**
```python
# Encode the target labels
label_encoder = LabelEncoder()
data['category'] = label_encoder.fit_transform(data['category'])
```
- **Purpose**: This part of the code is encoding the target labels (categories) as numerical values.
- **Explanation**:
  - `LabelEncoder` is a utility from `sklearn` that converts categorical labels (text) into numerical form, which is required for model training.
  - `fit_transform` assigns a unique integer to each category in `data['category']`. For example, if categories were ["Sports", "News", "Technology"], they would be encoded as integers like `[0, 1, 2]`.

---

### 2. **Custom Dataset Class for BERT**
```python
# 3. Custom Dataset Class for BERT
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors="pt"
        )
        return {key: val.squeeze(0) for key, val in encoding.items()}, torch.tensor(label, dtype=torch.long)
```
- **Purpose**: This is a custom dataset class for preparing the text data in a format that BERT can understand. It helps in managing how data is fed to the model during training.
- **Explanation**:
  - `__init__(self, texts, labels, tokenizer, max_len=512)`: The constructor initializes the class with:
    - `texts`: The list of text content (website content).
    - `labels`: The list of numerical category labels (encoded in the previous step).
    - `tokenizer`: The BERT tokenizer to convert text to tokens.
    - `max_len`: The maximum length of each tokenized sequence.
  - `__len__(self)`: Returns the number of samples in the dataset.
  - `__getitem__(self, idx)`: This method retrieves an individual sample (text and label) and prepares it for BERT:
    - `text = self.texts[idx]` and `label = self.labels[idx]`: Retrieve the text and corresponding label at index `idx`.
    - `encoding = self.tokenizer(...)`: Tokenizes the text using BERT’s tokenizer with specified parameters:
      - **`truncation=True`**: Truncates the text if it exceeds `max_len`.
      - **`padding='max_length'`**: Pads the text to `max_len`.
      - **`max_length=self.max_len`**: Limits the tokenized output to the specified `max_len`.
      - **`return_tensors="pt"`**: Returns the encoding as PyTorch tensors, which is the required format for BERT.
    - `return {key: val.squeeze(0) for key, val in encoding.items()}, torch.tensor(label, dtype=torch.long)`: Returns a dictionary of the tokenized text (as PyTorch tensors) and the label as a tensor for each sample. The `squeeze(0)` is used to remove an extra dimension added by the tokenizer, making the tensors compatible for BERT.

---

### Summary
- **Label Encoding**: Converts categorical labels into numerical format required by the model.
- **Custom Dataset Class**: Prepares each sample by tokenizing the text and converting both text and labels to PyTorch tensors, making it ready for BERT’s input format. This class is essential for managing and loading data in a way compatible with BERT during training.

In [None]:
# Tokenizer and Dataset Preparation
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
dataset = TextDataset(data['text_content'].tolist(), data['category'].tolist(), tokenizer)

# Train-Test Split
train_size = int(0.8 * len(dataset))
train_dataset, test_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

Here’s a detailed explanation of what each part of this code is doing:

---

### 1. **Tokenizer and Dataset Preparation**
```python
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
dataset = TextDataset(data['text_content'].tolist(), data['category'].tolist(), tokenizer)
```

#### **BERT Tokenizer Initialization**
- **`BertTokenizer.from_pretrained('bert-base-uncased')`**:
  - This line loads a pretrained tokenizer for the **BERT base model** with an **uncased vocabulary** (all words are converted to lowercase).
  - **Purpose**: The tokenizer splits input text into tokens that correspond to BERT's vocabulary, adds special tokens like `[CLS]` and `[SEP]`, and converts tokens to their corresponding IDs.

#### **Dataset Initialization**
- **`TextDataset` Class**:
  - Initializes a dataset object that prepares text samples and labels for BERT.
  - **Inputs**:
    - `data['text_content'].tolist()`: A list of website text content.
    - `data['category'].tolist()`: A list of encoded category labels.
    - `tokenizer`: The initialized BERT tokenizer.
  - **Purpose**: This step processes the data, ensuring that each sample is correctly tokenized and paired with its label, ready for model input.

---

### 2. **Train-Test Split**
```python
train_size = int(0.8 * len(dataset))
train_dataset, test_dataset = random_split(dataset, [train_size, len(dataset) - train_size])
```

#### **Train-Test Split Logic**
- **`train_size = int(0.8 * len(dataset))`**:
  - Calculates the size of the training dataset as **80%** of the total dataset.
- **`random_split(dataset, [train_size, len(dataset) - train_size])`**:
  - Splits the full dataset into two subsets:
    - **`train_dataset`**: Contains 80% of the data.
    - **`test_dataset`**: Contains the remaining 20%.
  - **Purpose**: Ensures the model is trained on one subset (training set) and evaluated on an unseen subset (test set) to measure its generalization performance.

---

### 3. **DataLoader Initialization**
```python
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)
```

#### **`DataLoader` Explanation**
- **`DataLoader`**: Efficiently loads batches of data for training and testing.
  - **Inputs**:
    - `train_dataset` / `test_dataset`: The respective subsets of data.
    - `batch_size=16`: The number of samples per batch.
    - `shuffle=True` (for `train_loader`): Randomly shuffles the training data at the beginning of each epoch to improve model training.
    - `shuffle=False` (for `test_loader`): Keeps the test data in the same order for consistent evaluation.
  - **Purpose**: Loads data in smaller, manageable batches for processing by BERT. This improves memory usage and speeds up training and evaluation.

---

### Summary
1. **Tokenizer**: Converts text into token IDs that BERT understands.
2. **Dataset**: Prepares the data (tokenized text + labels) in a format compatible with BERT.
3. **Train-Test Split**: Divides the dataset into training and testing sets for evaluation.
4. **DataLoader**: Loads data in batches, enabling efficient training and testing while managing memory and computational resources.

In [None]:
# 4. Load BERT and Train
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_encoder.classes_))
optimizer = AdamW(model.parameters(), lr=1e-5)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Training Loop
epochs = 3
for epoch in range(epochs):
    model.train()
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        inputs, labels = batch
        inputs = {k: v.to(device) for k, v in inputs.items()}
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

Here’s a detailed explanation of this code:

---

### 1. **Load Pretrained BERT Model**
```python
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_encoder.classes_))
```
- **`BertForSequenceClassification`**:
  - A specific variant of BERT designed for sequence classification tasks.
  - **`from_pretrained('bert-base-uncased')`**: Loads a pretrained BERT model (base version with an uncased vocabulary).
  - **`num_labels=len(label_encoder.classes_)`**: Specifies the number of output labels for the classification task based on the number of unique categories in the dataset.
  - **Purpose**: Initializes a model capable of classifying website content into one of the predefined categories.

---

### 2. **Set Up Optimizer**
```python
optimizer = AdamW(model.parameters(), lr=1e-5)
```
- **AdamW Optimizer**:
  - A variant of the Adam optimizer that includes weight decay to improve generalization.
  - **`model.parameters()`**: Passes the model’s parameters to the optimizer for updating during training.
  - **`lr=1e-5`**: Sets a small learning rate to ensure gradual updates, crucial for fine-tuning BERT without disrupting its pretrained knowledge.
  - **Purpose**: Optimizes the model’s weights to minimize the loss during training.

---

### 3. **Device Configuration**
```python
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
```
- **`torch.device`**:
  - Checks if a GPU is available (via CUDA).
  - If available, it sets the computation to run on the GPU; otherwise, it defaults to the CPU.
- **`model.to(device)`**:
  - Transfers the model’s computations to the specified device for efficient processing.
  - **Purpose**: Maximizes training efficiency, especially on large models like BERT.

---

### 4. **Training Loop**
```python
epochs = 3
for epoch in range(epochs):
    model.train()
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        inputs, labels = batch
        inputs = {k: v.to(device) for k, v in inputs.items()}
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())
```

#### **Breaking Down the Training Loop**:
1. **`epochs = 3`**:
   - Specifies the number of times the model will pass through the entire training dataset.

2. **`model.train()`**:
   - Sets the model to training mode, enabling it to learn and update weights.

3. **`tqdm` Loop**:
   - Wraps the training loop with a progress bar to monitor training progress and loss in real time.

4. **Batch Processing**:
   - **`for batch in loop`**: Iterates through batches of training data.
   - **`inputs, labels = batch`**: Separates input features and labels for each batch.
   - **`inputs = {k: v.to(device)}`**: Moves tokenized inputs to the appropriate device (CPU/GPU).
   - **`labels = labels.to(device)`**: Moves labels to the same device.

5. **Backward Pass**:
   - **`optimizer.zero_grad()`**: Clears gradients from the previous step to prevent accumulation.
   - **`outputs = model(**inputs, labels=labels)`**: Passes inputs through the model and computes the loss.
   - **`loss.backward()`**: Computes gradients for all trainable parameters.
   - **`optimizer.step()`**: Updates model weights based on computed gradients.

6. **Progress Tracking**:
   - **`loop.set_description()`**: Displays the current epoch.
   - **`loop.set_postfix()`**: Displays the current batch loss.

---

### Summary
- **Model Loading**: Initializes a pretrained BERT model tailored for classification.
- **Optimizer**: Fine-tunes BERT’s weights using AdamW.
- **Device Usage**: Leverages GPU (if available) for faster training.
- **Training Loop**: Iteratively trains the model, optimizing it to reduce loss while displaying progress and performance metrics.

In [None]:
# 5. Evaluation
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in test_loader:
        inputs, labels = batch
        inputs = {k: v.to(device) for k, v in inputs.items()}
        labels = labels.to(device)
        outputs = model(**inputs)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# Generate Classification Report
print(classification_report(all_labels, all_preds, target_names=label_encoder.classes_))
print("Accuracy:", accuracy_score(all_labels, all_preds))

### Model Performance Evaluation Metrics

When evaluating a classification model like BERT for website categorization, several metrics are used to assess its performance. Here's an explanation of the key metrics:

---

### 1. **Accuracy**
   **Formula**:
   \[
   \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
   \]
   - **What It Measures**: The proportion of correctly classified instances out of all instances.
   - **Strengths**: Provides a quick overview of model performance.
   - **Limitations**: Not reliable for imbalanced datasets (where some categories are much more frequent than others).

---

### 2. **Precision**
   **Formula**:
   \[
   \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
   \]
   - **What It Measures**: Among the instances predicted as a particular category, how many actually belong to that category.
   - **Strengths**: Useful when the cost of false positives is high (e.g., misclassifying a category with strict rules).
   - **Example**: If predicting "News" category, precision indicates how many of the predicted "News" websites are truly "News".

---

### 3. **Recall (Sensitivity)**
   **Formula**:
   \[
   \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
   \]
   - **What It Measures**: Among all actual instances of a category, how many were correctly predicted.
   - **Strengths**: Useful when the cost of false negatives is high (e.g., missing critical categories).
   - **Example**: For "Health" websites, recall tells how many actual "Health" websites were identified by the model.

---

### 4. **F1-Score**
   **Formula**:
   \[
   \text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
   \]
   - **What It Measures**: The harmonic mean of precision and recall.
   - **Strengths**: Balances the trade-off between precision and recall, especially when dealing with imbalanced datasets.
   - **Range**: F1-score ranges between 0 (worst) and 1 (best).

---

### 5. **Confusion Matrix**
   - **What It Shows**: A matrix that summarizes the counts of:
     - **True Positives (TP)**: Correct predictions for a category.
     - **False Positives (FP)**: Instances incorrectly predicted as a category.
     - **False Negatives (FN)**: Instances of a category that the model missed.
     - **True Negatives (TN)**: Instances correctly predicted as not belonging to a category.
   - **Purpose**: Provides detailed insights into model errors for each category.

---

### 6. **Macro vs. Micro Metrics**
   - **Macro-Averaged Metrics**:
     - Treats all categories equally by calculating the metric independently for each category and averaging them.
     - Useful for evaluating overall performance across categories, regardless of class imbalance.
   - **Micro-Averaged Metrics**:
     - Aggregates contributions of all classes to compute the metric.
     - More sensitive to class imbalance since it weighs metrics by the number of samples in each class.

---

### Use Cases of Each Metric
- **Accuracy**: Best for balanced datasets with equal class distribution.
- **Precision**: Critical when false positives are costly.
- **Recall**: Important when false negatives are costly.
- **F1-Score**: Ideal for imbalanced datasets to balance false positives and negatives.

---

### Example from Model Evaluation
After training, these metrics are calculated using the model's predictions on the test set. In Python, they are computed using:

```python
from sklearn.metrics import classification_report, accuracy_score

print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))
print("Accuracy:", accuracy_score(y_test, y_pred))
```

This generates a detailed breakdown of performance per category, helping to identify strengths and weaknesses of the model.