# Automated Website categorization using machine learning algorithms (DT)

This notebook processes the website contents data and builds a BERT model to predict the category of websites.

BERT (Bidirectional Encoder Representations from Transformers) is an advanced NLP model that provides deep contextual understanding of language. Unlike traditional models, BERT can interpret the meaning of words in context, significantly improving the accuracy of text classification tasks.

## Stages of the project
- Web Scraping: Extract textual content from websites using tools like BeautifulSoup and Selenium.
- Data Preprocessing: Prepare and clean the text data for input into the model.
- Modeling: Decision Tree, Regression Tree, BERT
- Output Results: Evaluate the model performance.

## Model implementation in this file
The BERT model is implemented using the Hugging Face Transformers library. The following steps are performed:
1. Prepare Data
2. Preprocessing
3. Modeling & Fine-tuning
4. Evaluation using different metrics (e.g. accuracy, precision, recall)

Verizon, Group 41
<br>Athena Bai, Tia Zheng, Kathy Yang, Tapuwa Kabaira, Chris Smith

Last updated: Dec. 1, 2024

## 0. Package preparation (optional)

In [1]:
# Install Scikit-learn for evaluation metrics
!pip install scikit-learn





In [2]:
import sys
!{sys.executable} -m pip install torch





In [3]:
import sys
!{sys.executable} -m pip install transformers








## 1. Prepare Data

In [1]:
import os
import pandas as pd
import numpy as np

# Modeling
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import AdamW
import torch
from torch.utils.data import Dataset, DataLoader, random_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# # Set environemntal varibales so that the chache uses d disk space instead c disk space
# import os
# os.environ["TRANSFORMERS_CACHE"] = "D:/huggingface_cache"

# import os
# os.environ["TORCH_HOME"] = "D:/torch_cache"

import os

# Set Hugging Face cache directory
os.environ['HF_HOME'] = 'D:\\huggingface_cache'  # Or your preferred folder

import tempfile

# Set temporary directory
tempfile.tempdir = 'D:\\temp'

In [3]:
data = pd.read_csv('data_from_check.csv', header=0)

In [4]:
list(data.columns.values)

['url',
 'category',
 'text_content',
 'Text_Length',
 'text_cleaned',
 'Sentiment',
 'lexical_diversity']

In [5]:
# Encode the target labels
label_encoder = LabelEncoder()
data['category'] = label_encoder.fit_transform(data['category'])

In [6]:
data['text_content'] = data['text_content'].astype(str)

In [7]:
# Custom Dataset Class for BERT
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors="pt"
        )
        return {key: val.squeeze(0) for key, val in encoding.items()}, torch.tensor(label, dtype=torch.long)

In [8]:
# Tokenizer and Dataset Preparation
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
dataset = TextDataset(data['text_content'].tolist(), data['category'].tolist(), tokenizer)

# Train-Test Split
train_size = int(0.8 * len(dataset))
train_dataset, test_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

In [17]:
# Load BERT and Train
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_encoder.classes_))
optimizer = AdamW(model.parameters(), lr=1e-5)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Training Loop
epochs = 3
for epoch in range(epochs):
    model.train()
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        inputs, labels = batch
        inputs = {k: v.to(device) for k, v in inputs.items()}
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 37/37 [37:55<00:00, 61.50s/it, loss=3.54]
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 37/37 [50:13<00:00, 81.45s/it, loss=3.01]
Epoch 2: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 37/37 [54:20<00:00, 88.12s/it, loss=2.78]


In [18]:
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in test_loader:
        inputs, labels = batch
        inputs = {k: v.to(device) for k, v in inputs.items()}
        labels = labels.to(device)
        outputs = model(**inputs)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

In [19]:
# Solve error: Number of classes, 34,does not match size
# of target_names, 45. Try specifying the labels parameter
print("Unique classes in all_labels:", set(all_labels))
print("Unique classes in all_preds:", set(all_preds))
print("Classes in label_encoder:", label_encoder.classes_)
print("Number of classes in label_encoder:", len(label_encoder.classes_))

# Generate Classification Report
unique_labels = sorted(set(all_labels))
print(classification_report(all_labels, all_preds, labels=unique_labels, target_names=[label_encoder.classes_[i] for i in unique_labels]))
# print(classification_report(all_labels, all_preds, target_names=label_encoder.classes_))
print("Accuracy:", accuracy_score(all_labels, all_preds))

Unique classes in all_labels: {0, 1, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 31, 32, 33, 34, 36, 39, 40, 41}
Unique classes in all_preds: {32, 22}
Classes in label_encoder: ['Business and Economy' 'Computer and Internet Info'
 'Content Delivery Networks' 'Dating' 'Educational Institutions'
 'Entertainment and Arts' 'Financial Services' 'Food & Drink'
 'Food and Beverage' 'Food and Dining' 'Food and Drink' 'Gambling' 'Games'
 'Government' 'Health and Medicine' 'Home and Garden'
 'Internet Communications and Telephony' 'Internet Portals' 'Job Search'
 'Military' 'Motor Vehicles' 'Music' 'News' 'Online Storage and Backup'
 'Personal Sites and Blogs' 'Real Estate' 'Recreation and Hobbies'
 'Reference and Research' 'Religion' 'Science' 'Science and Technology'
 'Search Engines' 'Shopping' 'Smart Home' 'Social Networking' 'Society'
 'Sports' 'Sports and Fitness' 'Stock Advice and Tools' 'Streaming Media'
 'Technology' 'Travel' 'Weather' 'Web 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Results with epochs = 3
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 37/37 \[37:55<00:00, 61.50s/it, loss=3.54\]

Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 37/37 \[50:13<00:00, 81.45s/it, loss=3.01\]

Epoch 2: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 37/37 \[54:20<00:00, 88.12s/it, loss=2.78\]

Unique classes in all_labels: {0, 1, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 31, 32, 33, 34, 36, 39, 40, 41}

Unique classes in all_preds: {32, 22}

Classes in label_encoder: \['Business and Economy' 'Computer and Internet Info'
 'Content Delivery Networks' 'Dating' 'Educational Institutions'
 'Entertainment and Arts' 'Financial Services' 'Food & Drink'
 'Food and Beverage' 'Food and Dining' 'Food and Drink' 'Gambling' 'Games'
 'Government' 'Health and Medicine' 'Home and Garden'
 'Internet Communications and Telephony' 'Internet Portals' 'Job Search'
 'Military' 'Motor Vehicles' 'Music' 'News' 'Online Storage and Backup'
 'Personal Sites and Blogs' 'Real Estate' 'Recreation and Hobbies'
 'Reference and Research' 'Religion' 'Science' 'Science and Technology'
 'Search Engines' 'Shopping' 'Smart Home' 'Social Networking' 'Society'
 'Sports' 'Sports and Fitness' 'Stock Advice and Tools' 'Streaming Media'
 'Technology' 'Travel' 'Weather' 'Web Advertisements' 'Web Hosting'\]
Number of classes in label_encoder: 45
                                       precision    recall  f1-score   support

                 Business and Economy       0.00      0.00      0.00        10
           Computer and Internet Info       0.00      0.00      0.00         6
                               Dating       0.00      0.00      0.00         1
             Educational Institutions       0.00      0.00      0.00         2
               Entertainment and Arts       0.00      0.00      0.00        10
                   Financial Services       0.00      0.00      0.00         7
                    Food and Beverage       0.00      0.00      0.00         3
                      Food and Dining       0.00      0.00      0.00         1
                       Food and Drink       0.00      0.00      0.00         2
                             Gambling       0.00      0.00      0.00         3
                                Games       0.00      0.00      0.00         3
                           Government       0.00      0.00      0.00         4
                  Health and Medicine       0.00      0.00      0.00         5
                      Home and Garden       0.00      0.00      0.00         1
    Internet Communications and Telephony       0.00      0.00      0.00         6
                     Internet Portals       0.00      0.00      0.00         1
                           Job Search       0.00      0.00      0.00         1
                             Military       0.00      0.00      0.00         1
                       Motor Vehicles       0.00      0.00      0.00         3
                                Music       0.00      0.00      0.00         1
                                 News       0.39      1.00      0.56        14
            Online Storage and Backup       0.00      0.00      0.00         1
             Personal Sites and Blogs       0.00      0.00      0.00         1
                          Real Estate       0.00      0.00      0.00         5
               Recreation and Hobbies       0.00      0.00      0.00         1
               Reference and Research       0.00      0.00      0.00         2
                       Search Engines       0.00      0.00      0.00         1
                             Shopping       0.32      0.97      0.48        36
                           Smart Home       0.00      0.00      0.00         1
                    Social Networking       0.00      0.00      0.00         3
                               Sports       0.00      0.00      0.00         3
                      Streaming Media       0.00      0.00      0.00         2
                           Technology       0.00      0.00      0.00         1
                               Travel       0.00      0.00      0.00         5

                             accuracy                           0.33       147
                            macro avg       0.02      0.06      0.03       147
                         weighted avg       0.11      0.33      0.17       147

Accuracy: 0.3333333333333333