# Project Summary: HTML Ad Classifier

## Overview
The goal of this project is to build a simple classifier that takes in an HTML string as input and determines whether the string represents an advertisement.

## Input
- A string containing an HTML element, e.g.:
  ```html
  <button class=ad> ... </button>
  ```

## Data Collection
- We will use `curl` to fetch website HTML pages.
- Manually label lines that are advertisements.
- Each website will contribute a few thousand lines of HTML.
- Only a small portion (approximately 20 lines per site) will actually be ads.

## Dataset Characteristics
- Around 10,000 lines of HTML data overall.

## Objective
- Train a model to accurately classify whether a given HTML line is an ad or not.

In [1]:
import torch
import torch.nn as nn
from typing import Optional

In [3]:
class HTMLAdClassifier(nn.Module):
    """
    Neural network that classifies each HTML token (or element start tag) as ad / non‑ad.
    """

    def __init__(
        self,
        vocab_size: int,
        tag_vocab_size: int,
        attr_vocab_size: int,
        embed_dim: int = 256,
        num_layers: int = 4,
        num_heads: int = 8,
        dropout: float = 0.2,
        max_seq_len: int = 1024,
    ) -> None:
        super().__init__()

        # ──────────────────── Embedding blocks ────────────────────
        self.token_embed = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.tag_embed = nn.Embedding(tag_vocab_size, embed_dim, padding_idx=0)
        self.attr_embed = nn.Embedding(attr_vocab_size, embed_dim, padding_idx=0)
        self.pos_embed = nn.Embedding(max_seq_len, embed_dim)
        self.embed_dropout = nn.Dropout(dropout)

        # ──────────────────── Transformer encoder ────────────────────
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim,
            nhead=num_heads,
            dim_feedforward=embed_dim * 4,
            dropout=dropout,
            activation="gelu",
            batch_first=True,
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers)

        # ──────────────────── Classification head ────────────────────
        self.classifier = nn.Sequential(
            nn.Linear(embed_dim, embed_dim),
            nn.GELU(), #Can replace this with RELU
            nn.Dropout(dropout),
            nn.Linear(embed_dim, 1),  # logit
        )

    def forward(
        self,
        token_ids: torch.LongTensor,      # (B, L)
        tag_ids: torch.LongTensor,        # (B, L)
        attr_ids: torch.LongTensor,       # (B, L)
        pos_ids: torch.LongTensor,        # (B, L)
        attention_mask: Optional[torch.BoolTensor] = None,  # (B, L)
    ) -> torch.Tensor:
        x = (
            self.token_embed(token_ids)
            + self.tag_embed(tag_ids)
            + self.attr_embed(attr_ids)
            + self.pos_embed(pos_ids)
        )
        x = self.embed_dropout(x)

        x = self.encoder(x, src_key_padding_mask=attention_mask)
        logits = self.classifier(x).squeeze(-1)  # (B, L)
        return logits

    @staticmethod
    def probability(logits: torch.Tensor) -> torch.Tensor:
        return torch.sigmoid(logits)

    @staticmethod
    def prediction(logits: torch.Tensor, threshold: float = 0.9) -> torch.Tensor:
        return torch.sigmoid(logits) > threshold
