# Tokenization, labeling and filtering
- My approach revolves around labeling the h1 tag from websites that get through my filtering and additionally (in some of my training datasets) label similar string to the h1 tag that is allready labeled.
- Also relevant is the format of my training data. Each entry will contain the following:
    1. URL last path (i.e. modern-wooden-chair)
    2. Page Title (i.e. Modern Wooden Chair, usually similar or identical to the h1 tag)
    3. H1 tag surrounded by page text (i.e. Modern Wooden Chair, Price 1000$, Description: Beautiful modern wooden chair ... )
- Subject to change will be the number of tokens left and right of the h1 tag that will be labeled and how to label substrings that are similar to the h1 tag (fuzzy matching or even more advanced methods).
- The labeling will be done in the BIO / CoNLL format (B-PRODUCT, I-PRODUCT, O) where B-PRODUCT is the beginning of the product name, I-PRODUCT is the continuation of the product name and O is outside of the product name.

In [None]:

# importing libraries
import csv
csv.field_size_limit(5000000)
import ast
import operator
import re
import threading
import random
from rapidfuzz import fuzz
import sys
import requests
from tqdm import tqdm 
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import concurrent
from urllib.parse import urlparse
from urllib.parse import urljoin
import spacy # we use this for word similarity
from collections import defaultdict
import time

### A little bit of data preprocessing (i.e. manually removing data that is likely wrong by removing entries with less that 1 words in the title, with strange symbols, etc.)
- By looking at the data manually I noticed what the wrong titles contain. The dataset will still be a bit noisy but I consider we have enough data to work with.
At the bottom of the code cell you can also see the number of distinct websites in the dataset just so you can get an idea of the diversity of the dataset. When I wrote the code it was 288 / 705 (the initial dataset). This should be enough diversity for the model to learn from.


In [None]:
data = []
distinct_websites = set()

BAD_TEXT_PATTERNS_IN_TITLE = ['releases', 'products', 'product', 'collections', 'collection', 'item', 'personalization', 'personalize', 'personalized', 'customize', 'customized', 'customise', 'customised', 'shop', 'store', 'stores', 'home', 'page', 'pages', 'about', 'contact', 'contact us', 'contact me', 'contact info', 'furniture', 'sofas', 'chairs', 'armchairs', 'ottomans', 'furniture' 'gift', 'card', ] #  all generic names that would indicate that the h1 tag does not contain a product - we can afford to lose a few products in the dataset

def clean_text(s):
    # this pattern keeps only normal alphanumerical characters and some special symbols
    allowed_pattern = r"[^a-zA-Z0-9\s,.:;\'\"!?()\-&+]"
    return re.sub(allowed_pattern, '', s)

def get_base_url(url):
    try:
        parsed_url = urlparse(url)
        base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
        return base_url
    except Exception as e:
        # print(f"Error parsing URL {url}: {e}")
        return None

def literal_eval(item): # for parsing the tuple from the csv file (which is now a string)
    try:
        # check if the string looks like a tuple or list
        if item.startswith('(') and item.endswith(')'):
            return ast.literal_eval(item)
        return None
    except (ValueError, SyntaxError):
        # Return the original string if evaluation fails
        return None
    
def find_all_h1_positions(text, h1_tag): # original_position is a tuple (start, end)
    positions = []
    start_pos = 0
    # search for all occurrences of h1_tag in text - if they match exactly we label them regardless
    while True:
        start_idx = text.find(h1_tag, start_pos)
        if start_idx == -1:
            break
        end_idx = start_idx + len(h1_tag) 
        positions.append((start_idx, end_idx))
        start_pos = end_idx + 1
    return h1_tag, positions


with open('../data/preprocessed_data_from_all_sitemaps_100000.csv', 'r', encoding='utf-8', newline='') as file:
    reader = csv.reader(file)
    for row in reader:
        row[1] = literal_eval(row[1])
        if row[1] is None:
            continue
        row[1] = (clean_text(row[1][0]), row[1][1], row[1][2]) # url, h1_tag_position, title, url_last_path, page_text = row
        row[2] = clean_text(row[2])
        row[3] = clean_text(row[3])
        row[4] = clean_text(row[4])
        
        h1_tag, positions = find_all_h1_positions(row[4], row[1][0])
        row[1] = (h1_tag, positions) # string with list
        if row[1] is None or row[1][1] is None or 'G Plan Chloe' in row[1]: # this website haunts me in my data with G Plan Chloe
            continue
        
        # finally we remove anything that has a len < 2
        if len(row[1][0].split()) < 2:
            continue
        
        ok = True
        for word in row[1][0].split():
            if word.lower() in BAD_TEXT_PATTERNS_IN_TITLE:
                ok = False
                break
        if not ok:
            continue    
        
        distinct_websites.add(get_base_url(row[0]))
        data.append(row)
        
print(len(distinct_websites))

### Methods for labeling the data
- Now there is a choice to be made. We can only strictly label the h1 tag or we can label similar strings to the h1 tag. I will try both approaches and see which one works better.
- The labeling for similar strings will only be done on the title and the url last path since they are shorter, and more likely to contain the product name, thus making it worthwhile to label them (plus the algorithm to label similar substring is inefficient and slow at the time of writing).
- The ideal approach of labeling similar tokens and with what method (or how lenient or strict you are with the labeling) is something that can only be determined by testing and experimenting.

In [None]:
def fuzzy_match(text_tokens, h1_tokens, similarity_threshold=80):
    longest_match = None
    max_similarity = similarity_threshold  # similarity threshold for the longest match
    
    for i in range(len(text_tokens)):
        for j in range(i+1, len(text_tokens)+1):
            text_sub_seq = ' '.join(text_tokens[i:j])
            # compare with h1 tokens using fuzzy matching
            similarity = fuzz.partial_ratio(text_sub_seq.lower(), ' '.join(h1_tokens).lower())
            
            # if the similarity is above the threshold, we check if it's the longest match so we don't label too short sequences ( it did cause problems during testing with smaller substrings)
            if similarity >= similarity_threshold:
                if longest_match is None or len(text_sub_seq) > len(longest_match[0]):
                    longest_match = (text_sub_seq, similarity)
                elif len(text_sub_seq) == len(longest_match[0]) and similarity > max_similarity:
                    longest_match = (text_sub_seq, similarity)
                    max_similarity = similarity

    return longest_match if longest_match else None


def label_text(text, to_label): 
    # split the main text and the text to be labeled into tokens
    tokens = text.split()
    label_tokens = to_label.split()

    # initialize the labels list with 'O' for 'Outside'
    labels = ['O'] * len(tokens)
    # iterate through the tokens and label all occurrences of the label_tokens
    i = 0
    while i < len(tokens):
        # convert both the token slice and label tokens to lowercase for comparison
        if [t.lower() for t in tokens[i:i+len(label_tokens)]] == [lt.lower() for lt in label_tokens]:
            labels[i] = 'B-PRODUCT'  # mark the beginning of the label
            for j in range(1, len(label_tokens)):
                labels[i+j] = 'I-PRODUCT'  # mark the rest of the label tokens
            i += len(label_tokens)  # skip the tokens that were labeled
        else:
            i += 1  # move to the next token

    return tokens, labels


# for simplicity, we will only perform fuzzy matching on the title and url last path strings 
def label_url_or_title(text, to_label):
    # split the text and label into tokens
    tokens = text.split()
    label_tokens = to_label.split()

    # initialize labels with 'O' for 'Outside'
    labels = ['O'] * len(tokens)

    # perform fuzzy matching between the text and the label
    match_result = fuzzy_match(tokens, label_tokens)

    if match_result is None or len(match_result[0].split()) < 3: # we don't want small matches since they are wrong most of the time
        return tokens, labels
    # get the best matching subsequence and its similarity score
    matched_subseq, similarity = match_result
    # tokenize the matched subsequence
    matched_tokens = matched_subseq.split()
    
    # Find the starting index of the matched subsequence in the main text
    start_index = None
    for i in range(len(tokens) - len(matched_tokens) + 1):
        if tokens[i:i+len(matched_tokens)] == matched_tokens:
            start_index = i
            break
    if start_index is not None:
        # Label the matched subsequence in the main text
        labels[start_index] = 'B-PRODUCT'
        for j in range(1, len(matched_tokens)):
            labels[start_index + j] = 'I-PRODUCT'

    return tokens, labels



### Training entry format
- The training data will be in the format of a csv file with the following columns:
    1. URL
    2. Tokens (space-separated)
    3. Labels (space-separated)
- We will have 5 special tokens that will be used to separate the different parts of the text that we want to label:
    1. [URL] - for the URL last path
    2. [TITLE] - for the page title
    3. [TEXT] - for the page text
    4. <NO_URL> - if the URL last path is missing
    5. <NO_TITLE> - if the page title is missing
- Token entries will look like this:
    - [URL] modern-wooden-chair [URL] [TITLE] Modern Wooden Chair [TITLE] [TEXT] Modern Wooden Chair, Price 1000$, Description: Beautiful modern wooden chair ... [TEXT]
- Label entries will look like this:
    - O B-PRODUCT I-PRODUCT I-PRODUCT O O B-PRODUCT I-PRODUCT I-PRODUCT O O B-PRODUCT I-PRODUCT I-PRODUCT O O O O O O O O O O O O .... O
- This is the format I came up with that made the most logical sense to me. It captures all the information a page has to offer about a product. The only possible issue would be making the model over rely on the URL last path and the page title, but there have been taken measures to prevent that (i.e. randomly setting the page title to <NO_TITLE> or the URL last path to <NO_URL>).

In [None]:
# left and right tokens represent the number of tokens before and after the h1 tag position that will be fed to the model
def tokenize_and_label(text, h1_tag_position, title, url_last_path, tokens_left=15, tokens_right=25):
    to_label = h1_tag_position[0]
    
    tokens_url_last_path, labels_url_last_path = [], []
    tokens_title, labels_title = [], []
    
    if url_last_path is None or url_last_path == '': # if the url last path is missing
        tokens_url_last_path = ['<NO_URL>']
        labels_url_last_path = ['O']
    elif isinstance(url_last_path, str): # we label using fuzzy matching if nothing was labeled
        tokens_url_last_path, labels_url_last_path = label_text(url_last_path, to_label) # this labels normally
        is_any_non_o_label = any(label != 'O' for label in labels_url_last_path) # check if there are any non 'O' labels
        if not is_any_non_o_label:
            tokens_url_last_path, labels_url_last_path = label_url_or_title(url_last_path, to_label)
    else: 
        tokens_url_last_path = ['<NO_URL>']
        labels_url_last_path = ['O']
    
    if title is None or title == '':
        tokens_title = ['<NO_TITLE>']
        labels_title = ['O']
    elif isinstance(title, str): # we label using fuzzy matching if nothing was labeled - same as above
        tokens_title, labels_title = label_text(title, to_label) # this labels normally
        is_any_non_o_label = any(label != 'O' for label in labels_title) # check if there are any non 'O' labels
        if not is_any_non_o_label:
            tokens_title, labels_title = label_url_or_title(title, to_label) # this labels using fuzzy matching
    else: 
        tokens_title = ['<NO_TITLE>']
        labels_title = ['O']
    
    tokens_text, labels_text = label_text(text, to_label)
    
    try:
        first_label_index = labels_text.index('B-PRODUCT')
    except ValueError:
        first_label_index = 0  # if no labeled entity, start from the beginning
    try:
        last_label_index = max(idx for idx, label in enumerate(labels_text) if label in ['B-PRODUCT', 'I-PRODUCT'])
    except ValueError:
        last_label_index = len(tokens_text) - 1  # If no labeled entity, end at the last token

    # Calculate the window to slice
    start_index = max(0, first_label_index - tokens_left)
    end_index = min(len(tokens_text), last_label_index + tokens_right + 1) # or len(tokens_text) 


    tokens_text = tokens_text[start_index:end_index] # slice the tokens
    labels_text = labels_text[start_index:end_index] # slice the labels
    
    end_of_window = 100 # maximum length of the window
    while end_of_window < len(labels_text) and labels_text[end_of_window] != 'O': # we want to end the window at the end of the product name
        end_of_window +=1
    
    if end_of_window < len(labels_text):
        tokens_text = tokens_text[:end_of_window]
        labels_text = labels_text[:end_of_window]
    
    if random.randint(1, 100) == 1:
        tokens_title = ['<NO_TITLE>']
        labels_title = ['O']
    
    tokens = ['[URL]'] + tokens_url_last_path + ['[URL]', '[TITLE]'] + tokens_title + ['[TITLE]', '[TEXT]'] + tokens_text + ['[TEXT]'] # concatenate all tokens and matching the format
    
    labels = ['O'] + labels_url_last_path + ['O', 'O'] + labels_title + ['O', 'O'] + labels_text + ['O'] # concatenate all labels and matching the format
    

    return tokens, labels # return the tokens and labels


### Creating our dataset for training
- One thing to note is that we have to join the tokens and labels into a single string so that we can write them to a csv file. This is due to how the BERT tokenizer works (see the training notebook). 

In [None]:
with open('../data/100000_data_ready_for_training.csv', 'w', encoding='utf-8', newline='') as file:
    writer = csv.writer(file)
    for row in data:
        url, h1_tag_positions, title, url_last_path, page_text = row
        
        tokens, labels = tokenize_and_label(page_text, h1_tag_positions, title, url_last_path)
        # tokens, labels = clean_tokens_with_labels(tokens, labels)
        tokens_str = ' '.join(tokens)  # join tokens into a single string
        labels_str = ' '.join(labels)  # join labels into a single string
        writer.writerow([url, tokens_str, labels_str])


### Testing code

In [None]:
preprocessed_data = []

with open('../data/100000_data_ready_for_training.csv', 'r', encoding='utf-8', newline='') as file:
    reader = csv.reader(file)
    for row in reader:
        url, tokens_str, labels_str = row
        tokens = tokens_str.split(' ') # !!!
        labels = labels_str.split(' ')
        preprocessed_data.append((url, tokens, labels))