<a href="https://colab.research.google.com/gist/toniramchandani1/c513e5c870e465a38964a90dd57dd83f/xpathsgenlstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import Libraries**
The first section of the code imports necessary libraries for the task. numpy is used for numerical operations, tensorflow and specifically its keras API for building the neural network model, and train_test_split from sklearn to split the dataset into training and testing sets.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, TimeDistributed
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

**Define Dataset**
Here, a sample dataset of HTML elements and their corresponding XPaths is defined. Each element is a dictionary with two keys: 'element' for the HTML snippet and 'xpath' for its XPath.

In [None]:
dataset = [
    {'element': '<html><head></head></html>', 'xpath': '/html/head'},
    {'element': '<div class="container"></div>', 'xpath': '//div[@class="container"]'},
    {'element': '<ul><li>Item 1</li><li>Item 2</li></ul>', 'xpath': '//ul/li'},
    {'element': '<div><span class="text">Example</span></div>', 'xpath': '//div/span[@class="text"]'},
    {'element': '<a href="https://example.com">Link</a>', 'xpath': '//a[@href="https://example.com"]'},
    {'element': '<input type="text" name="firstname">', 'xpath': '//input[@type="text"]'},
    {'element': '<button id="submit">Submit</button>', 'xpath': '//button[@id="submit"]'},
    {'element': '<section><header>Header</header></section>', 'xpath': '/section/header'},
    {'element': '<footer><p>Copyright 2021</p></footer>', 'xpath': '//footer/p'},
    {'element': '<nav><ul><li><a href="#home">Home</a></li></ul></nav>', 'xpath': '//nav//a[@href="#home"]'},
    {'element': '<article><h2>Title</h2><p>Paragraph</p></article>', 'xpath': '//article//p'},
    {'element': '<form><label for="email">Email:</label><input type="email" id="email"></form>', 'xpath': '//form//input[@type="email"]'},
    {'element': '<table><tr><td>Cell 1</td><td>Cell 2</td></tr></table>', 'xpath': '//table/tr/td'},
    {'element': '<img src="image.jpg" alt="Image">', 'xpath': '//img[@alt="Image"]'},
    {'element': '<aside><h3>Related</h3><p>Content</p></aside>', 'xpath': '//aside/p'},
    {'element': '<section id="main-content"><div class="highlight">Featured</div></section>', 'xpath': '//section[@id="main-content"]/div[@class="highlight"]'},
    {'element': '<ol><li class="item">First item</li><li class="item">Second item</li></ol>', 'xpath': '//ol/li[@class="item"]'},
    {'element': '<header><h1>Welcome</h1><nav><ul><li>About</li><li>Contact</li></ul></nav></header>', 'xpath': '//header/nav/ul/li'},
    {'element': '<div class="login"><form><input type="password" name="password"></form></div>', 'xpath': '//div[@class="login"]//input[@type="password"]'},
    {'element': '<article><section><p>Some text here</p></section></article>', 'xpath': '//article//section/p'},
    {'element': '<iframe src="video.html"></iframe>', 'xpath': '//iframe[@src="video.html"]'},
    {'element': '<main><section><h2>Introduction</h2><p>Welcome to our site.</p></section></main>', 'xpath': '//main//section/p'},
    {'element': '<ul class="menu"><li><a href="#home">Home</a></li><li><a href="#about">About</a></li></ul>', 'xpath': '//ul[@class="menu"]/li/a'},
    {'element': '<div class="header"><span class="date">Jan 1, 2024</span><h1>New Year</h1></div>', 'xpath': '//div[@class="header"]/span[@class="date"]'},
    {'element': '<blockquote cite="http://example.com/facts">Interesting Fact</blockquote>', 'xpath': '//blockquote[@cite]'},
    {'element': '<label for="search">Search:</label><input id="search" type="search">', 'xpath': '//input[@id="search"]'},
    {'element': '<details><summary>More Info</summary><p>Detailed Information</p></details>', 'xpath': '//details/summary'},
    {'element': '<figure><img src="photo.png" alt="Photo"><figcaption>Photo Caption</figcaption></figure>', 'xpath': '//figure/figcaption'},
    {'element': '<datalist id="browsers"><option value="Chrome"></option><option value="Firefox"></option></datalist>', 'xpath': '//datalist[@id="browsers"]/option'},
    {'element': '<div class="comments"><comment>Great article!</comment><comment>Thanks for sharing.</comment></div>', 'xpath': '//div[@class="comments"]/comment'},
    {'element': '<bdi dir="rtl">This text will be reversed</bdi>', 'xpath': '//bdi[@dir="rtl"]'},
    {'element': '<mark>This is highlighted text</mark>', 'xpath': '//mark'},
    {'element': '<time datetime="2024-01-01">January 1st, 2024</time>', 'xpath': '//time[@datetime="2024-01-01"]'}

    # Add more samples as needed...
]

**Preprocess Dataset**
This section calculates the maximum length among all HTML elements and XPaths to standardize input size, separates elements and XPaths into lists, and creates a vocabulary from all unique characters in the dataset. It then encodes HTML elements and XPaths into numerical sequences based on this vocabulary.

In [None]:
max_length = max(max(len(sample['element']), len(sample['xpath'])) for sample in dataset)
elements = [sample['element'] for sample in dataset]
xpaths = [sample['xpath'] for sample in dataset]

all_chars = ''.join(set(''.join(elements) + ''.join(xpaths)))
vocabulary = sorted(set(all_chars))
char_to_index = {char: index + 1 for index, char in enumerate(vocabulary)}
x_encoded = [[char_to_index[char] for char in element] for element in elements]
y_encoded = [[char_to_index[char] for char in xpath] for xpath in xpaths]

**Pad Sequences**
To ensure all input sequences have the same length, this part pads shorter sequences with zeros at the end (post padding).

In [None]:
x_padded = pad_sequences(x_encoded, maxlen=max_length, padding='post', value=0)
y_padded = pad_sequences(y_encoded, maxlen=max_length, padding='post', value=0)

**Split Dataset**
Splits the padded sequences into training and testing sets using a standard 80/20 split.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_padded, y_padded, test_size=0.2, random_state=42)

**Define Model Architecture**
Defines a Sequential model with an Embedding layer to learn dense representations of characters, an LSTM layer to capture sequences, and a TimeDistributed Dense layer to make predictions at each sequence step.

In [None]:
vocab_size = len(vocabulary) + 1  # Vocabulary size
embedding_dim = 64

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
    LSTM(128, return_sequences=True),
    TimeDistributed(Dense(vocab_size, activation='softmax'))
])

**Compile and Train Model**
Compiles the model with sparse_categorical_crossentropy as the loss function (suitable for multi-class classification tasks) and trains it using the training data. y_train and y_test are reshaped to fit the expected input shape for the loss function.

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

y_train_reshaped = np.expand_dims(y_train, -1)
y_test_reshaped = np.expand_dims(y_test, -1)

model.fit(x_train, y_train_reshaped, epochs=10, batch_size=32, validation_data=(x_test, y_test_reshaped))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x78169f14f4f0>

**Generate XPath Predictions**
Encodes a new HTML element using the vocabulary, pads the sequence, predicts the corresponding XPath using the trained model, and decodes the prediction back into characters.

In [None]:
sample_element = '<div class="example"></div>'
sample_encoded = [char_to_index[char] for char in sample_element if char in char_to_index]
sample_padded = pad_sequences([sample_encoded], maxlen=max_length, padding='post', value=0)
prediction = model.predict(sample_padded)
predicted_xpath_encoded = np.argmax(prediction, axis=-1)[0]
predicted_xpath = ''.join([vocabulary[index - 1] for index in predicted_xpath_encoded if index > 0])

print("Predicted XPath:", predicted_xpath)

Predicted XPath: 


This code demonstrates how to preprocess data, build a simple sequence-to-sequence model with TensorFlow's Keras API, train the model, and use it to predict the XPath of a given HTML element. Adjustments may be needed for more complex HTML/XPath pairs or larger datasets.