# Problem Definition 


The goal is to build a system that can predict potential directories on a website based on its existing structure. This can be useful for web crawling, SEO analysis, and security assessments.

The system will use one independent model trained to predict both; lateral and hierarchical directories.
- Lateral discovery: It will identify top-level directories that are likely to be present on a similar website (e.g., `/about`, `/contact`).
- Hierarchical discovery: It predicts subdirectories based on the parent directory (e.g., `/parent/child`).

# Data Collection 

Origin of the dataset: [Common Crawl - Open Repository of Web Crawl Data](https://commoncrawl.org)

Chosen subset for analysis: [Common Crawl Columnar URL Index Files](https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-38/subset=warc/part-00000-165bfb83-c006-44f2-a121-3d8ae730bc93.c000.gz.parquet)

# Global Libraries

In [1]:
from IPython.display import display
import pandas as pd
import numpy as np
import gc

# Data Exploration 



### Data Loading

In [4]:
df = pd.read_parquet("./data/part-00000-802c11a9-bd5f-4d25-a78f-7a021d95d575.c000.gz.parquet")

### Data Shape

Simply print the shape of the dataframe to understand the number of records and features.

In [5]:
data_shape = {"rows": [df.shape[0]], "cols": [df.shape[1]]}
display(pd.DataFrame(data_shape))
del data_shape

Unnamed: 0,rows,cols
0,1235437,30


### Data Sampling

Sample a few records to get a sense of the data structure and content.

In [6]:
sample_data = {
    "col_names": list(df.columns),
    "data": [df[col].sample(n=1).iloc[0] for col in df.columns],
}

display(pd.DataFrame(sample_data))
del sample_data

Unnamed: 0,col_names,data
0,url_surtkey,"ro,ilfovsport)/tag/afumati-2"
1,url,https://k-pop.rocks/artist/1517
2,url_host_name,george-dragon.ro
3,url_host_tld,ro
4,url_host_2nd_last_part,bwear
5,url_host_3rd_last_part,
6,url_host_4th_last_part,
7,url_host_5th_last_part,
8,url_host_registry_suffix,ro
9,url_host_registered_domain,sanrotex.ro


Based on these observations, we will focus on the following columns for our analysis:
- `url_path`: It contains the path component of the URL
- `url_host_registered_domain`: It contains the registered domain of the URL
- `fetch_status`: It indicates the HTTP status code of the URL fetch operation

## Identify Empty Rows

Identify and handle any empty rows in the selected columns to ensure data quality.

In [7]:
columns = ["url_path", "url_host_registered_domain", "fetch_status"]

nan_values = {}
for column in columns:
    nan_values[column] = df[column].isna().sum()

display(pd.DataFrame.from_dict(nan_values, orient="index", columns=["NaN count"]))
del nan_values, columns

Unnamed: 0,NaN count
url_path,0
url_host_registered_domain,0
fetch_status,0


## Check for Duplicates

Checking for duplicate entries in the selected columns to maintain data integrity.

In [8]:
dup_counts = df["url_path"].value_counts()
dup_only = dup_counts[dup_counts > 1].reset_index()
dup_only.columns = ["url_path", "count"]
dup_only["percentage_of_total_rows"] = (dup_only["count"] / len(df) * 100).round(6)

print(f"Number of distinct duplicated url_path values: {len(dup_only)}")
print("Top 10 duplicated url_path values and counts:")
display(dup_only.head(10))
del dup_counts, dup_only

Number of distinct duplicated url_path values: 81718
Top 10 duplicated url_path values and counts:


Unnamed: 0,url_path,count,percentage_of_total_rows
0,/,68155,5.516671
1,/index.php,13075,1.05833
2,/Catalogul-Bibliotecii-BNR-10393.aspx,3188,0.258046
3,/video/,2380,0.192644
4,/ucp.php,2307,0.186736
5,/url,2232,0.180665
6,/drept/culegeri-de-jurisprudenta-si-legislatie,1920,0.155411
7,/e107_plugins/forum/forum_viewtopic.php,1708,0.138251
8,/advanced_search_result.php,1496,0.121091
9,/file.php,1303,0.105469


## Path Depth Stats

Statistics on the depth of URL paths to understand the distribution of path lengths.

In [9]:
depth_data = df["url_path"].str.count("/").describe()
display(depth_data.to_frame())
del depth_data

Unnamed: 0,url_path
count,1235437.0
mean,2.592459
std,1.488754
min,0.0
25%,2.0
50%,2.0
75%,3.0
max,32.0


## Domain Frequency Stats

Understand the frequency distribution of registered domains in the dataset. This can help identify popular domains and potential biases in the data.

In [10]:
domain_counts = (
    df.groupby("url_host_registered_domain")
    .size()
    .reset_index(name="count")
    .sort_values("count", ascending=False)
)
domain_counts["percentage"] = (domain_counts["count"] / len(df) * 100).round(4)

print(f"Unique domains: {len(domain_counts)}\n")
print("Top 20 domains by row count:")

display(domain_counts.head(20))

del domain_counts

Unique domains: 58265

Top 20 domains by row count:


Unnamed: 0,url_host_registered_domain,count,percentage
32074,monitorulsv.ro,18433,1.492
2665,carzz.ro,11932,0.9658
19364,gov.ro,11093,0.8979
21465,hotnews.ro,7837,0.6344
54048,ubbcluj.ro,7632,0.6178
25176,k-pop.rocks,7188,0.5818
8947,dcmedical.ro,6993,0.566
10200,digi24.ro,6541,0.5294
19293,google.ro,5699,0.4613
46720,sfatulmedicului.ro,5467,0.4425


## Status Code Frequency Stats

404 responses are particularly important as they indicate missing pages, which are irrelevant for our directory prediction task.

In [11]:
# hola
status_counts = df["fetch_status"].value_counts().reset_index()
status_counts.columns = ["status_code", "count"]
print("Top 20 status codes by row count:")
display(status_counts.head(20))

del status_counts

Top 20 status codes by row count:


Unnamed: 0,status_code,count
0,301,534149
1,404,397027
2,302,179965
3,304,32526
4,403,29355
5,308,13048
6,410,12605
7,500,10957
8,307,7412
9,303,5137


In [12]:
del df
gc.collect()

1629

Things to consider:
- Fortunately, any of the columns of interest does not contain any null values.
- There are some duplicate entries in the `url_host_registered_domain` column.
- The 75th percentile of the number of slashes (how deep the path is) is 3 which means that there are not many abnormally deep paths.
- There are a significant number of paths that respond with 404 status code.

Preprocessing steps based on the analysis above:

1. Take valuable columns 
2. Randomize the dataset
3. Normalize the paths by converting them to lowercase and stripping whitespace.
4. Exclude paths that are too deep (e.g., more than 10 slashes) to avoid outliers.
5. Remove rows with 404 status codes to focus on existing directories.
6. Remove files and query parameters from the paths to focus on directory structures.
7. Remove duplicate entries in the `url_path` column.
8. Remove directory names that mostly contain non-alphabetic characters (e.g., `/12345`, `/!@#$%`), are too short (e.g., less than 3 characters), or are too long (e.g., more than 100 characters).
9. For the `url_path` column, filter out rows with extremely deep paths (e.g., more than 10 slashes) to avoid outliers.
10. Remove directory names that contain more than 6 digits to avoid noise from auto-generated directories.


# Data Preprocessing

## Libraries

In [15]:
from collections import Counter

## Path Preprocessing

In [16]:
df = pd.read_parquet("./data/part-00000-802c11a9-bd5f-4d25-a78f-7a021d95d575.c000.gz.parquet")

In [17]:
# Take valuable columns
print(f"Original shape: {df.shape}")
df_clean = df[["url_host_registered_domain", "url_path", "fetch_status"]]

Original shape: (1235437, 30)


Shuffling the dataset is done to ensure that the training and testing sets are representative of the overall data distribution. This helps in reducing bias and variance in the model's performance.

In [18]:
# Shuffle dataset
df_clean = df_clean.sample(frac=1, random_state=1)
print(f"Shuffled: {df_clean.shape}")

Shuffled: (1235437, 3)


Part of data normalization is to convert all paths to lowercase.

In [19]:
# Convert all paths to lowercase
df_clean["url_path"] = df_clean["url_path"].str.lower()
print(f"To lowercase: {df_clean.shape}")

To lowercase: (1235437, 3)


In [20]:
# Remove 404
df_clean = df_clean[df_clean["fetch_status"] != 404]
print(f"404 removed: {df_clean.shape}")

404 removed: (838410, 3)


## Tokenization

Most machine learning models require numerical input. Therefore, we need to convert the directory names into a numerical format that the model can understand. 

In [21]:
# Validate the path by checking if it has the "/" character
def validate_path(url):
    return url if isinstance(url, str) and "/" in url else None

In [22]:
# Remove extra whitespeces and "/" characters at the end and start of the string
def clean_slashes(url):
    url = url.strip().strip("/")

    return url if url else None

In [23]:
# Separate directory names into a list
def tokenize(url):
    return url.split("/")

In [24]:
# Apply changes using a pipeline
def url_tokenize_pipeline(url):
    url = validate_path(url)
    if not url: return None

    url = clean_slashes(url)
    if not url: return None

    return tokenize(url)

df_clean["url_path"] = df_clean["url_path"].apply(url_tokenize_pipeline)
df_clean = df_clean.dropna(subset=["url_path"])

In [25]:
# Display sample of the data
print(f"Tokenized: {df_clean.shape}")
display(df_clean.head(5))

Tokenized: (774661, 3)


Unnamed: 0,url_host_registered_domain,url_path,fetch_status
4220,bnr.ro,[catalogul-bibliotecii-bnr-10393.aspx],302
711011,nutriland.ro,"[products, vit-min-powder-pudra-150g-ostrovit-...",302
825627,profismile.ro,[cnn1122368.htm],301
1100122,tunetanken.ro,"[categorie, agro, constructie-agro, produse-pe...",503
834381,protv.ro,"[emisiuni, 19-stirile-pro-tv, episodul, 41221-...",403


## Local Token Filtering

Filtering data is relevant to the model's focus on directory names. By removing irrelevant tokens, we can improve the model's performance and reduce noise in the data.

In [26]:
# Only keep directory names using valid characters
def is_allowed_char(c):
    return c.isalnum() or c in "-_%"

def filter_allowed_chars(tokens):
    return [d for d in tokens if all(is_allowed_char(c) for c in d)]

Long directory names are likely to be less common. Examples include session IDs or unique identifiers that do not generalize well.

In [27]:
# Remove long direcotory names
def filter_length(tokens, max_len=15):
    return [d for d in tokens if len(d) <= max_len]

Similary, tokens with a high digit count are often not meaningful directory names.

In [28]:
# Remove directory names that mostly contain digits
def is_mostly_digits(s, threshold=0.5):
    digit_count = sum(c.isdigit() for c in s)
    return (digit_count / len(s)) >= threshold

def filter_digits(tokens):
    return [
        d for d in tokens 
        if not (d.isdigit() and len(d) > 6) and not (is_mostly_digits(d) and len(d) > 10)
    ]

Deep directory paths can introduce noise and complexity to the model.

In [29]:
# Limit paths that are too deep
def limit_depth(tokens, max_depth=10):
    return tokens[:max_depth]

In [30]:
# Apply changes using a pipeline
def token_filter_pipeline(tokens):
    tokens = filter_allowed_chars(tokens)
    tokens = filter_length(tokens)
    tokens = filter_digits(tokens)

    return tokens if tokens else None

df_clean["url_path"] = df_clean["url_path"].apply(token_filter_pipeline)
df_clean = df_clean.dropna(subset=["url_path"])

In [31]:
# Display a sample of the data
print(f"Local Token Filtering: {df_clean.shape}")
display(df_clean.head(5))

Local Token Filtering: (551714, 3)


Unnamed: 0,url_host_registered_domain,url_path,fetch_status
711011,nutriland.ro,[products],302
1100122,tunetanken.ro,"[categorie, agro, steps]",503
834381,protv.ro,"[emisiuni, episodul]",403
270277,epson.ro,"[ro_ro, produse, op%c8%9biuni, standard, p, 10...",301
25368,buhnici.ro,[tag],301


## Global Token Filtering

This step is done separately as it requires calculating global statistics across the entire dataset.

In [32]:
df_grouped = df_clean.groupby("url_host_registered_domain")["url_path"].sum().reset_index()

This is one of the most effective ways to reduce noise in the dataset. By removing infrequent tokens, we can focus the model on more relevant and common directory names, improving its predictive performance.

In [33]:
# Create global frequency counter
global_freqs = Counter(token for tokens in df_grouped["url_path"] for token in set(tokens))

# Remove infrequent tokens
def filter_by_min_domains(tokens, global_freqs, min_domains=10):
    filtered_tokens = [t for t in tokens if global_freqs.get(t, 0) >= min_domains]

    return filtered_tokens if filtered_tokens else None

# Apply changes using a pipeline
df_clean["url_path"] = df_clean["url_path"].apply(lambda tokens: filter_by_min_domains(tokens, global_freqs))

# Remove rows with empty url_path after filtering
df_clean = df_clean.dropna(subset=["url_path"])

print(df_clean.shape)

(409086, 3)


This drastically reduces the vocabulary size, which can improve the efficiency of the model used for lateral directory prediction.

In [34]:
# Display how many tokens were removed
print(df_grouped["url_path"].apply(len).sum())
print(df_clean["url_path"].apply(len).sum())

1045851
631971


# Model Training

**Herarchical model**

LSTM was chosen due to its effectiveness in handling sequential data, such as directory paths.

**Lateral model**

Co-occurrence-based recommender system was selected for its ability to suggest directories based on their occurrence patterns across different websites.

## Libraries

In [35]:
from itertools import chain
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import re

2025-10-26 16:59:38.963278: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-10-26 16:59:39.497627: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-10-26 16:59:42.031176: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.


## Lateral Directory Prediction

From the preprocessed data, we will take only the top-level directories for training the lateral directory prediction model. Then, each token will be grouped by its registered domain to create a list of unique top-level directories for each website.

In [36]:
df_clean_grouped = df_clean.copy()
df_clean_grouped["first_path"] = df_clean_grouped["url_path"].str[0]
df_clean_grouped = df_clean_grouped.drop(columns=["url_path", "fetch_status"])
df_clean_grouped = df_clean_grouped.groupby("url_host_registered_domain")["first_path"].apply(set).reset_index()
df_clean_grouped = df_clean_grouped["first_path"].tolist()

print(df_clean_grouped[:10])

[{'tagged'}, {'author', 'home'}, {'2022'}, {'blog', 'tag', 'register'}, {'artist'}, {'2019', '2016', '2021', '2017', '2018', '2015', '2020'}, {'ro_ro'}, {'blog', 'pages'}, {'page'}, {'category'}]


In [37]:
# Number of unique dir names
unique_dirs = {d for sublist in df_clean_grouped for d in sublist}
print(len(unique_dirs))

1563


The co-occurrence matrix is built by counting how often each pair of directories appears together across different websites. This matrix serves as the foundation for the recommendation system. Then, for a given set of existing directories on a website, the model can recommend additional directories that frequently co-occur with them based on the co-occurrence matrix. The `top-k` parameter can be adjusted to control how many recommendations are provided.

In [38]:
from collections import defaultdict, Counter

# Build index
cooccur = defaultdict(Counter)
dir_counts = Counter()

for site in df_clean_grouped:
    dirs = list(set(site))
    dir_counts.update(dirs)
    for i, d1 in enumerate(dirs):
        for d2 in dirs[i+1:]:
            cooccur[d1][d2] += 1
            cooccur[d2][d1] += 1

print(f"Indexed {len(dir_counts):,} directories")

# Recommend
def recommend(input_dirs, top_k=100):
    scores = Counter()
    for d in input_dirs:
        for related, count in cooccur[d].most_common(top_k * 2):
            if related not in input_dirs:
                scores[related] += count
    return scores.most_common(top_k)

Indexed 1,563 directories


## Herarchical Directory Prediction Model

The steps to train the hierarchical model using LSTM are as follows:
1. **Data Preparation**: The preprocessed directory paths are tokenized and converted into sequences of integers. Each sequence represents a path, with each token corresponding to a directory name.
2. **Model Architecture**: An LSTM model is designed with an embedding layer to convert tokens into dense vectors.
3. **Training**: The model is trained using the prepared sequences.

In [39]:
# Limit to 10,000
tokenized_paths = df_clean["url_path"].tolist()[:10000]

# Flatten all tokens
all_tokens = list(set(chain.from_iterable(tokenized_paths)))

# Assign indices
token_to_idx = {
    token: idx + 1 for idx, token in enumerate(all_tokens)
}

# Reverse mapping
idx_to_token = {idx: token for token, idx in token_to_idx.items()}

# Get vocab size
vocab_size = len(token_to_idx) + 1

# Prepare input and target sequences
input_seqs = []
target_tokens = []

# Build sequences
for path in tokenized_paths:
    for i in range(1, len(path)):
        prefix = [token_to_idx[t] for t in path[:i]]
        target = token_to_idx[path[i]]
        input_seqs.append(prefix)
        target_tokens.append(target)

max_len = max(len(seq) for seq in input_seqs)

# Pad sequences
X = pad_sequences(input_seqs, maxlen=max_len, padding="pre")

# Target array
y = np.array(target_tokens)

# Build model
model = Sequential(
    [
        # Embedding layer
        Embedding(input_dim=vocab_size, output_dim=32, input_length=max_len),
        # LSTM layer
        LSTM(64),
        # Output layer
        Dense(vocab_size, activation="softmax"),
    ]
)

# Compile and summarize
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
model.summary()

# Helper to detect numeric tokens
def is_numeric(token):
    return re.fullmatch(r"\d+", token) is not None

# Build sample weights based on target token type
sample_weights = np.array([
    0.1 if is_numeric(idx_to_token[t]) else 1.0
    for t in y
])

# Train the model with sample weights
model.fit(X, y, epochs=20, batch_size=32, sample_weight=sample_weights)

2025-10-26 16:59:45.333236: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


Epoch 1/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - loss: 4.4097
Epoch 2/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 3.8826
Epoch 3/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 3.7580
Epoch 4/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 3.6737
Epoch 5/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 3.5925
Epoch 6/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 3.4974
Epoch 7/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 3.3801
Epoch 8/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 3.2608
Epoch 9/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 3.1511
Epoch 10/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - lo

<keras.src.callbacks.history.History at 0x7f951b803ce0>

# Model Evaluation

## Herarchical Model Evaluation

In [42]:
def predict_lateral_tf(model, input_path, top_k=50):
    tokens = [token_to_idx[t] for t in input_path if t in token_to_idx]
    padded = pad_sequences([tokens], maxlen=max_len, padding="pre")
    probs = model.predict(padded, verbose=0)[0]
    top_idx = probs.argsort()[-top_k:][::-1]
    return [idx_to_token[i] for i in top_idx]


input_path = ["admin"]
suggestions = predict_lateral_tf(model, input_path)
print(suggestions)

['attachment', 'products', 'bilete', 'search', 'product', 'category', 'stiri', 'details', 'news', 'produs', 'page', 'node', 'p', 'cart', 'educatie', 'bucuresti', 'home', 'ro', 'tickets', 'produse', 'articol', 'pages', 'special', 'timisoara', 'oferta', 'artist', 'frumusete', 'articole', 'content', 'wishlist', 'iasi', 'c', 'ten', 'activitate', 'store', 'blog', 'fara-categorie', 'google', 'filtre', 'games', 'literatura', 'valcea', 'author', 'tag', 'tags', 'cercetare', 'national', 'wellness', 'collections', 'contact']


## Lateral Model Evaluation

In [41]:
print(recommend(['admin', 'user'], 100))

[('en', 20), ('course', 20), ('wp-content', 13), ('ro', 11), ('anunturi', 10), ('search', 9), ('files', 9), ('tag', 8), ('calendar', 8), ('author', 8), ('issue', 8), ('studenti', 8), ('pdf', 7), ('node', 7), ('category', 7), ('proiecte', 6), ('blog', 6), ('login', 6), ('2018', 6), ('article', 6), ('admitere', 6), ('component', 6), ('docs', 6), ('contact', 5), ('2021', 5), ('page', 5), ('evenimente', 5), ('fr', 5), ('documente', 5), ('mod', 5), ('content', 5), ('account', 5), ('2017', 5), ('2015', 5), ('cercetare', 5), ('de', 4), ('2023', 4), ('2019', 4), ('educatie', 4), ('imobiliare', 4), ('logout', 4), ('stiri', 4), ('departamente', 4), ('2022', 4), ('2016', 4), ('news', 3), ('documents', 3), ('2024', 3), ('despre-noi', 3), ('hu', 3), ('noutati', 3), ('cursuri', 3), ('arhiva', 3), ('forum', 3), ('media', 3), ('afaceri', 3), ('bucuresti', 3), ('2', 3), ('-', 3), ('frontend', 3), ('about', 3), ('2025', 3), ('publicatii', 3), ('organizare', 3), ('pages', 3), ('images', 3), ('site', 3), 