# Problem Definition 


The goal is to build a system that can predict potential directories on a website based on its existing structure. This can be useful for web crawling, SEO analysis, and security assessments.

The system will use one independent model trained to predict both; lateral and hierarchical directories.
- Lateral discovery: It will identify top-level directories that are likely to be present on a similar website (e.g., `/about`, `/contact`).
- Hierarchical discovery: It predicts subdirectories based on the parent directory (e.g., `/parent/child`).

# Data Collection 

Origin of the dataset: [Common Crawl - Open Repository of Web Crawl Data](https://commoncrawl.org)

Chosen subset for analysis: [Common Crawl Columnar URL Index Files](https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2025-38/subset=warc/part-00000-165bfb83-c006-44f2-a121-3d8ae730bc93.c000.gz.parquet)

# Global Libraries

In [1]:
from IPython.display import display
import pandas as pd
import numpy as np
import gc

# Data Exploration 



### Data Loading

In [2]:
df = pd.read_parquet("../data/part-00000-802c11a9-bd5f-4d25-a78f-7a021d95d575.c000.gz.parquet")

### Data Shape

In [3]:
data_shape = {"rows": [df.shape[0]], "cols": [df.shape[1]]}
display(pd.DataFrame(data_shape))
del data_shape

Unnamed: 0,rows,cols
0,1235437,30


### Data Sampling

In [4]:
sample_data = {
    "col_names": list(df.columns),
    "data": [df[col].sample(n=1).iloc[0] for col in df.columns],
}

display(pd.DataFrame(sample_data))
del sample_data

Unnamed: 0,col_names,data
0,url_surtkey,"ro,mania)/p/tricou-cu-imprimeu-tommy-hilfiger-..."
1,url,https://www.magazintuning.ro/maxton-design/se-...
2,url_host_name,wizardhr.ro
3,url_host_tld,ro
4,url_host_2nd_last_part,vhdforum
5,url_host_3rd_last_part,www
6,url_host_4th_last_part,
7,url_host_5th_last_part,
8,url_host_registry_suffix,ro
9,url_host_registered_domain,holdem.ro


Based on these observations, we will focus on the following columns for our analysis:
- `url_path`: It contains the path component of the URL
- `url_host_registered_domain`: It contains the registered domain of the URL
- `fetch_status`: It indicates the HTTP status code of the URL fetch operation

## Identify Empty Rows

In [5]:
columns = ["url_path", "url_host_registered_domain", "fetch_status"]

nan_values = {}
for column in columns:
    nan_values[column] = df[column].isna().sum()

display(pd.DataFrame.from_dict(nan_values, orient="index", columns=["NaN count"]))
del nan_values, columns

Unnamed: 0,NaN count
url_path,0
url_host_registered_domain,0
fetch_status,0


## Check for Duplicates

In [6]:
dup_counts = df["url_path"].value_counts()
dup_only = dup_counts[dup_counts > 1].reset_index()
dup_only.columns = ["url_path", "count"]
dup_only["percentage_of_total_rows"] = (dup_only["count"] / len(df) * 100).round(6)

print(f"Number of distinct duplicated url_path values: {len(dup_only)}")
print("Top 10 duplicated url_path values and counts:")
display(dup_only.head(10))
del dup_counts, dup_only

Number of distinct duplicated url_path values: 81718
Top 10 duplicated url_path values and counts:


Unnamed: 0,url_path,count,percentage_of_total_rows
0,/,68155,5.516671
1,/index.php,13075,1.05833
2,/Catalogul-Bibliotecii-BNR-10393.aspx,3188,0.258046
3,/video/,2380,0.192644
4,/ucp.php,2307,0.186736
5,/url,2232,0.180665
6,/drept/culegeri-de-jurisprudenta-si-legislatie,1920,0.155411
7,/e107_plugins/forum/forum_viewtopic.php,1708,0.138251
8,/advanced_search_result.php,1496,0.121091
9,/file.php,1303,0.105469


## Path Depth Stats

In [7]:
depth_data = df["url_path"].str.count("/").describe()
display(depth_data.to_frame())
del depth_data

Unnamed: 0,url_path
count,1235437.0
mean,2.592459
std,1.488754
min,0.0
25%,2.0
50%,2.0
75%,3.0
max,32.0


## Domain Frequency Stats

In [8]:
domain_counts = (
    df.groupby("url_host_registered_domain")
    .size()
    .reset_index(name="count")
    .sort_values("count", ascending=False)
)
domain_counts["percentage"] = (domain_counts["count"] / len(df) * 100).round(4)

print(f"Unique domains: {len(domain_counts)}\n")
print("Top 20 domains by row count:")

display(domain_counts.head(20))

del domain_counts

Unique domains: 58265

Top 20 domains by row count:


Unnamed: 0,url_host_registered_domain,count,percentage
32074,monitorulsv.ro,18433,1.492
2665,carzz.ro,11932,0.9658
19364,gov.ro,11093,0.8979
21465,hotnews.ro,7837,0.6344
54048,ubbcluj.ro,7632,0.6178
25176,k-pop.rocks,7188,0.5818
8947,dcmedical.ro,6993,0.566
10200,digi24.ro,6541,0.5294
19293,google.ro,5699,0.4613
46720,sfatulmedicului.ro,5467,0.4425


## Status Code Frequency Stats

In [9]:
# hola
status_counts = df["fetch_status"].value_counts().reset_index()
status_counts.columns = ["status_code", "count"]
print("Top 20 status codes by row count:")
display(status_counts.head(20))

del status_counts

Top 20 status codes by row count:


Unnamed: 0,status_code,count
0,301,534149
1,404,397027
2,302,179965
3,304,32526
4,403,29355
5,308,13048
6,410,12605
7,500,10957
8,307,7412
9,303,5137


In [10]:
del df
gc.collect()

44

Things to consider:
- Fortunately, any of the columns of interest does not contain any null values.
- There are some duplicate entries in the `url_host_registered_domain` column.
- The 75th percentile of the number of slashes (how deep the path is) is 3 which means that there are not many abnormally deep paths.
- There are a significant number of paths that respond with 404 status code.

Preprocessing steps based on the analysis above:

1. Take valuable columns 
2. Randomize the dataset
3. Normalize the paths by converting them to lowercase and stripping whitespace.
4. Exclude paths that are too deep (e.g., more than 10 slashes) to avoid outliers.
5. Remove rows with 404 status codes to focus on existing directories.
6. Remove files and query parameters from the paths to focus on directory structures.
7. Remove duplicate entries in the `url_path` column.
8. Remove directory names that mostly contain non-alphabetic characters (e.g., `/12345`, `/!@#$%`), are too short (e.g., less than 3 characters), or are too long (e.g., more than 100 characters).
9. For the `url_path` column, filter out rows with extremely deep paths (e.g., more than 10 slashes) to avoid outliers.
10. Add the word *root* to represent the root directory `/` for better model understanding.
11. Remove directory names that contain more than 6 digits to avoid noise from auto-generated directories.

# Data Preprocessing

## Libraries

In [92]:
from collections import Counter

## Path Preprocessing

In [93]:
df = pd.read_parquet("../data/part-00000-802c11a9-bd5f-4d25-a78f-7a021d95d575.c000.gz.parquet")

In [94]:
# Take valuable columns
print(f"Original shape: {df.shape}")
df_clean = df[["url_host_registered_domain", "url_path", "fetch_status"]]

Original shape: (1235437, 30)


In [95]:
# Shuffle dataset
df_clean = df_clean.sample(frac=1, random_state=1)
print(f"Shuffled: {df_clean.shape}")

Shuffled: (1235437, 3)


In [96]:
# Convert all paths to lowercase
df_clean["url_path"] = df_clean["url_path"].str.lower()
print(f"To lowercase: {df_clean.shape}")

To lowercase: (1235437, 3)


In [97]:
# Remove 404
df_clean = df_clean[df_clean["fetch_status"] != 404]
print(f"404 removed: {df_clean.shape}")

404 removed: (838410, 3)


## Tokenization

In [98]:
def validate_path(url):
    return url if isinstance(url, str) and "/" in url else None

In [99]:
def clean_slashes(url):
    url = url.strip().strip("/")

    return url if url else None

In [100]:
def tokenize(url):
    return url.split("/")

In [101]:
def url_tokenize_pipeline(url):
    url = validate_path(url)
    if not url: return None

    url = clean_slashes(url)
    if not url: return None

    return tokenize(url)

df_clean["url_path"] = df_clean["url_path"].apply(url_tokenize_pipeline)
df_clean = df_clean.dropna(subset=["url_path"])

In [102]:
print(f"Tokenized: {df_clean.shape}")
display(df_clean.head(5))

Tokenized: (774661, 3)


Unnamed: 0,url_host_registered_domain,url_path,fetch_status
4220,bnr.ro,[catalogul-bibliotecii-bnr-10393.aspx],302
711011,nutriland.ro,"[products, vit-min-powder-pudra-150g-ostrovit-...",302
825627,profismile.ro,[cnn1122368.htm],301
1100122,tunetanken.ro,"[categorie, agro, constructie-agro, produse-pe...",503
834381,protv.ro,"[emisiuni, 19-stirile-pro-tv, episodul, 41221-...",403


## Local Token Filtering

In [103]:
def is_allowed_char(c):
    return c.isalnum() or c in "-_%"

def filter_allowed_chars(tokens):
    return [d for d in tokens if all(is_allowed_char(c) for c in d)]

In [104]:
def filter_length(tokens, max_len=15):
    return [d for d in tokens if len(d) <= max_len]

In [105]:
def is_mostly_digits(s, threshold=0.5):
    digit_count = sum(c.isdigit() for c in s)
    return (digit_count / len(s)) >= threshold

def filter_digits(tokens):
    return [
        d for d in tokens 
        if not (d.isdigit() and len(d) > 6) and not (is_mostly_digits(d) and len(d) > 10)
    ]

In [106]:
def limit_depth(tokens, max_depth=10):
    return tokens[:max_depth]

In [107]:
def token_filter_pipeline(tokens):
    tokens = filter_allowed_chars(tokens)
    tokens = filter_length(tokens)
    tokens = filter_digits(tokens)

    return tokens if tokens else None

df_clean["url_path"] = df_clean["url_path"].apply(token_filter_pipeline)
df_clean = df_clean.dropna(subset=["url_path"])

In [108]:
print(f"Local Token Filtering: {df_clean.shape}")
display(df_clean.head(5))

Local Token Filtering: (551714, 3)


Unnamed: 0,url_host_registered_domain,url_path,fetch_status
711011,nutriland.ro,[products],302
1100122,tunetanken.ro,"[categorie, agro, steps]",503
834381,protv.ro,"[emisiuni, episodul]",403
270277,epson.ro,"[ro_ro, produse, op%c8%9biuni, standard, p, 10...",301
25368,buhnici.ro,[tag],301


## Global Token Filtering

In [109]:
df_grouped = df_clean.groupby("url_host_registered_domain")["url_path"].sum().reset_index()

In [110]:
global_freqs = Counter(token for tokens in df_grouped["url_path"] for token in set(tokens))

def filter_by_min_domains(tokens, global_freqs, min_domains=10):
    filtered_tokens = [t for t in tokens if global_freqs.get(t, 0) >= min_domains]

    return filtered_tokens if filtered_tokens else None

df_clean["url_path"] = df_clean["url_path"].apply(lambda tokens: filter_by_min_domains(tokens, global_freqs))

df_clean = df_clean.dropna(subset=["url_path"])

print(df_clean.shape)

(409086, 3)


In [111]:
print(df_grouped["url_path"].apply(len).sum())
print(df_clean["url_path"].apply(len).sum())

1045851
631971


In [112]:
display(df_clean.head(10)["url_path"].str[0])
display(df_clean.head(10))

711011       products
1100122     categorie
834381       emisiuni
270277          ro_ro
25368             tag
1132501          2024
288980     wp-content
708381     literatura
500768       taxonomy
137351         produs
Name: url_path, dtype: object

Unnamed: 0,url_host_registered_domain,url_path,fetch_status
711011,nutriland.ro,[products],302
1100122,tunetanken.ro,[categorie],503
834381,protv.ro,[emisiuni],403
270277,epson.ro,"[ro_ro, produse, p]",301
25368,buhnici.ro,[tag],301
1132501,upb.ro,"[2024, user]",303
288980,evaluarea-riscului.ro,"[wp-content, uploads, 2017, 2016]",301
708381,nouasperanta.ro,[literatura],301
500768,jysk.ro,"[taxonomy, term]",301
137351,costumase.ro,[produs],301


# Model Training

## Libraries

In [69]:
from itertools import chain
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
import re

2025-10-19 01:31:58.385474: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-10-19 01:31:58.446610: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-10-19 01:32:00.792889: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.


## Lateral Directory Prediction

In [113]:
df_clean_grouped = df_clean.copy()
df_clean_grouped["first_path"] = df_clean_grouped["url_path"].str[0]
df_clean_grouped = df_clean_grouped.drop(columns=["url_path", "fetch_status"])
df_clean_grouped = df_clean_grouped.groupby("url_host_registered_domain")["first_path"].apply(set).reset_index()
df_clean_grouped = df_clean_grouped["first_path"].tolist()

print(df_clean_grouped[:10])

[{'tagged'}, {'author', 'home'}, {'2022'}, {'register', 'blog', 'tag'}, {'artist'}, {'2020', '2017', '2021', '2019', '2018', '2015', '2016'}, {'ro_ro'}, {'blog', 'pages'}, {'page'}, {'category'}]


In [114]:
unique_dirs = {d for sublist in df_clean_grouped for d in sublist}
print(len(unique_dirs))         # Number of unique dir names

1563


In [115]:
from collections import defaultdict, Counter

# Build index
cooccur = defaultdict(Counter)
dir_counts = Counter()

for site in df_clean_grouped:
    dirs = list(set(site))
    dir_counts.update(dirs)
    for i, d1 in enumerate(dirs):
        for d2 in dirs[i+1:]:
            cooccur[d1][d2] += 1
            cooccur[d2][d1] += 1

print(f"Indexed {len(dir_counts):,} directories")

# Recommend
def recommend(input_dirs, top_k=100):
    scores = Counter()
    for d in input_dirs:
        for related, count in cooccur[d].most_common(top_k * 2):
            if related not in input_dirs:
                scores[related] += count
    return scores.most_common(top_k)

# Test
print(recommend(['wp-content'], 100))

Indexed 1,563 directories
[('en', 58), ('ro', 45), ('category', 39), ('tag', 25), ('2023', 22), ('2024', 21), ('2022', 21), ('contact', 20), ('2016', 19), ('anunturi', 19), ('2017', 18), ('2018', 18), ('despre-noi', 18), ('documente', 16), ('files', 16), ('2025', 15), ('blog', 15), ('2021', 15), ('stiri', 14), ('page', 14), ('author', 13), ('2015', 13), ('evenimente', 13), ('produs', 13), ('2020', 13), ('2019', 13), ('noutati', 13), ('attachment', 13), ('proiecte', 13), ('fr', 11), ('user', 11), ('cercetare', 11), ('studenti', 11), ('docs', 10), ('download', 10), ('hu', 9), ('2014', 9), ('events', 9), ('product', 9), ('article', 9), ('de', 8), ('2013', 8), ('course', 8), ('admitere', 8), ('news', 7), ('categorie', 7), ('2011', 7), ('2012', 7), ('departamente', 7), ('articole', 7), ('pdf', 7), ('it', 6), ('primaria', 6), ('produse', 6), ('home', 6), ('login', 6), ('pages', 6), ('event', 6), ('about', 6), ('team', 6), ('portfolio', 5), ('doc', 5), ('prezentare', 5), ('old', 5), ('servici

In [36]:
print(recommend(['admin', 'user', "products"], 100))

[]


## Herarchical Directory Prediction Model

In [70]:
# Herarchical
# Limit to 10,000
tokenized_paths = df_clean["url_path"].tolist()[:10000]

all_tokens = list(set(chain.from_iterable(tokenized_paths)))
token_to_idx = {
    token: idx + 1 for idx, token in enumerate(all_tokens)
}
idx_to_token = {idx: token for token, idx in token_to_idx.items()}

vocab_size = len(token_to_idx) + 1

input_seqs = []
target_tokens = []

for path in tokenized_paths:
    for i in range(1, len(path)):
        prefix = [token_to_idx[t] for t in path[:i]]
        target = token_to_idx[path[i]]
        input_seqs.append(prefix)
        target_tokens.append(target)

max_len = max(len(seq) for seq in input_seqs)
X = pad_sequences(input_seqs, maxlen=max_len, padding="pre")
y = np.array(target_tokens)

model = Sequential(
    [
        Embedding(input_dim=vocab_size, output_dim=32, input_length=max_len),
        LSTM(64),
        Dense(vocab_size, activation="softmax"),
    ]
)

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
model.summary()

# Helper to detect numeric tokens
def is_numeric(token):
    return re.fullmatch(r"\d+", token) is not None

# Build sample weights based on target token type
sample_weights = np.array([
    0.1 if is_numeric(idx_to_token[t]) else 1.0
    for t in y
])

model.fit(X, y, epochs=20, batch_size=32, sample_weight=sample_weights)

2025-10-19 01:32:07.238038: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


Epoch 1/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - loss: 4.4107
Epoch 2/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 3.8899
Epoch 3/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 3.7443
Epoch 4/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 3.6279
Epoch 5/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 3.5037
Epoch 6/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 3.3560
Epoch 7/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 3.1891
Epoch 8/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 3.0229
Epoch 9/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 2.8800
Epoch 10/20
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - lo

<keras.src.callbacks.history.History at 0x712e20088950>

# Model Evaluation

In [81]:
def predict_lateral_tf(model, input_path, top_k=50):
    tokens = [token_to_idx[t] for t in input_path if t in token_to_idx]
    padded = pad_sequences([tokens], maxlen=max_len, padding="pre")
    probs = model.predict(padded, verbose=0)[0]
    top_idx = probs.argsort()[-top_k:][::-1]
    return [idx_to_token[i] for i in top_idx]


input_path = ["category"]
suggestions = predict_lateral_tf(model, input_path)
print(suggestions)

['page', 'comentarii', 'node', 'user', 'evenimente', 'usa', 'paturi', 'comments', 'details', 'members', 'member', 'news', 'ten', 'forum', 'c', 'noutati', 'articles', 'publicatii', 'posts', 'files', 'culoare', 'declaratii', 'documente', 'category', 'fara-categorie', 'contact', 'pagina', 'iasi', 'producator', 'blog', 'pages', 'tickets', 'revista', 'cariera', 'orar', 'rom', 'home', 'p', 'program', 'europa', 'help', 'doctorat', 'carte', 'hotarari', 'canapele', 'bucuresti', 'special', 'retete', 'fashion', 'alba']
