## 1. Imports & Setup

We import:
- **NetworkX** for graph analysis,  
- **Seaborn/Matplotlib** for statistical visualization,  
- **Pandas/Numpy** for data handling,  
- **json/urlparse** to load and normalize graph data.

We also configure Seaborn for clear plots.

In [4]:
import json
import re
from urllib.parse import urlparse

import networkx as nx
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(context="notebook", style="whitegrid", font_scale=1.1)
pd.options.display.max_colwidth = 150

# 2. Generate graph

In [15]:
from web_scraping.graph_generator import GraphGenerator

urls = [
    "https://iet.agh.edu.pl/",
    "https://skos.agh.edu.pl/tel",
    "https://oferta-badawcza.agh.edu.pl/",
    "https://podyplomowe.agh.edu.pl",
    "https://szkolenia.agh.edu.pl/",
    "https://open.agh.edu.pl/",
    "https://badap.agh.edu.pl/",
    "http://www.dzp.agh.edu.pl/",
    "https://cwp.agh.edu.pl",
    "https://ckim.agh.edu.pl/",
    "https://rownosc.agh.edu.pl/",
    "https://wilgz.agh.edu.pl/",
    "https://www.metal.agh.edu.pl/",
    "https://www.eaiib.agh.edu.pl/",
    "https://imir.agh.edu.pl/",
    "https://www.wggios.agh.edu.pl/",
    "https://www.ceramika.agh.edu.pl/",
    "https://odlewnictwo.agh.edu.pl/",
    "https://wmn.agh.edu.pl/",
    "https://wnig.agh.edu.pl/",
    "https://www.zarz.agh.edu.pl/",
    "https://weip.agh.edu.pl/",
    "https://www.fis.agh.edu.pl/",
    "https://www.wms.agh.edu.pl/",
    "https://wh.agh.edu.pl/",
    "https://www.informatyka.agh.edu.pl/pl/",
    "https://spacetech.agh.edu.pl/pl",
    "https://www.sjo.agh.edu.pl/",
    "https://swfis.agh.edu.pl/"
]

for url in urls:
    print(url)
    prefix = url.replace(".", "_").replace("http:", "").replace("https:", "").replace("/", "")
    GRAPH_JSON_PATH = f"graphs{prefix}_graph.json"

    graph_generator = GraphGenerator(
        allowed_domains=[url],
        start_urls=[url],
        max_pages=1000,
    )
    graph_generator.generate_graph()
    graph_generator.graph_to_json(output_file=GRAPH_JSON_PATH)

https://iet.agh.edu.pl/




Crawling:   0%|          | 0/1000 [00:00<?, ?it/s][A[A

Crawling:   0%|          | 2/1000 [00:03<24:56,  1.50s/it][A[A

Crawling:   0%|          | 1/1000 [00:34<9:27:48, 34.10s/it]A[A


Crawling:   0%|          | 4/1000 [00:08<37:06,  2.24s/it][A[A

Crawling:   0%|          | 5/1000 [00:10<37:21,  2.25s/it][A[A

Crawling:   1%|          | 6/1000 [00:13<40:12,  2.43s/it][A[A

Crawling:   1%|          | 7/1000 [00:16<42:03,  2.54s/it][A[A

Crawling:   1%|          | 8/1000 [00:18<42:46,  2.59s/it][A[A

Crawling:   1%|          | 9/1000 [00:21<41:48,  2.53s/it][A[A

Crawling:   1%|          | 10/1000 [00:23<40:51,  2.48s/it][A[A

Crawling:   1%|          | 11/1000 [00:26<41:46,  2.53s/it][A[A

Crawling:   1%|          | 12/1000 [00:29<43:08,  2.62s/it][A[A

Crawling:   1%|▏         | 13/1000 [00:31<41:48,  2.54s/it][A[A

Crawling:   1%|▏         | 14/1000 [00:34<42:50,  2.61s/it][A[A

Crawling:   2%|▏         | 15/1000 [00:36<43:47,  2.67s/it][A[A

Crawling:

Error fetching the URL: HTTPSConnectionPool(host='www.wrss.iet.agh.edu.pl', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))




Crawling:  12%|█▏        | 123/1000 [04:45<29:46,  2.04s/it][A[A

Crawling:  12%|█▏        | 124/1000 [04:46<28:26,  1.95s/it][A[A

Crawling:  12%|█▎        | 125/1000 [04:48<26:00,  1.78s/it][A[A

Crawling:  13%|█▎        | 126/1000 [04:49<24:11,  1.66s/it][A[A

Crawling:  13%|█▎        | 127/1000 [04:51<24:45,  1.70s/it][A[A

Crawling:  13%|█▎        | 128/1000 [04:52<24:17,  1.67s/it][A[A

Crawling:  13%|█▎        | 129/1000 [04:54<24:00,  1.65s/it][A[A

Crawling:  13%|█▎        | 130/1000 [04:56<24:45,  1.71s/it][A[A

Crawling:  13%|█▎        | 131/1000 [04:57<23:57,  1.65s/it][A[A

Crawling:  13%|█▎        | 132/1000 [04:59<24:29,  1.69s/it][A[A

Crawling:  13%|█▎        | 133/1000 [05:01<24:20,  1.68s/it][A[A

Crawling:  13%|█▎        | 134/1000 [05:03<24:11,  1.68s/it][A[A

Crawling:  14%|█▎        | 135/1000 [05:05<25:20,  1.76s/it][A[A

Crawling:  14%|█▎        | 136/1000 [05:06<24:21,  1.69s/it][A[A

Crawling:  14%|█▎        | 137/1000 [05:08<23:

Error crawling https://iet.agh.edu.pl/wp-content/uploads/2025/03/dzien_otwarty_2024-7.webp: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![\x12��%�>��s\x0c!Ӓ�@m\x138'




Crawling:  17%|█▋        | 166/1000 [05:59<22:10,  1.59s/it][A[A

Crawling:  17%|█▋        | 167/1000 [06:01<23:03,  1.66s/it][A[A

Crawling:  17%|█▋        | 168/1000 [06:03<23:30,  1.70s/it][A[A

Crawling:  17%|█▋        | 169/1000 [06:04<22:39,  1.64s/it][A[A

Crawling:  17%|█▋        | 170/1000 [06:06<21:53,  1.58s/it][A[A

Crawling:  17%|█▋        | 171/1000 [06:07<22:16,  1.61s/it][A[A

Crawling:  17%|█▋        | 172/1000 [06:10<26:56,  1.95s/it][A[A

Crawling:  17%|█▋        | 173/1000 [06:12<28:17,  2.05s/it][A[A

Crawling:  17%|█▋        | 174/1000 [06:15<29:56,  2.18s/it][A[A

Crawling:  18%|█▊        | 175/1000 [06:17<31:17,  2.28s/it][A[A

Crawling:  18%|█▊        | 176/1000 [06:20<31:24,  2.29s/it][A[A

Crawling:  18%|█▊        | 177/1000 [06:22<32:03,  2.34s/it][A[A

Crawling:  18%|█▊        | 178/1000 [06:24<32:30,  2.37s/it][A[A

Crawling:  18%|█▊        | 179/1000 [06:27<34:08,  2.49s/it][A[A

Crawling:  18%|█▊        | 180/1000 [06:30<34:

Error crawling https://iet.agh.edu.pl/wp-content/uploads/2024/11/B9_TC.webp: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: unknown status keyword 'WE' in marked section




Crawling:  20%|█▉        | 197/1000 [07:15<28:04,  2.10s/it][A[A

Crawling:  20%|█▉        | 198/1000 [07:17<26:50,  2.01s/it][A[A

Crawling:  20%|█▉        | 199/1000 [07:18<25:53,  1.94s/it][A[A

Crawling:  20%|██        | 200/1000 [07:21<28:09,  2.11s/it][A[A

Crawling:  20%|██        | 201/1000 [07:23<28:49,  2.16s/it][A[A

Crawling:  20%|██        | 202/1000 [07:26<29:34,  2.22s/it][A[A

Crawling:  20%|██        | 203/1000 [07:27<27:07,  2.04s/it][A[A

Crawling:  20%|██        | 204/1000 [07:29<25:36,  1.93s/it][A[A

Crawling:  20%|██        | 205/1000 [07:31<25:22,  1.92s/it][A[A

Crawling:  21%|██        | 206/1000 [07:32<23:33,  1.78s/it][A[A

Crawling:  21%|██        | 207/1000 [07:34<23:09,  1.75s/it][A[A

Crawling:  21%|██        | 208/1000 [07:36<22:52,  1.73s/it][A[A

Crawling:  21%|██        | 209/1000 [07:37<21:43,  1.65s/it][A[A

Crawling:  21%|██        | 210/1000 [07:39<21:01,  1.60s/it][A[A

Crawling:  21%|██        | 211/1000 [07:40<20:

Error crawling https://iet.agh.edu.pl/wp-content/uploads/2024/02/7.webp: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![_Ѡ)6��0|��ţN/�.��'




Crawling:  61%|██████    | 611/1000 [20:29<08:49,  1.36s/it][A[A

Crawling:  61%|██████    | 612/1000 [20:32<11:34,  1.79s/it][A[A

Crawling:  61%|██████▏   | 613/1000 [20:33<11:05,  1.72s/it][A[A

Crawling:  61%|██████▏   | 614/1000 [20:35<10:56,  1.70s/it][A[A

Crawling:  62%|██████▏   | 615/1000 [20:37<10:49,  1.69s/it][A[A

Crawling:  62%|██████▏   | 616/1000 [20:39<10:56,  1.71s/it][A[A

Crawling:  62%|██████▏   | 617/1000 [20:40<10:38,  1.67s/it][A[A

Crawling:  62%|██████▏   | 618/1000 [20:42<10:51,  1.71s/it][A[A

Crawling:  62%|██████▏   | 619/1000 [20:44<10:42,  1.69s/it][A[A

Crawling:  62%|██████▏   | 620/1000 [20:45<11:02,  1.74s/it][A[A

Crawling:  62%|██████▏   | 621/1000 [20:47<10:48,  1.71s/it][A[A

Crawling:  62%|██████▏   | 622/1000 [20:49<10:52,  1.73s/it][A[A

Crawling:  62%|██████▏   | 623/1000 [20:50<10:26,  1.66s/it][A[A

Crawling:  62%|██████▏   | 624/1000 [20:52<10:50,  1.73s/it][A[A

Crawling:  62%|██████▎   | 625/1000 [20:54<10:

Error fetching the URL: HTTPSConnectionPool(host='www.wrss.iet.agh.edu.pl', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))




Crawling:  76%|███████▌  | 758/1000 [25:37<08:31,  2.11s/it][A[A

Crawling:  76%|███████▌  | 759/1000 [25:39<08:11,  2.04s/it][A[A

Crawling:  76%|███████▌  | 760/1000 [25:40<07:28,  1.87s/it][A[A

Crawling:  76%|███████▌  | 761/1000 [25:42<07:06,  1.79s/it][A[A

Crawling:  76%|███████▌  | 762/1000 [25:43<06:36,  1.66s/it][A[A

Crawling:  76%|███████▋  | 763/1000 [25:45<06:48,  1.72s/it][A[A

Crawling:  76%|███████▋  | 764/1000 [25:47<06:50,  1.74s/it][A[A

Crawling:  76%|███████▋  | 765/1000 [25:48<06:33,  1.67s/it][A[A

Crawling:  77%|███████▋  | 766/1000 [25:50<06:20,  1.63s/it][A[A

Crawling:  77%|███████▋  | 767/1000 [25:52<06:15,  1.61s/it][A[A

Crawling:  77%|███████▋  | 768/1000 [25:53<06:25,  1.66s/it][A[A

Crawling:  77%|███████▋  | 769/1000 [25:56<07:31,  1.95s/it][A[A

Crawling:  77%|███████▋  | 770/1000 [25:59<08:20,  2.18s/it][A[A

Crawling:  77%|███████▋  | 771/1000 [26:01<08:53,  2.33s/it][A[A

Crawling:  77%|███████▋  | 772/1000 [26:03<07:

https://skos.agh.edu.pl/tel


Crawling: 100%|██████████| 1000/1000 [26:45<00:00,  1.61s/it]


https://oferta-badawcza.agh.edu.pl/


Crawling:   9%|▉         | 93/1000 [02:39<27:08,  1.80s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://oferta-badawcza.agh.edu.pl/equipment/?page=%E2%80%A6


Crawling:  71%|███████▏  | 714/1000 [20:02<08:24,  1.76s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://oferta-badawcza.agh.edu.pl/equipment/?responsible_entity=401&page=%E2%80%A6


Crawling: 100%|██████████| 1000/1000 [28:33<00:00,  1.71s/it]


https://podyplomowe.agh.edu.pl


Crawling:  79%|███████▊  | 787/1000 [24:06<05:33,  1.56s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://informatyka.podyplomowe.agh.edu.pl/aktualnosci/


Crawling:  79%|███████▉  | 788/1000 [24:07<05:27,  1.54s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://informatyka.podyplomowe.agh.edu.pl/o-nas/


Crawling:  79%|███████▉  | 789/1000 [24:09<05:15,  1.49s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://informatyka.podyplomowe.agh.edu.pl/kontakt/


Crawling:  79%|███████▉  | 790/1000 [24:10<05:17,  1.51s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://informatyka.podyplomowe.agh.edu.pl/o-studiach/


Crawling:  79%|███████▉  | 791/1000 [24:12<05:03,  1.45s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://informatyka.podyplomowe.agh.edu.pl/uzytkownicy/login/


Crawling:  79%|███████▉  | 792/1000 [24:13<05:17,  1.52s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://informatyka.podyplomowe.agh.edu.pl/uzytkownicy/signup/


Crawling:  79%|███████▉  | 793/1000 [24:15<05:01,  1.45s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://informatyka.podyplomowe.agh.edu.pl/polityka-prywatnosci/


Crawling:  79%|███████▉  | 794/1000 [24:16<05:14,  1.53s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://informatyka.podyplomowe.agh.edu.pl/admin/login/


Crawling: 100%|██████████| 1000/1000 [29:57<00:00,  1.80s/it]


https://szkolenia.agh.edu.pl/


Crawling:   7%|▋         | 69/1000 [01:59<26:46,  1.73s/it]


https://open.agh.edu.pl/


Crawling:  87%|████████▋ | 868/1000 [35:03<05:00,  2.28s/it]  

Error fetching the URL: 404 Client Error: Not Found for url: https://zasoby.open.agh.edu.pl/zasob/e-fizyka-podstawy-fizyki


Crawling: 100%|██████████| 1000/1000 [40:33<00:00,  2.43s/it]


https://badap.agh.edu.pl/


Crawling:   9%|▊         | 87/1000 [02:19<26:10,  1.72s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://badap.agh.edu.pl/wykazy/czasopisma


Crawling: 100%|██████████| 1000/1000 [26:17<00:00,  1.58s/it]


http://www.dzp.agh.edu.pl/



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(response.text, 'html.parser')
Crawling:  10%|▉         | 96/1000 [03:32<41:53,  2.78s/it]  

Error crawling http://www.dzp.agh.edu.pl/home/dzp/Umowy_ogolnouczelniane/tonery/642_2024_Umowa_MAK_Sp._z_o.o._DE-dzp.272-642-24.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![��Ɵ����K����F��T�'


Crawling:  15%|█▍        | 146/1000 [05:21<29:11,  2.05s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://dzp.agh.edu.pl/fileadmin/default/templates/css/j/dzp/system/inne_nowe/2022_informacja_o_zmianach_PZP.pdf


Crawling:  15%|█▍        | 149/1000 [05:29<31:22,  2.21s/it]


https://cwp.agh.edu.pl


Crawling:  31%|███▏      | 314/1000 [08:55<19:30,  1.71s/it]


https://ckim.agh.edu.pl/


Crawling:  10%|█         | 105/1000 [02:57<25:12,  1.69s/it]


https://rownosc.agh.edu.pl/


Crawling:   4%|▍         | 45/1000 [01:19<28:14,  1.77s/it]


https://wilgz.agh.edu.pl/


Crawling:  13%|█▎        | 134/1000 [04:26<28:07,  1.95s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wilgz.agh.edu.pl/rekrutacja/studia-stacjonarne-kierunki/inzynieria-gornicza/


Crawling:  14%|█▎        | 135/1000 [04:28<25:44,  1.79s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wilgz.agh.edu.pl/rekrutacja/studia-stacjonarne-kierunki/inzynieria-i-zarzadzanie-procesami-przemyslowymi/


Crawling:  14%|█▎        | 136/1000 [04:29<24:37,  1.71s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wilgz.agh.edu.pl/rekrutacja/studia-stacjonarne-kierunki/budownictwo-2/


Crawling:  14%|█▎        | 137/1000 [04:31<24:08,  1.68s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wilgz.agh.edu.pl/rekrutacja/studia-stacjonarne-kierunki/rewitalizacja-terenow-zdegradowanych/


Crawling:  25%|██▍       | 246/1000 [07:55<22:49,  1.82s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wilgz.agh.edu.pl/student/dokumenty-i-sprawy-socjalne/dokumenty-student/


Crawling:  32%|███▏      | 316/1000 [09:53<18:23,  1.61s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wilgz.agh.edu.pl/rekrutacja/studia-stacjonarne-kierunki/inzynieria-ksztaltowania-srodowiska/


Crawling:  42%|████▏     | 415/1000 [12:47<13:53,  1.42s/it]

Error crawling https://wilgz.agh.edu.pl/home/__processed__/5/e/csm_f3_175c748051.jpg: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![�<�\n����9��1=����'


Crawling:  61%|██████    | 609/1000 [18:41<09:39,  1.48s/it]

Error crawling https://wilgz.agh.edu.pl/home/__processed__/d/a/csm_Image__1__a547068ff7.jpg: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![�A�\n�6=���/�0w��ڧ'


Crawling:  68%|██████▊   | 676/1000 [20:38<09:53,  1.83s/it]


https://www.metal.agh.edu.pl/


Crawling:   0%|          | 1/1000 [00:01<32:06,  1.93s/it]


https://www.eaiib.agh.edu.pl/


Crawling:  23%|██▎       | 233/1000 [06:49<20:18,  1.59s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.eaiib.agh.edu.pl/wp-content/uploads/2022/10/Zalacznik_nr_6_Formularz-informacyjny-przed-podjeciem-praktyki.docx


Crawling:  24%|██▍       | 241/1000 [07:07<29:27,  2.33s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.eaiib.agh.edu.pl/system-jakosci-ksztalcenia/Raport%20z%20ankietyzacji%20student%C3%B3w%20dyscyplina%20AEE%20lato%202020_21


Crawling:  25%|██▌       | 254/1000 [07:36<25:55,  2.09s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.eaiib.agh.edu.pl/system-jakosci-ksztalcenia/Centrum%20Akredytacji%20i%20Jako%C5%9Bci%20Kszta%C5%82cenia


Crawling:  33%|███▎      | 333/1000 [09:53<15:39,  1.41s/it]

Error crawling https://www.eaiib.agh.edu.pl/wp-content/uploads/2025/09/105-lat-Elektrotechniki-AGH-fot-8.jpg: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![�y���k�zc)|b9�X�s'


Crawling:  34%|███▍      | 343/1000 [10:10<17:40,  1.61s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://www.eaiib.agh.edu.pl/repozytorium/.html


Crawling:  57%|█████▊    | 575/1000 [17:18<15:43,  2.22s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.eaiib.agh.edu.pl/wp-content/uploads/2020/07/Harmonogrhttps:/www.eaiib.agh.edu.pl/wp-content/uploads/2020/07/M__YNARCZYK_HARMONOGRAM.pdfam__P.Bania_.docx



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(response.text, 'html.parser')
Crawling:  65%|██████▌   | 653/1000 [20:17<10:12,  1.77s/it]

Error crawling https://www.eaiib.agh.edu.pl/wp-content/uploads/2025/05/ABU-SARHAN-MOHAMMAD_RECENZJA-1-dr-hab.-inz-Michal-Jasinski.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: unknown status keyword 'w' in marked section


Crawling:  79%|███████▉  | 791/1000 [25:48<07:28,  2.15s/it]

Error crawling https://www.eaiib.agh.edu.pl/wp-content/uploads/2024/03/PANKIEWICZ-NIKODEM_ZAWIADOMIENIE-O-PUBLICZNEJ-OBRONIE.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![$�!s�d�ӽw4P\x06jC.��'


Crawling:  86%|████████▌ | 860/1000 [28:31<03:21,  1.44s/it]

Error crawling https://www.eaiib.agh.edu.pl/wp-content/uploads/2023/11/SIERSZYNSKI-MICHAL_STRESZCZENIA.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![|W[\r/#݅��~���\x17u�\x19'


Crawling:  93%|█████████▎| 932/1000 [31:35<04:53,  4.32s/it]

Error crawling https://www.eaiib.agh.edu.pl/wp-content/uploads/2023/07/KMAK-JAROSLAW_RECENZJA-1-dr-hab.inz_.-Andrzej-Chojnacki-prof.-PSk.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: unknown status keyword 'u\x1d' in marked section


Crawling: 100%|██████████| 1000/1000 [34:02<00:00,  2.04s/it]


https://imir.agh.edu.pl/


Crawling:   0%|          | 1/1000 [00:01<28:14,  1.70s/it]


https://www.wggios.agh.edu.pl/


Crawling:  16%|█▋        | 164/1000 [04:46<18:43,  1.34s/it]

Error crawling https://www.wggios.agh.edu.pl/home/wggios/Postepowania-awansowe/doktoraty/2025/Kubacka_Tomasz/recenzja_prof_dr_hab._inz._GRZEGORZ_MUTKE.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![<�^.=\x16*Kd� �1��6i'


Crawling:  17%|█▋        | 174/1000 [05:05<22:19,  1.62s/it]

Error crawling https://www.wggios.agh.edu.pl/home/wggios/Postepowania-awansowe/doktoraty/2024/Pawel_Godlewski/zawiadomienie_o_obronie_doktorskiej_mgr_inz._PAWEL_GODLEWSKI.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at "<!['�\x15\x00�\x19�\t���8K;{\x1d�"


Crawling:  18%|█▊        | 181/1000 [05:18<20:07,  1.47s/it]

Error crawling https://www.wggios.agh.edu.pl/home/wggios/Postepowania-awansowe/doktoraty/2024/Adam_Cygal/uchwala_nadanie_stopnia_doktora_mgr_inz._ADAM_CYGAL.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![%��\x07\x040\x18ڴJf�{�Z��Z'


Crawling:  33%|███▎      | 326/1000 [09:48<16:27,  1.47s/it]

Error crawling https://www.wggios.agh.edu.pl/home/wggios/Postepowania-awansowe/doktoraty/2023/Powolny-Tomasz-STRESZCZENIE_J_ANGIELSKI.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![���aFe\x0b�\x0b�g�\x1d\x15�W�'


Crawling:  36%|███▌      | 358/1000 [10:48<17:07,  1.60s/it]

Error crawling https://www.wggios.agh.edu.pl/home/wggios/Postepowania-awansowe/doktoraty/2022/02-Obrona_doktorska_mgr_inz._Adrianny_Maslanki/uchwala_nadanie_stopnia_doktora_ADRIANNA_MASLANKA.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![�i�Ү�.\x01�۸��H�\x0e\x0bY�'


Crawling:  78%|███████▊  | 781/1000 [24:09<06:30,  1.78s/it]

Error crawling https://www.wggios.agh.edu.pl/home/wggios/Postepowania-awansowe/doktoraty/2016/04-Obrona_doktorska_mgr_inz._Barbary_Muir/Muir-Streszczenie.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![���aFe\x0b�\x0b�g�\x1d\x15�W�'


Crawling:  78%|███████▊  | 782/1000 [24:10<05:00,  1.38s/it]

Error crawling https://www.wggios.agh.edu.pl/home/wggios/Postepowania-awansowe/doktoraty/2016/04-Obrona_doktorska_mgr_inz._Barbary_Muir/Muir-Abstract.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![���aFe\x0b�\x0b�g�\x1d\x15�W�'


Crawling:  82%|████████▏ | 816/1000 [25:09<04:01,  1.31s/it]

Error crawling https://www.wggios.agh.edu.pl/home/wggios/Postepowania-awansowe/doktoraty/2016/10-Obrona_doktorska_mgr_inz._Piotra_Olchowego/streszczenie_rozprawy_doktorskiej_pol_PIOTR_OLCHOWY.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: unknown status keyword 'j' in marked section


Crawling:  93%|█████████▎| 927/1000 [28:46<02:30,  2.06s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.wggios.agh.edu.pl/fileadmin/default/templates/css/j/wggios/2020/Uchwala_komisja_hab._A.GRUSZECKA-KOSOWSKA-sig.pdf


Crawling:  99%|█████████▉| 988/1000 [30:49<00:17,  1.46s/it]

Error crawling https://www.wggios.agh.edu.pl/home/wggios/Postepowania-awansowe/habilitacje/13-Postepowanie_habilitacyjne_dr_inz._Marcina_Zycha/Zych-recenzja-Idziak-Adam.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![���aFe\x0b�\x0b�g�\x1d\x15�W�'


Crawling: 100%|██████████| 1000/1000 [31:15<00:00,  1.88s/it]


https://www.ceramika.agh.edu.pl/


Crawling:  34%|███▎      | 336/1000 [10:09<22:22,  2.02s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.ceramika.agh.edu.pl/aktualnosci/detail/kierunek-inzynieria-materialowa-z-akredytacja-polskiej-komisji-akredytacyjnej


Crawling:  40%|████      | 404/1000 [12:06<17:13,  1.73s/it]

Error fetching the URL: 403 Client Error: Internal Server Error for url: https://www.ceramika.agh.edu.pl/typo3/record/edit?token=02d76df8b19d83011af50a238b8cda85b39cd344&edit%5Btt_content%5D%5B27723%5D=edit&returnUrl=%2Ftypo3%2Fmodule%2Fweb%2Flayout%3Ftoken%3D22ee4014c9d970a43d397007ab647bee72e832b1%26id%3D941%23element-tt_content-27723


Crawling:  47%|████▋     | 473/1000 [14:11<12:38,  1.44s/it]

Error crawling https://www.ceramika.agh.edu.pl/home/wimic/Aktualnosci/komunikaty_dla_studentow/plany_zajec/II2_TCH.xlsx: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: unknown status keyword 'H' in marked section


Crawling:  87%|████████▋ | 868/1000 [26:14<03:42,  1.69s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.ceramika.agh.edu.pl/aktualnosci/detail/konferencja-termoelektryczna-ict-ect-2024


Crawling:  97%|█████████▋| 967/1000 [29:17<00:55,  1.67s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.ceramika.agh.edu.pl/pl.wikipedia.org/wiki/Prorektor_ds._nauki


Crawling: 100%|██████████| 1000/1000 [30:23<00:00,  1.82s/it]


https://odlewnictwo.agh.edu.pl/


Crawling:   1%|          | 6/1000 [00:11<35:17,  2.13s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://kandydaci.odlewnictwo.agh.edu.pl/


Crawling:   4%|▍         | 44/1000 [01:39<37:30,  2.35s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://kandydaci.odlewnictwo.agh.edu.pl/technologie-przemyslu-4-0/


Crawling:   4%|▍         | 45/1000 [01:41<32:57,  2.07s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://kandydaci.odlewnictwo.agh.edu.pl/komputerowe-wspomaganie-procesow-inzynierskich/


Crawling:   5%|▍         | 46/1000 [01:42<31:05,  1.96s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://kandydaci.odlewnictwo.agh.edu.pl/tworzywa-i-technologie-motoryzacyjne/


Crawling:   7%|▋         | 71/1000 [02:36<24:02,  1.55s/it]

Error crawling https://odlewnictwo.agh.edu.pl/wp-content/uploads/2025/09/Komunikat_terminy_stypendium_rektora.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![\x14�fc83��9@h&YS?\x18e'


Crawling:   8%|▊         | 75/1000 [02:46<40:12,  2.61s/it]

Error crawling https://odlewnictwo.agh.edu.pl/wp-content/uploads/2021/11/mcc2019.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![�KV(��Y�#��%�bi�\x17'
Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/pracownik/konferencje/www.afe.polsl.pl/pl/kikm/index/80


Crawling:   8%|▊         | 82/1000 [03:02<37:02,  2.42s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/aktualnosci/info/article/konferencja-sprawozdawcza-komitetu-metalurgii-pan-metalurgia-2014/


Crawling:   8%|▊         | 83/1000 [03:05<38:51,  2.54s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/wo-2013


Crawling:   8%|▊         | 84/1000 [03:08<39:59,  2.62s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/pracownik/konferencje/www.targikielce.pl


Crawling:   9%|▊         | 87/1000 [03:15<36:08,  2.37s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/wydzial/kontakt/epajor@agh.edu.pl



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(response.text, 'html.parser')
Crawling:  14%|█▍        | 140/1000 [05:07<28:12,  1.97s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/student/praktyki-i-staze/zapala@agh.edu.pl


Crawling:  20%|██        | 200/1000 [07:06<30:34,  2.29s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/plan_Kolegium_Wydzialu_15_11_2021.pdf


Crawling:  20%|██        | 201/1000 [07:08<30:07,  2.26s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/protokol_z_dnia_20_09_2020.pdf


Crawling:  20%|██        | 202/1000 [07:10<29:42,  2.23s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/PLAN_KOLEGIUM_WYDZIALU_W_DNIU_20_09_2021.pdf


Crawling:  20%|██        | 203/1000 [07:12<29:34,  2.23s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/28_06_2021_protokol.pdf


Crawling:  20%|██        | 204/1000 [07:14<28:54,  2.18s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/PROGRAM_28_06_2021.pdf


Crawling:  20%|██        | 205/1000 [07:16<27:50,  2.10s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/PROTOKOL_24_05_2021.pdf


Crawling:  21%|██        | 206/1000 [07:18<27:24,  2.07s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/PLAN_KOLEGIUM_WYDZIALU_24_05_2021_r..pdf


Crawling:  21%|██        | 207/1000 [07:20<26:51,  2.03s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/Protokol_KW_z_dnia_19_04_2021_r.-1.pdf


Crawling:  21%|██        | 208/1000 [07:22<26:02,  1.97s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/program_KW_w_dniu_19_04_2021.pdf


Crawling:  21%|██        | 209/1000 [07:24<26:55,  2.04s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/protokol_15_03_2021.pdf


Crawling:  21%|██        | 210/1000 [07:26<27:23,  2.08s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/Plan_Kolegium_Wydzialu_15_03_2021.pdf


Crawling:  21%|██        | 211/1000 [07:29<28:28,  2.17s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/Protokol_01_02_2021.pdf


Crawling:  21%|██        | 212/1000 [07:31<27:54,  2.13s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/restricted/wo/Protokoly_RW/2021/Plan_KW_w_dniu_02_02_2021.pdf


Crawling:  32%|███▎      | 325/1000 [11:02<17:53,  1.59s/it]

Error crawling https://odlewnictwo.agh.edu.pl/wp-content/uploads/2021/11/Zaproszenie_na_publiczna_obrone_z_logo_-_PN.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![�m���\x7f\x10�\x06�9Sc��.�'


Crawling:  33%|███▎      | 332/1000 [11:15<16:28,  1.48s/it]

Error crawling https://odlewnictwo.agh.edu.pl/wp-content/uploads/2021/11/Streszczenie_jezyk_angielski_M.Piekos.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![��\x1f1\x13��8o��\x178q�\x0f�'


Crawling:  47%|████▋     | 468/1000 [15:48<15:48,  1.78s/it]

Error crawling https://odlewnictwo.agh.edu.pl/wp-content/uploads/2021/11/Zaproszenie_na_publiczna_obrone_rozprawy_doktorskiej_Pani_mgr_Magdaleny_Bisztygi_-_Szklarz.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![�І\x15j>g�Q6�Я�����w'


Crawling:  58%|█████▊    | 576/1000 [19:41<17:08,  2.43s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://odlewnictwo.agh.edu.pl/fileadmin/default/templates/css/w/08_wo/Konferencje/wo_2015/AGENDA_Polski.pdf


Crawling:  71%|███████   | 711/1000 [23:38<05:35,  1.16s/it]

Error crawling https://odlewnictwo.agh.edu.pl/wp-content/uploads/2021/11/csm_15156925_10157881480400602_6827439561927186775_o_8b34818323.jpg: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![�����\x83*��g�Њ�R�~n'


Crawling: 100%|██████████| 1000/1000 [31:29<00:00,  1.89s/it]


https://wmn.agh.edu.pl/


Crawling:   3%|▎         | 26/1000 [00:53<36:05,  2.22s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wmn.agh.edu.pl/www.pse.pl


Crawling:   3%|▎         | 27/1000 [00:55<34:06,  2.10s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wmn.agh.edu.pl/www.kuca.pl/


Crawling:   3%|▎         | 28/1000 [00:56<30:49,  1.90s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wmn.agh.edu.pl/www.kghm.com


Crawling:   3%|▎         | 29/1000 [00:58<29:07,  1.80s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wmn.agh.edu.pl/www.walcownia.com


Crawling:   3%|▎         | 31/1000 [01:02<30:15,  1.87s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wmn.agh.edu.pl/www.pse.pl


Crawling:   3%|▎         | 32/1000 [01:03<29:30,  1.83s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wmn.agh.edu.pl/www.kuca.pl/


Crawling:   3%|▎         | 33/1000 [01:05<28:44,  1.78s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wmn.agh.edu.pl/www.kghm.com


Crawling:   3%|▎         | 34/1000 [01:07<28:50,  1.79s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wmn.agh.edu.pl/www.walcownia.com


Crawling:  43%|████▎     | 427/1000 [15:05<19:57,  2.09s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wmn.agh.edu.pl/informacja-dla-absolwentow-roku-akademickiego-20202021-dot.-odbioru-dyplomu-76-74


Crawling:  45%|████▌     | 452/1000 [15:57<18:18,  2.01s/it]

Error fetching the URL: HTTPConnectionPool(host='wmn.agh.edu.pl', port=80): Read timed out. (read timeout=10)


Crawling:  75%|███████▍  | 747/1000 [27:43<07:16,  1.73s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://wmn.agh.edu.pl/%20https:/agh4s2024.4scienceinstitute.org/


Crawling:  75%|███████▍  | 748/1000 [27:45<06:52,  1.64s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://wmn.agh.edu.pl/%20https:/emb-agh4s2024.4scienceinstitute.org/


Crawling:  75%|███████▍  | 749/1000 [27:46<06:40,  1.59s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://wmn.agh.edu.pl/%20https:/cyber-agh4s2024.4scienceinstitute.org/


Crawling:  75%|███████▌  | 750/1000 [27:48<06:27,  1.55s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://wmn.agh.edu.pl/%20https:/platform.4scienceinstitute.org/signup


Crawling:  75%|███████▌  | 751/1000 [27:49<06:16,  1.51s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://wmn.agh.edu.pl/%20https:/lnkd.in/dG7vNWKe


Crawling:  78%|███████▊  | 777/1000 [28:36<06:48,  1.83s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wmn.agh.edu.pl/rekrutacja.agh.edu.pl


Crawling:  81%|████████  | 810/1000 [29:34<04:27,  1.41s/it]

Error crawling http://wmn.agh.edu.pl/uploads/180.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: unknown status keyword 'c\x0b' in marked section


Crawling:  83%|████████▎ | 829/1000 [30:09<04:08,  1.45s/it]

Error crawling http://wmn.agh.edu.pl/uploads/20220930132513439.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: unknown status keyword 'q' in marked section


Crawling:  84%|████████▎ | 837/1000 [30:28<05:44,  2.11s/it]

Error fetching the URL: 403 Client Error: Forbidden for url: https://wmn.agh.edu.pl/%20http:/www.bhp.agh.edu.pl/studenci/


Crawling:  98%|█████████▊| 981/1000 [35:05<00:40,  2.15s/it]


https://wnig.agh.edu.pl/


Crawling:  24%|██▍       | 239/1000 [08:29<23:18,  1.84s/it] 

Error fetching the URL: HTTPSConnectionPool(host='wnig.agh.edu.pl', port=443): Read timed out. (read timeout=10)


Crawling:  53%|█████▎    | 528/1000 [17:56<10:33,  1.34s/it]  

Error crawling https://test.wnig.agh.edu.pl/wp-content/uploads/2020/02/galeria_2019_06_05_2.jpg: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![:\x06���F�Q�\x00W\r\x07M\x07hW'


Crawling:  55%|█████▍    | 545/1000 [18:24<10:11,  1.34s/it]

Error crawling https://test.wnig.agh.edu.pl/wp-content/uploads/2020/02/galeria_2019_06_05_19.jpg: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![ژ���\x02��*��W\x11���*\ue625'


Crawling:  57%|█████▋    | 566/1000 [18:58<08:52,  1.23s/it]

Error crawling https://test.wnig.agh.edu.pl/wp-content/uploads/2020/02/galeria_2019_06_06_20.jpg: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![��$��M��b��������'


Crawling:  67%|██████▋   | 666/1000 [22:17<10:38,  1.91s/it]

Error fetching the URL: HTTPSConnectionPool(host='wnig.agh.edu.pl', port=443): Read timed out. (read timeout=10)


Crawling:  70%|███████   | 701/1000 [23:39<10:34,  2.12s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://wnig.agh.edu.pl/wp-content/uploads/2020/11/aktualne-terminy-zjazd%C3%B3w.pdf


Crawling:  88%|████████▊ | 877/1000 [29:14<03:01,  1.47s/it]

Error crawling https://wnig.agh.edu.pl/wp-content/uploads/2021/02/S_komunikaty_swiadczenia_2020_2021_sem_letni_2021_02_02.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![�+M��\\��Q\x00�\x00r��g\x1a'


Crawling: 100%|██████████| 1000/1000 [32:58<00:00,  1.98s/it]


https://www.zarz.agh.edu.pl/


Crawling:   2%|▏         | 18/1000 [00:45<44:07,  2.70s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.zarz.agh.edu.pl/index.php/studia-doktoranckie/


Crawling:  21%|██        | 211/1000 [07:05<25:50,  1.97s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.zarz.agh.edu.pl/polska/informacje_o_wydziale/dzialalnosc_naukowa/Info_1_2016.pdf


Crawling:  27%|██▋       | 268/1000 [08:48<21:24,  1.76s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.zarz.agh.edu.pl/wp-content/uploads/2021/09/7_Procedura_doktoranci_zglaszania_ograniczen_2021_22-1.pdf


Crawling:  27%|██▋       | 274/1000 [08:59<21:10,  1.75s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.zarz.agh.edu.pl/wp-content/uploads/2016/04/organizacja-roku-ak.-2017-18.pdf


Crawling:  28%|██▊       | 278/1000 [09:06<22:20,  1.86s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.zarz.agh.edu.pl/wp-content/uploads/2016/05/Regulamin-kwalifikacji-wniosk%C3%B3w-NR.pdf


Crawling:  28%|██▊       | 280/1000 [09:10<21:54,  1.83s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.zarz.agh.edu.pl/wp-content/uploads/2016/05/wniosek-o-nagrode-rektora_zespo%C5%82owa.docx


Crawling:  29%|██▉       | 289/1000 [09:32<29:58,  2.53s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.zarz.agh.edu.pl/index.php/kontakt/wzarz@zarz.agh.edu.pl


Crawling:  30%|███       | 300/1000 [09:54<22:56,  1.97s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.zarz.agh.edu.pl/wp-content/uploads/2023/04/oplaty-za-przetrzymane-ksiazki.pdf


Crawling:  52%|█████▎    | 525/1000 [16:53<16:36,  2.10s/it]

Error fetching the URL: 404 Client Error: Not Found for url: http://www.zarz.agh.edu.pl/index.php/studia-doktoranckie/


Crawling:  56%|█████▌    | 560/1000 [18:13<11:33,  1.58s/it]

Error crawling http://www.zarz.agh.edu.pl/wpawnik/pliki_do_pobrania/Wojciech_Pawnik_Spoleczna_odpowiedzialnosc_przedsiebiorstwa.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![\x1d\x1d۽�)&����\x1a�=\x10t��'


Crawling:  56%|█████▌    | 562/1000 [18:20<16:54,  2.32s/it]

Error fetching the URL: 404 Client Error: Not Found for url: http://www.zarz.agh.edu.pl/wpawnik/pliki_do_pobrania/Wojciech_Pawnik_Zaufanie_spoleczne_paradoks_proces?w_modernizacyjnych.pdf


Crawling:  57%|█████▋    | 569/1000 [18:47<14:13,  1.98s/it]


https://weip.agh.edu.pl/


Crawling:  21%|██        | 206/1000 [06:18<22:44,  1.72s/it] 

Error fetching the URL: 404 Client Error: Not Found for url: https://weip.agh.edu.pl/wiecej/ogloszenie-o-mozliwosci-zatrudnienia-konkurs/?portfolioCats=33


Crawling:  80%|███████▉  | 798/1000 [42:49<10:50,  3.22s/it]  


https://www.fis.agh.edu.pl/


Crawling:  27%|██▋       | 269/1000 [08:34<23:56,  1.97s/it]  

Error crawling https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/2024_JN.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![\x0c�y�\x19\x18��Lf�\x11{5ѥB�'


Crawling:  28%|██▊       | 278/1000 [08:50<18:26,  1.53s/it]

Error crawling https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/2024_AZ_abs.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: unknown status keyword 'L' in marked section


Crawling:  39%|███▉      | 392/1000 [13:00<16:23,  1.62s/it]

Error crawling https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/2020_MP_abs.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: unknown status keyword 'v' in marked section


Crawling:  47%|████▋     | 472/1000 [16:58<1:51:23, 12.66s/it]

Error fetching the URL: HTTPSConnectionPool(host='www.fis.agh.edu.pl', port=443): Read timed out.



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(response.text, 'html.parser')
Crawling:  51%|█████▏    | 514/1000 [19:43<17:04,  2.11s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/2017_BB.pdf



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(response.text, 'html.parser')
Crawling:  56%|█████▌    | 560/1000 [22:38<23:38,  3.22s/it]  

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/homola_2016.pdf


Crawling:  56%|█████▋    | 565/1000 [22:50<17:39,  2.44s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/2016_Kaminski.pdf



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(response.text, 'html.parser')
Crawling:  59%|█████▉    | 589/1000 [24:11<19:54,  2.91s/it]

Error fetching the URL: HTTPSConnectionPool(host='www.fis.agh.edu.pl', port=443): Read timed out.


Crawling:  60%|█████▉    | 597/1000 [25:19<43:45,  6.52s/it]  

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/adamowski_2014.pdf


Crawling:  61%|██████▏   | 613/1000 [29:38<28:58,  4.49s/it]  

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/seminarium


Crawling:  64%|██████▎   | 635/1000 [30:39<16:33,  2.72s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/chau.pdf


Crawling:  64%|██████▍   | 639/1000 [30:49<15:39,  2.60s/it]

Error crawling https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/spalek2011.pdf: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: expected name token at '<![*Rڝ\x06J_����F�\x1f���p'


Crawling:  68%|██████▊   | 678/1000 [32:23<11:40,  2.17s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/niewodniczanski.pdf


Crawling:  71%|███████▏  | 713/1000 [33:57<11:09,  2.33s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/ft/FTsem1.pdf


Crawling:  71%|███████▏  | 714/1000 [33:59<10:03,  2.11s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/ft/FTsem2.pdf


Crawling:  72%|███████▏  | 715/1000 [34:00<09:03,  1.91s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/ft/FTsem3.pdf


Crawling:  72%|███████▏  | 716/1000 [34:02<08:20,  1.76s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/ft/FTsem4.pdf


Crawling:  72%|███████▏  | 717/1000 [34:03<07:41,  1.63s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/ft/FTsem5.pdf


Crawling:  72%|███████▏  | 718/1000 [34:04<07:12,  1.53s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/ft/FTsem6.pdf


Crawling:  72%|███████▏  | 719/1000 [34:06<07:15,  1.55s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/ft/sem7FCS.pdf


Crawling:  72%|███████▏  | 720/1000 [34:08<07:17,  1.56s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/ft/sem7FJ.pdf


Crawling:  72%|███████▏  | 721/1000 [34:09<07:09,  1.54s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/ft/sem7FK.pdf


Crawling:  72%|███████▏  | 722/1000 [34:11<07:06,  1.54s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/ft/sem7FS.pdf


Crawling:  72%|███████▏  | 723/1000 [34:12<07:13,  1.56s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/fm/FMsem1.pdf


Crawling:  72%|███████▏  | 724/1000 [34:14<07:38,  1.66s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/fm/FMsem2.pdf


Crawling:  72%|███████▎  | 725/1000 [34:16<07:32,  1.65s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/fm/FMsem3.pdf


Crawling:  73%|███████▎  | 726/1000 [34:17<07:26,  1.63s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/fm/FMsem4.pdf


Crawling:  73%|███████▎  | 727/1000 [34:19<07:00,  1.54s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/fm/FMsem5.pdf


Crawling:  73%|███████▎  | 728/1000 [34:20<07:15,  1.60s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/fm/FMsem6.pdf


Crawling:  73%|███████▎  | 729/1000 [34:22<06:57,  1.54s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/fm/FMsem7.pdf


Crawling:  73%|███████▎  | 730/1000 [34:23<06:59,  1.55s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/is/ISsem1.pdf


Crawling:  73%|███████▎  | 731/1000 [34:25<07:08,  1.59s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/is/ISsem2.pdf


Crawling:  73%|███████▎  | 732/1000 [34:27<07:02,  1.58s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/is/ISsem3.pdf


Crawling:  73%|███████▎  | 733/1000 [34:28<06:52,  1.55s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/is/ISsem4.pdf


Crawling:  73%|███████▎  | 734/1000 [34:30<07:03,  1.59s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/is/ISsem5.pdf


Crawling:  74%|███████▎  | 735/1000 [34:31<07:13,  1.64s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/is/ISsem6.pdf


Crawling:  74%|███████▎  | 736/1000 [34:33<07:07,  1.62s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/is/ISsem7.pdf


Crawling:  75%|███████▌  | 750/1000 [35:10<12:25,  2.98s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/MarsAGH


Crawling:  77%|███████▋  | 767/1000 [36:01<12:03,  3.10s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/home/wfiis/wfiis/doc/pl/seminarium/idzik.ps.gz


Crawling:  77%|███████▋  | 774/1000 [36:13<06:49,  1.81s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.fis.agh.edu.pl/fee2016/


Crawling:  95%|█████████▍| 947/1000 [41:19<02:18,  2.62s/it]


https://www.wms.agh.edu.pl/


Crawling:  10%|█         | 101/1000 [02:39<26:04,  1.74s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.wms.agh.edu.pl/fileadmin/default/templates/css/j/wms/Studia/I_stopien/plan_stud_I_2018_19.xlsx


Crawling:  16%|█▌        | 158/1000 [04:10<22:41,  1.62s/it]

Error fetching the URL: 404 Client Error: Not Found for url: https://www.wms.agh.edu.pl/fileadmin/default/templates/css/j/wms/Studia/I_stopien/ZD_-_Szczegolowe_zasady_realizacji_programu_studiow_dla_studiow_I_stopnia_na_kierunku_Matematyka_-_zal_1.pdf


Crawling:  17%|█▋        | 174/1000 [04:37<26:33,  1.93s/it]

KeyboardInterrupt: 

## 3. Load Graph from JSON

We rebuild the **directed graph (DiGraph)** from the JSON file.  
Each **node** is a web page (URL), and each **edge** is a hyperlink from one page to another.  

- Nodes represent entities we might later cluster.  
- Edges capture relationships that clustering algorithms exploit.  

In [9]:
GRAPH_JSON_PATH = "/Users/wnowogorski/PycharmProjects/ChatAGH_DataCollecting/graphs/rekrutacja_agh_edu_graph.json"  # <-- change path if needed

with open(GRAPH_JSON_PATH, "r") as f:
    data = json.load(f)

G = nx.DiGraph()
G.add_nodes_from([n["url"] for n in data["nodes"]])
G.add_edges_from([(e["source"], e["target"]) for e in data["edges"]])

print(f"Loaded graph: {G.number_of_nodes():,} nodes, {G.number_of_edges():,} edges")

Loaded graph: 578 nodes, 7,781 edges
