## 1. Imports & Setup

We import:
- **NetworkX** for graph analysis,  
- **Seaborn/Matplotlib** for statistical visualization,  
- **Pandas/Numpy** for data handling,  
- **json/urlparse** to load and normalize graph data.

We also configure Seaborn for clear plots.

In [4]:
import json
import re
from urllib.parse import urlparse

import networkx as nx
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(context="notebook", style="whitegrid", font_scale=1.1)
pd.options.display.max_colwidth = 150

# 2. Generate graph

In [None]:
from web_scraping.graph_generator import GraphGenerator

urls = [
    "https://iet.agh.edu.pl/",
    "https://skos.agh.edu.pl/tel",
    "https://oferta-badawcza.agh.edu.pl/",
    "https://podyplomowe.agh.edu.pl",
    "https://szkolenia.agh.edu.pl/",
    "https://open.agh.edu.pl/",
    "https://badap.agh.edu.pl/",
    "http://www.dzp.agh.edu.pl/",
    "https://cwp.agh.edu.pl",
    "https://ckim.agh.edu.pl/",
    "https://rownosc.agh.edu.pl/",
    "https://wilgz.agh.edu.pl/",
    "https://www.metal.agh.edu.pl/",
    "https://www.eaiib.agh.edu.pl/",
    "https://imir.agh.edu.pl/",
    "https://www.wggios.agh.edu.pl/",
    "https://www.ceramika.agh.edu.pl/",
    "https://odlewnictwo.agh.edu.pl/",
    "https://wmn.agh.edu.pl/",
    "https://wnig.agh.edu.pl/",
    "https://www.zarz.agh.edu.pl/",
    "https://weip.agh.edu.pl/",
    "https://www.fis.agh.edu.pl/",
    "https://www.wms.agh.edu.pl/",
    "https://wh.agh.edu.pl/",
    "https://www.informatyka.agh.edu.pl/pl/",
    "https://spacetech.agh.edu.pl/pl",
    "https://www.sjo.agh.edu.pl/",
    "https://swfis.agh.edu.pl/"
]

for url in urls:
    print(url)
    prefix = url.replace(".", "_").replace("http:", "").replace("https:", "").replace("/", "")
    GRAPH_JSON_PATH = f"graphs{prefix}_graph.json"

    graph_generator = GraphGenerator(
        allowed_domains=[url],
        start_urls=[url],
        max_pages=1000,
    )
    graph_generator.generate_graph()
    graph_generator.graph_to_json(output_file=GRAPH_JSON_PATH)

https://iet.agh.edu.pl/




Crawling:   0%|          | 0/1000 [00:00<?, ?it/s][A[A

Crawling:   0%|          | 2/1000 [00:03<24:56,  1.50s/it][A[A

Crawling:   0%|          | 1/1000 [00:34<9:27:48, 34.10s/it]A[A


Crawling:   0%|          | 4/1000 [00:08<37:06,  2.24s/it][A[A

Crawling:   0%|          | 5/1000 [00:10<37:21,  2.25s/it][A[A

Crawling:   1%|          | 6/1000 [00:13<40:12,  2.43s/it][A[A

Crawling:   1%|          | 7/1000 [00:16<42:03,  2.54s/it][A[A

Crawling:   1%|          | 8/1000 [00:18<42:46,  2.59s/it][A[A

Crawling:   1%|          | 9/1000 [00:21<41:48,  2.53s/it][A[A

Crawling:   1%|          | 10/1000 [00:23<40:51,  2.48s/it][A[A

Crawling:   1%|          | 11/1000 [00:26<41:46,  2.53s/it][A[A

Crawling:   1%|          | 12/1000 [00:29<43:08,  2.62s/it][A[A

Crawling:   1%|▏         | 13/1000 [00:31<41:48,  2.54s/it][A[A

Crawling:   1%|▏         | 14/1000 [00:34<42:50,  2.61s/it][A[A

Crawling:   2%|▏         | 15/1000 [00:36<43:47,  2.67s/it][A[A

Crawling:

## 3. Load Graph from JSON

We rebuild the **directed graph (DiGraph)** from the JSON file.  
Each **node** is a web page (URL), and each **edge** is a hyperlink from one page to another.  

- Nodes represent entities we might later cluster.  
- Edges capture relationships that clustering algorithms exploit.  

In [9]:
GRAPH_JSON_PATH = "/Users/wnowogorski/PycharmProjects/ChatAGH_DataCollecting/graphs/rekrutacja_agh_edu_graph.json"  # <-- change path if needed

with open(GRAPH_JSON_PATH, "r") as f:
    data = json.load(f)

G = nx.DiGraph()
G.add_nodes_from([n["url"] for n in data["nodes"]])
G.add_edges_from([(e["source"], e["target"]) for e in data["edges"]])

print(f"Loaded graph: {G.number_of_nodes():,} nodes, {G.number_of_edges():,} edges")

Loaded graph: 578 nodes, 7,781 edges
